Fractured-sorry-bench: Framework For Revealing Attacks In Conversational Turns Undermining Refusal Efficacy And Defenses Over Sorry-bench · The Large Language Model Bible Contribute to LLM-Bible

Fractured-sorry-bench: Framework For Revealing Attacks In Conversational Turns Undermining Refusal Efficacy And Defenses Over Sorry-bench

Priyanshu Aman, Vijay Supriti. Arxiv 2024

[Paper]    
GPT Model Architecture Prompting Responsible AI Security Tools

This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.

Similar Work