Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Xie Tinghao, Qi Xiangyu, Zeng Yi, Huang Yangsibo, Sehwag Udari Madhushani, Huang Kaixuan, He Luxi, Wei Boyi, Li Dacheng, Sheng Ying, Jia Ruoxi, Li Bo, Li Kai, Chen Danqi, Henderson Peter, Mittal Prateek. Arxiv 2024

[Paper]
Applications GPT Model Architecture Prompting Reinforcement Learning Responsible AI

Evaluating aligned large language models’ (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more – which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs’ safety refusal capabilities, in a balanced, granular, and efficient manner.

The Large Language Model Bible

Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Xie Tinghao, Qi Xiangyu, Zeng Yi, Huang Yangsibo, Sehwag Udari Madhushani, Huang Kaixuan, He Luxi, Wei Boyi, Li Dacheng, Sheng Ying, Jia Ruoxi, Li Bo, Li Kai, Chen Danqi, Henderson Peter, Mittal Prateek. Arxiv 2024

Similar Work