Counterfactuals As A Means For Evaluating Faithfulness Of Attribution Methods In Autoregressive Language Models · The Large Language Model Bible Contribute to LLM-Bible

Counterfactuals As A Means For Evaluating Faithfulness Of Attribution Methods In Autoregressive Language Models

Kamahi Sepehr, Yaghoobzadeh Yadollah. Arxiv 2024

[Paper] [Code]    
BERT GPT Has Code Interpretability And Explainability Language Modeling Masked Language Model Pretraining Methods RAG Training Techniques

Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models (MLMs). Evaluating the faithfulness of an explanation method – how accurately the method explains the inner workings and decision-making of the model – is very challenging because it is very hard to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove some input tokens considered important according to a particular attribution (feature importance) method and observe the change in the model’s output. This approach creates out-of-distribution inputs for causal language models (CLMs) due to their training objective of next token prediction. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language modeling scenarios. Our technique creates fluent and in-distribution counterfactuals that makes evaluation protocol more reliable. Code is available at https://github.com/Sepehr-Kamahi/faith

Similar Work