The Base-rate Effect On LLM Benchmark Performance: Disambiguating Test-taking Strategies From Benchmark Performance · The Large Language Model Bible Contribute to LLM-Bible

The Base-rate Effect On LLM Benchmark Performance: Disambiguating Test-taking Strategies From Benchmark Performance

Moore Kyle, Roberts Jesse, Pham Thao, Ewaleifoh Oseremhen, Fisher Doug. Arxiv 2024

[Paper]    
Prompting

Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.

Similar Work