MR-BEN: A Comprehensive Meta-reasoning Benchmark For Large Language Models · The Large Language Model Bible Contribute to LLM-Bible

MR-BEN: A Comprehensive Meta-reasoning Benchmark For Large Language Models

Zeng Zhongshen, Liu Yinhong, Wan Yingjia, Li Jingyao, Chen Pengguang, Dai Jianbo, Yao Yuxuan, Xu Rongwu, Qi Zehan, Zhao Wanru, Shen Linling, Lu Jianqiao, Tan Haochen, Chen Yukang, Zhang Hao, Shi Zhan, Wang Bailin, Guo Zhijiang, Jia Jiaya. Arxiv 2024

[Paper] [Code]    
GPT Has Code Model Architecture Reinforcement Learning

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io/.

Similar Work