Struc-bench: Are Large Language Models Really Good At Generating Complex Structured Data? · The Large Language Model Bible Contribute to LLM-Bible

Struc-bench: Are Large Language Models Really Good At Generating Complex Structured Data?

Tang Xiangru, Zong Yiming, Phang Jason, Zhao Yilun, Zhou Wangchunshu, Cohan Arman, Gerstein Mark. Arxiv 2023

[Paper] [Code]    
Fine Tuning GPT Has Code Model Architecture Pretraining Methods Prompting RAG Training Techniques

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs’ proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions – coverage, formatting, reasoning, comprehension, pragmatics, and hallucination – highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

Similar Work