Comparing Plausibility Estimates In Base And Instruction-tuned Large Language Models · The Large Language Model Bible Contribute to LLM-Bible

Comparing Plausibility Estimates In Base And Instruction-tuned Large Language Models

Kauf Carina, Chersoni Emmanuele, Lenci Alessandro, Fedorenko Evelina, Ivanova Anna A.. Arxiv 2024

[Paper]    
Model Architecture Prompting Survey Paper Training Techniques

Instruction-tuned LLMs can respond to explicit queries formulated as prompts, which greatly facilitates interaction with human users. However, prompt-based approaches might not always be able to tap into the wealth of implicit knowledge acquired by LLMs during pre-training. This paper presents a comprehensive study of ways to evaluate semantic plausibility in LLMs. We compare base and instruction-tuned LLM performance on an English sentence plausibility task via (a) explicit prompting and (b) implicit estimation via direct readout of the probabilities models assign to strings. Experiment 1 shows that, across model architectures and plausibility datasets, (i) log likelihood (\(\textit{LL}\)) scores are the most reliable indicator of sentence plausibility, with zero-shot prompting yielding inconsistent and typically poor results; (ii) \(\textit{LL}\)-based performance is still inferior to human performance; (iii) instruction-tuned models have worse \(\textit{LL}\)-based performance than base models. In Experiment 2, we show that \(\textit{LL}\) scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility and providing a direct match to explicit human plausibility judgments. Overall, \(\textit{LL}\) estimates remain a more reliable measure of plausibility in LLMs than direct prompting.

Similar Work