BLT: Can Large Language Models Handle Basic Legal Text?

Blair-stanek Andrew, Holzenberger Nils, Van Durme Benjamin. Arxiv 2023

[Paper]
Fine Tuning GPT Model Architecture Pretraining Methods Reinforcement Learning Training Techniques

We find that the best publicly available LLMs like GPT-4, Claude, and {PaLM 2} currently perform poorly at basic legal text handling. We introduce a benchmark consisting of tasks that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs’ poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning for these tasks brings even a smaller model to near-perfect performance on our test set and also raises performance on a related legal task. These results suggest that many simple behaviors needed for a domain may not be present in foundational LLMs, without additional engagement from subject matter experts.

The Large Language Model Bible

BLT: Can Large Language Models Handle Basic Legal Text?

Blair-stanek Andrew, Holzenberger Nils, Van Durme Benjamin. Arxiv 2023

Similar Work