[Paper]
Evaluating the factuality of long-form text generated by large language
models (LMs) is non-trivial because (1) generations often contain a mixture of
supported and unsupported pieces of information, making binary judgments of
quality inadequate, and (2) human evaluation is time-consuming and costly. In
this paper, we introduce FACTSCORE, a new evaluation that breaks a generation
into a series of atomic facts and computes the percentage of atomic facts
supported by a reliable knowledge source. We conduct an extensive human
evaluation to obtain FACTSCOREs of people biographies generated by several
state-of-the-art commercial LMs – InstructGPT, ChatGPT, and the
retrieval-augmented PerplexityAI – and report new analysis demonstrating the
need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since
human evaluation is costly, we also introduce an automated model that estimates
FACTSCORE using retrieval and a strong language model, with less than a 2%
error rate. Finally, we use this automated metric to evaluate 6,500 generations
from a new set of 13 recent LMs that would have cost $26K if evaluated by
humans, with various findings: GPT-4 and ChatGPT are more factual than public
models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is
available for public use via pip install factscore
.