News Verifiers Showdown: A Comparative Performance Evaluation Of Chatgpt 3.5, Chatgpt 4.0, Bing AI, And Bard In News Fact-checking
Caramancion Kevin Matthe. Arxiv 2023
[Paper]
GPT
Model Architecture
RAG
This study aimed to evaluate the proficiency of prominent Large Language
Models (LLMs), namely OpenAI’s ChatGPT 3.5 and 4.0, Google’s Bard(LaMDA), and
Microsoft’s Bing AI in discerning the truthfulness of news items using black
box testing. A total of 100 fact-checked news items, all sourced from
independent fact-checking agencies, were presented to each of these LLMs under
controlled conditions. Their responses were classified into one of three
categories: True, False, and Partially True/False. The effectiveness of the
LLMs was gauged based on the accuracy of their classifications against the
verified facts provided by the independent agencies. The results showed a
moderate proficiency across all models, with an average score of 65.25 out of
- Among the models, OpenAI’s GPT-4.0 stood out with a score of 71,
suggesting an edge in newer LLMs’ abilities to differentiate fact from
deception. However, when juxtaposed against the performance of human
fact-checkers, the AI models, despite showing promise, lag in comprehending the
subtleties and contexts inherent in news information. The findings highlight
the potential of AI in the domain of fact-checking while underscoring the
continued importance of human cognitive skills and the necessity for persistent
advancements in AI capabilities. Finally, the experimental data produced from
the simulation of this work is openly available on Kaggle.
Similar Work