Elsevier Arena: Human Evaluation Of Chemistry/biology/health Foundational Large Language Models

Thorne Camilo, Druckenbrodt Christian, Szarkowska Kinga, Goyal Deepika, Marajan Pranita, Somanath Vijay, Harper Corey, Yan Mao, Scerri Tony. Arxiv 2024

[Paper]
GPT Model Architecture Pretraining Methods Tools Training Techniques Transformer

The quality and capabilities of large language models cannot be currently fully assessed with automated, benchmark evaluations. Instead, human evaluations that expand on traditional qualitative techniques from natural language generation literature are required. One recent best-practice consists in using A/B-testing frameworks, which capture preferences of human evaluators for specific models. In this paper we describe a human evaluation experiment focused on the biomedical domain (health, biology, chemistry/pharmacology) carried out at Elsevier. In it a large but not massive (8.8B parameter) decoder-only foundational transformer trained on a relatively small (135B tokens) but highly curated collection of Elsevier datasets is compared to OpenAI’s GPT-3.5-turbo and Meta’s foundational 7B parameter Llama 2 model against multiple criteria. Results indicate – even if IRR scores were generally low – a preference towards GPT-3.5-turbo, and hence towards models that possess conversational abilities, are very large and were trained on very large datasets. But at the same time, indicate that for less massive models training on smaller but well-curated training sets can potentially give rise to viable alternatives in the biomedical domain.

The Large Language Model Bible

Elsevier Arena: Human Evaluation Of Chemistry/biology/health Foundational Large Language Models

Thorne Camilo, Druckenbrodt Christian, Szarkowska Kinga, Goyal Deepika, Marajan Pranita, Somanath Vijay, Harper Corey, Yan Mao, Scerri Tony. Arxiv 2024

Similar Work