Large Language Models Can Self-improve At Web Agent Tasks

Patel Ajay, Hofmarcher Markus, Leoveanu-condrei Claudiu, Dinu Marius-constantin, Callison-burch Chris, Hochreiter Sepp. Arxiv 2024

[Paper]
Agentic Few Shot Fine Tuning Pretraining Methods Prompting Security Tools Training Techniques

Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

The Large Language Model Bible

Large Language Models Can Self-improve At Web Agent Tasks

Patel Ajay, Hofmarcher Markus, Leoveanu-condrei Claudiu, Dinu Marius-constantin, Callison-burch Chris, Hochreiter Sepp. Arxiv 2024

Similar Work