Reinforcement Learning In The Era Of Llms: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, And Beyond
Sun Hao. Arxiv 2023
[Paper]
Agentic
Attention Mechanism
Efficiency And Optimization
Fine Tuning
GPT
Model Architecture
Prompting
Reinforcement Learning
Recent advancements in Large Language Models (LLMs) have garnered wide
attention and led to successful products such as ChatGPT and GPT-4. Their
proficiency in adhering to instructions and delivering harmless, helpful, and
honest (3H) responses can largely be attributed to the technique of
Reinforcement Learning from Human Feedback (RLHF). In this paper, we aim to
link the research in conventional RL to RL techniques used in LLM research.
Demystify this technique by discussing why, when, and how RL excels.
Furthermore, we explore potential future avenues that could either benefit from
or contribute to RLHF research.
Highlighted Takeaways:
- RLHF is Online Inverse RL with Offline Demonstration Data.
- RLHF \(>\) SFT because Imitation Learning (and Inverse RL) \(>\) Behavior
Cloning (BC) by alleviating the problem of compounding error.
- The RM step in RLHF generates a proxy of the expensive human feedback,
such an insight can be generalized to other LLM tasks such as prompting
evaluation and optimization where feedback is also expensive.
- The policy learning in RLHF is more challenging than conventional problems
studied in IRL due to their high action dimensionality and feedback sparsity.
- The main superiority of PPO over off-policy value-based methods is its
stability gained from (almost) on-policy data and conservative policy updates.
Similar Work