Skip to main navigation Skip to search Skip to main content

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

  • Shreyas Chaudhari
  • , Pranjal Aggarwal
  • , Vishvak Murahari
  • , Tanmay Rajpurohit
  • , Ashwin Kalyan
  • , Karthik Narasimhan
  • , Ameet Deshpande
  • , Bruno Castro da Silva

Research output: Contribution to journalArticlepeer-review

Abstract

A significant challenge in training large language models (LLMs) as effective assistants is aligning them with human preferences. Reinforcement learning from human feedback (RLHF) has emerged as a promising solution. However, our understanding of RLHF is often limited to initial design choices. This article analyzes RLHF through reinforcement learning principles, focusing on the reward model. It examines modeling choices and function approximation caveats, highlighting assumptions about reward expressivity and revealing limitations like incorrect generalization, model misspecification, and sparse feedback. A categorical review of current literature provides insights for researchers to understand the challenges of RLHF and build upon existing methods.

Original languageEnglish (US)
Article number53
JournalACM Computing Surveys
Volume58
Issue number2
DOIs
StatePublished - Sep 10 2025

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Keywords

  • Preference learning
  • function approximation
  • survey
  • taxonomy

Fingerprint

Dive into the research topics of 'RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs'. Together they form a unique fingerprint.

Cite this