Back to AI Research

AI Research

Three Models of RLHF Annotation: Extension, Evidenc... | AI Research

Key Takeaways

  • Three Models of RLHF Annotation: Extension, Evidence, and Authority Reinforcement Learning with Human Feedback (RLHF) is the primary method used to align lar...
  • Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour.
  • However, the normative role of these judgments is rarely made explicit.
  • I distinguish three conceptual models of that role.
  • The first is extension: annotators extend the system designers' own judgments about what outputs should be.
Paper AbstractExpand

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

Three Models of RLHF Annotation: Extension, Evidence, and Authority
Reinforcement Learning with Human Feedback (RLHF) is the primary method used to align large language models with human values. While this process relies heavily on human annotators, the specific role these annotators play is often left undefined. This paper argues that there are three distinct conceptual models for how annotators function: as extensions of the designers, as providers of evidence, or as holders of independent authority. By distinguishing these roles, the author provides a framework for designers to better structure their annotation pipelines, suggesting that a "one-size-fits-all" approach is often ineffective.

The Three Models of Annotation

The author identifies three ways to interpret the normative role of an annotator:

  • Extension: Annotators act as proxies for the system designers. Their job is to apply the designers' own preferences and beliefs to new scenarios that the designers do not have the time to review themselves.

  • Evidence: Annotators serve as sources of information. They provide data about objective facts (such as whether a statement is true) or about social facts (such as whether a community finds a specific output offensive).

  • Authority: Annotators act as a "mini-legislature." They possess the legitimate power to decide how a system should behave, not because they are experts, but because they represent the broader population.

Implications for Pipeline Design

The choice of model significantly changes how a developer should build an RLHF pipeline. For instance, if a designer views annotators as an "extension" of themselves, they might validate work by comparing it to "gold-standard" labels they created. However, if they view annotators as "authority" figures, using gold-standard labels would be inappropriate because it undermines the annotators' independent role. Similarly, the method of aggregation changes: while a simple majority vote might work for an "extension" model, an "evidence" model regarding community preferences might require a more nuanced, scalar approach to capture the diversity of public opinion.

A Call for Heterogeneous Pipelines

A central problem in current AI development is the unintentional conflation of these models. When designers mix these roles without clear intent, it can lead to inconsistent instructions, poor validation metrics, and confusion about how to handle disagreements among annotators. The author recommends that instead of seeking a single, unified pipeline, designers should decompose their tasks into separable dimensions. By identifying whether a specific task requires an extension, evidence, or authority-based approach, developers can tailor their selection, instruction, and aggregation methods to be more effective and transparent.

Comments (0)

No comments yet

Be the first to share your thoughts!