Three Models of RLHF Annotation: Extension, Evidence, and Authority
Reinforcement Learning with Human Feedback (RLHF) is the primary method used to align large language models with human values. While this process relies heavily on human annotators, the specific role these annotators play is often left undefined. This paper argues that there are three distinct conceptual models for how annotators function: as extensions of the designers, as providers of evidence, or as holders of independent authority. By distinguishing these roles, the author provides a framework for designers to better structure their annotation pipelines, suggesting that a "one-size-fits-all" approach is often ineffective.
The Three Models of Annotation
The author identifies three ways to interpret the normative role of an annotator:
Extension: Annotators act as proxies for the system designers. Their job is to apply the designers' own preferences and beliefs to new scenarios that the designers do not have the time to review themselves.
Evidence: Annotators serve as sources of information. They provide data about objective facts (such as whether a statement is true) or about social facts (such as whether a community finds a specific output offensive).
Authority: Annotators act as a "mini-legislature." They possess the legitimate power to decide how a system should behave, not because they are experts, but because they represent the broader population.
Implications for Pipeline Design
The choice of model significantly changes how a developer should build an RLHF pipeline. For instance, if a designer views annotators as an "extension" of themselves, they might validate work by comparing it to "gold-standard" labels they created. However, if they view annotators as "authority" figures, using gold-standard labels would be inappropriate because it undermines the annotators' independent role. Similarly, the method of aggregation changes: while a simple majority vote might work for an "extension" model, an "evidence" model regarding community preferences might require a more nuanced, scalar approach to capture the diversity of public opinion.
A Call for Heterogeneous Pipelines
A central problem in current AI development is the unintentional conflation of these models. When designers mix these roles without clear intent, it can lead to inconsistent instructions, poor validation metrics, and confusion about how to handle disagreements among annotators. The author recommends that instead of seeking a single, unified pipeline, designers should decompose their tasks into separable dimensions. By identifying whether a specific task requires an extension, evidence, or authority-based approach, developers can tailor their selection, instruction, and aggregation methods to be more effective and transparent.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!