GlossaryFoundations

RLHF (Reinforcement Learning from Human Feedback)

RLHF aligns a model’s behavior with human preferences by training on rankings, ratings, or corrections from people, not just raw internet text.

It is a major reason chat models feel helpful, refuse harmful requests, and follow instructions more reliably than base models alone.

What it means

Humans compare or score model outputs; those signals become a reward signal used to fine-tune the model toward preferred responses.

Why designers should care

RLHF shapes tone, refusals, and “personality” you inherit from the vendor. Your system prompt and guardrails sit on top of alignment choices you did not make.

Example

Two assistants use the same base model; one feels more cautious on medical questions because RLHF and policy layers trained different refusal and hedge patterns.

Common mistakes

  • Blaming prompt design alone when alignment or base model choice drives refusals and tone.
  • Expecting RLHF to eliminate hallucinations or bias without product-level evals and UX.
  • Overriding aligned behavior in UI copy that promises capabilities the model will refuse.

Weekly AI UX notes

Patterns, prompts, and glossary updates for designers building AI products on Substack. No spam.

Subscribe on Substack