RLHF (Reinforcement Learning from Human Feedback)
RLHF aligns a model’s behavior with human preferences by training on rankings, ratings, or corrections from people, not just raw internet text.
It is a major reason chat models feel helpful, refuse harmful requests, and follow instructions more reliably than base models alone.
What it means
Humans compare or score model outputs; those signals become a reward signal used to fine-tune the model toward preferred responses.
Why designers should care
RLHF shapes tone, refusals, and “personality” you inherit from the vendor. Your system prompt and guardrails sit on top of alignment choices you did not make.
Example
Two assistants use the same base model; one feels more cautious on medical questions because RLHF and policy layers trained different refusal and hedge patterns.
Common mistakes
- • Blaming prompt design alone when alignment or base model choice drives refusals and tone.
- • Expecting RLHF to eliminate hallucinations or bias without product-level evals and UX.
- • Overriding aligned behavior in UI copy that promises capabilities the model will refuse.