Inference
Inference is running a trained model on new inputs to produce outputs: the live “prediction” step users experience as chat, classify, or generate.
Product decisions about model size, caching, batching, and region affect inference cost and the latency budget you design around.
What it means
Each user request triggers inference: tokens in, tokens out, optionally interleaved with tool calls and retrieval.
Why designers should care
Features that chain many inference calls (agents, auto-rewrite loops) need budgets, caps, and user-visible cost or rate limits to stay usable.
Example
A “Improve tone” button runs one inference pass; “Rewrite entire doc” shows estimated sections and confirms before firing ten chained calls.
Common mistakes
- • Hidden auto-retries that multiply inference without user awareness.
- • No offline or degraded mode when inference APIs fail.
- • Treating inference as free in UI that encourages unlimited regenerations.