Subscribe newsletterSubscribe

GlossaryProduct and performanceUpdated June 17, 2026

Inference

Inference is running a trained model on new inputs to produce outputs: the live “prediction” step users experience as chat, classify, or generate.

Product decisions about model size, caching, batching, and region affect inference cost and the latency budget you design around.

Share

What it means

Each user request triggers inference: tokens in, tokens out, optionally interleaved with tool calls and retrieval.

Why designers should care

Features that chain many inference calls (agents, auto-rewrite loops) need budgets, caps, and user-visible cost or rate limits to stay usable.

Example

A “Improve tone” button runs one inference pass; “Rewrite entire doc” shows estimated sections and confirms before firing ten chained calls.

Common mistakes

• Hidden auto-retries that multiply inference without user awareness.
• No offline or degraded mode when inference APIs fail.
• Treating inference as free in UI that encourages unlimited regenerations.

Related patterns

Streaming

Token by token

Model Selection UI

Let users choose AI model (speed vs quality)

Related glossary terms

Patterns, prompts, and glossary updates for designers building AI products on Substack. No spam.

Subscribe on Substack