Latency for AI Designers: Definition, Examples, and UX Tips

What it means

Network, queue, retrieval, tool calls, and generation length all add wait time before users can read, edit, or act on output.

Why designers should care

Without latency-aware UX, users double-submit prompts, abandon flows, or distrust “thinking” states that hang with no progress signal.

Example

A report generator streams section headings first, shows elapsed time, and lets users cancel while heavy charts render in a background step with email notify.

Common mistakes

• Blocking the entire screen with no partial output during long runs.
• Promising “real-time” when retrieval plus tools routinely exceed two seconds.
• Ignoring latency on mobile or low-bandwidth users in agent loops.

Related terms

Streaming Response: The client renders partial text as it arrives over HTTP or WebSocket until the model signals completion or stop.
Inference: Each user request triggers inference: tokens in, tokens out, optionally interleaved with tool calls and retrieval.
Token: Models process and bill text as tokens; a rough rule is ~¾ of a word per token in English, but code and symbols can consume more.
Agent: The model decides sequences of actions (search, draft, click, call APIs) within policies until the task completes or hits a human checkpoint.
Workflow: Defined stages (inputs → generate → review → export) often spanning people, systems, and multiple model calls with clear deliverables.

Latency

What it means

Why designers should care

Example

Common mistakes

Related patterns

Streaming

Related glossary terms

Streaming Response

Inference

Token

Agent

What it means

Why designers should care

Example

Common mistakes

Related terms

Related patterns

Streaming

Related glossary terms

Streaming Response

Inference

Token

Agent

Weekly AI UX notes