Compute for AI Designers: Definition, Examples, and UX Tips

What it means

Every AI request consumes compute cycles on hardware that has cost, capacity, and queue time, especially for large models, image diffusion, or long agent runs.

Why designers should care

Compute constraints drive tiering (“Pro uses larger model”), queue UX, offline modes, and honest messaging when features are unavailable in a region or on weak devices.

Example

A video summarizer offers “Fast (cloud)” vs “Private (on-device)” with estimated wait times and a note when the device cannot run the local model.

Common mistakes

• Unlimited heavy features with no queue, tier, or degrade path when compute is saturated.
• Hiding cloud-only dependency from users who expect on-device privacy.
• No capacity planning UX before launch traffic spikes queue times.

Related terms

Inference: Each user request triggers inference: tokens in, tokens out, optionally interleaved with tool calls and retrieval.
Latency: Network, queue, retrieval, tool calls, and generation length all add wait time before users can read, edit, or act on output.
Fine-Tuning: Additional training (full or lightweight) on your labeled data so the model’s default behavior skews toward your product’s patterns.
Token Burn Rate: Tokens in plus tokens out, multiplied by call frequency and retries, equals burn rate. Long context, verbose prompts, and chained inference multiply it quickly.
Model: A model is a packaged set of learned weights and settings that maps prompts plus context to generated text, classifications, or tool calls.

Compute

What it means

Why designers should care

Example

Common mistakes

Related patterns

Model Selection UI

Related glossary terms

Inference

Latency

Fine-Tuning

Token Burn Rate

What it means

Why designers should care

Example

Common mistakes

Related terms

Related patterns

Model Selection UI

Related glossary terms

Inference

Latency

Fine-Tuning

Token Burn Rate

Weekly AI UX notes