GlossaryFoundations

Multimodal

Multimodal AI can process and generate more than text: images, audio, video, and structured files in the same workflow.

Capability varies by model, but multimodal systems power screenshot analysis, voice interfaces, document Q&A, and visual copilots.

What it means

A multimodal model accepts or produces multiple media types, not just strings of text, often in one conversation or agent run.

Why designers should care

Multimodal features need input-specific affordances, latency expectations, accessibility fallbacks, and honest limits when a modality is read-only or unsupported.

Example

A design copilot accepts PNG mockups and returns annotated feedback plus optional text-to-code for one component, with clear labels on which outputs are vision-based vs inferred.

Common mistakes

  • Marketing “multimodal” when the product only accepts text and URLs.
  • No fallback when vision or audio fails on low-quality inputs.
  • Same UI chrome for text chat and heavy media uploads with no progress or size guidance.

Weekly AI UX notes

Patterns, prompts, and glossary updates for designers building AI products on Substack. No spam.

Subscribe on Substack