Multimodal
Multimodal AI can process and generate more than text: images, audio, video, and structured files in the same workflow.
Capability varies by model, but multimodal systems power screenshot analysis, voice interfaces, document Q&A, and visual copilots.
What it means
A multimodal model accepts or produces multiple media types, not just strings of text, often in one conversation or agent run.
Why designers should care
Multimodal features need input-specific affordances, latency expectations, accessibility fallbacks, and honest limits when a modality is read-only or unsupported.
Example
A design copilot accepts PNG mockups and returns annotated feedback plus optional text-to-code for one component, with clear labels on which outputs are vision-based vs inferred.
Common mistakes
- • Marketing “multimodal” when the product only accepts text and URLs.
- • No fallback when vision or audio fails on low-quality inputs.
- • Same UI chrome for text chat and heavy media uploads with no progress or size guidance.