GlossaryPrompting and interaction

Multimodal Input

Multimodal input lets users combine text with images, audio, video, or files in one request so the model can reason across media types.

It is how screenshot critique, voice notes, and “look at this PDF” flows work in modern copilots.

What it means

The interface accepts multiple input modalities in a single turn: type a question, attach a photo, paste a link, or record audio, then send one combined prompt.

Why designers should care

Multimodal input needs clear attachment affordances, preview, size limits, privacy cues, and failure states when a file type or size is unsupported.

Example

A UX review tool accepts a Figma screenshot plus “check contrast on this modal”; the model references regions in the image and returns annotated findings.

Common mistakes

  • Attachments with no preview or remove control before send.
  • Treating image upload like generic file attach with no capture guidance.
  • No alt text or description path when images fail to upload on slow networks.

Weekly AI UX notes

Patterns, prompts, and glossary updates for designers building AI products on Substack. No spam.

Subscribe on Substack