A modality is a type of data. Current commercial multimodal models typically handle some combination of the following inputs and outputs.
- • Text-to-text: the classic LLM capability.
- • Image-to-text (vision): describing, analyzing, and answering questions about images.
- • Text-to-image: generating images from text descriptions (e.g., DALL-E, Stable Diffusion).
- • Audio-to-text: transcribing and understanding speech.
- • Video understanding: interpreting sequences of frames over time.