Multimodal Prompts
Tool: ChatGPT-4o for iOS.
This is an emerging capability of several systems. As of June 2024, these capabilities are limited. More should be available in coming months.
The modal in multimodal refers to different modes of communication – text, image, audio, video, and other data. For now, the different media may have to be uploaded as files, but capabilities are coming to allow use of the camera as well as microphone for real-time visual and spoken interactions. These are likely to include interpretation of gestures and moods.
Preparation
Any AI chatbot that allows uploading of files or images has at least some capability to use multimodal prompts. Some experimentation and research may be needed to figure out what the tool can actually do. (Asking the AI about its capabilities may result in incorrect or misleading information.) The specific device interface also affects capabilities and use. The speech and audio abilities of the ChatGPT iOS app, for instance, allow wider possibilities than the web interface on a laptop.
Ingredients
- Media file(s).
- Instructions on what to do with the file(s).
- Suggested actions might include:
- Transcribe
- Extract
- Summarize
- Ask questions
An Ai App with Voice Capabilities
Try this prompt in a voice-capable AI app:
Write some text on paper.
Photograph that image with your phone.
In your chosen AI app:
- Upload the image as an image or file.
- Add the text prompt: “Transcribe the text of this image.”
- Activate the voice feature and ask it: “Read back the transcribed text.”