Multimodal Prompts

a simple multimodal prompt involving uploading an image of handwritten text, and its transcription. The input involved an image and text. The output involved text and audio.
Example of a simple multimodal prompt involving uploading an image of handwritten text, and its transcription. The input involved an image and text. The output involved text and audio.
Prompt: [upload an image] then add the text prompt: “Transcribe the text of this image.” Next, click the audio button (headphones icon in the screenshot) and ask it to “Read back the transcribed text.”

Tool: ChatGPT-4o for iOS.

This is an emerging capability of several systems. As of June 2024, these capabilities are limited. More should be available in coming months.

The modal in multimodal refers to different modes of communication – text, image, audio, video, and other data. For now, the different media may have to be uploaded as files, but capabilities are coming to allow use of the camera as well as microphone for real-time visual and spoken interactions. These are likely to include interpretation of gestures and moods.

Preparation

Any AI chatbot that allows uploading of files or images has at least some capability to use multimodal prompts. Some experimentation and research may be needed to figure out what the tool can actually do. (Asking the AI about its capabilities may result in incorrect or misleading information.) The specific device interface also affects capabilities and use. The speech and audio abilities of the ChatGPT iOS app, for instance, allow wider possibilities than the web interface on a laptop.

Ingredients

  • Media file(s).
  • Instructions on what to do with the file(s).
  • Suggested actions might include:
    • Transcribe
    • Extract
    • Summarize
    • Ask questions

An Ai App with Voice Capabilities

Try this prompt in a voice-capable AI app:

Write some text on paper.
Photograph that image with your phone.
In your chosen AI app:

  • Upload the image as an image or file.
  • Add the text prompt: “Transcribe the text of this image.”
  • Activate the voice feature and ask it: “Read back the transcribed text.”

 

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

AI Cookbook: Recipes and More from the University of Missouri Copyright © 2024 by University of Missouri is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book