For a long time, I used models separately: Natural Language Processing for text and Computer Vision for images. Each was powerful, but they missed context. An NLP model couldn’t see the image you meant, and a CV model couldn’t understand your question. Now, things are different. Multimodal AI is here, and it’s easier to use than you might expect. Let’s learn about it by building a Visual Question Answering App with Python.
What is Multimodal AI?
Think about how you understand the world. If you see a picture of a dog in a park and someone asks, “What is the dog doing?” you don’t just analyze the text of the question or the pixels of the image. You instantly fuse both. You see the dog (vision) and understand the query (language) to form an answer: “It’s catching a frisbee.”
That, in a nutshell, is multimodal AI. It’s a system that can process, understand, and reason about information from multiple modalities (such as text, images, and audio) simultaneously.
The specific task we’ll build today is called Visual Question Answering (VQA). It’s a classic multimodal task:
- Input: An image + a text-based question about the image.
- Output: A text-based answer.
Visual Question Answering App
Let’s build a web app that lets you upload any image, ask a question about it, and get an answer from an AI. We can do this in about 20 lines of Python, thanks to the amazing open-source community.
We’ll use two key tools:
- Hugging Face transformers: To download and run a powerful, pre-trained VQA model.
- Gradio: To instantly create a simple, shareable web interface for our model.
Step 1: Set Up Your Environment
First, you need to install the necessary libraries. Open your terminal and run:
pip install gradio transformers torch Pillow
Here’s a breakdown of each of these libraries:
- gradio builds the UI.
- transformers gets us the AI model.
- torch is a deep learning framework.
- Pillow (PIL) helps us handle the images.
Step 2: Building the App
Create a new Python file named app.py and write the following code. I’ve added comments to explain every single line:
import gradio as gr
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
# 1. Load the pre-trained model and its processor
# A "processor" prepares the data (image + text) for the model.
# We're using a "ViLT" (Vision-and-Language Transformer) model
# fine-tuned for visual question answering.
processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
# 2. Define the core "prediction" function
# This function will be called every time a user clicks "Submit".
def answer_question(image, text):
try:
# 3. Prepare the inputs
# The processor converts the raw image and text query into
# the specific numerical format the model expects.
encoding = processor(image, text, return_tensors="pt")
# 4. Run the model
# We pass the processed inputs to the model...
outputs = model(**encoding)
logits = outputs.logits
# 5. Decode the answer
# The model's raw output ("logits") is just a set of numbers.
# We find the highest-scoring number (the "argmax") and use
# the model's config to turn it back into a readable word.
idx = logits.argmax(-1).item()
answer = model.config.id2label[idx]
return answer
except Exception as e:
print(f"Error: {e}")
return "Sorry, I had trouble processing that. Try a different image or question."
# 6. Create the Gradio web interface
# This one line of code builds the entire UI!
iface = gr.Interface(
fn=answer_question, # The function to call
inputs=[
gr.Image(type="pil"), # An image upload box (provides a PIL image)
gr.Textbox(label="Ask a question about the image...") # A text input box
],
outputs=gr.Textbox(label="Answer"), # A text output box
title="🤖 Multimodal AI: Visual Question Answering",
description="Upload an image and ask any question about it. (Model: dandelin/vilt-b32-finetuned-vqa)"
)
# 7. Launch the app!
iface.launch()Step 3: Run Your App
Go back to your terminal, make sure you’re in the same directory as your app.py file, and run:
python app.py
Your terminal will show a local URL. Open that link in your browser.
That’s it! You now have a running multimodal AI application. Upload a picture of your pet, your room, or a landscape, and ask it questions like:
- “What colour is the cat?”
- “How many chairs are in the image?”
- “Is there a person on the beach?”
Here’s an example of how the final UI and the Output will look:

Final Words
By building this simple Visual Question Answering app, you’ve done more than just link two libraries. You’ve created a system that perceives the world in a more human-like way. This is the foundation for everything from apps that describe the world to the visually impaired to creative co-pilots that can brainstorm ideas based on a sketch and a conversation.
I hope you liked this article on building a Visual Question Answering App using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





