Build a Multimodal AI App Using Gemini API

A few years ago, building an app that could process both images and text at once was a huge challenge. You had to set up an OCR pipeline for text, use a separate object detection model, and then connect everything to an NLP model, hoping your code worked smoothly. Now, multimodal AI models have made this process much simpler. In this article, I’ll show you how to build a Multimodal AI App using the Gemini API.

Multimodal AI App Using Gemini API: Getting Started

We’ll use a simple set of tools to build our app:

Google Generative AI SDK: To communicate with the model.
Pillow (PIL): To handle image processing in memory.
Streamlit: To build the user interface in pure Python.

Install these libraries to get started:

pip install streamlit google-generativeai pillow

Before we begin, keep in mind that some Gemini API features, like multimodal support, might need you to set up a billing account in your Google Cloud project. If you want to use the Gemini API for free, check out my earlier article on API setup.

Let’s start building our Multimodal AI App with the Gemini API.

Step 1: Configuration and Initialization

First, import the necessary libraries and set up authentication:

import streamlit as st
import google.generativeai as genai
from PIL import Image

# 1. Configure API Key
genai.configure(api_key="Your_API_Key")

# 2. Initialize Model
model = genai.GenerativeModel("gemini-2.0-flash")

import streamlit as st
import google.generativeai as genai
from PIL import Image

# 1. Configure API Key
genai.configure(api_key="Your_API_Key")

# 2. Initialize Model
model = genai.GenerativeModel("gemini-2.0-flash")

We’re using gemini-2.0-flash here. For interactive apps, speed matters a lot. The Flash version is very fast and handles multimodal tasks well, so it’s a great fit for web apps where users want quick responses.

Step 2: Building the Streamlit UI

Streamlit is a great tool for data scientists and ML engineers. It lets you build interactive web apps without needing to use HTML or React:

# 3. Streamlit UI
st.set_page_config(page_title="Multimodal AI Explorer", layout="centered")

st.title("👁️ Multimodal AI Explorer")
st.write("Upload an image and ask the AI a question about it.")

# 4. Inputs
uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

user_prompt = st.text_input(
    "Enter your question",
    placeholder="e.g., Extract items and prices from this receipt in JSON format"
)

# 3. Streamlit UI
st.set_page_config(page_title="Multimodal AI Explorer", layout="centered")

st.title("👁️ Multimodal AI Explorer")
st.write("Upload an image and ask the AI a question about it.")

# 4. Inputs
uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

user_prompt = st.text_input(
    "Enter your question",
    placeholder="e.g., Extract items and prices from this receipt in JSON format"
)

In this step, we define the layout. The page gets a title and two main inputs:

a file uploader for standard image formats
and a text field where users describe what they want the AI to find.

Step 3: Handling the Logic and Processing

This is where the multimodal processing takes place. To make sure the API is only called when the user is ready, we use a button to control when the code runs:

# 5. Button
analyze_clicked = st.button(
    "Analyze Image",
    disabled=(uploaded_file is None or user_prompt.strip() == "")
)

# 6. Processing
if analyze_clicked:
    try:
        # Open image
        image = Image.open(uploaded_file)

        # Display image (UPDATED)
        st.image(image, caption="Uploaded Image", use_container_width=True)

        # Convert image properly for Gemini
        image = image.convert("RGB")

        # Generate response
        with st.spinner("Analyzing image..."):
            response = model.generate_content([user_prompt, image])

        # Show result
        st.subheader("AI Response")
        st.write(response.text)

    except Exception as e:
        # TEMP: Show real error for debugging
        st.error(f"Error: {e}")

# 5. Button
analyze_clicked = st.button(
    "Analyze Image",
    disabled=(uploaded_file is None or user_prompt.strip() == "")
)

# 6. Processing
if analyze_clicked:
    try:
        # Open image
        image = Image.open(uploaded_file)

        # Display image (UPDATED)
        st.image(image, caption="Uploaded Image", use_container_width=True)

        # Convert image properly for Gemini
        image = image.convert("RGB")

        # Generate response
        with st.spinner("Analyzing image..."):
            response = model.generate_content([user_prompt, image])

        # Show result
        st.subheader("AI Response")
        st.write(response.text)

    except Exception as e:
        # TEMP: Show real error for debugging
        st.error(f"Error: {e}")

Key takeaways from this block:

State Management: The “Analyze Image” button stays disabled until both an image is uploaded and a prompt is entered. This helps avoid empty API calls and saves your quota.
Image Conversion (image.convert(“RGB”)): This step is important. Images might have an alpha channel (RGBA) or use a different color space, depending on their source. Converting to standard RGB makes sure the Gemini API processes them correctly and avoids errors.
The API Call (model.generate_content([user_prompt, image])): This step is simple. You pass a Python list with the text prompt and the PIL Image object to the model. The SDK takes care of serializing the image and sending it to Google’s servers.

Here’s how your final app will look:

Closing Thoughts

That’s how you can build a Multimodal AI App with the Gemini API.

The most important skill to build now is product sense. It’s not just about making the API call; think about what happens before and after. Learn to handle unreadable mages. Learn to format the AI’s output for a database. Dive deep into keeping your app fast and reliable.

Hope you liked the article! Follow me on Instagram for more AI/ML tips. Check out my book, Hands-On GenAI, LLMs & AI Agents, to get career-ready in AI.