Multimodal AI is one of the most exciting fields in artificial intelligence today. It’s powering some smartest systems, like Google Lens to ChatGPT-4o. Gartner predicts that by 2026, 70% of customer interactions will involve multimodal AI, up from less than 15% in 2023. So, if you want to learn about building multimodal AI models, this article is for you. In this article, I’ll take you through a guide to building a multimodal AI model with Python.
What Is Multimodal AI?
Multimodal AI refers to models that process and understand multiple types of data, like text, images, audio, and video, simultaneously. These models can analyze an image, read a sentence, and understand how both relate.
OpenAI’s CLIP (Contrastive Language-Image Pretraining) is one of the most powerful examples. Trained on 400 million (image, text) pairs, CLIP can “see” an image and “read” text in the same semantic space, which allows you to compare the two directly.
Let’s understand multimodal AI in detail by building a model with Python.
Building a Multimodal AI Model with Python
Here, we’ll build a caption-matching AI system that:
- Takes an input image (say, a cup of tea)
- Compares it to a list of 70+ potential captions
- Returns the Top 5 captions that best describe the image, using cosine similarity
We’ll build all of this using Python, PyTorch, and Hugging Face Transformers.
Now let’s get started with building a multimodal AI model with Python step-by-step.
Step 1: Load and Preprocess the Image
We’ll use Pillow to load the image and CLIPProcessor to tokenize it for the CLIP model:
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel, AutoProcessor, AutoModelForCausalLM
# image loading and preprocessing
def load_and_preprocess_image(image_path):
image = Image.open(image_path).convert("RGB")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(images=image, return_tensors="pt")
return inputs, processorThis will convert the image into a tensor that the CLIP model can process.
Step 2: Extract Image Embeddings with CLIP
Next, we’ll use CLIPModel to extract feature embeddings from the image:
# image understanding with CLIP
def generate_image_embeddings(inputs):
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
with torch.no_grad():
image_features = model.get_image_features(**inputs)
return image_features, modelThis will give us a vector that captures the semantic meaning of the image.
Now, before moving forward, we will create a list of captions to compare with the features of the images:
candidate_captions = [
"Trees, Travel and Tea!",
"A refreshing beverage.",
"A moment of indulgence.",
"The perfect thirst quencher.",
"Your daily dose of delight.",
"Taste the tradition.",
"Savor the flavor.",
"Refresh and rejuvenate.",
"Unwind and enjoy.",
"The taste of home.",
"A treat for your senses.",
"A taste of adventure.",
"A moment of bliss.",
"Your travel companion.",
"Fuel for your journey.",
"The essence of nature.",
"The warmth of comfort.",
"A sip of happiness.",
"Pure indulgence.",
"Quench your thirst, ignite your spirit.",
"Awaken your senses, embrace the moment.",
"The taste of faraway lands.",
"A taste of home, wherever you are.",
"Your daily dose of delight.",
"Your moment of serenity.",
"The perfect pick-me-up.",
"The perfect way to unwind.",
"Taste the difference.",
"Experience the difference.",
"A refreshing escape.",
"A delightful escape.",
"The taste of tradition, the spirit of adventure.",
"The warmth of home, the joy of discovery.",
"Your passport to flavor.",
"Your ticket to tranquility.",
"Sip, savor, and explore.",
"Indulge, relax, and rejuvenate.",
"The taste of wanderlust.",
"The comfort of home.",
"A journey for your taste buds.",
"A haven for your senses.",
"Your refreshing companion.",
"Your delightful escape.",
"Taste the world, one sip at a time.",
"Embrace the moment, one cup at a time.",
"The essence of exploration.",
"The comfort of connection.",
"Quench your thirst for adventure.",
"Savor the moment of peace.",
"The taste of discovery.",
"The warmth of belonging.",
"Your travel companion, your daily delight.",
"Your moment of peace, your daily indulgence.",
"The spirit of exploration, the comfort of home.",
"The joy of discovery, the warmth of connection.",
"Sip, savor, and set off on an adventure.",
"Indulge, relax, and find your peace.",
"A delightful beverage.",
"A moment of relaxation.",
"The perfect way to start your day.",
"The perfect way to end your day.",
"A treat for yourself.",
"Something to savor.",
"A moment of calm.",
"A taste of something special.",
"A refreshing pick-me-up.",
"A comforting drink.",
"A taste of adventure.",
"A moment of peace.",
"A small indulgence.",
"A daily ritual.",
"A way to connect with others.",
"A way to connect with yourself.",
"A taste of home.",
"A taste of something new.",
"A moment to enjoy.",
"A moment to remember."
]Step 3: Match the Image to the Captions
Now, we’ll compare the image features to the text features of all possible captions:
# caption matching (using CLIP text embeddings)
def match_captions(image_features, captions, clip_model, processor):
# 1. get text embeddings for the captions:
text_inputs = processor(text=captions, return_tensors="pt", padding=True)
with torch.no_grad():
text_features = clip_model.get_text_features(**text_inputs)
# 2. calculate cosine similarity between image and text features:
image_features = image_features.detach().cpu().numpy()
text_features = text_features.detach().cpu().numpy()
similarities = cosine_similarity(image_features, text_features)
# 3. find the best matching captions:
best_indices = similarities.argsort(axis=1)[0][::-1]
best_captions = [captions[i] for i in best_indices]
return best_captions, similarities[0][best_indices].tolist()Here, we used cosine similarity to find how closely the image vector aligns with each caption vector.
Step 4: Wrap It All Together
Now, we will write the final function for our multimodal AI model and try it out on an image:
# main function
def image_captioning(image_path, candidate_captions):
inputs, processor = load_and_preprocess_image(image_path)
image_features, clip_model = generate_image_embeddings(inputs)
best_captions, similarities = match_captions(image_features, candidate_captions, clip_model, processor)
return best_captions, similarities
from sklearn.metrics.pairwise import cosine_similarity
best_captions, similarities = image_captioning("/content/aman.png", candidate_captions)
top_n = min(5, len(best_captions))
top_best_captions = best_captions[:top_n]
top_similarities = similarities[:top_n]
print("Top 5 Best Captions:")
for i, (caption, similarity) in enumerate(zip(top_best_captions, top_similarities)):
print(f"{i+1}. {caption} (Similarity: {similarity:.4f})")Top 5 Best Captions:
1. Your moment of peace, your daily indulgence. (Similarity: 0.2538)
2. Embrace the moment, one cup at a time. (Similarity: 0.2515)
3. Taste the world, one sip at a time. (Similarity: 0.2495)
4. Unwind and enjoy. (Similarity: 0.2487)
5. Savor the moment of peace. (Similarity: 0.2486)
Here’s the image I used as input. This exact approach can power:
- Smart caption generators for social media platforms
- Image search engines
- Visual product recommenders in e-commerce
- Automated marketing content creation
- AI agents that “see” and “talk”
Final Words
So, you’re no longer just working with plain text or static images; you’re bridging vision and language. In just a few lines of Python, you’ve built a Multimodal AI system that would’ve required teams of researchers a few years ago. I hope you liked this article on building a multimodal AI model with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





