Build a Real-Time Voice AI Assistant

We’ve all watched Iron Man and quietly envied Tony Stark’s banter with JARVIS. There is something profoundly magical about speaking to a machine and having it understand, think, and reply, not with a pre-programmed script, but with genuine intelligence. For a long time, building something like that required a PhD or massive cloud subscriptions. Today, we can make it in an afternoon with a few lines of Python. In this article, we will build a voice AI assistant that listens to you, thinks using a powerful local AI model (Llama 3), and responds to you.

How a Voice AI Assistant Works in Real-time?

Before we write the code, let’s understand what we are building. An AI voice assistant is essentially a loop of three distinct biological functions replicated by code:

The Ears (Speech-to-Text): We capture audio vibrations and translate them into text.
The Brain (LLM Inference): We send that text to a Large Language Model (Ollama/Llama 3) to generate a smart response.
The Mouth (Text-to-Speech): We convert the AI’s text response back into audio so we can hear it.

Let’s understand it practically by building a real-time voice AI assistant using Python.

Building a Real-Time Voice AI Assistant

To get it running, we are relying on three key libraries. You will need to install them via your terminal:

pip install speechrecognition ollama pyttsx3 pyaudio

You must have the Ollama application installed on your computer and the Llama 3 model pulled (ollama pull llama3) for the brain part of our code to work. Here’s a tutorial if you are a first timer.

Step 1: Importing the Tools

Here we are grabbing our tools. Think of this as laying out your ingredients before cooking. sr is our listener, ollama is our thinker, and pyttsx3 is our speaker:

import speech_recognition as sr
import ollama
import pyttsx3

Step 2: The Ears

This function is responsible for the physical world interface, the microphone:

def listen():
    recognizer = sr.Recognizer()

    try:
        with sr.Microphone() as source:
            print("Listening... (Speak now)")
            # Adjust for ambient noise
            recognizer.adjust_for_ambient_noise(source, duration=0.5)

            # Listen for audio input
            audio = recognizer.listen(source, timeout=5, phrase_time_limit=10)
            print("Processing...")

        # Recognize speech using Google's free API
        text = recognizer.recognize_google(audio)
        print(f"You said: {text}")
        return text

    except sr.WaitTimeoutError:
        print("No speech detected (timeout).")
        return None
    except sr.UnknownValueError:
        print("Sorry, I didn't catch that.")
        return None
    except sr.RequestError:
        print("Speech recognition service unavailable.")
        return None
    except Exception as e:
        print(f"An error occurred in listen(): {e}")
        return None

Here’s what’s happening:

adjust_for_ambient_noise: Microphones pick up fan hums and static. This line tells the code to listen to the silence for 0.5 seconds to understand the room’s baseline noise, which makes the actual recognition much more accurate.
recognize_google: We are using Google’s Web Speech API to convert audio to text. It’s free and generally very accurate, though it does require an internet connection.

Step 3: The Brain

This is the key part of our assistant. We take the raw text and give it intelligence:

def think(text: str):
    if not text:
        return None

    print("Thinking...")

    try:
        # Ensure you have pulled the model via: ollama pull llama3
        response = ollama.chat(
            model="llama3",
            messages=[
                {
                    "role": "user",
                    "content": text,
                }
            ],
        )

        response_text = response["message"]["content"]
        print(f"AI: {response_text}")
        return response_text

    except Exception as e:
        print(f"An error occurred in think(): {e}")
        return "Sorry, something went wrong while thinking."

Here’s what’s happening:

ollama.chat: This is the interface to your local Llama 3 model. We send a list of messages (in this case, just one from the “user”) and wait for the model to complete the pattern.

Latency: Since Llama 3 is running locally on your device, this might take a second or two, depending on your GPU/CPU, but it’s completely private. No data is sent to a cloud server for thinking.

Step 4: The Mouth

An assistant isn’t an assistant if you have to read the screen. Here’s how to give it the ability to speak:

def speak(text: str):
    if not text:
        return

    try:
        engine = pyttsx3.init()

        # Optional: Change voice properties
        voices = engine.getProperty("voices")
        if voices:
            # Try changing index 0 -> 1 for alternative voice
            engine.setProperty("voice", voices[0].id)

        engine.setProperty("rate", 175)  # Speed of speech

        engine.say(text)
        engine.runAndWait()

    except Exception as e:
        print(f"An error occurred in speak(): {e}")

Here’s what’s happening:

pyttsx3.init(): This initialises the speech engine driver on your OS (sapi5 on Windows, nsss on Mac, espeak on Linux).
engine.runAndWait(): This is critical. It blocks the code execution until the speaking is done. Without this, the program might try to listen while it’s still speaking, causing it to hear itself!

Step 5: The Main Function

Finally, we stitch the organs together into a living body:

def main():
    print("--- Voice Assistant Started ---")
    speak("Hello, I am ready. You can start speaking.")

    while True:
        # 1. Listen
        user_input = listen()

        # Skip if nothing heard
        if not user_input:
            continue

        # 2. Check for exit keywords
        if user_input.lower().strip() in ["exit", "stop", "quit"]:
            speak("Goodbye!")
            print("Exiting...")
            break

        # 3. Think
        ai_response = think(user_input)

        # 4. Speak
        speak(ai_response)


if __name__ == "__main__":
    main()

Here’s what’s happening:

The while True Loop: This creates the always-on behaviour. The program enters a cycle of Listen -> Think -> Speak, and then immediately goes back to Listen.
Exit Strategy: We added a simple check for exit or stop so you can gracefully shut down the assistant without force-quitting the terminal.

Here’s the output with an example:

Closing Thoughts

When you run this script and hear the AI respond to your voice, take a moment to appreciate what just happened. You essentially built a synthetic neocortex (Llama 3) and gave it sensory organs (mic/speakers).

Today it’s just chatting; tomorrow, you could hook the think() function up to your calendar API or email client, turning this from a chatbot into a true proactive agent.

I hope you liked this article on building a voice AI assistant that listens to you, thinks using a powerful local AI model, and speaks back to you. Follow me on Instagram for many more resources.