Build a Production-Ready LLM API

If you want to turn your data science experiments into real products, you’ll need an API. Today, we’re not just running a model, we’re building infrastructure ready for real-world use. In this article, I’ll show you how to build a production-ready LLM API with FastAPI and Hugging Face.

Building a Production-Ready LLM API

We’ll build a fast, well-tested API to serve TinyLlama, a powerful open-source small language model, using FastAPI. The best part is that it’s completely free, runs on your own machine, and doesn’t require an OpenAI API key.

Step 0: The Setup

First, let’s set up the environment. Make sure you have Python installed. We’ll use torch for the main computations and fastapi to run the web server.

Open your terminal and run:

pip install torch transformers accelerate fastapi uvicorn pydantic
pip install "fastapi[standard]"

To keep things organized, as any experienced engineer would, we’ll split our code into three main files:

  1. ml_engine.py: Handles the model logic.
  2. schemas.py: Defines the data rules.
  3. main.py: The API server itself.

Step 1: The Engine

This part is the core of our app. We’ll wrap the model inside a Python class.

Here, we load the model into memory once and keep it there. If we loaded it every time a user made a request, the API would be very slow and might even crash:

# ml_engine.py
import torch
from transformers import pipeline

class LLMEngine:
    def __init__(self):
        self.pipe = None

    def load_model(self):
        """Loads the model into memory. This happens only once."""
        print("⏳ Loading TinyLlama... (This might take a minute)")
        
        # We use 'pipeline' for simplicity. 
        # model_id can be swapped for other small models like 'distilgpt2'
        model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        
        self.pipe = pipeline(
            "text-generation",
            model=model_id,
            torch_dtype=torch.bfloat16, # Saves memory
            device_map="auto" # Uses GPU if available, otherwise CPU
        )
        print("Model loaded successfully!")

    def generate(self, prompt: str, max_new_tokens: int = 256):
        """Generates text based on the prompt."""
        if not self.pipe:
            raise RuntimeError("Model is not loaded!")
            
        # Create a chat-like format for TinyLlama
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": prompt},
        ]
        
        # Apply the chat template
        prompt_formatted = self.pipe.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
        )

        outputs = self.pipe(
            prompt_formatted, 
            max_new_tokens=max_new_tokens, 
            do_sample=True, 
            temperature=0.7, 
            top_k=50, 
            top_p=0.95
        )
        
        # Clean up the output to return just the generated text
        generated_text = outputs[0]["generated_text"]
        # Remove the prompt from the response to be cleaner
        return generated_text[len(prompt_formatted):]

# Create a global instance
llm_engine = LLMEngine()

Notice the use of torch_dtype=torch.bfloat16. This setting almost halves memory use compared to float32, so the model can run smoothly even on basic hardware.

Step 2: The Contract

In software engineering, bad input leads to bad results. To avoid this, we use Pydantic. It clearly defines what our API should receive and return. Think of it as a bouncer, making sure only valid data gets through:

#schemas.py
from pydantic import BaseModel, Field

class GenerationRequest(BaseModel):
    prompt: str = Field(
        ...,
        min_length=1,
        json_schema_extra={
            "example": "Explain quantum physics in 5-year-old terms."
        }
    )
    max_tokens: int = Field(
        256,
        ge=10,
        le=1024,
        json_schema_extra={
            "example": 128
        }
    )
    temperature: float = Field(
        0.7,
        ge=0.0,
        le=1.0,
        json_schema_extra={
            "example": 0.7
        }
    )

class GenerationResponse(BaseModel):
    result: str
    token_usage: int

If someone sends a temperature of 5.0, which would break the math, or a negative token count, Pydantic will instantly return a clear error before the model processes the request.

Step 3: The Server

Now, we tie it all together with FastAPI.

An important part here is the lifespan context manager. Previously, we used startup events, but now lifespan is the standard. It lets us control what happens when the server starts up and shuts down, like loading or unloading the model:

# main.py
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
from ml_engine import llm_engine
from schemas import GenerationRequest, GenerationResponse

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Load the model
    llm_engine.load_model()
    yield
    # Shutdown: Clean up resources (if needed)
    print("🛑 Shutting down model engine...")

# Initialize the app with the lifespan
app = FastAPI(title="TinyLlama API", lifespan=lifespan)

@app.get("/")
def read_root():
    return {"status": "online", "model": "TinyLlama-1.1B"}

@app.post("/generate", response_model=GenerationResponse)
def generate_text(request: GenerationRequest):
    """
    Endpoint to generate text.
    Note: We use a standard 'def' (not async def) here.
    """
    try:
        # Generate the text
        result = llm_engine.generate(
            prompt=request.prompt,
            max_new_tokens=request.max_tokens
        )
        
        return GenerationResponse(
            result=result,
            token_usage=len(result.split()) # Rough estimate
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Notice that generate_text is defined with def, not async def. Generating text uses a lot of CPU or GPU power and can block other tasks. If we used async def, Python might freeze the server for other users. Using def lets FastAPI run this in a separate thread, so the server stays responsive.

Now, in your terminal, run the following command:

fastapi dev main.py

FastAPI provides automatic documentation. Open your browser and go to:

http://127.0.0.1:8000/docs

You’ll see an interactive interface called Swagger UI.

  1. Click on the /generate endpoint.
  2. Click “Try it out”.
  3. Enter a prompt like “Why is the sky blue?”
  4. Hit “Execute”.
Building a Production-Ready LLM API

The API will process your request and return a JSON response with the generated text.

Closing Thoughts

That’s how you can build a production-ready LLM API with FastAPI and Hugging Face.

In practice, big tech companies don’t use one huge script for everything. Instead, they use small, specialized services that communicate with each other. By putting your LLM in an API, you’ve made a portable intelligence unit. Now, you can connect it to a React frontend, a mobile app, or even a Discord bot, all without changing your model code.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2070

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading