Building a Large Language Model from Scratch

Large Language Models (LLMs) are the backbone of modern AI applications, from chatbots to code generators. But how are large language models built? If you want to learn about building a large language model from scratch, this article is for you. In this article, I’ll explain how to build a large language model from scratch using Python.

Building a Large Language Model from Scratch

Modern language models (like GPT-4) use transformers, a deep learning architecture that learns word relationships through self-attention. We’ll build a basic transformer-based model to understand how to build a large language model from scratch. The goal of our language model will be to predict the next word.

Here are the six main components we’ll cover:

Tokenization
Embedding Layer
Positional Encoding
Self-Attention
Transformer Block
Full Language Model

Step 1: Tokenization

Computers can’t understand words directly, so we map each word to a unique number (ID). This process is called tokenization. Here’s how to tokenize text:

import torch
import torch.nn as nn
import torch.optim as optim
import math

def tokenize(text, vocab):
    return [vocab.get(word, vocab["<UNK>"]) for word in text.split()]

Here’s how this works:

text.split(): Splits a sentence into words (e.g., “hello world”: [“hello”, “world”]).
vocab: A dictionary that assigns numbers to words (e.g., {“hello”: 0, “world”: 1, “<UNK>”: 2}).
vocab.get(word, vocab[“<UNK>”]): Returns a word’s assigned number. If it’s missing, assigns <UNK> (unknown).

Think of this as giving each word an ID, so the model can work with numbers instead of text.

Step 2: Embedding Layer

Numbers alone (like 0 and 1) don’t carry meaning. An embedding layer transforms these numbers into vectors (lists of numbers), allowing words with similar meanings to have similar representations. Here’s how to implement it:

class Embedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Embedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, x):
        return self.embedding(x)

Here’s how the embedding layer works:

nn.Embedding(vocab_size, embedding_dim): Creates a table where each word ID maps to a vector.
embedding_dim: Defines the length of each vector (e.g., 16 numbers per word).

Think of embeddings as assigning each word a personality, so words like happy and joyful get similar vectors.

Step 3: Positional Encoding

Transformers process all words at once, so they don’t naturally understand order (e.g., “I love you” ≠ “You love I”). Positional encoding fixes this by adding a unique “position signal” to each word. Here’s how to implement positional encoding:

class PositionalEncoding(nn.Module):
    def __init__(self, embedding_dim, max_seq_len=5000):
        super(PositionalEncoding, self).__init__()
        self.embedding_dim = embedding_dim
        pe = torch.zeros(max_seq_len, embedding_dim)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embedding_dim, 2).float() * (-math.log(10000.0) / embedding_dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

Here’s how the above function works:

embedding_dim: Matches the vector size from the embedding layer.
max_seq_len: The longest sentence we’ll handle (e.g., 5000 words).
Math (sine and cosine): Creates a pattern of numbers that change based on position (e.g., word 1 gets one pattern, word 2 gets another).
forward: Adds these position numbers to the word vectors.

Think of this as tagging each word with a position stamp so the model understands word order.

Step 4: Self-Attention

Self-attention helps the model focus on important words. For example, in “The cat sat on the mat”, “sat” relates more to “cat” than “mat”. Here’s how to implement it:

class SelfAttention(nn.Module):
    def __init__(self, embedding_dim):
        super(SelfAttention, self).__init__()
        self.query = nn.Linear(embedding_dim, embedding_dim)
        self.key = nn.Linear(embedding_dim, embedding_dim)
        self.value = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, x):
        queries = self.query(x)
        keys = self.key(x)
        values = self.value(x)
        scores = torch.bmm(queries, keys.transpose(1, 2)) / torch.sqrt(torch.tensor(x.size(-1), dtype=torch.float32))
        attention_weights = torch.softmax(scores, dim=-1)
        attended_values = torch.bmm(attention_weights, values)
        return attended_values

Here’s how self-attention works:

query, key, value: Three transformations of the input vectors. Think of them as asking “What do I care about?” (query), “What’s available?” (key), and “What do I take?” (value).
scores: Measures how much each word relates to every other word.
attention_weights: Turns scores into probabilities (e.g., 70% focus on “how”, 30% on “are”).
attended_values: Combines the important parts of the sentence.

Think of self-attention as a smart highlighter that finds important words to focus on.

Step 5: Transformer Block

A single attention layer isn’t enough. Transformer blocks combine attention with deeper processing. Here’s how to implement a transformer block:

class TransformerBlock(nn.Module):
    def __init__(self, embedding_dim, hidden_dim):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embedding_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embedding_dim)
        )
        self.norm1 = nn.LayerNorm(embedding_dim)
        self.norm2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        attended = self.attention(x)
        x = self.norm1(x + attended)
        forwarded = self.feed_forward(x)
        x = self.norm2(x + forwarded)
        return x

Here’s how the transformer block works:

attention: The self-attention we just built.
feed_forward: A small neural network to process each word further.
norm1, norm2: Normalizes the numbers so they don’t get too big or small (like keeping everyone on the same scale).
x + attended: Adds the original input to the attention output (a trick called “residual connection”).

This is like a brain cell, it listens (attention), thinks (feed-forward), and keeps things stable (normalization).

Step 6: Full Language Model

Now, we will combine all the pieces into one model that predicts the next word:

class SimpleLLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers):
        super(SimpleLLM, self).__init__()
        self.embedding = Embedding(vocab_size, embedding_dim)
        self.positional_encoding = PositionalEncoding(embedding_dim)
        self.transformer_blocks = nn.Sequential(*[TransformerBlock(embedding_dim, hidden_dim) for _ in range(num_layers)])
        self.output = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(0, 1) # Transpose for positional encoding
        x = self.positional_encoding(x)
        x = x.transpose(0, 1) # Transpose back
        x = self.transformer_blocks(x)
        x = self.output(x)
        return x

Some key components you should know:

num_layers: How many transformer blocks to stack (more layers = deeper thinking).
output: Turns the final vectors back into word predictions (e.g., probabilities for each word in the vocab).

This is the final system, it reads the sentence, understands it, and guesses the next word.

Step 7: Training the Model

Now, we will teach the model by showing it examples and correcting its mistakes:

vocab = {"hello": 0, "world": 1, "how": 2, "are": 3, "you": 4, "<UNK>": 5}
vocab_size = len(vocab)
embedding_dim = 16
hidden_dim = 32
num_layers = 2

model = SimpleLLM(vocab_size, embedding_dim, hidden_dim, num_layers)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

data = ["hello world how are you", "how are you hello world"]
tokenized_data = [tokenize(sentence, vocab) for sentence in data]

for epoch in range(100):
    for sentence in tokenized_data:
        for i in range(1, len(sentence)):
            input_seq = torch.tensor(sentence[:i]).unsqueeze(0)
            target = torch.tensor(sentence[i]).unsqueeze(0)
            optimizer.zero_grad()
            output = model(input_seq)
            loss = criterion(output[:, -1, :], target)
            loss.backward()
            optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item()}")

Epoch 0, Loss: 1.7691224813461304
Epoch 10, Loss: 0.6396194696426392
Epoch 20, Loss: 0.2903057932853699
Epoch 30, Loss: 0.1653764843940735
Epoch 40, Loss: 0.10594221949577332
Epoch 50, Loss: 0.07302528619766235
Epoch 60, Loss: 0.05297106131911278
Epoch 70, Loss: 0.039956752210855484
Epoch 80, Loss: 0.031084876507520676
Epoch 90, Loss: 0.024792836979031563

Some key components you should know:

input_seq: The words so far (e.g., [0, 1] for “hello world”).
target: The next word (e.g., 2 for “how”).
loss: How far off the prediction was.
optimizer.step(): Updates the model to improve.

Step 8: Using the Model

Now, let’s predict the next word using our model:

input_text = "hello world how"
input_tokens = tokenize(input_text, vocab)
input_tensor = torch.tensor(input_tokens).unsqueeze(0)
output = model(input_tensor)
predicted_token = torch.argmax(output[:, -1, :]).item()
print(f"Input: {input_text}, Predicted: {list(vocab.keys())[list(vocab.values()).index(predicted_token)]}")

Input: hello world how, Predicted: are

How to Build an Actual LLM with this?

To scale up this model into a practical LLM, several key changes are needed. First, the vocabulary size must expand from just 6 words to 50,000+ words or subwords using techniques like Byte-Pair Encoding (BPE) and tokenizers from libraries like Hugging Face. Instead of two sentences, real-world training requires millions of sentences sourced from books, Wikipedia, or large datasets.

The embedding dimension should increase from 16 to 512 or 1024 for richer word representations, while the hidden dimension should grow from 32 to at least 2048 for greater processing power. The number of transformer layers needs to scale from 2 to 12–96, similar to models like GPT-3.

Instead of simple self-attention, multi-head attention should be implemented using nn.MultiheadAttention for better contextual understanding. Training also becomes significantly more complex, moving from 100 CPU epochs to multi-GPU/TPU training over days or weeks, requiring optimizations like batching (DataLoader), gradient clipping, and learning rate schedulers.

Hardware-wise, a real LLM demands multiple high-end GPUs (e.g., 8+ A100s) and frameworks like PyTorch Lightning or DeepSpeed for efficient scaling.

Summary

I hope you now understood how to build a large language model from scratch with this example. To build an actual LLM, you need to use libraries like Hugging Face, scale up the architecture, and train on massive datasets. I hope you liked this article on building a Large Language Model from scratch. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Building a Large Language Model from Scratch