Fine-Tuning a Small Language Model Locally

Fine-tuning a small language model on your own computer isn’t just for researchers with lots of hardware anymore. With new advances in quantization and parameter-efficient training, you can now take powerful models like Meta’s Llama 3 8B or Microsoft’s Phi-3 and customize them on a regular consumer GPU. Learning this skill can be a real game-changer.

In this article, I’ll walk you through a hands-on tutorial for fine-tuning a small language model on your own machine.

How Fine-Tuning Works Locally

When I first tried working with language models, doing it locally felt out of reach unless you had a huge multi-GPU setup. Full fine-tuning updates billions of weights, which needs a lot of memory for optimizer states and gradients.

Enter LoRA (Low-Rank Adaptation) and Quantization.

LoRA works by keeping the original model weights the same and adding small, trainable matrices to the model’s layers. It’s like reading a textbook you can’t write in, so instead of changing the text, you add notes with your own updates. This method gives you about 90% of the performance of full fine-tuning, but you only need to train a small part of the model.

We also use 4-bit quantization with tools like bitsandbytes. Quantization shrinks the large model weights so they fit into your GPU’s memory. With QLoRA (Quantized LoRA), a model that would usually need 32GB of VRAM can now run on a regular 12GB or 16GB consumer GPU.

Fine-Tuning a Small Language Model Locally: Getting Started

Let’s see how this works step by step. In this tutorial, we won’t use any paid APIs. Instead, we’ll use open-source libraries like unsloth (which makes local training faster and more memory-efficient), Hugging Face’s transformers, trl, and peft.

We’ll use Llama-3-8B and get it ready to train on a basic instruction-following dataset.

If you want to master this shift from simple LLM apps to real-world AI agent systems, I’ve broken it down step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.

Step 1: Environment Setup

You’ll need a computer with an NVIDIA GPU. Be sure to install these libraries:

pip install unsloth
pip install trl peft accelerate

Step 2: Loading the Quantized Model

Next, we’ll load the model in 4-bit precision to save memory. Unsloth’s FastLanguageModel makes this process simple:

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # A solid default for most local GPU constraints
dtype = None # Auto-detects fp16 or bf16
load_in_4bit = True # saves your VRAM

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
πŸ¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))== Unsloth: Fast Llama patching release 2024.5
\\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\ / Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
"-____-" Free Apache license: http://github.com/unslothai/unsloth

config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.20k/1.20k [00:00<00:00, 4.56MB/s]
model.safetensors: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.70G/5.70G [00:45<00:00, 126MB/s]
generation_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 172/172 [00:00<00:00, 780kB/s]
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50.6k/50.6k [00:00<00:00, 12.4MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 9.09M/9.09M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 464/464 [00:00<00:00, 2.10MB/s]

Step 3: Applying LoRA Adapters

Now, we attach those “notes” to the model:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Suggested starting points: 8, 16, 32, 64
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # = 0 is optimized
    bias = "none",    # = "none" is optimized
    use_gradient_checkpointing = "unsloth", # Crucial for saving memory on long contexts
    random_state = 3407,
)
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195983464188562

Take note of the r parameter, which stands for rank. Setting it to 16 is a good starting point for instruction tuning. Higher ranks can capture more detail but will use more VRAM.

Step 4: Formatting the Dataset

Models need data to be organized in a very specific format. We’ll convert our dataset to a standard prompt format:

from datasets import load_dataset

prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = prompt_template.format(instruction, input, output) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts, }

# Load a standard dataset and format it
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True)

Step 5: Training Setup and Execution

Finally, we’ll use Hugging Face’s SFTTrainer to run the training loop. By using a small batch size and gradient accumulation, we can simulate a larger batch size without going over our VRAM limit:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # A small number of steps just to verify the script works
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Start training!
trainer_stats = trainer.train()
[60/60 03:15, Epoch 0/1]
Step Training Loss
1 1.854300
2 1.811200
3 1.789100
4 1.623400
5 1.542100
6 1.498200
... ...
55 0.912300
56 0.895400
57 0.887100
58 0.881200
59 0.879500
60 0.875200

TrainOutput(global_step=60, training_loss=1.245312, metrics={'train_runtime': 195.45, 'train_samples_per_second': 2.45, 'train_steps_per_second': 0.307, 'total_flos': 1.45e+15, 'train_loss': 1.245312, 'epoch': 0.009})

When the training loop is done, you’ll have a set of custom weights made just for your data and domain. And you did it all without sending any private information online.

Closing Thoughts

There’s a big difference between just using AI as an API and really understanding how the model works behind the scenes.

Fine-tuning a small language model on your own machine makes you deal with hardware limits, memory management, and data quality. It helps you build real intuition.

I hope you found this article on fine-tuning a small language model locally helpful.

For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2121

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading