Fine-Tuning a Small Language Model Locally

Fine-tuning a small language model on your own computer isn’t just for researchers with lots of hardware anymore. With new advances in quantization and parameter-efficient training, you can now take powerful models like Meta’s Llama 3 8B or Microsoft’s Phi-3 and customize them on a regular consumer GPU. Learning this skill can be a real game-changer.

In this article, I’ll walk you through a hands-on tutorial for fine-tuning a small language model on your own machine.

How Fine-Tuning Works Locally

When I first tried working with language models, doing it locally felt out of reach unless you had a huge multi-GPU setup. Full fine-tuning updates billions of weights, which needs a lot of memory for optimizer states and gradients.

Enter LoRA (Low-Rank Adaptation) and Quantization.

LoRA works by keeping the original model weights the same and adding small, trainable matrices to the model’s layers. It’s like reading a textbook you can’t write in, so instead of changing the text, you add notes with your own updates. This method gives you about 90% of the performance of full fine-tuning, but you only need to train a small part of the model.

We also use 4-bit quantization with tools like bitsandbytes. Quantization shrinks the large model weights so they fit into your GPU’s memory. With QLoRA (Quantized LoRA), a model that would usually need 32GB of VRAM can now run on a regular 12GB or 16GB consumer GPU.

Fine-Tuning a Small Language Model Locally: Getting Started

Let’s see how this works step by step. In this tutorial, we won’t use any paid APIs. Instead, we’ll use open-source libraries like unsloth (which makes local training faster and more memory-efficient), Hugging Face’s transformers, trl, and peft.

We’ll use Llama-3-8B and get it ready to train on a basic instruction-following dataset.

If you want to master this shift from simple LLM apps to real-world AI agent systems, I’ve broken it down step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.

Step 1: Environment Setup

You’ll need a computer with an NVIDIA GPU. Be sure to install these libraries:

pip install unsloth
pip install trl peft accelerate

Step 2: Loading the Quantized Model

Next, we’ll load the model in 4-bit precision to save memory. Unsloth’s FastLanguageModel makes this process simple:

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # A solid default for most local GPU constraints
dtype = None # Auto-detects fp16 or bf16
load_in_4bit = True # saves your VRAM

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # A solid default for most local GPU constraints
dtype = None # Auto-detects fp16 or bf16
load_in_4bit = True # saves your VRAM

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth

config.json: 100%|██████████████████████████████| 1.20k/1.20k [00:00<00:00, 4.56MB/s]
model.safetensors: 100%|████████████████████████| 5.70G/5.70G [00:45<00:00, 126MB/s]
generation_config.json: 100%|███████████████████| 172/172 [00:00<00:00, 780kB/s]
tokenizer_config.json: 100%|████████████████████| 50.6k/50.6k [00:00<00:00, 12.4MB/s]
tokenizer.json: 100%|███████████████████████████| 9.09M/9.09M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|██████████████████| 464/464 [00:00<00:00, 2.10MB/s]

Step 3: Applying LoRA Adapters

Now, we attach those “notes” to the model:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Suggested starting points: 8, 16, 32, 64
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # = 0 is optimized
    bias = "none",    # = "none" is optimized
    use_gradient_checkpointing = "unsloth", # Crucial for saving memory on long contexts
    random_state = 3407,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Suggested starting points: 8, 16, 32, 64
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # = 0 is optimized
    bias = "none",    # = "none" is optimized
    use_gradient_checkpointing = "unsloth", # Crucial for saving memory on long contexts
    random_state = 3407,
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195983464188562

Take note of the r parameter, which stands for rank. Setting it to 16 is a good starting point for instruction tuning. Higher ranks can capture more detail but will use more VRAM.

Step 4: Formatting the Dataset

Models need data to be organized in a very specific format. We’ll convert our dataset to a standard prompt format:

from datasets import load_dataset

prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = prompt_template.format(instruction, input, output) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts, }

# Load a standard dataset and format it
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True)

from datasets import load_dataset

prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = prompt_template.format(instruction, input, output) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts, }

# Load a standard dataset and format it
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True)

Step 5: Training Setup and Execution

Finally, we’ll use Hugging Face’s SFTTrainer to run the training loop. By using a small batch size and gradient accumulation, we can simulate a larger batch size without going over our VRAM limit:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # A small number of steps just to verify the script works
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Start training!
trainer_stats = trainer.train()

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # A small number of steps just to verify the script works
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

# Start training!
trainer_stats = trainer.train()

[60/60 03:15, Epoch 0/1]
Step    Training Loss
1       1.854300
2       1.811200
3       1.789100
4       1.623400
5       1.542100
6       1.498200
...     ...
55      0.912300
56      0.895400
57      0.887100
58      0.881200
59      0.879500
60      0.875200

TrainOutput(global_step=60, training_loss=1.245312, metrics={'train_runtime': 195.45, 'train_samples_per_second': 2.45, 'train_steps_per_second': 0.307, 'total_flos': 1.45e+15, 'train_loss': 1.245312, 'epoch': 0.009})

When the training loop is done, you’ll have a set of custom weights made just for your data and domain. And you did it all without sending any private information online.

Closing Thoughts

There’s a big difference between just using AI as an API and really understanding how the model works behind the scenes.

Fine-tuning a small language model on your own machine makes you deal with hardware limits, memory management, and data quality. It helps you build real intuition.

I hope you found this article on fine-tuning a small language model locally helpful.

For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.

Fine-Tuning a Small Language Model Locally

How Fine-Tuning Works Locally

Fine-Tuning a Small Language Model Locally: Getting Started

Step 1: Environment Setup

Step 2: Loading the Quantized Model

Step 3: Applying LoRA Adapters

Step 4: Formatting the Dataset

Step 5: Training Setup and Execution

Closing Thoughts

Aman Kharwal

Leave a ReplyCancel reply

How Fine-Tuning Works Locally

Fine-Tuning a Small Language Model Locally: Getting Started

Step 1: Environment Setup

Step 2: Loading the Quantized Model

Step 3: Applying LoRA Adapters

Step 4: Formatting the Dataset

Step 5: Training Setup and Execution

Closing Thoughts

Aman Kharwal

Recommended For You

Roadmap to Becoming an Agentic AI Engineer

How to Automate Your Daily Workflow Using AI Agents

AI Agent Projects for Beginners to Advanced

Build a Multi-Modal RAG Pipeline

Leave a ReplyCancel reply

Discover more from AmanXai by Aman Kharwal