Fine-tuning a small language model on your own computer isnβt just for researchers with lots of hardware anymore. With new advances in quantization and parameter-efficient training, you can now take powerful models like Meta’s Llama 3 8B or Microsoft’s Phi-3 and customize them on a regular consumer GPU. Learning this skill can be a real game-changer.
In this article, Iβll walk you through a hands-on tutorial for fine-tuning a small language model on your own machine.
How Fine-Tuning Works Locally
When I first tried working with language models, doing it locally felt out of reach unless you had a huge multi-GPU setup. Full fine-tuning updates billions of weights, which needs a lot of memory for optimizer states and gradients.
Enter LoRA (Low-Rank Adaptation) and Quantization.
LoRA works by keeping the original model weights the same and adding small, trainable matrices to the modelβs layers. Itβs like reading a textbook you canβt write in, so instead of changing the text, you add notes with your own updates. This method gives you about 90% of the performance of full fine-tuning, but you only need to train a small part of the model.
We also use 4-bit quantization with tools like bitsandbytes. Quantization shrinks the large model weights so they fit into your GPUβs memory. With QLoRA (Quantized LoRA), a model that would usually need 32GB of VRAM can now run on a regular 12GB or 16GB consumer GPU.
Fine-Tuning a Small Language Model Locally: Getting Started
Letβs see how this works step by step. In this tutorial, we wonβt use any paid APIs. Instead, weβll use open-source libraries like unsloth (which makes local training faster and more memory-efficient), Hugging Faceβs transformers, trl, and peft.
Weβll use Llama-3-8B and get it ready to train on a basic instruction-following dataset.
If you want to master this shift from simple LLM apps to real-world AI agent systems, Iβve broken it down step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.
Step 1: Environment Setup
Youβll need a computer with an NVIDIA GPU. Be sure to install these libraries:
pip install unsloth
pip install trl peft accelerate
Step 2: Loading the Quantized Model
Next, weβll load the model in 4-bit precision to save memory. Unslothβs FastLanguageModel makes this process simple:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # A solid default for most local GPU constraints
dtype = None # Auto-detects fp16 or bf16
load_in_4bit = True # saves your VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)π¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))== Unsloth: Fast Llama patching release 2024.5
\\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\ / Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
"-____-" Free Apache license: http://github.com/unslothai/unsloth
config.json: 100%|ββββββββββββββββββββββββββββββ| 1.20k/1.20k [00:00<00:00, 4.56MB/s]
model.safetensors: 100%|ββββββββββββββββββββββββ| 5.70G/5.70G [00:45<00:00, 126MB/s]
generation_config.json: 100%|βββββββββββββββββββ| 172/172 [00:00<00:00, 780kB/s]
tokenizer_config.json: 100%|ββββββββββββββββββββ| 50.6k/50.6k [00:00<00:00, 12.4MB/s]
tokenizer.json: 100%|βββββββββββββββββββββββββββ| 9.09M/9.09M [00:00<00:00, 34.2MB/s]
special_tokens_map.json: 100%|ββββββββββββββββββ| 464/464 [00:00<00:00, 2.10MB/s]
Step 3: Applying LoRA Adapters
Now, we attach those “notes” to the model:
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Suggested starting points: 8, 16, 32, 64
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # = 0 is optimized
bias = "none", # = "none" is optimized
use_gradient_checkpointing = "unsloth", # Crucial for saving memory on long contexts
random_state = 3407,
)Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5195983464188562
Take note of the r parameter, which stands for rank. Setting it to 16 is a good starting point for instruction tuning. Higher ranks can capture more detail but will use more VRAM.
Step 4: Formatting the Dataset
Models need data to be organized in a very specific format. Weβll convert our dataset to a standard prompt format:
from datasets import load_dataset
prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
def formatting_prompts_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
text = prompt_template.format(instruction, input, output) + tokenizer.eos_token
texts.append(text)
return { "text" : texts, }
# Load a standard dataset and format it
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True)Step 5: Training Setup and Execution
Finally, weβll use Hugging Faceβs SFTTrainer to run the training loop. By using a small batch size and gradient accumulation, we can simulate a larger batch size without going over our VRAM limit:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60, # A small number of steps just to verify the script works
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
# Start training!
trainer_stats = trainer.train()[60/60 03:15, Epoch 0/1]
Step Training Loss
1 1.854300
2 1.811200
3 1.789100
4 1.623400
5 1.542100
6 1.498200
... ...
55 0.912300
56 0.895400
57 0.887100
58 0.881200
59 0.879500
60 0.875200
TrainOutput(global_step=60, training_loss=1.245312, metrics={'train_runtime': 195.45, 'train_samples_per_second': 2.45, 'train_steps_per_second': 0.307, 'total_flos': 1.45e+15, 'train_loss': 1.245312, 'epoch': 0.009})
When the training loop is done, youβll have a set of custom weights made just for your data and domain. And you did it all without sending any private information online.
Closing Thoughts
Thereβs a big difference between just using AI as an API and really understanding how the model works behind the scenes.
Fine-tuning a small language model on your own machine makes you deal with hardware limits, memory management, and data quality. It helps you build real intuition.
I hope you found this article on fine-tuning a small language model locally helpful.
For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.





