Fine-Tuning an Open-Source LLM

For a long time, people thought only big tech companies could train Large Language Models (LLMs), and you needed a cluster of H100s to even try. That’s changed. Now, with tools like Unsloth and methods like QLoRA (Quantized Low-Rank Adaptation), you can fine-tune powerful open-source models like Llama 3 on a free Google Colab account in less than an hour. In this article, I’ll show you how to fine-tune an open-source LLM step by step.

Fine-Tuning an Open-Source LLM: Getting Started

We’ll use Unsloth, a library that makes fine-tuning faster and more efficient. It can double the training speed and uses 60% less memory compared to standard Hugging Face methods.

Here’s what you need:

A Google Colab account (Free tier works).
Select Runtime > Change runtime type > T4 GPU.

Step 1: Installation and Setup

First, we’ll remove the default environment and install the versions needed for Unsloth and Xformers, which helps with memory-efficient attention:

%%capture
!pip uninstall -y accelerate
!pip install accelerate==0.27.2
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft bitsandbytes

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
from unsloth import FastLanguageModel

print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))

CUDA available: True
GPU: Tesla T4

We use %%capture to hide the long install logs. We install unsloth, trl (a library for Supervised Fine-Tuning), and peft. Then, we check that PyTorch detects the GPU.

Step 2: Loading the 4-bit Model

Next, we load the Llama 3 model. Unsloth takes care of downloading the weights and instantly quantizing them for you:

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Here are the key parameters we used:

max_seq_length = 2048: This sets the context window. While Llama 3 can handle up to 8k, 2048 is a good default for efficient training on free accounts.
load_in_4bit = True: This key setting lets the model fit into about 6GB of VRAM, so there’s room left for training gradients.

Step 3: Configuring LoRA

Here, we choose which parts of the model to retrain:

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

model = model.to("cuda")

In this case, r = 16 sets the Rank. Higher values mean more trainable parameters. Common choices are 16, 32, or 64.

With target_modules, we focus on all linear layers (like q_proj, k_proj, and others). Earlier guides only targeted the Attention layers (q and v), but including all linear layers gives you much better models.

Notice use_gradient_checkpointing= “unsloth”. This setting saves memory by using a little more compute time, but it greatly reduces VRAM usage.

Step 4: Formatting the Data

LLMs need well-formatted data. If you train on unstructured text, the model won’t know when to stop generating. We’ll use the standard Alpaca format:

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}

from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=2)

Be sure to add tokenizer.eos_token at the end of each training example. Without it, the model might keep generating text endlessly because it never learned when to stop.

Step 5: The Training Loop

We’ll use the SFTTrainer (Supervised Fine-Tuning Trainer) from Hugging Face, which Unsloth has optimized:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,              # increase for real training
        learning_rate = 2e-4,
        fp16 = True,                 # FORCE fp16 (no bf16 auto-detect)
        bf16 = False,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        no_cuda = False,
    ),
)
trainer_stats = trainer.train()

Here are the key parameters we used:

max_steps = 60: This is just for demonstration. For a real project, you’d usually run 1 or 2 full passes (epochs) over your dataset.
optim = “adamw_8bit”: We use this 8-bit optimizer to save extra memory.
fp16 = True: This sets 16-bit floating point precision. If you have an Ampere GPU (A100 or A10), you can use bf16=True, but on a T4 (the Colab default), stick with fp16.

You will see a training loss table updating every step:

Step | Training Loss
1    | 1.812300
2    | 2.241500
...
60   | 0.891200

The loss should generally trend downward.

Step 6: Testing the Model

Once training is done, we put the model into inference mode and test it:

FastLanguageModel.for_inference(model)

with torch.inference_mode():
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                "Continue the fibonacci sequence.",
                "1, 1, 2, 3, 5, 8",
                "",
            )
        ],
        return_tensors="pt",
    ).to("cuda")

    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        use_cache=True,
    )

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

13, 21, 34, 55, 89, 144, 233, 377, 610, 987

The model should complete the sequence correctly.

Closing Thoughts

That’s how you fine-tune an open-source LLM. Not long ago, only research labs had this kind of access to AI. Now, you can do it right in your browser.

Don’t just run the script and move on. Try changing the dataset. Look for something on Hugging Face that matches your interests, like recipes, code snippets, or poems, and fine-tune Llama 3 to become an expert in that area.

If you found this article helpful, you can follow me on Instagram for daily AI tips and practical resources. You may also be interested in my latest book, Hands-On GenAI, LLMs & AI Agents, a step-by-step guide to prepare you for careers in today’s AI industry.