Data Augmentation using LLMs

Data Augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. With the rise of Large Language Models (LLMs), data augmentation has become one of their most impactful industry applications, enabling the creation of high-quality, diverse datasets for various use cases. So, in this article, I’ll take you through the concept of data augmentation using LLMs with Python.

What is Data Augmentation?

Data Augmentation involves creating new, synthetic data points from an existing dataset to improve model robustness, generalization, and performance. In the context of text, this could mean paraphrasing sentences, generating new examples, or creating entirely new structured data based on patterns. LLMs excel at this task because they can understand context, mimic writing styles, and generate plausible outputs based on prompts.

For structured data like tables (e.g., CSV files), LLMs can be guided to produce rows that follow a specific format, such as employee records with fields like Employee ID, Name, Age, Department, Salary, and Experience.

Let’s understand data augmentation using LLMs with a practical example!

Data Augmentation using LLMs

Let’s consider a scenario where we have an Employee Salary Dataset (with columns like: Employee ID, Name, Age, Department, Salary, and Experience), but it contains only a few samples. Our goal is to generate additional realistic records to improve training data quality for a salary prediction model.

Step 1: Load a Pretrained LLM

I am using GPT-2 for demonstration. You can try larger models like GPT-3.5, LLaMA, or Mistral for better results:

from transformers import pipeline

# load GPT-2 model for text generation
generator = pipeline("text-generation", model="gpt2")

When this code runs, the GPT-2 model and its tokenizer are downloaded (if not already cached) and loaded into memory. The generator object is now ready to produce text based on input prompts.

Step 2: Define a Structured Prompt for Data Generation

LLMs like GPT-2 rely on patterns in the input prompt. By providing a detailed example, we guide the model to mimic the structure (comma-separated values) and content (realistic employee data). You can also extract data from CSV files while writing prompts.

For now, we will just give some sample data in the prompt:

prompt = """
Generate a structured table in CSV format with columns:
Employee ID, Name, Age, Department, Salary ($), Experience (Years).

Example:
1, John Doe, 28, Engineering, 75000, 3
2, Jane Smith, 32, Marketing, 85000, 5
3, Alice Brown, 45, HR, 95000, 10
4, Robert White, 38, Engineering, 90000, 7
5, Emily Davis, 29, Finance, 72000, 4
6, Michael Johnson, 50, Sales, 110000, 20
7, Sarah Wilson, 31, HR, 78000, 6
8, David Lee, 42, Marketing, 88000, 12
9, Jennifer Moore, 27, Engineering, 71000, 2
10, Kevin Clark, 35, Finance, 93000, 8
11, Jessica Taylor, 30, Sales, 79000, 5
12, William Martin, 37, HR, 87000, 9
13, Olivia Adams, 40, Engineering, 99000, 14
14, Daniel Harris, 26, Finance, 70000, 2
15, Sophia Anderson, 33, Marketing, 85000, 7
16, Matthew Thomas, 29, Sales, 73000, 3
17, Laura Jackson, 36, HR, 89000, 10
18, Anthony Rodriguez, 41, Engineering, 105000, 15
19, Lisa Scott, 39, Marketing, 92000, 11
20, Andrew Hall, 34, Finance, 94000, 9
"""

# generate synthetic data using GPT-2
generated_data = generator(prompt, max_length=5000, num_return_sequences=1)

# print generated output
print(generated_data[0]['generated_text'])

Data Augmentation — **Output of the generated samples**

This prompt primes GPT-2 to generate additional rows that follow the same format and style. Once, we will run this code, GPT-2 will use its learned patterns to predict what comes next after the prompt, ideally producing new CSV rows with realistic employee data.

Step 3: Parse the Generated Text into a DataFrame

The resulting data DataFrame should contain the synthetic rows generated by GPT-2, structured as a table. So, we need to extract structured data:

import pandas as pd
from io import StringIO

# extract generated text
generated_text = generated_data[0]['generated_text']

# remove the prompt portion from the output
generated_text = generated_text.replace(prompt, "").strip()

# convert to dataframe
data = pd.read_csv(StringIO(generated_text), names=["Employee ID", "Name", "Age", "Department", "Salary", "Experience"])
print(data)

    Employee ID                  Name  Age     Department       Salary  \
0            21            Jodi Hynes   34    Engineering        91000   
1            22          Nicole Burch   35      Marketing        70000   
2            23        Scott Thompson   28        Finance        40000   
3            24        Julia Williams   29     Accounting       240000   
4            25     Julia Jones-Smith   35      Education        80000   
5            26           Lisa Wilson   36   Hired Talent        90000   
6            27              Amy Fung   33   Construction        80000   
7            28        Kathy Foulburn   32      Education       140000   
8            29          Lisa Mathers   18         Hiring        70000   
9            30      Sarah R. Johnson   30         Hiring        75000   
10           31   Laura Johnson-Garré   30    Social Club       180000   
11           32        Jannice Taylor   21         Hiring        80000   
12           33       Mary T. Johnson   27    Engineering        70000   
13           34         David Johnson   25         Hiring        75000   
14           35          Paul Johnson   25         Hiring        85000   
15           36        Ann J. Johnson   21         Hiring        90000   
16           37       Jill K. Johnson   24     Management        70000   
17           38           Ann Johnson   21         Hiring       100000   
18           39         Laura Johnson   19    Engineering        70000   
19           40      Jennifer Johnson   20     Technology        70000   
20           41       John R. Johnson   36         Hiring   75000 0=No   

    Experience  
0         12.0  
1         10.0  
2          9.0  
3         11.0  
4         11.0  
5         14.0  
6         16.0  
7         22.0  
8          4.0  
9          8.0  
10         NaN  
11         NaN  
12         NaN  
13        10.0  
14         9.0  
15        11.0  
16         4.0  
17         NaN  
18         NaN  
19         NaN  
20         NaN

This step transforms the raw text into a usable format for analysis or machine learning.

Summary

So, data augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. In this article, we demonstrated how LLMs can generate synthetic tabular data to augment datasets. I hope you liked this article on data augmentation using LLMs. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.