Data Augmentation using LLMs

Data Augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. With the rise of Large Language Models (LLMs), data augmentation has become one of their most impactful industry applications, enabling the creation of high-quality, diverse datasets for various use cases. So, in this article, I’ll take you through the concept of data augmentation using LLMs with Python.

What is Data Augmentation?

Data Augmentation involves creating new, synthetic data points from an existing dataset to improve model robustness, generalization, and performance. In the context of text, this could mean paraphrasing sentences, generating new examples, or creating entirely new structured data based on patterns. LLMs excel at this task because they can understand context, mimic writing styles, and generate plausible outputs based on prompts.

For structured data like tables (e.g., CSV files), LLMs can be guided to produce rows that follow a specific format, such as employee records with fields like Employee ID, Name, Age, Department, Salary, and Experience.

Let’s understand data augmentation using LLMs with a practical example!

Data Augmentation using LLMs

Let’s consider a scenario where we have an Employee Salary Dataset (with columns like: Employee ID, Name, Age, Department, Salary, and Experience), but it contains only a few samples. Our goal is to generate additional realistic records to improve training data quality for a salary prediction model.

Step 1: Load a Pretrained LLM

I am using GPT-2 for demonstration. You can try larger models like GPT-3.5, LLaMA, or Mistral for better results:

from transformers import pipeline

# load GPT-2 model for text generation
generator = pipeline("text-generation", model="gpt2")

When this code runs, the GPT-2 model and its tokenizer are downloaded (if not already cached) and loaded into memory. The generator object is now ready to produce text based on input prompts.

Step 2: Define a Structured Prompt for Data Generation

LLMs like GPT-2 rely on patterns in the input prompt. By providing a detailed example, we guide the model to mimic the structure (comma-separated values) and content (realistic employee data). You can also extract data from CSV files while writing prompts.

For now, we will just give some sample data in the prompt:

prompt = """
Generate a structured table in CSV format with columns:
Employee ID, Name, Age, Department, Salary ($), Experience (Years).

Example:
1, John Doe, 28, Engineering, 75000, 3
2, Jane Smith, 32, Marketing, 85000, 5
3, Alice Brown, 45, HR, 95000, 10
4, Robert White, 38, Engineering, 90000, 7
5, Emily Davis, 29, Finance, 72000, 4
6, Michael Johnson, 50, Sales, 110000, 20
7, Sarah Wilson, 31, HR, 78000, 6
8, David Lee, 42, Marketing, 88000, 12
9, Jennifer Moore, 27, Engineering, 71000, 2
10, Kevin Clark, 35, Finance, 93000, 8
11, Jessica Taylor, 30, Sales, 79000, 5
12, William Martin, 37, HR, 87000, 9
13, Olivia Adams, 40, Engineering, 99000, 14
14, Daniel Harris, 26, Finance, 70000, 2
15, Sophia Anderson, 33, Marketing, 85000, 7
16, Matthew Thomas, 29, Sales, 73000, 3
17, Laura Jackson, 36, HR, 89000, 10
18, Anthony Rodriguez, 41, Engineering, 105000, 15
19, Lisa Scott, 39, Marketing, 92000, 11
20, Andrew Hall, 34, Finance, 94000, 9
"""

# generate synthetic data using GPT-2
generated_data = generator(prompt, max_length=5000, num_return_sequences=1)

# print generated output
print(generated_data[0]['generated_text'])
Data Augmentation
Output of the generated samples

This prompt primes GPT-2 to generate additional rows that follow the same format and style. Once, we will run this code, GPT-2 will use its learned patterns to predict what comes next after the prompt, ideally producing new CSV rows with realistic employee data.

Step 3: Parse the Generated Text into a DataFrame

The resulting data DataFrame should contain the synthetic rows generated by GPT-2, structured as a table. So, we need to extract structured data:

import pandas as pd
from io import StringIO

# extract generated text
generated_text = generated_data[0]['generated_text']

# remove the prompt portion from the output
generated_text = generated_text.replace(prompt, "").strip()

# convert to dataframe
data = pd.read_csv(StringIO(generated_text), names=["Employee ID", "Name", "Age", "Department", "Salary", "Experience"])
print(data)
    Employee ID                  Name  Age     Department       Salary  \
0 21 Jodi Hynes 34 Engineering 91000
1 22 Nicole Burch 35 Marketing 70000
2 23 Scott Thompson 28 Finance 40000
3 24 Julia Williams 29 Accounting 240000
4 25 Julia Jones-Smith 35 Education 80000
5 26 Lisa Wilson 36 Hired Talent 90000
6 27 Amy Fung 33 Construction 80000
7 28 Kathy Foulburn 32 Education 140000
8 29 Lisa Mathers 18 Hiring 70000
9 30 Sarah R. Johnson 30 Hiring 75000
10 31 Laura Johnson-Garré 30 Social Club 180000
11 32 Jannice Taylor 21 Hiring 80000
12 33 Mary T. Johnson 27 Engineering 70000
13 34 David Johnson 25 Hiring 75000
14 35 Paul Johnson 25 Hiring 85000
15 36 Ann J. Johnson 21 Hiring 90000
16 37 Jill K. Johnson 24 Management 70000
17 38 Ann Johnson 21 Hiring 100000
18 39 Laura Johnson 19 Engineering 70000
19 40 Jennifer Johnson 20 Technology 70000
20 41 John R. Johnson 36 Hiring 75000 0=No

Experience
0 12.0
1 10.0
2 9.0
3 11.0
4 11.0
5 14.0
6 16.0
7 22.0
8 4.0
9 8.0
10 NaN
11 NaN
12 NaN
13 10.0
14 9.0
15 11.0
16 4.0
17 NaN
18 NaN
19 NaN
20 NaN

This step transforms the raw text into a usable format for analysis or machine learning.

Summary

So, data augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. In this article, we demonstrated how LLMs can generate synthetic tabular data to augment datasets. I hope you liked this article on data augmentation using LLMs. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2076

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading