Code Generation Model using LLMs

A code generation model is a type of artificial intelligence that can automatically generate source code based on a given input, which can be natural language instructions, existing code snippets, or structured data. So, if you want to learn how to build a code generation model using LLMs, this article is for you. In this article, I’ll take you through the task of building a code generation model with LLMs using Python.

How to Build a Code Generation Model using LLMs?

To build a code generation model using Large Language Models (LLMs), we can start by collecting a large, diverse dataset of code examples and related documentation from various programming languages. Then, preprocess this data to ensure quality and consistency. In the end, fine-tune a pre-trained LLM, such as GPT-4 or any appropriate LLM, on the dataset to specialize it in understanding and generating code.

So, to build a code generation model using LLMs, we need a lot of data on code snippets. We can collect code snippets from GitHub by using the GitHub API. So, before proceeding with the task of building a code generation model, I recommend you sign up for the GitHub API and get your access token. Here’s the process you can follow:

  • Go to GitHub Settings.
  • Click on “Generate new token”.
  • Select the necessary scopes (at least repo scope to access repositories).
  • Generate the token and copy it.

Please feel free to reach me on Instagram or LinkedIn if you find any issues while generating a token.

Code Generation Model using LLMs

Now, let’s get started with the task of building a Code Generation model using LLMs. Before proceeding, here are the commands to install some of the libraries you will be using for the first time if it’s your first time using LLMs:

  • pip install transformers datasets
  • pip install transformers[torch] accelerate -U

And, as we are using GitHub in this task to collect code snippets, run this command as well in your command prompt before getting started:

  • pip install PyGithub datasets

Now, let’s get started by collecting Python code snippets from GitHub to build a code generation model:

from github import Github
import re
from datasets import Dataset

# initialize PyGithub with the GitHub token
g = Github("Your Github Token")

# specify the repository
repo = g.get_repo("openai/gym")

# function to extract Python functions from a script
def extract_functions_from_code(code):
    pattern = re.compile(r"def\s+(\w+)\s*\(.*\):")
    functions = pattern.findall(code)
    return functions

# fetch Python files from the repository
python_files = []
contents = repo.get_contents("")
while contents:
    file_content = contents.pop(0)
    if file_content.type == "dir":
        contents.extend(repo.get_contents(file_content.path))
    elif file_content.path.endswith(".py"):
        python_files.append(file_content)

# extract functions and create dataset
data = {"code": [], "function_name": []}
for file in python_files:
    code = file.decoded_content.decode("utf-8")
    functions = extract_functions_from_code(code)
    for function in functions:
        data["code"].append(code)
        data["function_name"].append(function)

# create a Hugging Face dataset
dataset = Dataset.from_dict(data)

# save the dataset to disk
dataset.save_to_disk("code_generation_dataset")

print("Dataset created and saved to disk.")
Code Generation Model using LLMs: data files
The type of files you will receive

In the above code, we are initializing a GitHub client with a personal access token, specifying the “openai/gym” repository, and defining a function to extract Python function definitions from the code. We are then iterating over the contents of the repository to collect Python files, extracting function definitions from each file, and storing them in a dataset. In the end, we are creating a Hugging Face dataset from the extracted data and saving it to disk, which will allow us to use this dataset for tasks such as training or fine-tuning a code generation model.

Now, we will use a pre-trained LLM model from Salesforce to fine-tune the model on our dataset for the task of code generation:

from datasets import load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

# load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

# set the pad_token to eos_token or add a new pad token
tokenizer.pad_token = tokenizer.eos_token

# load the dataset
dataset = load_from_disk("code_generation_dataset")

# split the dataset into training and test sets
dataset = dataset.train_test_split(test_size=0.1)

# preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples['code'], truncation=True, padding='max_length')

In the above code, we are initializing the tokenizer and model for code generation using a pre-trained model from Salesforce, and setting the pad token to ensure proper input formatting. We are then loading our previously saved dataset of code snippets from disk, splitting it into training and test sets to create a validation framework, and defining a preprocessing function to tokenize the code examples to ensure they are appropriately truncated and padded to a consistent length for model training.

The above step prepares the data for efficient fine-tuning of the code generation model. And now, here’s how to fine-tune the model:

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# fine-tune the model
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

In the above code, we are tokenizing the dataset by applying the preprocessing function in batches, which prepares the data for training. We then defined the training arguments, specifying parameters such as the output directory, batch size, number of training epochs, and checkpoint saving strategy. With these settings, we initialized a Trainer object with the model, training arguments, and the tokenized training and evaluation datasets. Finally, we started the fine-tuning process by calling the train method on the Trainer object, which begins training the model on the prepared dataset to improve its performance for code generation tasks.

This step will take time, depending on the computing power of your system. After this step, here’s how we can test our code generation model:

# define a function to generate code using the fine-tuned model
def generate_code(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(inputs['input_ids'], max_length=max_length)
    generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_code

# test the model with a code generation prompt
prompt = "def merge_sort(arr):"
generated_code = generate_code(prompt)

print("Generated Code:")
print(generated_code)
Generated Code:
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid])
right = merge_sort(arr[mid:])
return merge(left, right)

def merge(left, right):
result = []
while left and right:
if left[0] < right[0]:

In the above code, we are defining a function generate_code that takes a code prompt and uses the fine-tuned model to generate a continuation of the code. By tokenizing the input prompt and passing it to the model, we are generating a sequence of tokens up to a specified maximum length. These tokens are then decoded back into a readable string of code.

So, this is how we can build a code generation model using LLMs.

Summary

A code generation model is a type of artificial intelligence that can automatically generate source code based on a given input, which can be natural language instructions, existing code snippets, or structured data. I hope you liked this article on building a code generation model using LLMs. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2027

5 Comments

  1. Greetings Aman Sir , I have tried executing the above code snippet in google colab,while the available RAM is 13 GB approx I am facing a issue stating that the RAM available is not enough. I tried reducing the Batch size to 1 and also setting a range of 1000 on both “train” and “test” dataset but to no avail. Please guide me as to what I could be doing wrong.

  2. Sir where to train it ? Google colab is having RAM issues, Jupyter notebook is crashing , Even the code is not running on VS CODE . where to run it ? LLM need high computational power , but then we dont have any platform upon which we can run it ? Please help . Where did you run that code ?

    • As a student, you don’t have computational resources to work with LLMs. So, it’s recommended to focus on Machine Learning and Deep Learning only, as you can see the code to use LLMs is quite easy. If you know how to work with ml algorithms, you can work on any new concept just by looking at the documentation. If you still wish to learn it practically, you can try colab pro.

  3. Thank you so much for creating this article! I am in the midst of generating code samples using an agentic system and I was looking for how to load those samples into a training program for an Ollama-hosted LLM. I do have a question for you – how do I load the fine-tuned model once it is trained and is there a way to give it a name?

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading