Large Language Models (LLMs): A Dive into Fine Tuning with Practical Examples

8 min readDec 5, 2023

In this article, we’ll explore the process of fine-tuning a pre-trained LLM. We’ll begin by introducing essential concepts and techniques for fine-tuning and conclude with a specific example demonstrating how to fine-tune a model locally using Python and Hugging Face’s software ecosystem.

Concept of Fine Tuning

A pre-trained model is fine-tuned by training at least one internal model parameter (weights, for example). This usually results in the conversion of a general-purpose base model (like GPT-3) into a customized model for a specific use case in the context of LLMs.

The question “Why fine-tune these models?, Can’t we just use them as they are?” is a valid thought.

Fine-tuning LLMs boosts smaller models to outperform larger ones on specific tasks. This was demonstrated by OpenAI’s InstructGPT, with a 100x smaller model surpassing GPT-3. This approach minimizes context limitations, reducing inference costs.

Supervised Learning to Fine-tune a LLM

This section focuses on supervised learning for fine-tuning language models. We’ll delve into a high-level process, emphasizing the crucial steps involved.

Choose fine-tuning task — Select Your Task Identify the fine-tuning task, whether it’s summarization, question answering, or text classification.
Prepare training dataset — Prepare Your Data Create a training dataset with input-output pairs (100–10k) and preprocess the data by tokenizing, truncating, and padding text.
Choose a base model — Experiment with different models and select the one that performs best for your task.
Fine-Tune with Supervised Learning — Execute the fine-tuning process using supervised learning techniques.
Evaluate Model Performance — Assess the performance of your fine-tuned model.

3 Options for Parameter Training

Exploring the dilemma of training models with ~100M-100B parameters, we consider three generic options.

Option 1: Retrain All Parameters

Train all internal model parameters, a computationally expensive but conceptually simple approach.

Option 2: Transfer Learning

Preserve useful representations/features from past training while adapting to a new task.

Option 3: Parameter Efficient Fine-tuning (PEFT)

With PEFT, a base model is enhanced with a comparatively modest set of trainable parameters. The main outcome of this is a fine-tuning approach that performs on par with complete parameter tuning at a minuscule fraction of the computational and storage costs.

PEFT is a collection of methods, among which is the widely used LoRA (Low-Rank Adaptation) approach. LoRA’s basic concept is to take a selection of layers from an existing model and change their weights in accordance with the equation that follows.

let’s see how we can use LoRA to fine-tune a language model efficiently enough to run on a personal computer.

Example Code: Fine-tuning an LLM using LoRA

In this example, we will refine a language model to categorize text as either “positive” or “negative” using the Hugging Face ecosystem. Here, we optimize distilbert-base-uncased, a BERT-based model with almost 70 million parameters. We use transfer learning to swap out the base model head with a classification head because it was trained for language modelling rather than classification. We also use LoRA to efficiently fine-tune the model so that it can be used with Google Colab.

Imports

We start by importing helpful libraries and modules. Datasets, transformers, peft, and evaluate are all libraries from Hugging Face (HF).

from datasets import load_dataset, DatasetDict, Dataset
from transformers import (
    AutoTokenizer,
    AutoConfig, 
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

Base model

Next, we load in our base model. The base model here is a relatively small one, but there are several other (larger) ones that we could have used (e.g. roberta-base, llama2, gpt2). A full list is available here.

model_checkpoint = 'distilbert-base-uncased'

# define label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

Load data

We can then load our training and validation data from HF’s datasets library. This is a dataset of 2000 movie reviews (1000 for training and 1000 for validation) with binary labels indicating whether the review is positive (or not).

# load dataset
dataset = load_dataset("shawhin/imdb-truncated")
dataset

# dataset = 
# DatasetDict({
#     train: Dataset({
#         features: ['label', 'text'],
#         num_rows: 1000
#     })
#     validation: Dataset({
#         features: ['label', 'text'],
#         num_rows: 1000
#     })
# })

Preprocess data

Next, we need to preprocess our data so that it can be used for training. This consists of using a tokenizer to convert the text into an integer representation understood by the base model.

# create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

To apply the tokenizer to the dataset, we use the .map() method. This takes in a custom function that specifies how the text should be preprocessed. In this case, that function is called tokenize_function(). In addition to translating text to integers, this function truncates integer sequences such that they are no longer than 512 numbers to conform to the base model’s max input length.

# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["text"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )    return tokenized_inputs# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset# tokenized_dataset = 
# DatasetDict({
#     train: Dataset({
#        features: ['label', 'text', 'input_ids', 'attention_mask'],
#         num_rows: 1000
#     })
#     validation: Dataset({
#         features: ['label', 'text', 'input_ids', 'attention_mask'],
#         num_rows: 1000
#     })
# })

At this point, we can also create a data collator, which will dynamically pad examples in each batch during training such that they all have the same length. This is computationally more efficient than padding all examples to be equal in length across the entire dataset.

# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Evaluation metrics

We can define how we want to evaluate our fine-tuned model via a custom function. Here, we define the compute_metrics() function to compute the model’s accuracy.

# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)    return {"accuracy": accuracy.compute(predictions=predictions, 
                                          references=labels)}

Untrained model performance

Before training our model, we can evaluate how the base model with a randomly initialized classification head performs on some example inputs.

# define list of examples
text_list = ["It was good.", "Not a fan, don't recommed.", 
"Better than the first one.", "This is not worth watching even once.", 
"This one is a pass."]

print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)    print(text + " - " + id2label[predictions.tolist()])# Output:
# Untrained model predictions:
# ----------------------------
# It was good. - Negative
# Not a fan, don't recommed. - Negative
# Better than the first one. - Negative
# This is not worth watching even once. - Negative
# This one is a pass. - Negative

As expected, the model performance is equivalent to random guessing. Let’s see how we can improve this with fine-tuning.

Fine-tuning with LoRA

To use LoRA for fine-tuning, we first need a config file. This sets all the parameters for the LoRA algorithm. See comments in the code block for more details.

peft_config = LoraConfig(task_type="SEQ_CLS", # sequence classification
                        r=4, # intrinsic rank of trainable weight matrix
                        lora_alpha=32, # this is like a learning rate
                        lora_dropout=0.01, # probablity of dropout
                        target_modules = ['q_lin']) # we apply lora to query layer only

We can then create a new version of our model that can be trained via PEFT. Notice that the scale of trainable parameters was reduced by about 100x.

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# trainable params: 1,221,124 || all params: 67,584,004 || trainable%: 1.8068239934408148

Next, we define hyperparameters for model training.

# hyperparameters
lr = 1e-3 # size of optimization step 
batch_size = 4 # number of examples processed per optimziation step
num_epochs = 10 # number of times model runs through training data

# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Finally, we create a trainer() object and fine-tune the model!

# creater trainer object
trainer = Trainer(
    model=model, # our peft model
    args=training_args, # hyperparameters
    train_dataset=tokenized_dataset["train"], # training data
    eval_dataset=tokenized_dataset["validation"], # validation data
    tokenizer=tokenizer, # define tokenizer
    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics, # evaluates model using compute_metrics() function from before
)

# train model
trainer.train()

The above code will generate the following table of metrics during training.

Model training metrics. Image by author.

Trained model performance

To see how the model performance has improved, let’s apply it to the same 5 examples from before.

model.to('mps') # moving to mps for Mac (can alternatively do 'cpu')

print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to("mps") # moving to mps for Mac (can alternatively do 'cpu')    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices    print(text + " - " + id2label[predictions.tolist()[0]])# Output:
# Trained model predictions:
# ----------------------------
# It was good. - Positive
# Not a fan, don't recommed. - Negative
# Better than the first one. - Positive
# This is not worth watching even once. - Negative
# This one is a pass. - Positive # this one is tricky

The fine-tuned model improved significantly from its prior random guessing, correctly classifying all but one of the examples in the above code. This aligns with the ~90% accuracy metric we saw during training.

Conclusions

Even with the use of cunning rapid engineering tactics, (smaller) fine-tuned models can outperform (bigger) pre-trained base models for a given use case, even when fine-tuning an existing model needs more computational resources and technical experience than using one out-of-the-box. Moreover, fine-tuning a model for a custom application has never been simpler with to the abundance of open-source LLM resources accessible.

Checkout the Code for the above examples are available in Github-Repo
Checkout other projects/concepts in “Github” and “Medium”.