GENERATIVE AI

LoRA: Linear Algebra Under the Hood

Key Matrix Concepts behind LoRA-based Fine-Tuning of LLMs

Kaush B
5 min readDec 31, 2023

--

Photo by Carl Nenzen Loven on Unsplash

As GenAI and LLMs gain chart-busting popularity, it is reassuring to find the good old linear algebra basics in the pivoting position when it comes to fine-tuning LLMs (Large Language Models). LoRA (Low Rank Adaptation) is one of the most popular PEFT (Parameter-Efficient Fine-Tuning) techniques to fine-tune LLMs. This article will explain, in very simple terms, how LoRA works and why it’s crucial for affordable Transformer fine-tuning.

Why LoRA?

LoRA learns low-rank matrix decompositions to slash the costs of training large language models. It adapts only low-rank factors instead of entire weight matrices, achieving major memory and performance wins. Please feel free to refer to the original paper, but if you need an explanation easier to visualize and comprehend, please read this article.

Basics first!

What is fine-tuning?

Fine-tuning is the process wherein we pass data through a pre-trained deep learning network, then we calculate delta weights (i.e., weight updates via back-propagation), and finally we combine the delta weights with the base weights to get new weights. And we keep doing it till we are satisfied with the result (i.e., errors are low enough). It is very similar to the feedback loop in any Control System. Below is how a single fine-tuning step looks like, where h = W’x.

Figure 1. Simplistic representation of fine-tuning

We can also represent it slightly differently. Assume that we have frozen pre-trained weights (W) and frozen inputs (x), and all the trainable parameters are in delta weights (highlighted in green below). We calculate delta weights and the hidden layer activation gets updated like before. But now we track the delta weights separately instead of combining them with the weights, because LoRA will take advantage of this representation. In mathematical terms, h = W’x = (W+ΔW)x = Wx + ΔWx.

Figure 2. Alternative representation of fine-tuning for easier explanation of LoRA

Let’s take a pause here and understand the basics of matrix decomposition before we delve deeper into LoRA.

What is matrix decomposition?

Let’s assume that delta weights is represented by a matrix with AxB dimensions. The key claim in the LoRA paper is that pre-trained language models have a low “intrinsic dimension”. That is, the weight matrix can be represented almost as accurately using way fewer dimensions (i.e., there are a lot of intrinsic redundancies which we can get rid of without losing much).

The LoRA paper then hypothesizes that the matrix will also have a low intrinsic rank during adaptation. Now, rank of a matrix isn’t necessarily same as its dimensions. Rather, it is equal to the number of linearly independent rows or columns. That’s why we can represent the large matrix of delta weights as a combination of potentially smaller matrices through the process called matrix decomposition, where ΔW = WA x WB, as shown below.

Figure 3. Matrix decomposition (adapted from the works of Chris Alexiuk)

ΔW has dimensions A x B; WA has dimensions A x r; WB has dimensions r x B. The rank during adaptation is r, where r << min(A, B), because LoRA hypothesizes that.

How does LoRA work?

LoRA proposes r to be a hyper-parameter, and empirically shows that fine-tuning with various low values of r achieves accuracy figures very close to full training accuracy figures. The diagram below shows LoRA representation of fine-tuning.

Figure 4. LoRA representation of fine-tuning

That’s all it is!

But wait! Why does it matter?

You wonder, we are just representing the same thing in a different way! So, where does the compute savings come from?

Lower compute requirements at training time

Let’s assume that ΔW has dimensions 100 x 100 (i.e., 10,000 elements). Let’s also assume that r is 5. Therefore, WA has dimensions 100 x 5 (i.e., 500 elements); WB has dimensions 5 x 100 (i.e., 500 elements).

So, we now will have to store and process 500 + 500 = 1,000 elements instead of 10,000 elements. This also means that we will not have to train the remaining 90,000 elements. That’s a huge 90% reduction in compute requirements during training, without sacrificing too much of accuracy!

Does it add to inference latency?

You might notice in Figure 4 that at the inference time, there is a bit of inefficiency (we need to perform the addition as the last step after the data flows through the matrices) which can lead to some inference latency. But we can just merge these updates into the pre-trained weights during inference, which will lead to potentially zero inference latency.

That’s just the best of both worlds!

But how does it apply to fine-tuning LLMs?

Attention is all you need!

All LLMs (Large Language Models) follow a Transformer-based architecture. The above process is applicable for any weight matrix. And hence it is applicable for LLMs too. In fact, LLMs benefit from LoRA significantly because they have billons of trainable parameters, and any compute savings will lead to huge benefits!

The LoRA paper proposes to replace the Attention weights in LLM transformers. Attention weights contribute to the biggest chunk of all the parameters in any transformer. In fact, LoRA proposes to replace only the query (q) and value (v) parameters in LLM transformer’s Attention weights. We assume that everything else is frozen and we don’t need the optimizer states for them. We just need those small injections which are those matrix pairs.

The paper empirically shows how they went from 1.2TB to 350GB achieving 90% of the accuracy and 25 times speed-up of the training process, with the help of r = 8 applied to all 96 layers of attentions (so that every layer or block is of similar structure).

One more cool thing.

Imagine a situation where we have 1 foundation model and 10 different downstream tasks. With LoRA, we will just have to swap the small matrices for different downstream tasks while keeping the foundation model frozen. These replacements can even be done at the inference time. That means, we can have the customers choose the downstream task at inference time without needing us to deploy a model zoo of 10 different foundation models. LoRA allows this plug-&-play of task-specific small matrices at inference time while sharing the same foundation model, which is a huge gain in terms of cost savings, without imposing significant inference penalty!

--

--

Kaush B

CTO & Chief Data Scientist @ AI Startup. Holds US Patent in storage tech. Entrepreneur.