Jan 13, 2024

Unveiling LORA πŸš€: Fine-tuning Neural Networks with Low-Rank Adaptation

Introduction 🌐

Hey, folks! Welcome to the exploration of a groundbreaking technique: LORAβ€”Low Rank Adaptation of large language models. LORA emerged from the labs at Microsoft around two years ago, and today, we'll uncover its mysteries. We'll understand its significance, operational mechanics. So, let's dive into the realm of language models and explore the wonders of LORA.

Fine-Tuning and the Challenges πŸ€”

Before we embark on our LORA journey, let's grasp why we need such an adaptation. Neural networks, with their inputs, hidden layers, and outputs, form the backbone of machine learning. During training, adjustments to weights are made.
The above diagram show how the finetuning done in a normal NN setting . We typically have hidden layers represented by matrices. During training, we compare the network's output with a target to compute a loss. This loss is then backpropagated through all layers, adjusting the weights and biases. For instance, in a layer, there are weight and bias matrices, both modified by the loss function to refine the network's performance.
Fine-tuning large pre-trained models is computationally challenging, involving adjustment of millions of parameters. This traditional approach demands substantial computational resources and time, posing a bottleneck for adapting these models to specific tasks. 

LORA's Solution πŸ› οΈ

Enter LORA, a revolutionary approach to fine-tuning. Instead of modifying all weights (W), LORA introduces two new matrices, B and A, and freezing the original pre-trained model's weights (W). In LoRA, there's no need to create matrices B and A for every layer of the original model; instead, this can be selectively done for specific layers.
The key distinction between the original weight matrix W and the LoRA matrices B and A lies in their dimensions. Lets go through the example. 
The above diagram shows original weight matrix W, had dimensions d by k . Suppose d=1000 and k=5000 then the dim of W i.e. d by k=5 000,000 , So it has 5 million parameters 
LoRA introduces two new matrices, B with dimensions d by r and A with dimensions r by k. 
To achieve the same dimensions as the original matrix W (d by K), When multiplied together (B(d by r) * A(r by k)), these matrices yield a new matrix with dimensions d by k, ensuring compatibility with the original matrix.
So if we take r=1, then B is of dim (1000*1) and A is of dim (1*5000), so it produce in total 1000+5000=6000 parameters . And if multiple them we can still produce the matrix of the original dim.
You might be concerned that the new matrices B and A, being smaller, won't capture the same information as the original matrix W. While they produce the same dimensions, they represent a more concise version. The concept behind LoRA is that the matrix W often contains redundant weightsβ€”numbers that don't contribute meaningful information to the model. By creating a lower-dimensional representation with B and A, we aim to retain essential information and fine-tune the model effectively.

LoRA provides key benefits:

  1. A pre-trained model can be shared and used for various tasks by replacing matrices A and B, reducing storage and task-switching overhead significantly.
  2. LoRA streamlines training, lowering the hardware entry barrier up to three times with adaptive optimizers. Gradients and optimizer states are only calculated for smaller low-rank matrices.
  3. LoRA's simple design seamlessly combines trainable matrices with frozen weights during deployment, introducing no inference latency compared to a fully fine-tuned model.

Conclusion and Future Exploration πŸš€

In conclusion, LORA emerges as a powerful tool for fine-tuning neural networks efficiently. As you explore LORA further, don't hesitate to experiment with different ranks and models. Share your insights in the comments!
Thanks for joining me on this exploration of LORA. If you found this blog informative, give it a thumbs up and stay tuned for more content on machine learning and deep learning. Until next time, happy coding! 🌟

Reference:

← Unlock the Power of LLaMA 3: Your Ultimate Resource Gui Understanding the Quadratic Cost of Autoregressive Mode β†’

Share this post

Comments