title: Parameter efficient fine-tuning PEFT: A complete guide to LoRA and QLoRA large model fine-tuning technology | Daoman PythonAI description: In-depth study of the core technology of efficient fine-tuning of parameters, master the complete theory and practical methods of PEFT technologies such as LoRA, QLoRA, and Adapter in fine-tuning large models, and achieve efficient fine-tuning of large models on consumer-grade GPUs. keywords: [PEFT, LoRA, QLoRA, parameter efficient fine-tuning, large model fine-tuning, artificial intelligence, machine learning, deep learning, quantification, low-rank adaptation]

Parameter efficient fine-tuning PEFT: A complete guide to LoRA and QLoRA large model fine-tuning technology

Table of contents


PEFT Overview

Parameter-Efficient Fine-Tuning (PEFT) is a “civilian artifact” in the era of large models. Its core concept is simple: leave most of the pre-trained parameters in the model and only add or update a very small proportion (0.01% to 5%) of lightweight parameters, so that the large model can quickly adapt to your specific field or task, and the effect is even comparable to full fine-tuning.

Why choose PEFT?

The most direct value is spread out:

  1. Video Memory Threshold Diving: Tasks that originally required multiple A100/H100s can now be completed with a single RTX 3090/4090
  2. Extremely low storage cost: Each task only needs to save a few MB to dozens of MB small parameter files, saying goodbye to hundreds of GB model copies.
  3. The training cycle is greatly shortened: Training of lightweight parameters converges quickly, from days to hours or even dozens of minutes.
  4. Easy to switch between multiple tasks: Only one copy of the basic model is kept, and different lightweight adaptation layers are needed for different tasks without repeated copying of the model.

Pain points of full fine-tuning

Before the advent of PEFT, fine-tuning large models was almost a "big factory game." Several real-life problems that cause headaches:

  1. Horrifying consumption of computing resources: Taking LLaMA with 65B parameters as an example, even with BF16 precision fine-tuning, it will require at least an 8×80GB A100 GPU cluster, which is difficult for individuals and small teams to afford.
  2. Storage pressure is huge: Each fine-tuned version is a complete model copy. The 65B BF16 model will occupy about 130 GB of disk. Several additional tasks will directly overwhelm the hard disk.
  3. Risk of Catastrophic Forgetting: Directly changing all parameters can easily wash away the general knowledge hard learned in the pre-training stage, causing the model to significantly degrade in other capabilities.
  4. Chaos of version management: Facing massive complete model copies for different tasks and scenarios, management and migration are a nightmare

It is these pain points that gave birth to a set of PEFT technology that "only requires a little movement, and the results are obvious".


Detailed explanation of LoRA low-rank adaptation

LoRA (Low-Rank Adaptation) is currently the most mainstream and stable PEFT technology. It is inspired by an observation: parameter updates of large models are often concentrated in the "low-rank" space - in other words, complex weight changes can be approximately simulated by multiplying two very small matrices.

LoRA’s core process

  1. Freeze original model weights: All pre-training parameters are set to non-trainable and remain unchanged.
  2. Insert lightweight low-rank matrix: Next to the attention layer (especially the Q/K/V projection layer) or feedforward layer of the model, add two new small matrices A and B
  3. Incremental output: The final output of the model = original pre-training output + the output of the low-rank matrix
  4. (Optional) Scaling factor: You can multiply the low-rank output by a small factor to control the intensity of its impact on the overall behavior.

Intuitive understanding: **The original model is equivalent to an experienced old craftsman. LoRA only adds a "flexible assistant" at certain key nodes to allow the model to quickly adapt to new tasks without having to retrain the old craftsman himself. **

LoRA key parameters

Parameter nameRecommended rangeFunction
rank (r)4~32The "condensation degree" of the low-rank matrix. The smaller r, the smaller the number of parameters, but the expressive ability will also decrease
alpha8~64Scaling factor for low-rank output, usually set to about 2 times of r
lora_dropout0.0~0.1Dropout rate to prevent overfitting
target_modules-The name of the module to be inserted into LoRA, commonly used by GPT classesq_proj,k_proj,v_projetc

QLoRA Quantitative LoRA technology

QLoRA (Quantized LoRA) is the "enhanced version of consumer-grade graphics cards" of LoRA. It introduces 4-bit quantization on the basis of LoRA, compressing the original 7B/13B model that often requires dozens of GB of video memory to only 4GB/8GB to load, and adding LoRA training, the total only requires 8GB/16GB of video memory - ordinary game graphics cards can finally run large model fine-tuning.

QLoRA core technology stack

  1. NF4 Quantization: A 4-bit floating point format specially designed for normal distribution pre-training weights, with far higher accuracy than ordinary Int4
  2. Double quantization: Quantize the quantized scaling factor again to further squeeze out the video memory space
  3. Combined with LoRA: Only train the lightweight LoRA matrix on the quantized model, and the quantization parameters themselves remain frozen
  4. BF16 calculation: Intermediate calculations use BF16 accuracy, taking into account both efficiency and numerical stability.

QLoRA Hardware Reference

Model sizeMinimum video memory (training)Minimum video memory (inference)
7B8GB (like RTX 3070 Ti)4GB
13B16GB (like RTX 3090)8GB
70B48GB (like RTX A6000)24GB

Comparison of mainstream PEFT methods

In addition to LoRA and QLoRA, there are several common PEFT routes suitable for different scenarios:

MethodNumber of parametersPerformanceEase of useApplicable scenarios
LoRAAbout 0.1%ExcellentHighGeneral fine-tuning, code/text/multi-modality, etc.
QLoRAAbout 0.1%ExcellentMediumFine-tuning large models, when hardware is severely limited
Adapter1%~5%GoodMediumMulti-task parallel learning
Prompt TuningAbout 0.01%GeneralHighFew-shot learning, simple text generation

PEFT implementation

The following is based on Hugging Face's Transformers and PEFT libraries to demonstrate the core process of using QLoRA to fine-tune a 7B model (the complete code requires additional data set preprocessing).

1. Install dependencies

pip install transformers peft bitsandbytes accelerate datasets

2. Load the quantization model and configure LoRA

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, TaskType

# 4位量化配置
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 加载基础模型和分词器
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # 演示用小模型,实际可换成7B/13B
model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token   # 补齐 pad token

# 配置 LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

# 应用 LoRA
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()  # 打印可训练参数比例

3. Inference and model saving/loading

# 推理(微调完成后)
prompt = "你好,请介绍一下PEFT技术"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# 保存适配器参数
model.save_pretrained("./my_peft_model")
tokenizer.save_pretrained("./my_peft_model")

# 加载已保存的适配器
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./my_peft_model")

Performance and efficiency optimization

Three tips for optimizing video memory

  1. Gradient Checkpointing: A small amount of additional calculations are used to reduce the graphics memory by 30% to 50%, and the training time is increased by approximately 20%.
    model.gradient_checkpointing_enable()
  2. Gradient accumulation: Use small batches to simulate large batches to alleviate the problem of memory shortage in quantitative models.
    # 在 TrainingArguments 中设置
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16
  3. Optimizer upgrade: replace ordinary AdamW withpaged_adamw_32bit, more friendly to video memory control.

Training efficiency optimization

  1. Increase the learning rate: PEFT models usually require a learning rate that is an order of magnitude higher than full fine-tuning (1e-4~2e-4)
  2. Target module streamlining: Prioritize only fine-tuning the Q/K/V projection of the attention layer, do not check too many modules at once
  3. The number of training rounds is enough: PEFT converges quickly. Generally, 1 to 3 epochs are enough. If it is delayed too long, it may be overfitting.

It is recommended to start with **LoRA fine-tuning small model** (such as TinyLlama), run through the process, and then try the actual project of **QLoRA fine-tuning 7B/13B**. During actual implementation, key hyperparameters such as rank, learning rate, and target module can be flexibly adjusted based on hardware conditions, task difficulty, and expected effects.

Summarize

PEFT technology completely breaks the resource barrier for fine-tuning large models. LoRA is the first choice for current general scenarios, while QLoRA is a life-saving straw when hardware is limited. In the future, PEFT will continue to evolve and incorporate more efficient methods, lowering the threshold for implementing large models.


🔗 Extended reading

📂 Stage: Stage 5 - Ladder to Large Model (LLM)