道满PythonAI

title: Instruction Tuning: A complete guide to large model alignment technology and RLHF | Daoman PythonAI description: Learn the core technology of instruction fine-tuning in depth, and master the complete theory and practical methods of model alignment technologies such as SFT supervised fine-tuning, RLHF human feedback reinforcement learning, and DPO direct preference optimization. keywords: [Instruction Tuning, Instruction Tuning, RLHF, Model Alignment, Artificial Intelligence, Large Language Model, LLM, Reinforcement Learning, Human Feedback, DPO, SFT, PPO]

#Instruction Tuning: A Complete Guide to Large Model Alignment Technology and RLHF

从“续写文本”到“懂指令”
预训练模型的局限性
SFT：指令微调的第一步
RLHF：高级人类偏好对齐
DPO：简化版的对齐替代方案
技术对比与选择指南
实际开源对齐案例
总结与扩展阅读

From "continuing text" to "understanding instructions"

Think back to the experience when GPT‑3 was first released: you typed in “write a poem about autumn”, and it might throw you a large piece of prose, lyrics about autumn in the training corpus, or even drafts that other users had typed – the only thing missing was a complete poem customized for you. This is not because the model is "stupid", but because it has only learned to continue writing. The core task of the pre-training phase is called Language Modeling, which in plain English is: predict the most likely words to appear next based on the words that have already appeared. It has never learned "to treat the user's words as instructions to complete."

In order to transform a large model from a "continuation machine that only answers conversations" to an intelligent assistant that can understand requirements such as "summarize this paragraph" and "use a table to organize the following information", we must give it a key lesson - Instruction Tuning, and subsequent model alignment work, so that the output is not only accurate, but also safe and in line with human preferences.

Limitations of pre-trained models

A simple pre-trained model is like a "super bookworm" who has read the entire Internet but does not understand social etiquette. There are three outstanding problems:

1. The output is uncontrollable and the security risk is high

The model only maximizes the probability of the next word according to the pattern in the training data, and has no idea what to say and what not to say. It could generate detailed instructions for making weapons, pontificate on medical advice, or even leak personal information in training data.

2. The format is completely random

You need a piece of JSON, and it will write a prose for you; you want a three-sentence summary, and it may be a series of ten paragraphs. Because it has not been trained on the instruction format, the model does not understand that the "format constraints" itself are part of the task.

3. Can only continue to write, but cannot “understand”

Enter "Translate: The weather is nice today -> English". It will not think that this is a translation instruction. Instead, it is more likely to regard this text as a beginning and continue to write more similar sentence patterns.

These problems must be solved through command fine-tuning + preference alignment, which is the technical route we will talk about below.

SFT: The first step in instruction fine-tuning

SFT (Supervised Fine‑Tuning, supervised fine-tuning), explained in one sentence, is: take the "one question and one answer" demonstration data and teach the model step by step how to answer a certain question when it is asked. The goal of this stage is to allow the model to acquire the ability to follow instructions, from "only continuing to write" to "knowing to answer questions."

Core principles

Data Driven: Collect or construct a large number of(指令, 输入, 理想输出)Triplets cover various tasks such as translation, question and answer, code generation, and summary.
Loss calculation: The cross-entropy loss of the language model is still used, but only the loss of the output part is calculated, and the instructions and the above part do not participate in the gradient update (usually the label of this part is set to-100neglect).
Extremely fast results: On many open source base models, using thousands to tens of thousands of high-quality instruction data and training for a few hours to a few days, the model can be transformed from "unintelligible" to "basically usable".

Where does the data come from?

Manual annotation: Hire domain experts to write answers directly, which has the highest quality, but also the highest cost.
Crowdsourcing platforms: such as Amazon Mechanical Turk, which can quickly expand the amount of data, but must be accompanied by strict quality audits.
Open source data sets: Alpaca‑52K, ShareGPT, Belle, etc., just bring them and use them.
Synthetic Data: Use existing strong models to generate candidate answers, and then manually screen and correct them, taking into account both efficiency and quality.

Minimalist SFT code example

The code below demonstrates how toQwen2.5‑1.5BSuch small models undergo supervised fine-tuning. In actual work, it is strongly recommended to use efficient parameter fine-tuning methods such as LoRA to save video memory. For clarity of teaching, a simplified writing method of full-parameter fine-tuning is used here.

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import Dataset
import torch

# 加载基础模型和分词器
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# 格式化 prompt，Qwen 采用 im_start/im_end 标记
def format_prompt(instruction, input_text="", output_text=""):
    return (
        f"<|im_start|>system\n你是一个有用的助手<|im_end|>\n"
        f"<|im_start|>user\n{instruction}\n{input_text}<|im_end|>\n"
        f"<|im_start|>assistant\n{output_text}<|im_end|>"
    )

# 构造训练数据（少量示例）
train_data = [
    {"instruction": "将以下中文翻译成英文", "input": "今天北京的天气很好", "output": "The weather in Beijing is very nice today."},
    {"instruction": "写一段Python代码", "input": "计算1-100的和", "output": "total = sum(range(1, 101))\nprint(total)"}
]
dataset = Dataset.from_list(train_data)

# 预处理：将文本 tokenize 并构建 labels（只计算 assistant 部分的损失）
def tokenize_function(examples):
    full_texts = [
        format_prompt(inst, inp, out)
        for inst, inp, out in zip(examples["instruction"], examples["input"], examples["output"])
    ]
    # 先 tokenize 不带输出的 prompt（方便获取回答部分的 token 起始位置）
    prompt_texts = [
        format_prompt(inst, inp, "")   # 输出为空
        for inst, inp in zip(examples["instruction"], examples["input"])
    ]
    tokenized_prompt = tokenizer(prompt_texts, truncation=True, max_length=256)
    tokenized_full = tokenizer(full_texts, truncation=True, max_length=256)

    all_labels = []
    for i in range(len(tokenized_full["input_ids"])):
        prompt_len = len(tokenized_prompt["input_ids"][i])
        full_ids = tokenized_full["input_ids"][i]
        # prompt 部分标签置为 -100，只保留回答部分的标签
        labels = [-100] * len(full_ids)
        if prompt_len < len(full_ids):
            labels[prompt_len:] = full_ids[prompt_len:]
        all_labels.append(labels)

    tokenized_full["labels"] = all_labels
    return tokenized_full

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)

# 训练配置
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    fp16=True,
    logging_steps=10,
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)
trainer.train()

Tip

The code above is just a teaching demonstration. In real scenarios, SFT data often contains thousands or even tens of thousands of samples, and LoRA or QLoRA is used to reduce memory usage. Otherwise, full parameter fine-tuning of a 7B or more model will be very expensive.

After SFT, the model can already understand the command "Help write a poem", but it may not be able to write the one that people like the most. In the next section, we will see how to introduce human preference judgment to take the output quality to a higher level.

RLHF: Advanced Human Preference Alignment

SFT allows the model to learn to "follow instructions", but the same instruction often has multiple reasonable answers:

User: "What is there to do in Beijing on the weekend?"
Answer A: "There are many scenic spots in Beijing: the Forbidden City, the Great Wall, the Summer Palace..."
Answer B: "The weather in Beijing is fine this week. I recommend going boating in the Summer Palace or going to 798 to see the exhibition. There are relatively fewer people."

Both answers are correct, but obviously B is closer to human preferences: specific, actionable, short and thoughtful. At this time, RLHF (Reinforcement Learning from Human Feedback, human feedback reinforcement learning) is needed.

Three-step process

SFT pre-alignment: First use high-quality instruction data to train a basic supervised model as the starting point for subsequent optimization.
Training Reward Model (RM): Collect human preference ranking data for "multiple answers to the same instruction" and train a model that can score any answer. The higher the score, the more humans like the answer.
PPO reinforcement learning optimization: Treat the SFT model as a policy network, RM as a reward source, and use the PPO algorithm to iteratively update model parameters so that the answers generated by the policy network can obtain high scores from RM. At the same time, the KL divergence constraint is added to prevent the model from talking nonsense in order to increase the score.

RLHF is currently the best and most widely used alignment scheme (both ChatGPT and GPT‑4 rely on it), but it also has obvious shortcomings: the process is complex, multiple models need to be trained and maintained, reinforcement learning training is extremely unstable, there are many hyperparameters and are difficult to tune, and the computational and labor costs of the entire process are very high.

DPO: Simplified Alignment Alternative

Faced with the dilemma of "expensive and difficult to train" RLHF, DPO (Direct Preference Optimization, Direct Preference Optimization) proposed in 2023 provides an elegant alternative idea.

The core insight of DPO is: **The reward model is essentially an implicit function used to express preferences. The optimal strategy can be derived directly from the preference data without the need to explicitly train an RM and then perform reinforcement learning. ** Therefore, DPO reduces the entire alignment process to a classification task: directly based on the SFT model, human-preferred responses have a higher log probability, while unpreferred responses have a reduced probability. During training, only preference pairs of data are needed(指令, 被喜欢的回答, 不被喜欢的回答), the loss function directly optimizes the model itself.

Outstanding advantages of DPO

Process Simplification: No additional training of RM is required, and no complex reinforcement learning framework such as PPO is required.
Training Stability: There are no problems such as reward value explosion and policy collapse, and the sensitivity of hyperparameters is greatly reduced.
Effect close to RLHF: In many academic tests and practical scenarios, the alignment effect of DPO is on par with RLHF, and even better in some harmlessness indicators.
Easy to implement: Hugging FacetrlThe library is already built-inDPOTrainer, load the data and start training.

Even if you have a limited budget and a small team, DPO allows you to align a large model to a "cute" level on consumer-grade hardware.

Technology comparison and selection guide

Technology	Core Idea	Advantages	Disadvantages	Applicable Scenarios
SFT	Supervised learning, using command-answer demonstration training	Simple and easy to implement, quick results	Relying on annotated data, limited generalization and creativity	Rapid prototyping, basic instruction following, pre-training before alignment
RLHF	Reinforcement learning + human preference feedback	Highest alignment quality, best safety and usefulness	Complex process, unstable training, and extremely high cost	Commercial products with strict safety and quality requirements
DPO	Direct preference optimization, skip reward model	Simple and stable, the effect is close to RLHF	It may be slightly worse than RLHF under some extreme distributions	Most practical scenarios that balance cost and effect

If your goal is to make the model "understand human speech", start with SFT; if you want it to "answer questions like human speech and be safe", give priority to trying DPO based on SFT; only when resources and time are abundant, and there are ultra-high requirements for output quality, consider the complete RLHF pipeline.

Actual open source alignment case

1. Vicuna

Based on models such as LLaMA‑2, using ShareGPT's large amount of dialogue data to do SFT, combined with simplified preference alignment, it is one of the first projects to prove that "open source dialogue models can approximate the ChatGPT experience". Its training program has also become a model for fine-tuning in many communities.

2. Qwen2.5‑Instruct Series

Alibaba Cloud Tongyi Qianwen's open source instruction model has completely experienced a mixture of SFT + multi-level security alignment + multiple preference optimization technologies. The ability to understand and generate Chinese is particularly outstanding, and it also performs well in multi-lingual aspects. It is currently a practical model in China that has a very low cost of getting started and a very good effect.

Summary and further reading

Fine-tuning instructions and aligning them with large models is the only way to turn a "knowledgeable but incapable of doing things" pre-trained model into an intelligent assistant that "understands you and is reliable":

SFT is the foundation: Give the model the ability to follow instructions and answer "what should be done".
RLHF is the ceiling: Make the answer more in line with human preferences and answer "how to do it better".
DPO is a balanced choice: approach the ceiling effect with a simpler solution, taking into account both cost and stability.
Safety alignment cannot be ignored: No matter which technology is used, the final model cannot output harmful, illegal or unethical content.

Table of contents

From "continuing text" to "understanding instructions"

Limitations of pre-trained models

1. The output is uncontrollable and the security risk is high

2. The format is completely random

3. Can only continue to write, but cannot “understand”

SFT: The first step in instruction fine-tuning

Core principles

Where does the data come from?

Minimalist SFT code example

RLHF: Advanced Human Preference Alignment

Three-step process

DPO: Simplified Alignment Alternative

Outstanding advantages of DPO

Technology comparison and selection guide

Actual open source alignment case

1. Vicuna

2. Qwen2.5‑Instruct Series

Summary and further reading

Further reading

#Table of contents

#From "continuing text" to "understanding instructions"

#Limitations of pre-trained models

#1. The output is uncontrollable and the security risk is high

#2. The format is completely random

#3. Can only continue to write, but cannot “understand”

#SFT: The first step in instruction fine-tuning

#Core principles

#Where does the data come from?

#Minimalist SFT code example

#RLHF: Advanced Human Preference Alignment

#Three-step process

#DPO: Simplified Alignment Alternative

#Outstanding advantages of DPO

#Technology comparison and selection guide

#Actual open source alignment case

#1. Vicuna

#2. Qwen2.5‑Instruct Series

#Summary and further reading

#Further reading

Table of contents

From "continuing text" to "understanding instructions"

Limitations of pre-trained models

1. The output is uncontrollable and the security risk is high

2. The format is completely random

3. Can only continue to write, but cannot “understand”

SFT: The first step in instruction fine-tuning

Core principles

Where does the data come from?

Minimalist SFT code example

RLHF: Advanced Human Preference Alignment

Three-step process

DPO: Simplified Alignment Alternative

Outstanding advantages of DPO

Technology comparison and selection guide

Actual open source alignment case

1. Vicuna

2. Qwen2.5‑Instruct Series

Summary and further reading

Further reading