title: Instruction Tuning: A complete guide to large model alignment technology and RLHF | Daoman PythonAI description: Learn the core technology of instruction fine-tuning in depth, and master the complete theory and practical methods of model alignment technologies such as SFT supervised fine-tuning, RLHF human feedback reinforcement learning, and DPO direct preference optimization. keywords: [Instruction Tuning, Instruction Tuning, RLHF, Model Alignment, Artificial Intelligence, Large Language Model, LLM, Reinforcement Learning, Human Feedback, DPO, SFT, PPO]
#Instruction Tuning: A Complete Guide to Large Model Alignment Technology and RLHF
Table of contents
From "continuing text" to "understanding instructions"
Think back to the experience when GPT‑3 was first released: you typed in “write a poem about autumn”, and it might throw you a large piece of prose, lyrics about autumn in the training corpus, or even drafts that other users had typed – the only thing missing was a complete poem customized for you. This is not because the model is "stupid", but because it has only learned to continue writing. The core task of the pre-training phase is called Language Modeling, which in plain English is: predict the most likely words to appear next based on the words that have already appeared. It has never learned "to treat the user's words as instructions to complete."
In order to transform a large model from a "continuation machine that only answers conversations" to an intelligent assistant that can understand requirements such as "summarize this paragraph" and "use a table to organize the following information", we must give it a key lesson - Instruction Tuning, and subsequent model alignment work, so that the output is not only accurate, but also safe and in line with human preferences.
Limitations of pre-trained models
A simple pre-trained model is like a "super bookworm" who has read the entire Internet but does not understand social etiquette. There are three outstanding problems:
1. The output is uncontrollable and the security risk is high
The model only maximizes the probability of the next word according to the pattern in the training data, and has no idea what to say and what not to say. It could generate detailed instructions for making weapons, pontificate on medical advice, or even leak personal information in training data.
2. The format is completely random
You need a piece of JSON, and it will write a prose for you; you want a three-sentence summary, and it may be a series of ten paragraphs. Because it has not been trained on the instruction format, the model does not understand that the "format constraints" itself are part of the task.
3. Can only continue to write, but cannot “understand”
Enter "Translate: The weather is nice today -> English". It will not think that this is a translation instruction. Instead, it is more likely to regard this text as a beginning and continue to write more similar sentence patterns.
These problems must be solved through command fine-tuning + preference alignment, which is the technical route we will talk about below.
SFT: The first step in instruction fine-tuning
SFT (Supervised Fine‑Tuning, supervised fine-tuning), explained in one sentence, is: take the "one question and one answer" demonstration data and teach the model step by step how to answer a certain question when it is asked. The goal of this stage is to allow the model to acquire the ability to follow instructions, from "only continuing to write" to "knowing to answer questions."
Core principles
- Data Driven: Collect or construct a large number of
(指令, 输入, 理想输出)Triplets cover various tasks such as translation, question and answer, code generation, and summary. - Loss calculation: The cross-entropy loss of the language model is still used, but only the loss of the output part is calculated, and the instructions and the above part do not participate in the gradient update (usually the label of this part is set to
-100neglect). - Extremely fast results: On many open source base models, using thousands to tens of thousands of high-quality instruction data and training for a few hours to a few days, the model can be transformed from "unintelligible" to "basically usable".
Where does the data come from?
- Manual annotation: Hire domain experts to write answers directly, which has the highest quality, but also the highest cost.
- Crowdsourcing platforms: such as Amazon Mechanical Turk, which can quickly expand the amount of data, but must be accompanied by strict quality audits.
- Open source data sets: Alpaca‑52K, ShareGPT, Belle, etc., just bring them and use them.
- Synthetic Data: Use existing strong models to generate candidate answers, and then manually screen and correct them, taking into account both efficiency and quality.
Minimalist SFT code example
The code below demonstrates how toQwen2.5‑1.5BSuch small models undergo supervised fine-tuning. In actual work, it is strongly recommended to use efficient parameter fine-tuning methods such as LoRA to save video memory. For clarity of teaching, a simplified writing method of full-parameter fine-tuning is used here.
The code above is just a teaching demonstration. In real scenarios, SFT data often contains thousands or even tens of thousands of samples, and LoRA or QLoRA is used to reduce memory usage. Otherwise, full parameter fine-tuning of a 7B or more model will be very expensive.
After SFT, the model can already understand the command "Help write a poem", but it may not be able to write the one that people like the most. In the next section, we will see how to introduce human preference judgment to take the output quality to a higher level.
RLHF: Advanced Human Preference Alignment
SFT allows the model to learn to "follow instructions", but the same instruction often has multiple reasonable answers:
- User: "What is there to do in Beijing on the weekend?"
- Answer A: "There are many scenic spots in Beijing: the Forbidden City, the Great Wall, the Summer Palace..."
- Answer B: "The weather in Beijing is fine this week. I recommend going boating in the Summer Palace or going to 798 to see the exhibition. There are relatively fewer people."
Both answers are correct, but obviously B is closer to human preferences: specific, actionable, short and thoughtful. At this time, RLHF (Reinforcement Learning from Human Feedback, human feedback reinforcement learning) is needed.
Three-step process
- SFT pre-alignment: First use high-quality instruction data to train a basic supervised model as the starting point for subsequent optimization.
- Training Reward Model (RM): Collect human preference ranking data for "multiple answers to the same instruction" and train a model that can score any answer. The higher the score, the more humans like the answer.
- PPO reinforcement learning optimization: Treat the SFT model as a policy network, RM as a reward source, and use the PPO algorithm to iteratively update model parameters so that the answers generated by the policy network can obtain high scores from RM. At the same time, the KL divergence constraint is added to prevent the model from talking nonsense in order to increase the score.
RLHF is currently the best and most widely used alignment scheme (both ChatGPT and GPT‑4 rely on it), but it also has obvious shortcomings: the process is complex, multiple models need to be trained and maintained, reinforcement learning training is extremely unstable, there are many hyperparameters and are difficult to tune, and the computational and labor costs of the entire process are very high.
DPO: Simplified Alignment Alternative
Faced with the dilemma of "expensive and difficult to train" RLHF, DPO (Direct Preference Optimization, Direct Preference Optimization) proposed in 2023 provides an elegant alternative idea.
The core insight of DPO is: **The reward model is essentially an implicit function used to express preferences. The optimal strategy can be derived directly from the preference data without the need to explicitly train an RM and then perform reinforcement learning. **
Therefore, DPO reduces the entire alignment process to a classification task: directly based on the SFT model, human-preferred responses have a higher log probability, while unpreferred responses have a reduced probability. During training, only preference pairs of data are needed(指令, 被喜欢的回答, 不被喜欢的回答), the loss function directly optimizes the model itself.
Outstanding advantages of DPO
- Process Simplification: No additional training of RM is required, and no complex reinforcement learning framework such as PPO is required.
- Training Stability: There are no problems such as reward value explosion and policy collapse, and the sensitivity of hyperparameters is greatly reduced.
- Effect close to RLHF: In many academic tests and practical scenarios, the alignment effect of DPO is on par with RLHF, and even better in some harmlessness indicators.
- Easy to implement: Hugging Face
trlThe library is already built-inDPOTrainer, load the data and start training.
Even if you have a limited budget and a small team, DPO allows you to align a large model to a "cute" level on consumer-grade hardware.
Technology comparison and selection guide
Actual open source alignment case
1. Vicuna
Based on models such as LLaMA‑2, using ShareGPT's large amount of dialogue data to do SFT, combined with simplified preference alignment, it is one of the first projects to prove that "open source dialogue models can approximate the ChatGPT experience". Its training program has also become a model for fine-tuning in many communities.
2. Qwen2.5‑Instruct Series
Alibaba Cloud Tongyi Qianwen's open source instruction model has completely experienced a mixture of SFT + multi-level security alignment + multiple preference optimization technologies. The ability to understand and generate Chinese is particularly outstanding, and it also performs well in multi-lingual aspects. It is currently a practical model in China that has a very low cost of getting started and a very good effect.
Summary and further reading
Fine-tuning instructions and aligning them with large models is the only way to turn a "knowledgeable but incapable of doing things" pre-trained model into an intelligent assistant that "understands you and is reliable":
- SFT is the foundation: Give the model the ability to follow instructions and answer "what should be done".
- RLHF is the ceiling: Make the answer more in line with human preferences and answer "how to do it better".
- DPO is a balanced choice: approach the ceiling effect with a simpler solution, taking into account both cost and stability.
- Safety alignment cannot be ignored: No matter which technology is used, the final model cannot output harmful, illegal or unethical content.
Further reading
- InstructGPT 论文 —— The first systematic explanation of the application of RLHF in language models
- DPO 论文 —— Direct Preference Optimization original paper
- TRL 库官方文档 —— Provides standard training interfaces for SFT, DPO, and RLHF

