GPT series evolution: complete development history from GPT-1 to GPT-4o

Table of contents


Overview of the development history of the GPT series

GPT (Generative Pre-trained Transformer) is a family of large one-way Transformer models led by OpenAI. From the first GPT-1 in 2018 to GPT-4o in 2024, this route continues to promote dual breakthroughs in scale drive and technology alignment, turning large language models from research ideas in the laboratory step by step into multi-modal intelligent assistants that everyone can use today.

The picture below can help you quickly establish a sense of time:

graph LR
    A[2017<br>Transformer论文<br>解码器架构铺垫] --> B
    B[2018<br>GPT-1<br>117M<br>预训练+微调] --> C
    C[2019<br>GPT-2<br>1.5B<br>零样本泛化] --> D
    D[2020<br>GPT-3<br>175B<br>涌现能力/ICL] --> E
    E[2022<br>GPT-3.5<br>RLHF对齐<br>ChatGPT] --> F
    F[2023<br>GPT-4<br>多模态+安全对齐] --> G
    G[2024<br>GPT-4o<br>原生多模态<br>实时交互]

Comparison of core parameters and capabilities

ModelYearNumber of parameters (approximately)Core inputsSignature capabilities
GPT-12018117MTextPre-training + fine-tuning transfer learning paradigm
GPT-220191.5BTextZero-shot generalization
GPT-32020175BTextContextual Learning (ICL), multiple emergent capabilities
GPT-3.52022175BTextRLHF aligns human preferences, ChatGPT dialogue implementation
GPT-420231.8TText + ImageCross-modal reasoning, 128K long context
GPT-4o2024200BAudio and video + textNative multi-modal, millisecond-level real-time voice interaction, cost halved

GPT-1: pre-training + fine-tuning foundation

In 2018, half a year before the advent of BERT, OpenAI quietly released the first pre-trained language model using only the Transformer decoder. At that time, the mainstream was still LSTM and RNN, and this choice seemed very “avant-garde”.

Core architecture selection

  • Completely abandon the popular recurrent network in the past and directly use the unidirectional decoder (only look at the words on the left) in the 2017 Transformer paper, which is naturally suitable for text generation.
  • The structure is simple and clear: 12 layers of Transformer blocks, hidden layer dimension 768, 12 attention heads per layer, vocabulary size 40478, and total parameters of about 117 million.
  • The 5GB BooksCorpus is used as the pre-training data, which contains more than 10,000 unpublished books. The text is continuous and long, which is very suitable for learning long-distance dependencies.

Paradigm Innovation: Two-Phase Training

The biggest contribution of GPT-1 is not scale, but the training paradigm - it makes "universal language understanding + downstream task fine-tuning" a standard operation.

def two_stage_training():
    """GPT-1的两阶段训练流程"""
    return {
        "Stage1:无监督预训练": "仅用BooksCorpus做『下一词预测』,建立通用语言表示",
        "Stage2:监督微调": "在分类/问答等下游任务上,添加小的任务头,仅微调上层参数"
    }
print(two_stage_training())

This two-stage process fundamentally alleviates the problem of insufficient labeled data for supervised tasks. Almost all subsequent large models (including BERT) have followed a similar pre-training + fine-tuning idea.


GPT-2: Zero-sample generalization emerges

In 2019, OpenAI made a bold assumption: As long as the model and data are amplified at the same time, the model may be able to directly complete new tasks that have never been seen before without any fine-tuning. They almost expanded the parameters of GPT-1 by 13 times and the data by 8 times. The final conclusion is indeed exciting - "can".

Architecture and data upgrade

  • 48-layer Transformer block, hidden layer dimension 1600, 25 attention heads, total parameters reaching 1.5 billion.
  • The training data is replaced with 40GB of WebText. The content is manually screened Reddit high-quality external link text, which is closer to the real Internet language environment. This injects stronger common sense and writing skills into the model.

Zero-sample learning: direct questions without examples

In the GPT-1 era, if you want a model to perform sentiment analysis, you must fine-tune it with labeled training data. With GPT-2, you no longer have to go through this trouble - you directly give it a natural language "hint", and it can understand the intention and output the answer, even if the task has never explicitly appeared during training.

Prompt: 
这是一段影评:这部电影的特效震撼,但剧情拖沓。
影评的情感是:
GPT-2输出:
负面

This makes people truly feel for the first time: Scale itself can become a kind of generalization ability.


GPT-3: Emergent Capabilities Revolution

In 2020, GPT-3 debuted with 175 billion parameters. Not only is it bigger, it's more like a sudden "enlightenment" - abilities that many small models don't have at all will appear naturally as long as the model reaches a certain scale. Researchers call this phenomenon "emergent capacity".

Core emergent capabilities

1. Contextual Learning (ICL)

When inferring, you only need to give the model one or two examples, and it can learn the new task without changing the parameters at all. For example, let it convert the English month abbreviation into the full Chinese name:

Prompt:
任务:把英文月份缩写转成中文全称
Jan → 一月
Feb → 二月
Mar →
GPT-3输出:
三月

This kind of contextual learning ability of "learn now and sell now" is completely impossible for small models.

2. Other examples emerge

  • Basic arithmetic reasoning (three-digit addition, subtraction, multiplication and division)
  • Code generation (write functions based on comments)
  • Multi-language translation (even without explicit translation pre-training)
  • Creative writing (stories, poems, speeches, even imitating the style of a specific author)

GPT-3 tells the industry that as long as you are willing to pile up parameters, the model can automatically unlock more advanced skills. Since then, “scale expansion → capability emergence” has become one of the core philosophies for the development of large models.


GPT-3.5: RLHF aligned ChatGPT landing

Although GPT-3 is very powerful, its output is also quite "wild" - it often answers questions incorrectly, makes up facts, or speaks bluntly. In November 2022, ChatGPT was launched based on the further optimized GPT-3.5 of GPT-3 and RLHF (reinforcement learning based on human feedback) technology, turning large models from geek toys into mass tools in one fell swoop.

RLHF’s three-step core process

In order to make the model obedient, useful, and safe, OpenAI has designed a three-step training plan.

Collect tens of thousands of high-quality data of "human questions + human ideal answers", and first let GPT-3 learn basic command following and natural conversational tone. For the same question, let the model generate multiple different answers, which are manually sorted "from best to worst." These rankings are then used to train a specially scored reward model to quantify "what kind of answers do humans like?" Treat GPT-3 as a strategy model, use the score of the reward model as a feedback signal, and continuously optimize the strategy through the PPO algorithm - let the model learn to generate answers with high rewards and in line with human preferences.

After these three steps, the model learned to "speak human language", "not make up random things" and "know when to say no", and the chat experience suddenly became silky smooth. This is also the technical basis for ChatGPT to become popular around the world.


GPT-4: The multi-modal era begins

In March 2023, GPT-4 brought two major upgrades: Cross-modal understanding and Significantly enhanced security alignment. It can not only understand text, but also pictures. At the same time, the long context window directly extends to 128K tokens (about 96,000 Chinese characters), which is enough to fill the entire first part of "Three Body".

Typical applications of multimodality

  • Financial Statement Analysis: Upload a screenshot of the financial report, it can automatically extract key indicators and generate a summary.
  • Code Repair: Take a photo of your handwritten code and it will identify the error and give you a corrected version of the code.
  • Creative Design: Upload a hand-drawn sketch and it will automatically generate detailed UI instructions or product copy.

The release of GPT-4 marks the true transformation of large models from "pure language" to "multi-modal", and the boundaries of capabilities are further broadened.


GPT-4o: native multi-modal + real-time interaction

In May 2024, GPT-4o is here (the “o” stands for “omni”, meaning all). This is the first model in the GPT series that is built for multi-modality from the bottom architecture, completely bidding farewell to the previous transition solution of "text pre-training + image encoder splicing".

Core upgrade

  • Native multi-modality: Text, audio, images, and videos are processed in the same unified space, and the understanding between different modalities is no longer separated.
  • Real-time voice interaction: millisecond-level response, supports interruption at any time, can sense tone, express empathy, and the conversation experience is close to real people.
  • Significant cost reduction: Compared with GPT-4, the price of text input and output is reduced by 50%, and image processing is reduced by 75%, making multi-modal AI more accessible to the people.

The design philosophy of GPT-4o is very clear: Let AI see, listen, speak and read naturally like humans, instead of switching back and forth between various modes.


Core Concept Supplement: Emergent Capability and Alignment

A simple explanation of emergent ability

You can think of the model parameters as ants: a few ants can only carry small debris, but when thousands of ants form a colony, they can build complex nests and achieve a high degree of division of labor and collaboration. After the parameters reach the tens of billions level, the model is like an ant colony, suddenly unlocking advanced capabilities that small models do not have - this is emergence.

The essence of alignment

Alignment is to add a set of values ​​and security constraints to this large and powerful "universal brain" so that what it says is consistent with facts and human preferences and norms. To put it simply: In order to make a model smart, it must also be "reliable".


Comparison of mainstream large models in early 2025

As of the beginning of 2025, the global large model market has formed a situation where a hundred flowers are blooming. The following are some representative players:

ModelDeveloperCore Features
GPT-4oOpenAINative multi-modal, real-time interaction, easy to use
Claude 3.5AnthropicUltra-long context, safe alignment, strong reasoning
Gemini 2.0GoogleNative multi-modal, deeply integrated search
LLaMA 3.1MetaOpen source, supports local privatized deployment
DeepSeek R1Deep searchOpen source, strong mathematical and logical reasoning capabilities
  • More efficient architecture: Mixed Expert (MoE) technology has become mainstream, and the inference cost is reduced by another order of magnitude with the same capabilities.
  • Vertical Specialization: Specialized models in medical, legal, coding and other fields are growing explosively, and in-depth response is more important than general use.
  • Personalized AI: Everyone can have a private small model assistant that can be run locally and is completely their own.
  • Stricter security alignment: With the implementation of AI regulatory policies in various countries, alignment and explainability will change from "plus points" to "must have options".

The main development line of the GPT series can be summarized into three keywords: **scale, architecture, and alignment**. If you want to understand in depth, the most recommended learning path is: first understand the basics of Transformer, then carefully read the GPT-3 paper and the original literature of RLHF, and finally run the open source model yourself to feel the differences at different scales.

🔗 Extended reading

📂 Stage: Stage 5 - Ladder to Large Model (LLM) 🔗 Related chapters: Prompt Engineering基础 · 注意力机制详解