title: Practical Project 2: Automatic Summary Generator | Daoman PythonAI description: From extraction to generation, we implement an automatic summary generation system based on BERT, T5, and ChatGPT, including key technologies such as TextRank algorithm, ROUGE evaluation indicators, and FastAPI deployment to create enterprise-level summary services. keywords: [Automatic summary, text summary, extractive summary, generative summary, TextRank, BERT, T5, ChatGPT, ROUGE, NLP, natural language processing, summary generation]

Practical project two: automatic summary generator

Table of contents


Project Overview

Automatic summary generation is one of the most common scenarios for NLP - extracting key information from a long document and generating concise and accurate content. This project will take care of both "the speed of classic algorithms" and "the quality of large models", allowing you to directly build a summary service that can be launched online.

Quickly understand our goals

Use a Python dictionary to explain the three dimensions of the project clearly at once:

objectives = {
    "技术": ["TextRank/BERT抽取", "T5/BART/GPT生成", "ROUGE评估", "FastAPI服务"],
    "性能": ["ROUGE-1 > 0.4", "单条响应 < 2秒", "摘要占原文10%~20%"],
    "业务": ["多风格(简洁/技术)", "多长度(100~5000字)", "批量处理"]
}

From extraction to generation, to evaluation and launch, the following content will help you step by step.


Abstract technical classification

Extractive vs. Generative: A Guide to Scenario Selection in 2026

No need to dwell on this issue again and again, a table can help you make a quick decision:

Comparison dimensionsExtraction formula (TextRank/BERT)Generative formula (T5/GPT)
Vocabulary source100% from the original text, zero illusionCan be rewritten and compressed, occasionally inaccurate content
Processing speedVery fast (<0.5 seconds/item)Slow (0.5~2 seconds/item, GPT is longer)
Applicable scenariosShort news, real-time requirements, rule-based documentsLong text, technical polish, quality priority scenarios

A simple summary: If you want to be fast, use extractive formulas, and if you want to be sophisticated, use generative formulas.

Development history of minimalist technology

There is no need to memorize a bunch of papers, just remember these three key nodes:

  1. Classic stage (before 2004): Based on word frequency and PageRank variants, it is fast but cannot understand the semantics.
  2. Pre-training phase (2017-2021): Transformer appears, and BERT, T5, and BART greatly improve the quality of summary.
  3. Large model stage (2022 to present): The GPT series debuts, which can directly understand complex text and generate smooth summaries.

Core implementation: Extraction→Generation→Evaluation

1. Extraction method: run through TextRank in 10 minutes (no large model required)

TextRank is the fastest and lowest-cost solution to get started, and is especially suitable for scenarios with high real-time requirements and clear rules.

import numpy as np
import jieba
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

class TextRank:
    def __init__(self, top_k=3):
        self.top_k = top_k
        self.damping = 0.85
        self.max_iter = 100
        self.tol = 1e-5

    def _split_sent(self, text):
        sents = []
        for s in text.replace('!', '。').replace('?', '。').split('。'):
            s = s.strip()
            if len(s) > 10:
                sents.append(s)
        return sents

    def _build_sim_mat(self, sents):
        # 用 jieba 分词构造 TF‑IDF 并计算句子相似度
        tfidf = TfidfVectorizer(tokenizer=jieba.lcut, token_pattern=None)
        tfidf_mat = tfidf.fit_transform(sents)
        sim_mat = cosine_similarity(tfidf_mat)
        np.fill_diagonal(sim_mat, 0)
        # 归一化
        row_sum = sim_mat.sum(axis=1, keepdims=True)
        row_sum[row_sum == 0] = 1
        return sim_mat / row_sum

    def summarize(self, text):
        sents = self._split_sent(text)
        if len(sents) <= self.top_k:
            return '。'.join(sents) + '。'

        sim_mat = self._build_sim_mat(sents)
        scores = np.ones(len(sents)) / len(sents)

        # PageRank 迭代
        for _ in range(self.max_iter):
            new_scores = (1 - self.damping)/len(sents) + self.damping * sim_mat.T.dot(scores)
            if abs(new_scores - scores).sum() < self.tol:
                break
            scores = new_scores

        # 按原文顺序选出 top‑k 句
        top_idx = np.argsort(scores)[-self.top_k:][::-1]
        top_idx.sort()
        return '。'.join([sents[i] for i in top_idx]) + '。'


# 测试一下
text = (
    "自然语言处理是AI的重要分支,研究人机语言交互。"
    "包含语音识别、文本分类、机器翻译等任务。"
    "近年来大模型取得突破,BERT和GPT是代表,ChatGPT在对话中表现出色。"
)
print(TextRank().summarize(text))

2. Evaluation indicators: Simplified use of ROUGE

Use ready-madepy-rougelibrary, three lines of code can complete the core evaluation:

# 安装:pip install py-rouge
from rouge import Rouge

def eval_summary(candidate, reference):
    """简化版 ROUGE 评估,直接返回 F1 值"""
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)[0]
    return {
        "ROUGE-1-F1": round(scores['rouge-1']['f'], 4),
        "ROUGE-2-F1": round(scores['rouge-2']['f'], 4),
        "ROUGE-L-F1": round(scores['rouge-l']['f'], 4)
    }

# 测试
print(eval_summary(
    "NLP是AI重要分支,大模型近年突破",
    "自然语言处理是人工智能重要分支,近年大模型取得突破性进展"
))

3. Generative formula: run through T5 in 5 minutes (open source and controllable)

T5 is a classic "text-to-text" unified framework that can flexibly adjust the summary style and the cost is much lower than GPT.

# 安装:pip install transformers torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

class T5Summary:
    def __init__(self):
        self.model_name = "t5-small"  # 入门用 t5‑small,效果要求高可换 t5‑base 或中文 mT5
        self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
        self.model = T5ForConditionalGeneration.from_pretrained(self.model_name)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

    def summarize(self, text, max_len=100, min_len=30):
        # T5 需要加上前缀:summarize:
        input_text = "summarize: " + text
        inputs = self.tokenizer(
            input_text, return_tensors="pt", truncation=True, max_length=512
        ).to(self.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=max_len,
                min_length=min_len,
                num_beams=4,               # 束搜索,比随机采样更稳定
                no_repeat_ngram_size=2     # 防止重复短语
            )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)


# 测试
print(T5Summary().summarize(text))

Model fusion and deployment

Hybrid Strategy: Balancing Speed ​​and Quality

The most recommended approach in actual production is: First use TextRank to extract the top-5 key sentences, and then use T5/mT5 to polish, so that both speed and effect can be taken into account.

class HybridSummary:
    def __init__(self):
        self.extract = TextRank(top_k=5)
        self.generate = T5Summary()

    def summarize(self, text, short_threshold=500):
        if len(text) < short_threshold:
            # 短文本直接抽取
            return self.extract.summarize(text)
        else:
            # 长文本抽取后再生成式润色
            extracted = self.extract.summarize(text)
            return self.generate.summarize(extracted)

FastAPI one-click deployment

Package the above hybrid model into an API and use it with Docker to quickly go online:

# app.py
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="自动摘要 API", version="1.0")
hybrid = HybridSummary()

class SummaryReq(BaseModel):
    text: str
    short_threshold: int = 500

@app.post("/summary")
async def get_summary(req: SummaryReq):
    return {"summary": hybrid.summarize(req.text, req.short_threshold)}

# 启动命令:uvicorn app:app --host 0.0.0.0 --port 8000 --reload

Best Practices and Summary

Best practices for implementation in 2026

  1. Scenario Layering: Real-time processing of short texts → TextRank; high requirements for long texts → Hybrid model → GPT fine-tuning.
  2. Dual-track evaluation: ROUGE indicator quantification + small-batch manual spot checks, both are indispensable.
  3. Term protection: Vocabulary in important fields can be forcibly retained through prompt words or rules to prevent it from being mistakenly rewritten.
  4. Cache Optimization: High-frequency requests such as popular news are cached with LRU to reduce repeated calculation costs.

Summarize

The core logic of automatic summarization has never changed: select key content → organize into coherent sentences → polish to make expression smooth. The tool chain in 2026 will be mature enough. Newbies can start with TextRank, advance to T5/mT5, and finally access the fine-tuned large model according to business needs to create a practical enterprise-level summary service.


It is recommended to master TextRank and py‑rouge first, quickly run through the entire process, and then gradually add the pre-trained model.

📂 Stage: Stage 6 - Industrial NLP Project Practice 🔗 Related chapters: BERT 家族详解 · Prompt Engineering 基础