道满PythonAI

---
title: 搜索功能实现
description: 基于 SQLAlchemy 模糊查询实现数据库搜索，以及集成全文检索提升搜索质量。
---

# 搜索功能实现：模糊查询与全文检索

> 📂 **实战归属**：道满博客·第四阶段（实战演练）  
> 🔗 **前置知识**：[SQLAlchemy ORM]() · [文章发布与 Markdown 支持]()

---

对于个人博客来说，**可检索性**是仅次于“发布流畅”的核心体验。写了干货文章却让用户找不到，投入的写作精力就打了折扣。今天我们从最基础的 `LIKE` 模糊查询出发，一步步加上分类/时间过滤、搜索结果高亮，最后再用 PostgreSQL 内置的全文检索实现升级，覆盖小到中型博客的搜索场景。

---

## 1. 基础方案：SQLAlchemy 的 LIKE/ilike 模糊查询

### 1.1 多字段文章搜索

博客搜索通常只需要匹配**已发布文章**的三个核心字段：标题、摘要和正文。原实现中使用 `ilike` 做不区分大小写的模糊匹配，这在中文场景下虽然意义不大，但能兼容英文关键词的习惯，保留它很合理。

下面这段代码完整展示了搜索路由，注意我们补上了蓝图和模板参数的说明，让逻辑更清晰：

```python
# app/articles/routes.py
from flask import request, render_template
from sqlalchemy import or_ # Used for multi-field "or" matching
from app import db
from app.articles.models import Post
from app.articles import articles_bp # Assume that the blueprint has been registered

@articles_bp.route("/search")
def search():
# Get keywords and page numbers (default page 1)
    query = request.args.get("q", "").strip()
    page = request.args.get("page", 1, type=int)

# Empty keywords directly return to the empty result page
    if not query:
        return render_template("articles/search.html", results=[], query="")

# Construct %xxx% pattern to match keywords at any position
    search_pattern = f"%{query}%"

# Query chain: only search published articles → multi-field "or" matching → reverse chronological order → paging
    pagination = (
        Post.query
.filter(Post.is_published.is_(True)) # It is recommended to use is_(True) instead of == True
        .filter(
            or_(
                Post.title.ilike(search_pattern),
                Post.summary.ilike(search_pattern),
                Post.content.ilike(search_pattern),
            )
        )
        .order_by(Post.created_at.desc())
.paginate(page=page, per_page=20, error_out=False) # error_out=False: Do not throw 404 when the page number is out of range
    )

    return render_template(
        "articles/search.html",
        results=pagination.items,
        query=query,
        pagination=pagination,
        total=pagination.total,
    )

💡 小细节：使用 filter(Post.is_published.is_(True)) 比 == True 更符合 SQLAlchemy 的规范，也能避免某些数据库布尔类型（比如 Postgres 的 BOOLEAN）在比较时发生隐式转换问题。

1.2 搜索结果高亮 + 摘要截取

直接展示整篇文章内容太臃肿，最好的做法是：优先使用文章已有的摘要，如果没有则截取正文中包含关键词的部分，并对关键词高亮。下面的高亮函数在截取时会围绕关键词前后取上下文，比直接截取前 200 字更能体现相关性。

# app/utils/search.py
import re

def highlight_excerpt(text: str, query: str, max_length: int = 300) -> str:
    """
Extract snippets from text that contain search terms and highlight them
:param text: original text (plain text is passed here, Markdown/HTML tags have been removed)
:param query: search keywords
:param max_length: Maximum truncation length when no match
:return: HTML fragment with <mark> highlighting
    """
    if not text or not query:
        return text[:max_length] if text else ""

#Search case-insensitively
    pattern = re.compile(re.escape(query), re.IGNORECASE)
    match = pattern.search(text)

    if match:
# Keep the context of max_length//2 before and after the keyword
        context_length = max_length // 2
        start = max(0, match.start() - context_length)
        end = min(len(text), match.end() + context_length)
        excerpt = text[start:end]

# Fragment truncation tips
        if start > 0:
            excerpt = "..." + excerpt
        if end < len(text):
            excerpt += "..."
    else:
        excerpt = text[:max_length]
        if len(text) > max_length:
            excerpt += "..."

# Highlight: Keep the original case and wrap it with <mark> tag
    highlighted = pattern.sub(lambda m: f'<mark class="search-highlight">{m.group()}</mark>', excerpt)
    return highlighted

在模板里要记得为高亮添加样式，并且优先展示文章摘要：

<!-- templates/articles/search.html -->
<style>
.search-highlight {
background-color: #fff9c4; /* Light yellow background, eye-catching but not dazzling */
  padding: 0 2px;
  border-radius: 2px;
  font-weight: 500;
}
.search-result {
  margin-bottom: 2rem;
  padding-bottom: 1rem;
  border-bottom: 1px solid #eee;
}
</style>

<h2> Search results: "{{ query }}" ({{ total }} articles)</h2>

{% if total > 0 %}
  {% for post in results %}
  <article class="search-result">
    <h3>
      <a href="{{ url_for('articles.detail', post_id=post.id) }}">
        {{ post.title }}
      </a>
    </h3>
    <p>
      {% if post.summary %}
        {{ post.summary | truncate(200) }}
      {% else %}
        {{ post.content | striptags | highlight_excerpt(query) | safe }}
      {% endif %}
    </p>
    <small class="text-muted">
Posted in {{ post.created_at.strftime('%Y-%m-%d') }}
      {% if post.category %} · {{ post.category.name }} {% endif %}
    </small>
  </article>
  {% endfor %}

  {% if pagination.pages > 1 %}
    <nav aria-label="搜索结果分页">
      <!-- 分页组件，复用已有的分页模板 -->
    </nav>
  {% endif %}
{% else %}
  <div class="text-center py-5">
<h4> No related articles found 😔</h4>
<p> Try adjusting the keywords, or go to the <a href="{{ url_for('articles.index') }}"> homepage</a> to browse? </p>
  </div>
{% endif %}

⚠️ 安全提醒：模板里使用了 | safe 过滤器，因为 highlight_excerpt 返回的是包含 <mark> 标签的 HTML。不过这个函数中所有用户输入都已经过处理：查询关键词被 re.escape() 转义，原始文本也先用 striptags 去掉了 HTML 标签，因此不存在 XSS 风险。

2. 进阶过滤：分类与时间筛选

大多数博客不需要太复杂的布尔查询，但加上分类筛选和时间范围（近 7/30 天或指定起始日期）会非常实用。我们可以在原有搜索基础上补充这两个维度。

路由代码升级如下：

# app/articles/routes.py (replace the original search function, pay attention to introducing necessary models and modules)
from datetime import datetime, timedelta
from app.articles.models import Category # Assume there is already a Category model

@articles_bp.route("/search")
def search():
    query = request.args.get("q", "").strip()
    category_id = request.args.get("category", type=int)
    date_from = request.args.get("from", type=str)
    recent_days = request.args.get("recent", type=int)
    page = request.args.get("page", 1, type=int)

# Get the categories of all published articles (for drop-down box)
    categories = Category.query.join(Category.posts).filter(Post.is_published.is_(True)).distinct().all()

    base_query = Post.query.filter(Post.is_published.is_(True))

# Keyword filter
    if query:
        search_pattern = f"%{query}%"
        base_query = base_query.filter(
            or_(
                Post.title.ilike(search_pattern),
                Post.summary.ilike(search_pattern),
                Post.content.ilike(search_pattern),
            )
        )

# Classification filtering
    if category_id:
        base_query = base_query.filter_by(category_id=category_id)

# Time filtering: use the specified date first, followed by "near N days"
    if date_from:
        try:
            start_date = datetime.fromisoformat(date_from)
            base_query = base_query.filter(Post.created_at >= start_date)
        except ValueError:
pass # Invalid dates are ignored directly
    elif recent_days:
        start_date = datetime.utcnow() - timedelta(days=recent_days)
        base_query = base_query.filter(Post.created_at >= start_date)

    pagination = (
        base_query
        .order_by(Post.created_at.desc())
        .paginate(page=page, per_page=20, error_out=False)
    )

    return render_template(
        "articles/search.html",
        results=pagination.items,
        query=query,
        categories=categories,
        selected_category=category_id,
        selected_from=date_from,
        selected_recent=recent_days,
        pagination=pagination,
        total=pagination.total,
    )

在模板顶部添加一个筛选栏，使用 GET 表单，这样筛选条件可以随 URL 分享：

<!-- templates/articles/search.html（在搜索结果标题上方插入筛选栏） -->
<div class="search-filter card mb-4 p-3">
  <form method="GET" action="{{ url_for('articles.search') }}" class="row g-3">
    <div class="col-md-6">
      <input type="text" name="q" value="{{ query }}" placeholder="搜索文章..." class="form-control" required>
    </div>
    <div class="col-md-2">
      <select name="category" class="form-select">
<option value="">All categories</option>
        {% for cat in categories %}
        <option value="{{ cat.id }}" {% if selected_category == cat.id %}selected{% endif %}>{{ cat.name }}</option>
        {% endfor %}
      </select>
    </div>
    <div class="col-md-2">
      <select name="recent" class="form-select">
<option value="">No time limit</option>
<option value="7" {% if selected_recent == 7 %}selected{% endif %}>Last 7 days</option>
<option value="30" {% if selected_recent == 30 %}selected{% endif %}>Last 30 days</option>
      </select>
    </div>
    <div class="col-md-2">
<button type="submit" class="btn btn-primary w-100">Search</button>
    </div>
    <div class="col-12 mt-1">
<small class="text-muted"> or specify starting date: </small>
      <input type="date" name="from" value="{{ selected_from }}" class="form-control form-control-sm d-inline-block w-auto ms-1">
    </div>
  </form>
</div>

3. 优化方案：PostgreSQL 内置全文检索

LIKE/ilike 虽然简单，但有两个硬伤：

无法利用索引（除非只用 xxx% 前缀匹配，但博客通常需要任意位置匹配），文章超过 1 万篇后速度会明显下降；
没有分词能力，比如搜索“道满博客”，匹配不到“道满的博客”。

如果你的博客运行在 PostgreSQL 上（个人博客强烈推荐，免费功能又全），就可以使用它的内置全文检索解决这两个问题。

3.1 基础版：使用 `chinese` 配置

PostgreSQL 从 12 版本开始内置了中文分词配置（虽然分词质量一般，但能用）。用法如下：

# Replace the keyword filtering part in the search route
from sqlalchemy import func

@articles_bp.route("/search")
def search():
#...The previous parameter acquisition, time/category filtering, and paging logic remain unchanged...

    if query:
#Build a search vector: combine the title and text into a Chinese search vector (|| is the vector merging symbol)
        search_vector = (
            func.to_tsvector("chinese", Post.title)
            .op("||")(func.to_tsvector("chinese", Post.content))
        )
# Build query: plainto_tsquery will automatically convert the spaces entered by the user into "and" logic
        ts_query = func.plainto_tsquery("chinese", query)
# Filter and order descending by relevance
        base_query = (
            base_query
.filter(search_vector.op("@@")(ts_query)) # @@ is the full-text matching operator
            .order_by(func.ts_rank(search_vector, ts_query).desc())
        )

# ... paging and rendering ...

3.2 生产环境优化建议

① 添加 GIN 索引，查询速度提升几十倍

在 Post 模型中添加一个计算列存储搜索向量，并为它创建 GIN 索引：

# app/articles/models.py
from app import db

class Post(db.Model):
# ...original fields ...
#Add search vector column (PostgreSQL 12+ supports calculated columns)
    search_vector = db.Column(
        db.TSVECTOR,
        db.Computed(
            "to_tsvector('chinese', coalesce(title, '') || ' ' || coalesce(content, ''))",
            persisted=True,
        ),
    )
#Create GIN index
    __table_args__ = (db.Index("idx_post_search_vector", search_vector, postgresql_using="gin"),)

⚠️ 计算列在 SQLite 中不支持。如果开发环境使用 SQLite，可以将这段代码脱敏注释掉，生产环境再启用。

② 给不同字段设置权重

标题的相关性应该比正文更高。PostgreSQL 的 setweight 函数可以做到：

# Modify the definition of computed column
db.Computed(
    "setweight(to_tsvector('chinese', coalesce(title, '')), 'A') || "
    "setweight(to_tsvector('chinese', coalesce(summary, '')), 'B') || "
    "setweight(to_tsvector('chinese', coalesce(content, '')), 'D')",
    persisted=True,
)

权重由高到低为 A > B > C > D，ts_rank 会自动计算加权分数，让标题更匹配的文章排在前列。

③ 使用 jieba 分词改进中文处理

PostgreSQL 内置的中文分词比较粗糙，比如“人工智能”会被拆成“人”“工”“智”“能”。如果需要更好的分词效果，可以考虑：

在插入/更新文章时，使用 Python 的 jieba 分词生成 tsvector，再存入数据库；
或者安装 PostgreSQL 插件 zhparser，它能借助 SCWS 分词引擎提供更准确的中文分词。

这两条路的投入产出根据你的博客规模选择即可。

4. 方案总结与选型建议

通过以上三种方案，你可以根据博客的规模灵活选择：

方案	优点	缺点	适用场景
SQLAlchemy `ilike` 模糊查询	实现简单、兼容所有数据库	无索引、无分词、数据量大于 1 万会明显变慢	新博客，文章数 < 5000
PostgreSQL 内置全文检索 + GIN 索引	无需额外服务、支持中文、速度快、自带相关性排序	必须使用 PostgreSQL，内置中文分词质量一般	中型博客，文章数 < 10 万
Elasticsearch / Whoosh	分词强大、支持复杂查询、可横向扩展	需要额外部署服务（Whoosh 为纯 Python 方案，性能不及 Elasticsearch），开发成本较高	大型网站，需要复杂查询或文章 > 10 万

所有代码示例都基于你熟悉的 Flask + SQLAlchemy 技术栈，可以直接迁移到实际项目中。

🔗 扩展阅读

#1.2 搜索结果高亮 + 摘要截取

#2. 进阶过滤：分类与时间筛选

#3. 优化方案：PostgreSQL 内置全文检索

#3.1 基础版：使用 chinese 配置

#3.2 生产环境优化建议