title: Practical project three: Semantic search and question answering system | Daoman PythonAI description: Based on vector retrieval and RAG architecture, an enterprise FAQ semantic search question and answer system is built to support natural language questions and accurate answer matching. This tutorial introduces in detail the complete implementation of core technologies such as semantic search, vector database, and RAG question and answer system. keywords: [Semantic search, question answering system, RAG, vector retrieval, Sentence-BERT, FAISS, vector database, knowledge base question answering, natural language processing, semantic understanding]

Practical project three: Semantic search and question answering system

Table of contents


Project Overview

In scenarios such as intelligent customer service and knowledge base question and answer, semantic search and question and answer systems are rapidly replacing traditional keyword matching. Through vector retrieval technology, the system can understand queries with "similar meanings but different words", significantly improving the hit rate and user experience.

This tutorial focuses on the lightweight implementation scenario of Enterprise FAQ Q&A, and takes you step by step to implement an out-of-the-box semantic Q&A system. The project goals are clear:

  • ✅ Supports natural language Chinese questions
  • ✅ The search accuracy rate reaches more than 85%
  • ✅ Single query response time is less than 300 milliseconds
  • ✅ No GPU required, ordinary CPU server can be deployed

Core architecture and technology stack

Streamlined architecture

The entire system consists of four layers with clear responsibilities:

┌─────────────┐
│  Streamlit  │ ← 用户交互
└──────┬──────┘

┌──────▼──────┐
│  FastAPI    │ ← API网关、鉴权
└──────┬──────┘

┌──────▼──────────────────────┐
│  检索增强生成(RAG)引擎       │
│  ┌─────────┐  ┌──────────┐  │
│  │检索模块  │→ │ 生成模块  │  │
│  └─────────┘  └──────────┘  │
└──────┬──────────────────────┘

┌──────▼──────────────┐
│  向量化+FAISS索引   │ ← 存储层
└─────────────────────┘
  • User interaction layer: Built with Streamlit, you can create a beautiful interface without front-end experience.
  • API Gateway Layer: FastAPI provides high-performance, automatically documented RESTful interfaces.
  • RAG engine layer: Search first and then generate, using the retrieved knowledge to assist in answering.
  • Storage layer: FAISS-based vector index, carrying the semantic vectors of all FAQs.

Selection criteria and recommendations

This project adheres to the principle of free and open source, lightweight and easy to deploy. The component selection is as follows:

ModuleRecommended solutionDescription
Embed modelparaphrase-multilingual-MiniLM-L12-v2Multi-language lightweight version, 768 dimensions, Chinese effect is medium but fully sufficient, fast loading speed, CPU/GPU universal
Vector indexFAISS IndexFlatIP (after normalization)Accurate inner product retrieval, suitable for FAQ data within 100,000 items, with controllable memory usage
Backend frameworkFastAPI + UvicornAsynchronous high performance, automatically generate API documents, debugging friendly
Front-end frameworkStreamlitSet up an interactive interface in 10 minutes, data scientist-friendly
Generate alternativesRule matching based on search resultsCan run without relying on external API, privacy and security, zero cost

SelectIndexFlatIPrather thanIndexFlatL2The reason: after vector normalization, the inner product is equivalent to the cosine similarity and is more computationally efficient.


Data and vectorized indexing

1. Data preparation

In its simplest form, FAQ data are question-answer pairs. In order to improve the search effect, we add asearch_textfields, splicing together original questions, common similar questions, keywords and other information.

Example data structure:

[
  {
    "id": "faq_001",
    "question": "密码忘记了怎么办?",
    "answer": "请访问登录页面→点击'忘记密码'→输入注册邮箱→查收重置链接;或联系客服人工处理。",
    "category": "账户",
    "search_text": "密码忘记了怎么办?忘记密码怎么找回?密码丢失如何重置?"
  }
]

Tips:search_textThe quality directly affects the retrieval recall rate. It is recommended to regularly supplement high-frequency similar questions based on real user logs.

2. Vectorization and index construction

The following code completes model loading, batch encoding, normalization, and construction and persistence of the FAISS index.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle

# ----------------------
# 1. 初始化嵌入模型
# ----------------------
embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# ----------------------
# 2. 批量编码FAQ
# ----------------------
# 假设processed_faqs是预处理好的列表
search_texts = [faq["search_text"] for faq in processed_faqs]
embeddings = embedder.encode(search_texts, convert_to_numpy=True, show_progress_bar=True)

# ----------------------
# 3. 归一化+构建FAISS索引
# ----------------------
# 归一化后可使用内积近似余弦相似度,速度快
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / (norms + 1e-8)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)   # 精确内积索引
index.add(embeddings.astype("float32"))

# ----------------------
# 4. 保存索引+数据
# ----------------------
faiss.write_index(index, "faq_index.faiss")
with open("faq_data.pkl", "wb") as f:
    pickle.dump(processed_faqs, f)

After running, you will get two files locally:

  • faq_index.faiss: vector index
  • faq_data.pkl:Original FAQ data

When retrieving later, just load these two files.


Semantic retrieval and RAG implementation

1. Semantic retrieval module

We encapsulate the retrieval logic into a class to facilitate subsequent calls.

class FAQRetriever:
    def __init__(self, index_path="faq_index.faiss", data_path="faq_data.pkl"):
        self.index = faiss.read_index(index_path)
        with open(data_path, "rb") as f:
            self.faqs = pickle.load(f)
    
    def search(self, query: str, top_k: int = 3, threshold: float = 0.3):
        # 编码查询并归一化
        query_emb = embedder.encode([query], convert_to_numpy=True)
        query_emb = query_emb / (np.linalg.norm(query_emb, axis=1, keepdims=True) + 1e-8)
        
        # 检索
        scores, indices = self.index.search(query_emb.astype("float32"), top_k)
        
        # 过滤低相似度结果,避免强答
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx >= 0 and score >= threshold:
                results.append({**self.faqs[idx], "score": float(score)})
        return results

thresholdIt is the similarity threshold. It is recommended to set it to 0.3 initially, and it can be adjusted later according to the actual effect. Results lower than this value mean that the semantic gap is too large and are discarded directly.

2. Lightweight RAG engine

After retrieving relevant content, how to generate the final answer? Two modes are provided here:

  • Rule Mode (default): Directly return the most similar standard answers to FAQs, completely offline and at zero cost.
  • LLM Enhanced Mode: Use the search results as context and call a large language model to generate more natural answers, which has better results, but relies on external APIs.
class LightRAG:
    def __init__(self, retriever: FAQRetriever, llm_client=None):
        self.retriever = retriever
        self.llm = llm_client  # 如OpenAI()或本地LLM
    
    def query(self, user_question: str, use_llm: bool = False):
        # 1. 检索
        docs = self.retriever.search(user_question)
        if not docs:
            return "抱歉,未找到相关信息,请尝试重新表述问题或联系客服。"
        
        # 2. 构建上下文
        context = "\n\n".join([
            f"【问题】{d['question']}\n【答案】{d['answer']}"
            for d in docs
        ])
        
        # 3. 生成答案
        if use_llm and self.llm:
            prompt = f"""
            你是一个智能客服助手。请根据以下知识库信息准确回答用户问题,不要编造内容,控制在200字以内。
            知识库:
            {context}
            用户问题:{user_question}
            """
            try:
                resp = self.llm.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=300,
                    temperature=0.3
                )
                return resp.choices[0].message.content
            except Exception as e:
                print(f"LLM调用失败:{e}")
        
        # 无LLM备选方案:返回最相关的第一条答案
        return f"根据知识库信息:{docs[0]['answer']}"

In a production environment, you canThresholduse_llmMake it a configurable item and dynamically switch modes.


Rapid deployment solution

1. Backend FastAPI core

The backend only retains core functions and omits logging, authentication, and caching codes, making it easy to get started quickly.

from fastapi import FastAPI
from pydantic import BaseModel
from typing import Optional, List

app = FastAPI(title="FAQ智能问答API")

# 初始化全局对象(通常在app启动时加载)
retriever = FAQRetriever()
rag = LightRAG(retriever)

class QueryRequest(BaseModel):
    question: str
    use_llm: bool = False
    top_k: int = 3

class QueryResponse(BaseModel):
    answer: str
    docs: Optional[List[dict]] = None

@app.post("/api/query", response_model=QueryResponse)
async def query_endpoint(req: QueryRequest):
    docs = retriever.search(req.question, req.top_k)
    answer = rag.query(req.question, req.use_llm)
    return QueryResponse(answer=answer, docs=docs if req.top_k > 0 else None)

Start command:

uvicorn backend:app --host 0.0.0.0 --port 8000

accesshttp://localhost:8000/docsYou can see the automatically generated Swagger document and test it directly online.

2. Front-end Streamlit core

The front-end code is minimal but fully functional: enter a question, toggle LLM enhancements, view reference sources.

import streamlit as st
import requests

# ----------------------
# 页面配置
# ----------------------
st.set_page_config(page_title="FAQ智能问答", page_icon="🤖")
st.title("🤖 FAQ 智能问答助手")

# ----------------------
# 侧边栏设置
# ----------------------
with st.sidebar:
    api_url = st.text_input("后端API", "http://localhost:8000/api/query")
    use_llm = st.checkbox("使用LLM增强(可选)", value=False)
    show_docs = st.checkbox("显示参考来源", value=True)

# ----------------------
# 主交互
# ----------------------
user_q = st.text_input("请输入您的问题:")
if st.button("🔍 提问", type="primary") and user_q:
    with st.spinner("思考中..."):
        try:
            resp = requests.post(
                api_url,
                json={"question": user_q, "use_llm": use_llm, "top_k": 3 if show_docs else 0}
            )
            resp_data = resp.json()
            
            st.success(resp_data["answer"])
            
            if show_docs and resp_data.get("docs"):
                with st.expander("📋 查看参考来源"):
                    for i, d in enumerate(resp_data["docs"], 1):
                        st.markdown(f"**来源{i}(相似度:{d['score']:.2f})**")
                        st.write(f"问题:{d['question']}")
                        st.write(f"答案:{d['answer']}")
                        st.divider()
        except Exception as e:
            st.error(f"请求失败:{e}")

Start the frontend:

streamlit run frontend.py

3. One-click start command

# 先安装后端依赖
pip install fastapi uvicorn sentence-transformers faiss-cpu numpy pickle5

# 终端1:启动后端
uvicorn backend:app --host 0.0.0.0 --port 8000

# 新终端2:安装前端依赖
pip install streamlit requests

# 启动前端
streamlit run frontend.py

Open your browser and you can ask questions to the FAQ knowledge base using natural language in the Streamlit interface.


Summary of optimization points

  1. Vectorization Optimization
  • Merge similar questions, categories, and tags intosearch_textfields to enrich semantic information.
  • Prioritize batch encoding and avoid calling each item one by one.
  • If you have a GPU, add it when initializing the modeldevice="cuda"Parameters, the encoding speed can be increased several times.
  1. Search Optimization
  • When the amount of data is less than 100,000,IndexFlatIPBoth simple and precise.
  • When the amount of data is larger, it can be replaced byIndexIVFFlat, sacrificing a small amount of accuracy for faster speed.
  • Setting a reasonable similarity threshold (e.g. 0.3) can effectively filter irrelevant results.
  • Introduce hybrid retrieval when necessary: ​​use semantic similarity and simple exact matching of keywords to improve the recall rate of long-tail questions.
  1. Performance Optimization
  • Local LRU caches embedding vectors and final results of high-frequency queries to avoid repeated calculations.
  • When deploying multiple instances, use Redis as a centralized cache.
  • For production environments, it is recommended to use Uvicorn’s multi-worker mode (for example--workers 4) or paired with Gunicorn to improve concurrency.

From data processing, vectorization, index construction, to retrieval, question answering, and front-end and back-end deployment, this project completely covers the minimum viable product of a semantic question answering system. You can use it as the starting point for corporate FAQ, product consultation, online course Q&A and other scenarios, and then add functions such as intent recognition, multi-round dialogue, and automatic update of the knowledge base according to actual needs to gradually improve it.