Personal knowledge base assistant project

1. Introduction

1. Project background introduction

In a contemporary society where data volumes are exploding, effectively managing and retrieving information has become a critical skill. To address this challenge, this project came into being, aiming to build a personal knowledge base assistant based on Langchain. The assistant provides users with a reliable information acquisition platform through an efficient information management system and powerful retrieval functions.

The core goal of this project is to give full play to the advantages of large-scale language models in processing natural language queries, while conducting customized development based on user needs to achieve intelligent understanding and accurate response to complex information.在项目开发过程中,团队深入分析了大型语言模型的潜力与局限,特别是其在生成幻觉信息方面的倾向。 To solve this problem, the project integrated RAG technology, a combined retrieval and generation method that retrieves relevant information from large amounts of data before generating an answer, thereby significantly improving the accuracy and reliability of the answer.

Through the introduction of RAG technology, this project not only improves the accuracy of information retrieval, but also effectively suppresses misleading information that may be generated by Langchain. This method of combining retrieval and generation ensures the accuracy and authority of the intelligent assistant when providing information, making it a powerful assistant for users when facing massive amounts of data.

2. Goal and significance

This project is committed to developing an efficient and intelligent personal knowledge base system, aiming to optimize the user's knowledge acquisition process in the flood of information. By integrating Langchain and natural language processing technologies, the system enables rapid access and integration of dispersed data sources, enabling users to efficiently retrieve and utilize information through intuitive natural language interactions.

The core value of the project is reflected in the following aspects:

1.Optimize information retrieval efficiency: Using a Langchain-based framework, the system can retrieve relevant information from a wide range of data sets before generating answers, thereby accelerating the information location and extraction process.

  1. Strengthen knowledge organization and management: Support users to build personalized knowledge bases, promote the accumulation and effective management of knowledge through structured storage and classification, and thereby enhance users' mastery and application of professional knowledge.

  2. Assisted decision-making: Through accurate information provision and analysis, the system enhances the user's decision-making ability in complex situations, especially in situations where rapid judgment and response are required.

  3. Personalized information service: The system allows users to customize the knowledge base according to their specific needs, realize personalized information retrieval and services, and ensure that users can obtain the most relevant and valuable knowledge.

  4. Technological Innovation Demonstration: The project demonstrates the advantages of RAG technology in solving the Langchain illusion problem. By combining retrieval and generation, it improves the accuracy and reliability of information, and provides new ideas for technological innovation in the field of intelligent information management.

  5. Promote intelligent assistant applications: Through user-friendly interface design and convenient deployment options, the project makes intelligent assistant technology easier to understand and use, promoting the application and popularization of this technology in a wider range of fields

3. Main functions

This project can implement knowledge Q&A based on the README of existing projects in Datawhale, allowing users to quickly understand the status of existing projects in Datawhale.

Project start interface 项目开始界面

Question and Answer Demonstration Interface 问答演示界面

Example demonstration interface

  1. Introduction to joyrl demo 介绍下joyrl
  2. Demonstration of the relationship between joyrl-book and joyrl joyrl-book与joyrl是什么关系

2. Technical implementation

1. Environmental dependence

1.1 Technical resource requirements

  • CPU: Intel 5th generation processor (for cloud CPU, it is recommended to choose a cloud CPU service with more than 2 cores)

  • Memory (RAM): At least 4 GB

  • Operating system: Windows, macOS, Linux are all available

1.2 Project settings

Clone repository

git clone https://github.com/logan-zou/Chat_with_Datawhale_langchain.git
cd Chat_with_Datawhale_langchain

Create Conda environment and install dependencies

  • python>=3.9
  • pytorch>=2.0.0
# 创建 Conda 环境
conda create -n llm-universe python==3.9.0
# 激活 Conda 环境
conda activate llm-universe
# 安装依赖项
pip install -r requirements.txt

1.3 Project operation

  • Start the service as a local API
# Linux 系统
cd project/serve
uvicorn api:app --reload 
# Windows 系统
cd project/serve
python api.py
  • Run the project
cd llm-universe/project/serve
python run_gradio.py -model_name='chatglm_std' -embedding_model='m3e' -db_path='../../data_base/knowledge_db' -persist_path='../../data_base/vector_db'

2. Brief description of development process

2.1 Current project version and future plans

  • Current version: 0.2.0 (updated on 2024.3.17)

  • UPDATE CONTENT

  • [√] Added m3e embedding

  • [√] Added new knowledge base content

  • [√] Added summary of all MDs of Datawhale

  • [√] Fix grdio display error

  • Currently supported models

    • OpenAi
      • [√] gpt-3.5-turbo
      • [√] gpt-3.5-turbo-16k-0613
      • [√] gpt-3.5-turbo-0613
      • [√] gpt-4
      • [√] gpt-4-32k
  • A word from Wen Xin - [√] ERNIE-Bot - [√] ERNIE-Bot-4 - [√] ERNIE-Bot-turbo

  • iFlytek Spark - [√] Spark-1.5 - [√] Spark-2.0

  • Wisdom AI - [√] chatglm_pro - [√] chatglm_std - [√] chatglm_lite

  • Future Planning

  • Update Ai embedding

2.2 Core Idea

The core is to implement the underlying encapsulation for four large model APIs, build a switchable model retrieval and question chain based on Langchain, and implement the API and personal lightweight large model applications deployed by Gradio.

2.3 Technology stack used

This project is a personal knowledge base assistant based on a large model, built on the LangChain framework. The core technologies include LLM API calls, vector databases, search question and answer chains, etc. The overall structure of the project is as follows:

整体框架

As above, this project is divided into LLM layer, data layer, database layer, application layer and service layer from bottom to top.

① The LLM layer mainly encapsulates LLM calls based on four popular LLM APIs, allowing users to access different models through a unified entrance and method, and supporting model switching at any time;

② The data layer mainly includes the source data of the personal knowledge base and the Embedding API. The source data can be used by the vector database after Embedding processing;

③ The database layer is mainly a vector database built based on the source data of the personal knowledge base. In this project, we chose Chroma;

④ The application layer is the top-level encapsulation of the core functions. We further encapsulate it based on the retrieval question and answer chain base class provided by LangChain, thereby supporting different model switching and convenient implementation of database-based retrieval and question answering;

⑤ The top layer is the service layer. We have implemented two methods: Gradio to build Demo and FastAPI to build API to support service access of this project.

3. Application details

1. Core architecture

llm-universe personal knowledge base assistant address:

https://github.com/datawhalechina/llm-universe/tree/main

This project is a typical RAG project. It uses langchain+LLM to realize local knowledge base question and answer, and establishes a local knowledge base dialogue application that can be implemented using open source models throughout the process. Currently, it supports access to ChatGPT, spark model, Wenxin large model, GLM and other large language models. The implementation principle of this project is the same as that of the general RAG project, as shown in the previous article and the figure below:

The entire RAG process includes the following operations:

  1. Users ask questions Query

  2. Load and read knowledge base documents

  3. Segment the knowledge base documents

  4. Vectorize the segmented knowledge base text and store it in the vector base to create an index.

  5. Query vectorization

  6. Match the top k most similar to the question Query vector in the knowledge base document vector.

  7. The matched knowledge base text is added to the prompt together with the question as context.

  8. Submit to LLM to generate answer Answer

It can be roughly divided into three stages: indexing, retrieval and generation. These three stages will be dismantled in the following sections in conjunction with the llm-universe knowledge base assistant project.

2. Index-indexing

This section describes the project llm-universe personal knowledge base assistant: creating a knowledge base and loading files - reading files - Text splitter (Text splitter), knowledge base Text vectorization (embedding) and the implementation of storing to vector database,

Among them Load file: This is the step to read the knowledge base file stored locally. Read file: Read the contents of the loaded file, usually converting it into text format. Text splitter: Split text according to certain rules (such as paragraphs, sentences, words, etc.). Text vectorization: This usually involves NLP feature extraction. This project uses local m3e text embedding model, openai, zhipuai open source api and other methods to convert the segmented text into numerical vectors and store them in the vector database

2.1 Knowledge base construction-loading and reading

The project llm-universe personal knowledge base assistant uses some classic open source courses and videos (parts) of Datawhale as examples, including:

These knowledge base source data are placed in the ../../data_base/knowledge_db directory. Users can also store their own other files.

1. Let’s talk about how to obtain the readme of all open source projects in the DataWhale general warehouse. Users can first run the project/database/test_get_all_repo.py file to obtain the readme of all open source projects in the DataWhale general warehouse. The code is as follows:

import json
import requests
import os
import base64
import loguru
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
# 从环境变量中获取TOKEN
TOKEN = os.getenv('TOKEN')
# 定义获取组织仓库的函数
def get_repos(org_name, token, export_dir):
    headers = {
        'Authorization': f'token {token}',
    }
    url = f'https://api.github.com/orgs/{org_name}/repos'
    response = requests.get(url, headers=headers, params={'per_page': 200, 'page': 0})
    if response.status_code == 200:
        repos = response.json()
        loguru.logger.info(f'Fetched {len(repos)} repositories for {org_name}.')
        # 使用 export_dir 确定保存仓库名的文件路径
        repositories_path = os.path.join(export_dir, 'repositories.txt')
        with open(repositories_path, 'w', encoding='utf-8') as file:
            for repo in repos:
                file.write(repo['name'] + '\n')
        return repos
    else:
        loguru.logger.error(f"Error fetching repositories: {response.status_code}")
        loguru.logger.error(response.text)
        return []
# 定义拉取仓库README文件的函数
def fetch_repo_readme(org_name, repo_name, token, export_dir):
    headers = {
        'Authorization': f'token {token}',
    }
    url = f'https://api.github.com/repos/{org_name}/{repo_name}/readme'
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        readme_content = response.json()['content']
        # 解码base64内容
        readme_content = base64.b64decode(readme_content).decode('utf-8')
        # 使用 export_dir 确定保存 README 的文件路径
        repo_dir = os.path.join(export_dir, repo_name)
        if not os.path.exists(repo_dir):
            os.makedirs(repo_dir)
        readme_path = os.path.join(repo_dir, 'README.md')
        with open(readme_path, 'w', encoding='utf-8') as file:
            file.write(readme_content)
    else:
        loguru.logger.error(f"Error fetching README for {repo_name}: {response.status_code}")
        loguru.logger.error(response.text)
# 主函数
if __name__ == '__main__':
    # 配置组织名称
    org_name = 'datawhalechina'
    # 配置 export_dir
    export_dir = "../../database/readme_db"  # 请替换为实际的目录路径
    # 获取仓库列表
    repos = get_repos(org_name, TOKEN, export_dir)
    # 打印仓库名称
    if repos:
        for repo in repos:
            repo_name = repo['name']
            # 拉取每个仓库的README
            fetch_repo_readme(org_name, repo_name, TOKEN, export_dir)
    # 清理临时文件夹
    # if os.path.exists('temp'):
    #     shutil.rmtree('temp')

By default, these readme files will be placed in the readme_db file in the same directory as database. These readme files contain a lot of irrelevant information, that is, running the project/database/text_summary_readme.py file can call the large model to generate a summary of each readme file and save it to the above knowledge base directory ../../data_base/knowledge_db /readme_summary folder, ****. The code is as follows:

import os
from dotenv import load_dotenv
import openai
from test_get_all_repo import get_repos
from bs4 import BeautifulSoup
import markdown
import re
import time
# Load environment variables
load_dotenv()
TOKEN = os.getenv('TOKEN')
# Set up the OpenAI API client
openai_api_key = os.environ["OPENAI_API_KEY"]

# 过滤文本中链接防止大语言模型风控
def remove_urls(text):
    # 正则表达式模式,用于匹配URL
    url_pattern = re.compile(r'https?://[^\s]*')
    # 替换所有匹配的URL为空字符串
    text = re.sub(url_pattern, '', text)
    # 正则表达式模式,用于匹配特定的文本
    specific_text_pattern = re.compile(r'扫描下方二维码关注公众号|提取码|关注|科学上网|回复关键词|侵权|版权|致谢|引用|LICENSE'
                                       r'|组队打卡|任务打卡|组队学习的那些事|学习周期|开源内容|打卡|组队学习|链接')
    # 替换所有匹配的特定文本为空字符串
    text = re.sub(specific_text_pattern, '', text)
    return text

# 抽取md中的文本
def extract_text_from_md(md_content):
    # Convert Markdown to HTML
    html = markdown.markdown(md_content)
    # Use BeautifulSoup to extract text
    soup = BeautifulSoup(html, 'html.parser')

    return remove_urls(soup.get_text())

def generate_llm_summary(repo_name, readme_content,model):
    prompt = f"1:这个仓库名是 {repo_name}. 此仓库的readme全部内容是: {readme_content}\
               2:请用约200以内的中文概括这个仓库readme的内容,返回的概括格式要求:这个仓库名是...,这仓库内容主要是..."
    openai.api_key = openai_api_key
    # 具体调用
    messages = [{"role": "system", "content": "你是一个人工智能助手"},
                {"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
    )
    return response.choices[0].message["content"]

def main(org_name,export_dir,summary_dir,model):
    repos = get_repos(org_name, TOKEN, export_dir)

    # Create a directory to save summaries
    os.makedirs(summary_dir, exist_ok=True)

    for id, repo in enumerate(repos):
        repo_name = repo['name']
        readme_path = os.path.join(export_dir, repo_name, 'README.md')
        print(repo_name)
        if os.path.exists(readme_path):
            with open(readme_path, 'r', encoding='utf-8') as file:
                readme_content = file.read()
            # Extract text from the README
            readme_text = extract_text_from_md(readme_content)
            # Generate a summary for the README
            # 访问受限,每min一次
            time.sleep(60)
            print('第' + str(id) + '条' + 'summary开始')
            try:
                summary = generate_llm_summary(repo_name, readme_text,model)
                print(summary)
                # Write summary to a Markdown file in the summary directory
                summary_file_path = os.path.join(summary_dir, f"{repo_name}_summary.md")
                with open(summary_file_path, 'w', encoding='utf-8') as summary_file:
                    summary_file.write(f"# {repo_name} Summary\n\n")
                    summary_file.write(summary)
            except openai.OpenAIError as e:
                summary_file_path = os.path.join(summary_dir, f"{repo_name}_summary风控.md")
                with open(summary_file_path, 'w', encoding='utf-8') as summary_file:
                    summary_file.write(f"# {repo_name} Summary风控\n\n")
                    summary_file.write("README内容风控。\n")
                print(f"Error generating summary for {repo_name}: {e}")
                # print(readme_text)
        else:
            print(f"文件不存在: {readme_path}")
            # If README doesn't exist, create an empty Markdown file
            summary_file_path = os.path.join(summary_dir, f"{repo_name}_summary不存在.md")
            with open(summary_file_path, 'w', encoding='utf-8') as summary_file:
                summary_file.write(f"# {repo_name} Summary不存在\n\n")
                summary_file.write("README文件不存在。\n")
if __name__ == '__main__':
    # 配置组织名称
    org_name = 'datawhalechina'
    # 配置 export_dir
    export_dir = "../database/readme_db"  # 请替换为实际readme的目录路径
    summary_dir="../../data_base/knowledge_db/readme_summary"# 请替换为实际readme的概括的目录路径
    model="gpt-3.5-turbo"  #deepseek-chat,gpt-3.5-turbo,moonshot-v1-8k
    main(org_name,export_dir,summary_dir,model)

Among them, the extract_text_from_md() function is used to extract the text in the md file, and the remove_urls() function filters out some web links in the readme text and filters out some words that may cause risk control for large models. Then call generate_llm_summary() to let the large model generate a summary of each readme.

2. After the above knowledge base is constructed, the ../../data_base/knowledge_db directory will have the md files summarized by the readme of all open source projects of Datawhale, as well as 《机器学习公式详解》PDF版本, 《面向开发者的 LLM 入门教程 第一部分 Prompt Engineering》md版本, 《强化学习入门指南》MP4版本 and other files.

There are mp4 format, md format, and pdf format. For the loading method of these files, the project places the code under the project/database/create_db.py file. Part of the code is as follows. Among them, the pdf format file uses the PyMuPDFLoader loader, and the md format file uses the UnstructuredMarkdownLoader loader. It should be noted that data processing is actually a very complex and business-specific matter. For example, a PDF file contains charts, pictures, text, and titles at different levels, which require refined processing according to the business. For specific operations, you can pay attention to the second part of the advanced RAG tutorial technology to explore on your own:

from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.vectorstores import Chroma
# 首先实现基本配置

DEFAULT_DB_PATH = "../../data_base/knowledge_db"
DEFAULT_PERSIST_PATH = "../../data_base/vector_db"
... 
...
...
def file_loader(file, loaders):
    if isinstance(file, tempfile._TemporaryFileWrapper):
        file = file.name
    if not os.path.isfile(file):
        [file_loader(os.path.join(file, f), loaders) for f in  os.listdir(file)]
        return
    file_type = file.split('.')[-1]
    if file_type == 'pdf':
        loaders.append(PyMuPDFLoader(file))
    elif file_type == 'md':
        pattern = r"不存在|风控"
        match = re.search(pattern, file)
        if not match:
            loaders.append(UnstructuredMarkdownLoader(file))
    elif file_type == 'txt':
        loaders.append(UnstructuredFileLoader(file))
    return
...
...

2.2 Text segmentation and vectorization

Text segmentation and vectorization operations are essential in the entire RAG process. The above-loaded knowledge base needs to be divided into copies or divided by token length, or divided by semantic model. This project utilizes the text splitter in Langchain to split based on chunk_size (chunk size) and chunk_overlap (overlap size between chunks).

  • chunk_size refers to the number of characters or Tokens (such as words, sentences, etc.) contained in each chunk
  • chunk_overlap refers to the number of characters shared between two chunks, which is used to maintain the coherence of the context and avoid losing context information during segmentation

1. You can set a maximum Token length, and then split the document according to this maximum Token length. The document fragments segmented in this way are document fragments of uniform length. Some overlapping content between fragments can ensure that relevant document fragments can be retrieved during retrieval. This part of the text splitting code is also in the project/database/create_db.py file. The project uses the RecursiveCharacterTextSplitter text splitter in langchain for splitting. The code is as follows:

......
def create_db(files=DEFAULT_DB_PATH, persist_directory=DEFAULT_PERSIST_PATH, embeddings="openai"):
    """
    该函数用于加载 PDF 文件,切分文档,生成文档的嵌入向量,创建向量数据库。

    参数:
    file: 存放文件的路径。
    embeddings: 用于生产 Embedding 的模型

    返回:
    vectordb: 创建的数据库。
    """
    if files == None:
        return "can't load empty file"
    if type(files) != list:
        files = [files]
    loaders = []
    [file_loader(file, loaders) for file in files]
    docs = []
    for loader in loaders:
        if loader is not None:
            docs.extend(loader.load())
    # 切分文档
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500, chunk_overlap=150)
    split_docs = text_splitter.split_documents(docs)
    ....
    ....
    ....此处省略了其他代码
    ....
    return vectordb
...........    

2. After segmenting the knowledge base text, the text needs to be vectorized. The project is in project/embedding/call_embedding.py. The text embedding method can choose the local m3e model, and the method of calling the api of openai and zhipuai for text embedding. The code is as follows:

import os
import sys

sys.path.append(os.path.dirname(os.path.dirname(__file__)))
sys.path.append(r"../../")
from embedding.zhipuai_embedding import ZhipuAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
from llm.call_llm import parse_llm_api_key


def get_embedding(embedding: str, embedding_key: str = None, env_file: str = None):
   if embedding == 'm3e':
      return HuggingFaceEmbeddings(model_name="moka-ai/m3e-base")
   if embedding_key == None:
      embedding_key = parse_llm_api_key(embedding)
   if embedding == "openai":
      return OpenAIEmbeddings(openai_api_key=embedding_key)
   elif embedding == "zhipuai":
      return ZhipuAIEmbeddings(zhipuai_api_key=embedding_key)
   else:
      raise ValueError(f"embedding {embedding} not support ")

2.3 Vector database

After segmenting and vectorizing the knowledge base text, you need to define a vector database to store document fragments and corresponding vector representations. In the vector database, data is represented in vector form, and each vector represents a data item. These vectors can be numbers, text, images, or other types of data.

Vector databases use efficient indexing and query algorithms to speed up the storage and retrieval process of vector data. This project chooses the chromadb vector database (similar vector databases include faiss, etc.). The code corresponding to the definition of the vector library is also in the project/database/create_db.py file. persist_directory is the local persistence address. The vectordb.persist() operation can persist the vector database locally, and the existing local vector library can be loaded again later. Complete text segmentation, obtain vectorization, and define the vector database code as follows:

def create_db(files=DEFAULT_DB_PATH, persist_directory=DEFAULT_PERSIST_PATH, embeddings="openai"):
    """
    该函数用于加载 PDF 文件,切分文档,生成文档的嵌入向量,创建向量数据库。

    参数:
    file: 存放文件的路径。
    embeddings: 用于生产 Embedding 的模型

    返回:
    vectordb: 创建的数据库。
    """
    if files == None:
        return "can't load empty file"
    if type(files) != list:
        files = [files]
    loaders = []
    [file_loader(file, loaders) for file in files]
    docs = []
    for loader in loaders:
        if loader is not None:
            docs.extend(loader.load())
    # 切分文档
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500, chunk_overlap=150)
    split_docs = text_splitter.split_documents(docs)
    if type(embeddings) == str:
        embeddings = get_embedding(embedding=embeddings)
    # 定义持久化路径
    persist_directory = '../../data_base/vector_db/chroma'
    # 加载数据库
    vectordb = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory=persist_directory  # 允许我们将persist_directory目录保存到磁盘上
    ) 

    vectordb.persist()
    return vectordb

3. Retrieve-Retriver and Generate-Generator

This section enters the retrieval and generation stage of RAG, that is, after vectorizing the question Query, the top k fragments most similar to the question Query vector are matched in the knowledge base document vector. The matched knowledge base text is added to the prompt as the context and question, and then submitted to LLM to generate the answer. The following will be explained based on the llm_universe personal knowledge base assistant.

3.1 Vector database retrieval

By segmenting and vectorizing the text in the previous chapter and building a vector database index, the vector database can be used for efficient retrieval. Vector database is a library for efficiently searching for similarities in large-scale high-dimensional vector spaces, which can quickly find the most similar vector to a given query vector in large-scale data sets. As shown in the following example:

question="什么是机器学习"
Copy to clipboardErrorCopied
sim_docs = vectordb.similarity_search(question,k=3)
print(f"检索到的内容数:{len(sim_docs)}")

检索到的内容数:3
for i, sim_doc in enumerate(sim_docs):
    print(f"检索到的第{i}个内容: \n{sim_doc.page_content[:200]}", end="\n--------------\n")
检索到的第0个内容: 
导,同时也能体会到这三门数学课在机器学习上碰撞产生的“数学之美”。
1.1
引言
本节以概念理解为主,在此对“算法”和“模型”作补充说明。“算法”是指从数据中学得“模型”的具
体方法,例如后续章节中将会讲述的线性回归、对数几率回归、决策树等。“算法”产出的结果称为“模型”,
通常是具体的函数或者可抽象地看作为函数,例如一元线性回归算法产出的模型即为形如 f(x) = wx + b

的一元一次函数。
--------------

检索到的第1个内容: 
模型:机器学习的一般流程如下:首先收集若干样本(假设此时有 100 个),然后将其分为训练样本
(80 个)和测试样本(20 个),其中 80 个训练样本构成的集合称为“训练集”,20 个测试样本构成的集合
称为“测试集”,接着选用某个机器学习算法,让其在训练集上进行“学习”(或称为“训练”),然后产出

得到“模型”(或称为“学习器”),最后用测试集来测试模型的效果。执行以上流程时,表示我们已经默
--------------

检索到的第2个内容: 
→_→
欢迎去各大电商平台选购纸质版南瓜书《机器学习公式详解》
←_←
第 1 章
绪论
本章作为“西瓜书”的开篇,主要讲解什么是机器学习以及机器学习的相关数学符号,为后续内容作
铺垫,并未涉及复杂的算法理论,因此阅读本章时只需耐心梳理清楚所有概念和数学符号即可。此外,在
阅读本章前建议先阅读西瓜书目录前页的《主要符号表》,它能解答在阅读“西瓜书”过程中产生的大部
分对数学符号的疑惑。
本章也作为

3.2 Calling large model llm

Here we take the project/qa_chain/model_to_llm.py code of this project as an example. The packages for open source model API calls such as *** Spark***, Glm, Wenxin llm are defined under the directory folder of project/llm/. These modules are imported in the project/qa_chain/model_to_llm.py file, and llm can be called according to the model name passed in by the user. The code is as follows:

def model_to_llm(model:str=None, temperature:float=0.0, appid:str=None, api_key:str=None,Spark_api_secret:str=None,Wenxin_secret_key:str=None):
        """
        星火:model,temperature,appid,api_key,api_secret
        百度问心:model,temperature,api_key,api_secret
        智谱:model,temperature,api_key
        OpenAI:model,temperature,api_key
        """
        if model in ["gpt-3.5-turbo", "gpt-3.5-turbo-16k-0613", "gpt-3.5-turbo-0613", "gpt-4", "gpt-4-32k"]:
            if api_key == None:
                api_key = parse_llm_api_key("openai")
            llm = ChatOpenAI(model_name = model, temperature = temperature , openai_api_key = api_key)
        elif model in ["ERNIE-Bot", "ERNIE-Bot-4", "ERNIE-Bot-turbo"]:
            if api_key == None or Wenxin_secret_key == None:
                api_key, Wenxin_secret_key = parse_llm_api_key("wenxin")
            llm = Wenxin_LLM(model=model, temperature = temperature, api_key=api_key, secret_key=Wenxin_secret_key)
        elif model in ["Spark-1.5", "Spark-2.0"]:
            if api_key == None or appid == None and Spark_api_secret == None:
                api_key, appid, Spark_api_secret = parse_llm_api_key("spark")
            llm = Spark_LLM(model=model, temperature = temperature, appid=appid, api_secret=Spark_api_secret, api_key=api_key)
        elif model in ["chatglm_pro", "chatglm_std", "chatglm_lite"]:
            if api_key == None:
                api_key = parse_llm_api_key("zhipuai")
            llm = ZhipuAILLM(model=model, zhipuai_api_key=api_key, temperature = temperature)
        else:
            raise ValueError(f"model{model} not support!!!")
        return llm

3.3 prompt and build question and answer chain

Next comes the last step. After designing the prompt based on the knowledge base Q&A, the answer can be generated by combining the above retrieval and large model call. The format of building prompt is as follows, which can be modified according to your business needs:

from langchain.prompts import PromptTemplate

# template = """基于以下已知信息,简洁和专业的来回答用户的问题。
#             如果无法从中得到答案,请说 "根据已知信息无法回答该问题" 或 "没有提供足够的相关信息",不允许在答案中添加编造成分。
#             答案请使用中文。
#             总是在回答的最后说“谢谢你的提问!”。
# 已知信息:{context}
# 问题: {question}"""
template = """使用以下上下文来回答最后的问题。如果你不知道答案,就说你不知道,不要试图编造答
案。最多使用三句话。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问!”。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template)

# 运行 chain

And build a question and answer chain: The method to create a retrieval QA chain, RetrievalQA.from_chain_type(), has the following parameters:

  • llm: Specify the LLM used
  • Specify chain type: RetrievalQA.from_chain_type(chain_type="map_reduce"), you can also use the load_qa_chain() method to specify the chain type.
  • Custom prompt: By specifying the chain_type_kwargs parameter in the RetrievalQA.from_chain_type() method, and this parameter: chain_type_kwargs = {"prompt": PROMPT}
  • Return the source document: Specify the return_source_documents=True parameter in the RetrievalQA.from_chain_type() method; you can also use the RetrievalQAWithSourceChain() method to return the reference of the source document (coordinates or primary key, index)

# 自定义 QA 链
self.qa_chain = RetrievalQA.from_chain_type(llm=self.llm,
                                        retriever=self.retriever,
                                        return_source_documents=True,
                                        chain_type_kwargs={"prompt":self.QA_CHAIN_PROMPT})

The effect of the question and answer chain is as follows: prompt effect built based on the combination of recall results and query

question_1 = "什么是南瓜书?"
question_2 = "王阳明是谁?"Copy to clipboardErrorCopied
result = qa_chain({"query": question_1})
print("大模型+知识库后回答 question_1 的结果:")
print(result["result"])
大模型+知识库后回答 question_1 的结果:
南瓜书是对《机器学习》(西瓜书)中难以理解的公式进行解析和补充推导细节的一本书。谢谢你的提问!
result = qa_chain({"query": question_2})
print("大模型+知识库后回答 question_2 的结果:")
print(result["result"])
大模型+知识库后回答 question_2 的结果:
我不知道王阳明是谁,谢谢你的提问!

The above detailed retrieval question and answer chain codes without memory are all in the project: project/qa_chain/QA_chain_self.py. In addition, the project also implements the retrieval question and answer chain with memory. The internal implementation details of the two custom retrieval question and answer chains are similar, but different LangChain chains are called. The complete retrieval question and answer chain code with memory project/qa_chain/Chat_QA_chain_self.py is as follows:

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI

from qa_chain.model_to_llm import model_to_llm
from qa_chain.get_vectordb import get_vectordb


class Chat_QA_chain_self:
    """"
    带历史记录的问答链  
    - model:调用的模型名称
    - temperature:温度系数,控制生成的随机性
    - top_k:返回检索的前k个相似文档
    - chat_history:历史记录,输入一个列表,默认是一个空列表
    - history_len:控制保留的最近 history_len 次对话
    - file_path:建库文件所在路径
    - persist_path:向量数据库持久化路径
    - appid:星火
    - api_key:星火、百度文心、OpenAI、智谱都需要传递的参数
    - Spark_api_secret:星火秘钥
    - Wenxin_secret_key:文心秘钥
    - embeddings:使用的embedding模型
    - embedding_key:使用的embedding模型的秘钥(智谱或者OpenAI)  
    """
    def __init__(self,model:str, temperature:float=0.0, top_k:int=4, chat_history:list=[], file_path:str=None, persist_path:str=None, appid:str=None, api_key:str=None, Spark_api_secret:str=None,Wenxin_secret_key:str=None, embedding = "openai",embedding_key:str=None):
        self.model = model
        self.temperature = temperature
        self.top_k = top_k
        self.chat_history = chat_history
        #self.history_len = history_len
        self.file_path = file_path
        self.persist_path = persist_path
        self.appid = appid
        self.api_key = api_key
        self.Spark_api_secret = Spark_api_secret
        self.Wenxin_secret_key = Wenxin_secret_key
        self.embedding = embedding
        self.embedding_key = embedding_key


        self.vectordb = get_vectordb(self.file_path, self.persist_path, self.embedding,self.embedding_key)
        
    
    def clear_history(self):
        "清空历史记录"
        return self.chat_history.clear()

    
    def change_history_length(self,history_len:int=1):
        """
        保存指定对话轮次的历史记录
        输入参数:
        - history_len :控制保留的最近 history_len 次对话
        - chat_history:当前的历史对话记录
        输出:返回最近 history_len 次对话
        """
        n = len(self.chat_history)
        return self.chat_history[n-history_len:]

 
    def answer(self, question:str=None,temperature = None, top_k = 4):
        """"
        核心方法,调用问答链
        arguments: 
        - question:用户提问
        """
        
        if len(question) == 0:
            return "", self.chat_history
        
        if len(question) == 0:
            return ""
        
        if temperature == None:
            temperature = self.temperature

        llm = model_to_llm(self.model, temperature, self.appid, self.api_key, self.Spark_api_secret,self.Wenxin_secret_key)

        #self.memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

        retriever = self.vectordb.as_retriever(search_type="similarity",   
                                        search_kwargs={'k': top_k})  #默认similarity,k=4

        qa = ConversationalRetrievalChain.from_llm(
            llm = llm,
            retriever = retriever
        )

        #print(self.llm)
        result = qa({"question": question,"chat_history": self.chat_history})       #result里有question、chat_history、answer
        answer =  result['answer']
        self.chat_history.append((question,answer)) #更新历史记录

        return self.chat_history  #返回本次回答和更新后的历史记录

3. Summary and outlook

3.1 Summary of key points of personal knowledge base

This example is a personal knowledge base assistant project based on a large language model (LLM), which helps users quickly locate and obtain knowledge related to DATa whales through intelligent retrieval and question and answer systems. Here are the key points of the project:

Key point one

  1. The project uses multiple methods to complete the extraction and summary of all md files in Datawhale and generate the corresponding knowledge base. While completing the extraction and summary of md files, corresponding methods are also used to complete the filtering of web links in the readme text and vocabulary that may cause risk control of large models;

  2. The project uses the text cutter in Langchain to complete text segmentation before the vectorization operation of the knowledge base. The vector database uses efficient indexing and query algorithms to accelerate the storage and retrieval process of vector data, and quickly complete the establishment and use of personal knowledge base data.

Key point two

The project provides low-level encapsulation of different APIs. Users can avoid complex encapsulation details and directly call the corresponding large language model.

3.2 Future development direction

  1. User experience upgrade: Support users to upload and establish personal knowledge base independently, and build their own exclusive personal knowledge base assistant;

  2. Model architecture upgrade: from the universal architecture of REG to the multi-agent framework of Multi-Agent;

  3. Function optimization and upgrade: Optimize the retrieval function within the existing structure to improve the retrieval accuracy of the personal knowledge base.

4. Acknowledgments

I would like to thank Master San for his 项目 crawler and summary part.