Chapter 3 Building a Vector Knowledge Base

3.1 Vectors and vector knowledge base

3.1.1 Word vectors and vectors

Embeddding In machine learning and natural language processing (NLP), word embedding is a technique that converts each word into a real vector on a word-by-word basis. These real vectors can be better understood and processed by computers. The main idea behind word vectors is that similar or related objects should be close together in vector space. similar For example, we can use word vectors to represent text data. In word vectors, each word is converted into a vector that captures the semantic information of the word. For example, the words "king" and "queen" will be very close in vector space because they have similar meanings. And "apple" and "orange" will be close because they are both fruits. The two words "king" and "apple" will be far apart in vector space because their meanings are different.

Word vectors actually convert words into fixed, static vectors. Although they can capture and express semantic information in text to a certain extent, they ignore the fact that the meaning of words in different contexts will be affected. Therefore, the vector technology used in RAG applications is generally universal text embedding, which can vectorize text of any length within a certain range. The difference from word vectors is that the unit of vectorization is no longer words but the input text, and the output vector will capture more semantic information.

There are two main advantages of vectors in RAG (Retrieval Augmented Generation):

  • Vectors are better for retrieval than text. When we search in the database, if the database stores text, we mainly find relatively matching data by retrieving keywords (lexical search) and other methods. The degree of matching depends on whether the documents in the database contain the keywords in the query sentence; and the vector contains the semantic information of the original text. We can directly obtain the semantic similarity between the question and the data by calculating the dot product, cosine distance, Euclidean distance and other indicators between the question and the data in the database;
  • Vectors have stronger comprehensive information capabilities than other media. When traditional databases store text, sound, images, videos and other media, it is difficult to build correlation and cross-modal query methods for the above-mentioned multiple media; however, vectors can map various data into a unified vector form through a variety of vector models.

When building a RAG system, we can often build vectors by using vector models. We can choose:

  • Use the Embedding API of each company;
  • Use vector models natively to construct data as vectors.

3.1.2 Vector database

Vector databases are solutions for efficient computing and management of large amounts of vector data. A vector database is a database system specifically designed to store and retrieve vector data (embedding). It is different from traditional databases based on relational models. It mainly focuses on the characteristics and similarities of vector data.

In a vector database, data is represented as vectors, with each vector representing a data item. These vectors can be numbers, text, images, or other types of data. Vector databases use efficient indexing and query algorithms to speed up the storage and retrieval process of vector data.

The data in the vector database uses vectors as the basic unit to store, process and retrieve vectors. The vector database obtains the similarity with the target vector by calculating the cosine distance, dot product, etc. with the target vector. When processing large or even massive amounts of vector data, the efficiency of vector database indexing and query algorithms is significantly higher than that of traditional databases.

  • Chroma: It is a lightweight vector database with rich functions and simple API. It has the advantages of simplicity, ease of use, and lightweight. However, its functions are relatively simple and does not support GPU acceleration, making it suitable for beginners.
  • Weaviate: is an open source vector database. In addition to supporting similarity search and Maximal Marginal Relevance (MMR) search, it can also support hybrid search that combines multiple search algorithms (based on lexical search, vector search), thereby improving the relevance and accuracy of search results.
  • Qdrant: Qdrant is developed using the Rust language, has extremely high retrieval efficiency and RPS (Requests Per Second), and supports three deployment modes: local running, deployment on local servers and Qdrant cloud. And data can be reused by formulating different keys for page content and metadata.

3.2 Using Embedding API

Note: In order to facilitate embedding API calls, the key should be filled in the .env file under llm_universe, and the code will automatically read and load the environment variables.

3.2.1 Using OpenAI API

GPT has encapsulated interfaces, we can simply encapsulate them. There are currently three GPT embedding modes, and their performance is as follows:

ModelPages per dollarMTEB scoreMIRACL score
text-embedding-3-large9,61564.654.9
text-embedding-3-small62,50062.344.0
text-embedding-ada-00212,50061.031.4
*The MTEB score is the average score of eight tasks including embedding model classification, clustering, and pairing.
  • MIRACL score is the average score of the embedding model on the retrieval task.

From the above three embedding models, we can see thattext-embedding-3-largeIt has the best performance and the most expensive price, and can be used when the application we build requires better performance and the cost is sufficient;text-embedding-3-smallWith good performance and price, we can choose this model when our budget is limited; andtext-embedding-ada-002It is the previous generation model of OpenAI. It is not as good as the previous two in terms of performance and price, so it is not recommended.

import os
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv


# 读取本地/项目的环境变量。
# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件,并将其中的环境变量加载到当前的运行环境中  
# 如果你设置的是全局的环境变量,这行代码则没有任何作用。
_ = load_dotenv(find_dotenv())

# 如果你需要通过代理端口访问,你需要如下配置
# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'

def openai_embedding(text: str, model: str=None):
    # 获取环境变量 OPENAI_API_KEY
    api_key=os.environ['OPENAI_API_KEY']
    client = OpenAI(api_key=api_key)

    # embedding model:'text-embedding-3-small', 'text-embedding-3-large', 'text-embedding-ada-002'
    if model == None:
        model="text-embedding-3-small"

    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response

response = openai_embedding(text='要生成 embedding 的输入文本,字符串形式。')

The data returned by the API isjsonformat, exceptobjectIn addition to vector types, there are also ways to store datadata, embedding model modelmodelAnd the usage of this tokenusageand other data, as shown below:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.006929283495992422,
        ... (省略)
        -4.547132266452536e-05,
      ],
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 5
  }
}

We can call the response object to get the embedding type.

print(f'返回的embedding类型为:{response.object}')

The returned embedding type is: list

Embedding is stored in data. We can check the length of embedding and the generated embedding.

print(f'embedding长度为:{len(response.data[0].embedding)}')
print(f'embedding(前10)为:{response.data[0].embedding[:10]}')

The embedding length is: 1536 embedding (top 10) is: [0.038827355951070786, 0.013531949371099472, -0.0025024667847901583, -0.016542360186576843, 0.02412303350865841, -0.017386866733431816, 0.042086150497198105, 0.011515072546899319, -0.0282362699508667, -0.006800748407840729]

We can also check the model and token usage of this embedding.

print(f'本次embedding model为:{response.model}')
print(f'本次token使用情况为:{response.usage}')

This embedding model is: text-embedding-3-small The token usage this time is: Usage(prompt_tokens=12, total_tokens=12)

3.2.2 Using Wenxin Qianfan API

Embedding-V1 is a text representation model based on Baidu Wenxin large model technology. Access token is the certificate for calling the interface. When using Embedding-V1, you should first obtain the Access token with the API Key and Secret Key, and then call the interface through the Access token to embed the text. At the same time, the Qianfan large model platform also supportsbge-large-zhWait for the embedding model.

import requests
import json

def wenxin_embedding(text: str):
    # 获取环境变量 wenxin_api_key、wenxin_secret_key
    api_key = os.environ['QIANFAN_AK']
    secret_key = os.environ['QIANFAN_SK']

    # 使用API Key、Secret Key向https://aip.baidubce.com/oauth/2.0/token 获取Access token
    url = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={0}&client_secret={1}".format(api_key, secret_key)
    payload = json.dumps("")
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'
    }
    response = requests.request("POST", url, headers=headers, data=payload)
    
    # 通过获取的Access token 来embedding text
    url = "https://aip.baidubce.com/rpc/2.0/ai_custom/v1/wenxinworkshop/embeddings/embedding-v1?access_token=" + str(response.json().get("access_token"))
    input = []
    input.append(text)
    payload = json.dumps({
        "input": input
    })
    headers = {
        'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    return json.loads(response.text)
# text应为List(string)
text = "要生成 embedding 的输入文本,字符串形式。"
response = wenxin_embedding(text=text)

In addition to a separate ID for each embedding in Embedding-V1, there is also a timestamp to record the embedding time.

print('本次embedding id为:{}'.format(response['id']))
print('本次embedding产生时间戳为:{}'.format(response['created']))

This embedding id is: as-h7u5sde3ga The timestamp generated by this embedding is: 1741091715

Similarly, we can also get the embedding type and embedding from the response.

print('返回的embedding类型为:{}'.format(response['object']))
print('embedding长度为:{}'.format(len(response['data'][0]['embedding'])))
print('embedding(前10)为:{}'.format(response['data'][0]['embedding'][:10]))

The returned embedding type is: embedding_list The embedding length is: 384 The embedding (top 10) is: [0.060567744076251984, 0.020958080887794495, 0.053234219551086426, 0.02243831567466259, -0.024505289271473885, -0.09820500761270523, 0.04375714063644409, -0.009092536754906178, -0.020122773945331573, 0.015808865427970886]

3.2.3 Using iFlytek Spark API

Things to note when using iFlytek embeddingspark_embedding_domainparameter, which when vectorizing the problem is"query"When vectorizing the knowledge base, it is"para"

from sparkai.embedding.spark_embedding import Embeddingmodel
import os
xunfei_embedding = Embeddingmodel(
    spark_embedding_app_id=os.environ["IFLYTEK_SPARK_APP_ID"],
    spark_embedding_api_key=os.environ["IFLYTEK_SPARK_API_KEY"],
    spark_embedding_api_secret=os.environ["IFLYTEK_SPARK_API_SECRET"],
    spark_embedding_domain="para"
    )

text = {"content":'要生成 embedding 的输入文本。',"role":"user"}
response = xunfei_embedding.embedding(text=text)
print(f'生成的embedding长度为:{len(response)}')
print(f'embedding(前10)为: {response[:10]}')

The generated embedding length is: 2560 The embedding (top 10) is: [-0.448486328125, 0.84130859375, 0.67919921875, 0.214599609375, 0.374267578125, 0.384033203125, -0.488525390625, -0.6103515625, 0.1571044921875, 0.81494140625]

3.2.4 Using Zhipu API

Zhipu has a packaged SDK that we can call.

from zhipuai import ZhipuAI
def zhipu_embedding(text: str):

    api_key = os.environ['ZHIPUAI_API_KEY']
    client = ZhipuAI(api_key=api_key)
    response = client.embeddings.create(
        model="embedding-3",
        input=text,
    )
    return response

text = '要生成 embedding 的输入文本,字符串形式。'
response = zhipu_embedding(text=text)

response iszhipuai.types.embeddings.EmbeddingsRespondedtype, we can callobjectdatamodelusageTo view the embedding type, embedding, embedding model and usage of the response.

print(f'response类型为:{type(response)}')
print(f'embedding类型为:{response.object}')
print(f'生成embedding的model为:{response.model}')
print(f'生成的embedding长度为:{len(response.data[0].embedding)}')
print(f'embedding(前10)为: {response.data[0].embedding[:10]}')

The response type is: <class 'zhipuai.types.embeddings.EmbeddingsResponded'> The embedding type is: list The model that generates embedding is: embedding-3 The generated embedding length is: 2048 The embedding (top 10) is: [-0.0042974288, 0.040918995, -0.0036798029, 0.034753118, -0.047749206, -0.015196704, -0.023666998, -0.002935019, 0.0015090306, -0.011958062]

3.3 Data processing

To build our local knowledge base, we need to process local documents stored in multiple types, read the local documents and convert the contents of the local documents into word vectors through the Embedding method described above to build a vector database. In this section, we start with some practical examples to explain how to process local documents.

3.3.1 Source document selection

We choose some classic open source courses from Datawhale as examples, including:

3.3.2 Data reading

For PDF documents, we can use LangChain's PyMuPDFLoader to read the PDF files of the knowledge base. PyMuPDFLoader is one of the fastest PDF parsers, and the results contain detailed metadata of the PDF and its pages, returning one document per page.

from langchain_community.document_loaders import PyMuPDFLoader

# 创建一个 PyMuPDFLoader Class 实例,输入为待加载的 pdf 文档路径
loader = PyMuPDFLoader("../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf")

# 调用 PyMuPDFLoader Class 的函数 load 对 pdf 文件进行加载
pdf_pages = loader.load()

After the document is loaded, it is stored inpagesIn variables:

  • pageThe variable type isList
  • PrintpagesYou can see how many pages the pdf contains in total
print(f"载入后的变量类型为:{type(pdf_pages)},",  f"该 PDF 一共包含 {len(pdf_pages)} 页")

The loaded variable type is: <class 'list'>. The PDF contains a total of 196 pages.

pageEach element in is a document, and the variable type islangchain_core.documents.base.Document, the document variable type contains two attributes

  • page_contentContains the contents of this document.
  • meta_dataDescriptive data related to the document.
pdf_page = pdf_pages[1]
print(f"每一个元素的类型:{type(pdf_page)}.", 
    f"该文档的描述性数据:{pdf_page.metadata}", 
    f"查看该文档的内容:\n{pdf_page.page_content}", 
    sep="\n------\n")

The type of each element: <class 'langchain_core.documents.base.Document'>. ------ Descriptive data for this document: {'source': '../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'file_path': '../../data_base/knowledge_db/pumkin_book/pumpkin_book.pdf', 'page': 1, 'total_pages': 196, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'xdvipdfmx (20200315)', 'creationDate': "D:20230303170709-00'00'", 'modDate': '', 'trapped': ''} ------ View the contents of this document: Preface "Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to make as many readers as possible Readers have some understanding of machine learning through the Xigua book, so the details of the derivation of some formulas are not detailed in the book, but this is useful for those who want to delve deeper into the derivation of formulas. It may not be "friendly" to readers who want to learn details. This book aims to analyze the more difficult-to-understand formulas in Xigua's book and to supplement some formulas. Specific derivation details. " After reading this, you may wonder why the previous paragraph is in quotation marks, because this is just our initial reverie. Later we learned that Zhou The real reason why the teacher omits these derivation details is that he himself believes that “sophomore students with a solid foundation in science and engineering mathematics should learn about Xigua Shu” There is no difficulty in deriving the details in the book. The key points are all in the book. The omitted details should be able to be supplemented by brain or practice." So... this pumpkin book can only be regarded as my I hope that the notes I took down when I was studying on my own will help everyone become a qualified "sophomore with a solid foundation in science and engineering mathematics." "Lost student". Instructions for use • All the contents of the Pumpkin Book are expressed based on the content of the Watermelon Book as pre-knowledge, so the best way to use the Pumpkin Book is to use the Watermelon Book For the main line, refer to the Pumpkin Book when you encounter a formula that you cannot derive or understand; • For beginners who are new to machine learning, it is strongly not recommended to study the formulas in Chapters 1 and 2 of Xigua Book. You can simply go through them and wait until you learn them. When it gets a little drifting, it’s still time to come back and nibble; • We strive (zhi) strive (neng) to explain the analysis and derivation of each formula from the perspective of basic undergraduate mathematics, so the mathematical knowledge beyond the scope We usually provide them in the form of appendices and references. Interested students can continue to study in depth along the materials we provide; • If there is no formula you want to check in the Pumpkin Book, or if you find an error somewhere in the Pumpkin Book, please do not hesitate to go to our GitHub Issues (Address: https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块 Submit the formula number or errata information you wish to add, and we will usually reply to you within 24 hours. If we do not reply within 24 hours, If you want, you can contact us via WeChat (WeChat ID: at-Sm1les); Supporting video tutorial: https://www.bilibili.com/video/BV1Mh411e7VU Online reading address: https://datawhalechina.github.io/pumpkin-book(仅供第1 version) The latest version of PDF is available at: https://github.com/datawhalechina/pumpkin-book/releases Editorial Board Editor-in-chief: Sm1les, archwalker, jbb0523 Editorial Board: juxiao, Majingmin, MrBigFan, shanry, Ye980226 Cover design: Concept-Sm1les, Creation-Lin Wangmaosheng Acknowledgments Special thanks to awyd234, feijuan, Ggmatch, Heitao5200, huaqing89, LongJH, LilRachel, LeoLRH, Nono17, spareribs, sunchaothu, StevenLzq for their earliest contributions to the Pumpkin Book. Scan the QR code below and reply with the keyword "Pumpkin Book" to join the "Pumpkin Book Readers Exchange Group" Copyright statement This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

For markdown documents we can read them in almost exactly the same way.

from langchain_community.document_loaders.markdown import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader("../data_base/knowledge_db/prompt_engineering/1. 简介 Introduction.md")
md_pages = loader.load()

The read object is exactly the same as the PDF document read:

print(f"载入后的变量类型为:{type(md_pages)},",  f"该 Markdown 一共包含 {len(md_pages)} 页")

The loaded variable type is: <class 'list'>. The Markdown contains a total of 1 page.

md_page = md_pages[0]
print(f"每一个元素的类型:{type(md_page)}.", 
    f"该文档的描述性数据:{md_page.metadata}", 
    f"查看该文档的内容:\n{md_page.page_content[0:][:200]}", 
    sep="\n------\n")

The type of each element: <class 'langchain_core.documents.base.Document'>. ------ Descriptive data for this document: {'source': '../../data_base/knowledge_db/prompt_engineering/1. Introduction Introduction.md'} ------ View the contents of this document: Chapter 1 Introduction

Welcome to the Prompt Engineering for Developers section. The content of this section is based on the "Prompt Engineering for Developer" course taught by Andrew Ng. The "Prompt Engineering for Developer" course is taught by Mr. Ng Enda in collaboration with Mr. Isa Fulford, a member of the OpenAI technical team. Mr. Isa has developed the popular ChatGPT search plug-in and is teaching LLM (Larg

3.3.3 Data cleaning

We expect the data in the knowledge base to be as orderly, high-quality, and streamlined as possible, so we need to delete low-quality text data that even affects understanding. It can be seen that the pdf file read above not only adds line breaks to a sentence according to the lines of the original text\n, also inserted between the original two symbols.\n, we can use regular expressions to match and delete\n

import re
pattern = re.compile(r'[^\u4e00-\u9fff](\n)[^\u4e00-\u9fff]', re.DOTALL)
pdf_page.page_content = re.sub(pattern, lambda match: match.group(0).replace('\n', ''), pdf_page.page_content)
print(pdf_page.page_content)

Preface "Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to make as many readers as possible Readers have some understanding of machine learning through the Xigua book, so the details of the derivation of some formulas are not detailed in the book, but this is useful for those who want to delve deeper into the derivation of formulas. It may not be "friendly" to readers who want to learn details. This book aims to analyze the more difficult-to-understand formulas in Xigua's book and to supplement some formulas. Specific derivation details. " After reading this, you may wonder why the previous paragraph is in quotation marks, because this is just our initial reverie. Later we learned that Zhou The real reason why the teacher omits these derivation details is that he himself believes that “sophomore students with a solid foundation in science and engineering mathematics should learn about Xigua Shu” There is no difficulty in deriving the details in the book. The key points are all in the book. The omitted details should be able to be supplemented by brain or practice." So... this pumpkin book can only be regarded as my I hope that the notes I took down when I was studying on my own will help everyone become a qualified "sophomore with a solid foundation in science and engineering mathematics." "Lost student". Instructions for use • All the contents of the Pumpkin Book are expressed based on the content of the Watermelon Book as pre-knowledge, so the best way to use the Pumpkin Book is to use the Watermelon Book The main line is the main line. When you encounter a formula that you cannot derive or understand, you can refer to the Pumpkin Book; • For beginners who are new to machine learning, it is strongly not recommended to go into the formulas in Chapters 1 and 2 of the Watermelon Book. Just go through it briefly and wait until you learn it. When you feel a little lost, you can come back to it in time; • We strive to explain the analysis and derivation of each formula from the perspective of basic undergraduate mathematics, so the mathematical knowledge beyond the scope is We usually provide them in the form of appendices and references. Interested students can continue to study in depth along the materials we provide; • If there is no formula you want to check in the Pumpkin Book, or you find an error somewhere in the Pumpkin Book, please do not hesitate to go to our GitHub Issues (Address: https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块 Submit the formula number or errata information you wish to add, and we will usually reply to you within 24 hours. If we do not reply within 24 hours, If you want, you can contact us via WeChat (WeChat ID: at-Sm1les); Supporting video tutorial: https://www.bilibili.com/video/BV1Mh411e7VU Online reading address: https://datawhalechina.github.io/pumpkin-book(仅供第1 version) The latest version of PDF is available at: https://github.com/datawhalechina/pumpkin-book/releases Editorial Board Editor-in-chief: Sm1les, archwalker, jbb0523 Editorial Board: juxiao, Majingmin, MrBigFan, shanry, Ye980226 Cover design: Concept-Sm1les, Creation-Lin Wangmaosheng Acknowledgments Special thanks to awyd234, feijuan, Ggmatch, Heitao5200, huaqing89, LongJH, LilRachel, LeoLRH, Nono17, spareribs, sunchaothu, StevenLzq for their early contributions to the Pumpkin Book. Scan the QR code below and reply with the keyword "Pumpkin Book" to join the "Pumpkin Book Readers Exchange Group" Copyright statement This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Further analyzing the data, we found that there are manyand spaces, we can use the replace method to remove them.

pdf_page.page_content = pdf_page.page_content.replace('•', '')
pdf_page.page_content = pdf_page.page_content.replace(' ', '')
print(pdf_page.page_content)

Preface "Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to make as many readers as possible Readers have some understanding of machine learning through the Xigua book, so the details of the derivation of some formulas are not detailed in the book, but this is useful for those who want to delve deeper into the derivation of formulas. It may not be "friendly" to readers who want to learn details. This book aims to analyze the more difficult-to-understand formulas in Xigua's book and to supplement some formulas. Specific derivation details. " After reading this, you may wonder why the previous paragraph is in quotation marks, because this is just our initial reverie. Later we learned that Zhou The real reason why the teacher omits these derivation details is that he himself believes that “sophomore students with a solid foundation in science and engineering mathematics should learn about Xigua Shu” There is no difficulty in deriving the details in the book. The key points are all in the book. The omitted details should be able to be supplemented by brain or practice." So... this pumpkin book can only be regarded as my I hope that the notes I took down when I was studying on my own will help everyone become a qualified "sophomore with a solid foundation in science and engineering mathematics." "Lost student". Instructions for use All the contents of the Pumpkin Book are expressed using the content of the Watermelon Book as pre-knowledge, so the best way to use the Pumpkin Book is to use the Watermelon Book The main line is the main line. When you encounter a formula that you cannot derive or understand, you can refer to the Pumpkin Book. For beginners who are new to machine learning, it is strongly not recommended to go into the formulas in Chapters 1 and 2 of the Watermelon Book. Just go through it briefly and wait until you learn it. When you feel a little lost, you can come back to it in time; we strive to explain the analysis and derivation of each formula from the perspective of basic undergraduate mathematics, so the mathematical knowledge beyond the basics We usually provide them in the form of appendices and references. Interested students can continue to study in depth along the materials we have given; if there is no formula you want to check in the Pumpkin Book, or if you find an error somewhere in the Pumpkin Book, please do not hesitate to go to our GitHub Issues (Address: https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块 Submit the formula number or errata information you wish to add. We will usually reply to you within 24 hours. If we do not reply within 24 hours, If you want, you can contact us via WeChat (WeChat ID: at-Sm1les); Supporting video tutorial: https://www.bilibili.com/video/BV1Mh411e7VU Online reading address: https://datawhalechina.github.io/pumpkin-book(仅供第1版) The latest version of PDF is available at: https://github.com/datawhalechina/pumpkin-book/releases Editorial Board Editor-in-chief: Sm1les, archwalker, jbb0523 Editorial Board: juxiao, Majingmin, MrBigFan, shanry, Ye980226 Cover design: Concept-Sm1les, Creation-Lin Wangmaosheng Acknowledgments Special thanks to awyd234, feijuan, Ggmatch, Heitao5200, huaqing89, LongJH, LilRachel, LeoLRH, Nono17, spareribs, sunchaothu, StevenLzq for their early contributions to the Pumpkin Book. Scan the QR code below and reply with the keyword "Pumpkin Book" to join the "Pumpkin Book Readers Exchange Group" Copyright statement This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Each section of the md file read above is separated by a newline character. We can also use the replace method to remove it.

md_page.page_content = md_page.page_content.replace('\n\n', '\n')
print(md_page.page_content)

Chapter 1 Introduction Welcome to the Prompt Engineering for Developers section. The content of this section is based on the "Prompt Engineering for Developer" course taught by Andrew Ng. The "Prompt Engineering for Developer" course is taught by Mr. Ng Enda in cooperation with Mr. Isa Fulford, a member of the OpenAI technical team. Mr. Isa has developed the popular ChatGPT search plug-in and has made great contributions in teaching the application of LLM (Large Language Model) technology in products. She also co-wrote the OpenAI cookbook that teaches people to use Prompt. We hope that through studying this module, we can share with you the best practices and techniques for developing LLM applications using prompt words. There is a lot of material on the Internet about prompt word (prompt (this term will be retained in this tutorial)) design, such as articles such as "30 prompts everyone has to know". These articles mainly focus on the web interface of ChatGPT, which many people use to perform specific, usually one-time tasks. But we believe that for developers, the more powerful feature of large language models (LLM) is that it can be called through API interfaces to quickly build software applications. In fact, we learned that the team at DeepLearning.AI’s sister company AI Fund has been working with many startups to apply these technologies to many applications. It’s exciting to see how the LLM API enables developers to build applications very quickly. In this module, we will share with readers various techniques and best practices to improve the application effect of large language models. The book covers a wide range of topics, including typical application scenarios of language models such as software development prompt word design, text summary, reasoning, conversion, expansion, and building chat robots. We sincerely hope that this course will inspire readers to develop better language model applications. With the development of LLM, it can be roughly divided into two types, later called basic LLM and instruction fine-tuned (Instruction Tuned) LLM. The basic LLM is based on text training data to train a model with the ability to predict the next word. It is typically trained on large amounts of data from the Internet and other sources to determine the most likely words that follow. For example, if you take "Once upon a time, there was a unicorn" as a prompt, the base LLM might go on to predict "She lived in a magical forest with her unicorn friends." However, if you take "What is the capital of France" as the prompt, the basic LLM may predict the answer as "What is the largest city in France? What is the population of France?" based on articles on the Internet, because the articles on the Internet are likely to be a list of question and answer questions about the country of France. Unlike basic language models, instruction fine-tuning LLM can better understand and follow instructions through specialized training. For example, when asked "What is the capital of France?", this type of model is likely to directly answer "The capital of France is Paris." The training of instruction fine-tuning LLM is usually based on a pre-trained language model. It is first pre-trained on large-scale text data to master the basic laws of language. On this basis, further training and fine-tuning (finetune) are performed. The input is the instructions and the output is the correct reply to these instructions. Sometimes RLHF (reinforcement learning from human feedback, human feedback reinforcement learning) technology is also used to further enhance the model's ability to follow instructions based on human feedback on the model output. through this controlled training process. Instruction fine-tuning LLM can produce output that is highly sensitive to instructions, more safe and reliable, and less irrelevant and damaging. therefore. Many practical applications have turned to using such large language models. Therefore, this course will focus on best practices for fine-tuning LLMs for instructions, which we recommend for most use cases. When you use instructions to fine-tune an LLM, you can analogize it to giving instructions to another person (assuming that person is smart but doesn't know the specific details of your task). So when LLM doesn't work properly, sometimes it's because the instructions aren't clear enough. For example, if you want to ask "Write something for me about Alan Turing," it might be more helpful to make it clear that you want the text to focus on his scientific work, personal life, historical role, or other aspects. In addition, you can also specify the tone of your answer to better meet your needs. Options include writing by a professional reporter or an essay written to a friend. If you think of the LLM as a new graduate and asking him to complete this task, you can even specify in advance which text fragments they should read to write a text about Alan Turing, which can help the new graduate complete this task better. The next chapter of this book explains in detail two key principles of prompt word design: clarity and sufficient time to think.

3.3.4 Document segmentation

Since the length of a single document often exceeds the context supported by the model, the retrieved knowledge is too long and exceeds the processing capacity of the model. Therefore, in the process of building a vector knowledge base, we often need to segment the document and divide the single document into several chunks according to length or fixed rules, and then convert each chunk into a word vector and store it in the vector database.

During retrieval, we will use chunk as the meta-unit of retrieval, that is, k chunks retrieved each time will be used as knowledge that the model can refer to to answer user questions. This k can be set freely by us.

Langchain Chinese text splitters are based onchunk_size(block size) andchunk_overlap(overlap size between blocks) to split.

image.png

  • chunk_size refers to the number of characters or Tokens (such as words, sentences, etc.) contained in each chunk

  • chunk_overlap refers to the number of characters shared between two chunks, which is used to maintain the coherence of the context and avoid losing context information during segmentation

Langchain provides a variety of document segmentation methods. The difference lies in how to determine the boundaries between blocks, which characters/tokens a block consists of, and how to measure the block size.

  • RecursiveCharacterTextSplitter(): Split text by string, recursively try to split text by different separators.
  • CharacterTextSplitter(): Split text by characters.
  • MarkdownHeaderTextSplitter(): Split markdown files based on the specified header.
  • TokenTextSplitter(): Split text by token.
  • SentenceTransformersTokenTextSplitter(): Split text by token
  • Language(): for CPP, Python, Ruby, Markdown, etc.
  • NLTKTextSplitter(): Split text by sentences using NLTK (Natural Language Toolkit).
  • SpacyTextSplitter(): Use Spacy to split text by sentence.
''' 
* RecursiveCharacterTextSplitter 递归字符文本分割
RecursiveCharacterTextSplitter 将按不同的字符递归地分割(按照这个优先级["\n\n", "\n", " ", ""]),
    这样就能尽量把所有和语义相关的内容尽可能长时间地保留在同一位置
RecursiveCharacterTextSplitter需要关注的是4个参数:

* separators - 分隔符字符串数组
* chunk_size - 每个文档的字符数量限制
* chunk_overlap - 两份文档重叠区域的长度
* length_function - 长度计算函数
'''
#导入文本分割器
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 知识库中单段文本长度
CHUNK_SIZE = 500

# 知识库中相邻文本重合长度
OVERLAP_SIZE = 50
# 使用递归字符文本分割器
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=OVERLAP_SIZE
)
text_splitter.split_text(pdf_page.page_content[0:1000])

['Preface\n"Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to let as many readers as possible\n learn about machine learning through Xigua Book, Teacher Zhou does not elaborate on the derivation details of some formulas in the book. However, this may not be "unfriendly" to readers who want to delve into the details of formula derivation\n. This book aims to analyze the more difficult to understand formulas in Xigua Book, and to explain some of them. The formula adds\nspecific derivation details. "\nAfter reading this, you may wonder why the quotation marks are added to the previous paragraph, because this is just our initial reverie. Later we learned that Teacher Zhou omitted these deductions. The real reason for explaining the details is that he personally believes that "sophomore students with a solid foundation in science and engineering mathematics should have no difficulty with the details of the derivation in the Xigua book\n. The key points are all in the book, and the omitted details should be able to make up for it in their heads or do exercises." So... this pumpkin book can only be regarded as the notes that I\nother math bastards took down when they were studying on their own. I hope it can help everyone become a qualified "sophomore student with a solid foundation in mathematics in science and engineering." \nInstructions for use\nAll the contents of the Pumpkin Book are expressed with the content of the Watermelon Book as pre-knowledge, so the best way to use the Pumpkin Book is to use the Watermelon Book\n as the main line. When you encounter formulas that you cannot derive or understand, refer to the Pumpkin Book; for beginners who are new to machine learning, it is strongly not recommended to go into the formulas in Chapters 1 and 2 of the Watermelon Book. Just go through it briefly and wait until you learn it', 'When I feel a little lost, I can come back to read it in time; we strive to explain the analysis and derivation of each formula from the perspective of undergraduate mathematics basics, so the mathematical knowledge beyond the program\n is usually given in the form of appendices and references. If you are interested Interested students can continue to study in depth along the materials we have provided; if there is no formula you want to check in the Pumpkin Book, or if you find an error somewhere in the Pumpkin Book, please do not hesitate to go to our GitHub\nIssues (Address: https://github.com/datawhalechina/pumpkin-book/issues)进行反馈,在对应版块\n提交你希望补充的公式编号或者勘误信息,我们通常会在24小时以内给您回复,超过24小时未回复的\n话可以微信联系我们(微信号:at-Sm1les);\n配套视频教程:https://www.bilibili.com/video/BV1Mh411e7VU\n在线阅读地址:https://datawhalechina.github.io/pumpkin-book(仅供第1版)\n最新版PDF获取地址:https://github.com/datawhalechina/pumpkin-book/releases\n编委会', 'Editorial Board\nEditor-in-Chief: Sm1les, archwalker']

split_docs = text_splitter.split_documents(pdf_pages)
print(f"切分后的文件数量:{len(split_docs)}")

Number of files after splitting: 711

print(f"切分后的字符数(可以用来大致评估 token 数):{sum([len(doc.page_content) for doc in split_docs])}")

Number of characters after segmentation (can be used to roughly evaluate the number of tokens): 305816

Note: How to segment documents is actually the core step in data processing, which often determines the lower limit of the retrieval system. However, how to choose the segmentation method often has strong business relevance - for different businesses and different source data, it is often necessary to set personalized document segmentation methods. Therefore, in this chapter, we simply split documents based on chunk_size. For readers who are interested in further exploration, you are welcome to read the project examples in Part 3 to refer to how existing projects perform document segmentation.

3.4 Build and use vector database

3.4.1 Preorder configuration

The focus of this section is to build and use a vector database, so after reading the data, we will skip the data processing and go straight to the topic. For data cleaning and other steps, please refer to Section 3.

import os
from dotenv import load_dotenv, find_dotenv

# 读取本地/项目的环境变量。
# find_dotenv()寻找并定位.env文件的路径
# load_dotenv()读取该.env文件,并将其中的环境变量加载到当前的运行环境中  
# 如果你设置的是全局的环境变量,这行代码则没有任何作用。
_ = load_dotenv(find_dotenv())

# 如果你需要通过代理端口访问,你需要如下配置
# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'

# 获取folder_path下所有文件路径,储存在file_paths里
file_paths = []
folder_path = '../../data_base/knowledge_db'
for root, dirs, files in os.walk(folder_path):
    for file in files:
        file_path = os.path.join(root, file)
        file_paths.append(file_path)
print(file_paths[:3])

['../../data_base/knowledge_db/prompt_engineering/6. Text Transformation Transforming.md', '../../data_base/knowledge_db/prompt_engineering/4. Text summary Summarizing.md', '../../data_base/knowledge_db/prompt_engineering/5. Inferring.md']

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# 遍历文件路径并把实例化的loader存放在loaders里
loaders = []

for file_path in file_paths:
    file_type = file_path.split('.')[-1]
    if file_type == 'pdf':
        loaders.append(PyMuPDFLoader(file_path))
    elif file_type == 'md':
        loaders.append(UnstructuredMarkdownLoader(file_path))
# 下载文件并存储到text
texts = []

for loader in loaders: texts.extend(loader.load())

The variable type after loading islangchain_core.documents.base.Document, the document variable type also contains two attributes

  • page_contentContains the contents of this document.
  • meta_dataDescriptive data related to the document.
text = texts[1]
print(f"每一个元素的类型:{type(text)}.", 
    f"该文档的描述性数据:{text.metadata}", 
    f"查看该文档的内容:\n{text.page_content[0:]}", 
    sep="\n------\n")

The type of each element: <class 'langchain_core.documents.base.Document'>. ------ Descriptive data for this document: {'source': '../../data_base/knowledge_db/prompt_engineering/4. Text Summary Summarizing.md'} ------ View the contents of this document: Chapter 4 Text Summary

In the busy information age, Xiao Ming is an enthusiastic developer and faces the challenge of processing massive text information. He needed to research countless documents to find key information for his project, but there was never enough time. When he was struggling, he discovered the text summarization function of large language models (LLM).

This function is like a beacon to Xiao Ming, illuminating his way to deal with the ocean of information. The powerful ability of LLM is that it can simplify complex text information and extract key points, which is undoubtedly a huge help to him. He no longer needs to spend a lot of time reading all the documents, he only needs to use LLM to summarize them, and he can quickly obtain the information he needs.

By calling the API interface programmatically, Xiao Ming successfully implemented this text summary function. He sighed: "This is like magic, turning the endless ocean of information into a clear source of information." Xiao Ming's experience demonstrates the huge advantages of LLM's text summary function: saving time, improving efficiency, and accurately obtaining information. This is what we are going to introduce in this chapter. Let us explore how to use programming and call API interfaces to master this powerful tool.

  1. Single text summary

Take the task of summarizing product reviews as an example: For e-commerce platforms, there are often a large number of product reviews on the website, and these reviews reflect the thoughts of all customers. If we have a tool to summarize these massive and lengthy reviews, we can quickly browse more reviews and gain insight into customer preferences, thereby guiding the platform and merchants to provide better services.

Next, we provide an online product review as an example, which may come from an online shopping platform, such as Amazon, Taobao, JD.com, etc. The evaluator reviewed a panda doll. The evaluation included factors such as the quality, size, price, and logistics speed of the product, as well as how much his daughter liked the product.

python prod_review = """ This panda doll is a birthday gift for my daughter. She likes it very much and takes it with her everywhere. The doll is soft, super cute, and has a kind facial expression. But compared to the price, it is a bit small. I feel that I can buy a bigger one at the same price elsewhere. The express arrived one day earlier than expected, so I played with it myself before giving it to my daughter. """

1.1 Limit the length of output text

We first try to limit the length of the text to 30 words.

```python from tool import get_completion

prompt = f""" 您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括,最多30个字。

评论: {prod_review} """

response = get_completion(prompt) print(response) ```

The panda doll is soft and cute, and my daughter likes it, but it’s a bit small. The express arrived one day early.

We can see that the language model gives us a result that meets the requirements.

Note: In the previous section we mentioned that language models rely on tokenizers when calculating and judging text length, and tokenizers do not have perfect accuracy in character statistics.

1.2 Set key angles to focus on

In some cases, we will focus on the text differently for different business scenarios. For example, in product review text, the logistics department may focus more on the timeliness of transportation, the merchant may focus more on price and product quality, and the platform may focus more on the overall user experience.

We can emphasize our emphasis on a specific perspective by enhancing the input prompt (Prompt).

1.2.1 Focus on express delivery services

```python prompt = f""" 您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括,最多30个字,并且侧重在快递服务上。

评论: {prod_review} """

response = get_completion(prompt) print(response) ```

The express arrived early. The doll is cute but a bit small.

From the output results, we can see that the text begins with "Express delivery arrives in advance", which reflects the emphasis on express delivery efficiency.

1.2.2 Focus on price and quality

```python prompt = f""" 您的任务是从电子商务网站上生成一个产品评论的简短摘要。

请对三个反引号之间的评论文本进行概括,最多30个词汇,并且侧重在产品价格和质量上。

评论: {prod_review} """

response = get_completion(prompt) print(response) ```

Cute panda figurine, good quality but a bit small and slightly expensive. The express arrived early.

From the output results, we can see that the text begins with "Cute panda doll, good quality but a bit small, price slightly high", which reflects the emphasis on product price and quality.

1.3 Key information extraction

In Section 1.2, although we did make the text summary more focused on a specific aspect by adding a Prompt that focuses on the key perspective, we can find that some other information will also be retained in the results. For example, the summary that focuses on the price and quality perspectives still retains the information of "express arrival early". If we only want to extract information from a certain angle and filter out all other information, we can ask LLM to perform text extraction (Extract) instead of summarizing (Summarize).

Let’s extract information from the text together!

```python prompt = f""" 您的任务是从电子商务网站上的产品评论中提取相关信息。

请从以下三个反引号之间的评论文本中提取产品运输相关的信息,最多30个词汇。

评论: {prod_review} """

response = get_completion(prompt) print(response) ```

Information related to product transportation: Express delivery arrives one day in advance.

  1. Summarize multiple texts at the same time

In actual workflows, we often have to deal with a large amount of review text. The following example collects multiple user reviews in a list, and uses a for loop and text summary (Summarize) prompt words to summarize the reviews to less than 20 words and print them in order. Of course, in actual production, for comment texts of different sizes, in addition to using for loops, you may also need to consider integrating comments, distribution and other methods to improve computing efficiency. You can build a main control panel to summarize a large number of user comments and facilitate quick browsing by you or others. You can also click to view the original comments. In this way, you can effectively capture all the thoughts of your customers.

```python review_1 = prod_review

一盏落地灯的评论

review_2 = """ 我需要一盏漂亮的卧室灯,这款灯不仅具备额外的储物功能,价格也并不算太高。 收货速度非常快,仅用了两天的时间就送到了。 不过,在运输过程中,灯的拉线出了问题,幸好,公司很乐意寄送了一根全新的灯线。 新的灯线也很快就送到手了,只用了几天的时间。 装配非常容易。然而,之后我发现有一个零件丢失了,于是我联系了客服,他们迅速地给我寄来了缺失的零件! 对我来说,这是一家非常关心客户和产品的优秀公司。 """

一把电动牙刷的评论

review_3 = """ 我的牙科卫生员推荐了电动牙刷,所以我就买了这款。 到目前为止,电池续航表现相当不错。 初次充电后,我在第一周一直将充电器插着,为的是对电池进行条件养护。 过去的3周里,我每天早晚都使用它刷牙,但电池依然维持着原来的充电状态。 不过,牙刷头太小了。我见过比这个牙刷头还大的婴儿牙刷。 我希望牙刷头更大一些,带有不同长度的刷毛, 这样可以更好地清洁牙齿间的空隙,但这款牙刷做不到。 总的来说,如果你能以50美元左右的价格购买到这款牙刷,那是一个不错的交易。 制造商的替换刷头相当昂贵,但你可以购买价格更为合理的通用刷头。 这款牙刷让我感觉就像每天都去了一次牙医,我的牙齿感觉非常干净! """

一台搅拌机的评论

review_4 = """ 在11月份期间,这个17件套装还在季节性促销中,售价约为49美元,打了五折左右。 可是由于某种原因(我们可以称之为价格上涨),到了12月的第二周,所有的价格都上涨了, 同样的套装价格涨到了70-89美元不等。而11件套装的价格也从之前的29美元上涨了约10美元。 看起来还算不错,但是如果你仔细看底座,刀片锁定的部分看起来没有前几年版本的那么漂亮。 然而,我打算非常小心地使用它 (例如,我会先在搅拌机中研磨豆类、冰块、大米等坚硬的食物,然后再将它们研磨成所需的粒度, 接着切换到打蛋器刀片以获得更细的面粉,如果我需要制作更细腻/少果肉的食物)。 在制作冰沙时,我会将要使用的水果和蔬菜切成细小块并冷冻 (如果使用菠菜,我会先轻微煮熟菠菜,然后冷冻,直到使用时准备食用。 如果要制作冰糕,我会使用一个小到中号的食物加工器),这样你就可以避免添加过多的冰块。 大约一年后,电机开始发出奇怪的声音。我打电话给客户服务,但保修期已经过期了, 所以我只好购买了另一台。值得注意的是,这类产品的整体质量在过去几年里有所下降 ,所以他们在一定程度上依靠品牌认知和消费者忠诚来维持销售。在大约两天内,我收到了新的搅拌机。 """

reviews = [review_1, review_2, review_3, review_4]

```

```python for i in range(len(reviews)): prompt = f""" 你的任务是从电子商务网站上的产品评论中提取相关信息。

请对三个反引号之间的评论文本进行概括,最多20个词汇。

评论文本: ```{reviews[i]}```
"""
response = get_completion(prompt)
print(f"评论{i+1}: ", response, "\n")

```

Comment 1: The panda doll is a birthday gift. My daughter likes it. It is soft and cute and has a kind facial expression. The price is a bit small, and the express arrived one day early.

Review 2: Beautiful bedroom lamp, storage function, fast delivery, light cord problem, quick fix, easy to assemble, care about customers and product.

Comment 3: This electric toothbrush has good battery life, but the toothbrush head is too small. The price is reasonable and the cleaning effect is good.

Comment 4: This review mentions a 17-piece set that was on sale in November but saw a price increase in December. Reviewers mentioned the appearance and use of the product, as well as issues with the product's declining quality. Finally, the reviewer mentioned that they purchased another blender.

  1. English version

1.1 Single text summary

python prod_review = """ Got this panda plush toy for my daughter's birthday, \ who loves it and takes it everywhere. It's soft and \ super cute, and its face has a friendly look. It's \ a bit small for what I paid though. I think there \ might be other options that are bigger for the \ same price. It arrived a day earlier than expected, \ so I got to play with it myself before I gave it \ to her. """

```python prompt = f""" Your task is to generate a short summary of a product \ review from an ecommerce site.

Summarize the review below, delimited by triple backticks, in at most 30 words.

Review: {prod_review} """

response = get_completion(prompt) print(response) ```

This panda plush toy is loved by the reviewer's daughter, but they feel it is a bit small for the price.

1.2 Set key angles to focus on

1.2.1 Focus on express delivery services

```python prompt = f""" Your task is to generate a short summary of a product \ review from an ecommerce site to give feedback to the \ Shipping deparmtment.

Summarize the review below, delimited by triple backticks, in at most 30 words, and focusing on any aspects \ that mention shipping and delivery of the product.

Review: {prod_review} """

response = get_completion(prompt) print(response) ```

The customer is happy with the product but suggests offering larger options for the same price. They were pleased with the early delivery.

1.2.2 Focus on price and quality

```python prompt = f""" Your task is to generate a short summary of a product \ review from an ecommerce site to give feedback to the \ pricing deparmtment, responsible for determining the \ price of the product.

Summarize the review below, delimited by triple backticks, in at most 30 words, and focusing on any aspects \ that are relevant to the price and perceived value.

Review: {prod_review} """

response = get_completion(prompt) print(response) ```

The customer loves the panda plush toy for its softness and cuteness, but feels it is overpriced compared to other options available.

1.3 Key information extraction

```python prompt = f""" Your task is to extract relevant information from \ a product review from an ecommerce site to give \ feedback to the Shipping department.

From the review below, delimited by triple quotes \ extract the information relevant to shipping and \ delivery. Limit to 30 words.

Review: {prod_review} """

response = get_completion(prompt) print(response) ```

The shipping department should take note that the product arrived a day earlier than expected.

2.1 Summarize multiple texts at the same time

```python review_1 = prod_review

review for a standing lamp

review_2 = """ Needed a nice lamp for my bedroom, and this one \ had additional storage and not too high of a price \ point. Got it fast - arrived in 2 days. The string \ to the lamp broke during the transit and the company \ happily sent over a new one. Came within a few days \ as well. It was easy to put together. Then I had a \ missing part, so I contacted their support and they \ very quickly got me the missing piece! Seems to me \ to be a great company that cares about their customers \ and products. """

review for an electric toothbrush

review_3 = """ My dental hygienist recommended an electric toothbrush, \ which is why I got this. The battery life seems to be \ pretty impressive so far. After initial charging and \ leaving the charger plugged in for the first week to \ condition the battery, I've unplugged the charger and \ been using it for twice daily brushing for the last \ 3 weeks all on the same charge. But the toothbrush head \ is too small. I’ve seen baby toothbrushes bigger than \ this one. I wish the head was bigger with different \ length bristles to get between teeth better because \ this one doesn’t. Overall if you can get this one \ around the $50 mark, it's a good deal. The manufactuer's \ replacements heads are pretty expensive, but you can \ get generic ones that're more reasonably priced. This \ toothbrush makes me feel like I've been to the dentist \ every day. My teeth feel sparkly clean! """

review for a blender

review_4 = """ So, they still had the 17 piece system on seasonal \ sale for around $49 in the month of November, about \ half off, but for some reason (call it price gouging) \ around the second week of December the prices all went \ up to about anywhere from between $70-$89 for the same \ system. And the 11 piece system went up around $10 or \ so in price also from the earlier sale price of $29. \ So it looks okay, but if you look at the base, the part \ where the blade locks into place doesn’t look as good \ as in previous editions from a few years ago, but I \ plan to be very gentle with it (example, I crush \ very hard items like beans, ice, rice, etc. in the \ blender first then pulverize them in the serving size \ I want in the blender then switch to the whipping \ blade for a finer flour, and use the cross cutting blade \ first when making smoothies, then use the flat blade \ if I need them finer/less pulpy). Special tip when making \ smoothies, finely cut and freeze the fruits and \ vegetables (if using spinach-lightly stew soften the \ spinach then freeze until ready for use-and if making \ sorbet, use a small to medium sized food processor) \ that you plan to use that way you can avoid adding so \ much ice if at all-when making your smoothie. \ After about a year, the motor was making a funny noise. \ I called customer service but the warranty expired \ already, so I had to buy another one. FYI: The overall \ quality has gone done in these types of products, so \ they are kind of counting on brand recognition and \ consumer loyalty to maintain sales. Got it in about \ two days. """

reviews = [review_1, review_2, review_3, review_4] ```

```python for i in range(len(reviews)): prompt = f""" Your task is to generate a short summary of a product \ review from an ecommerce site.

Summarize the review below, delimited by triple \
backticks in at most 20 words.

Review: ```{reviews[i]}```
"""
response = get_completion(prompt)
print(i, response, "\n")

```

0 Soft and cute panda plush toy loved by daughter, but small for the price. Arrived early.

1 Great lamp with storage, fast delivery, excellent customer service, and easy assembly. Highly recommended.

2 Impressive battery life, but toothbrush head is too small. Good deal if bought around $50.

3 The reviewer found the price increase after the sale disappointing and noticed a decrease in quality over time.
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 切分文档
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50)

split_docs = text_splitter.split_documents(texts)

3.4.2 Build Chroma vector library

Langchain integrates with over 30 different vector repositories. We chose Chroma because it is lightweight and data is stored in memory, which makes it very easy to launch and start using.

LangChain can directly use the Embedding of OpenAI and Baidu Qianfan. At the same time, we can also customize the Embedding API that it does not support. For example, we can encapsulate a zhipuai_embedding based on the interface provided by LangChain to connect the Embedding API of Zhipu to LangChain. In this chapter [Attached explanation of LangChain custom Embedding encapsulation], we take the Zhipu Embedding API as an example to introduce how to encapsulate other Embedding APIs into LangChain , interested readers are welcome to read.

**Note: If you use the Zhipuai API, you can refer to the explanation to implement the encapsulated code, or you can directly use our encapsulated code [zhipuai_embedding.py], download the code to the same level directory of this Notebook, and then you can directly import our encapsulated functions. In the following code Cell, we use Zhipu's Embedding by default, and present the other two Embedding usage codes in a comment method. If you are using Baidu API or OpenAI API, you can use the code in the Cell below according to the situation. **

# 使用 OpenAI Embedding
# from langchain.embeddings.openai import OpenAIEmbeddings
# 使用百度千帆 Embedding
# from langchain.embeddings.baidu_qianfan_endpoint import QianfanEmbeddingsEndpoint
# 使用我们自己封装的智谱 Embedding,需要将封装代码下载到本地使用
from zhipuai_embedding import ZhipuAIEmbeddings

# 定义 Embeddings
# embedding = OpenAIEmbeddings() 
embedding = ZhipuAIEmbeddings()
# embedding = QianfanEmbeddingsEndpoint()

# 定义持久化路径
persist_directory = '../../data_base/vector_db/chroma'
!rm -rf '../../data_base/vector_db/chroma'  # 删除旧的数据库文件(如果文件夹中有文件的话),windows电脑请手动删除
from langchain_community.vectorstores import Chroma

vectordb = Chroma.from_documents(
    documents=split_docs,
    embedding=embedding,
    persist_directory=persist_directory  # 允许我们将persist_directory目录保存到磁盘上
)
print(f"向量库中存储的数量:{vectordb._collection.count()}")

Number stored in vector library: 1004

3.4.3 Vector retrieval

Chroma's similarity search uses cosine distance, that is: similarity=cos(A,B)=ABAB=1naibi1nai21nbi2similarity = cos(A, B) = \frac{A \cdot B}{\parallel A \parallel \parallel B \parallel} = \frac{\sum_1^n a_i b_i}{\sqrt{\sum_1^n a_i^2}\sqrt{\sum_1^n b_i^2}} Among them, aia_i and bib_i are the components of vectors AA and BB respectively.

You can use it when you need the database to return results strictly sorted by cosine similarity.similarity_searchfunction.

question="什么是大语言模型"
sim_docs = vectordb.similarity_search(question,k=3)
print(f"检索到的内容数:{len(sim_docs)}")

Number of items retrieved: 3

for i, sim_doc in enumerate(sim_docs):
    print(f"检索到的第{i}个内容: \n{sim_doc.page_content[:200]}", end="\n--------------\n")

The 0th content retrieved: There is a lot of material on the Internet about prompt word (prompt (this term will be retained in this tutorial)) design, such as articles such as "30 prompts everyone has to know". These articles mainly focus on the web interface of ChatGPT, which many people use to perform specific, usually one-time tasks. But we believe that for developers, the more powerful feature of large language models (LLM) is that it can be called through API interfaces to quickly build software applications. In fact, we learned that Deep -------------- The first content retrieved: Chapter 6 Text Conversion

The large language model has powerful text conversion capabilities and can realize different types of text conversion tasks such as multi-language translation, spelling correction, grammar adjustment, and format conversion. Using language models to perform various transformations is one of its typical applications.

In this chapter, we will introduce how to call the API interface through programming and use the language model to implement the text conversion function. Through code examples, readers can learn specific methods to convert input text into the desired output format.

Mastering the skill of calling large language model interfaces for text conversion is an important step in developing various language applications. arts -------------- The 2nd content retrieved: Total cost for student calculations: 450x+450x + 100,000 Actual calculated total cost: 360x+$100,000 Are student calculated fees and actual calculated fees the same: No Are the student's solution and the actual solution the same: No Student Grade: Incorrect

  1. Limitations

When developing applications related to large models, please remember:

False knowledge: The model occasionally generates knowledge that appears to be real but is actually made up.

When developing and applying language models, one needs to be aware of the risk that they may generate false information. Although the model has been extensively pre-trained and has a wealth of knowledge, it --------------

If you only consider the relevance of the retrieved content, the content will be too single and important information may be lost.

maximum marginal correlation (MMR, Maximum marginal relevance) can help us increase the richness of our content while remaining relevant.

The core idea is to select a document that is less relevant to the selected document but rich in information after a highly relevant document has been selected. This can increase the diversity of content while maintaining relevance and avoid overly monotonous results.

mmr_docs = vectordb.max_marginal_relevance_search(question,k=3)
for i, sim_doc in enumerate(mmr_docs):
    print(f"MMR 检索到的第{i}个内容: \n{sim_doc.page_content[:200]}", end="\n--------------\n")

The 0th content retrieved by MMR: There is a lot of material on the Internet about prompt word (prompt (this term will be retained in this tutorial)) design, such as articles such as "30 prompts everyone has to know". These articles mainly focus on the web interface of ChatGPT, which many people use to perform specific, usually one-time tasks. But we believe that for developers, the more powerful feature of large language models (LLM) is that it can be called through API interfaces to quickly build software applications. In fact, we learned that Deep -------------- The first content retrieved by MMR: Total cost for student calculations: 450x+450x + 100,000 Actual calculated total cost: 360x+$100,000 Are student calculated fees and actual calculated fees the same: No Are the student's solution and the actual solution the same: No Student Grade: Incorrect

  1. Limitations

When developing applications related to large models, please remember:

False knowledge: The model occasionally generates knowledge that appears to be real but is actually made up.

When developing and applying language models, one needs to be aware of the risk that they may generate false information. Although the model has been extensively pre-trained and has a wealth of knowledge, it -------------- The 2nd content retrieved by MMR: Line derivation. For any sample, without considering the sample itself (that is, a priori), if you make a blind guess at the probability that it is generated by the i-th Gaussian mixture component P (zj = i), then guessing must be based on the prior probabilities α1, α2, . . . , αk, that is, P (zj = i) = αi. If we consider the information brought by the sample itself, information (i.e. posteriori), at this time guess the probability pM (zj = i | Rate pM (zj = --------------