Chapter 5 System Evaluation and Optimization

5.1 How to evaluate LLM applications

5.1.1 General idea of verification evaluation

Now, we have built a simple, generalized large model application. Looking back at the entire development process, we can find that large model development, which centers on calling and playing large models, pays more attention to verification and iteration than traditional AI development. Since you can quickly build an LLM-based application, define a prompt in minutes, and get feedback within hours, stopping to collect a thousand test samples can become extremely tedious. Because now, you can get the results without any training samples.

So when building an application using LLM, you might go through the following process: First, you'll tweak Prompt in a small sample of one to three samples and try to make it work on those samples. Later, when you further test the system, you may encounter some tricky examples that cannot be solved by prompts or algorithms. This is the challenge faced by developers building applications using LLM. In this case, you can add these extra few examples to the collection you're testing, organically adding otherwise unwieldy examples. Eventually, you'll add enough of these examples to your growing development set that manually running each one to test Prompt becomes inconvenient. Then you start developing some metrics for measuring performance on these small sample sets, such as average accuracy. The interesting thing about this process is that if you feel that your system is good enough, you can stop at any time and no longer improve it. In fact, many deployed applications stop at step one or two, and they work perfectly fine.

In this chapter, we will introduce the general methods of verification and evaluation of large model applications one by one, and design the verification iteration process of this project to achieve the optimization of application functions. However, please note that since system evaluation and optimization is a topic closely related to business, we mainly focus on theoretical introduction in this chapter, and readers are welcome to actively carry out self-practice and exploration.

We will first introduce several methods for large model development evaluation. For tasks with simple standard answers, evaluation is easy to implement; however, large model development generally requires the implementation of complex generation tasks. How to implement evaluation without simple answers or even standard answers can accurately reflect the effect of the application. We will briefly introduce several methods.

As we continue to find Bad Cases and make targeted optimizations, we can gradually add these Bad Cases to the verification set to form a verification set with a certain number of samples. For this kind of validation set, it is impractical to evaluate them one by one. We need an automatic evaluation method to achieve an overall evaluation of the performance on this validation set.

After mastering the general idea, we will explore how to evaluate and optimize application performance in large model applications based on the RAG paradigm. Since large model applications developed based on the RAG paradigm generally include two core parts: retrieval and generation, our evaluation optimization will also focus on these two parts respectively to optimize the system retrieval accuracy and the generation quality under certain given materials.

In each section, we will first introduce some tips on how to find Bad Cases, as well as general ideas for targeted retrieval optimization or Prompt optimization for Bad Cases. Note that during this process, you should always keep in mind the series of large model development principles and techniques we described in previous chapters, and always ensure that the optimized system will not make mistakes on the original examples that performed well.

Verification iteration is an indispensable and important step in building LLM-centric applications. By constantly looking for Bad Cases, adjusting Prompts or optimizing retrieval performance, we can promote the application to achieve the performance and accuracy we target. Next, we will briefly introduce several methods for large model development evaluation, and outline the general idea from targeted optimization of a few Bad Cases to overall automated evaluation.

5.1.2 Large model evaluation method

In specific large model application development, we can find Bad Cases and continuously optimize Prompt or retrieval architecture to solve Bad Cases, thereby optimizing system performance. We will add every Bad Case we find to our verification set. After each optimization, we will re-verify all verification cases in the verification set to ensure that the optimized system will not lose capabilities or degrade performance on the original Good Case. When the verification set is small, we can use manual evaluation, that is, manually evaluate the quality of the system output for each verification case in the verification set; however, when the verification set continues to expand with the optimization of the system, its size will continue to increase, so that the time and labor costs of manual evaluation expand to an unacceptable level. Therefore, we need to adopt an automatic evaluation method to automatically evaluate the system's output quality for each verification case to evaluate the overall performance of the system.

We will first introduce the general idea of manual evaluation for reference, then introduce in depth the general method of automatic evaluation of large models, and conduct actual verification on this system to comprehensively evaluate the performance of this system and prepare for further optimization iterations of the system. Again, before we start, we first load our vector database and retrieval chain:

import sys
sys.path.append("../C3 搭建知识库") # 将父目录放入系统路径中

# 使用智谱 Embedding API，注意，需要将上一章实现的封装代码下载到本地
from zhipuai_embedding import ZhipuAIEmbeddings

from langchain_community.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())    # read local .env file
zhipuai_api_key = os.environ['ZHIPUAI_API_KEY']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

# 定义 Embeddings
embedding = ZhipuAIEmbeddings()

# 向量数据库持久化路径
persist_directory = '../../data_base/vector_db/chroma'

# 加载数据库
vectordb = Chroma(
    persist_directory=persist_directory,  # 允许我们将persist_directory目录保存到磁盘上
    embedding_function=embedding
)
retriever = vectordb.as_retriever()
# 使用 OpenAI GPT-4o 模型
llm = ChatOpenAI(model_name = "gpt-4o", temperature = 0)

/var/folders/yd/c4q_f88j1l70g7_jcb6pdnb80000gn/T/ipykernel_25665/3013114604.py:23: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
  vectordb = Chroma(

In the early stages of system development, the verification set is small, and the simplest and most intuitive method is to manually evaluate each verification case in the verification set. However, there are also some basic guidelines and ideas for manual assessment, which are briefly introduced here for learners’ reference. However, please note that the evaluation of the system is strongly related to the business, and the design of specific evaluation methods and dimensions requires in-depth consideration based on the specific business.

5.1.2.1 Quantitative Assessment

In order to ensure a good comparison of system performance of different versions, quantitative evaluation indicators are very necessary. We should give a score for the answer to each verification case, and finally calculate the average score of all verification cases to get the score of this version of the system. The quantitative dimension can be 0~~5 or 0~~100, which can be determined according to personal style and actual business conditions.

The quantified evaluation indicators should have certain evaluation specifications. For example, if condition A is met, it can be scored as y points to ensure the relative consistency of evaluations between different evaluators.

For example, we give two verification cases:

① Who is the author of "The Pumpkin Book"?

② How to use Pumpkin Book?

Next, we use version A prompt (concise) and version B prompt (detailed and specific) to ask the model to answer:

from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

template_v1 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。最多使用三句话。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问！”。
{context}
问题: {question}
"""

QA_CHAIN_PROMPT = PromptTemplate(template=template_v1)

def combine_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
retrievel_chain = retriever | RunnableLambda(combine_docs)
qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    |{
        "answer": QA_CHAIN_PROMPT | llm | StrOutputParser(),
        "context": lambda x: x["context"]
    }
)
print("问题一：")
question = "南瓜书和西瓜书有什么关系？"
result = qa_chain.invoke(question)
print(result["answer"])

print("问题二：")
question = "应该如何使用南瓜书？"
result = qa_chain.invoke(question)
print(result["answer"])

Question one: The Pumpkin Book is expressed using the Watermelon Book as prerequisite knowledge, and aims to analyze and supplement the derivation details of the formulas that are difficult to understand in the Watermelon Book. The best way to use the Pumpkin Book is to use the Watermelon Book as the main line, and then refer to the Pumpkin Book when you encounter a formula that cannot be derived or cannot be understood. Thank you for your question! Question two: The best way to use the Pumpkin Book is to use the Watermelon Book as the main line. When you encounter a formula that cannot be derived or cannot be understood, refer to the Pumpkin Book. For beginners, it is recommended to briefly browse the formulas in Chapters 1 and 2 of the Xigua book, and then study in depth after having a certain foundation. If there is no content you need in the Pumpkin Book, you can provide feedback through Issues on GitHub or contact the editorial board. Thank you for your question!

The above is the answer to version A prompt. Let’s test version B again:

template_v2 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template_v2)

qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | {
        "answer": QA_CHAIN_PROMPT | llm | StrOutputParser(),
        "context": lambda x: x["context"]
    }
)

print("问题一：")
question = "南瓜书和西瓜书有什么关系？"
result = qa_chain.invoke(question)
print(result["answer"])

print("问题二：")
question = "应该如何使用南瓜书？"
result = qa_chain.invoke(question)
print(result["answer"])

Question one: The relationship between the Pumpkin Book and the Watermelon Book is that the Pumpkin Book is a learning material that is expanded and supplemented based on the Watermelon Book. Specifically:

Watermelon Book: Written by teacher Zhou Zhihua, it is a classic introductory textbook in the field of machine learning. The book does not elaborate on the derivation details of some formulas. It is assumed that readers have a certain mathematical foundation and can derive these details by themselves or master them through practice.
Pumpkin Book: Designed to help readers who have difficulty with the details of formula derivation in the Watermelon Book during the process of self-study. The Pumpkin Book analyzes the formulas that are difficult to understand in the Watermelon Book and adds specific derivation details. It is expressed based on the content of the Watermelon Book as pre-knowledge. Readers are advised to refer to the Pumpkin Book if they encounter difficulties in learning the Watermelon Book.

Therefore, the Pumpkin Book can be regarded as an auxiliary learning tool for the Watermelon Book, especially suitable for readers who require a more detailed derivation process. Question two: The best way to use the Pumpkin Book is as a supplement to the Watermelon Book, especially when you encounter formulas that you cannot derive or understand. Here are some specific usage suggestions:

Used in conjunction with the Watermelon Book: The content of the Pumpkin Book is expressed based on the pre-knowledge of the Watermelon Book. Therefore, it is recommended to study with the Watermelon Book as the main line. When you encounter a formula that is difficult to understand, refer to the Pumpkin Book for more detailed derivation and explanation.
Advice for Beginners: For beginners who are just starting to learn machine learning, it is recommended not to delve too deeply into the formulas in Chapters 1 and 2 of the Xigua Book. You can briefly understand it first, and then go back and study these formulas in depth after you have a certain foundation and understanding of machine learning.
Basic Mathematics Perspective: Pumpkin Book strives to analyze and derive formulas from the perspective of basic undergraduate mathematics. If it involves knowledge beyond the basics of undergraduate mathematics, Pumpkin Books usually provide relevant information in appendices and references for interested readers to study further.
Feedback and Interaction: If the formula you need is not available in the Pumpkin Book, or you find an error, you can provide feedback through the Issues page of GitHub. The editorial board usually responds within 24 hours. In addition, you can also contact the editorial board through WeChat.
Supporting Resources: The Pumpkin Book also provides supporting video tutorials and online reading addresses, through which you can more fully understand the contents of the book.
Join the reader exchange group: By scanning the QR code and replying to the keyword "Pumpkin Book", you can join the reader exchange group and exchange learning experiences with other readers.

Through the above methods, readers can more effectively use the pumpkin book to assist in learning complex formulas and concepts in machine learning.

It can be seen that the prompt of version A has a better effect in case ①, but the prompt of version B has a better effect in case ②. If we do not quantify the evaluation indicators and only use the evaluation of relative merits, we cannot judge which prompt is better between version A and version B. Therefore, we have to find a prompt that performs better in all cases before we can further iterate; however, this is obviously very difficult and not conducive to our iterative optimization.

We can assign a score of 1 to 5 to each answer. For example, in the above case, we give version A a score of 4 for answer ① and answer ② of 2, and give version B a score of 3 for answer ① and answer ② of 5. Then, the average score of version A is 3 points and the average score of version B is 4 points, so version B is better than version A.

5.1.2.2 Multidimensional Assessment

The large model is a typical generative model, that is, its answer is a statement generated by the model. In general, responses from large models need to be evaluated along multiple dimensions. For example, in the personal knowledge base Q&A project of this project, user questions are generally based on the content of the personal knowledge base. The model's answer must also meet the requirements of making full use of the personal knowledge base content, the answer is consistent with the question, the answer is true and effective, and the answer sentence is fluent, etc. An excellent Q&A assistant should be able to answer users' questions well, ensure the correctness of the answers, and demonstrate sufficient intelligence.

Therefore, we often need to start from multiple dimensions, design evaluation indicators for each dimension, and score in each dimension to comprehensively evaluate system performance. At the same time, it should be noted that multi-dimensional evaluation should be effectively combined with quantitative evaluation. For each dimension, the same dimension can be set or different dimensions can be set, and it should be fully combined with business reality.

For example, in this project, we can design assessments in the following dimensions:

① Correctness of knowledge search. This dimension requires viewing the intermediate results of the system's search for relevant knowledge fragments from the vector database, and evaluating whether the knowledge fragments found by the system can answer the question. This dimension is evaluated from 0 to 1, that is, a score of 0 means that the found knowledge fragment cannot be answered, and a score of 1 means that the found knowledge fragment can be answered.

② Answer consistency. This dimension evaluates whether the system's answers are specific to the user's questions, and whether there are any deviations or misunderstandings of the meaning of the questions. This dimension is also designed to range from 0 to 1, with 0 being completely off-topic, 1 being completely relevant, and any intermediate result can be taken.

③ Answer the hallucination proportion. This dimension needs to integrate the system's answers and the found knowledge fragments to evaluate whether the system's answers are hallucinated and how high the proportion of hallucinations is. This dimension is also designed to range from 0 to 1, with 0 being all model illusions and 1 being no illusions.

④ Answer correctness. This dimension evaluates whether the system's answers are correct and whether it fully answers the user's questions. It is one of the core evaluation indicators of the system. This dimension can be scored arbitrarily between 0 and 1.

The above four dimensions all revolve around knowledge and the correctness of answers, and are highly relevant to the question; the next few dimensions will revolve around the anthropomorphism and grammatical correctness of the results generated by the large model, and are less relevant to the question:

⑤ Logic. This dimension evaluates whether the system's answers are logically coherent and whether there are conflicts or logical confusion. This dimension is evaluated on a 0-1 basis.

⑥ Fluency. This dimension evaluates whether the system's answers are fluent and grammatical, and can be scored arbitrarily between 0 and 1.

⑦ Intelligence. This dimension evaluates whether the system's answers are anthropomorphic and intelligent, and whether it can sufficiently confuse users with artificial answers and intelligent answers. This dimension can be scored arbitrarily between 0 and 1.

For example, we evaluate responses to:

print("问题：")
question = "应该如何使用南瓜书？"
print(question)
print("模型回答：")
result = qa_chain.invoke(question)
print(result["answer"])

question: How should I use the Pumpkin Book? Model answer: The best way to use the Pumpkin Book is to use it as a supplement to the Watermelon Book. Especially when you encounter a formula that cannot be derived or cannot be understood, you can consult the Pumpkin Book for more detailed analysis and derivation details. Here are some specific usage suggestions:

Used in combination with the Watermelon Book: The content of the Pumpkin Book is expressed based on the pre-knowledge of the Watermelon Book, so it is recommended to study the Watermelon Book as the main line. When you get stuck, consult the Pumpkin Book for a deeper understanding.
Advice for Beginners: For beginners, especially those who are new to machine learning, it is recommended not to delve too deeply into the formulas in Chapters 1 and 2 of the Xigua Book. You can start with a brief understanding and then come back to study these formulas when you have a deeper understanding of machine learning.
Basic Mathematics Perspective: The Pumpkin Book strives to explain from the perspective of the basics of undergraduate mathematics. Therefore, knowledge beyond the scope of undergraduate mathematics is usually provided in the form of appendices and references. Interested readers can conduct in-depth study based on these materials.
Feedback and Interaction: If the formula you need is not available in the Pumpkin Book, or you find an error, you can provide feedback through the Issues page of GitHub. The editorial board usually responds within 24 hours.
Supporting Resources: You can use the supporting video tutorials and online reading addresses provided by Pumpkin Book for learning. In addition, you can also join the "Pumpkin Book Readers Exchange Group" by scanning the QR code for more communication and support.
Get the latest version: You can get the latest PDF version through GitHub to ensure you learn the latest content and revisions.

Through these methods, Pumpkin Book can help you better understand and master complex formulas and concepts in machine learning.

The following are the pieces of knowledge found by the system:

print(result["context"])

Preface "Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to make as many readers as possible Readers have some understanding of machine learning through the Xigua book, so the details of the derivation of some formulas are not detailed in the book, but this is useful for those who want to delve deeper into the derivation of formulas. It may not be "friendly" to readers who want to learn details. This book aims to analyze the more difficult-to-understand formulas in Xigua's book and to supplement some formulas. Specific derivation details. " After reading this, you may wonder why the previous paragraph is in quotation marks, because this is just our initial reverie. Later we learned that Zhou The real reason why the teacher omits these derivation details is that he himself believes that “sophomore students with a solid foundation in science and engineering mathematics should learn about Xigua Shu” There is no difficulty in deriving the details in the book. The key points are all in the book. The omitted details should be able to be supplemented by brain or practice." So... this pumpkin book can only be regarded as my I hope that the notes I took down when I was studying on my own will help everyone become a qualified "sophomore with a solid foundation in science and engineering mathematics." "Lost student". Instructions for use • All the contents of the Pumpkin Book are expressed based on the content of the Watermelon Book as pre-knowledge, so the best way to use the Pumpkin Book is to use the Watermelon Book For the main line, refer to the Pumpkin Book when you encounter a formula that you cannot derive or understand; • For beginners who are new to machine learning, it is strongly not recommended to study the formulas in Chapters 1 and 2 of Xigua Book. You can simply go through them and wait until you learn them.

• For beginners who are new to machine learning, it is strongly not recommended to study the formulas in Chapters 1 and 2 of Xigua Book. You can simply go through them and wait until you learn them. When it gets a little drifting, it’s still time to come back and nibble; • We strive (zhi) strive (neng) to explain the analysis and derivation of each formula from the perspective of basic undergraduate mathematics, so the mathematical knowledge beyond the scope We usually provide them in the form of appendices and references. Interested students can continue to study in depth along the materials we provide; • If there is no formula you want to check in the Pumpkin Book, or if you find an error somewhere in the Pumpkin Book, please do not hesitate to go to our GitHub Issues (Address: https://github.com/datawhalechina/pumpkin-book/issues）进行反馈，在对应版块 Submit the formula number or errata information you wish to add, and we will usually reply to you within 24 hours. If we do not reply within 24 hours, If you want, you can contact us via WeChat (WeChat ID: at-Sm1les); Supporting video tutorial: https://www.bilibili.com/video/BV1Mh411e7VU Online reading address: https://datawhalechina.github.io/pumpkin-book（仅供第1 version)

The latest version of PDF is available at: https://github.com/datawhalechina/pumpkin-book/releases Editorial Board Editor-in-chief: Sm1les, archwalker, jbb0523 Editorial Board: juxiao, Majingmin, MrBigFan, shanry, Ye980226 Cover design: Concept-Sm1les, Creation-Lin Wangmaosheng Acknowledgments Special thanks to awyd234, feijuan, Ggmatch, Heitao5200, huaqing89, LongJH, LilRachel, LeoLRH, Nono17, spareribs, sunchaothu, StevenLzq for their earliest contributions to the Pumpkin Book. Scan the QR code below and reply with the keyword "Pumpkin Book" to join the "Pumpkin Book Readers Exchange Group" Copyright statement This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

→_→

Welcome to purchase the paper version of the pumpkin book "Detailed Explanation of Machine Learning Formulas" on major e-commerce platforms. ←_← 9.2.4 Explanation of equation (9.8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.2.5 Explanation of equation (9.12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.3 distance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.3.1 Explanation of equation (9.21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.4 prototype clustering

We evaluate accordingly:

① Correctness of knowledge search——1

② Answer consistency - 0.8 (the question was answered, but the topic similar to "feedback" was off topic)

③ Answer hallucination ratio——1

④ Answer correctness——0.8 (same reason as above)

⑤ Logic - 0.7 (the subsequent content is not logically consistent with the previous content)

⑥ Fluency - 0.6 (the final summary is long-winded and ineffective)

⑦ Intelligence——0.5 (with obvious style of AI answer)

Combining the above seven dimensions, we can comprehensively and comprehensively evaluate the performance of the system in each case. By comprehensively considering the scores of all cases, we can evaluate the performance of the system in each dimension. If all dimensions are unified, we can also calculate the average score of all dimensions to evaluate the score of the system. We can also assign weights to the different importance of different dimensions, and then calculate the weighted average of all dimensions to represent the system score.

However, we can see that the more comprehensive and specific the assessment, the greater the difficulty and cost of assessment. Taking the above seven-dimensional evaluation as an example, we need to conduct seven evaluations for each case of each version of the system. If we have two versions of the system and there are 10 verification cases in the verification set, then we will need $10 \times 2 \times 7 = 140$ for each evaluation; but as our system continues to improve and iterate, the verification set expands rapidly. Generally speaking, a mature system verification set should be at least several hundred in size, with at least dozens of iteratively improved versions, so the total number of our evaluations will reach tens of thousands, and the labor and time costs will be very high. Therefore, we need a way to automatically evaluate model answers.

5.1.2.4 Construct objective questions

An important reason why large model evaluation is complicated is that it is difficult to judge the answers of the generated model. That is, it is easy to judge the objective questions, but it is very difficult to judge the subjective questions. Especially for some questions that do not have standard answers, it is particularly difficult to implement automatic assessment. However, at the expense of a certain degree of evaluation accuracy, we can transform complex subjective questions without standard answers into questions with standard answers, and then achieve this through simple automatic assessment. Two methods are introduced here: constructing objective questions and calculating the similarity of standard answers.

The evaluation of subjective questions is very difficult, but objective questions can directly compare whether the system answer is consistent with the standard answer, thus achieving simple evaluation. We can construct some subjective questions into multiple-choice or single-choice objective questions to achieve simple assessment. For example, for the question:

【Q&A】Who is the author of the Pumpkin Book?

We can construct this subjective question as the following objective question:

[Multiple Choice Question] Who is the author of the Pumpkin Book? A Zhou Zhiming B Xie Wenrui C Qinzhou D Jia Binbin

The model is required to answer this objective question. We give the standard answer as BCD, and the evaluation score can be achieved by comparing the answer given by the model with the standard answer. Based on the above ideas, we can construct a Prompt question template:

prompt_template = '''
请你做如下选择题：
题目：南瓜书的作者是谁？
选项：A 周志明 B 谢文睿 C 秦州 D 贾彬彬
你可以参考的知识片段：

{}

请仅返回选择的选项
如果你无法做出选择，请返回空
'''

Of course, due to the instability of large models, even if we ask it to only give selection options, the system may return a lot of text explaining in detail why the following options were selected. Therefore, we need to extract the options from the model answers. At the same time, we need to design a scoring strategy. Under normal circumstances, we can use the general scoring strategy for multiple-choice questions: 1 point for all choices, 0.5 points for missed choices, and no points for wrong choices:

def multi_select_score_v1(true_answer : str, generate_answer : str) -> float:
    # true_anser : 正确答案，str 类型，例如 'BCD'
    # generate_answer : 模型生成答案，str 类型
    true_answers = list(true_answer)
    '''为便于计算，我们假设每道题都只有 A B C D 四个选项'''
    # 先找出错误答案集合
    false_answers = [item for item in ['A', 'B', 'C', 'D'] if item not in true_answers]
    # 如果生成答案出现了错误答案
    for one_answer in false_answers:
        if one_answer in generate_answer:
            return 0
    # 再判断是否全选了正确答案
    if_correct = 0
    for one_answer in true_answers:
        if one_answer in generate_answer:
            if_correct += 1
            continue
    if if_correct == 0:
        # 不选
        return 0
    elif if_correct == len(true_answers):
        # 全选
        return 1
    else:
        # 漏选
        return 0.5

Based on the above scoring function, we can test four answers:

① B C

② Except for A Zhou Zhihua, all others are the authors of Pumpkin Books

③ You should choose B C D

④ I don’t know

answer1 = 'B C'
answer2 = '西瓜书的作者是 A 周志华'
answer3 = '应该选择 B C D'
answer4 = '我不知道'
true_answer = 'BCD'
print("答案一得分：", multi_select_score_v1(true_answer, answer1))
print("答案二得分：", multi_select_score_v1(true_answer, answer2))
print("答案三得分：", multi_select_score_v1(true_answer, answer3))
print("答案四得分：", multi_select_score_v1(true_answer, answer4))

Answer one score: 0.5 Answer two score: 0 Score for answer three: 1 Answer four points: 0

But we can see that we require the model not to make a choice if it cannot answer, rather than choosing randomly. However, in our scoring strategy, both wrong selection and non-selection are given 0 points, which actually encourages the model to give illusory answers. Therefore, we can adjust the scoring strategy according to the situation and deduct one point for wrong selection:

def multi_select_score_v2(true_answer : str, generate_answer : str) -> float:
    # true_anser : 正确答案，str 类型，例如 'BCD'
    # generate_answer : 模型生成答案，str 类型
    true_answers = list(true_answer)
    '''为便于计算，我们假设每道题都只有 A B C D 四个选项'''
    # 先找出错误答案集合
    false_answers = [item for item in ['A', 'B', 'C', 'D'] if item not in true_answers]
    # 如果生成答案出现了错误答案
    for one_answer in false_answers:
        if one_answer in generate_answer:
            return -1
    # 再判断是否全选了正确答案
    if_correct = 0
    for one_answer in true_answers:
        if one_answer in generate_answer:
            if_correct += 1
            continue
    if if_correct == 0:
        # 不选
        return 0
    elif if_correct == len(true_answers):
        # 全选
        return 1
    else:
        # 漏选
        return 0.5

As above, we use the second version of the scoring function to score the four answers again:

answer1 = 'B C'
answer2 = '西瓜书的作者是 A 周志华'
answer3 = '应该选择 B C D'
answer4 = '我不知道'
true_answer = 'BCD'
print("答案一得分：", multi_select_score_v2(true_answer, answer1))
print("答案二得分：", multi_select_score_v2(true_answer, answer2))
print("答案三得分：", multi_select_score_v2(true_answer, answer3))
print("答案四得分：", multi_select_score_v2(true_answer, answer4))

Answer one score: 0.5 Score for answer two: -1 Score for answer three: 1 Answer four points: 0

As you can see, in this way we achieve fast, automatic and differentiated automatic evaluation. Under this method, we only need to construct each verification case, and each subsequent verification and iteration can be fully automated, thus achieving efficient verification.

However, not all cases can be constructed as objective questions. For some cases that cannot be constructed as objective questions or constructed as objective questions will cause the difficulty of the questions to drop sharply, we need to use the second method: calculating answer similarity.

5.1.2.5 Calculate answer similarity

The evaluation of answers to generated questions is actually not a new problem in NLP. Whether it is machine translation, automatic summarization and other tasks, it is actually necessary to evaluate the quality of the generated answers. NLP generally uses the method of manually constructing standard answers to generated questions and calculating the similarity between the answers and the standard answers to achieve automatic evaluation.

For example, to the question:

What is the goal of The Pumpkin Book?

We can first manually construct a standard answer:

Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to let as many readers as possible understand machine learning through Xigua Book, Teacher Zhou does not elaborate on the derivation details of some formulas in the book. However, this may not be "unfriendly" to readers who want to delve into the details of formula derivation. This book aims to analyze the more difficult to understand formulas in Xigua Book and add specific derivation details to some formulas.

Then we calculate the similarity between the model answer and the standard answer. The more similar it is, the more correct we think the answer is.

There are many ways to calculate similarity. We can generally use BLEU to calculate similarity. For details, see: 知乎|BLEU详解. For students who do not want to delve into the principles of the algorithm, it can be simply understood as topic similarity.

We can call the bleu scoring function in the nltk library to calculate:

from nltk.translate.bleu_score import sentence_bleu
import jieba

def bleu_score(true_answer : str, generate_answer : str) -> float:
    # true_anser : 标准答案，str 类型
    # generate_answer : 模型生成答案，str 类型
    true_answers = list(jieba.cut(true_answer))
    # print(true_answers)
    generate_answers = list(jieba.cut(generate_answer))
    # print(generate_answers)
    bleu_score = sentence_bleu(true_answers, generate_answers)
    return bleu_score

Test it out:

true_answer = '周志华老师的《机器学习》（西瓜书）是机器学习领域的经典入门教材之一，周老师为了使尽可能多的读者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述，但是这对那些想深究公式推导细节的读者来说可能“不太友好”，本书旨在对西瓜书里比较难理解的公式加以解析，以及对部分公式补充具体的推导细节。'

print("答案一：")
answer1 = '周志华老师的《机器学习》（西瓜书）是机器学习领域的经典入门教材之一，周老师为了使尽可能多的读者通过西瓜书对机器学习有所了解, 所以在书中对部分公式的推导细节没有详述，但是这对那些想深究公式推导细节的读者来说可能“不太友好”，本书旨在对西瓜书里比较难理解的公式加以解析，以及对部分公式补充具体的推导细节。'
print(answer1)
score = bleu_score(true_answer, answer1)
print("得分：", score)
print("答案二：")
answer2 = '本南瓜书只能算是我等数学渣渣在自学的时候记下来的笔记，希望能够帮助大家都成为一名合格的“理工科数学基础扎实点的大二下学生”'
print(answer2)
score = bleu_score(true_answer, answer2)
print("得分：", score)

Building prefix dict from the default dictionary ...

Answer one: Teacher Zhou Zhihua's "Machine Learning" (Xigua Book) is one of the classic introductory textbooks in the field of machine learning. In order to let as many readers as possible understand machine learning through Xigua Book, Teacher Zhou does not elaborate on the derivation details of some formulas in the book. However, this may not be "unfriendly" to readers who want to delve into the details of formula derivation. This book aims to analyze the more difficult to understand formulas in Xigua Book and add specific derivation details to some formulas.

Dumping model to file cache /var/folders/yd/c4q_f88j1l70g7_jcb6pdnb80000gn/T/jieba.cache
Loading model cost 0.308 seconds.
Prefix dict has been built successfully.

Score: 1.2705543769116016e-231 Answer two: This pumpkin book can only be regarded as the notes me and other math bastards took down when I was studying on my own. I hope it can help everyone become a qualified "sophomore student with a solid foundation in science and engineering mathematics." Score: 1.1935398790363042e-231

/Users/lta/anaconda3/envs/universe_0_3/lib/python3.10/site-packages/nltk/translate/bleu_score.py:577: UserWarning: 
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/Users/lta/anaconda3/envs/universe_0_3/lib/python3.10/site-packages/nltk/translate/bleu_score.py:577: UserWarning: 
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
/Users/lta/anaconda3/envs/universe_0_3/lib/python3.10/site-packages/nltk/translate/bleu_score.py:577: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)

It can be seen that the higher the consistency between the answer and the standard answer, the higher the evaluation score. Through this method, we also only need to construct a standard answer for each question in the verification set, and then we can achieve automatic and efficient evaluation.

However, this method also has several problems: ① It requires manual construction of standard answers. For some vertical fields, constructing standard answers may be difficult; ② There may be problems in evaluating through similarity. For example, if the generated answer is highly consistent with the standard answer but the answer is completely wrong in several core places, the bleu score will still be very high; ③ The flexibility of the consistency with the standard answer through calculation is very poor. If the model generates a better answer than the standard answer, the evaluation score will be reduced; ④ The intelligence and fluency of the answer cannot be evaluated. If the answer is spliced together from keywords in each standard answer, we think that such an answer is unusable and incomprehensible, but the bleu score will be higher.

Therefore, depending on the business situation, sometimes we also need some advanced evaluation methods that do not require the construction of standard answers.

5.1.2.6 Use large models for evaluation

Using manual assessment has high accuracy and comprehensiveness, but the labor cost and time cost are high; using automatic assessment has low cost and fast assessment speed, but there are problems such as insufficient accuracy and insufficient comprehensive assessment. So, do we have a way to combine the advantages of both to achieve rapid and comprehensive evaluation of generated problems?

Large models represented by GPT-4 provide us with a new method: using large models for evaluation. We can construct Prompt Engineering to let the large model act as an evaluator, thereby replacing the evaluator of manual assessment; at the same time, the large model can give results similar to manual assessment, so the multi-dimensional quantitative assessment method in manual assessment can be adopted to achieve rapid and comprehensive assessment.

For example, we can construct the following Prompt Engineering to allow large models to score:

prompt = '''
你是一个模型回答评估员。
接下来，我将给你一个问题、对应的知识片段以及模型根据知识片段对问题的回答。
请你依次评估以下维度模型回答的表现，分别给出打分：

① 知识查找正确性。评估系统给定的知识片段是否能够对问题做出回答。如果知识片段不能做出回答，打分为0；如果知识片段可以做出回答，打分为1。

② 回答一致性。评估系统的回答是否针对用户问题展开，是否有偏题、错误理解题意的情况，打分分值在0~1之间，0为完全偏题，1为完全切题。

③ 回答幻觉比例。该维度需要综合系统回答与查找到的知识片段，评估系统的回答是否出现幻觉，打分分值在0~1之间,0为全部是模型幻觉，1为没有任何幻觉。

④ 回答正确性。该维度评估系统回答是否正确，是否充分解答了用户问题，打分分值在0~1之间，0为完全不正确，1为完全正确。

⑤ 逻辑性。该维度评估系统回答是否逻辑连贯，是否出现前后冲突、逻辑混乱的情况。打分分值在0~1之间，0为逻辑完全混乱，1为完全没有逻辑问题。

⑥ 通顺性。该维度评估系统回答是否通顺、合乎语法。打分分值在0~1之间，0为语句完全不通顺，1为语句完全通顺没有任何语法问题。

⑦ 智能性。该维度评估系统回答是否拟人化、智能化，是否能充分让用户混淆人工回答与智能回答。打分分值在0~1之间，0为非常明显的模型回答，1为与人工回答高度一致。

你应该是比较严苛的评估员，很少给出满分的高评估。
用户问题：

{}

待评估的回答：

{}

给定的知识片段：

{}

你应该返回给我一个可直接解析的 Python 字典，字典的键是如上维度，值是每一个维度对应的评估打分。
不要输出任何其他内容。
'''

We can actually test its effect:

# 使用第二章讲过的 OpenAI 原生接口

from openai import OpenAI

client = OpenAI()


def gen_gpt_messages(prompt):
    '''
    构造 GPT 模型请求参数 messages
    
    请求参数：
        prompt: 对应的用户提示词
    '''
    messages = [{"role": "user", "content": prompt}]
    return messages


def get_completion(prompt, model="gpt-4o", temperature = 0):
    '''
    获取 GPT 模型调用结果

    请求参数：
        prompt: 对应的提示词
        model: 调用的模型，默认为 gpt-4o，也可以按需选择 gpt-4 等其他模型
        temperature: 模型输出的温度系数，控制输出的随机程度，取值范围是 0~2。温度系数越低，输出内容越一致。
    '''
    response = client.chat.completions.create(
        model=model,
        messages=gen_gpt_messages(prompt),
        temperature=temperature,
    )
    if len(response.choices) > 0:
        return response.choices[0].message.content
    return "generate answer error"

question = "应该如何使用南瓜书？"
result = qa_chain.invoke(question)
answer = result["answer"]
knowledge = result["context"]

response = get_completion(prompt.format(question, answer, knowledge))
response

'```python\n{\n    "知识查找正确性": 1,\n    "回答一致性": 1,\n    "回答幻觉比例": 0.9,\n    "回答正确性": 0.9,\n    "逻辑性": 1,\n    "通顺性": 1,\n    "智能性": 0.9\n}\n```'

Note, however, that there are still problems with using large models for evaluation:

① Our goal is to iteratively improve Prompt to improve the performance of large models. Therefore, the large model we choose for evaluation needs to have better performance than the large model base we use. For example, the most powerful large model currently is still GPT-4. It is recommended to use GPT-4 for evaluation, which has the best effect.

② Large models have powerful capabilities, but there are also boundaries of capabilities. If the questions and answers are too complex, the knowledge fragments are too long, or too many evaluation dimensions are required, even GPT-4 will suffer from incorrect evaluations, incorrect formats, and inability to understand instructions. In response to these situations, we recommend considering the following solutions to improve the performance of large models:

Improve Prompt Engineering. Iteratively optimize and evaluate Prompt Engineering in a manner similar to the Prompt Engineering improvement of the system itself, paying particular attention to whether Prompt Engineering's basic guidelines, core recommendations, etc. are adhered to;
Split the evaluation dimensions. If there are too many evaluation dimensions, the model may have an incorrect format and the returned data cannot be parsed. You can consider splitting the multiple dimensions to be evaluated, calling a large model for each dimension for evaluation, and finally getting a unified result;
Merge assessment dimensions. If the evaluation dimensions are too detailed, the model may not be understood correctly and the evaluation may be incorrect. You can consider merging multiple dimensions to be evaluated, for example, merging logic, fluency, and intelligence into intelligence, etc.;
Provide detailed evaluation specifications. Without evaluation specifications, it is difficult for the model to give ideal evaluation results. You can consider giving detailed and specific evaluation specifications to improve the evaluation capabilities of the model;
Provide a few examples. It may be difficult for the model to understand the evaluation specification. In this case, a small number of evaluation examples can be given for the model to refer to for correct evaluation.

5.1.2.7 Hybrid evaluation

In fact, none of the above evaluation methods are isolated or antagonistic. Rather than using a certain evaluation method independently, we recommend mixing multiple evaluation methods and selecting an appropriate evaluation method for each dimension, taking into account the comprehensiveness, accuracy and efficiency of the evaluation.

For example, for this project's personal knowledge base assistant, we can design the following hybrid evaluation method:

Objective correctness. Objective correctness means that the model can give correct answers to some questions that have fixed correct answers. We can select some cases and use the method of constructing objective questions to evaluate the model and evaluate its objective correctness.
Subjective correctness. Subjective correctness means that the model can give correct and comprehensive answers to subjective questions that do not have fixed correct answers. We can select some cases and use large model evaluation to evaluate whether the model answers are correct.
Intelligence. Intelligence refers to whether the model’s answers are sufficiently anthropomorphic. Since intelligence is weakly related to the problem itself and strongly related to the model and prompt, and the model's ability to judge intelligence is weak, we can manually evaluate its intelligence using a small number of samples.
Knowledge search accuracy. Knowledge search correctness refers to whether the knowledge fragments retrieved from the knowledge base are correct and sufficient to answer the question for a specific question. It is recommended that the correctness of knowledge search be evaluated using a large model, which requires the model to determine whether a given piece of knowledge is sufficient to answer the question. At the same time, the evaluation results of this dimension combined with the subjective correctness can calculate the hallucination situation, that is, if the subjective answer is correct but the knowledge search is incorrect, it means that model hallucination has occurred.

Using the above evaluation methods, a reasonable evaluation of the project can be made based on the obtained validation set examples. Due to time and manpower limitations, they will not be shown in detail here.

5.2 Evaluate and optimize the generation part

In the previous chapter, we talked about how to evaluate the overall performance of a large model application based on the RAG framework. By constructing a verification set in a targeted manner, a variety of methods can be used to evaluate system performance from multiple dimensions. However, the purpose of evaluation is to better optimize application effects. To optimize application performance, we need to combine the evaluation results, split the evaluated Bad Cases, and evaluate and optimize each part separately.

RAG stands for Retrieval Enhanced Generation, so it has two core parts: the retrieval part and the generation part. The core function of the retrieval part is to ensure that the system can find the corresponding answer fragment according to the user query, and the core function of the generation part is to ensure that after the system obtains the correct answer fragment, it can fully utilize the large model capabilities to generate a correct answer that meets the user's requirements.

To optimize a large model application, we often need to start from these two parts at the same time, evaluate the performance of the retrieval part and the optimization part respectively, find Bad Cases and perform targeted performance optimization. As for the generation part specifically, when the large model base has been restricted for use, we often optimize the generated answers by optimizing Prompt Engineering. In this chapter, we will first combine the large model application example we just built - Personal Knowledge Base Assistant to explain to you how to evaluate the performance of the analysis and generation part, identify Bad Cases, and optimize the generation part by optimizing Prompt Engineering.

Before we officially start, we first load our vector database and search chain:

import sys
sys.path.append("../C3 搭建知识库") # 将父目录放入系统路径中

# 使用智谱 Embedding API，注意，需要将上一章实现的封装代码下载到本地
from zhipuai_embedding import ZhipuAIEmbeddings

from langchain_community.vectorstores.chroma import Chroma
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())    # read local .env file
zhipuai_api_key = os.environ['ZHIPUAI_API_KEY']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

# 定义 Embeddings
embedding = ZhipuAIEmbeddings()

# 向量数据库持久化路径
persist_directory = '../../data_base/vector_db/chroma'

# 加载数据库
vectordb = Chroma(
    persist_directory=persist_directory,  # 允许我们将persist_directory目录保存到磁盘上
    embedding_function=embedding
)

# 使用 OpenAI GPT-4o 模型
llm = ChatOpenAI(model_name = "gpt-4o", temperature = 0)

# os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# os.environ["HTTP_PROXY"] = 'http://127.0.0.1:7890'

/var/folders/yd/c4q_f88j1l70g7_jcb6pdnb80000gn/T/ipykernel_26332/2254682865.py:23: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
  vectordb = Chroma(

We first use the initialized Prompt to create a template-based retrieval chain:

from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableParallel, RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
template_v1 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。最多使用三句话。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问！”。
{context}
问题: {question}
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template_v1)

def combine_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

retrievel_chain = vectordb.as_retriever() | RunnableLambda(combine_docs)

qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | QA_CHAIN_PROMPT
    | llm
    | StrOutputParser()
)

Test the effect first:

question = "什么是南瓜书"
result = qa_chain.invoke(question)
print(result)

The Pumpkin Book is a book that supplements the analysis and derivation details of the more difficult to understand formulas in "Machine Learning" (Watermelon Book), aiming to help readers better understand the mathematical formulas in machine learning. It uses the Watermelon Book as prerequisite knowledge and is suitable for reference when encountering derivation difficulties. The content of the Pumpkin Book is explained from the perspective of undergraduate mathematics foundation, and supporting video tutorials and online reading resources are provided. Thank you for your question!

5.2.1 Improve the quality of intuitive answers

There are many ways to find Bad Cases. The most intuitive and simplest is to evaluate the quality of intuitive answers and determine where there are shortcomings based on the original data content. For example, we can construct the above test into a Bad Case:

Question: What is a Pumpkin Book? Initial answer: The Pumpkin Book is a book that analyzes the difficult-to-understand formulas in "Machine Learning" (Watermelon Book) and adds derivation details. Thank you for your question! There are shortcomings: the answer is too brief, and the answer needs to be more specific; thank you for your question, which seems rigid and can be removed. We then modify the Prompt template in a targeted manner, adding a requirement for specific answers, and removing the "Thank you for your question" part:

template_v2 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template_v2)
qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | QA_CHAIN_PROMPT
    | llm
    | StrOutputParser()
)

question = "什么是南瓜书"
result = qa_chain.invoke(question)
print(result)

Pumpkin Book is an auxiliary textbook designed to help readers deeply understand the details of formula derivation in "Machine Learning" (Watermelon Book). It is especially suitable for readers who have difficulties in derivation when studying the Xigua book. The content of the Pumpkin Book is expressed based on the pre-knowledge of the Watermelon Book. Therefore, it is recommended that readers consult the Pumpkin Book when they encounter formulas they do not understand when studying the Watermelon Book. The goal of the Pumpkin Book is to help readers become "sophomore students with a solid foundation in science and engineering mathematics", that is, learners with a solid foundation in mathematics.

The authors of the Pumpkin Book strive to analyze and derive formulas from the perspective of undergraduate mathematics foundations, and provide appendices and references for readers to learn in depth. Pumpkin Book also provides resources such as online reading, video tutorials and reader communication groups so that readers can learn and communicate better.

If readers cannot find the formula they need in the pumpkin book, or find an error, they can provide feedback through the Issues page on GitHub, and the editorial board will usually respond within 24 hours. The content of the Pumpkin Book is also constantly being updated and improved to better serve the learning needs of readers.

It can be seen that the improved v2 version can give more specific and detailed answers, solving the previous problems. But we can think further and ask the model to give specific and detailed answers. Will it lead to unfocused and vague answers to some key points? We test the following questions:

question = "使用大模型时，构造 Prompt 的原则有哪些"
result = qa_chain.invoke(question)
print(result)

When using a large language model (LLM), the principles for constructing Prompt mainly include the following two key points:

Write clear, specific instructions:

Ensure that the Prompt clearly expresses the requirements and provides sufficient contextual information so that the model accurately understands the task intent. Like explaining the human world to an alien, a prompt that is too simplistic can result in a model that fails to grasp the specific requirements of the task.
When designing prompts, avoid vague and uncertain language, and try to use specific descriptions and clear instructions to improve the accuracy of model generation results.

Give the model enough time to think:

Allow the model enough time to reason and think, similar to how humans need time to think when solving problems to avoid rushing to the wrong conclusion.
Add the requirement for step-by-step reasoning in Prompt to ensure that the model has time to think deeply, thereby generating more accurate and reliable results.

By optimizing these two aspects, developers can fully unleash the potential of language models to complete complex reasoning and generation tasks. In addition, the design of Prompt is an iterative optimization process. Developers need to gradually find the most suitable Prompt form for the application through multiple attempts and adjustments.

It can be seen that in response to our questions about the LLM course, the model's answer is indeed detailed and specific, and fully refers to the course content. However, the answer starts with the words first, second, etc., and the overall answer is divided into 4 paragraphs, resulting in the answer being not particularly focused and clear, and difficult to read. Therefore, we construct the following Bad Case:

Question: What are the principles for constructing Prompt when using large models? Initial answer: slightly Weaknesses: no focus, vagueness

For this Bad Case, we can improve Prompt and require it to mark answers with several points to make the answer clear and specific:

template_v3 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v3)
qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | QA_CHAIN_PROMPT
    | llm
    | StrOutputParser()
)

question = "使用大模型时，构造 Prompt 的原则有哪些"
result = qa_chain.invoke(question)
print(result)

When using large models, the principles for constructing Prompt mainly include the following points:

Write clear, specific instructions:

Ensure that the Prompt clearly expresses the requirements and provides sufficient contextual information so that the language model can accurately understand the task intent.
Avoid overly brief descriptions that may make it difficult for the model to grasp specific task requirements.

Give the model enough time to think:

When designing Prompt, step-by-step reasoning requirements should be included, similar to the human problem-solving process, to avoid hasty conclusions.
By giving the model enough time to infer, the results generated will be more accurate and reliable.

Iterative Optimization:

Optimize Prompt through multiple iterations, gradually improving to achieve the best results.
After running the first version of Prompt, check the results and analyze the reasons for unsatisfactory results, and make adjustments and improvements.
For complex applications, iterative training can be performed on multiple samples to evaluate the average performance of Prompt.

Practice and Assessment:

Practice Prompt examples on tools such as Jupyter Notebook, observe different outputs, and gain a deeper understanding of the iterative optimization process.
After the application becomes more mature, detailed optimization will be carried out by evaluating Prompt performance on multiple sample sets.

By mastering these principles, developers can leverage large language models more effectively and build reliable applications.

There are many ways to improve the quality of answers. The core is to think about the specific business, find out the unsatisfactory points in the initial answers, and make targeted improvements. I will not go into details here.

5.2.2 Indicate the source of knowledge to improve credibility

Due to the hallucination problem in large models, we sometimes suspect that model answers are not derived from existing knowledge base content. This is especially important for some scenarios where authenticity needs to be ensured, such as:

question = "强化学习的定义是什么"
result = qa_chain.invoke(question)
print(result)

I don't know the definition of reinforcement learning because it is not mentioned in the context provided.

We can require the model to indicate the source of knowledge when generating answers. This can prevent the model from fabricating knowledge that does not exist in the given data. At the same time, it can also improve our credibility in the answers generated by the model:

template_v4 = """使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答
案。你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体。
请你附上回答的来源原文，以保证回答的正确性。
{context}
问题: {question}
有用的回答:"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v4)
qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | QA_CHAIN_PROMPT
    | llm
    | StrOutputParser()
)

question = "强化学习的定义是什么"
result = qa_chain.invoke(question)
print(result)

According to the context provided, the definition of reinforcement learning is not mentioned directly, so I cannot derive the definition of reinforcement learning from it. However, I can provide a general definition of reinforcement learning:

Reinforcement Learning (RL) is a machine learning method that aims to learn how to take actions through interaction with the environment to maximize accumulated rewards. The core of reinforcement learning is that the agent selects actions in different states and adjusts the policy based on the rewards from environmental feedback in order to obtain the maximum total reward in the long term. Reinforcement learning typically involves the following key elements:

Agent: An entity that performs actions in the environment.
Environment: The external system with which the agent interacts.
State: The specific situation of the environment at a certain moment.
Action: The operations that the agent can perform in a certain state.
Reward: The signal of environmental feedback after the agent performs the action, used to guide the learning of the agent.

If you need more detailed information or specific examples, it is recommended to consult relevant reinforcement learning textbooks or literature.

However, attaching the source of the original text often leads to an increase in context and a reduction in response speed. We need to consider whether to require attachment of the original text based on the business scenario.

5.2.3 Construct a thinking chain

Large models can often understand and execute instructions well, but the models themselves still have some limitations in their capabilities, such as the illusion of large models, the inability to understand more complex instructions, and the inability to execute complex steps. We can minimize its ability limitations by constructing a thinking chain and structuring Prompt into a series of steps. For example, we can construct a two-step thinking chain and require the model to reflect in the second step to eliminate the illusion problem of large models as much as possible.

We first have such a Bad Case:

Question: How should we structure an LLM project Initial answer: slightly There are shortcomings: In fact, the content in the knowledge base on how to construct an LLM project is to use the LLM API to build an application. The model's answer seems reasonable, but in fact it is the illusion of a large model. It is obtained by splicing some related texts, and there are problems

question = "我们应该如何去构造一个LLM项目"
result = qa_chain.invoke(question)
print(result)

Constructing a large language model (LLM) project is a complex task with multiple aspects to consider. Here are some key steps and considerations:

Project Goals and Requirements Analysis:

Clarify the goals and application scenarios of the project, such as whether it is used for dialogue systems, text generation, translation, or other tasks.
Determine the performance indicators of the model, such as accuracy, response speed, resource consumption, etc.

Data Collection and Preparation:

Collect large amounts of high-quality text data to ensure data diversity and representativeness.
Clean and preprocess data, including removing noise, handling missing values, and standardization.

Model Selection and Design:

Select the appropriate model architecture, such as Transformer, GPT, BERT, etc., based on project requirements and resource constraints.
Consider the size and complexity of the model and balance performance and computing resources.

Training and Optimization:

Configure the training environment, including hardware (such as GPU, TPU) and software framework (such as TensorFlow, PyTorch).
Perform model training and adjust hyperparameters to optimize model performance.
Use techniques such as transfer learning, fine-tuning, etc. to improve the efficiency and effectiveness of the model.

Assessment and Testing:

Evaluate models using standard datasets and metrics to ensure their performance on different tasks.
Conduct A/B testing and user testing to collect feedback to further improve the model.

Deployment and Maintenance:

Deploy the model to production environment to ensure its stability and scalability.
Regularly update and maintain models, process new data and feedback, and improve model performance.

Ethics and Compliance:

Ensure that the use of models complies with ethical standards and legal regulations, especially with regard to data privacy and security.
Consider the fairness and transparency of the model to avoid bias and discrimination.

These steps require collaboration across disciplines, including expertise in areas such as data science, machine learning, software engineering, and product management. Through a systematic approach and continuous iterative improvement, an effective LLM project can be successfully built and deployed.

In this regard, we can optimize Prompt and turn the previous Prompt into two steps, requiring the model to reflect in the second step:

template_v4 = """
请你依次执行以下步骤：
① 使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答案。
你应该使答案尽可能详细具体，但不要偏题。如果答案比较长，请酌情进行分段，以提高答案的阅读体验。
如果答案有几点，你应该分点标号回答，让答案清晰具体。
上下文：
{context}
问题: 
{question}
有用的回答:
② 基于提供的上下文，反思回答中有没有不正确或不是基于上下文得到的内容，如果有，回答你不知道
确保你执行了每一个步骤，不要跳过任意一个步骤。
"""

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context","question"],
                                 template=template_v4)
qa_chain = (
    RunnableParallel(context=retrievel_chain, question=RunnablePassthrough())
    | QA_CHAIN_PROMPT
    | llm
    | StrOutputParser()
)

question = "我们应该如何去构造一个LLM项目"
result = qa_chain.invoke(question)
print(result)

Based on the context provided, here are the detailed steps on how to structure an LLM (Large Language Model) project:

Understand goals and needs:

First clarify the goals and needs of the project. Understand what functions you want to achieve or what problems you want to solve through LLM.
Determine the scope and constraints of the project, such as budget, time, technology stack, etc.

Choose appropriate models and algorithms:

Select the appropriate model architecture based on project requirements. You can consider using existing pre-trained models (such as GPT, BERT, etc.) for fine-tuning, or training a new model from scratch.
Understand the basic principles and applicable scenarios of the model to ensure that the selected model can meet the project needs.

Data preparation:

Collect and prepare training data. The quality and quantity of data are critical to model performance.
Clean and preprocess data to ensure data consistency and accuracy.

Model training and optimization:

Use appropriate frameworks (such as TensorFlow, PyTorch, etc.) for model training.
Improve the performance of the model by adjusting hyperparameters, optimization algorithms, etc.
Use methods such as cross-validation to evaluate the generalization ability of the model.

Assessment and Testing:

Comprehensive evaluation and testing of the model to ensure its performance in different scenarios.
Evaluate the performance of the model using appropriate metrics (such as precision, recall, F1 score, etc.).

Deployment and Maintenance:

Deploy the model to the production environment to ensure that it runs stably.
Monitor and maintain the model regularly, and update and optimize the model in a timely manner to meet new needs and challenges.

Documentation and User Training:

Write detailed project documents to record all aspects of the project, including design, implementation, testing, etc.
Provide training and support to users to help them better use and understand the model.

In the above steps, ensure that each step is adapted and optimized based on the specific needs and context of the project.

Reflection: In the context provided, there are no specific steps directly involved in how to construct an LLM project, so the above answer is a summary based on the general project construction process and LLM related knowledge. If anything is out of context, please point it out.

It can be seen that after asking the model to self-reflect, the model repaired its illusion and gave the correct answer. We can also accomplish more functions by constructing a thinking chain, which I won’t go into details here. Readers are welcome to try.

5.2.4 Add a command parsing

We often face a requirement that we need the model to output in a format we specify. However, because we use Prompt Template to populate user questions, the formatting requirements present in user questions are often ignored, such as:

question = "LLM的分类是什么？给我返回一个 Python List"
result = qa_chain.invoke(question)
print(result)

① Answer the questions based on the context provided:

What are the classifications of LLM? Returns me a Python List

Useful answers:

Sentiment inference: LLM can be used for sentiment analysis, by writing prompts to analyze the sentiment tendency of comments, such as positive or negative.
Recognize the language: LLM can identify the language of the text, for example, recognize that "Combien coûte le lampadaire?" is French.
Multilingual translation: LLM can translate text into multiple languages, such as "I want to order a basketball." into Chinese, English, French and Spanish.
Simultaneous tone conversion: LLM can perform tone conversion and information extraction based on the input prompt, such as extracting information about "NASA" from the text.

Python List:
```
llm_classifications = [
    "情感推断",
    "识别语种",
    "多语种翻译",
    "同时进行语气转换"
]
```

② Reflect on whether there is anything incorrect or not based on the context in the answer:

In the context provided, the classification of LLM is based on the description of functions such as emotion inference, language recognition, multilingual translation, and tone conversion. Therefore, responses are contextual and contain nothing incorrect.

As you can see, although we asked the model to return a Python List, the output request was wrapped in a Template and ignored by the model. To address this problem, we can construct a Bad Case:

Question: What are the classifications of LLM? Returns me a Python List Initial answer: Based on the context provided, the classification of LLM can be divided into basic LLM and instruction fine-tuning LLM. There is a shortcoming: the output is not according to the requirements in the instruction.

To solve this problem, an existing solution is to add a layer of LLM before our retrieval LLM to realize the parsing of instructions and separate the format requirements of user questions and question content. This idea is actually the prototype of the currently popular Agent mechanism, that is, for user instructions, set up an LLM (i.e. Agent) to understand the instructions, determine what tools need to be executed by the instructions, and then specifically call the tools that need to be executed. Each of these tools can be an LLM based on different Prompt Engineering, or it can be a database, API, etc. There is actually an Agent mechanism designed in LangChain, but we will not go into details in this tutorial. Here we only simply implement this function based on OpenAI's native interface:

# 使用第二章讲过的 OpenAI 原生接口

from openai import OpenAI

client = OpenAI()


def gen_gpt_messages(prompt):
    '''
    构造 GPT 模型请求参数 messages
    
    请求参数：
        prompt: 对应的用户提示词
    '''
    messages = [{"role": "user", "content": prompt}]
    return messages


def get_completion(prompt, model="gpt-4o", temperature = 0):
    '''
    获取 GPT 模型调用结果

    请求参数：
        prompt: 对应的提示词
        model: 调用的模型，默认为 gpt-4o，也可以按需选择 gpt-4 等其他模型
        temperature: 模型输出的温度系数，控制输出的随机程度，取值范围是 0~2。温度系数越低，输出内容越一致。
    '''
    response = client.chat.completions.create(
        model=model,
        messages=gen_gpt_messages(prompt),
        temperature=temperature,
    )
    if len(response.choices) > 0:
        return response.choices[0].message.content
    return "generate answer error"

prompt_input = '''
请判断以下问题中是否包含对输出的格式要求，并按以下要求输出：
请返回给我一个可解析的Python列表，列表第一个元素是对输出的格式要求，应该是一个指令；第二个元素是去掉格式要求的问题原文
如果没有格式要求，请将第一个元素置为空
需要判断的问题：

{}

不要输出任何其他内容或格式，确保返回结果可解析。
'''

Let’s test the LLM’s ability to decompose format requirements:

response = get_completion(prompt_input.format(question))
response

'["Return me a Python List", "What is the classification of LLM?"]'

It can be seen that through the above prompt, LLM can effectively parse the output format. Next, we can set up another LLM to parse the output content according to the output format requirements:

prompt_output = '''
请根据回答文本和输出格式要求，按照给定的格式要求对问题做出回答
需要回答的问题：

{}

回答文本：

{}

输出格式要求：

{}

'''

We can then concatenate the two LLMs with the retrieval chain:

question = 'LLM的分类是什么？给我返回一个 Python List'
# 首先将格式要求与问题拆分
input_lst_s = get_completion(prompt_input.format(question))
# 找到拆分之后列表的起始和结束字符
start_loc = input_lst_s.find('[')
end_loc = input_lst_s.find(']')
rule, new_question = eval(input_lst_s[start_loc:end_loc+1])
# 接着使用拆分后的问题调用检索链
result = qa_chain.invoke(new_question)
result_context = result
# 接着调用输出格式解析
response = get_completion(prompt_output.format(new_question, result_context, rule))
response

'```python\n[\n    "生成式模型",\n    "判别式模型"\n]\n```'

It can be seen that after the above steps, we have successfully achieved the limitation of the output format. Of course, in the above code, the core is to introduce the idea of Agent. In fact, whether it is the Agent mechanism or the Parser mechanism (that is, limited output format), LangChain provides a mature tool chain for use. Interested readers are welcome to discuss it in depth, and I will not explain it here.

Through the ideas explained above and combined with actual business conditions, we can continuously discover Bad Cases and optimize Prompts accordingly, thereby improving the performance of the generated part. However, the premise of the above optimization is that the retrieval part can retrieve the correct answer fragment, that is, the retrieval accuracy and recall rate are as high as possible. So, how can we evaluate and optimize the performance of the retrieval part? We will explore this issue in depth in the next chapter.

5.3 Evaluate and optimize the search part

In the previous chapter, we explained how to optimize Prompt Engineering for generation part evaluation to improve the generation quality of large models. But the premise of generation is retrieval. Only when the retrieval part of our application can retrieve the correct answer document according to the user query, the generated results of the large model may be correct. Therefore, the retrieval precision and recall rate of the retrieval part actually affect the overall performance of the application to a greater extent. However, the optimization of the search part is a more engineering and in-depth proposition. We often need to use many advanced advanced techniques derived from search, explore more practical tools, and even hand-write some tools for optimization. Therefore, in this chapter, we only briefly discuss the ideas for evaluation and optimization of the retrieval part, without going into code practice in depth. If readers feel unsatisfied after reading this part and want to learn more advanced techniques to further optimize their applications, please read the second part of our upcoming tutorial "LLM Development Techniques".

5.3.1 Evaluate retrieval results

First let's review the functionality of the entire RAG system.

For a query entered by the user, the system will convert it into a vector and match the most relevant text segments in the vector database. Then according to our settings, 3 to 5 text paragraphs will be selected and handed over to the large model together with the user's query. The large model will then answer the questions raised in the user's query based on the retrieved text paragraphs. In this entire system, we call the part where the vector database retrieves relevant text paragraphs the retrieval part, and the part where the large model generates answers based on the retrieved text paragraphs is called the generation part.

Therefore, the core function of the retrieval part is to find text paragraphs that exist in the knowledge base and can correctly answer the questions in the user query. Therefore, we can define the most intuitive accuracy rate to evaluate the retrieval effect: for N given queries, we ensure that the correct answer corresponding to each query exists in the knowledge base. Assume that for each query, the system finds K text fragments. If the correct answer is in one of the K text fragments, then we consider the retrieval successful; if the correct answer is not in one of the K text fragments, our task retrieval fails. Then, the retrieval accuracy of the system can be simply calculated as:

$accuracy = \frac{M}{N}$

Among them, M is the number of successfully retrieved queries.

Through the above accuracy rate, we can measure the system's retrieval capabilities. For the queries that the system can successfully retrieve, we can further optimize the prompt to improve system performance. For queries whose system retrieval fails, we must improve the retrieval system to optimize the retrieval effect. But note that when we calculate the accuracy defined above, we must ensure that the correct answer to each of our verification queries does exist in the knowledge base; if the correct answer does not exist, then we should attribute the Bad Case to the knowledge base construction part, indicating that the breadth and processing accuracy of the knowledge base construction need to be improved.

Of course, this is just the simplest evaluation method. In fact, this evaluation method has many shortcomings. For example:

Some queries may require combining multiple knowledge fragments to answer. How do we evaluate such queries?
The order of retrieved knowledge fragments will actually affect the generation of large models. Should we take the order of retrieved fragments into consideration?
In addition to retrieving correct knowledge fragments, our system should also try to avoid retrieving wrong and misleading knowledge fragments, otherwise the generation results of large models are likely to be misled by wrong fragments. Should we include the retrieved erroneous fragments in the metric calculation?

There are no standard answers to the above questions, and they need to be comprehensively considered based on the actual business targeted by the project and the estimated costs.

In addition to evaluating the retrieval effect through the above methods, we can also model the retrieval part as a classic search task. Let's look at a classic search scenario. The task of the search scenario is to find and sort relevant content from a given range of content (usually web pages) for the retrieval query given by the user, and try to make the top-ranked content meet the user's needs.

In fact, the tasks of our retrieval part are very similar to the search scenario. They are also targeted at user queries, but we place more emphasis on recall rather than sorting, and the content we retrieve is not web pages but knowledge fragments. Therefore, we can similarly model our retrieval task as a search task. Then, we can introduce classic evaluation ideas (such as accuracy, recall, etc.) and optimization ideas (such as building indexes, rearrangements, etc.) in search algorithms to more fully evaluate and optimize our retrieval effects. This part will not be described in detail, and interested readers are welcome to conduct in-depth research and sharing.

5.3.2 Ideas for optimizing retrieval

The above describes several general ideas for evaluating the retrieval effect. When we make a reasonable evaluation of the system's retrieval effect and find the corresponding Bad Case, we can disassemble the Bad Case into multiple dimensions to optimize the retrieval part. Note that although in the evaluation section above, we emphasized that the verification query to evaluate the retrieval effect must ensure that its correct answer exists in the knowledge base, here we default to knowledge base construction as part of the retrieval section. Therefore, we also need to solve Bad Cases caused by incorrect knowledge base construction in this section. Here, we share some common bad case attributions and possible optimization ideas.

Fragments of knowledge are fragmented and answers are lost

This problem generally manifests itself as: for a user query, we can be sure that the question must exist in the knowledge base, but we find that the retrieved knowledge fragments separate the correct answer, resulting in the inability to form a complete and reasonable answer. This kind of question is more common on queries that require a long answer.

The general optimization idea for this type of problem is to optimize the text cutting method. What we use in "C3 Building Knowledge Base" is the most primitive segmentation method, that is, segmentation based on specific characters and chunk size. However, this type of segmentation method often cannot take into account the semantics of the text, and it is easy to cause the strongly related context of the same topic to be divided into two chunks. For some knowledge documents with a unified format and clear organization, we can construct more appropriate segmentation rules; for documents with a chaotic format and unable to form unified segmentation rules, we can consider incorporating a certain amount of manpower for segmentation. We can also consider training a model dedicated to text segmentation to achieve chunk segmentation based on semantics and topics.

query requires a long context and summary answer

This problem is also a problem in the construction of knowledge base. That is, some of the questions raised by the query require retrieval parts spanning a long context to give a general answer, that is, multiple chunks need to be spanned to comprehensively answer the question. However, due to model context limitations, it is often difficult for us to give a sufficient number of chunks.

The general optimization idea for this type of problem is to optimize the way the knowledge base is constructed. For documents that may require such answers, we can add a step to solve this problem to a certain extent by using LLM to summarize long documents, or preset questions for LLM to answer, thereby pre-filling the possible answers to such questions into the knowledge base as a separate chunk.

Keywords misleading

This problem generally manifests itself as, for a user query, the knowledge fragment retrieved by the system has many keywords that are strongly related to the query, but the knowledge fragment itself is not an answer to the query. This situation generally results from the fact that there are multiple keywords in the query, and the matching effect of the secondary keywords affects the primary keywords.

The general optimization idea for this type of problem is to rewrite the user query, which is also a common idea in many large model applications. That is, for the user input query, we first use LLM to rewrite the user query into a reasonable form, removing the impact of secondary keywords and possible typos and omissions. The specific form of rewriting depends on the specific business. You can ask LLM to refine the query to form a Json object, or you can ask LLM to expand the query, etc.

The matching relationship is unreasonable

This problem is relatively common, that is, the matched strongly relevant text segment does not contain the answer text. The core problem with this problem is that the vector model we use does not match our initial assumptions. When explaining the framework of RAG, we mentioned that there is a core assumption behind the effectiveness of RAG, that is, we assume that the strongly relevant text segment we match is the answer text segment corresponding to the question. However, many vector models actually construct "pairing" semantic similarity rather than "causal" semantic similarity. For example, for the query-"How is the weather today", "I want to know the weather today" is considered to be more relevant than "The weather is good".

The general optimization ideas for this type of problem are to optimize the vector model or build an inverted index. We can choose a vector model with better performance, or collect some data and fine-tune a vector model that is more suitable for our own business. We can also consider building an inverted index, that is, for each knowledge fragment in the knowledge base, build an index that can characterize the content of the fragment but has a more accurate relative correlation with the query, and match the correlation between the index and the query instead of the full text during retrieval, thereby improving the accuracy of the matching relationship.

There are many ideas for optimizing the retrieval part. In fact, the optimization of the retrieval part is often the core engineering part of RAG application development. Due to space limitations, we will not go into details about more techniques and methods here. Interested readers are welcome to read our upcoming second part "LLM Development Techniques".

#Chapter 5 System Evaluation and Optimization

#5.1 How to evaluate LLM applications

#5.1.1 General idea of ​​verification evaluation

#5.1.2 Large model evaluation method

#5.1.2.1 Quantitative Assessment

#5.1.2.2 Multidimensional Assessment

#5.1.2.4 Construct objective questions

#5.1.2.5 Calculate answer similarity

#5.1.2.6 Use large models for evaluation

#5.1.2.7 Hybrid evaluation

#5.2 Evaluate and optimize the generation part

#5.2.1 Improve the quality of intuitive answers

#5.2.2 Indicate the source of knowledge to improve credibility

#5.2.3 Construct a thinking chain

#5.2.4 Add a command parsing

#5.3 Evaluate and optimize the search part

#5.3.1 Evaluate retrieval results

#5.3.2 Ideas for optimizing retrieval