Embedding packaging explanation

LangChain provides an efficient development framework for developing custom applications based on LLM, allowing developers to quickly activate the powerful capabilities of LLM and build LLM applications. LangChain also supports Embeddings of a variety of large models, and has built-in calling interfaces for Embeddings of large models such as OpenAI and LLAMA. However, LangChain does not have all large models built-in. It provides strong scalability by allowing users to customize Embeddings types.

In this section, we take Zhipu AI as an example to describe how to customize Embeddings based on LangChain.

This part involves relatively more technical details of LangChain and large model calls. If you have the energy, you can learn to deploy it. If you don’t have the energy, you can directly use the subsequent code to support the calls.

To implement custom Embeddings, you need to define a custom class that inherits from LangChain's Embeddings base class, and then define two functions: ① embed_query method, used to embedding a single string (query); ② embed_documents method, used to embedding a list of strings (documents).

First we import the required third-party libraries:

from typing import List
from langchain_core.embeddings import Embeddings

Here we define a custom Embeddings class that inherits from the Embeddings class:

class ZhipuAIEmbeddings(Embeddings):
    """`Zhipuai Embeddings` embedding models."""
    def __init__(self):
        """
        实例化ZhipuAI为values["client"]

        Args:

            values (Dict): 包含配置信息的字典，必须包含 client 的字段.
        Returns:

            values (Dict): 包含配置信息的字典。如果环境中有zhipuai库，则将返回实例化的ZhipuAI类；否则将报错 'ModuleNotFoundError: No module named 'zhipuai''.
        """
        from zhipuai import ZhipuAI
        self.client = ZhipuAI()

embed_documents is a method for calculating embedding for a string list (List[str]). Here we override this method and instantiate it when calling the verification environment.ZhipuAITo call the remote API and return the embedding results.

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """
        生成输入文本列表的 embedding.
        Args:
            texts (List[str]): 要生成 embedding 的文本列表.

        Returns:
            List[List[float]]: 输入列表中每个文档的 embedding 列表。每个 embedding 都表示为一个浮点值列表。
        """
        embeddings = self.client.embeddings.create(
            model="embedding-3",
            input=texts
        )
        return [embeddings.embedding for embeddings in embeddings.data]

embed_queryIt is a method of calculating embedding for a single text (str). Here we call the just definedembed_documentsmethod and return the first sublist.

    def embed_query(self, text: str) -> List[float]:
        """
        生成输入文本的 embedding.

        Args:
            texts (str): 要生成 embedding 的文本.

        Return:
            embeddings (List[float]): 输入文本的 embedding，一个浮点数值列表.
        """

        return self.embed_documents([text])[0]

For the above method, you can add some content processing before requesting embedding. For example, if the text is particularly long, we can consider segmenting the text to prevent exceeding the maximum token limit. These are all possible, and it is up to everyone to use their own subjective initiative to improve it. Here is just a simple demo.

Through the above steps, we can define the calling method of embedding based on LangChain and Zhipu AI. We encapsulate this code in the zhipuai_embedding.py file.

The source code corresponding to this article is at 此处. If you need to reproduce it, you can download and run the source code.

#Embedding packaging explanation

Embedding packaging explanation