LLM Universe | Hands-on learning of large model application development

📢 This tutorial comes from an open source project > Project address: https://github.com/datawhalechina/llm-universe

Project Introduction

This project is a large model application development tutorial for novice developers. It is designed to be based on Alibaba Cloud server and combined with the personal knowledge base assistant project to complete the key introduction to large model development through one course. The main contents include:

  1. Introduction to large models, what is a large model, what are the characteristics of a large model, what is LangChain, how to develop an LLM application, a brief introduction for novice developers;
  2. How ​​to call large model API, this section introduces the various calling methods of well-known large model product APIs at home and abroad, including calling native API, encapsulating it as LangChain LLM, encapsulating it as Fastapi and other calling methods. At the same time, various large model APIs including Baidu Wenxin, iFlytek Spark, Zhipu AI and other large model APIs are encapsulated in a unified form;
  3. Knowledge base construction, loading and processing of different types of knowledge base documents, and construction of vector database;
  4. Build RAG application, including connecting LLM to LangChain to build a retrieval question and answer chain, and using Streamlit for application deployment
  5. Verification iteration, how to implement verification iteration in large model development, and what are the general evaluation methods;

This project mainly includes three parts:

  1. Introduction to LLM development. The simplified version of the V1 version is designed to help beginners get started with LLM development as quickly and easily as possible. By understanding the general process of LLM development, they can build a simple demo.
  2. LLM development skills. LLM develops more advanced techniques, including but not limited to: Prompt Engineering, processing of multiple types of source data, optimized retrieval, recall and fine ranking, Agent framework, etc.
  3. LLM application examples. Introduce some successful open source cases, and from the perspective of this course, analyze the ideas, core ideas, and implementation frameworks of these application examples to help beginners understand what kind of applications they can develop through LLM.

Project significance

LLM is gradually becoming a new revolutionary force in the information world. It provides developers with new and more powerful application development options through its powerful natural language understanding and natural language generation capabilities. With the explosive opening of LLM API services at home and abroad, how to quickly and conveniently develop applications with stronger capabilities and integrated LLM based on LLM API has begun to become an important skill for developers.

At present, there are many introductions to LLM and scattered LLM development skills courses, but the quality is uneven and not well integrated. Developers need to search a large number of tutorials and read a large amount of content that is not highly relevant and less necessary in order to initially master the necessary skills for large model development. The learning efficiency is low and the learning threshold is high.

This project starts from practice and combines the most common and common personal knowledge base assistant projects to explain in simple terms and gradually disassemble the general process and steps of LLM development. It is designed to help novices without algorithm foundation complete the basic introduction to large model development through one course. At the same time, we will also add advanced RAG development skills and interpretation of some successful LLM application cases to help readers who have completed the first part of the study further master higher-level RAG development skills, and be able to develop their own, fun applications by learning from existing successful projects.

Project Audience

All developers who have basic Python skills and want to master LLM application development skills.

**This project does not require learners to have any basic knowledge of artificial intelligence or algorithms. They only need to master basic Python syntax and basic Python development skills. **

Considering the environment-setup problem, this project provides students with free access to Alibaba Cloud servers. Student readers can receive Alibaba Cloud servers for free and complete this course through Alibaba Cloud servers. This project also provides environment-setup guides for personal computers and non-Alibaba Cloud servers. This project basically has no requirements for local hardware and does not require a GPU environment. Both personal computers and servers can be used for learning.

**Note: This project mainly uses APIs provided by major model manufacturers for application development. If you want to learn to deploy and apply local open source LLM, you are welcome to learn Self LLM | 开源大模型食用指南, also produced by Datawhale. This project will teach you step by step how to fine-tune the full link of open source LLM deployment! **

**Note: Considering the difficulty of learning, this project is mainly for beginners and introduces how to use LLM to build applications. If you want to further study the theoretical basis of LLM, and further understand and apply LLM on the basis of theory, you are welcome to study So Large LM | 大模型基础, also produced by Datawhale. This project will provide you with comprehensive and in-depth theoretical knowledge and practical methods of LLM! **

Project Highlights

  1. Fully oriented to practice, hands-on-llm development. Compared with other similar tutorials that start from theory and have a large gap between practice and practice, this tutorial is based on the universal personal knowledge base assistant project. It integrates universal large model development concepts into project practice and helps learners master large model development skills by building personal projects.

  2. Starting from scratch, a comprehensive and short tutorial on large models. For the personal knowledge base assistant project, this project conducted a project-led reconstruction of relevant large model development theories, concepts and basic skills, deleting the underlying principles and algorithm details that do not need to be understood, and covering all core skills of large model development. The overall duration of the tutorial is within a few hours, but after studying this tutorial, you can master all the core skills of basic large model development.

  3. It has both unity and expandability. This project uniformly encapsulates major domestic and foreign LLM APIs such as GPT, Baidu Wenxin, iFlytek Spark, Zhipu GLM, etc., supports one-click calling of different LLMs, and helps developers focus more on learning applications and optimizing the model itself, without spending time on tedious calling details. At the same time, this tutorial is planned to be launched on 奇想星球 | AIGC共创社区平台, which supports learners' custom projects to add expansion content to this tutorial, and is fully scalable.

Online reading address

https://datawhalechina.github.io/llm-universe/

PDF address

https://github.com/datawhalechina/llm-universe/releases/tag/v1

Content Outline

Part 1 Introduction to LLM Development

Person in charge: Zou Yuheng

  1. LLM 介绍 @高立业
    1. LLM 的理论介绍
    2. 什么是 RAG
    3. 什么是 LangChain
    4. 开发 LLM 应用的整体流程
    5. 阿里云服务器的基本使用
    6. GitHub Codespaces 的基本使用(选修)
    7. 环境配置
  2. 使用 LLM API 开发应用 @小雨
  3. Basic concepts
  4. Using LLM API - ChatGPT
  • A word from Wen Xin
  • iFlytek Spark
  • GLM 3. [x] Prompt Engineering
  1. 搭建知识库 @loutianao
  2. Introduction to word vectors and vector knowledge base
  3. Using Embedding API
  4. Data processing: reading, cleaning and slicing
  5. Build and use vector database
  6. 构建 RAG 应用 @Xu Hu
  7. Connect LLM to LangChain - ChatGPT
  • A word from Wen Xin
  • iFlytek Spark
  • GLM
  1. Build a search question and answer chain based on LangChain
  2. Deploy knowledge base assistant based on Streamlit
  3. 系统评估与优化 @Zou Yuheng
  4. How to evaluate LLM applications
  5. Evaluate and optimize the generated part
  6. Evaluate and optimize the search part

Part 2 Advanced RAG Techniques (under creation)

Person in charge: Gao Liye

  1. Background
  2. Architecture Overview
  3. Problems
  4. Solution
  5. Data processing
  6. Multi-type document processing
  7. Block optimization
  8. Selection of vector model
  9. Fine-tuning vector model (advanced)
  10. Index level
  11. Index structure
  12. Mixed search
  13. Hypothetical questions
  14. Retrieval stage
  15. query filter
  16. Align query and document
  17. Alignment search and LLM
  18. Generation phase
  19. Post-processing
  20. Fine-tuning LLM (Advanced)
  21. References
  22. Enhancement phase
  23. Context enhancement
  24. Enhance the process
  25. RAG Engineering Assessment

Part 3 Interpretation of Open Source LLM Applications

Person in charge: Xu Hu

  1. ChatWithDatawhale——Interpretation of personal knowledge base assistant
  2. Tianji - Interpretation of the large model of human relations and sophistication

Acknowledgments

Core Contributor

Main Contributors

other

  1. Special thanks to @Sm1les and @LSGOMYP for their help and support on this project;
  2. Special thanks to 奇想星球 | AIGC共创社区平台 for the support, everyone is welcome to pay attention;
  3. If you have any ideas, please contact us. DataWhale also welcomes everyone to raise issues;
  4. Special thanks to the following students who contributed to the tutorial!

Made with contrib.rocks.

Star History

Star History Chart