LLM Universe | Hands-on learning of large model application development
📢 This tutorial comes from an open source project > Project address: https://github.com/datawhalechina/llm-universe
Project Introduction
This project is a large model application development tutorial for novice developers. It is designed to be based on Alibaba Cloud server and combined with the personal knowledge base assistant project to complete the key introduction to large model development through one course. The main contents include:
- Introduction to large models, what is a large model, what are the characteristics of a large model, what is LangChain, how to develop an LLM application, a brief introduction for novice developers;
- How to call large model API, this section introduces the various calling methods of well-known large model product APIs at home and abroad, including calling native API, encapsulating it as LangChain LLM, encapsulating it as Fastapi and other calling methods. At the same time, various large model APIs including Baidu Wenxin, iFlytek Spark, Zhipu AI and other large model APIs are encapsulated in a unified form;
- Knowledge base construction, loading and processing of different types of knowledge base documents, and construction of vector database;
- Build RAG application, including connecting LLM to LangChain to build a retrieval question and answer chain, and using Streamlit for application deployment
- Verification iteration, how to implement verification iteration in large model development, and what are the general evaluation methods;
This project mainly includes three parts:
- Introduction to LLM development. The simplified version of the V1 version is designed to help beginners get started with LLM development as quickly and easily as possible. By understanding the general process of LLM development, they can build a simple demo.
- LLM development skills. LLM develops more advanced techniques, including but not limited to: Prompt Engineering, processing of multiple types of source data, optimized retrieval, recall and fine ranking, Agent framework, etc.
- LLM application examples. Introduce some successful open source cases, and from the perspective of this course, analyze the ideas, core ideas, and implementation frameworks of these application examples to help beginners understand what kind of applications they can develop through LLM.
Project significance
LLM is gradually becoming a new revolutionary force in the information world. It provides developers with new and more powerful application development options through its powerful natural language understanding and natural language generation capabilities. With the explosive opening of LLM API services at home and abroad, how to quickly and conveniently develop applications with stronger capabilities and integrated LLM based on LLM API has begun to become an important skill for developers.
At present, there are many introductions to LLM and scattered LLM development skills courses, but the quality is uneven and not well integrated. Developers need to search a large number of tutorials and read a large amount of content that is not highly relevant and less necessary in order to initially master the necessary skills for large model development. The learning efficiency is low and the learning threshold is high.
This project starts from practice and combines the most common and common personal knowledge base assistant projects to explain in simple terms and gradually disassemble the general process and steps of LLM development. It is designed to help novices without algorithm foundation complete the basic introduction to large model development through one course. At the same time, we will also add advanced RAG development skills and interpretation of some successful LLM application cases to help readers who have completed the first part of the study further master higher-level RAG development skills, and be able to develop their own, fun applications by learning from existing successful projects.
Project Audience
All developers who have basic Python skills and want to master LLM application development skills.
**This project does not require learners to have any basic knowledge of artificial intelligence or algorithms. They only need to master basic Python syntax and basic Python development skills. **
Considering the environment-setup problem, this project provides students with free access to Alibaba Cloud servers. Student readers can receive Alibaba Cloud servers for free and complete this course through Alibaba Cloud servers. This project also provides environment-setup guides for personal computers and non-Alibaba Cloud servers. This project basically has no requirements for local hardware and does not require a GPU environment. Both personal computers and servers can be used for learning.
**Note: This project mainly uses APIs provided by major model manufacturers for application development. If you want to learn to deploy and apply local open source LLM, you are welcome to learn Self LLM | 开源大模型食用指南, also produced by Datawhale. This project will teach you step by step how to fine-tune the full link of open source LLM deployment! **
**Note: Considering the difficulty of learning, this project is mainly for beginners and introduces how to use LLM to build applications. If you want to further study the theoretical basis of LLM, and further understand and apply LLM on the basis of theory, you are welcome to study So Large LM | 大模型基础, also produced by Datawhale. This project will provide you with comprehensive and in-depth theoretical knowledge and practical methods of LLM! **
Project Highlights
-
Fully oriented to practice, hands-on-llm development. Compared with other similar tutorials that start from theory and have a large gap between practice and practice, this tutorial is based on the universal personal knowledge base assistant project. It integrates universal large model development concepts into project practice and helps learners master large model development skills by building personal projects.
-
Starting from scratch, a comprehensive and short tutorial on large models. For the personal knowledge base assistant project, this project conducted a project-led reconstruction of relevant large model development theories, concepts and basic skills, deleting the underlying principles and algorithm details that do not need to be understood, and covering all core skills of large model development. The overall duration of the tutorial is within a few hours, but after studying this tutorial, you can master all the core skills of basic large model development.
-
It has both unity and expandability. This project uniformly encapsulates major domestic and foreign LLM APIs such as GPT, Baidu Wenxin, iFlytek Spark, Zhipu GLM, etc., supports one-click calling of different LLMs, and helps developers focus more on learning applications and optimizing the model itself, without spending time on tedious calling details. At the same time, this tutorial is planned to be launched on 奇想星球 | AIGC共创社区平台, which supports learners' custom projects to add expansion content to this tutorial, and is fully scalable.
Online reading address
https://datawhalechina.github.io/llm-universe/
PDF address
https://github.com/datawhalechina/llm-universe/releases/tag/v1
Content Outline
Part 1 Introduction to LLM Development
Person in charge: Zou Yuheng
- LLM 介绍 @高立业
- 使用 LLM API 开发应用 @小雨
- Basic concepts
- Using LLM API - ChatGPT
- A word from Wen Xin
- iFlytek Spark
- GLM 3. [x] Prompt Engineering
- 搭建知识库 @loutianao
- Introduction to word vectors and vector knowledge base
- Using Embedding API
- Data processing: reading, cleaning and slicing
- Build and use vector database
- 构建 RAG 应用 @Xu Hu
- Connect LLM to LangChain - ChatGPT
- A word from Wen Xin
- iFlytek Spark
- GLM
- Build a search question and answer chain based on LangChain
- Deploy knowledge base assistant based on Streamlit
- 系统评估与优化 @Zou Yuheng
- How to evaluate LLM applications
- Evaluate and optimize the generated part
- Evaluate and optimize the search part
Part 2 Advanced RAG Techniques (under creation)
Person in charge: Gao Liye
- Background
- Architecture Overview
- Problems
- Solution
- Data processing
- Multi-type document processing
- Block optimization
- Selection of vector model
- Fine-tuning vector model (advanced)
- Index level
- Index structure
- Mixed search
- Hypothetical questions
- Retrieval stage
- query filter
- Align query and document
- Alignment search and LLM
- Generation phase
- Post-processing
- Fine-tuning LLM (Advanced)
- References
- Enhancement phase
- Context enhancement
- Enhance the process
- RAG Engineering Assessment
Part 3 Interpretation of Open Source LLM Applications
Person in charge: Xu Hu
- ChatWithDatawhale——Interpretation of personal knowledge base assistant
- Tianji - Interpretation of the large model of human relations and sophistication
Acknowledgments
Core Contributor
- 娄天奥-项目负责人 (Datawhale member - graduate student at University of Chinese Academy of Sciences)
- 邹雨衡-项目负责人 (Datawhale member - graduate student at University of International Business and Economics)
- 高立业-第二部分负责人 (DataWhale member - Algorithm Engineer)
- 徐虎-第三部分负责人 (Datawhale Member - Algorithm Engineer)
Main Contributors
- 毛雨-内容创作者 (Backend Development Engineer)
- 崔腾松-项目支持者 (Member of Datawhale - co-sponsor of Fantasy Planet)
- June-项目支持者 (Member of Datawhale - co-sponsor of Fantasy Planet)
other
- Special thanks to @Sm1les and @LSGOMYP for their help and support on this project;
- Special thanks to 奇想星球 | AIGC共创社区平台 for the support, everyone is welcome to pay attention;
- If you have any ideas, please contact us. DataWhale also welcomes everyone to raise issues;
- Special thanks to the following students who contributed to the tutorial!
Made with contrib.rocks.

