Complete Guide to FastAPIstreaming-response
📂 Related resources: websocket-realtime-communication · 异步编程深度解析
Table of contents
What is streaming response?
Imagine you are waiting for a 3-hour movie file to download: the traditional approach is to wait for the server to prepare the entire file and then throw it to you in one go. This leads to two problems:
- The server memory may be exhausted (for example, a large file of 4GB is read directly into the memory)
- You have to wait a long time to see the first frame
And streaming response is like playing while downloading: the server sends a small piece of data immediately when it is ready, and the client can also consume this small piece immediately. There is no need to wait for the full amount, and there is no need to hoard all the data.
Traditional HTTP vs Streaming Response
Streaming responses in FastAPI mainly rely onStreamingResponseTo achieve this, with Python's async generator, the data can be gradually "flowed out" like a faucet.
Choose SSE or WebSocket?
Many students will be confused: how to choose between streaming response, SSE and WebSocket? Simply put:
- StreamingResponse: The most basic streaming output, suitable for one-time streams (such as downloading files, simple digital sequences). Development is the easiest.
- SSE (Server-Sent Events): Built on StreamingResponse, it is a standardized one-way push protocol. The browser has native support and comes with functions such as automatic reconnection and event classification, which is very suitable for token flow, status updates, etc. in AI conversations.
- WebSocket: full-duplex two-way communication with higher complexity, suitable for scenarios that require the client to frequently send messages to the server, such as chat, games, and collaborative editing.
This article focuses on StreamingResponse and its most commonly used upper wrapper - SSE.
StreamingResponse basics and optimization
Simplest example: digital stream
Let's start with the simplest streaming response, letting the server push a number every 0.5 seconds:
accesshttp://localhost:8000/simple-stream, you will find that the browser does not display the entire content at once, but adds an extra line every 0.5 seconds, just like the output of a terminal command.
Handle client disconnection gracefully
The user may refresh the page or close the tab in the middle. At this time, FastAPI will cancel the running asynchronous task and triggerasyncio.CancelledError. We must catch this exception in the generator in order to release resources such as database connections and close files.
Key Point: CaptureCancelledErrormust beraiseThis is a convention of FastAPI's internal life cycle management, otherwise it may cause connection leaks.
SSE server sends events
SSE (Server-Sent Events) is a set of format specifications defined based on StreamingResponse. The purpose is to allow browsers toEventSourceThe API can directly and elegantly receive messages pushed by the server. Its benefits include:
- Automatic reconnect: Automatically recovers after network disconnection and reconnects. Default is 3 seconds to retry.
- Event Classification: You can add event names to messages, and the front end processes them according to different event types.
- Lightweight: Based on HTTP, no additional libraries required, firewall friendly.
SSE data format
SSE is delivered via a plain text stream, with strict format requirements:
- Each message starts with
data:Beginning, followed by specific content (can be plain text or JSON) - The end of message is marked by two consecutive newlines
\n\n - Optional fields:
event:Define the event type,id:Set the message number (for resume transmission),retry:Set reconnection interval (milliseconds)
Let’s look at a complete SSE endpoint example:
⚠️
X-Accel-Buffering: noThis response header is key. Without it, when you use Nginx as a reverse proxy, the data will be buffered, and it may take several seconds for the front end to receive a batch of messages, completely losing real-time performance.
AI dialogue typewriter effect
In AI dialogue products such as ChatGPT, responses are "typed" word by word, which is the classic Typewriter effect. The implementation principle is exactly SSE streaming output: every time a token is generated by the backend, it is pushed immediately.
Simulate AI reply
Below we simulate a simplified AI answering process and send sentences character by character:
After the front end gets each chunk, it can be spliced into the conversation interface in real time to create a typing effect. Usually one is added"[DONE]"The mark indicates the end of the stream, but this example is omitted for simplicity. Please refer to OpenAI’s response format in real scenarios.
File stream and log real-time push
Download large files in chunks
Streaming responses also apply to file downloads. Traditional methods may be usedFileResponseReading the entire file into memory before sending it causes problems when the file reaches gigabytes.
useStreamingResponseWith asynchronous file reading, we can read a small block (such as 8KB) and send it each time, and the memory usage is always controlled within the block size.
Real-time log push
Similarly, we can push the logs generated by the application to the front-end dashboard in real time. For example, there is a log queue in the background, and the generator continuously fetches log lines from it and sends them:
This method is often used for monitoring systems and CI/CD pipeline log display.
Key configuration of production environment
Nginx reverse proxy
Most FastAPI applications will have a layer of Nginx hanging in front of them. If Nginx enables buffering for SSE or streaming interfaces, your data will be "blocked" and the front end will have to wait until the buffer is full before it can see the content, completely losing the streaming characteristics. Therefore buffering must be disabled for the relevant paths:
Note that if your streaming interface path is not/stream/, remember to modify accordinglylocationblock, or via response headersX-Accel-Buffering: noClose them all together (just choose one of the two).
Uvicorn / Gunicorn parameters
It is recommended to use asynchronous workers to take advantage of the asynchronous features of FastAPI, and at the same time increase the timeout and concurrency limits:
--workers 4: The production environment is generally set to the number of CPU cores--worker-class uvicorn.workers.UvicornWorker: Specify asynchronous workers when managing Gunicorn--timeout-keep-alive 300: Maintain the maximum idle time of the connection, adapting to long connection scenarios--limit-concurrency 100: Limit the number of requests processed simultaneously to prevent resource exhaustion
Minimalist front-end integration solution
The most convenient way for the front end to receive SSE messages is to use the browser's nativeEventSourceObject without introducing any third-party libraries.
EventSourceThe built-in reconnection mechanism is very practical: after network jitter or server restart, it will automatically reconnect at certain intervals. You can also send it through the serverretry:field to adjust the reconnection interval.
📝 Summary: FastAPI
StreamingResponseIt is the cornerstone of realizing streaming response. With the SSE standard protocol, you can easily build smooth AI conversations, real-time log panels and large file download services. The key is: Asynchronous generators + appropriate MIME types + disabling buffering at all levels. By mastering these three points, you can run streaming applications stably in a production environment.

