Scrapyrt Practical Guide - Convert crawler to HTTP API service
📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: Scrapy-Redis分布式架构 · Docker容器化爬虫
Have you ever encountered this situation: you have written a perfect Scrapy crawler, but you don’t know how to quickly throw it to the backend, frontend or even third-party calls? Or do you want to make a real-time interface that "enters a URL → returns structured data", but don't want to build a Flask/django service from scratch? Today’s guide will take you 3 minutes to turn your crawler into an HTTP API and easily solve all your troubles!
🎯 What is Scrapyrt?
Scrapyrt (Scrapy Real-Time) is a lightweight plug-in officially maintained by Scrapy. Its magic is: You can directly turn the entire Scrapy project into a RESTful API service without modifying any crawler code. It is particularly suitable for the following scenarios:
- User-triggered "instant on-demand crawling" (for example, the user enters a product link and crawls the price in real time)
- "Data collection component" in microservice architecture (called through HTTP, decoupled from other services)
- "Secure Data Interface" behind API Gateway (just expose an authenticated endpoint)
To sum up, Scrapyrt has four core advantages:
- Service-based architecture: crawlers directly become callable web services
- Stateless, easy to expand: Each request is independent and easy to expand horizontally
- JSON format communication: an interaction method that both front and back ends like
- Perfectly compatible with existing Scrapy projects: No need to modify the original
settings.py、middlewarewait
🚀 3 minutes quick start
Step one: Prepare environment and project
Make sure your Python version is ≥ 3.6, then install Scrapy and Scrapyrt:
Then quickly create a sample Scrapy project and generate a basic crawler:
At this point your directory structure probably looks like this:
Step 2: Modify a line of code to allow the crawler to support dynamic URLs
Opendemo_spider/spiders/example.py,Bundlestart_requestsChange the method to the following (remaining the same):
Explanation of key points:
- we use
getattr(self, 'url', None)Get the data passed by the callerurlparameter. - If a string is passed, it becomes a list; if it is already a list, it is traversed directly.
- This allows multiple URLs to be passed in in one request, which is very flexible.
Step 3: Start the service
existdemo_spiderUnder the root directory (that is, you can seescrapy.cfglocation), execute:
Will listen by default0.0.0.0:6023. The console will output something likeStarting Scrapyrt 1.2.0 on http://0.0.0.0:6023information, indicating that the service started successfully.
Step 4: Verify the service
Open a browser or terminal and enter:
You will immediately receive a JSON response withitemsThe field contains the data parsed by the crawler (webpage title, status code, etc.). It's that simple, a real-time crawler API is up and running!
🔌 Detailed explanation of core API
Scrapyrt mainly provides two core endpoints, which can cover most usage scenarios.
1. /crawl.json——Synchronous crawling (most commonly used)
Request method:GET
Applicable scenarios: crawling tasks with small data volume and short time (such as single page, lightweight list)
Example:
If you want to pass multiple URLs, you canurlPassed in parameters&separated multiple values (actual behavior depends on how your crawler handlesurllist), but it is more recommended to usecrawl_argsPass a JSON list:
(in%7B%22url%22%3A%5B...yes{"url":["...","..."]}URL encoding, you can also construct it directly using Python or Postman)
2. /crawl.json/request—— Custom Request (POST)
Request method:POST
Applicable scenarios: Complex requests that require customized headers, cookies, request meta, etc.
Request body example:
This approach gives you almost complete control over Scrapy’sRequestobjects, includingcookies、methodetc. Very flexible.
🐍 Python minimalist client
If you don’t want to write by hand every timecurl, which can encapsulate a Python client with a few lines of code for easy calling in other projects:
At the same time, you can also encapsulate asynchronous calls, error retry and other logic based on this client, which is very suitable for integration into your back-end services.
🔒 Rapid security reinforcement (necessary for production)
Go straight to production naked? Absolutely not! Here are three of the most practical security measures.
1. Nginx reverse proxy + HTTPS
Let Scrapyrt only listen to the internal network address, and put an Nginx in front to handle HTTPS, authentication and flow limiting.
Simple configuration example (/etc/nginx/conf.d/scrapyrt.conf):
2. Add simple API Key verification
Can be accessed via Nginxauth_requestThe module cooperates with an authentication service, or adds a simple Flask agent in front of Scrapyrt to verifyAuthorizationhead. The complete implementation will not be discussed here, but the idea must be there.
3. Current limiting to prevent abuse
Add rate limit in Nginx to avoid crazy calls from a single client:
With these three tools in place, your API service will be much more secure.
🐳 Docker one-click deployment
Containerized deployment can make the environment consistent and facilitate horizontal expansion. I prepared a simpleDockerfile, you can use directly:
inrequirements.txtIt contains dependencies such as Scrapy, for example:
Then build and run:
Now you can pass on the host machinehttp://localhost:6023The crawler service in the container is accessed. If you want to link with other services in Compose, you only need a few lines of configuration.
❌ Troubleshooting high-frequency problems
In actual use, you may encounter these pitfalls, don't panic, the solutions are here.
1. The port is occupied
2. Crawler timeout
Some websites respond slowly, or the crawler logic is complex, and the default timeout may not be enough. This can be adjusted via both startup parameters and Scrapy settings:
At the same time, in the projectsettings.pyAdd download timeout in:
3. The returned data is too large
If the crawler crawls too many pages, it may burst the memory or take too long to respond. The total number of requests can be limited in the API parameters:
Or limit concurrency and depth in Scrapy settings:
💡 Core Best Practices
Finally, I would like to give you five “bottom-of-the-box” suggestions to help you use Scrapyrt stably and well:
- Containerized deployment: Use Docker to isolate the environment to facilitate one-click deployment and rapid expansion.
- Nginx reverse proxy: Unify HTTPS, IP whitelist, and rate limit to put security first.
- Limit resources: It is best to set this for each API call
max_requests, and also passed when Scrapyrt starts--max-concurrent-requestsControl global concurrency. - Log Management: If you use Docker, you can use
json-fileorsyslogDriver; if it is a bare metal deployment, remember to configure itlogrotate, don't let the log burst the disk. - Health check: Add health check to Dockerfile or Compose, and use orchestration tools to automatically restart abnormal services to improve availability.
🏷️ tag cloud:Scrapyrt HTTP API 实时爬虫 微服务 按需爬取 Docker部署

