Scrapyrt Practical Guide - Convert crawler to HTTP API service

📂 Stage: Stage 5 - Combat Power Upgrade (Distributed and Advanced) 🔗 Related chapters: Scrapy-Redis分布式架构 · Docker容器化爬虫

Have you ever encountered this situation: you have written a perfect Scrapy crawler, but you don’t know how to quickly throw it to the backend, frontend or even third-party calls? Or do you want to make a real-time interface that "enters a URL → returns structured data", but don't want to build a Flask/django service from scratch? Today’s guide will take you 3 minutes to turn your crawler into an HTTP API and easily solve all your troubles!


🎯 What is Scrapyrt?

Scrapyrt (Scrapy Real-Time) is a lightweight plug-in officially maintained by Scrapy. Its magic is: You can directly turn the entire Scrapy project into a RESTful API service without modifying any crawler code. It is particularly suitable for the following scenarios:

  • User-triggered "instant on-demand crawling" (for example, the user enters a product link and crawls the price in real time)
  • "Data collection component" in microservice architecture (called through HTTP, decoupled from other services)
  • "Secure Data Interface" behind API Gateway (just expose an authenticated endpoint)

To sum up, Scrapyrt has four core advantages:

  1. Service-based architecture: crawlers directly become callable web services
  2. Stateless, easy to expand: Each request is independent and easy to expand horizontally
  3. JSON format communication: an interaction method that both front and back ends like
  4. Perfectly compatible with existing Scrapy projects: No need to modify the originalsettings.pymiddlewarewait

🚀 3 minutes quick start

Step one: Prepare environment and project

Make sure your Python version is ≥ 3.6, then install Scrapy and Scrapyrt:

pip install scrapy scrapyrt

Then quickly create a sample Scrapy project and generate a basic crawler:

scrapy startproject demo_spider
cd demo_spider
scrapy genspider example example.com

At this point your directory structure probably looks like this:

demo_spider/
├── demo_spider/
│   ├── spiders/
│   │   └── example.py
│   └── settings.py
└── scrapy.cfg

Step 2: Modify a line of code to allow the crawler to support dynamic URLs

Opendemo_spider/spiders/example.py,Bundlestart_requestsChange the method to the following (remaining the same):

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'

    # 核心修改:通过 API 传递的 url 参数动态生成请求
    def start_requests(self):
        # 从属性中获取传入的 url(可以是字符串或列表)
        urls = getattr(self, 'url', None)
        if urls:
            # 统一处理成列表形式,兼容单个 URL 和多个 URL
            urls = [urls] if isinstance(urls, str) else urls
            for url in urls:
                yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').get().strip(),
            'status': response.status
        }

Explanation of key points:

  • we usegetattr(self, 'url', None)Get the data passed by the callerurlparameter.
  • If a string is passed, it becomes a list; if it is already a list, it is traversed directly.
  • This allows multiple URLs to be passed in in one request, which is very flexible.

Step 3: Start the service

existdemo_spiderUnder the root directory (that is, you can seescrapy.cfglocation), execute:

scrapyrt

Will listen by default0.0.0.0:6023. The console will output something likeStarting Scrapyrt 1.2.0 on http://0.0.0.0:6023information, indicating that the service started successfully.

Step 4: Verify the service

Open a browser or terminal and enter:

curl "http://localhost:6023/crawl.json?spider_name=example&url=https://httpbin.org/get"

You will immediately receive a JSON response withitemsThe field contains the data parsed by the crawler (webpage title, status code, etc.). It's that simple, a real-time crawler API is up and running!


🔌 Detailed explanation of core API

Scrapyrt mainly provides two core endpoints, which can cover most usage scenarios.

1. /crawl.json——Synchronous crawling (most commonly used)

Request method:GET
Applicable scenarios: crawling tasks with small data volume and short time (such as single page, lightweight list)

Parameter nameFunctionIs it required
spider_nameReptile name (corresponding tonameProperties)
urlURL to crawl
max_requestsThe maximum number of requests, which can limit the depth or breadth of crawling
crawl_argsAdditional parameters passed to the crawler should be JSON strings

Example:

curl "http://localhost:6023/crawl.json?spider_name=example&url=https://httpbin.org/get&max_requests=3"

If you want to pass multiple URLs, you canurlPassed in parameters&separated multiple values ​​(actual behavior depends on how your crawler handlesurllist), but it is more recommended to usecrawl_argsPass a JSON list:

curl "http://localhost:6023/crawl.json?spider_name=example&crawl_args=%7B%22url%22%3A%5B%22https://httpbin.org/get%22%2C%22https://httpbin.org/headers%22%5D%7D"

(in%7B%22url%22%3A%5B...yes{"url":["...","..."]}URL encoding, you can also construct it directly using Python or Postman)

2. /crawl.json/request—— Custom Request (POST)

Request method:POST
Applicable scenarios: Complex requests that require customized headers, cookies, request meta, etc.

Request body example:

curl -X POST http://localhost:6023/crawl.json/request \
  -H "Content-Type: application/json" \
  -d '{
    "spider_name": "example",
    "request": {
      "url": "https://httpbin.org/headers",
      "headers": {
        "User-Agent": "Custom Spider 1.0"
      },
      "meta": {
        "custom_key": "custom_value"
      }
    }
  }'

This approach gives you almost complete control over Scrapy’sRequestobjects, includingcookiesmethodetc. Very flexible.


🐍 Python minimalist client

If you don’t want to write by hand every timecurl, which can encapsulate a Python client with a few lines of code for easy calling in other projects:

import requests

class ScrapyrtClient:
    def __init__(self, base_url="http://localhost:6023"):
        self.base_url = base_url.rstrip('/')

    def crawl(self, spider_name, url, **kwargs):
        params = {"spider_name": spider_name, "url": url}
        params.update(kwargs)  # 支持 max_requests、crawl_args 等
        resp = requests.get(f"{self.base_url}/crawl.json", params=params, timeout=300)
        resp.raise_for_status()
        return resp.json()

# 使用示例
client = ScrapyrtClient()
result = client.crawl("example", "https://httpbin.org/get")
print(result['items'][0]['title'])

At the same time, you can also encapsulate asynchronous calls, error retry and other logic based on this client, which is very suitable for integration into your back-end services.


🔒 Rapid security reinforcement (necessary for production)

Go straight to production naked? Absolutely not! Here are three of the most practical security measures.

1. Nginx reverse proxy + HTTPS

Let Scrapyrt only listen to the internal network address, and put an Nginx in front to handle HTTPS, authentication and flow limiting.

Simple configuration example (/etc/nginx/conf.d/scrapyrt.conf):

server {
    listen 443 ssl;
    server_name your-domain.com;

    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;

    location / {
        # 只允许信任的 IP 访问(比如公司办公网、API 网关)
        allow 123.45.67.89;
        deny all;

        proxy_pass http://127.0.0.1:6023;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

2. Add simple API Key verification

Can be accessed via Nginxauth_requestThe module cooperates with an authentication service, or adds a simple Flask agent in front of Scrapyrt to verifyAuthorizationhead. The complete implementation will not be discussed here, but the idea must be there.

3. Current limiting to prevent abuse

Add rate limit in Nginx to avoid crazy calls from a single client:

limit_req_zone $binary_remote_addr zone=scrapyrt:10m rate=5r/s;

server {
    ...
    location / {
        limit_req zone=scrapyrt burst=10 nodelay;
        proxy_pass http://127.0.0.1:6023;
    }
}

With these three tools in place, your API service will be much more secure.


🐳 Docker one-click deployment

Containerized deployment can make the environment consistent and facilitate horizontal expansion. I prepared a simpleDockerfile, you can use directly:

FROM python:3.9-slim

WORKDIR /app

# 安装编译依赖(有些 lxml 等库需要)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc libxml2-dev libxslt1-dev && \
    rm -rf /var/lib/apt/lists/*

# 复制依赖文件并安装
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 安装 Scrapyrt
RUN pip install scrapyrt

# 复制整个项目
COPY . .

EXPOSE 6023

# 健康检查:每 30 秒访问一次根路径
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:6023/ || exit 1

CMD ["scrapyrt", "-p", "6023", "-i", "/app"]

inrequirements.txtIt contains dependencies such as Scrapy, for example:

Scrapy>=2.11

Then build and run:

docker build -t scrapyrt-demo .
docker run -d -p 6023:6023 --name scrapyrt-container scrapyrt-demo

Now you can pass on the host machinehttp://localhost:6023The crawler service in the container is accessed. If you want to link with other services in Compose, you only need a few lines of configuration.


❌ Troubleshooting high-frequency problems

In actual use, you may encounter these pitfalls, don't panic, the solutions are here.

1. The port is occupied

# 查看谁占用了端口
lsof -i :6023
# 或者换一个端口启动
scrapyrt -p 6024

2. Crawler timeout

Some websites respond slowly, or the crawler logic is complex, and the default timeout may not be enough. This can be adjusted via both startup parameters and Scrapy settings:

# 启动时增加整体任务超时(秒)
scrapyrt -p 6023 --timeout 600

At the same time, in the projectsettings.pyAdd download timeout in:

DOWNLOAD_TIMEOUT = 120   # 单个请求最多等 120 秒

3. The returned data is too large

If the crawler crawls too many pages, it may burst the memory or take too long to respond. The total number of requests can be limited in the API parameters:

curl "http://localhost:6023/crawl.json?spider_name=example&url=xxx&max_requests=5"

Or limit concurrency and depth in Scrapy settings:

CONCURRENT_REQUESTS = 8
DEPTH_LIMIT = 2

💡 Core Best Practices

Finally, I would like to give you five “bottom-of-the-box” suggestions to help you use Scrapyrt stably and well:

  1. Containerized deployment: Use Docker to isolate the environment to facilitate one-click deployment and rapid expansion.
  2. Nginx reverse proxy: Unify HTTPS, IP whitelist, and rate limit to put security first.
  3. Limit resources: It is best to set this for each API callmax_requests, and also passed when Scrapyrt starts--max-concurrent-requestsControl global concurrency.
  4. Log Management: If you use Docker, you can usejson-fileorsyslogDriver; if it is a bare metal deployment, remember to configure itlogrotate, don't let the log burst the disk.
  5. Health check: Add health check to Dockerfile or Compose, and use orchestration tools to automatically restart abnormal services to improve availability.

🏷️ tag cloud:Scrapyrt HTTP API 实时爬虫 微服务 按需爬取 Docker部署