Create your first Scrapy project - a complete guide to project structure, configuration and initialization

📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: Scrapy 五大核心组件 · Spider 实战

After installing Scrapy, the first hurdle for most people is to finish typingscrapy startprojectLater, I was confused when faced with a pile of documents and didn’t know where to start. This article helps you quickly clarify the core structure and required configuration changes, and easily generate a sample crawler that can run smoothly, clearing away confusion at once.

Environment preparation

Basic requirements

  • Python version: 3.8 and above (3.9/3.10 is recommended for optimal compatibility and efficiency)
  • Operating System: Windows/macOS/Linux Universal

Installation and verification

It is strongly recommended to install Scrapy in a virtual environment to facilitate dependency isolation:

# 创建并激活虚拟环境(Windows 用 venv\Scripts\activate)
python3 -m venv scrapy_env
source scrapy_env/bin/activate

# 安装 Scrapy
pip install scrapy

# 检查安装是否成功
scrapy version
# 输出示例:Scrapy 2.11.0

Create project

Scrapy provides standardized project generation commands, eliminating the need to manually build a folder structure:

# 1. 生成项目(项目名建议有意义,例如 news_crawler)
scrapy startproject news_crawler

# 2. 进入项目根目录
cd news_crawler

# 3. 查看生成的基础结构
ls -la       # Linux/macOS
dir          # Windows

Detailed project structure

The generated project adopts the Python package architecture. The core files and directories are as follows:

news_crawler/
├── scrapy.cfg                 # 部署到 Scrapyd 的全局配置(开发时极少修改)
└── news_crawler/              # 项目业务代码包
    ├── __init__.py            # Python 包标识文件(空即可)
    ├── items.py               # 定义结构化数据字段(类似 ORM 模型)
    ├── middlewares.py         # 自定义请求/响应处理逻辑(中间件)
    ├── pipelines.py           # 数据清洗、去重、存储逻辑(管道)
    ├── settings.py            # 全局项目配置(核心!)
    └── spiders/               # 所有爬虫的存放目录
        ├── __init__.py        # 子包标识
        └── example.py         # 默认的示例爬虫(可删除)

Quick overview of core file priorities

In the early stages of development, focus on the following files:

File/DirectoryInitial Development PriorityMain Purpose
settings.py⭐⭐⭐⭐⭐Global configuration (UA, concurrency, latency, etc.)
spiders/⭐⭐⭐⭐⭐Write specific crawling logic
items.py⭐⭐⭐⭐Define structured data for crawling
pipelines.py⭐⭐⭐⭐Data implementation, cleaning, etc.
middlewares.py⭐⭐⭐Advanced features: Proxy, UA pool and more
scrapy.cfgThis needs to be changed only when deployed to online Scrapyd

Detailed explanation of configuration files

Most of the options in the default configuration can be left unchanged for the time being. First, change the following ones to suit your own projects.

1. Basic Identity and User‑Agent

# settings.py
BOT_NAME = 'news_crawler'   # 项目名,一般不用改

# 必改项:指定 User-Agent(默认 UA 会被很多网站直接拦截)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'

# 学习/测试阶段建议关闭 ROBOTSTXT_OBEY(生产环境请遵守协议或与网站协商)
ROBOTSTXT_OBEY = False

2. Anti-crawling and rate limiting

# settings.py
# 对同一域名的并发请求数(友善一点可设为 2~4,学习时可以 8)
CONCURRENT_REQUESTS_PER_DOMAIN = 4

# 两次请求间的固定延迟,单位秒
DOWNLOAD_DELAY = 1

# 在 DOWNLOAD_DELAY 基础上增加随机延迟(0.5~1.5 倍之间)
RANDOMIZE_DOWNLOAD_DELAY = True

3. Data export coding (necessary for learning and testing)

# settings.py
FEED_EXPORT_ENCODING = 'utf-8'
# 这样导出的 JSON/CSV 文件在 Windows 和 macOS 下都不会乱码

Create the first crawler

usegenspiderThe command can be found inspiders/Quickly generate crawler templates in the directory:

# 命令格式:scrapy genspider 爬虫名 允许爬取的域名
scrapy genspider techcrunch techcrunch.com

It will be automatically generated after runningspiders/techcrunch.py, we can crawl article titles and links with slight modifications:

# spiders/techcrunch.py
import scrapy

class TechcrunchSpider(scrapy.Spider):
    name = 'techcrunch'                              # 爬虫唯一标识
    allowed_domains = ['techcrunch.com']             # 只爬取该域名下的链接
    start_urls = ['https://techcrunch.com/category/startups/']

    def parse(self, response):
        """
        解析响应:提取文章标题、链接,并实现自动翻页
        """
        for article in response.css('article.post-block'):
            yield {
                'title': article.css('a.post-block__title__link::text').get().strip(),
                'url': article.css('a.post-block__title__link::attr(href)').get(),
            }

        # 翻页逻辑:查找下一页链接并继续跟踪
        next_page = response.css('div.wp-block-query-pagination a.next::attr(href)').get()
        if next_page:
            # follow 方法会自动拼接相对路径并检查 allowed_domains
            yield response.follow(next_page, callback=self.parse)

Verification and testing

1. Run the crawler and view the results in real time

Execute in the project root directory:

scrapy crawl techcrunch

2. Export data to local file

# 导出为 JSONL(推荐,逐行存储便于后续处理)
scrapy crawl techcrunch -o techcrunch_startups.jsonl

# 也支持其他格式:JSON、CSV、XML 等
scrapy crawl techcrunch -o techcrunch_startups.csv

3. Interactively debug selectors with Scrapy Shell

If you are not sure whether the CSS selector is written correctly, you can use Shell to test it in real time:

scrapy shell 'https://techcrunch.com/category/startups/'

After entering the Shell try to extract:

# 提取第一篇文章标题
response.css('a.post-block__title__link::text').get().strip()

# 调试完毕退出
exit()

Troubleshooting common problems

1. scrapycommand not found

  • Cause: Python's Scripts directory is not added to the system PATH
  • solve:
  • Virtual environment: Activate the corresponding virtual environment before using it
  • Global installation:
  • Windows: willC:\Users\你的用户名\AppData\Local\Programs\Python\Python3x\ScriptsAdd to PATH
  • macOS/Linux: Make surewhich python3andwhich pip3Point to the Python environment you are using

2. The crawler exits immediately after starting and no data is output.

  • Cause: There is a high probability that the CSS/XPath selector is written incorrectly and cannot match any content.
  • Solution: Use Scrapy Shell to verify the selector item by item and confirm that the content can be extracted.

3. Return 403 Forbidden error

  • Cause: Default UA is blocked, lack of cookies, request frequency is too high
  • SOLVED: ModificationUSER_AGENT, increase appropriatelyDOWNLOAD_DELAY, consider adding cookies if necessary

Best practice recommendations

  1. Virtual environment isolation: Create a separate virtual environment for each project to avoid dependency conflicts
  2. Adjust selectors first and then write code: All selectors are tested in Scrapy Shell first.
  3. Structured Data: Use as much as possibleitems.pyDefine the data model instead of directly yielding the dictionary to facilitate later maintenance and expansion
  4. Keep Log: Insettings.pyAdd inLOG_FILE = 'logs/scrapy.log', to facilitate backtracking of issues

💡 Key Points: When you first get started, focus onsettings.pyandspiders/In two places, other files (middlewares, pipelines) can be drilled down when really needed. Don't get yourself confused at the beginning.