Create your first Scrapy project - a complete guide to project structure, configuration and initialization
📂 Stage: Stage 1 - fledgling (core framework) 🔗 Related chapters: Scrapy 五大核心组件 · Spider 实战
After installing Scrapy, the first hurdle for most people is to finish typingscrapy startprojectLater, I was confused when faced with a pile of documents and didn’t know where to start. This article helps you quickly clarify the core structure and required configuration changes, and easily generate a sample crawler that can run smoothly, clearing away confusion at once.
Environment preparation
Basic requirements
- Python version: 3.8 and above (3.9/3.10 is recommended for optimal compatibility and efficiency)
- Operating System: Windows/macOS/Linux Universal
Installation and verification
It is strongly recommended to install Scrapy in a virtual environment to facilitate dependency isolation:
Create project
Scrapy provides standardized project generation commands, eliminating the need to manually build a folder structure:
Detailed project structure
The generated project adopts the Python package architecture. The core files and directories are as follows:
Quick overview of core file priorities
In the early stages of development, focus on the following files:
Detailed explanation of configuration files
Most of the options in the default configuration can be left unchanged for the time being. First, change the following ones to suit your own projects.
1. Basic Identity and User‑Agent
2. Anti-crawling and rate limiting
3. Data export coding (necessary for learning and testing)
Create the first crawler
usegenspiderThe command can be found inspiders/Quickly generate crawler templates in the directory:
It will be automatically generated after runningspiders/techcrunch.py, we can crawl article titles and links with slight modifications:
Verification and testing
1. Run the crawler and view the results in real time
Execute in the project root directory:
2. Export data to local file
3. Interactively debug selectors with Scrapy Shell
If you are not sure whether the CSS selector is written correctly, you can use Shell to test it in real time:
After entering the Shell try to extract:
Troubleshooting common problems
1. scrapycommand not found
- Cause: Python's Scripts directory is not added to the system PATH
- solve:
- Virtual environment: Activate the corresponding virtual environment before using it
- Global installation:
- Windows: will
C:\Users\你的用户名\AppData\Local\Programs\Python\Python3x\ScriptsAdd to PATH - macOS/Linux: Make sure
which python3andwhich pip3Point to the Python environment you are using
2. The crawler exits immediately after starting and no data is output.
- Cause: There is a high probability that the CSS/XPath selector is written incorrectly and cannot match any content.
- Solution: Use Scrapy Shell to verify the selector item by item and confirm that the content can be extracted.
3. Return 403 Forbidden error
- Cause: Default UA is blocked, lack of cookies, request frequency is too high
- SOLVED: Modification
USER_AGENT, increase appropriatelyDOWNLOAD_DELAY, consider adding cookies if necessary
Best practice recommendations
- Virtual environment isolation: Create a separate virtual environment for each project to avoid dependency conflicts
- Adjust selectors first and then write code: All selectors are tested in Scrapy Shell first.
- Structured Data: Use as much as possible
items.pyDefine the data model instead of directly yielding the dictionary to facilitate later maintenance and expansion - Keep Log: In
settings.pyAdd inLOG_FILE = 'logs/scrapy.log', to facilitate backtracking of issues
💡 Key Points: When you first get started, focus on
settings.pyandspiders/In two places, other files (middlewares, pipelines) can be drilled down when really needed. Don't get yourself confused at the beginning.

