Scrapy Complete Guide to ImagesPipeline and FilesPipeline - Detailed explanation of multimedia resource downloading and processing technology
📂 Stage: Stage 2 - Data Flow (Data Processing) 🔗 Related chapters: Pipeline管道实战 · 数据清洗与校验
Table of contents
Students who play Scrapy must have encountered such a scenario: the crawler has captured hundreds of large pictures of products, and if they want to download the pictures locally, they have to generate thumbnails of different sizes; or they need to save a bunch of PDF and video files with automatic deduplication and reasonable file names for easy retrieval. If you write the download logic yourself every time, you will not only reinvent the wheel, but you will also easily get into trouble.
Scrapy built-inImagesPipelineandFilesPipelineIt is born for this kind of multimedia download scenario. They can automatically complete a series of tedious tasks such as downloading, deduplication, format verification, and path management, making your crawler code clean and efficient.
This article will take you from the beginning to identify the differences between the two, step by step through basic configuration, practical usage, to advanced customization techniques, and finally cover all common problems. After reading this, you can completely control the download pipeline of any multimedia resource.
ImagesPipeline vs FilesPipeline quick overview
Newbies often confuse these two Pipelines. In fact, their positioning is very clear. Just remember this comparison table:
In one sentence: ** Use ImagesPipeline to only download images, and use FilesPipeline to process mixed files or non-image resources. ** If you have both images and other files, both can be enabled at the same time and Scrapy will execute them in priority order.
💡 Tip: The bottom layer of ImagesPipeline is inherited from FilesPipeline, which adds additional image processing capabilities of PIL/Pillow, so images can also be downloaded using FilesPipeline, but you will not enjoy benefits such as thumbnails and format conversion.
Basic configuration and simplest practice
Next we run the Pipeline step by step. Whether it is a picture or a file, the configuration idea is exactly the same: Enable → Set parameters → Define Item → Spider output URL.
Step 1: Enable Pipeline
existsettings.pyAdd the target Pipeline toITEM_PIPELINESin the dictionary. The numerical value represents the priority. The smaller the numerical value, the earlier it is executed. If you want both, it is recommended to let ImagesPipeline run first (this is purely customary, not mandatory).
Step 2: Configure core parameters
ImagesPipeline common configuration
⚠️ NOTE:
IMAGES_MIN_WIDTHandIMAGES_MIN_HEIGHTOnly images that have been downloaded successfully can be filtered. If the image URL itself is invalid, the filtering conditions will not be triggered and additional processing is required.
FilesPipeline common configuration
📌 Two Pipelines
*_URLS_FIELDand*_RESULT_FIELDYou can customize the naming in Item and just modify it synchronously in settings.
Step 3: Define Item
Two fields need to be provided in Item: one to store the URL list, and the other to receive the download results. They can be defined in the same Item or separated. Here is an example to demonstrate them uniformly:
Scrapy's convention is: URL field must be an iterable object (usually a list), even if there is only one URL, it must be written as a list.
Step 4: Extract URL from Spider and yield Item
Finally, fill in the captured URL into Item in Spider andyieldOnce out, Pipeline can automatically take over the subsequent download process.
After running the crawler, you will be at./static/images/full/See the original picture below,thumbs/small/andthumbs/medium/You will see the corresponding thumbnail below; the file will appear under./static/files/full/Down. at the same time,item['images']anditem['files']will be populated as a dictionary list containingurl、path、checksumand other information to facilitate subsequent storage.
High-frequency practical customization skills
The default Pipeline can cover about 80% of download requirements, but in the real world you are likely to encounter these demands: ** Want to customize the file name, add anti-anti-crawling request headers, want to uniformly convert all images into WebP format **... At this time, you need to inherit and rewrite the corresponding methods.
Customize the file name and say goodbye to hash garbled characters
The default filename is the SHA1 hash of the URL, which is completely indecipherable to humans. We can do this by rewritingfile_pathMethod, design the file name as 分类/时间戳-ID-原始文件名 structure prevents duplication and facilitates search.
ImagesPipeline Custom file name
🧠 Tips:
thumb_pathIt is completely usable without rewriting, but if you want the thumbnail to have the same name as the original image and only add a suffix, you need to handle it as above.
FilesPipeline can be customized in exactly the same way, just change the inherited class toFilesPipelineThat’s it, I won’t go into details here.
Add anti-anti-crawling headers to download requests
Many resource servers will verifyReferer、User-Agent, and even requires cookies to be carried. The Request generated by the default Pipeline is quite "bare" and we must rewrite itget_media_requeststo customize the request.
⚠️ About Cookies: It is usually recommended to use Scrapy’s global
COOKIES_ENABLED=TrueCooperate with middleware management, but if you have special needs, you can also pass it hereheadersorcookiesParameters are passed in.
Unified image format is WebP (advanced)
The WebP format is 30%~80% smaller than JPG/PNG and has become standard in modern websites. by rewritingconvert_imageMethod, we can uniformly transfer all downloaded images to WebP, and complete the necessary mode conversion and size scaling at the same time.
At the same time, infile_pathChange the extension to.webp, Scrapy automatically calls Pillow's WebP encoder when writing a file.
At this point, a fully automatic image downloading and converting pipeline to WebP is completed.
Frequently Asked Questions and Solutions
❓ Image/file download failed, but there is no log alarm
Cause: The default Pipeline will ignore a single download failure and will only throw it when all URLs in the Item fail.DropItem。
Solution:
- Added when the crawler is running
--logfile=scrapy.log, then view the DEBUG level log and searchFile (images) download error。 - can also be rewritten
media_failedMethod to proactively log failed URLs and reasons.
❓ Customized file names still appear overwritten
Cause: The file name generation rules are not unique enough, such as multiple ItemidDuplicate, or timestamp/hash not included.
Solution: When splicing the file name, add at least a combination of timestamp + Item unique ID + URL short hash (first 8 digits).
❓ The image cache has expired and I don’t want to download it again. I want to keep it permanently.
reason:IMAGES_EXPIRES / FILES_EXPIRESSet too short.
Solution: Set the expiration number to a maximum value, for example36500days (100 years), or no limit at all (set to0means never expires, but the official Scrapy documentation recommends using a larger value to avoid accidental cleanup).
❓ Memory explosion or process crash when processing large images
Reason: There is no limit on image size or memory, and large images are loaded and scaled at one time.
Solution:
- exist
convert_imageLimit the maximum size, for example, if the width or height exceeds 2048px, it will be scaled proportionally. - Actively release intermediate variables (although Python has GC, it actively releases intermediate variables when dealing with large objects)
delcan speed up recycling).
❓ After the thumbnail is generated, the original image still exists in the full directory. If you want to keep only the thumbnail
By default, the Pipeline will retain the original image at the same time. If you want to retain only the thumbnail image (not the original image), you can override it.item_completedMethod, manually delete the original image file after the processing is completed.
⚠️ Note: This approach means that you abandon the original image and must re-download it when you need the original image again. Please weigh based on actual business.
Write at the end
ImagesPipelineandFilesPipelineIt is a gift from Scrapy to multimedia crawler developers. Master them well, and you can quickly set up a robust resource download pipeline and focus on data extraction and business logic. Combined with techniques such as custom file names, anti-anti-crawling request headers, and unified formats, it can fully meet the needs of most production environments.
If you encounter other thorny problems in practice, you might as well look back at the official documentation, or print it in the codeinfo.spiderRelated properties, Scrapy's Pipeline is far more powerful and flexible than it seems. I wish you worry-free crawler downloading!
📌 Extended reading: Next, you can learn about Item Pipeline 的更多玩法 and how to 清洗与校验 the downloaded data.

