#Docker容器化爬虫 - 云原生爬虫部署与管理详解
📂 所属阶段:第六阶段 — 运维与监控(工程化篇)
🔗 相关章节:Scrapyd与ScrapydWeb · 抓取监控看板 · Scrapy-Redis分布式架构
#目录
- Docker容器化概述
- Dockerfile最佳实践
- 多阶段构建优化
- Docker Compose编排
- 生产环境部署
- 网络与存储配置
- 安全配置与权限管理
- CI/CD集成
- 性能优化与监控
- 故障排查与维护
- 最佳实践总结
#Docker容器化概述
Docker容器化为Scrapy爬虫提供了标准化的运行环境,解决了"在我机器上能运行"的问题,实现了环境一致性、可移植性和可扩展性。
#容器化的优势
1. 环境一致性
- 开发、测试、生产环境统一
- 依赖包版本锁定
- 系统库隔离
2. 部署便捷性
- 一键部署
- 快速启动/停止
- 版本管理
3. 资源隔离
- CPU/内存限制
- 网络隔离
- 文件系统隔离
4. 可扩展性
- 水平扩展
- 负载均衡
- 自动伸缩
#容器化架构模式
1. 单体架构
- 单个容器运行完整爬虫
- 适合小型项目
- 部署简单
2. 微服务架构
- 多个容器协同工作
- 爬虫、存储、监控分离
- 适合大型项目
3. 混合架构
- 结合单体和微服务
- 根据业务需求灵活组合
#Dockerfile最佳实践
#基础Dockerfile
# Dockerfile - Scrapy爬虫基础镜像
# 使用官方Python基础镜像
FROM python:3.9-slim
# 设置环境变量
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONPATH=/app \
SCRAPY_SETTINGS_MODULE=myproject.settings
# 设置工作目录
WORKDIR /app
# 安装系统依赖
RUN apt-get update && apt-get install -y \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
libffi-dev \
libssl-dev \
libpng-dev \
libjpeg-dev \
zlib1g-dev \
wget \
curl \
git \
&& rm -rf /var/lib/apt/lists/*
# 复制依赖文件
COPY requirements.txt .
# 安装Python依赖
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser
RUN chown -R appuser:appuser /app
USER appuser
# 复制应用代码
COPY --chown=appuser:appuser . .
# 创建必要的目录
RUN mkdir -p /app/logs /app/data
# 暴露端口(如果需要)
EXPOSE 8080
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8080')" || exit 1
# 启动命令
CMD ["python", "run_spider.py"]#优化版Dockerfile
# Dockerfile.optimized - 优化版镜像
FROM python:3.9-slim
# 设置构建参数
ARG BUILD_DATE
ARG VCS_REF
ARG VERSION="1.0.0"
# 设置环境变量
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONPATH=/app \
SCRAPY_SETTINGS_MODULE=myproject.settings
# 设置工作目录
WORKDIR /app
# 安装系统依赖(合并为一条命令减少层数)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
libffi-dev \
libssl-dev \
libpng-dev \
libjpeg-dev \
zlib1g-dev \
wget \
curl \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# 复制requirements文件并安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt && \
pip cache purge
# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser -u 1000
RUN chown -R appuser:appuser /app
USER appuser
# 复制应用代码
COPY --chown=appuser:appuser . .
# 设置文件权限
RUN chmod +x run_spider.py entrypoint.sh
# 创建必要的目录
RUN mkdir -p /app/logs /app/data /app/cache
# 标签
LABEL org.label-schema.build-date=$BUILD_DATE \
org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.version=$VERSION \
org.label-schema.schema-version="1.0"
# 暴露端口
EXPOSE 8080
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8080')" || exit 1
# 启动脚本
ENTRYPOINT ["/app/entrypoint.sh"]#专用Scrapy镜像
# Dockerfile.scrapy - 专用Scrapy镜像
FROM python:3.9-slim
# 安装系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
libffi-dev \
libssl-dev \
libpng-dev \
libjpeg-dev \
zlib1g-dev \
libgumbo-dev \
libevent-dev \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# 安装Python依赖
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir \
scrapy \
twisted[tls,http2] \
cryptography \
pyopenssl \
service_identity \
cssselect \
lxml \
&& pip cache purge
# 创建工作目录
WORKDIR /app
# 创建非root用户
RUN groupadd -r scrapy && useradd -r -g scrapy scrapy -u 1000
USER scrapy
# 暴露端口
EXPOSE 8080
# 启动命令
CMD ["scrapy", "--version"]#多阶段构建优化
#多阶段构建示例
# Dockerfile.multi-stage - 多阶段构建
# 构建阶段
FROM python:3.9-slim AS builder
# 安装构建依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
libffi-dev \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# 设置工作目录
WORKDIR /app
# 复制依赖文件并安装
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# 运行时阶段
FROM python:3.9-slim
# 安装运行时依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
libxml2 \
libxslt1.1 \
libffi7 \
libssl1.1 \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# 复制Python包
COPY --from=builder /root/.local /root/.local
# 设置环境变量
ENV PATH=/root/.local/bin:$PATH \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
# 创建工作目录
WORKDIR /app
# 创建非root用户
RUN groupadd -r appuser && useradd -r -g appuser appuser -u 1000
USER appuser
# 复制应用代码
COPY --chown=appuser:appuser . .
# 暴露端口
EXPOSE 8080
# 启动命令
CMD ["python", "run_spider.py"]#针对不同环境的构建
# Dockerfile.env-specific - 环境特定构建
ARG TARGET_ENV=production
FROM python:3.9-slim AS base
# 基础系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
libxml2-dev \
libxslt1-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
# 开发环境
FROM base AS development
RUN apt-get update && apt-get install -y --no-install-recommends \
vim \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY requirements-dev.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# 生产环境
FROM base AS production
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 最终镜像
FROM ${TARGET_ENV}
COPY . .
USER 1000
EXPOSE 8080
CMD ["python", "run_spider.py"]#Docker Compose编排
#基础Compose配置
# docker-compose.yml - 基础编排配置
version: '3.8'
services:
# Redis缓存
redis:
image: redis:7-alpine
container_name: scrapy-redis
restart: unless-stopped
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --appendonly yes
networks:
- scrapy-network
# MongoDB存储
mongodb:
image: mongo:6
container_name: scrapy-mongodb
restart: unless-stopped
ports:
- "27017:27017"
environment:
MONGO_INITDB_ROOT_USERNAME: scrapy
MONGO_INITDB_ROOT_PASSWORD: scrapy123
volumes:
- mongodb_data:/data/db
networks:
- scrapy-network
# Scrapy爬虫
spider:
build: .
container_name: scrapy-spider
restart: unless-stopped
depends_on:
- redis
- mongodb
environment:
- REDIS_URL=redis://redis:6379
- MONGODB_URI=mongodb://scrapy:scrapy123@mongodb:27017
- SCRAPY_SETTINGS_MODULE=myproject.settings
volumes:
- ./logs:/app/logs
- ./data:/app/data
networks:
- scrapy-network
deploy:
replicas: 2 # 启动2个爬虫实例
# 监控面板
grafana:
image: grafana/grafana-enterprise
container_name: scrapy-grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
networks:
- scrapy-network
depends_on:
- spider
volumes:
redis_data:
mongodb_data:
grafana_data:
networks:
scrapy-network:
driver: bridge#高级Compose配置
# docker-compose.advanced.yml - 高级编排配置
version: '3.8'
services:
# 负载均衡器
nginx:
image: nginx:alpine
container_name: scrapy-nginx
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./ssl:/etc/nginx/ssl
networks:
- scrapy-network
depends_on:
- spider-cluster
# 爬虫集群
spider-cluster:
build: .
restart: unless-stopped
depends_on:
- redis
- postgres
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://scrapy:scrapy123@postgres:5432/scrapy_db
- LOG_LEVEL=INFO
- CONCURRENT_REQUESTS=16
volumes:
- spider_logs:/app/logs
- spider_data:/app/data
networks:
- scrapy-network
deploy:
replicas: 3
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.25'
memory: 256M
# Redis集群
redis:
image: redis:7-alpine
restart: unless-stopped
command: redis-server --appendonly yes --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendfilename appendonly.aof --appendfsync always
volumes:
- redis_cluster_data:/data
networks:
- scrapy-network
deploy:
replicas: 3
# PostgreSQL数据库
postgres:
image: postgres:15
restart: unless-stopped
environment:
POSTGRES_DB: scrapy_db
POSTGRES_USER: scrapy
POSTGRES_PASSWORD: scrapy123
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
networks:
- scrapy-network
deploy:
resources:
limits:
memory: 1G
reservations:
memory: 512M
# 消息队列
rabbitmq:
image: rabbitmq:3-management
restart: unless-stopped
environment:
RABBITMQ_DEFAULT_USER: scrapy
RABBITMQ_DEFAULT_PASS: scrapy123
ports:
- "15672:15672" # 管理界面
- "5672:5672" # AMQP端口
volumes:
- rabbitmq_data:/var/lib/rabbitmq
networks:
- scrapy-network
# 监控系统
prometheus:
image: prom/prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- scrapy-network
grafana:
image: grafana/grafana-enterprise
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- scrapy-network
depends_on:
- prometheus
volumes:
spider_logs:
spider_data:
redis_cluster_data:
postgres_data:
rabbitmq_data:
prometheus_data:
grafana_data:
networks:
scrapy-network:
driver: overlay
attachable: true#环境特定配置
# docker-compose.override.yml - 开发环境覆盖配置
version: '3.8'
services:
spider:
build:
context: .
target: development
volumes:
- .:/app # 代码热重载
- /app/__pycache__ # 排除缓存
environment:
- LOG_LEVEL=DEBUG
- DEBUG=true
ports:
- "8080:8080" # 暴露端口便于调试
command: ["python", "-m", "pdb", "run_spider.py"]
# docker-compose.prod.yml - 生产环境配置
version: '3.8'
services:
spider:
image: mycompany/scrapy-spider:latest
environment:
- LOG_LEVEL=WARNING
- DEBUG=false
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
- /var/run
ulimits:
nproc: 65535
nofile:
soft: 20000
hard: 40000#生产环境部署
#部署脚本
#!/bin/bash
# deploy.sh - 生产环境部署脚本
set -e # 遇到错误立即退出
# 配置变量
IMAGE_NAME="scrapy-spider"
CONTAINER_NAME="scrapy-production"
REGISTRY="myregistry.com"
TAG="latest"
ENV_FILE="./.env.production"
echo "🚀 Starting production deployment..."
# 检查Docker是否运行
if ! docker info >/dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker first."
exit 1
fi
# 登录私有仓库
if [ -n "$DOCKER_REGISTRY" ]; then
echo "🔐 Logging into Docker registry..."
echo "$DOCKER_PASSWORD" | docker login -u "$DOCKER_USERNAME" --password-stdin "$DOCKER_REGISTRY"
fi
# 构建镜像
echo "🔨 Building Docker image..."
docker build -t "$IMAGE_NAME:$TAG" .
# 标记镜像
echo "🏷️ Tagging image..."
docker tag "$IMAGE_NAME:$TAG" "$REGISTRY/$IMAGE_NAME:$TAG"
# 推送镜像
echo "📤 Pushing image to registry..."
docker push "$REGISTRY/$IMAGE_NAME:$TAG"
# 停止现有容器
echo "🛑 Stopping existing containers..."
docker-compose -f docker-compose.prod.yml down --remove-orphans || true
# 拉取最新镜像
echo "📥 Pulling latest image..."
docker-compose -f docker-compose.prod.yml pull
# 启动服务
echo "▶️ Starting services..."
docker-compose -f docker-compose.prod.yml up -d
# 等待服务启动
echo "⏳ Waiting for services to start..."
sleep 10
# 验证服务状态
echo "🔍 Verifying service status..."
if docker-compose -f docker-compose.prod.yml ps | grep -q "Up"; then
echo "✅ Deployment successful!"
else
echo "❌ Deployment failed!"
docker-compose -f docker-compose.prod.yml logs
exit 1
fi
# 清理旧镜像
echo "🧹 Cleaning up old images..."
docker image prune -f
echo "🎉 Production deployment completed successfully!"#Kubernetes部署配置
# k8s-scrapy.yaml - Kubernetes部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-spider
labels:
app: scrapy-spider
spec:
replicas: 3
selector:
matchLabels:
app: scrapy-spider
template:
metadata:
labels:
app: scrapy-spider
spec:
containers:
- name: spider
image: mycompany/scrapy-spider:latest
ports:
- containerPort: 8080
env:
- name: REDIS_URL
value: "redis://redis-service:6379"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: database-url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: scrapy-service
spec:
selector:
app: scrapy-spider
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scrapy-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: scrapy-spider
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70#网络与存储配置
#网络安全配置
# network-config.yml - 网络安全配置
version: '3.8'
services:
# 内部网络(仅内部通信)
internal-network:
driver: bridge
internal: true # 阻止外部访问
# 爬虫服务(仅内部访问)
spider-internal:
build: .
networks:
- internal
environment:
- ALLOWED_DOMAINS=internal-only.com
# API网关(对外暴露)
api-gateway:
image: nginx:alpine
ports:
- "80:80"
networks:
- internal
- external
volumes:
- ./gateway.conf:/etc/nginx/nginx.conf
# 监控服务
monitoring:
image: grafana/grafana
networks:
- internal
ports:
- "3000:3000" # 仅在特定IP开放
networks:
internal:
driver: bridge
internal: true
external:
driver: bridge#存储配置
# storage-config.yml - 存储配置
version: '3.8'
services:
# 持久化存储
spider-data:
build: .
volumes:
# 本地持久化
- scrapy-data:/app/data
# 日志持久化
- scrapy-logs:/app/logs
# 配置文件
- ./config:/app/config:ro
# SSL证书
- ./ssl:/app/ssl:ro
# 临时存储
tmpfs:
- /tmp:size=100M,mode=1777
- /app/temp:size=50M,mode=1777
volumes:
scrapy-data:
driver: local
driver_opts:
type: none
o: bind
device: /host/data/scrapy
scrapy-logs:
driver: local
driver_opts:
type: none
o: bind
device: /host/logs/scrapy#安全配置与权限管理
#安全基线配置
# Dockerfile.security - 安全基线配置
FROM python:3.9-slim
# 创建专用用户和组
RUN groupadd -r scrapy --gid=1001 && \
useradd -r -g scrapy --uid=1001 -d /app scrapy
# 安装最小化系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
libxml2 \
libxslt1.1 \
libffi7 \
libssl1.1 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# 设置工作目录
WORKDIR /app
# 复制并安装Python依赖
COPY --chown=scrapy:scrapy requirements.txt .
USER scrapy
RUN pip install --user --no-cache-dir -r requirements.txt
# 复制应用代码
COPY --chown=scrapy:scrapy . .
# 设置安全限制
USER root
RUN chown -R scrapy:scrapy /app
USER scrapy
# 只读根文件系统
# 注意:这需要在运行时使用 --read-only 标志
CMD ["python", "run_spider.py"]#运行时安全配置
# security-compose.yml - 运行时安全配置
version: '3.8'
services:
secure-spider:
build:
context: .
dockerfile: Dockerfile.security
security_opt:
- no-new-privileges:true
- label=type:container_runtime_t
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
- SETUID
- SETGID
read_only: true
tmpfs:
- /tmp
- /run
- /var/run
user: "1001:1001"
privileged: false
devices: []
shm_size: 64m
ulimits:
nproc: 1024
nofile:
soft: 1024
hard: 2048
environment:
- PYTHONPATH=/app
- HOME=/app
volumes:
- spider-data:/app/data:rw
- spider-logs:/app/logs:rw
- /etc/passwd:/etc/passwd:ro
- /etc/group:/etc/group:ro
volumes:
spider-data:
spider-logs:#CI/CD集成
#GitHub Actions配置
# .github/workflows/docker.yml - Docker CI/CD配置
name: Docker Build and Push
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest pytest-cov
- name: Run tests
run: |
pytest tests/ -v --cov=scrapy_project
- name: Run linting
run: |
pip install flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
build-and-push:
needs: test
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha,prefix={{branch}}-
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to production
run: |
echo "Deploying to production..."
# Add your deployment commands here#Jenkins Pipeline配置
// Jenkinsfile - Jenkins流水线配置
pipeline {
agent any
environment {
DOCKER_IMAGE = 'scrapy-spider'
DOCKER_REGISTRY = 'myregistry.com'
KUBE_NAMESPACE = 'scrapy-prod'
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Test') {
steps {
script {
sh '''
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest
pytest tests/ -v
'''
}
}
}
stage('Security Scan') {
steps {
script {
sh '''
# Run security scans
pip install bandit safety
bandit -r . -f json -o bandit-report.json
safety check -o json > safety-report.json
'''
}
}
post {
always {
publishHTML([
allowMissing: false,
alwaysLinkToLastBuild: true,
keepAll: true,
reportDir: '.',
reportFiles: 'bandit-report.json,safety-report.json',
reportName: 'Security Reports'
])
}
}
}
stage('Build Docker Image') {
steps {
script {
def image = docker.build("${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER}")
sh """
docker tag ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER} ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:latest
"""
}
}
}
stage('Push to Registry') {
steps {
script {
docker.withRegistry("https://${DOCKER_REGISTRY}", 'docker-registry-credentials') {
sh """
docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER}
docker push ${DOCKER_REGISTRY}/${DOCKER_IMAGE}:latest
"""
}
}
}
}
stage('Deploy to Kubernetes') {
when {
branch 'main'
}
steps {
script {
sh """
kubectl set image deployment/scrapy-spider spider=${DOCKER_REGISTRY}/${DOCKER_IMAGE}:${env.BUILD_NUMBER} -n ${KUBE_NAMESPACE}
kubectl rollout status deployment/scrapy-spider -n ${KUBE_NAMESPACE}
"""
}
}
}
}
post {
always {
cleanWs()
}
success {
echo '✅ Pipeline completed successfully!'
}
failure {
echo '❌ Pipeline failed!'
}
}
}#性能优化与监控
#性能监控配置
# monitoring.py - 性能监控脚本
import docker
import psutil
import time
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import logging
# Prometheus指标定义
CONTAINER_CPU_USAGE = Gauge('container_cpu_usage_percent', 'Container CPU usage', ['container_name'])
CONTAINER_MEMORY_USAGE = Gauge('container_memory_usage_bytes', 'Container memory usage', ['container_name'])
CONTAINER_NETWORK_RX = Counter('container_network_receive_bytes_total', 'Container network receive bytes', ['container_name'])
CONTAINER_NETWORK_TX = Counter('container_network_transmit_bytes_total', 'Container network transmit bytes', ['container_name'])
SPIDER_REQUEST_DURATION = Histogram('spider_request_duration_seconds', 'Spider request duration')
class DockerMonitor:
"""
Docker容器监控器
"""
def __init__(self, interval=10):
self.client = docker.from_env()
self.interval = interval
self.logger = logging.getLogger(__name__)
def collect_metrics(self):
"""
收集容器指标
"""
try:
containers = self.client.containers.list()
for container in containers:
try:
# 获取容器统计信息
stats = container.stats(stream=False)
# CPU使用率
cpu_delta = stats['cpu_stats']['cpu_usage']['total_usage'] - \
stats['precpu_stats']['cpu_usage']['total_usage']
system_delta = stats['cpu_stats']['system_cpu_usage'] - \
stats['precpu_stats']['system_cpu_usage']
if system_delta > 0:
cpu_percent = (cpu_delta / system_delta) * len(stats['cpu_stats']['cpu_usage']['percpu_usage']) * 100
CONTAINER_CPU_USAGE.labels(container_name=container.name).set(cpu_percent)
# 内存使用
mem_usage = stats['memory_stats']['usage']
CONTAINER_MEMORY_USAGE.labels(container_name=container.name).set(mem_usage)
# 网络流量
if 'networks' in stats:
for interface, net_stats in stats['networks'].items():
CONTAINER_NETWORK_RX.labels(container_name=container.name).inc(net_stats['rx_bytes'])
CONTAINER_NETWORK_TX.labels(container_name=container.name).inc(net_stats['tx_bytes'])
except Exception as e:
self.logger.error(f"Error collecting metrics for {container.name}: {e}")
except Exception as e:
self.logger.error(f"Error collecting container metrics: {e}")
def start_monitoring(self):
"""
开始监控
"""
self.logger.info("Starting Docker monitoring...")
while True:
try:
self.collect_metrics()
time.sleep(self.interval)
except KeyboardInterrupt:
self.logger.info("Monitoring stopped.")
break
except Exception as e:
self.logger.error(f"Monitoring error: {e}")
time.sleep(self.interval)
if __name__ == "__main__":
# 启动Prometheus指标服务器
start_http_server(8000)
# 开始监控
monitor = DockerMonitor(interval=5)
monitor.start_monitoring()#资源优化配置
# optimization-compose.yml - 资源优化配置
version: '3.8'
services:
optimized-spider:
build: .
# 资源限制
deploy:
resources:
limits:
cpus: '1.0'
memory: 1024M
reservations:
cpus: '0.5'
memory: 512M
# 重启策略
restart: unless-stopped
# 健康检查
healthcheck:
test: ["CMD", "python", "-c", "import requests; requests.get('http://localhost:8080/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
# 环境变量优化
environment:
# Python优化
- PYTHONUNBUFFERED=1
- PYTHONDONTWRITEBYTECODE=1
# Scrapy优化
- CONCURRENT_REQUESTS=16
- DOWNLOAD_DELAY=1
- RANDOMIZE_DOWNLOAD_DELAY=0.5
# 内存优化
- PYTHON_GC_THRESHOLD=700,10,10
# 日志优化
- LOG_LEVEL=INFO
# 存储优化
volumes:
# 使用高性能存储
- type: volume
source: spider-data
target: /app/data
volume:
nocopy: true
# 临时文件使用tmpfs
- type: tmpfs
target: /tmp
tmpfs:
size: 100M
mode: 1777
# 网络优化
networks:
- optimized
dns:
- 8.8.8.8
- 8.8.4.4
networks:
optimized:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
volumes:
spider-data:
driver: local
driver_opts:
type: none
o: bind
device: /fast-ssd/scrapy-data#故障排查与维护
#故障诊断脚本
#!/bin/bash
# troubleshoot.sh - Docker容器故障诊断脚本
echo "🔍 Starting Docker container troubleshooting..."
# 检查Docker服务状态
echo "📋 Checking Docker service status..."
if ! systemctl is-active --quiet docker; then
echo "❌ Docker service is not running"
sudo systemctl start docker
else
echo "✅ Docker service is running"
fi
# 检查磁盘空间
echo "💾 Checking disk space..."
df -h | grep -E '(overlay|/dev)' | while read line; do
usage=$(echo $line | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 80 ]; then
echo "⚠️ High disk usage: $line"
fi
done
# 检查容器状态
echo "🐳 Checking container status..."
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | while read line; do
if [[ $line =~ Exited|Restarting ]]; then
echo "⚠️ Container issue: $line"
fi
done
# 检查日志错误
echo "📝 Checking container logs for errors..."
for container in $(docker ps -aq); do
if [ -n "$(docker logs $container 2>&1 | grep -i error)" ]; then
echo "⚠️ Errors found in container: $(docker inspect --format='{{.Name}}' $container | sed 's/\///')"
docker logs $container --tail 50 | grep -i error
fi
done
# 检查Docker镜像
echo "📦 Checking Docker images..."
if [ $(docker images -q | wc -l) -gt 50 ]; then
echo "⚠️ Too many Docker images, consider cleaning up"
echo "Run: docker image prune -a"
fi
# 检查Docker网络
echo "🌐 Checking Docker networks..."
docker network ls | grep -v bridge | grep -v host | grep -v none | while read network; do
if [ $(docker network inspect $network | jq -r '.[].Containers | length') -eq 0 ]; then
echo "⚠️ Unused network detected: $network"
fi
done
# 性能检查
echo "⚡ Checking container performance..."
docker stats --no-stream | while read line; do
if [[ $line =~ ([0-9]+)% ]]; then
cpu_usage=${BASH_REMATCH[1]}
if [ $cpu_usage -gt 80 ]; then
echo "⚠️ High CPU usage: $line"
fi
fi
done
echo "✅ Troubleshooting completed!"#维护脚本
#!/bin/bash
# maintenance.sh - Docker容器维护脚本
# 配置变量
RETENTION_DAYS=30
LOG_RETENTION_DAYS=7
echo "🔧 Starting Docker maintenance tasks..."
# 清理未使用的容器
echo "🗑️ Cleaning up stopped containers..."
docker container prune -f
# 清理未使用的镜像
echo "🗑️ Cleaning up unused images..."
docker image prune -f
# 清理未使用的卷
echo "🗑️ Cleaning up unused volumes..."
docker volume prune -f
# 清理未使用的网络
echo "🗑️ Cleaning up unused networks..."
docker network prune -f
# 清理构建缓存
echo "🗑️ Cleaning up build cache..."
docker builder prune -f
# 清理旧日志
echo "📝 Rotating container logs..."
for container in $(docker ps -q); do
log_file=$(docker inspect --format='{{.LogPath}}' $container)
if [ -f "$log_file" ]; then
size=$(stat -c%s "$log_file")
if [ $size -gt 104857600 ]; then # 100MB
echo "Rotating large log: $log_file"
mv "$log_file" "${log_file}.old"
docker restart $container > /dev/null 2>&1
fi
fi
done
# 清理旧日志文件
find /var/lib/docker/containers -name "*.log" -mtime +$LOG_RETENTION_DAYS -delete
# 检查磁盘使用
echo "💾 Checking disk usage..."
docker system df
# 更新系统
echo "🔄 Updating Docker system..."
docker system prune -f
echo "✅ Maintenance tasks completed!"#最佳实践总结
#Dockerfile最佳实践
-
使用官方基础镜像
FROM python:3.9-slim # 使用官方精简镜像 -
多阶段构建
FROM python:3.9-slim AS builder # 构建依赖 FROM python:3.9-slim AS runtime # 运行时依赖 -
非root用户
RUN useradd -r -s /bin/false appuser USER appuser -
层缓存优化
COPY requirements.txt . RUN pip install -r requirements.txt COPY . . # 在依赖安装后复制代码
#部署最佳实践
-
环境变量管理
environment: - ENVIRONMENT=production - LOG_LEVEL=INFO -
健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ CMD python -c "import requests; requests.get('http://localhost:8080/health')" -
资源限制
deploy: resources: limits: cpus: '0.5' memory: 512M -
安全配置
security_opt: - no-new-privileges:true read_only: true
#监控最佳实践
-
指标收集
- CPU/内存使用率
- 网络I/O
- 磁盘I/O
- 应用性能指标
-
日志管理
- 结构化日志
- 日志轮转
- 集中收集
-
告警设置
- 资源使用告警
- 服务可用性告警
- 性能下降告警
💡 核心要点: Docker容器化是现代爬虫部署的标准方案。通过合理的配置、完善的监控和规范的运维流程,可以实现爬虫的高效、稳定运行。
#SEO优化策略
- 关键词优化: 在标题、内容中合理布局"Docker", "容器化", "Scrapy", "云原生", "微服务", "DevOps", "CI/CD"等关键词
- 内容结构: 使用清晰的标题层级(H1-H6),便于搜索引擎理解内容结构
- 内部链接: 建立与其他相关教程的内部链接,提升页面权重
- 元数据优化: 在页面头部包含描述性的标题、描述和标签
🔗 相关教程推荐
- Scrapyd与ScrapydWeb - 爬虫部署管理
- 抓取监控看板 - 监控系统建设
- Scrapy-Redis分布式架构 - 分布式爬虫架构
🏷️ 标签云: Docker 容器化 Scrapy 云原生 微服务 DevOps CI/CD Docker Compose Kubernetes 部署管理

