第一次接触Scrapy是在2013年处理一个电商价格监控项目,当时需要每天采集超过200个网站的数十万条商品数据。尝试了各种方案后,这个用Python编写的开源框架以其卓越的性能和清晰的架构设计彻底解决了我的痛点。Scrapy不是简单的请求库,而是一个完整的爬虫解决方案,它采用Twisted异步网络框架作为底层引擎,单机日处理能力轻松达到百万级页面请求。
在当今数据驱动的时代,网络爬虫技术已成为数据分析、市场研究、竞品监控等领域的基础设施。相比Requests+BeautifulSoup的传统组合,Scrapy提供了更专业的项目结构、自动化的请求调度、内置的数据管道,以及完善的异常处理机制。其模块化设计让开发者可以专注于核心业务逻辑,而不用重复造轮子处理Cookie管理、重试策略等底层细节。
Scrapy采用了经典的"引擎-中间件-管道"架构,各组件通过明确接口通信。引擎(Engine)是中枢神经系统,控制数据流在组件间的流动;调度器(Scheduler)管理待爬取URL队列;下载器(Downloader)负责实际网络请求;爬虫(Spider)包含核心解析逻辑;项目管道(Item Pipeline)处理数据持久化。这种设计使得每个组件都可以独立扩展或替换。
提示:理解组件间数据流动对调试至关重要。当请求卡顿时,可以通过日志级别调整为DEBUG观察各组件状态。
基于Twisted的异步I/O模型是Scrapy高性能的关键。不同于同步请求的阻塞式等待,Scrapy可以在单个线程中同时管理数百个网络连接。当某个请求等待响应时,CPU时间会自动分配给其他待处理任务。实测显示,在相同硬件条件下,Scrapy的吞吐量可达同步请求的5-8倍。
python复制# 典型同步请求与Scrapy异步对比
import requests # 同步方式
for url in urls:
response = requests.get(url) # 阻塞直到获得响应
# Scrapy异步方式
def parse(self, response): # 回调函数
item = {}
# 解析逻辑...
yield item
中间件是Scrapy最强大的扩展机制。下载器中间件可以修改请求/响应(如添加代理、修改Header),蜘蛛中间件能处理爬取结果(如过滤重复项)。系统内置的中间件已处理常见场景:
推荐使用Python 3.7+环境,通过虚拟环境隔离依赖:
bash复制python -m venv scrapy_env
source scrapy_env/bin/activate # Linux/Mac
pip install scrapy
创建新项目会生成标准目录结构:
code复制myproject/
├── scrapy.cfg # 部署配置文件
└── myproject/ # 项目Python模块
├── __init__.py
├── items.py # 数据模型定义
├── middlewares.py # 自定义中间件
├── pipelines.py # 数据处理管道
├── settings.py # 全局配置
└── spiders/ # 爬虫代码目录
└── __init__.py
以爬取豆瓣电影Top250为例,首先定义数据模型:
python复制# items.py
import scrapy
class DoubanMovieItem(scrapy.Item):
title = scrapy.Field() # 电影名称
rating = scrapy.Field() # 评分
quote = scrapy.Field() # 经典台词
detail_url = scrapy.Field() # 详情页链接
然后创建核心爬虫类:
python复制# spiders/douban_spider.py
import scrapy
from myproject.items import DoubanMovieItem
class DoubanSpider(scrapy.Spider):
name = "douban_movie"
allowed_domains = ["movie.douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
for movie in response.css('div.item'):
item = DoubanMovieItem()
item['title'] = movie.css('span.title::text').get()
item['rating'] = movie.css('span.rating_num::text').get()
item['quote'] = movie.css('span.inq::text').get()
item['detail_url'] = movie.css('div.hd > a::attr(href)').get()
yield item
# 分页处理
next_page = response.css('span.next > a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
关键配置参数在settings.py中设置:
python复制# 并发与延迟控制
CONCURRENT_REQUESTS = 16 # 全局并发请求数
CONCURRENT_ITEMS = 100 # 项目处理并发数
DOWNLOAD_DELAY = 0.5 # 请求间隔秒数
# 中间件启用设置
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'myproject.middlewares.RandomUserAgentMiddleware': 400,
}
# 管道配置
ITEM_PIPELINES = {
'myproject.pipelines.DuplicatesPipeline': 300,
'myproject.pipelines.MongoDBPipeline': 800,
}
对于JavaScript渲染的页面,通常有三种解决方案:
python复制# settings.py
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
}
python复制# middlewares.py
from selenium import webdriver
class SeleniumMiddleware:
def process_request(self, request, spider):
if request.meta.get('selenium'):
driver = webdriver.Chrome()
driver.get(request.url)
body = driver.page_source
driver.quit()
return HtmlResponse(url=request.url, body=body, encoding='utf-8')
突破单机性能瓶颈的两种主流方案:
python复制# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:password@localhost:6379/0'
python复制# settings.py
SCHEDULER = 'scrapy_cluster.scheduler.DistributedScheduler'
REDIS_HOST = 'redis-service'
KAFKA_HOSTS = 'kafka-service:9092'
实战中总结的反反爬技巧:
请求特征多样化:
IP代理方案选型:
python复制# middlewares.py
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = "http://user:pass@proxy_ip:port"
request.headers['Proxy-Authorization'] = basic_auth_header('user', 'pass')
验证码处理流程:
python复制# pipelines.py
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class CleanPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# 评分字段处理
if adapter.get('rating'):
adapter['rating'] = float(adapter['rating'])
else:
raise DropItem("Missing rating in %s" % item)
# 字符串去空格
for field in adapter.field_names():
if isinstance(adapter.get(field), str):
adapter[field] = adapter[field].strip()
return item
MongoDB存储示例:
python复制import pymongo
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert_one(ItemAdapter(item).asdict())
return item
MySQL存储示例:
python复制import pymysql
class MySQLPipeline:
def __init__(self, host, database, user, password):
self.conn = pymysql.connect(
host=host,
user=user,
password=password,
database=database,
charset='utf8mb4'
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql = """INSERT INTO movies(title, rating, quote)
VALUES (%s, %s, %s)"""
self.cursor.execute(sql, (
item['title'],
item['rating'],
item['quote']
))
self.conn.commit()
return item
生产环境推荐采用分层日志:
python复制# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'logs/scrapy.log'
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
# 扩展日志处理器
from scrapy.utils.log import configure_logging
import logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='logs/scrapy.log',
format=LOG_FORMAT,
datefmt=LOG_DATEFORMAT,
level=logging.INFO
)
集成Prometheus监控指标:
python复制# extensions.py
from prometheus_client import start_http_server, Counter
class PrometheusExtension:
def __init__(self):
self.items_scraped = Counter(
'scrapy_items_scraped_total',
'Total items scraped'
)
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)
start_http_server(8000)
return ext
def item_scraped(self, item, spider):
self.items_scraped.inc()
HTTP错误处理最佳实践:
python复制class MySpider(scrapy.Spider):
def start_requests(self):
urls = [...] # 初始URL列表
for url in urls:
yield scrapy.Request(
url,
callback=self.parse,
errback=self.errback_httpbin,
dont_filter=True
)
def errback_httpbin(self, failure):
# 记录失败请求
self.logger.error(repr(failure))
# 分类处理不同异常
if failure.check(TimeoutError):
request = failure.request
self.logger.warning(f"Timeout on {request.url}")
elif failure.check(HTTPError):
response = failure.value.response
self.logger.error(f"HTTPError on {response.url}")
bash复制pip install scrapyd
scrapyd # 启动服务
bash复制scrapyd-deploy default -p myproject
bash复制curl http://localhost:6800/schedule.json -d project=myproject -d spider=douban_movie
方案一:Crontab+Scrapy命令
bash复制# 每天凌晨2点运行
0 2 * * * cd /path/to/project && scrapy crawl douban_movie -o output_$(date +\%Y\%m\%d).json
方案二:Airflow集成
python复制from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
dag = DAG('scrapy_douban', schedule_interval='@daily')
run_spider = BashOperator(
task_id='run_spider',
bash_command='cd /path/to/project && scrapy crawl douban_movie',
dag=dag
)
Scrapy默认遵守robots.txt规则,可通过设置关闭:
python复制# settings.py
ROBOTSTXT_OBEY = False # 不推荐修改
智能限速扩展AutoThrottle:
python复制# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5.0 # 初始延迟(秒)
AUTOTHROTTLE_MAX_DELAY = 60.0 # 最大延迟
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # 目标并发
建议遵循的三项原则:
在长期使用Scrapy的过程中,我发现最容易被忽视的是中间件系统的合理配置。许多开发者习惯在Spider中直接处理所有逻辑,这会导致代码臃肿且难以维护。实际上,应该遵循"单一职责原则":下载中间件专注请求处理,蜘蛛中间件处理响应过滤,管道负责数据清洗和存储。这种架构不仅使代码更清晰,还能通过灵活组合中间件快速适应不同的爬取场景。