在当今数据驱动的时代,高效获取网络数据已成为许多业务的核心需求。传统单机爬虫面临着IP封禁、性能瓶颈和容错性差等问题。我们采用Celery+Playwright构建的分布式爬虫集群,通过任务分发、并发执行和自动重试等机制,实现了稳定高效的数据采集。
这套系统的核心优势在于:
首先需要准备Python 3.7+环境,建议使用virtualenv创建隔离的Python环境:
bash复制python -m venv crawler_env
source crawler_env/bin/activate # Linux/macOS
crawler_env\Scripts\activate # Windows
安装核心依赖包:
bash复制pip install celery[redis] playwright
playwright install # 安装浏览器二进制文件
注意:Playwright会下载Chromium、Firefox和WebKit浏览器,总计约300MB空间。如果只需要特定浏览器,可以使用
playwright install chromium单独安装。
创建celery_app.py作为Celery应用入口:
python复制from celery import Celery
from celery.schedules import crontab
app = Celery(
"distributed_crawler",
broker="redis://localhost:6379/0", # 消息代理
backend="redis://localhost:6379/1", # 结果存储
include=["tasks"] # 包含的任务模块
)
# 高级配置
app.conf.update(
task_serializer="json",
result_serializer="json",
accept_content=["json"],
timezone="Asia/Shanghai",
enable_utc=False,
task_acks_late=True, # 确保任务不丢失
worker_prefetch_multiplier=1, # 公平调度
task_reject_on_worker_lost=True,
broker_connection_retry_on_startup=True,
task_routes={
"tasks.critical_task": {"queue": "high_priority"},
"tasks.*": {"queue": "default"},
},
beat_schedule={
"periodic_crawl": {
"task": "tasks.daily_crawl",
"schedule": crontab(hour=2, minute=30),
"args": (),
},
},
)
关键配置说明:
task_acks_late=True:确保任务在真正完成后才确认,避免意外中断导致任务丢失worker_prefetch_multiplier=1:实现公平调度,防止长任务阻塞短任务broker_connection_retry_on_startup=True:解决Docker环境下启动顺序问题由于Celery任务需要是同步函数,而Playwright API是异步的,我们需要特殊处理:
python复制# tasks.py
import asyncio
from celery_app import app
from playwright.async_api import async_playwright
@app.task(bind=True,
autoretry_for=(Exception,),
retry_backoff=3,
retry_kwargs={'max_retries': 3})
def crawl_page(self, url):
"""同步包装器"""
return asyncio.run(async_crawl(url))
async def async_crawl(url):
"""实际爬取逻辑"""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
viewport={"width": 1920, "height": 1080}
)
try:
page = await context.new_page()
await page.goto(url, timeout=60000)
# 反反爬措施
await page.evaluate("() => { delete navigator.webdriver; }")
# 等待关键元素
await page.wait_for_selector("body", state="attached")
# 提取数据
title = await page.title()
content = await page.content()
return {
"url": url,
"title": title,
"content": content[:5000], # 截断防止过大
"success": True
}
finally:
await context.close()
await browser.close()
针对现代网站的反爬机制,我们实现了一套隐身方案:
python复制# stealth.py
from playwright.async_api import async_playwright
async def create_stealth_context(browser):
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
viewport={"width": 1920, "height": 1080},
locale="zh-CN",
timezone_id="Asia/Shanghai",
http_credentials={
"username": "user",
"password": "pass"
} if USE_PROXY_AUTH else None
)
# 注入stealth.js
await context.add_init_script("""
delete navigator.__proto__.webdriver;
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3],
});
// 更多反检测代码...
""")
# 随机化鼠标移动轨迹
await context.expose_binding("_mouse_move", lambda: None)
return context
生产环境推荐使用Docker Compose部署:
yaml复制# docker-compose.yml
version: '3.8'
services:
redis:
image: redis:6-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
worker:
build: .
command: celery -A celery_app worker --loglevel=info --concurrency=4
environment:
- CELERY_BROKER_URL=redis://redis:6379/0
- CELERY_RESULT_BACKEND=redis://redis:6379/1
deploy:
replicas: 3
depends_on:
- redis
flower:
image: mher/flower
command: celery -A celery_app flower --port=5555
ports:
- "5555:5555"
depends_on:
- redis
volumes:
redis_data:
启动命令:
bash复制docker-compose up -d --scale worker=5 # 启动5个worker节点
并发度设置:
concurrency = CPU核心数 × 1.5concurrency = CPU核心数 × 3内存优化:
python复制# 在任务中定期清理内存
async def memory_cleanup(page):
await page.evaluate("""() => {
if (window.performance && window.performance.memory) {
window.performance.memory.jsHeapSizeLimit;
}
}""")
await page._client.send("HeapProfiler.collectGarbage")
python复制app.conf.broker_pool_limit = 20 # Redis连接池大小
app.conf.broker_connection_timeout = 30 # 秒
除了Flower基础监控外,建议集成Prometheus+Grafana:
bash复制pip install celery-prometheus-exporter
bash复制celery -A celery_app flower --port=5555 &
celery -A celery_app prometheus --port=8888 &
问题1:任务卡住不执行
redis-cli pingjournalctl -u celery -fcelery -A celery_app inspect active问题2:浏览器实例泄漏
python复制@app.task(bind=True)
def crawl_task(self, url):
try:
return asyncio.run(async_crawl(url))
except Exception as e:
self.retry(exc=e, countdown=60)
finally:
# 强制清理残留进程
import psutil
for proc in psutil.process_iter():
if 'chrome' in proc.name().lower():
proc.kill()
问题3:网站封禁IP
python复制PROXY_ROTATION = [
"http://proxy1:port",
"http://proxy2:port",
# ...
]
async def get_rotating_proxy():
import random
proxy = random.choice(PROXY_ROTATION)
return {
"server": proxy,
"username": PROXY_USER,
"password": PROXY_PASS
}
将爬虫集群扩展为通用无头浏览器服务:
python复制@app.task(bind=True, time_limit=120)
def render_page(self, url, actions=None):
"""通用页面渲染服务"""
async def _render():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
try:
await page.goto(url)
# 执行自定义动作
if actions:
for action in actions:
if action["type"] == "click":
await page.click(action["selector"])
elif action["type"] == "scroll":
await page.evaluate(f"window.scrollBy(0, {action['y']})")
# 返回渲染结果
return {
"html": await page.content(),
"screenshot": await page.screenshot(type="jpeg", quality=80),
"pdf": await page.pdf(format="A4")
}
finally:
await context.close()
await browser.close()
return asyncio.run(_render())
实现基于机器学习的动态调度:
python复制from sklearn.ensemble import RandomForestRegressor
class TaskScheduler:
def __init__(self):
self.model = RandomForestRegressor()
self.features = ["url_domain", "time_of_day", "worker_load"]
self.target = "execution_time"
async def predict_duration(self, task_properties):
"""预测任务执行时间"""
return self.model.predict([task_properties])[0]
def update_model(self, new_data):
"""在线更新预测模型"""
X = [d[self.features] for d in new_data]
y = [d[self.target] for d in new_data]
self.model.partial_fit(X, y)
# 在任务中记录执行数据
@app.task(bind=True)
def smart_crawl(self, url):
start = time.time()
try:
result = asyncio.run(async_crawl(url))
duration = time.time() - start
# 上报执行数据
self.scheduler.update_model({
"url_domain": extract_domain(url),
"time_of_day": datetime.now().hour,
"worker_load": current_worker().stats["pool"]["max-concurrency"],
"execution_time": duration
})
return result
except Exception as e:
self.retry(exc=e)
这套分布式爬虫架构在实际项目中表现出色,某电商数据采集案例中实现了:
关键经验总结: