第一次用aiohttp写异步爬虫时,我盯着屏幕上不断弹出的ServerDisconnectedError错误提示,感觉整个人都要崩溃了。明明代码逻辑看起来没问题,但一跑起来就各种连接断开、超时报错,就像在跟服务器玩捉迷藏。如果你也遇到过类似情况,别担心——这几乎是每个Python异步爬虫开发者的必经之路。
当我们在使用aiohttp进行高并发爬取时,经常会遇到三类典型错误:
ServerDisconnectedError: 服务器主动断开连接ClientOSError: 本地网络问题导致的连接失败TimeoutError: 请求超时未响应这些错误表面上看是网络问题,实则背后隐藏着更深层次的原因。让我们先解剖几个最常见的错误场景:
很多初学者的代码是这样的:
python复制async def fetch(url):
async with aiohttp.ClientSession() as session: # 每个请求都新建Session
async with session.get(url) as response:
return await response.text()
这种写法看似简洁,实则存在严重问题。每次请求都创建新的Session对象,会导致:
正确做法应该是共享Session:
python复制async def fetch(url, session): # 接收外部传入的session
async with session.get(url) as response:
return await response.text()
async def main(urls):
async with aiohttp.ClientSession() as session: # 只创建一个Session
tasks = [fetch(url, session) for url in urls]
return await asyncio.gather(*tasks)
现代网站都有完善的防爬措施,高频请求会触发:
当服务器检测到异常流量时,最简单的防御就是直接断开连接,这就是ServerDisconnectedError的主要成因之一。
你的开发环境可能存在以下限制:
| 限制类型 | 默认值 | 影响 |
|---|---|---|
| 文件描述符限制 | 1024 (Linux) | 无法建立更多TCP连接 |
| 内存限制 | 系统依赖 | 高并发时OOM |
| CPU线程数 | 逻辑核心数 | 协程调度瓶颈 |
aiohttp的ClientSession实际上内置了连接池管理,但需要合理配置:
python复制from aiohttp import TCPConnector
connector = TCPConnector(
limit=100, # 最大连接数
limit_per_host=20, # 单主机最大连接
enable_cleanup_closed=True, # 自动清理关闭的连接
force_close=False # 保持长连接
)
async with aiohttp.ClientSession(connector=connector) as session:
# 使用配置好的session进行请求
关键参数说明:
limit: 控制全局最大连接数,避免耗尽系统资源limit_per_host: 防止对单一主机发起过多连接ttl_dns_cache: DNS缓存时间,建议设置为300秒直接上代码看如何实现智能延迟:
python复制import random
import asyncio
from aiohttp import ClientSession
class SmartCrawler:
def __init__(self):
self.delay_range = (1, 3) # 基础延迟范围(秒)
self.error_count = 0
async def request_with_backoff(self, url, session):
try:
async with session.get(url) as resp:
if resp.status == 429: # Too Many Requests
backoff = 2 ** self.error_count + random.random()
await asyncio.sleep(backoff)
self.error_count += 1
return await self.request_with_backoff(url, session)
self.error_count = 0 # 重置错误计数
return await resp.json()
except aiohttp.ClientError:
await asyncio.sleep(random.uniform(*self.delay_range))
return await self.request_with_backoff(url, session)
这个方案实现了:
完整的异常处理应该覆盖这些情况:
python复制async def robust_fetch(url, session, retry=3):
exceptions = (
aiohttp.ClientError,
asyncio.TimeoutError,
ConnectionResetError
)
for attempt in range(retry):
try:
async with session.get(url, timeout=20) as response:
if response.status == 200:
return await response.text()
await handle_http_error(response.status)
except exceptions as e:
if attempt == retry - 1: # 最后一次尝试仍然失败
raise
await asyncio.sleep(1 * (attempt + 1)) # 递增延迟
raise ValueError(f"Failed after {retry} attempts")
async def handle_http_error(status):
if status == 429:
await asyncio.sleep(10) # 长等待应对限流
elif status == 403:
raise RuntimeError("IP可能被封禁")
# 其他状态码处理...
一个专业的爬虫应该具备自我监控能力:
python复制class CrawlerMonitor:
def __init__(self):
self.request_count = 0
self.error_count = 0
self.start_time = time.time()
@property
def error_rate(self):
return self.error_count / max(1, self.request_count)
def adjust_strategy(self):
if self.error_rate > 0.2:
# 自动降低并发度
return {"concurrency": "low", "delay": "high"}
elif self.error_rate < 0.05:
# 可以适当激进
return {"concurrency": "high", "delay": "low"}
return {"concurrency": "medium", "delay": "medium"}
让我们把这些策略整合成一个完整的爬虫框架:
python复制import asyncio
import aiohttp
from collections import deque
import time
import random
class AsyncCrawler:
def __init__(self, urls, concurrency=100):
self.urls = deque(urls)
self.concurrency = concurrency
self.semaphore = asyncio.Semaphore(concurrency)
self.results = []
self.stats = {
'success': 0,
'errors': 0,
'retries': 0
}
async def fetch(self, url, session):
async with self.semaphore: # 控制并发量
try:
async with session.get(url, timeout=20) as response:
if response.status == 200:
data = await response.text()
self.results.append(data)
self.stats['success'] += 1
return data
await self.handle_error(response.status)
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
self.stats['errors'] += 1
if self.urls: # 重新加入队列等待重试
self.urls.append(url)
self.stats['retries'] += 1
async def handle_error(self, status):
if status == 429:
await asyncio.sleep(10 + random.random() * 5)
elif status in (500, 502, 503, 504):
await asyncio.sleep(3)
async def worker(self, session):
while self.urls:
url = self.urls.popleft()
await self.fetch(url, session)
async def run(self):
connector = TCPConnector(limit=self.concurrency, limit_per_host=10)
async with aiohttp.ClientSession(connector=connector) as session:
workers = [self.worker(session) for _ in range(self.concurrency)]
await asyncio.gather(*workers)
print(f"爬取完成. 成功: {self.stats['success']}, 错误: {self.stats['errors']}, 重试: {self.stats['retries']}")
这个框架实现了:
服务器通常会检查这些头部信息:
python复制headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Referer': 'https://www.google.com/',
'DNT': '1'
}
# 使用时
async with session.get(url, headers=headers) as response:
...
关键技巧:
当需要大规模爬取时,单机方案会遇到瓶颈。这时可以考虑:
code复制[任务队列] → [多个爬虫节点] → [中央存储]
↑ ↑
[调度中心] [节点监控]
实现要点:
对于反爬严格的网站,可能需要:
python复制from pyppeteer import launch
async def advanced_crawl(url):
browser = await launch(headless=True)
page = await browser.newPage()
# 模拟人类操作
await page.setViewport({'width': 1366, 'height': 768})
await page.setUserAgent('Mozilla/5.0...')
await page.goto(url)
# 随机滚动页面
for _ in range(random.randint(2, 5)):
await page.evaluate('window.scrollBy(0, 500)')
await asyncio.sleep(random.uniform(0.5, 2))
content = await page.content()
await browser.close()
return content
这种方案虽然速度较慢,但能有效绕过大多数反爬机制。