在数据驱动的时代,网络爬虫已成为获取信息的标准方式。但传统爬虫面对现代动态网页时常常力不从心——那些通过JavaScript动态加载的内容、隐藏在API接口后的结构化数据,以及复杂的反爬机制,都是数据采集路上的拦路虎。requests_html和json这对黄金组合,恰好提供了从简单静态页面到复杂动态内容的完整解决方案。
requests_html库在经典requests基础上集成了HTML解析和JavaScript执行能力,而json模块则是处理现代Web API响应的瑞士军刀。我曾在电商价格监控项目中用这套组合拳,仅用200行代码就实现了竞品全平台数据采集,相比传统方案效率提升近8倍。本文将分享如何用这对组合应对各种真实爬虫场景,包括那些官方文档没明说的实战技巧。
这个库远不止是requests+html那么简单。其核心价值在于:
python复制from requests_html import HTMLSession
session = HTMLSession()
# 启用Chromium内核渲染
response = session.get('https://dynamic-site.com',
headers={'User-Agent': 'Mozilla/5.0'})
response.html.render(timeout=20) # 执行JavaScript
关键优势在于:
实战提示:render()方法首次运行会自动下载Chromium(约130MB),建议在Docker环境中预装。遇到TimeoutError时可尝试增加sleep时间或retry机制。
现代网站60%的数据通过API传输,json模块的进阶用法包括:
python复制import json
from json import JSONDecodeError
# 处理不规范JSON的容错方案
try:
data = json.loads(response.text, strict=False)
except JSONDecodeError as e:
# 修复常见JSON格式问题
fixed_text = response.text.replace("'", '"').replace("True", "true")
data = json.loads(fixed_text)
特别有用的技巧:
以React/Vue构建的电商网站为例:
python复制# 等待特定元素出现
r.html.render(sleep=2, keep_page=True, scrolldown=3)
products = []
for item in r.html.find('div.product-card'):
products.append({
'name': item.find('h3', first=True).text,
'price': float(item.find('.price')[0].text.replace('¥', '')),
# 提取data-*属性
'sku': item.attrs['data-sku']
})
关键参数说明:
社交媒体的动态加载需要特殊处理:
python复制last_height = r.html.page.height
while True:
r.html.page.keyboard.press('PageDown')
time.sleep(random.uniform(1.0, 2.5)) # 随机延迟防封禁
new_height = r.html.page.height
if new_height == last_height:
break
last_height = new_height
python复制headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://target-site.com',
'Accept-Language': 'zh-CN,zh;q=0.9',
# 动态生成Cookie
'Cookie': f'sessionid={random.randint(100000,999999)}'
}
python复制from itertools import cycle
proxy_pool = cycle([
'http://user:pass@proxy1:port',
'http://user:pass@proxy2:port'
])
response = session.get(url,
proxies={"http": next(proxy_pool)},
timeout=10)
python复制import hashlib
from pathlib import Path
def get_content_hash(content):
return hashlib.md5(content.encode()).hexdigest()
cache_dir = Path('./cache')
cache_dir.mkdir(exist_ok=True)
hash_file = cache_dir / 'processed_hashes.txt'
processed = set(hash_file.read_text().splitlines()) if hash_file.exists() else set()
python复制def clean_product(data):
# 价格标准化
if isinstance(data['price'], str):
data['price'] = float(re.sub(r'[^\d.]', '', data['price']))
# 库存状态归一化
stock_text = data.get('stock', '')
data['in_stock'] = any(x in stock_text.lower()
for x in ['有货', 'in stock', 'available'])
return data
python复制from requests_html import AsyncHTMLSession
async def fetch(url):
asession = AsyncHTMLSession()
r = await asession.get(url)
await r.html.arender()
return r
# 批量执行
import asyncio
urls = ['https://site.com/page1', 'https://site.com/page2']
results = asyncio.get_event_loop().run_until_complete(
asyncio.gather(*[fetch(url) for url in urls])
)
长时间运行爬虫时需注意:
python复制# 定期清理Chromium内存
if len(session.browser)>3:
session.close()
session = HTMLSession()
# 禁用不必要的功能
options = {
'headless': True,
'disable_images': True,
'block_scripts': False # 需要执行JS时保持开启
}
response.html.render(**options)
python复制from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10))
def safe_render(url):
try:
r = session.get(url)
r.html.render(timeout=20)
return r
except Exception as e:
print(f"Retrying {url} due to {str(e)}")
raise
python复制# 使用第三方服务示例
def solve_captcha(image_url):
import requests
from io import BytesIO
from PIL import Image
resp = requests.get(image_url)
img = Image.open(BytesIO(resp.content))
img.save('captcha.png')
# 调用打码平台API
api_url = "http://captcha-service.com/solve"
files = {'image': open('captcha.png', 'rb')}
result = requests.post(api_url, files=files).json()
return result['solution']
推荐的项目结构:
code复制/scraper
│── /core
│ ├── downloader.py # 请求逻辑
│ ├── parser.py # 解析逻辑
│ └── storage.py # 存储逻辑
│── /utils
│ ├── anti_ban.py # 反爬措施
│ └── logger.py # 日志配置
└── run.py # 主入口
使用config.yaml统一管理:
yaml复制targets:
- name: "example-site"
start_url: "https://example.com/api/v1"
headers:
User-Agent: "Mozilla/5.0"
render_js: true
pagination:
type: "query_param"
param: "page"
start: 1
step: 1
python复制from urllib.robotparser import RobotFileParser
def check_robots_permission(url):
rp = RobotFileParser()
rp.set_url(f"{urlparse(url).scheme}://{urlparse(url).netloc}/robots.txt")
rp.read()
return rp.can_fetch("*", url)
python复制import time
from random import uniform
class RequestThrottler:
def __init__(self, base_delay=1.0):
self.base_delay = base_delay
def __enter__(self):
time.sleep(uniform(self.base_delay*0.5, self.base_delay*1.5))
def __exit__(self, *args):
pass
# 使用示例
with RequestThrottler(base_delay=2.0):
response = session.get(url)
python复制def price_monitor():
while True:
current = get_current_price()
if current < threshold_price:
send_alert_email(
subject="价格预警",
content=f"当前价格: {current}"
)
time.sleep(3600) # 每小时检查
python复制def test_api_endpoint():
test_data = {
"user": "testuser",
"action": "search",
"query": "test"
}
response = session.post(
"https://api.example.com/v3",
json=test_data,
headers={"Content-Type": "application/json"}
)
assert response.status_code == 200
assert "results" in response.json()
在长期爬虫开发中,我总结出一个黄金法则:永远用20%的时间获取数据,80%的时间处理异常和优化系统。requests_html虽然简化了动态内容抓取,但每个网站都是独特的战场,需要不断调整策略。最近发现一个实用技巧——在render()前注入自定义JavaScript,可以绕过某些前端检测:
python复制js = """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
response.html.render(script=js)