requests_html和json这两个Python库是现代网络数据抓取与处理的黄金搭档。作为一名长期从事数据采集工作的开发者,我发现这套组合能解决90%的常规网页数据提取需求。requests_html继承了requests的简洁API,又整合了PyQuery和pyppeteer的强大功能,而json则是处理网络数据的标准格式工具。
requests_html的核心价值在于它实现了真正的浏览器级页面渲染。与传统的requests库相比,它最大的突破是内置了Chromium浏览器引擎,这意味着:
典型应用场景包括:
json模块虽然简单,但在数据抓取流程中不可或缺:
推荐使用Python 3.7+环境,安装命令:
bash复制pip install requests-html pyppeteer
注意:首次运行会自动下载Chromium浏览器,建议使用国内镜像加速下载
完整示例代码:
python复制from requests_html import HTMLSession
import json
session = HTMLSession()
r = session.get('https://example.com')
# 渲染JavaScript
r.html.render(sleep=2)
# 提取数据
data = {
'title': r.html.find('h1', first=True).text,
'links': [link.absolute_links.pop() for link in r.html.find('a')]
}
# 保存为JSON
with open('output.json', 'w') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
python复制async def fetch(url):
session = AsyncHTMLSession()
r = await session.get(url)
await r.html.arender()
return r.html.raw_html
# 使用asyncio.gather并发执行
python复制r.html.render(
sleep=1,
wait=2,
timeout=10,
scrolldown=3 # 模拟滚动加载
)
python复制json.dump(data, ensure_ascii=False) # 保留中文
python复制from json import JSONEncoder
class CustomEncoder(JSONEncoder):
def default(self, obj):
# 自定义序列化逻辑
return super().default(obj)
python复制from requests_html import HTMLSession
from diskcache import Cache
cache = Cache('tmp_cache')
session = HTMLSession()
@cache.memoize(expire=3600)
def get_page(url):
return session.get(url).html.render()
python复制headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
}
python复制from random import uniform
from time import sleep
sleep(uniform(1, 3)) # 随机延迟
python复制proxies = {
'http': 'http://user:pass@proxy:port',
'https': 'https://user:pass@proxy:port'
}
response = session.get(url, proxies=proxies)
python复制try:
element = r.html.find('.price', first=True)
price = element.text if element else 'N/A'
except Exception as e:
print(f"提取失败: {str(e)}")
price = None
python复制def validate_data(item):
required_fields = ['title', 'price', 'url']
return all(field in item for field in required_fields)
完整实现流程:
关键代码片段:
python复制def monitor_product(url):
r = session.get(url)
r.html.render()
return {
'price': r.html.find('.price-value', first=True).text,
'stock': '有货' if r.html.find('.add-to-cart') else '缺货',
'timestamp': datetime.now().isoformat()
}
技术实现要点:
数据处理流程:
python复制articles = []
for source in news_sources:
r = session.get(source['url'])
r.html.render()
items = r.html.find(source['selector'])
for item in items:
articles.append({
'title': item.find('h3', first=True).text,
'content': item.find('.summary', first=True).text,
'source': source['name']
})
# 存储到JSON Lines格式
with open('news.jl', 'a') as f:
for article in articles:
f.write(json.dumps(article) + '\n')
python复制from scrapy import Selector
def parse(response):
sel = Selector(text=response.text)
# 结合requests_html处理JS
html = HTML(html=response.text)
html.render()
# 混合使用两种解析方式
python复制import unittest
class TestCrawler(unittest.TestCase):
def setUp(self):
self.session = HTMLSession()
def test_page_load(self):
r = self.session.get(TEST_URL)
self.assertIn('expected_text', r.html.text)
python复制def check_data_quality(data):
metrics = {
'completeness': len(data) / expected_count,
'validity': sum(1 for x in data if validate(x)) / len(data),
'timeliness': (datetime.now() - data['timestamp']).seconds
}
return metrics
在实际项目中,我发现合理设置render()的wait参数能显著提高数据抓取成功率。对于AJAX-heavy的网站,建议结合scrolldown参数模拟用户滚动行为,同时注意控制timeout避免长时间阻塞。存储JSON数据时,使用indent参数虽然会增加文件大小,但极大提升了数据可读性和调试效率。