最近在规划旅行时发现,携程网上的当季热门自由行目的地信息非常实用,但手动收集整理效率太低。作为一名经常和数据打交道的Python开发者,我决定写一个爬虫来自动化这个流程。不同于传统的requests+BeautifulSoup方案,这次我选择了Playwright作为核心工具,因为它能完美应对现代网页的动态加载和反爬机制。
这个项目特别适合以下场景:
传统爬虫在面对携程这样的大型旅游平台时通常会遇到三大难题:
Playwright的优势在于:
整个项目分为四个核心模块:
python复制# 伪代码展示核心流程
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
# 请求层
await page.goto('https://vacations.ctrip.com/')
await page.wait_for_selector('.destination-item')
# 解析层
destinations = await page.evaluate('''() => {
return [...document.querySelectorAll('.destination-item')].map(el => ({
name: el.querySelector('.title').innerText,
heat: el.querySelector('.heat-index').innerText,
reason: el.querySelector('.recommend-reason').innerText
}))
}''')
# 存储层
pd.DataFrame(destinations).to_csv('destinations.csv')
await browser.close()
推荐使用Python 3.8+版本,并创建独立的虚拟环境:
bash复制python -m venv ctrip_env
source ctrip_env/bin/activate # Linux/macOS
ctrip_env\Scripts\activate # Windows
除了Playwright主库,我们还添加了几个实用工具:
bash复制pip install playwright pandas
playwright install # 安装浏览器驱动
注意:Playwright默认会下载Chromium、Firefox和WebKit三个浏览器引擎。如果只需要Chromium,可以使用
playwright install chromium节省磁盘空间。
PWDEBUG=1环境变量启用)page.on('request')和page.on('response')事件携程页面有智能加载机制,需要特别注意:
wait_for_selector和自定义等待条件python复制await page.wait_for_function('''() => {
const items = document.querySelectorAll('.destination-item');
return items.length > 10 && items[0].querySelector('.heat-index');
}''')
python复制async def auto_scroll(page):
await page.evaluate('''async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 100);
});
}''')
python复制async def retry_request(page, url, max_retries=3):
for attempt in range(max_retries):
try:
response = await page.goto(url, timeout=15000)
if response.ok:
return response
except Exception as e:
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)
携程的DOM结构有几个特点需要注意:
python复制# 不推荐
await page.locator('.recommend-item').all()
# 推荐 - 使用包含特定属性的选择器
await page.locator('[class*="recommend-item"]').all()
python复制heat_level = await item.get_attribute('data-heat')
python复制import re
def clean_html(raw):
return re.sub(r'<[^>]+>', '', raw).strip()
除了基础的CSV存储,我们还实现了:
python复制import hashlib
def get_content_hash(item):
return hashlib.md5(f"{item['name']}{item['reason']}".encode()).hexdigest()
python复制df.to_excel('destinations.xlsx', index=False)
df.to_json('destinations.json', orient='records', force_ascii=False)
python复制df.drop_duplicates(subset=['name', 'heat'], keep='last', inplace=True)
根据实测,携程主要采用了以下防护措施:
User-Agent、Referer等字段python复制# 随机移动鼠标
async def random_mouse_move(page):
for _ in range(5):
await page.mouse.move(
random.randint(0, 1000),
random.randint(0, 600),
steps=random.randint(5, 20)
)
await page.wait_for_timeout(random.randint(200, 800))
python复制await page.set_extra_http_headers({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': 'https://vacations.ctrip.com/',
'Accept-Language': 'zh-CN,zh;q=0.9'
})
python复制async def smart_delay():
base = random.uniform(1.0, 3.0)
# 根据当前时间调整 - 白天访问量高时延迟更长
if 9 <= datetime.now().hour < 22:
base *= 1.5
await page.wait_for_timeout(int(base * 1000))
python复制import asyncio
import pandas as pd
from playwright.async_api import async_playwright
class CtripSpider:
def __init__(self):
self.destinations = []
async def fetch_page(self, page, url):
# 实现页面抓取逻辑
pass
async def parse_page(self, page):
# 实现数据解析逻辑
pass
async def save_results(self):
# 实现数据存储逻辑
pass
async def run(self):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await self.fetch_page(page, 'https://vacations.ctrip.com/')
await self.parse_page(page)
await self.save_results()
await browser.close()
if __name__ == '__main__':
spider = CtripSpider()
asyncio.run(spider.run())
| 目的地 | 热度指数 | 推荐理由 | 参考价格 |
|---|---|---|---|
| 三亚 | 98 | 冬季避寒首选,五星酒店性价比高 | ¥3200起 |
| 丽江 | 95 | 古城雪景绝美,民宿特色丰富 | ¥2800起 |
| 厦门 | 92 | 文艺小清新,适合短途周末游 | ¥1800起 |
现象:TimeoutError: Waiting for selector ".destination-item" failed
解决方案:
python复制await page.wait_for_selector('.destination-item', timeout=20000)
现象:只获取到部分目的地信息
排查步骤:
现象:playwright._impl._errors.Error: Failed to launch chromium
解决方法:
bash复制playwright install
对于需要更高阶功能的开发者,可以考虑:
python复制# 示例:简单的价格监控
def check_price_drop(df, threshold=0.1):
latest = df.iloc[-1]['price']
previous = df.iloc[-2]['price']
return (previous - latest) / previous > threshold
这个项目最让我惊喜的是Playwright的稳定性——即使面对携程这样复杂的商业网站,也能保持很高的成功率。在实际开发中,建议多使用page.screenshot()功能保存中间状态,这对调试非常有帮助。另外,定期更新选择器和用户行为模式也很重要,因为大型网站的UI经常会微调。