Playwright爬虫实战：高效抓取携程自由行热门目的地

陈慈龙

1. 项目概述：用Playwright抓取携程自由行热门目的地

最近在规划旅行时发现，携程网上的当季热门自由行目的地信息非常实用，但手动收集整理效率太低。作为一名经常和数据打交道的Python开发者，我决定写一个爬虫来自动化这个流程。不同于传统的requests+BeautifulSoup方案，这次我选择了Playwright作为核心工具，因为它能完美应对现代网页的动态加载和反爬机制。

这个项目特别适合以下场景：

旅行爱好者想定期获取最新热门目的地推荐
旅行社需要监控竞品主推路线
数据分析师收集旅游行业趋势
Python初学者想学习现代爬虫技术

2. 技术选型与整体设计思路

2.1 为什么选择Playwright？

传统爬虫在面对携程这样的大型旅游平台时通常会遇到三大难题：

数据通过AJAX动态加载，简单的HTML解析无法获取完整信息
反爬机制严格，包括请求频率限制、行为验证等
页面元素结构复杂，需要模拟真实用户操作

Playwright的优势在于：

完全模拟浏览器环境，能执行JavaScript并等待动态内容加载
支持自动处理各种验证机制
提供精准的元素定位方式（XPath/CSS选择器）
跨平台支持（Windows/macOS/Linux）

2.2 爬虫架构设计

整个项目分为四个核心模块：

请求层：处理页面导航、等待加载和异常重试
解析层：提取目的地名称、热度指数、推荐理由等关键信息
存储层：将结构化数据保存为CSV/Excel文件
调度层：控制爬取节奏，实现增量更新

python复制# 伪代码展示核心流程
async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        
        # 请求层
        await page.goto('https://vacations.ctrip.com/')
        await page.wait_for_selector('.destination-item')
        
        # 解析层
        destinations = await page.evaluate('''() => {
            return [...document.querySelectorAll('.destination-item')].map(el => ({
                name: el.querySelector('.title').innerText,
                heat: el.querySelector('.heat-index').innerText,
                reason: el.querySelector('.recommend-reason').innerText
            }))
        }''')
        
        # 存储层
        pd.DataFrame(destinations).to_csv('destinations.csv')
        
        await browser.close()

3. 环境准备与依赖安装

3.1 基础环境配置

推荐使用Python 3.8+版本，并创建独立的虚拟环境：

bash复制python -m venv ctrip_env
source ctrip_env/bin/activate  # Linux/macOS
ctrip_env\Scripts\activate  # Windows

3.2 安装核心依赖

除了Playwright主库，我们还添加了几个实用工具：

bash复制pip install playwright pandas
playwright install  # 安装浏览器驱动

注意：Playwright默认会下载Chromium、Firefox和WebKit三个浏览器引擎。如果只需要Chromium，可以使用playwright install chromium节省磁盘空间。

3.3 开发工具推荐

调试工具：Playwright自带的Playwright Inspector（设置PWDEBUG=1环境变量启用）
元素定位：使用Chrome开发者工具复制XPath/CSS选择器
网络监控：Playwright的page.on('request')和page.on('response')事件

4. 核心实现细节

4.1 请求层实现技巧

携程页面有智能加载机制，需要特别注意：

等待策略：混合使用wait_for_selector和自定义等待条件

python复制await page.wait_for_function('''() => {
    const items = document.querySelectorAll('.destination-item');
    return items.length > 10 && items[0].querySelector('.heat-index');
}''')

滚动加载：模拟用户滚动行为获取更多内容

python复制async def auto_scroll(page):
    await page.evaluate('''async () => {
        await new Promise((resolve) => {
            let totalHeight = 0;
            const distance = 100;
            const timer = setInterval(() => {
                const scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;
                if(totalHeight >= scrollHeight){
                    clearInterval(timer);
                    resolve();
                }
            }, 100);
        });
    }''')

请求重试：对失败请求实现指数退避重试机制

python复制async def retry_request(page, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = await page.goto(url, timeout=15000)
            if response.ok:
                return response
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)

4.2 数据解析的精准定位

携程的DOM结构有几个特点需要注意：

类名动态变化：使用属性选择器而非固定类名

python复制# 不推荐
await page.locator('.recommend-item').all()

# 推荐 - 使用包含特定属性的选择器
await page.locator('[class*="recommend-item"]').all()

数据藏在自定义属性中：直接读取dataset

python复制heat_level = await item.get_attribute('data-heat')

处理富文本内容：清理HTML标签

python复制import re
def clean_html(raw):
    return re.sub(r'<[^>]+>', '', raw).strip()

4.3 数据存储优化

除了基础的CSV存储，我们还实现了：

增量更新：记录已爬取目的地的MD5指纹

python复制import hashlib
def get_content_hash(item):
    return hashlib.md5(f"{item['name']}{item['reason']}".encode()).hexdigest()

多格式导出：同时生成Excel和JSON文件

python复制df.to_excel('destinations.xlsx', index=False)
df.to_json('destinations.json', orient='records', force_ascii=False)

数据去重：基于多个字段的复合去重

python复制df.drop_duplicates(subset=['name', 'heat'], keep='last', inplace=True)

5. 反爬对抗实战经验

5.1 携程的反爬机制分析

根据实测，携程主要采用了以下防护措施：

行为指纹检测：监控鼠标移动轨迹、点击频率等
请求头验证：检查User-Agent、Referer等字段
IP频率限制：单个IP访问过频会触发验证码

5.2 我们的应对策略

人性化操作模拟：

python复制# 随机移动鼠标
async def random_mouse_move(page):
    for _ in range(5):
        await page.mouse.move(
            random.randint(0, 1000),
            random.randint(0, 600),
            steps=random.randint(5, 20)
        )
        await page.wait_for_timeout(random.randint(200, 800))

请求头伪装：

python复制await page.set_extra_http_headers({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Referer': 'https://vacations.ctrip.com/',
    'Accept-Language': 'zh-CN,zh;q=0.9'
})

智能限速：

python复制async def smart_delay():
    base = random.uniform(1.0, 3.0)
    # 根据当前时间调整 - 白天访问量高时延迟更长
    if 9 <= datetime.now().hour < 22:
        base *= 1.5
    await page.wait_for_timeout(int(base * 1000))

6. 完整代码实现与结果展示

6.1 核心代码结构

python复制import asyncio
import pandas as pd
from playwright.async_api import async_playwright

class CtripSpider:
    def __init__(self):
        self.destinations = []
    
    async def fetch_page(self, page, url):
        # 实现页面抓取逻辑
        pass
    
    async def parse_page(self, page):
        # 实现数据解析逻辑
        pass
    
    async def save_results(self):
        # 实现数据存储逻辑
        pass
    
    async def run(self):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
            page = await context.new_page()
            
            await self.fetch_page(page, 'https://vacations.ctrip.com/')
            await self.parse_page(page)
            await self.save_results()
            
            await browser.close()

if __name__ == '__main__':
    spider = CtripSpider()
    asyncio.run(spider.run())

6.2 运行结果示例

目的地	热度指数	推荐理由	参考价格
三亚	98	冬季避寒首选，五星酒店性价比高	¥3200起
丽江	95	古城雪景绝美，民宿特色丰富	¥2800起
厦门	92	文艺小清新，适合短途周末游	¥1800起

7. 常见问题与解决方案

7.1 元素加载超时

现象：TimeoutError: Waiting for selector ".destination-item" failed

解决方案：

增加等待时间

python复制await page.wait_for_selector('.destination-item', timeout=20000)

检查是否触发了反爬机制
尝试更换User-Agent和IP

7.2 数据提取不完整

现象：只获取到部分目的地信息

排查步骤：

确认滚动加载是否执行到位
检查选择器是否匹配更新后的DOM结构
在浏览器控制台测试选择器有效性

7.3 浏览器实例无法启动

现象：playwright._impl._errors.Error: Failed to launch chromium

解决方法：

重新安装浏览器驱动

bash复制playwright install

检查系统依赖是否完整（特别是Linux系统）
尝试指定明确的浏览器路径

8. 进阶优化方向

对于需要更高阶功能的开发者，可以考虑：

分布式爬取：使用Scrapy+Playwright组合
自动化监控：设置定时任务每天抓取最新数据
可视化分析：用Pyecharts生成热度趋势图
价格预警：监控特定目的地价格波动

python复制# 示例：简单的价格监控
def check_price_drop(df, threshold=0.1):
    latest = df.iloc[-1]['price']
    previous = df.iloc[-2]['price']
    return (previous - latest) / previous > threshold