Python爬虫实战：豆瓣图书评分数据采集方案-代码聚汇网

Python爬虫实战：豆瓣图书评分数据采集方案

徐小疼

1. 项目概述

作为一名长期从事数据采集工作的开发者，我经常需要从各类网站获取结构化数据进行分析。豆瓣图书作为国内最具公信力的图书评价平台，其评分数据和读者反馈对于图书市场研究、阅读推荐系统开发都具有重要价值。今天我将分享一套经过实战检验的Python爬虫方案，专门用于爬取豆瓣图书的评分信息。

这个方案的核心优势在于：

完整覆盖静态页面解析和动态数据加载两种场景
内置反爬虫规避策略，实测可稳定运行
输出结构化数据，可直接用于后续分析
代码模块化设计，便于扩展和维护

重要提示：在实际操作中务必遵守豆瓣的robots.txt协议，控制请求频率，建议每次请求间隔至少3秒，避免对豆瓣服务器造成过大压力。

2. 环境准备与工具选型

2.1 开发环境配置

我推荐使用以下环境配置，这也是我团队的标准开发环境：

bash复制# 创建虚拟环境
python -m venv douban_spider
source douban_spider/bin/activate  # Linux/Mac
douban_spider\Scripts\activate  # Windows

# 安装核心依赖
pip install requests beautifulsoup4 pandas openpyxl

选择这些库的原因：

requests：比urllib更人性化的HTTP库，支持会话保持
beautifulsoup4：HTML解析神器，支持多种解析器
pandas：数据处理和分析的瑞士军刀
openpyxl：处理Excel文件的可靠选择

2.2 开发工具选择

在实际开发中，我强烈推荐使用：

VS Code + Python插件：轻量但功能强大
Jupyter Notebook：适合数据探索阶段
Postman：用于调试API请求

3. 豆瓣图书页面结构分析

3.1 静态页面元素定位

以《活着》的豆瓣页面为例(https://book.douban.com/subject/4913064/)，核心数据分布如下：

html复制<!-- 书名 -->
<h1>
    <span property="v:itemreviewed">活着</span>
</h1>

<!-- 评分 -->
<strong class="ll rating_num" property="v:average">9.4</strong>

<!-- 评价人数 -->
<span property="v:votes">824873人评价</span>

<!-- 作者信息 -->
<div id="info">
    <span class="pl">作者:</span>
    余华
</div>

3.2 动态加载数据处理

部分数据（如详细评论）是通过AJAX动态加载的。通过浏览器开发者工具（F12）的Network面板，可以发现类似这样的API请求：

code复制GET https://book.douban.com/j/subject_abstract?subject_id=4913064

4. 核心爬虫实现

4.1 基础爬取函数

python复制import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

def get_book_info(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        # 随机延迟1-3秒
        time.sleep(random.uniform(1, 3))
        
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 提取核心信息
        title = soup.find('span', property='v:itemreviewed').text
        rating = soup.find('strong', class_='rating_num').text
        rating_count = soup.find('span', property='v:votes').text.replace('人评价', '')
        
        # 提取作者信息（需要处理复杂的HTML结构）
        info_div = soup.find('div', id='info')
        author = info_div.find('a').text if info_div.find('a') else '未知'
        
        return {
            '书名': title,
            '评分': float(rating),
            '评价人数': int(rating_count),
            '作者': author,
            '链接': url
        }
    except Exception as e:
        print(f"爬取{url}时出错: {str(e)}")
        return None

4.2 批量爬取实现

python复制def batch_crawl(book_urls, output_file='douban_books.xlsx'):
    results = []
    for url in book_urls:
        book_info = get_book_info(url)
        if book_info:
            results.append(book_info)
            print(f"已爬取: {book_info['书名']}")
    
    # 保存到Excel
    df = pd.DataFrame(results)
    df.to_excel(output_file, index=False)
    print(f"数据已保存到 {output_file}")
    return df

5. 反爬策略与优化

5.1 豆瓣反爬机制分析

根据我的实战经验，豆瓣主要采用以下反爬手段：

User-Agent检测：必须设置合理的浏览器UA
请求频率限制：短时间内高频请求会被暂时封禁
Cookie验证：某些页面需要携带有效Cookie
IP限制：单个IP频繁访问会被封禁

5.2 应对方案

5.2.1 请求头优化

python复制headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Referer': 'https://book.douban.com/',
    'Connection': 'keep-alive'
}

5.2.2 代理IP池实现

python复制PROXY_POOL = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    # 添加更多代理...
]

def get_with_proxy(url):
    proxy = random.choice(PROXY_POOL)
    try:
        response = requests.get(url, headers=headers, proxies={'http': proxy})
        return response
    except:
        return None

5.2.3 请求间隔优化

我建议采用随机间隔+指数退避策略：

python复制def smart_delay(last_request_time):
    elapsed = time.time() - last_request_time
    if elapsed < 3:  # 确保至少间隔3秒
        sleep_time = 3 + random.random() * 2
        time.sleep(sleep_time)

6. 数据存储与后续处理

6.1 数据结构化存储

除了基本的Excel存储，我推荐以下进阶方案：

python复制# 保存到JSON
import json
with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

# 保存到SQLite
import sqlite3
conn = sqlite3.connect('books.db')
df.to_sql('douban_books', conn, if_exists='replace', index=False)
conn.close()

6.2 数据清洗技巧

在实际项目中，我经常需要处理以下数据问题：

python复制# 处理评价人数中的"万"单位
def clean_rating_count(text):
    if '万' in text:
        return int(float(text.replace('万', '')) * 10000)
    return int(text)

# 处理多作者情况
def clean_authors(info_div):
    authors = [a.text for a in info_div.find_all('a')]
    return '、'.join(authors)

7. 实战经验与避坑指南

7.1 常见问题排查

403 Forbidden错误
- 检查User-Agent是否有效
- 尝试更换IP
- 检查是否有必要的Cookie
数据提取不完整
- 确认页面结构是否变化
- 使用浏览器开发者工具重新分析DOM
连接超时
- 增加超时时间：requests.get(url, timeout=10)
- 实现重试机制

7.2 性能优化技巧

并发控制
使用concurrent.futures实现有限制的并发：

python复制from concurrent.futures import ThreadPoolExecutor, as_completed

def concurrent_crawl(urls, max_workers=3):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(get_book_info, url): url for url in urls}
        for future in as_completed(futures):
            result = future.result()
            if result:
                results.append(result)
    return results

缓存机制
对于大规模爬取，建议实现请求缓存：

python复制import os
from hashlib import md5

def get_cached(url, cache_dir='cache'):
    os.makedirs(cache_dir, exist_ok=True)
    filename = md5(url.encode()).hexdigest() + '.html'
    path = os.path.join(cache_dir, filename)
    
    if os.path.exists(path):
        with open(path, 'r', encoding='utf-8') as f:
            return f.read()
    
    content = requests.get(url).text
    with open(path, 'w', encoding='utf-8') as f:
        f.write(content)
    return content

8. 法律与道德考量

在开发爬虫时，必须注意以下法律和道德问题：

遵守robots.txt：豆瓣的robots.txt对爬虫有一定限制
控制请求频率：避免对豆瓣服务器造成过大负担
数据使用限制：爬取的数据仅限个人研究使用
用户隐私保护：不要爬取用户个人信息

我个人的经验法则是：每天从单个域名爬取的数据量不超过1000页，请求间隔至少3秒，并且不在商业项目中直接使用爬取的数据。

这个爬虫项目最实用的部分是它的模块化设计，你可以轻松扩展它来爬取其他类型的信息，比如：

图书的详细目录
读者标签数据
图书销售信息
相关推荐图书

只需要修改HTML解析部分，核心的请求处理、反爬策略和数据存储逻辑都可以复用。我在实际项目中用这个框架爬取了超过5万本图书的数据，用于阅读兴趣分析模型的训练，效果非常不错。