Python爬虫实战：豆瓣图书信息抓取与存储

誓死追随苏子敬

1. 项目概述

最近在整理个人技术书籍库时，发现需要批量获取编程类图书的评分和出版信息。手动收集太费时，于是决定用Python写个爬虫来自动化这个流程。选择豆瓣作为数据源，因为它的图书信息全面且结构化程度高。这个项目从零开始完整实现了多页图书信息的抓取、解析和存储，特别适合刚接触爬虫的开发者练手。

提示：实际操作前请务必阅读豆瓣的robots.txt文件，控制请求频率，建议每页间隔3-5秒

2. 核心工具链选择

2.1 基础工具选型

选择requests+BeautifulSoup组合主要基于以下考虑：

学习曲线平缓（相比Scrapy等框架）
对动态内容要求不高的静态页面足够使用
调试方便，可以逐步验证每个环节

python复制# 核心依赖
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

2.2 反爬应对策略

豆瓣有基础的反爬机制，需要做好这些准备：

用户代理伪装：模拟主流浏览器UA
请求间隔：设置随机延时（2-5秒）
请求头完善：添加Accept、Referer等标准头
Cookie处理：必要时维持会话

python复制headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Referer': 'https://book.douban.com/'
}

3. 页面结构解析

3.1 目标页面分析

以编程类标签页为例（https://book.douban.com/tag/编程）：

每页显示20条图书信息
分页参数通过start=URL参数控制
关键数据分布在class为"info"的div中

3.2 数据定位方案

使用Chrome开发者工具检查元素后发现：

书名：h2 > a标签的title属性
评分：span.rating_nums文本
评价人数：span.pl文本（需要清洗）
出版信息：div.pub文本（需拆分处理）
简介：p.quote > span（可能不存在）

python复制soup = BeautifulSoup(html, 'lxml')
books = soup.find_all('div', class_='info')

4. 完整实现流程

4.1 单页爬取实现

python复制def parse_page(html):
    soup = BeautifulSoup(html, 'lxml')
    book_list = []
    
    for item in soup.find_all('div', class_='info'):
        try:
            title = item.h2.a['title']
            rating = item.find('span', class_='rating_nums').text
            pl = item.find('span', class_='pl').text
            pl = pl.replace('(', '').replace('人评价)', '')
            pub = item.find('div', class_='pub').text.strip()
            
            book_list.append([title, rating, pl, pub])
        except Exception as e:
            print(f"解析出错: {e}")
            continue
            
    return book_list

4.2 分页控制逻辑

豆瓣的分页通过URL参数控制：

第一页：start=0
第二页：start=20
以此类推...

python复制base_url = "https://book.douban.com/tag/编程?start={}"
for i in range(0, 100, 20):  # 爬取前5页
    url = base_url.format(i)
    html = requests.get(url, headers=headers).text
    data.extend(parse_page(html))
    time.sleep(random.uniform(2, 5))

5. 数据存储方案

5.1 Excel存储实现

使用pandas可以方便地导出到Excel：

python复制df = pd.DataFrame(data, columns=['书名', '评分', '评价人数', '出版信息'])
df.to_excel('douban_books.xlsx', index=False)

5.2 数据清洗技巧

出版信息通常包含出版社、出版年、价格等信息，可以用正则拆分：

python复制import re

def clean_pub(text):
    pattern = r'^(.*?)\s*/\s*(.*?)\s*/\s*(\d{4}-\d{1,2})?\s*/\s*(\d+\.?\d*)'
    match = re.match(pattern, text)
    if match:
        return match.groups()
    return (None, None, None, None)

6. 常见问题与优化

6.1 反爬应对实录

实际爬取时可能遇到：

403禁止访问：检查请求头是否完整
验证码：需要降低请求频率
无返回数据：检查IP是否被封

解决方案：

使用代理IP池轮询
增加随机User-Agent
设置更长的请求间隔

6.2 性能优化建议

使用Session保持连接
实现断点续爬
添加异常重试机制
使用多线程（需谨慎控制并发）

python复制session = requests.Session()
retry_strategy = Retry(
    total=3,
    backoff_factor=1
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)

7. 数据使用建议

爬取的数据可以用于：

图书评分分析（使用pandas/numpy）
出版社作品统计
价格分布可视化（matplotlib）
构建个人图书推荐系统

重要：所有爬取数据仅限个人学习使用，禁止商用或大规模传播

8. 完整代码示例

python复制import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

def main():
    session = requests.Session()
    retries = Retry(total=3, backoff_factor=1)
    session.mount('http://', HTTPAdapter(max_retries=retries))
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    
    all_books = []
    base_url = "https://book.douban.com/tag/编程?start={}"
    
    for start in range(0, 100, 20):
        url = base_url.format(start)
        try:
            resp = session.get(url, headers=headers, timeout=10)
            soup = BeautifulSoup(resp.text, 'lxml')
            
            for item in soup.find_all('div', class_='info'):
                # 解析逻辑...
                pass
                
            time.sleep(random.uniform(3, 6))
        except Exception as e:
            print(f"请求失败: {e}")
            continue
    
    df = pd.DataFrame(all_books)
    df.to_excel('douban_books.xlsx', index=False)

if __name__ == '__main__':
    main()

9. 进阶方向建议

掌握基础爬取后可以尝试：

使用Scrapy框架重构项目
集成Selenium处理动态内容
构建自动化监控系统（跟踪图书评分变化）
开发可视化Dashboard
结合自然语言处理分析书评

我在实际项目中发现，豆瓣对爬虫的容忍度与请求频率直接相关。测试期间将间隔时间控制在5秒以上，连续爬取20页都未触发防护机制。但突然增加请求量会导致临时封禁，这时需要更换IP或等待1-2小时再尝试。

已经到底了哦