最近在整理个人技术书籍库时,发现需要批量获取编程类图书的评分和出版信息。手动收集太费时,于是决定用Python写个爬虫来自动化这个流程。选择豆瓣作为数据源,因为它的图书信息全面且结构化程度高。这个项目从零开始完整实现了多页图书信息的抓取、解析和存储,特别适合刚接触爬虫的开发者练手。
提示:实际操作前请务必阅读豆瓣的robots.txt文件,控制请求频率,建议每页间隔3-5秒
选择requests+BeautifulSoup组合主要基于以下考虑:
python复制# 核心依赖
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
豆瓣有基础的反爬机制,需要做好这些准备:
python复制headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Referer': 'https://book.douban.com/'
}
以编程类标签页为例(https://book.douban.com/tag/编程):
使用Chrome开发者工具检查元素后发现:
python复制soup = BeautifulSoup(html, 'lxml')
books = soup.find_all('div', class_='info')
python复制def parse_page(html):
soup = BeautifulSoup(html, 'lxml')
book_list = []
for item in soup.find_all('div', class_='info'):
try:
title = item.h2.a['title']
rating = item.find('span', class_='rating_nums').text
pl = item.find('span', class_='pl').text
pl = pl.replace('(', '').replace('人评价)', '')
pub = item.find('div', class_='pub').text.strip()
book_list.append([title, rating, pl, pub])
except Exception as e:
print(f"解析出错: {e}")
continue
return book_list
豆瓣的分页通过URL参数控制:
python复制base_url = "https://book.douban.com/tag/编程?start={}"
for i in range(0, 100, 20): # 爬取前5页
url = base_url.format(i)
html = requests.get(url, headers=headers).text
data.extend(parse_page(html))
time.sleep(random.uniform(2, 5))
使用pandas可以方便地导出到Excel:
python复制df = pd.DataFrame(data, columns=['书名', '评分', '评价人数', '出版信息'])
df.to_excel('douban_books.xlsx', index=False)
出版信息通常包含出版社、出版年、价格等信息,可以用正则拆分:
python复制import re
def clean_pub(text):
pattern = r'^(.*?)\s*/\s*(.*?)\s*/\s*(\d{4}-\d{1,2})?\s*/\s*(\d+\.?\d*)'
match = re.match(pattern, text)
if match:
return match.groups()
return (None, None, None, None)
实际爬取时可能遇到:
解决方案:
python复制session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
爬取的数据可以用于:
重要:所有爬取数据仅限个人学习使用,禁止商用或大规模传播
python复制import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
def main():
session = requests.Session()
retries = Retry(total=3, backoff_factor=1)
session.mount('http://', HTTPAdapter(max_retries=retries))
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
all_books = []
base_url = "https://book.douban.com/tag/编程?start={}"
for start in range(0, 100, 20):
url = base_url.format(start)
try:
resp = session.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(resp.text, 'lxml')
for item in soup.find_all('div', class_='info'):
# 解析逻辑...
pass
time.sleep(random.uniform(3, 6))
except Exception as e:
print(f"请求失败: {e}")
continue
df = pd.DataFrame(all_books)
df.to_excel('douban_books.xlsx', index=False)
if __name__ == '__main__':
main()
掌握基础爬取后可以尝试:
我在实际项目中发现,豆瓣对爬虫的容忍度与请求频率直接相关。测试期间将间隔时间控制在5秒以上,连续爬取20页都未触发防护机制。但突然增加请求量会导致临时封禁,这时需要更换IP或等待1-2小时再尝试。