Python爬虫实战：B站新番数据监控与分析-代码聚汇网

Python爬虫实战：B站新番数据监控与分析

小狐狸与小道士

1. 项目概述

最近在做一个很有意思的小项目——通过Python爬虫监控B站新番时间表及其热度数据。这个需求源于我作为动漫爱好者，经常需要手动刷新页面查看新番更新情况，效率实在太低。于是决定用技术手段解决这个问题，实现自动化监控。

这个爬虫的核心功能包括：

获取B站新番时间表数据
提取每部动漫的播放量、弹幕数、追番人数等热度指标
定时自动更新数据
数据可视化展示

2. 技术选型与整体流程

2.1 为什么选择API逆向

传统爬虫通常直接解析HTML页面，但现代网站普遍采用前后端分离架构，数据通过API接口传输。经过分析发现，B站新番数据也是通过API获取的，因此决定采用API逆向的方式。

相比HTML解析，API逆向有以下优势：

数据结构规范，易于解析
请求量更小，效率更高
稳定性更好，不受前端改版影响

2.2 整体工作流程

通过浏览器开发者工具分析API请求
模拟请求获取JSON数据
解析并清洗数据
存储到数据库
定时任务调度
数据可视化展示

3. 环境准备

3.1 基础环境

bash复制Python 3.8+
pip install requests pandas matplotlib schedule

3.2 推荐开发工具

Chrome浏览器 + Developer Tools
Postman（用于API调试）
Jupyter Notebook（用于数据分析）
VS Code（代码编写）

4. 核心实现

4.1 API分析与逆向

首先打开B站新番时间表页面，按F12打开开发者工具，切换到Network选项卡，筛选XHR请求。经过分析发现主要API接口：

python复制# 新番时间表API
SEASON_API = "https://api.bilibili.com/pgc/web/timeline/v2"
# 番剧详情API
DETAIL_API = "https://api.bilibili.com/pgc/view/web/season"

4.2 请求头设置

B站API需要一些必要的请求头：

python复制headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Referer": "https://www.bilibili.com/",
    "Origin": "https://www.bilibili.com"
}

4.3 请求封装

python复制import requests
import json

def fetch_season_data():
    try:
        response = requests.get(SEASON_API, headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

5. 数据解析

5.1 时间表数据解析

API返回的JSON数据结构如下：

python复制{
    "code": 0,
    "message": "success",
    "result": [
        {
            "date": "2023-10-01",
            "day_of_week": 7,
            "is_today": 1,
            "seasons": [
                {
                    "cover": "http://i0.hdslb.com/bfs/bangumi/...",
                    "delay": 0,
                    "ep_id": 123456,
                    "favorites": 123456,
                    "follow": 123456,
                    "pub_index": "第1话",
                    "pub_time": "20:00",
                    "season_id": 12345,
                    "square_cover": "http://i0.hdslb.com/bfs/bangumi/...",
                    "title": "某科学的超电磁炮T"
                }
            ]
        }
    ]
}

解析代码：

python复制def parse_season_data(data):
    if not data or data["code"] != 0:
        return []
    
    result = []
    for day_data in data["result"]:
        date = day_data["date"]
        for season in day_data["seasons"]:
            season_info = {
                "date": date,
                "title": season["title"],
                "season_id": season["season_id"],
                "pub_time": season["pub_time"],
                "pub_index": season["pub_index"],
                "favorites": season["favorites"]
            }
            result.append(season_info)
    return result

6. 数据存储

6.1 数据库设计

使用SQLite作为轻量级数据库：

sql复制CREATE TABLE IF NOT EXISTS bangumi (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    date TEXT NOT NULL,
    title TEXT NOT NULL,
    season_id INTEGER NOT NULL UNIQUE,
    pub_time TEXT,
    pub_index TEXT,
    favorites INTEGER,
    views INTEGER,
    danmaku INTEGER,
    update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

6.2 数据入库

python复制import sqlite3

def save_to_db(data):
    conn = sqlite3.connect('bangumi.db')
    cursor = conn.cursor()
    
    for item in data:
        try:
            cursor.execute('''
                INSERT OR REPLACE INTO bangumi 
                (date, title, season_id, pub_time, pub_index, favorites, update_time)
                VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
            ''', (
                item['date'],
                item['title'],
                item['season_id'],
                item['pub_time'],
                item['pub_index'],
                item['favorites']
            ))
        except sqlite3.Error as e:
            print(f"数据库操作失败: {e}")
    
    conn.commit()
    conn.close()

7. 定时任务

使用schedule库实现定时任务：

python复制import schedule
import time

def job():
    print("开始执行定时任务...")
    data = fetch_season_data()
    if data:
        parsed_data = parse_season_data(data)
        save_to_db(parsed_data)
    print("任务执行完成")

# 每天8点执行
schedule.every().day.at("08:00").do(job)

while True:
    schedule.run_pending()
    time.sleep(60)

8. 数据可视化

8.1 热度趋势图

python复制import pandas as pd
import matplotlib.pyplot as plt

def plot_trend(season_id):
    conn = sqlite3.connect('bangumi.db')
    df = pd.read_sql(f'''
        SELECT date, favorites, views, danmaku 
        FROM bangumi 
        WHERE season_id = {season_id}
        ORDER BY date
    ''', conn)
    conn.close()
    
    plt.figure(figsize=(12, 6))
    plt.plot(df['date'], df['favorites'], label='追番人数')
    plt.plot(df['date'], df['views'], label='播放量')
    plt.plot(df['date'], df['danmaku'], label='弹幕数')
    plt.title('番剧热度趋势')
    plt.xlabel('日期')
    plt.ylabel('数量')
    plt.legend()
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

9. 常见问题与解决方案

9.1 请求被拒绝

现象：返回403状态码
原因：请求头不完整或被识别为爬虫
解决方案：

完善请求头，特别是User-Agent和Referer
添加Cookie信息
适当降低请求频率

9.2 数据解析失败

现象：JSON解析出错
原因：API返回数据结构变化
解决方案：

检查API返回的原始数据
更新解析逻辑
添加异常处理

9.3 数据库写入冲突

现象：UNIQUE约束失败
原因：重复插入相同season_id
解决方案：
使用INSERT OR REPLACE语法

10. 进阶优化

10.1 多线程采集

对于大量番剧详情数据，可以使用多线程提高效率：

python复制from concurrent.futures import ThreadPoolExecutor

def fetch_detail(season_id):
    params = {"season_id": season_id}
    response = requests.get(DETAIL_API, headers=headers, params=params)
    return response.json()

def batch_fetch_details(season_ids):
    with ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(fetch_detail, season_ids))
    return results

10.2 异常重试机制

添加请求重试逻辑：

python复制from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def fetch_with_retry(url):
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return response

10.3 数据增量更新

优化数据库操作，只更新变化的数据：

python复制def smart_update(cursor, item):
    cursor.execute('''
        SELECT favorites, views, danmaku 
        FROM bangumi 
        WHERE season_id = ?
    ''', (item['season_id'],))
    
    existing = cursor.fetchone()
    if not existing or any([
        existing[0] != item['favorites'],
        existing[1] != item['views'],
        existing[2] != item['danmaku']
    ]):
        cursor.execute('''
            INSERT OR REPLACE INTO bangumi 
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, datetime('now'))
        ''', (
            None,
            item['date'],
            item['title'],
            item['season_id'],
            item['pub_time'],
            item['pub_index'],
            item['favorites'],
            item['views'],
            item['danmaku']
        ))
        return True
    return False

11. 项目扩展思路

通知功能：当关注的番剧更新时，发送邮件或微信通知
数据分析：统计各季度番剧类型分布、制作公司表现等
预测模型：基于历史数据预测番剧最终热度
Web展示：使用Flask或Django开发可视化后台
移动端适配：开发小程序或APP方便随时查看

在实际开发过程中，我发现B站的API设计相对规范，但需要注意以下几点：

请求频率不要过高，避免被封禁
定期检查API是否有变动
重要数据做好本地备份
可视化展示时注意数据聚合，避免图表过于复杂