Python爬虫实战：高效采集高校学术讲座信息-代码聚汇网

Python爬虫实战：高效采集高校学术讲座信息

AngstEssenSeele

1. 项目概述与需求分析

最近在帮导师整理学术资源时，发现手动收集各高校计算机学院的讲座信息效率极低。以某985高校计算机学院官网为例，其"学术动态"栏目每周更新3-5场讲座信息，包含主讲人、主题、时间地点等关键数据。传统手工复制粘贴方式不仅耗时，还容易遗漏重要信息。于是决定用Python写个爬虫自动采集这些结构化数据。

这个爬虫需要实现以下核心功能：

自动抓取学术动态栏目的所有页面
精准提取每场讲座的四个关键字段：主题、主讲人、时间、地点
处理分页逻辑确保不遗漏任何历史记录
将结果导出为标准JSON格式便于后续分析

注意：虽然目标网站是.edu.cn域名，但实际测试发现其未设置反爬机制。不过我们仍会遵守robots.txt规则，并将请求频率控制在合理范围（每秒不超过1次）。

2. 技术选型与工具准备

2.1 核心工具链选择

经过对比测试，最终确定技术方案：

请求库：requests（轻量级）+ fake-useragent（伪装浏览器头）
解析库：BeautifulSoup（DOM解析）+ re（正则辅助）
存储格式：JSON（标准结构化数据）
辅助工具：tqdm（进度条）、logging（错误记录）

选择依据：

目标页面是传统服务端渲染的HTML，无需处理JavaScript
讲座信息以固定格式展示，适合DOM+正则组合解析
数据量不大（约200条/年），单机脚本即可满足需求

2.2 开发环境配置

建议使用Python 3.8+环境，依赖安装命令：

bash复制pip install requests beautifulsoup4 fake-useragent tqdm

为方便调试，我习惯在项目根目录创建config.py存放全局变量：

python复制import os
from fake_useragent import UserAgent

BASE_URL = "https://cs.xxx.edu.cn/xsdt/list.htm"  # 替换为目标学院实际URL
MAX_PAGE = 5       # 预估最大分页数
SAVE_PATH = os.path.join(os.getcwd(), 'lectures.json')
HEADERS = {'User-Agent': UserAgent().random}

3. 网页结构分析与解析策略

3.1 目标页面结构拆解

通过浏览器开发者工具分析，发现目标页面具有以下特征：

分页器使用传统数字链接（如list_2.htm）
每条讲座记录在<ul class="news_list">下的<li>标签中
关键信息分布规律：
- 主题：<a>标签文本
- 时间：<span class="news_meta">内
- 地点/主讲人：藏在详情页或摘要中

3.2 双层级解析方案

由于部分关键信息需要进入详情页获取，采用两级解析策略：

mermaid复制graph TD
    A[起始页列表] --> B[提取详情页链接]
    B --> C[访问详情页]
    C --> D[提取完整信息]

对应代码框架：

python复制def parse_list_page(html):
    """解析列表页获取详情页链接"""
    soup = BeautifulSoup(html, 'html.parser')
    links = [a['href'] for a in soup.select('.news_list li a')]
    return links

def parse_detail_page(html):
    """解析详情页获取完整信息"""
    soup = BeautifulSoup(html, 'html.parser')
    # 这里需要根据实际页面结构调整选择器
    title = soup.select_one('h1').text.strip()
    # 其他字段提取逻辑...
    return lecture_data

4. 核心代码实现

4.1 网络请求模块

封装带异常处理的请求函数：

python复制import requests
from tqdm import tqdm
import time
import logging

logging.basicConfig(filename='spider.log', level=logging.INFO)

def safe_request(url, retry=3):
    for i in range(retry):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=10)
            resp.raise_for_status()
            return resp.text
        except Exception as e:
            logging.error(f"Request failed: {url} - {str(e)}")
            time.sleep(2 ** i)  # 指数退避
    return None

4.2 分页处理逻辑

通过分析URL规律实现自动翻页：

python复制def generate_page_urls():
    """生成所有分页URL"""
    urls = [BASE_URL]
    for i in range(2, MAX_PAGE + 1):
        urls.append(BASE_URL.replace('list.htm', f'list_{i}.htm'))
    return urls

4.3 数据提取实现

结合CSS选择器和正则表达式精确提取字段：

python复制import re
from datetime import datetime

def parse_detail_page(html):
    """完整版详情页解析"""
    soup = BeautifulSoup(html, 'html.parser')
    
    # 主题提取
    title = soup.select_one('.article-title').text.strip()
    
    # 使用正则提取时间和地点
    content = soup.select_one('.article-content').text
    time_pattern = r"时间：(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2})"
    location_pattern = r"地点：(.+?)\n"
    
    lecture_time = re.search(time_pattern, content).group(1)
    location = re.search(location_pattern, content).group(1)
    
    # 主讲人可能在标题或特定位置
    speaker = title.split('】')[-1].split('：')[0] if '】' in title else None
    
    return {
        "title": title,
        "speaker": speaker,
        "time": lecture_time,
        "location": location,
        "crawl_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }

5. 数据存储与导出

5.1 JSON格式标准化

定义统一的数据结构：

python复制{
    "metadata": {
        "source": "目标大学计算机学院官网",
        "crawl_time": "2023-08-20 15:00:00",
        "count": 42
    },
    "data": [
        {
            "id": 1,
            "title": "【学术讲座】张教授：量子计算前沿进展",
            "speaker": "张教授",
            "time": "2023-09-01 14:00",
            "location": "计算机学院101报告厅",
            "url": "https://cs.xxx.edu.cn/xxxx"
        },
        # 其他记录...
    ]
}

5.2 文件存储实现

使用原子写入防止数据丢失：

python复制import json
import os

def save_to_json(data, filename=SAVE_PATH):
    temp_file = f"{filename}.tmp"
    with open(temp_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    if os.path.exists(filename):
        os.remove(filename)
    os.rename(temp_file, filename)

6. 完整流程控制

主函数协调各模块工作：

python复制from concurrent.futures import ThreadPoolExecutor, as_completed

def main():
    all_lectures = []
    page_urls = generate_page_urls()
    
    with ThreadPoolExecutor(max_workers=3) as executor:
        futures = []
        for url in page_urls:
            futures.append(executor.submit(process_list_page, url))
        
        for future in tqdm(as_completed(futures), total=len(futures)):
            all_lectures.extend(future.result())
    
    result = {
        "metadata": {
            "source": BASE_URL,
            "crawl_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "count": len(all_lectures)
        },
        "data": all_lectures
    }
    
    save_to_json(result)
    print(f"成功采集 {len(all_lectures)} 条讲座信息")

def process_list_page(url):
    """处理单个列表页"""
    html = safe_request(url)
    if not html:
        return []
    
    detail_links = parse_list_page(html)
    lectures = []
    
    for link in detail_links:
        detail_html = safe_request(link)
        if detail_html:
            lectures.append(parse_detail_page(detail_html))
        time.sleep(1)  # 礼貌延迟
    
    return lectures

7. 常见问题与解决方案

7.1 字段提取失败处理

常见问题及应对策略：

问题现象	可能原因	解决方案
获取空列表	CSS选择器过时	更新选择器或改用正则
时间格式不一致	页面展示差异	添加多种正则模式匹配
主讲人信息缺失	命名不规范	尝试从标题中提取或标记为未知

增强版提取函数示例：

python复制def extract_speaker(title):
    patterns = [
        r"【(.+?)】",      # 格式1：【张教授】
        r"主讲人：(.+?)$",  # 格式2：主讲人：张教授
        r"(.+?)教授"       # 格式3：张教授讲座
    ]
    for pattern in patterns:
        match = re.search(pattern, title)
        if match:
            return match.group(1)
    return "未知主讲人"

7.2 反爬虫应对措施

虽然.edu.cn站点通常反爬较弱，但仍建议：

设置随机User-Agent
每个请求间隔1-2秒
遇到403错误自动重试
使用代理IP池（如需大规模采集）

代理配置示例：

python复制PROXIES = {
    'http': 'http://user:pass@proxy_ip:port',
    'https': 'https://user:pass@proxy_ip:port'
}

resp = requests.get(url, headers=HEADERS, proxies=PROXIES)

8. 进阶优化方向

8.1 自动化调度

添加定时任务实现每日自动采集：

python复制import schedule
import time

def daily_job():
    main()
    print(f"{datetime.now()} 采集任务完成")

schedule.every().day.at("02:00").do(daily_job)

while True:
    schedule.run_pending()
    time.sleep(60)

8.2 数据质量监控

添加数据校验逻辑：

python复制def validate_lecture(lecture):
    """验证数据完整性"""
    required_fields = ['title', 'time', 'location']
    return all(lecture.get(field) for field in required_fields)

# 在主流程中添加过滤
valid_lectures = [lecture for lecture in all_lectures if validate_lecture(lecture)]

8.3 可视化分析

使用Pandas生成基础统计：

python复制import pandas as pd
from matplotlib import pyplot as plt

df = pd.DataFrame(all_lectures)
df['time'] = pd.to_datetime(df['time'])

# 按月统计讲座数量
monthly_count = df.resample('M', on='time').size()
monthly_count.plot(kind='bar')
plt.title('月度讲座数量统计')
plt.savefig('lecture_stats.png')

9. 项目总结与使用建议

这个爬虫项目虽然代码量不大，但完整实现了从数据采集到结构化存储的全流程。在实际使用中需要注意：

合法性检查：定期确认robots.txt是否变更
稳定性维护：监控目标网站改版情况
数据应用：可将JSON数据导入数据库或知识图谱系统

我在实际运行中发现几个实用技巧：

将BASE_URL改为配置参数，可快速适配不同院系网站
添加--test参数支持测试模式（只处理前2页）
使用try-catch包裹每个解析步骤防止单条失败影响整体

完整项目代码已打包为可执行脚本，添加了命令行参数支持：

bash复制python academic_spider.py --url https://cs.xxx.edu.cn/xsdt/list.htm --pages 5