Python爬虫实战：GitHub每日热榜数据抓取与分析-代码聚汇网

Python爬虫实战：GitHub每日热榜数据抓取与分析

股海求生

1. 项目概述

最近在做一个挺有意思的小项目 - 用Python爬虫抓取GitHub每日热榜数据。这个需求其实源于我日常开发中的一个痛点：作为Python开发者，我经常需要关注GitHub上热门的Python项目，但手动一个个查看实在太费时间了。

通过这个爬虫，我可以自动获取以下关键数据：

仓库名称
总星标数
今日新增星标数
项目简介
作者信息
项目URL

最终这些数据会导出为CSV格式，方便后续分析和跟踪。整个过程涉及请求发送、页面解析、数据清洗和存储等多个环节，是一个比较完整的爬虫实战案例。

2. 技术选型与工具准备

2.1 为什么选择这些工具

对于这个项目，我选择了以下技术栈：

Requests：轻量级的HTTP库，适合简单的页面抓取
BeautifulSoup：HTML解析神器，学习曲线平缓
Pandas：数据处理和CSV导出非常方便

没有选择Scrapy这类框架的原因是项目规模较小，用轻量级工具更合适。而且这些库的组合对于初学者也更友好。

2.2 环境配置步骤

首先确保你已安装Python 3.6+，然后通过pip安装所需依赖：

bash复制pip install requests beautifulsoup4 pandas

我建议使用虚拟环境来管理依赖：

bash复制python -m venv github-spider
source github-spider/bin/activate  # Linux/Mac
github-spider\Scripts\activate  # Windows

3. 爬虫核心实现

3.1 分析目标页面结构

GitHub热榜页面(https://github.com/trending/python)的HTML结构有几个关键点需要注意：

每个仓库项目都包裹在<article class="Box-row">标签中
仓库名称在<h1>标签内
星标数在<a标签中，包含"stargazers"文本
今日新增星标数在<span标签中，通常带有颜色样式

3.2 请求层实现

首先实现一个健壮的请求函数：

python复制import requests
from time import sleep
from random import uniform

def fetch_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml',
    }
    
    try:
        # 随机延迟1-3秒，避免请求过于频繁
        sleep(uniform(1, 3))
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"请求失败: {e}")
        return None

这里有几个关键点：

设置了合理的User-Agent
加入了随机延迟，避免被封禁
完善的错误处理

3.3 解析层实现

使用BeautifulSoup解析HTML：

python复制from bs4 import BeautifulSoup

def parse_page(html):
    if not html:
        return []
    
    soup = BeautifulSoup(html, 'html.parser')
    projects = []
    
    for article in soup.find_all('article', class_='Box-row'):
        try:
            # 提取仓库名
            repo = article.find('h1').get_text(strip=True)
            
            # 提取描述
            description = article.find('p')
            description = description.get_text(strip=True) if description else "无描述"
            
            # 提取星标数
            stars = article.find('a', href=lambda x: x and 'stargazers' in x)
            stars = stars.get_text(strip=True) if stars else "0"
            
            # 提取今日新增星标
            stars_today = article.find('span', class_=lambda x: x and 'color-' in x)
            stars_today = stars_today.get_text(strip=True) if stars_today else "0"
            
            # 提取项目URL
            url = article.find('h1').find('a')['href']
            url = f"https://github.com{url}" if url else ""
            
            projects.append({
                'repository': repo,
                'description': description,
                'stars': stars,
                'stars_today': stars_today,
                'url': url
            })
        except Exception as e:
            print(f"解析项目时出错: {e}")
            continue
    
    return projects

4. 数据存储与导出

4.1 数据清洗

在存储前需要对数据进行清洗：

python复制def clean_data(projects):
    for project in projects:
        # 清理星标数中的逗号
        project['stars'] = project['stars'].replace(',', '')
        
        # 清理今日新增星标中的特殊符号
        stars_today = project['stars_today']
        if stars_today.endswith('today'):
            stars_today = stars_today[:-5]
        project['stars_today'] = stars_today.strip()
    
    return projects

4.2 CSV导出

使用Pandas导出CSV：

python复制import pandas as pd
from datetime import datetime

def save_to_csv(projects, filename=None):
    if not filename:
        today = datetime.now().strftime('%Y-%m-%d')
        filename = f"github_trending_python_{today}.csv"
    
    df = pd.DataFrame(projects)
    df.to_csv(filename, index=False, encoding='utf-8-sig')
    print(f"数据已保存到 {filename}")

5. 完整流程与执行

5.1 主函数实现

将所有模块组合起来：

python复制def main():
    url = "https://github.com/trending/python?since=daily"
    print(f"开始抓取 GitHub Python 每日热榜: {url}")
    
    html = fetch_page(url)
    if not html:
        print("获取页面内容失败")
        return
    
    projects = parse_page(html)
    if not projects:
        print("没有解析到项目数据")
        return
    
    projects = clean_data(projects)
    save_to_csv(projects)
    
    print(f"成功抓取 {len(projects)} 个热门项目")

if __name__ == "__main__":
    main()

5.2 执行结果示例

运行后会生成类似以下的CSV文件：

repository	description	stars	stars_today	url
owner/repo1	一个很棒的Python项目	1024	56	https://github.com/owner/repo1
owner/repo2	另一个Python工具	2048	128	https://github.com/owner/repo2

6. 常见问题与解决方案

6.1 请求被拒绝或返回403

可能原因：

请求频率过高
User-Agent被识别为爬虫

解决方案：

增加请求间隔时间
轮换User-Agent
考虑使用requests.Session()

6.2 解析不到数据

可能原因：

页面结构发生变化
HTML标签类名被修改

解决方案：

重新检查页面结构
使用更通用的选择器，如通过标签层级关系定位

6.3 数据格式不一致

可能原因：

不同项目的展示方式有差异
某些字段可能缺失

解决方案：

增加更健壮的数据清洗逻辑
为可能缺失的字段设置默认值

7. 进阶优化建议

7.1 增加定时任务

可以使用APScheduler实现每日自动运行：

python复制from apscheduler.schedulers.blocking import BlockingScheduler

scheduler = BlockingScheduler()
@scheduler.scheduled_job('cron', hour=10)
def daily_job():
    main()

scheduler.start()

7.2 添加数据库支持

对于长期收集的数据，可以存入SQLite或MySQL：

python复制import sqlite3

def save_to_db(projects):
    conn = sqlite3.connect('github_trending.db')
    c = conn.cursor()
    
    c.execute('''CREATE TABLE IF NOT EXISTS projects
                 (date TEXT, repo TEXT, stars INT, stars_today INT)''')
    
    today = datetime.now().strftime('%Y-%m-%d')
    for p in projects:
        c.execute("INSERT INTO projects VALUES (?,?,?,?)",
                 (today, p['repository'], p['stars'], p['stars_today']))
    
    conn.commit()
    conn.close()

7.3 添加异常监控

可以使用Sentry或自定义日志系统监控爬虫运行状态：

python复制import logging
from logging.handlers import RotatingFileHandler

def setup_logger():
    logger = logging.getLogger('github_spider')
    logger.setLevel(logging.INFO)
    
    handler = RotatingFileHandler('spider.log', maxBytes=1e6, backupCount=3)
    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    
    logger.addHandler(handler)
    return logger

8. 合规使用注意事项

严格遵守GitHub的robots.txt规则
设置合理的请求间隔(建议至少1秒)
不要对服务器造成过大负担
获取的数据仅用于个人学习
不要大规模爬取或用于商业用途

在实际项目中，我还发现几个有用的技巧：

使用代理IP轮换可以降低被封风险
添加缓存机制可以避免重复请求相同内容
对于重要数据，实现断点续爬功能很有必要

这个爬虫虽然简单，但涵盖了从请求到存储的完整流程。根据实际需求，你可以进一步扩展功能，比如添加邮件通知、数据分析可视化等。

Python爬虫实战：GitHub每日热榜数据抓取与分析

1. 项目概述

2. 技术选型与工具准备

2.1 为什么选择这些工具

2.2 环境配置步骤

3. 爬虫核心实现

3.1 分析目标页面结构

3.2 请求层实现

3.3 解析层实现

4. 数据存储与导出

4.1 数据清洗

4.2 CSV导出

5. 完整流程与执行

5.1 主函数实现

5.2 执行结果示例

6. 常见问题与解决方案

6.1 请求被拒绝或返回403

6.2 解析不到数据

6.3 数据格式不一致

7. 进阶优化建议

7.1 增加定时任务

7.2 添加数据库支持

7.3 添加异常监控

8. 合规使用注意事项

内容推荐