Python爬虫实战：伯克利大学新闻网站数据抓取-代码聚汇网

Python爬虫实战：伯克利大学新闻网站数据抓取

橙心橙怡

1. 项目概述

最近在做一个爬取伯克利大学新闻网站（https://news.berkeley.edu/）的小项目，目标是抓取新闻的标题、内容、作者、发布时间、链接地址和文章快照等字段。这个项目虽然看起来简单，但在实际操作中遇到了不少值得分享的技术细节和坑点。

作为一个经常需要收集学术资讯的研究人员，我发现手动整理这些信息效率太低。于是决定用Python写个爬虫来自动化这个过程。下面我会详细分享整个开发过程，包括网页分析、代码实现和遇到的典型问题。

2. 网页分析与爬取策略

2.1 网站结构解析

伯克利新闻网站的架构相对清晰。首页展示最新新闻列表，每个新闻条目包含标题、摘要、发布时间和图片等基本信息。点击标题会跳转到详细内容页面。

通过Chrome开发者工具分析，我发现几个关键点：

新闻列表采用分页加载，通过page参数控制
每个新闻条目都是<article>标签包裹
详情页的内容区域有明确的CSS类名标识
图片资源使用了延迟加载技术

2.2 反爬机制应对

这个网站的反爬措施不算严格，但仍有几点需要注意：

请求头需要设置合理的User-Agent
需要处理Cookie（虽然本项目中没有严格要求）
请求频率不宜过快，建议添加适当延迟
图片资源可能有防盗链，需要设置Referer

3. 核心代码实现

3.1 基础爬虫类设计

我创建了一个MitnewsScraper类来封装所有爬取逻辑，主要包含以下功能：

python复制class MitnewsScraper:
    def __init__(self, root_url, model_url, img_output_dir):
        self.root_url = root_url  # 网站根URL
        self.model_url = model_url  # 模块URL
        self.img_output_dir = img_output_dir  # 图片保存路径
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
            'Referer': 'https://news.berkeley.edu/'
        }
    
    def catalogue_all_pages(self):
        """获取所有分页"""
    
    def parse_catalogues(self, page):
        """解析单个分页"""
    
    def parse_cards_list(self, url, catalogue_id, cardupdatetime, cardtitle):
        """解析新闻详情页"""
    
    def download_images(self, img_urls, card_id):
        """下载图片资源"""

3.2 分页爬取实现

新闻列表采用分页加载，通过观察发现URL中的page参数控制分页：

python复制def catalogue_all_pages(self):
    response = requests.get(self.model_url, headers=self.headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    try:
        # 从页面文本中提取总页数
        match = re.search(r'of (\d+)', soup.text)
        num_pages = int(match.group(1))
        
        for page in range(1, num_pages + 1):
            params = {'page': page}
            self.parse_catalogues(params)
    except Exception as e:
        print(f'获取页数失败: {e}')

3.3 新闻条目解析

每个新闻条目都是<article>标签，关键信息提取如下：

python复制def parse_catalogues(self, params):
    response = requests.get(self.model_url, params=params, headers=self.headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    catalogue_list = soup.find('div', 'filtered-items')
    for article in catalogue_list.find_all('article'):
        # 提取标题
        title = article.find('div', 'news-item__description').find('a').get_text(strip=True)
        
        # 提取发布时间
        pub_time = article.find('time').get('datetime')
        pub_date = datetime.strptime(pub_time, '%Y-%m-%d')
        
        # 提取详情页链接
        relative_url = article.find('a').get('href')
        absolute_url = self.root_url + relative_url
        
        # 进一步处理详情页
        self.parse_cards_list(absolute_url, relative_url[1:], pub_date, title)

4. 详情页内容提取

4.1 基本信息获取

详情页包含更丰富的内容，需要仔细处理：

python复制def parse_cards_list(self, url, catalogue_id, cardupdatetime, cardtitle):
    response = requests.get(url, headers=self.headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 获取作者信息
    try:
        author = soup.find('a', href='/author/news').get_text()
    except:
        author = "Unknown"
    
    # 获取正文内容
    content_div = soup.find('div', 'single-post cb-section cb-stretch')
    
    # 清理不需要的元素
    for element in content_div.find_all(['script', 'style', 'iframe']):
        element.decompose()
    
    # 提取纯文本
    content = content_div.get_text(separator='\n', strip=True)

4.2 图片处理

新闻中的图片需要特殊处理，包括下载和本地存储：

python复制def download_images(self, img_urls, card_id):
    # 创建按文章ID命名的子目录
    article_dir = os.path.join(self.img_output_dir, card_id)
    os.makedirs(article_dir, exist_ok=True)
    
    downloaded = []
    for img_url in img_urls:
        try:
            # 从URL提取文件名
            filename = os.path.basename(img_url.split('?')[0])
            save_path = os.path.join(article_dir, filename)
            
            # 下载并保存图片
            with requests.get(img_url, stream=True, headers=self.headers) as r:
                with open(save_path, 'wb') as f:
                    for chunk in r.iter_content(1024):
                        f.write(chunk)
            downloaded.append(save_path)
        except Exception as e:
            print(f'下载图片失败: {e}')
    
    return downloaded

5. 数据存储方案

5.1 MongoDB设计

我选择MongoDB存储爬取的数据，设计了两个集合：

catalogues: 存储新闻列表信息
cards: 存储新闻详情内容

python复制# 连接MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['berkeley-news']

# 存储新闻列表
catalogues_col = db['catalogues']
catalogue_data = {
    'id': catalogue_id,
    'title': title,
    'url': url,
    'publish_date': pub_date,
    'scrape_time': datetime.now()
}
catalogues_col.update_one({'id': catalogue_id}, {'$set': catalogue_data}, upsert=True)

# 存储新闻详情
cards_col = db['cards']
card_data = {
    'id': catalogue_id,
    'title': cardtitle,
    'author': author,
    'content': content,
    'images': downloaded_images,
    'html_content': str(content_div),
    'url': url,
    'publish_date': cardupdatetime,
    'scrape_time': datetime.now()
}
cards_col.update_one({'id': catalogue_id}, {'$set': card_data}, upsert=True)

5.2 数据去重机制

为了避免重复爬取，我实现了基于ID的检查：

python复制# 检查是否已存在
existing = catalogues_col.find_one({'id': catalogue_id})
if existing:
    print(f'新闻 {catalogue_id} 已存在，跳过')
    return

6. 常见问题与解决方案

6.1 请求被拒绝

问题现象：返回403状态码或验证页面

解决方案：

设置合理的请求头，特别是User-Agent和Referer
添加请求延迟
使用会话(Session)保持Cookies

python复制session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://news.berkeley.edu/'
})

6.2 动态加载内容

问题现象：部分内容通过JavaScript动态加载

解决方案：

分析XHR请求接口
使用Selenium或Playwright等浏览器自动化工具
本例中幸运的是主要内容都是静态加载的

6.3 图片下载失败

问题现象：图片返回403或404

解决方案：

确保设置了Referer头
处理图片懒加载（data-src属性）
添加重试机制

python复制def get_image_url(img_tag):
    return img_tag.get('data-src') or img_tag.get('src')

7. 项目优化方向

7.1 性能优化

使用异步请求(aiohttp)提高爬取速度
实现分布式爬取
添加缓存机制避免重复下载

7.2 功能扩展

增加自动分类功能（基于关键词或机器学习）
实现定时爬取和增量更新
添加API接口供其他系统调用

7.3 健壮性提升

完善日志记录和错误处理
添加监控和报警机制
实现断点续爬功能

8. 完整代码示例

以下是整合后的核心代码：

python复制import os
import re
from datetime import datetime
import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient

class BerkeleyNewsScraper:
    def __init__(self, root_url, img_output_dir):
        self.root_url = root_url
        self.img_output_dir = img_output_dir
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0...',
            'Referer': 'https://news.berkeley.edu/'
        })
        
        # MongoDB连接
        self.client = MongoClient('mongodb://localhost:27017/')
        self.db = self.client['berkeley-news']
    
    def scrape_news_list(self, start_page=1):
        """爬取新闻列表"""
        page = start_page
        while True:
            print(f'正在处理第 {page} 页...')
            url = f'{self.root_url}/news?page={page}'
            try:
                response = self.session.get(url)
                response.raise_for_status()
                
                soup = BeautifulSoup(response.text, 'html.parser')
                articles = soup.select('div.filtered-items article')
                if not articles:
                    break
                    
                for article in articles:
                    self.process_article(article)
                
                page += 1
                # 礼貌性延迟
                time.sleep(1)
                
            except Exception as e:
                print(f'处理第 {page} 页时出错: {e}')
                break
    
    def process_article(self, article):
        """处理单个新闻条目"""
        try:
            title_elem = article.select_one('div.news-item__description a')
            title = title_elem.get_text(strip=True)
            relative_url = title_elem['href']
            absolute_url = self.root_url + relative_url
            article_id = relative_url.strip('/')
            
            time_elem = article.select_one('div.news-item__description time')
            pub_time = datetime.strptime(time_elem['datetime'], '%Y-%m-%d')
            
            # 检查是否已存在
            if self.db.catalogues.find_one({'id': article_id}):
                print(f'文章 {article_id} 已存在，跳过')
                return
                
            # 处理详情页
            detail_data = self.scrape_article_detail(absolute_url, article_id)
            
            # 保存到MongoDB
            self.save_to_db(article_id, title, absolute_url, pub_time, detail_data)
            
        except Exception as e:
            print(f'处理文章出错: {e}')
    
    def scrape_article_detail(self, url, article_id):
        """爬取文章详情"""
        response = self.session.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 获取作者
        try:
            author = soup.select_one('a[href^="/author/"]').get_text(strip=True)
        except:
            author = "Unknown"
        
        # 获取正文内容
        content_div = soup.select_one('div.single-post.cb-section')
        
        # 清理不需要的元素
        for element in content_div.select('script, style, iframe, nav, footer'):
            element.decompose()
        
        # 提取图片并下载
        img_urls = [img['src'] for img in content_div.select('img[src]')]
        local_images = self.download_images(img_urls, article_id)
        
        return {
            'author': author,
            'content': content_div.get_text('\n', strip=True),
            'html_content': str(content_div),
            'images': local_images
        }
    
    def download_images(self, img_urls, article_id):
        """下载图片到本地"""
        article_dir = os.path.join(self.img_output_dir, article_id)
        os.makedirs(article_dir, exist_ok=True)
        
        local_paths = []
        for img_url in img_urls:
            try:
                filename = os.path.basename(img_url.split('?')[0])
                save_path = os.path.join(article_dir, filename)
                
                with self.session.get(img_url, stream=True) as r:
                    with open(save_path, 'wb') as f:
                        for chunk in r.iter_content(1024):
                            f.write(chunk)
                local_paths.append(save_path)
            except Exception as e:
                print(f'下载图片失败: {e}')
        
        return local_paths
    
    def save_to_db(self, article_id, title, url, pub_date, detail_data):
        """保存数据到MongoDB"""
        # 新闻列表数据
        list_data = {
            'id': article_id,
            'title': title,
            'url': url,
            'publish_date': pub_date,
            'scrape_time': datetime.now()
        }
        self.db.catalogues.update_one(
            {'id': article_id},
            {'$set': list_data},
            upsert=True
        )
        
        # 新闻详情数据
        detail_data.update({
            'id': article_id,
            'scrape_time': datetime.now()
        })
        self.db.cards.update_one(
            {'id': article_id},
            {'$set': detail_data},
            upsert=True
        )

if __name__ == '__main__':
    scraper = BerkeleyNewsScraper(
        root_url='https://news.berkeley.edu',
        img_output_dir='./berkeley_images'
    )
    scraper.scrape_news_list()

9. 实际应用建议

定时任务：可以使用APScheduler或Celery设置定时任务，每天自动爬取最新新闻
数据清洗：添加更复杂的内容清洗逻辑，去除广告、推荐内容等噪音
内容分析：结合NLP技术对新闻内容进行关键词提取、情感分析等
可视化展示：使用Flask或Django搭建简单的Web界面展示爬取结果

这个项目虽然规模不大，但涵盖了网页爬取的典型流程和技术要点。通过这个实战案例，我们可以学习到如何分析网页结构、处理反爬措施、设计数据存储方案等实用技能。

Python爬虫实战：伯克利大学新闻网站数据抓取

1. 项目概述

2. 网页分析与爬取策略

2.1 网站结构解析

2.2 反爬机制应对

3. 核心代码实现

3.1 基础爬虫类设计

3.2 分页爬取实现

3.3 新闻条目解析

4. 详情页内容提取

4.1 基本信息获取

4.2 图片处理

5. 数据存储方案

5.1 MongoDB设计

5.2 数据去重机制

6. 常见问题与解决方案

6.1 请求被拒绝

6.2 动态加载内容

6.3 图片下载失败

7. 项目优化方向

7.1 性能优化

7.2 功能扩展

7.3 健壮性提升

8. 完整代码示例

9. 实际应用建议

内容推荐