Python爬虫实战：新闻网站数据采集与MongoDB存储-代码聚汇网

Python爬虫实战：新闻网站数据采集与MongoDB存储

贵萌兄

1. 项目概述

最近在练习Python爬虫技术，选择了巴黎圣母院新闻网站（news.nd.edu）作为实战目标。这个项目的主要目的是爬取该网站上的新闻内容，包括标题、正文、作者、发布时间等关键信息，并将数据存储到MongoDB数据库中。同时，还需要下载新闻中的图片并保存到本地。

这个爬虫项目采用了模块化设计思路，将整个爬取过程分为三个层级：模块（新闻分类）、版面（新闻列表页）和文章（具体新闻内容）。这种分层设计使得代码结构更清晰，也便于后续维护和扩展。

2. 技术选型与准备

2.1 主要技术栈

在这个项目中，我选择了以下技术组合：

Requests：用于发送HTTP请求获取网页内容
BeautifulSoup：用于解析HTML文档，提取所需数据
PyMongo：用于连接和操作MongoDB数据库
re：Python正则表达式模块，用于字符串匹配和处理
datetime：处理日期和时间相关操作

选择这些库的主要考虑是：

Requests和BeautifulSoup组合是Python爬虫的经典搭配，学习曲线平缓，社区支持完善
MongoDB作为NoSQL数据库，适合存储非结构化的网页内容
正则表达式虽然学习成本较高，但在处理复杂文本模式时非常高效

2.2 开发环境准备

在开始编码前，需要确保以下环境已经就绪：

Python 3.6+环境

安装必要的Python库：

bash复制pip install requests beautifulsoup4 pymongo

MongoDB服务已启动并运行在默认端口27017
创建好用于存储图片的本地目录（如D://imgs//nd-news）

3. 网页分析与爬取策略

3.1 目标网站结构分析

巴黎圣母院新闻网站的结构相对清晰：

首页：展示最新新闻和主要分类
归档页：按年份/月份组织的历史新闻
新闻详情页：包含完整的新闻内容和元数据

通过分析发现，网站提供了两种浏览新闻的方式：

按类别浏览
按时间归档浏览

经过比较，选择按时间归档的方式爬取，因为这种方式能获取更全面的新闻内容。

3.2 爬取逻辑设计

整个爬取过程采用三层结构：

模块层：对应不同年份的新闻归档
版面层：对应某一年份下的分页列表
文章层：具体的新闻内容页面

这种分层设计的好处是：

逻辑清晰，便于理解和维护
可以灵活控制爬取范围（如只爬取特定年份）
出错时可以从断点继续，不必重新开始

4. 核心代码实现

4.1 爬虫类初始化

首先创建一个爬虫类，初始化必要的参数和请求头：

python复制class MitnewsScraper:
    def __init__(self, root_url, model_url, img_output_dir):
        self.root_url = root_url  # 网站根URL
        self.model_url = model_url  # 当前模块URL
        self.img_output_dir = img_output_dir  # 图片保存目录
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
            'Cookie': '替换成你自己的'
        }

4.2 模块爬取实现

模块对应的是不同年份的新闻归档页面。首先获取所有模块URL：

python复制def run():
    root_url = 'https://news.nd.edu/'
    output_dir = 'D://imgs//nd-news'
    response = requests.get('https://news.nd.edu/news/archives/')
    soup = BeautifulSoup(response.text, 'html.parser')
    model_urls = []
    model_url_array = soup.find('ul', 'archives-by-year archives-list').find_all('li')
    for item in model_url_array:
        model_url = root_url + item.find('a').get('href')
        model_urls.append(model_url)
    for model_url in model_urls:
        scraper = MitnewsScraper(root_url, model_url, output_dir)
        scraper.catalogue_all_pages()

4.3 版面爬取实现

每个模块（年份）下的新闻可能分多页显示，需要先获取总页数：

python复制def catalogue_all_pages(self):
    response = requests.get(self.model_url, headers=self.headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    try:
        len_catalogues_page = len(soup.find('div', 'pagination').find_all('a'))
        list_catalogues_page = soup.find('div', 'pagination').find_all('a')
        num_pages = list_catalogues_page[len_catalogues_page - 2].get_text()
        print(f'{self.model_url} 模块一共有{num_pages}页版面')
        for page in range(1, int(num_pages) + 1):
            print(f"========开始爬取第 {page}/{num_pages} 页版面========")
            self.parse_catalogues(page)
            print(f"========完成第 {page}/{num_pages} 页版面爬取========")
    except Exception as e:
        print(f'错误: {e}')
        traceback.print_exc()

4.4 文章爬取实现

对于每个版面中的文章，提取关键信息并存储到数据库：

python复制def parse_cards_list(self, url, catalogue_id, cardupdatetime, cardtitle):
    card_response = requests.get(url, headers=self.headers)
    soup = BeautifulSoup(card_response.text, 'html.parser')
    
    # 提取文章基本信息
    card_id = catalogue_id
    card_title = cardtitle
    updateTime = cardupdatetime
    date = datetime.now()
    author = soup.find('article', 'article span-md-2').find('p', 'author').find('span', property='name').get_text()
    
    # 处理文章内容
    html_dom = soup.find('article', 'article span-md-2')
    # 移除不需要的元素
    for element in html_dom.find_all(['div', 'section-profile'], {'class': ['meta-share-group', 'social-share', 'section-profile']}):
        element.decompose()
    
    # 下载图片
    imgs = []
    img_array = soup.find('div', 'article-content entry-content').find_all('img')
    if img_array:
        for item in img_array:
            img_url = self.root_url + item.get('src')
            imgs.append(img_url)
    illustrations = self.download_images(imgs, card_id) if imgs else []
    
    # 存储到MongoDB
    client = MongoClient('mongodb://localhost:27017/')
    db = client['nd-news']
    cards_collection = db['cards']
    
    card_data = {
        'id': card_id,
        'catalogueId': catalogue_id,
        'type': 'nd-news',
        'date': date,
        'title': card_title,
        'author': author,
        'updatetime': updateTime,
        'url': url,
        'html_content': str(html_dom),
        'content': self.clean_content(html_dom.get_text()),
        'illustrations': illustrations,
    }
    
    # 避免重复插入
    if not cards_collection.find_one({'id': card_id}):
        cards_collection.insert_one(card_data)
        print(f"[文章爬取] {url} 已成功保存！")
    else:
        print(f"[文章爬取] {url} 已存在，跳过保存")

5. 数据处理与存储

5.1 数据清洗

在存储前需要对原始HTML内容进行清洗：

python复制def clean_content(self, content):
    if not content:
        return ''
    
    # 替换多余的空白字符
    content = re.sub(r'\s+', ' ', content)
    # 移除特定字符串
    content = content.replace('![](../../../image/zxbl.gif)', '')
    content = content.replace('![](****处理标记%ef%bc%9a[Article]时，%20字段%20[SnapUrl]%20在数据源中没有找到!%20****)', '')
    # 移除HTML注释
    content = re.sub(r'<!--.*?-->', '', content)
    return content.strip()

5.2 图片下载

图片下载功能需要处理URL拼接和本地存储：

python复制def download_images(self, img_urls, card_id):
    # 提取card_id的最后部分作为目录名
    last_word = re.search(r'[^/]+$', card_id).group(0)
    images_dir = os.path.join(self.img_output_dir, last_word)
    
    if not os.path.exists(images_dir):
        os.makedirs(images_dir)
    
    downloaded_images = []
    for img_url in img_urls:
        try:
            response = requests.get(img_url, stream=True, headers=self.headers)
            if response.status_code == 200:
                # 从URL提取文件名
                img_name = re.search(r'^[^?]*', img_url.split('/')[-1]).group(0)
                # 保存图片
                with open(os.path.join(images_dir, img_name), 'wb') as f:
                    f.write(response.content)
                downloaded_images.append({
                    'url': img_url,
                    'local_path': os.path.join(images_dir, img_name)
                })
                print(f'[图片下载] {img_name} 下载成功')
        except Exception as e:
            print(f'[图片下载] 下载 {img_url} 时出错: {e}')
    
    return downloaded_images

5.3 数据库设计

MongoDB中设计了两个集合：

catalogues：存储版面信息
- id: 版面唯一标识
- date: 爬取时间
- title: 版面标题
- url: 版面URL
- cardSize: 包含的文章数量
- updatetime: 版面更新时间
cards：存储文章详情
- id: 文章ID
- catalogueId: 所属版面ID
- type: 文章类型
- date: 爬取时间
- title: 文章标题
- author: 作者
- updatetime: 文章发布时间
- url: 文章URL
- html_content: 原始HTML内容
- content: 清洗后的文本内容
- illustrations: 图片信息

6. 常见问题与解决方案

6.1 反爬机制应对

在实际爬取过程中，可能会遇到以下反爬措施及解决方案：

请求频率限制：
- 在请求间添加随机延迟：time.sleep(random.uniform(1, 3))
- 使用代理IP池轮换请求
User-Agent检测：
- 准备多个常用User-Agent随机选择
- 保持与普通浏览器一致的请求头
Cookie验证：
- 定期更新Cookie
- 模拟登录获取有效会话

6.2 数据完整性问题

确保数据完整性的几个关键点：

异常处理：对每个网络请求和解析操作都添加try-catch
断点续爬：记录已爬取的URL，程序重启后可继续
数据验证：检查关键字段是否完整，如标题、正文不为空
去重机制：基于URL或文章ID避免重复存储

6.3 性能优化建议

当需要爬取大量数据时，可以考虑以下优化：

多线程/异步爬取：使用concurrent.futures或asyncio提高效率
连接池：复用HTTP连接，减少握手开销
增量爬取：只爬取新增或更新的内容
分布式爬取：使用Scrapy-Redis等框架实现分布式

7. 项目扩展与改进

7.1 功能扩展方向

当前爬虫可以进一步扩展：

定时任务：添加定时爬取功能，自动获取最新新闻
内容分析：对爬取的文本进行关键词提取、情感分析等
可视化展示：基于爬取数据生成统计图表
API接口：提供RESTful API供其他系统调用数据

7.2 代码优化建议

配置分离：将数据库连接、请求头等配置移到单独文件
日志系统：使用logging模块替代print，便于问题排查
单元测试：为关键函数添加测试用例
类型提示：添加Python类型注解，提高代码可读性

7.3 法律与伦理考量

开发爬虫时需要注意：

robots.txt：遵守目标网站的爬取规则
数据使用：仅用于个人学习，不进行商业用途
请求频率：控制请求速度，避免对目标网站造成负担
隐私保护：不爬取和存储用户个人信息

8. 完整代码示例

以下是整合后的完整爬虫代码：

python复制import os
import re
import time
import random
import traceback
from datetime import datetime
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup

class NewsScraper:
    def __init__(self, root_url, img_output_dir, db_config):
        self.root_url = root_url
        self.img_output_dir = img_output_dir
        self.db_config = db_config
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
        }
        self.client = MongoClient(db_config['uri'])
        self.db = self.client[db_config['db_name']]
        
    def get_all_modules(self):
        """获取所有模块(年份)的URL"""
        response = requests.get(f'{self.root_url}news/archives/', headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        return [self.root_url + li.find('a')['href'] 
                for li in soup.find('ul', 'archives-by-year').find_all('li')]
    
    def scrape_module(self, module_url):
        """爬取单个模块"""
        response = requests.get(module_url, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 获取总页数
        pagination = soup.find('div', 'pagination')
        if not pagination:
            return 0
            
        last_page = int(pagination.find_all('a')[-2].get_text())
        
        for page in range(1, last_page + 1):
            time.sleep(random.uniform(1, 3))  # 随机延迟
            self.scrape_page(f'{module_url}/page/{page}')
        
        return last_page
    
    def scrape_page(self, page_url):
        """爬取单个版面页"""
        response = requests.get(page_url, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        for article in soup.find('ol', 'no-bullets').find_all('li'):
            article_url = self.root_url + article.find('h2').find('a')['href']
            self.scrape_article(article_url)
    
    def scrape_article(self, article_url):
        """爬取单篇文章"""
        try:
            response = requests.get(article_url, headers=self.headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # 提取文章元数据
            article_id = re.search(r'/news/(.+?)/$', article_url).group(1)
            title = soup.find('h1').get_text(strip=True)
            publish_time = datetime.fromisoformat(soup.find('time')['datetime'])
            author = soup.find('span', property='name').get_text(strip=True)
            
            # 处理文章内容
            content_div = soup.find('div', 'article-content')
            for elem in content_div.find_all(['script', 'style', 'iframe']):
                elem.decompose()
                
            # 下载图片
            imgs = [self.root_url + img['src'] for img in content_div.find_all('img')]
            illustrations = self.download_images(imgs, article_id) if imgs else []
            
            # 存储到数据库
            article_data = {
                'id': article_id,
                'title': title,
                'author': author,
                'publish_time': publish_time,
                'url': article_url,
                'content': self.clean_content(content_div.get_text()),
                'html_content': str(content_div),
                'illustrations': illustrations,
                'crawl_time': datetime.now()
            }
            
            if not self.db.articles.find_one({'id': article_id}):
                self.db.articles.insert_one(article_data)
                print(f'[成功] 文章 {title} 已保存')
            
        except Exception as e:
            print(f'[错误] 处理 {article_url} 时出错: {str(e)}')
            traceback.print_exc()
    
    def download_images(self, img_urls, article_id):
        """下载文章图片"""
        image_dir = os.path.join(self.img_output_dir, article_id)
        os.makedirs(image_dir, exist_ok=True)
        
        downloaded = []
        for img_url in img_urls:
            try:
                img_name = os.path.basename(img_url.split('?')[0])
                img_path = os.path.join(image_dir, img_name)
                
                with requests.get(img_url, stream=True, headers=self.headers) as r:
                    r.raise_for_status()
                    with open(img_path, 'wb') as f:
                        for chunk in r.iter_content(chunk_size=8192):
                            f.write(chunk)
                
                downloaded.append({
                    'url': img_url,
                    'local_path': img_path,
                    'filename': img_name
                })
                print(f'[图片] {img_name} 下载成功')
                
            except Exception as e:
                print(f'[图片错误] 下载 {img_url} 失败: {str(e)}')
        
        return downloaded
    
    def clean_content(self, text):
        """清洗文本内容"""
        text = re.sub(r'\s+', ' ', text)  # 合并空白字符
        text = re.sub(r'<!--.*?-->', '', text)  # 移除HTML注释
        return text.strip()
    
    def run(self):
        """启动爬虫"""
        modules = self.get_all_modules()
        print(f'找到 {len(modules)} 个模块')
        
        for module_url in modules:
            print(f'开始爬取模块: {module_url}')
            page_count = self.scrape_module(module_url)
            print(f'完成爬取, 共处理 {page_count} 页')

if __name__ == '__main__':
    config = {
        'root_url': 'https://news.nd.edu/',
        'img_output_dir': 'D:/imgs/nd-news',
        'db_config': {
            'uri': 'mongodb://localhost:27017/',
            'db_name': 'nd-news'
        }
    }
    
    scraper = NewsScraper(**config)
    scraper.run()

9. 实际应用建议

9.1 部署与运行

环境要求：
- Python 3.6+
- MongoDB 4.0+
- 足够的磁盘空间存储图片和数据
运行方式：
```
bash复制python news_scraper.py
```
监控与维护：
- 定期检查爬取日志
- 监控数据库存储空间
- 更新User-Agent和Cookie

9.2 数据使用示例

爬取的数据可以用于：

新闻分析：主题趋势、关键词提取
内容聚合：建立新闻档案库
研究用途：媒体研究、传播学分析

9.3 注意事项

尊重版权：仅将数据用于个人学习和研究
控制频率：避免高频请求影响目标网站
数据备份：定期备份重要数据
合规使用：遵守相关法律法规和网站条款

通过这个项目，我系统性地练习了Python爬虫开发的各个环节，从网页分析、数据提取到存储和优化。这种分层设计的爬虫架构可以灵活扩展到其他类似网站的数据采集任务中。