基于Scrapy的新闻爬虫开发实践与优化策略-代码聚汇网

基于Scrapy的新闻爬虫开发实践与优化策略

三月Moon

1. 项目背景与目标

最近在帮朋友收集巴黎圣母院相关的新闻报道时，发现手动从各大新闻网站复制粘贴效率实在太低。作为一个技术从业者，自然想到用爬虫来解决这个问题。这个项目就是针对新闻网站的结构特点，设计一个轻量级的定向爬虫，专门抓取与巴黎圣母院相关的新闻报道。

这个爬虫需要实现几个核心功能：能够识别新闻网站的页面结构、准确提取正文内容、自动过滤广告等干扰信息，并且要能够处理不同新闻网站的特殊反爬机制。最终目标是建立一个可扩展的新闻采集系统，只需要简单配置就能抓取指定主题的新闻内容。

2. 技术选型与工具准备

2.1 爬虫框架选择

经过对比几个主流爬虫框架后，我选择了Scrapy作为基础框架。原因有几个：

Scrapy自带的Selector选择器可以很好地处理新闻网站的DOM结构
内置的中间件系统方便后续添加各种反爬策略
成熟的Pipeline机制适合新闻数据的清洗和存储
活跃的社区和丰富的插件生态

安装很简单：

bash复制pip install scrapy

2.2 辅助工具包

除了Scrapy主体，还需要几个关键辅助包：

newspaper3k：专门针对新闻内容的提取库
dateparser：处理新闻发布时间格式
langdetect：检测新闻语言（因为要抓取多语种报道）

安装命令：

bash复制pip install newspaper3k dateparser langdetect

3. 爬虫核心实现

3.1 网站分析与规则制定

首先需要分析目标新闻网站的结构。以BBC为例：

搜索页：https://www.bbc.com/search?q=Notre+Dame
文章页：https://www.bbc.com/news/world-europe-47942275

通过开发者工具分析发现：

搜索结果通过<div class="ssrcss-1ocoo3l-Wrap e42f8511">包裹每篇文章
正文内容主要在<article>标签内
发布时间在<time>标签的datetime属性中

基于此编写对应的XPath规则：

python复制title_xpath = '//h1/text()'
content_xpath = '//article//p/text()'
time_xpath = '//time/@datetime'

3.2 Scrapy爬虫实现

创建一个基础的Scrapy项目：

bash复制scrapy startproject notre_dame_news
cd notre_dame_news
scrapy genspider bbc_news bbc.com

核心爬虫类实现：

python复制import scrapy
from newspaper import Article
from dateparser import parse

class BbcNewsSpider(scrapy.Spider):
    name = 'bbc_news'
    start_urls = [
        'https://www.bbc.com/search?q=Notre+Dame'
    ]

    def parse(self, response):
        # 提取搜索结果页的文章链接
        article_links = response.xpath('//div[contains(@class, "ssrcss-1ocoo3l-Wrap")]//a/@href').getall()
        for link in article_links:
            if '/news/' in link:  # 确保是新闻链接
                yield response.follow(link, self.parse_article)
        
        # 处理分页
        next_page = response.xpath('//a[contains(@class, "ssrcss-1j3alh1-PageLink")]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_article(self, response):
        # 使用newspaper3k提取正文
        article = Article(response.url)
        article.download(input_html=response.text)
        article.parse()
        
        yield {
            'title': article.title,
            'text': article.text,
            'authors': article.authors,
            'publish_date': parse(article.publish_date),
            'url': response.url,
            'source': 'BBC'
        }

3.3 反爬策略处理

新闻网站常见的反爬措施及应对方法：

User-Agent检测：

python复制# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

请求频率限制：

python复制# settings.py
DOWNLOAD_DELAY = 2
AUTOTHROTTLE_ENABLED = True

IP封禁：
建议使用代理中间件，这里展示一个简单的轮换方案：

python复制# middlewares.py
class ProxyMiddleware:
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://your_proxy_server:port'

4. 数据清洗与存储

4.1 数据清洗Pipeline

python复制# pipelines.py
import re
from langdetect import detect

class CleanPipeline:
    def process_item(self, item, spider):
        # 清理空白字符
        item['text'] = re.sub(r'\s+', ' ', item['text']).strip()
        
        # 检测语言
        try:
            item['language'] = detect(item['text'])
        except:
            item['language'] = 'unknown'
            
        return item

4.2 存储方案

根据数据量大小可以选择不同方案：

小规模：SQLite

python复制# pipelines.py
import sqlite3

class SQLitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect('news.db')
        self.cur = self.conn.cursor()
        self.cur.execute('''
            CREATE TABLE IF NOT EXISTS articles(
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title TEXT,
                text TEXT,
                authors TEXT,
                publish_date TEXT,
                url TEXT UNIQUE,
                source TEXT,
                language TEXT
            )
        ''')

    def process_item(self, item, spider):
        self.cur.execute('''
            INSERT OR IGNORE INTO articles 
            (title, text, authors, publish_date, url, source, language)
            VALUES (?,?,?,?,?,?,?)
        ''', (
            item['title'],
            item['text'],
            ','.join(item['authors']),
            str(item['publish_date']),
            item['url'],
            item['source'],
            item['language']
        ))
        self.conn.commit()
        return item

大规模：MongoDB

python复制# pipelines.py
import pymongo

class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        self.db['articles'].update_one(
            {'url': item['url']},
            {'$set': dict(item)},
            upsert=True
        )
        return item

5. 部署与调度

5.1 定时运行配置

使用Scrapyd部署后，可以通过API调度：

bash复制curl http://localhost:6800/schedule.json -d project=notre_dame_news -d spider=bbc_news

或者使用更专业的调度工具如Airflow：

python复制from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'notre_dame_news',
    default_args=default_args,
    schedule_interval=timedelta(hours=6)
)

task = BashOperator(
    task_id='run_spider',
    bash_command='cd /path/to/project && scrapy crawl bbc_news',
    dag=dag
)

5.2 监控与告警

建议添加简单的监控脚本检查爬虫运行状态：

python复制import requests
import smtplib
from email.mime.text import MIMEText

def check_spider():
    try:
        resp = requests.get('http://localhost:6800/daemonstatus.json')
        if resp.json()['status'] != 'ok':
            send_alert('Scrapyd daemon not running')
    except:
        send_alert('Scrapyd check failed')

def send_alert(message):
    msg = MIMEText(message)
    msg['Subject'] = '爬虫监控告警'
    msg['From'] = 'alert@example.com'
    msg['To'] = 'admin@example.com'
    
    s = smtplib.SMTP('smtp.example.com')
    s.send_message(msg)
    s.quit()

6. 常见问题与解决方案

6.1 内容提取不准确

问题现象：提取的正文包含大量无关内容（如广告、推荐链接）

解决方案：

尝试调整newspaper3k的配置：

python复制article = Article(response.url, language='en')  # 明确指定语言
article.set_html(response.text)
article.parse()
article.nlp()  # 启用自然语言处理

或者改用更精确的XPath规则：

python复制content = response.xpath('//div[contains(@class, "article-body")]//p/text()').getall()
content = ' '.join([p.strip() for p in content if p.strip()])

6.2 反爬导致封禁

问题现象：返回403错误或验证码页面

解决方案：

增加请求头真实性：

python复制# settings.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://www.google.com/'
}

使用更真实的浏览行为模拟：

python复制# middlewares.py
import random
import time

class RandomDelayMiddleware:
    def process_request(self, request, spider):
        delay = random.uniform(1, 3)
        time.sleep(delay)
        return None

6.3 多语言处理

问题现象：不同语言的新闻混合在一起难以分类

解决方案：

在Pipeline中添加语言过滤：

python复制# pipelines.py
class LanguageFilterPipeline:
    def __init__(self, allowed_langs):
        self.allowed_langs = allowed_langs

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            allowed_langs=crawler.settings.get('ALLOWED_LANGUAGES', ['en'])
        )

    def process_item(self, item, spider):
        if item.get('language') not in self.allowed_langs:
            raise DropItem(f"Unsupported language: {item.get('language')}")
        return item

7. 项目扩展与优化

7.1 支持更多新闻网站

通过继承基础爬虫类实现多站点支持：

python复制class CnnNewsSpider(BbcNewsSpider):
    name = 'cnn_news'
    start_urls = ['https://edition.cnn.com/search?q=Notre+Dame']
    
    def parse_article(self, response):
        # CNN特定的解析逻辑
        article = Article(response.url)
        article.download(input_html=response.text)
        article.parse()
        
        yield {
            'title': article.title,
            'text': article.text,
            'authors': article.authors,
            'publish_date': parse(article.publish_date),
            'url': response.url,
            'source': 'CNN'
        }

7.2 添加情感分析

使用TextBlob进行简单的情感分析：

python复制from textblob import TextBlob

class SentimentPipeline:
    def process_item(self, item, spider):
        blob = TextBlob(item['text'])
        item['sentiment'] = blob.sentiment.polarity
        item['subjectivity'] = blob.sentiment.subjectivity
        return item

7.3 可视化展示

使用Pandas和Matplotlib生成简单的统计图表：

python复制import pandas as pd
import matplotlib.pyplot as plt

def generate_report():
    df = pd.read_sql('SELECT * FROM articles', con=sqlite3.connect('news.db'))
    
    # 按来源统计
    source_counts = df['source'].value_counts()
    source_counts.plot(kind='bar')
    plt.title('News Count by Source')
    plt.savefig('source_dist.png')
    
    # 按时间统计
    df['date'] = pd.to_datetime(df['publish_date'])
    daily_counts = df.set_index('date').resample('D').size()
    daily_counts.plot()
    plt.title('Daily News Count')
    plt.savefig('daily_trend.png')

基于Scrapy的新闻爬虫开发实践与优化策略

1. 项目背景与目标

2. 技术选型与工具准备

2.1 爬虫框架选择

2.2 辅助工具包

3. 爬虫核心实现

3.1 网站分析与规则制定

3.2 Scrapy爬虫实现

3.3 反爬策略处理

4. 数据清洗与存储

4.1 数据清洗Pipeline

4.2 存储方案

5. 部署与调度

5.1 定时运行配置

5.2 监控与告警

6. 常见问题与解决方案

6.1 内容提取不准确

6.2 反爬导致封禁

6.3 多语言处理

7. 项目扩展与优化

7.1 支持更多新闻网站

7.2 添加情感分析

7.3 可视化展示

内容推荐