最近在做一个趣味小项目,用Python爬虫抓取网络上的笑话数据。这类项目看似简单,但实际涉及不少技术细节。作为一个经常用爬虫处理非结构化数据的开发者,我发现笑话数据的抓取有几个独特的技术挑战:页面结构多变、反爬机制复杂、数据清洗难度大。下面就把我在这个项目中的完整实现过程和踩坑经验分享给大家。
这个项目适合以下几类读者:
我选择了两个典型的笑话网站作为数据源:
这两个网站各有特点:
提示:实际开发时建议先从简单的静态页面入手,等核心爬虫逻辑稳定后再处理动态加载内容
基础组件:
进阶工具:
python复制# 基础依赖安装
pip install requests beautifulsoup4 selenium pymongo redis
以段子网为例,核心抓取流程:
分析页面结构:
<div class="content">标签内编写解析代码:
python复制import requests
from bs4 import BeautifulSoup
def get_jokes(page=1):
url = f"https://duanziwang.com/page/{page}/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
jokes = []
for item in soup.select('div.content'):
title = item.select_one('h2').text.strip()
content = item.select_one('p').text.strip()
jokes.append({'title': title, 'content': content})
return jokes
冷笑话精选采用动态加载,需要Selenium模拟浏览器:
python复制from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_dynamic_jokes():
driver = webdriver.Chrome()
driver.get("https://lengxiaohua.com")
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "joke-item"))
)
jokes = []
items = driver.find_elements(By.CLASS_NAME, "joke-item")
for item in items:
content = item.find_element(By.CLASS_NAME, "content").text
jokes.append(content)
return jokes
finally:
driver.quit()
常见反爬手段及应对方案:
User-Agent检测:
python复制user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)'
]
IP频率限制:
python复制proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'https://proxy.example.com:8080'
}
验证码拦截:
python复制import time
import random
time.sleep(random.uniform(1, 3))
考虑到笑话数据的特性:
选择MongoDB作为主存储:
python复制from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['joke_db']
collection = db['jokes']
def save_to_mongo(jokes):
collection.insert_many(jokes)
原始数据常见问题:
清洗代码示例:
python复制import re
def clean_content(text):
# 去除广告
text = re.sub(r'关注.*?公众号', '', text)
# 处理特殊字符
text = text.replace('\u200b', '').strip()
return text
def remove_duplicates(jokes):
seen = set()
unique_jokes = []
for joke in jokes:
joke_hash = hash(joke['content'])
if joke_hash not in seen:
seen.add(joke_hash)
unique_jokes.append(joke)
return unique_jokes
当需要大规模抓取时,可以采用Scrapy-Redis方案:
bash复制pip install scrapy scrapy-redis
python复制# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://localhost:6379'
使用Supervisor管理爬虫进程:
ini复制[program:joke_spider]
command=/usr/bin/python /path/to/spider.py
autostart=true
autorestart=true
stderr_logfile=/var/log/joke_spider.err.log
stdout_logfile=/var/log/joke_spider.out.log
建立简单的监控机制:
python复制def quality_check(joke):
if len(joke['content']) < 10:
return False
if '点击查看' in joke['content']:
return False
return True
典型表现:
解决方案:
python复制session = requests.Session()
session.headers.update({
'Referer': 'https://example.com',
'Accept-Language': 'zh-CN,zh;q=0.9'
})
常见原因:
应对策略:
python复制try:
title = item.select_one('h2.title').text
except AttributeError:
title = item.select_one('h2').text
优化方案:
python复制# MongoDB批量插入优化
buffer = []
MAX_BUFFER = 100
def save_to_buffer(joke):
buffer.append(joke)
if len(buffer) >= MAX_BUFFER:
collection.insert_many(buffer)
buffer.clear()
对收集的笑话数据进行词频分析:
python复制from collections import Counter
import jieba
def word_frequency(jokes):
all_text = ' '.join([j['content'] for j in jokes])
words = jieba.cut(all_text)
return Counter(words).most_common(50)
使用Flask提供数据接口:
python复制from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/jokes/random')
def random_joke():
joke = collection.aggregate([{ '$sample': { 'size': 1 } }])
return jsonify(list(joke)[0])
设置定时任务(Linux crontab):
bash复制0 */6 * * * /usr/bin/python /path/to/spider.py >> /var/log/joke_spider.log 2>&1
在实际开发中,我发现笑话类网站的页面结构变化比较频繁,建议每周检查一次爬虫的运行情况。对于重要的生产环境应用,最好实现自动化的页面结构检测和选择器更新机制。