在爬取招聘类网站时,我们常常会遇到这样的场景:首页只展示职位分类和简要信息,点击进入二级页面才能看到具体岗位列表,最终在三级详情页才能获取完整的职位描述。这种三层递进式结构是典型的信息聚合网站设计模式。
我去年帮某人力资源公司做数据采集时就踩过坑。最初尝试用简单爬虫直接抓取,结果发现:
后来改用Scrapy的LinkExtractor+Rule组合拳,配合MongoDB的灵活存储,效率直接提升4倍。下面我就分享这套经过实战检验的解决方案。
先用命令行创建基础项目结构:
bash复制scrapy startproject job_spider
cd job_spider
scrapy genspider career example.com
关键配置在settings.py中:
python复制# 启用内置去重
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
# 调低并发避免被封
CONCURRENT_REQUESTS = 4
# 必须设置User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0)'
以公考雷达为例的典型数据流:
对应的Spider类结构:
python复制class CareerSpider(scrapy.Spider):
def parse(self, response): # 处理首页
yield from self.extract_level1_links(response)
def parse_list(self, response): # 处理列表页
yield from self.handle_pagination(response)
yield from self.extract_detail_links(response)
def parse_detail(self, response): # 处理详情页
yield self.construct_final_item(response)
很多新手会直接用XPath提取链接,其实LinkExtractor的restrict_xpaths参数更高效。比如提取导航菜单中的省份链接:
python复制from scrapy.linkextractors import LinkExtractor
nav_links = LinkExtractor(
restrict_xpaths='//div[@class="province-nav"]//a',
deny=['/about/', '/contact/'] # 排除非职位链接
)
实测中我发现三个优化点:
deny参数过滤无关链接allow参数匹配URL模式(如allow=r'/job/\d+')unique=True避免重复提取招聘网站的分页通常有两种形式:
page=1形式对于传统分页,可以这样配置Rule:
python复制from scrapy.spiders import Rule
rules = (
Rule(LinkExtractor(
allow=r'page=\d+',
restrict_xpaths='//div[@class="pagination"]'
), callback='parse_list', follow=True),
)
遇到无限滚动时,需要分析XHR请求。以某招聘站为例:
python复制def parse_list(self, response):
api_url = 'https://xxx.com/api/jobs'
yield scrapy.FormRequest(
url=api_url,
formdata={'page': str(self.page)},
callback=self.parse_api_response
)
不同于关系型数据库,MongoDB更适合嵌套文档。我们的职位文档结构设计:
json复制{
"_id": "5f8d1b2c3e4a5d6e7f8g9h0",
"basic_info": {
"title": "Python开发工程师",
"salary": "20k-30k"
},
"company": {
"name": "某科技公司",
"type": "民营"
},
"locations": [
{"province": "北京", "district": "海淀区"},
{"province": "上海", "district": "浦东新区"}
],
"crawl_time": "2023-08-20T10:00:00Z"
}
这种设计带来三大优势:
原始方案使用逐条插入:
python复制def process_item(self, item, spider):
collection.insert_one(dict(item))
优化后改用批量写入:
python复制from pymongo import InsertOne
buffer = []
MAX_BUFFER_SIZE = 100
def process_item(self, item, spider):
buffer.append(InsertOne(dict(item)))
if len(buffer) >= MAX_BUFFER_SIZE:
self.flush_buffer()
def flush_buffer(self):
if buffer:
collection.bulk_write(buffer)
buffer.clear()
实测数据:处理10万条数据时,写入时间从210秒降至47秒。
在middlewares.py中实现自适应重试:
python复制class RetryMiddleware:
def process_response(self, request, response, spider):
if response.status in [408, 429, 500]:
retry_times = request.meta.get('retry_times', 0)
if retry_times < 3:
delay = 2 ** retry_times # 指数退避
spider.logger.warning(f'Retrying {request.url}')
return request.copy(dont_filter=True, meta={
'retry_times': retry_times + 1
}).replace(delay=delay)
return response
不要只用固定User-Agent,推荐使用fake_useragent库:
python复制from fake_useragent import UserAgent
class RotateUserAgentMiddleware:
def process_request(self, request, spider):
ua = UserAgent()
request.headers['User-Agent'] = ua.random
request.headers['Accept-Language'] = 'zh-CN,zh;q=0.9'
遇到定位困难时,先用交互式shell测试:
bash复制scrapy shell 'https://example.com/jobs'
>>> from scrapy.linkextractors import LinkExtractor
>>> le = LinkExtractor(restrict_xpaths='//div[@class="job-list"]')
>>> len(le.extract_links(response)) # 检查提取数量
在extensions.py中添加性能统计:
python复制from collections import defaultdict
class StatsMonitor:
def __init__(self):
self.level_counts = defaultdict(int)
def spider_closed(self, spider):
for level, count in self.level_counts.items():
spider.logger.info(f'Level {level} URLs: {count}')
然后在spider中埋点:
python复制def parse_detail(self, response):
self.crawler.extensions['stats_monitor'].level_counts['detail'] += 1
最终项目目录应包含:
code复制job_spider/
├── scrapy.cfg
└── job_spider/
├── spiders/
│ └── career.py
├── items.py
├── pipelines.py
├── middlewares.py
├── settings.py
└── utils/
├── link_cleaner.py
└── date_parser.py
关键文件说明:
items.py:定义三级数据结构pipelines.py:包含MongoDB批量写入逻辑utils/:存放URL清洗等工具函数在部署到生产环境时,建议添加Airflow或Scrapyd进行任务调度。对于千万级数据采集,可以考虑使用Redis作为去重队列后端。