Python requests库实战：网络爬虫HTTP请求处理指南-代码聚汇网

Python requests库实战：网络爬虫HTTP请求处理指南

故小里

1. 网络爬虫基础：requests库实战指南

作为一名爬虫开发者，掌握HTTP请求的发送和处理是必备技能。Python的requests库以其简洁优雅的API设计，成为我们与Web服务器交互的首选工具。今天我将分享requests库在实际爬虫开发中的核心用法和实战技巧，这些都是我在多个爬虫项目中积累的经验总结。

requests库之所以广受欢迎，主要得益于它封装了HTTP协议底层细节，让我们可以用几行代码完成复杂的网络交互。相比Python内置的urllib库，requests的API设计更加人性化，错误处理机制也更完善。在中小型爬虫项目中，requests完全能够胜任90%以上的网络请求任务。

2. 核心请求方法与参数配置

2.1 GET请求的参数传递技巧

GET请求是最基础的HTTP方法，常用于获取服务器资源。实际开发中我们经常需要传递查询参数：

python复制import requests

# 基础查询参数
params = {'page': 1, 'limit': 20}
response = requests.get('https://api.example.com/posts', params=params)

# 处理数组参数的特殊情况
params = {'tags': ['python', 'web']}  # 会自动转换为tags=python&tags=web

注意：当参数值包含特殊字符时，requests会自动进行URL编码。如果服务端对编码有特殊要求，可以手动处理后再传入。

我在实际项目中遇到过参数顺序影响缓存命中率的情况。有些CDN会严格匹配URL字符串，包括参数顺序。这时可以使用collections.OrderedDict保持参数顺序稳定：

python复制from collections import OrderedDict

params = OrderedDict([('z', '1'), ('a', '2'), ('c', '3')])

2.2 POST请求的数据提交方式

POST请求用于创建/修改资源，支持多种数据格式提交：

python复制# 表单形式提交（application/x-www-form-urlencoded）
data = {'title': 'Hello', 'content': 'World'}
response = requests.post('https://api.example.com/posts', data=data)

# JSON格式提交（application/json）
import json
data = {'title': 'Hello', 'content': 'World'}
response = requests.post('https://api.example.com/posts', json=data)  # 推荐方式

# 手动处理JSON字符串
headers = {'Content-Type': 'application/json'}
response = requests.post('https://api.example.com/posts', 
                        data=json.dumps(data), 
                        headers=headers)

经验分享：使用json参数比手动序列化更可靠，它会自动：

设置正确的Content-Type头

处理datetime等特殊类型的序列化

确保使用正确的字符编码

2.3 文件上传的实现方法

文件上传是POST请求的另一个常见用途：

python复制# 单文件上传
files = {'file': open('report.pdf', 'rb')}
response = requests.post('https://api.example.com/upload', files=files)

# 多文件上传
files = [
    ('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
    ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))
]
response = requests.post('https://api.example.com/upload', files=files)

3. 请求头与会话管理

3.1 自定义请求头的最佳实践

合理的请求头设置可以显著提高爬虫的成功率：

python复制headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://example.com',
    'X-Requested-With': 'XMLHttpRequest'  # 模拟AJAX请求
}

response = requests.get('https://example.com/api', headers=headers)

我建议将常用头信息封装成配置：

python复制DEFAULT_HEADERS = {
    'User-Agent': '...',
    'Accept': '...',
    # 其他通用头
}

def create_headers(extra=None):
    headers = DEFAULT_HEADERS.copy()
    if extra:
        headers.update(extra)
    return headers

3.2 会话管理的核心技术

Session对象可以保持跨请求的Cookies和配置：

python复制s = requests.Session()

# 设置会话级配置
s.headers.update({'User-Agent': 'MyCrawler/1.0'})
s.auth = ('username', 'password')

# 所有请求自动携带配置
s.get('https://api.example.com/login')
s.post('https://api.example.com/data', json={'query': '...'})

重要提示：Session对象是线程不安全的。在多线程环境中，应该为每个线程创建独立的Session实例。

4. 响应处理与错误管理

4.1 响应内容解析技巧

requests提供了多种响应内容访问方式：

python复制response = requests.get('https://example.com')

# 文本内容（自动解码）
print(response.text)  

# 二进制内容
with open('image.png', 'wb') as f:
    f.write(response.content)

# JSON响应（自动解析）
data = response.json()

# 原始响应流（用于大文件下载）
for chunk in response.iter_content(chunk_size=8192):
    process_chunk(chunk)

4.2 高级错误处理机制

完善的错误处理是稳定爬虫的关键：

python复制try:
    response = requests.get('https://example.com', timeout=5)
    response.raise_for_status()  # 检查HTTP错误
    
    # 处理内容级错误
    data = response.json()
    if data.get('error'):
        raise ValueError(data['error'])
        
except requests.exceptions.Timeout:
    print("请求超时")
except requests.exceptions.SSLError:
    print("SSL证书错误")
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")
except ValueError as e:
    print(f"内容错误: {e}")

我通常会封装一个重试装饰器来处理临时性网络错误：

python复制from functools import wraps
import time

def retry(max_attempts=3, delay=1):
    def decorator(f):
        @wraps(f)
        def wrapper(*args, **kwargs):
            attempts = 0
            while attempts < max_attempts:
                try:
                    return f(*args, **kwargs)
                except requests.exceptions.RequestException:
                    attempts += 1
                    if attempts == max_attempts:
                        raise
                    time.sleep(delay * attempts)
        return wrapper
    return decorator

@retry(max_attempts=5, delay=2)
def fetch_url(url):
    return requests.get(url)

5. 高级配置与性能优化

5.1 代理与连接池配置

大规模爬虫需要考虑代理和连接复用：

python复制# 代理设置
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

response = requests.get('http://example.com', proxies=proxies)

# 连接池配置
from requests.adapters import HTTPAdapter

s = requests.Session()
adapter = HTTPAdapter(pool_connections=10, pool_maxsize=100, max_retries=3)
s.mount('http://', adapter)
s.mount('https://', adapter)

5.2 超时与重试策略

合理的超时设置可以避免长时间阻塞：

python复制# 连接超时和读取超时分开设置
response = requests.get('http://example.com', timeout=(3.05, 27))

# 自定义重试策略
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[408, 429, 500, 502, 503, 504]
)

adapter = HTTPAdapter(max_retries=retry_strategy)
s = requests.Session()
s.mount("https://", adapter)
s.mount("http://", adapter)

6. 实战经验与常见问题

6.1 性能优化技巧

连接复用：始终重用Session对象
流式下载：使用iter_content处理大文件
并行请求：结合concurrent.futures实现并发

python复制from concurrent.futures import ThreadPoolExecutor

urls = ['url1', 'url2', 'url3']

def fetch(url):
    return requests.get(url).text

with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch, urls))

6.2 常见反爬对策

User-Agent轮换：准备多个UA随机使用
请求频率控制：添加随机延迟
Cookies处理：定期更新会话

python复制import random
import time

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
    # 更多UA...
]

def random_delay():
    time.sleep(random.uniform(0.5, 1.5))

def make_request(url):
    headers = {'User-Agent': random.choice(USER_AGENTS)}
    random_delay()
    return requests.get(url, headers=headers)

6.3 调试技巧

请求日志记录：

python复制import logging
logging.basicConfig(level=logging.DEBUG)

查看实际请求：

python复制print(response.request.headers)  # 查看发送的头
print(response.request.body)     # 查看发送的body

使用本地代理：

python复制proxies = {'http': 'http://127.0.0.1:8888', 'https': 'http://127.0.0.1:8888'}
response = requests.get('http://example.com', proxies=proxies, verify=False)

在实际爬虫开发中，requests库虽然简单易用，但要真正发挥其威力，需要深入理解HTTP协议的各种细节。我建议开发者在遇到问题时，多查阅requests的官方文档和源码，同时结合网络抓包工具分析实际通信过程。