在开始爬虫实战之前,我们需要确保开发环境配置正确。很多初学者在环境搭建阶段就会遇到各种奇怪的问题,这里我整理了2026年最新的Python 3.12+环境配置要点。
Python 3.12是目前最稳定的版本,相比之前的3.7-3.11系列,它在异步IO和类型提示方面有显著改进。特别值得注意的是:
注意:不要使用Python 2.x系列,不仅已经停止维护,而且很多现代库都不再支持。
使用以下命令安装必备库:
bash复制pip install requests==2.32.0 beautifulsoup4==4.12.0 lxml==5.2.0
为什么选择这些特定版本:
常见安装问题解决方案:
bash复制pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org <package>
xcode-select --install推荐使用VS Code + Python插件组合,配置要点:
json复制{
"python.linting.pylintEnabled": false,
"python.linting.flake8Enabled": true,
"python.formatting.provider": "black"
}
一个完整的HTTP请求包含以下几个关键部分:
Requests库简化了这个过程,但理解底层机制很重要。让我们看一个最基本的GET请求:
python复制import requests
response = requests.get(
url="https://example.com/api/data",
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
},
params={"page": 1, "limit": 20},
timeout=5
)
很多初学者会忽略Session对象的重要性。实际上,对于需要多次请求同一网站的情况,使用Session可以:
优化后的代码示例:
python复制with requests.Session() as session:
session.headers.update({"User-Agent": "MyCrawler/1.0"})
session.max_redirects = 3
session.timeout = 3
# 第一次请求
response1 = session.get("https://example.com/login")
# 第二次请求会保持会话
response2 = session.get("https://example.com/dashboard")
默认情况下requests会自动处理重定向,但有时需要手动控制:
python复制response = requests.get(
"https://example.com",
allow_redirects=False, # 禁用自动重定向
stream=True # 对于大文件下载
)
合理的超时设置可以防止程序挂起:
python复制try:
response = requests.get(
"https://example.com",
timeout=(3.05, 27) # 连接超时3.05秒,读取超时27秒
)
except requests.exceptions.Timeout:
print("请求超时")
python复制proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)
BeautifulSoup支持多种解析器,2026年最推荐的是lxml:
| 解析器 | 速度 | 内存使用 | 容错性 | 依赖 |
|---|---|---|---|---|
| html.parser | 慢 | 低 | 一般 | 无 |
| lxml | 快 | 中 | 好 | 需要lxml库 |
| html5lib | 最慢 | 高 | 最好 | 需要html5lib |
实际测试数据(解析100KB HTML):
python复制from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "lxml")
# 通过标签名
soup.find_all("a") # 所有<a>标签
soup.find("div") # 第一个<div>
# 通过属性
soup.find(id="content")
soup.select("div.item") # CSS选择器
python复制# 组合选择
soup.select("div.content > p.intro")
# 属性选择
soup.select("a[href^='https']") # href以https开头
soup.select("img[width='200']")
# 文本匹配
import re
soup.find_all(text=re.compile("Python"))
python复制for article in soup.select("article"):
title = article.select_one("h2.title").get_text(strip=True)
date = article.select_one("time")["datetime"]
author = article.find("span", class_="author").text
python复制from urllib.parse import urljoin
base_url = "https://example.com"
for link in soup.find_all("a"):
absolute_url = urljoin(base_url, link["href"])
python复制data = []
for item in soup.select(".product"):
data.append({
"name": item.select_one(".name").text,
"price": float(item.select_one(".price").text.replace("$", "")),
"rating": int(item.select_one(".stars")["data-rating"])
})
2026年最常见的反爬技术包括:
python复制headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
"DNT": "1"
}
python复制import time
import random
for page in range(1, 10):
time.sleep(random.uniform(1, 3)) # 随机延迟
response = requests.get(f"https://example.com/page/{page}")
python复制# 先获取cookies
login_response = requests.post(
"https://example.com/login",
data={"username": "user", "password": "pass"}
)
# 然后使用cookies
profile_response = requests.get(
"https://example.com/profile",
cookies=login_response.cookies
)
python复制headers = {
"User-Agent": "Mozilla/5.0...",
"Accept-Encoding": "gzip, deflate, br",
"Sec-CH-UA": '"Chromium";v="112", "Google Chrome";v="112"',
"Sec-CH-UA-Platform": "Windows",
"Sec-CH-UA-Mobile": "?0"
}
需要使用定制化的SSL上下文:
python复制import ssl
from urllib3.util.ssl_ import create_urllib3_context
ctx = create_urllib3_context()
ctx.options |= 0x4 # OP_LEGACY_SERVER_CONNECT
session = requests.Session()
session.mount("https://", requests.adapters.HTTPAdapter(pool_connections=1, max_retries=3))
我们要爬取一个典型的小说网站,目标:
网站特点:
python复制import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json
import time
BASE_URL = "https://novel.example.com"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
def get_novel_list():
novels = []
page = 1
while True:
url = f"{BASE_URL}/list?page={page}"
response = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".novel-item")
if not items:
break
for item in items:
novels.append({
"title": item.select_one(".title").text.strip(),
"author": item.select_one(".author").text.strip(),
"link": urljoin(BASE_URL, item.select_one("a")["href"])
})
page += 1
time.sleep(1)
return novels
def get_chapter_content(chapter_url):
response = requests.get(chapter_url, headers=HEADERS)
soup = BeautifulSoup(response.text, "lxml")
return {
"title": soup.select_one("h1").text,
"content": "\n".join(p.text for p in soup.select(".content p")),
"next": urljoin(BASE_URL, soup.select_one(".next-chapter")["href"])
}
def save_as_epub(novel_data, filename):
# 实现EPUB生成逻辑
pass
if __name__ == "__main__":
novels = get_novel_list()
for novel in novels[:3]: # 只爬取前3本作为示例
chapters = []
current_url = novel["link"]
while current_url:
chapter = get_chapter_content(current_url)
chapters.append(chapter)
current_url = chapter.get("next")
time.sleep(0.5)
novel["chapters"] = chapters
with open(f"{novel['title']}.json", "w", encoding="utf-8") as f:
json.dump(novel, f, ensure_ascii=False, indent=2)
在开始爬取前,请确认:
python复制from requests.adapters import HTTPAdapter
session = requests.Session()
adapter = HTTPAdapter(
pool_connections=10,
pool_maxsize=50,
max_retries=3
)
session.mount("http://", adapter)
session.mount("https://", adapter)
虽然requests是同步库,但可以通过线程池提高效率:
python复制from concurrent.futures import ThreadPoolExecutor
def fetch(url):
return requests.get(url).text
urls = ["https://example.com/page/1", "https://example.com/page/2"]
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(fetch, urls))
使用requests-cache减少重复请求:
python复制import requests_cache
requests_cache.install_cache(
"demo_cache",
expire_after=3600, # 1小时缓存
allowable_methods=["GET", "POST"]
)
可能原因:
解决方案:
调试技巧:
python复制try:
response = requests.get(url, timeout=5)
except requests.exceptions.RequestException as e:
print(f"请求失败: {e}")
# 重试逻辑
python复制from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except ValueError:
return False
不要硬编码敏感信息:
python复制# 错误做法
password = "123456"
# 正确做法
import os
from dotenv import load_dotenv
load_dotenv()
password = os.getenv("API_PASSWORD")
python复制# 限制重定向次数
response = requests.get(url, allow_redirects=True, max_redirects=3)
# 禁用SSL验证(谨慎使用)
response = requests.get(url, verify=False)
在实际爬虫开发中,我发现最容易被忽视的是请求节奏控制。很多开发者只关注如何获取数据,却忽略了合理控制请求频率的重要性。一个实用的技巧是根据服务器响应时间动态调整请求间隔 - 如果发现响应变慢,自动增加延迟时间。这不仅更符合道德规范,长期来看也能提高爬虫的稳定性。