在当今互联网时代,文件下载是编程中最常见的需求之一。无论是爬虫项目、数据采集还是日常自动化任务,掌握稳健的文件下载技术都至关重要。Python凭借其简洁语法和丰富的库生态,成为实现下载功能的理想选择。
Requests库是Python HTTP客户端库的事实标准,相比标准库urllib具有明显优势:
安装方式简单:
bash复制pip install requests
注意:在生产环境中建议固定版本号,避免因库更新导致兼容性问题,可使用
pip install requests==2.28.1这样的格式
当我们在浏览器中点击下载链接时,背后发生了这些关键步骤:
我们的Python代码正是模拟这个过程:
python复制response = requests.get(url) # 对应步骤1-3
for chunk in response.iter_content(): # 对应步骤4
f.write(chunk)
初学者常犯的错误是直接使用response.content获取完整内容:
python复制# 危险做法!大文件会导致内存溢出
with open('file', 'wb') as f:
f.write(response.content)
分块下载的优势体现在:
文件打开模式'wb'中的b代表二进制模式,这是下载非文本文件的关键:
\n→\r\n)健壮的下载器必须处理以下异常情况:
python复制try:
response = requests.get(url, timeout=10, stream=True)
response.raise_for_status() # 检查HTTP错误
with open(filename, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk: # 过滤keep-alive空块
f.write(chunk)
except requests.exceptions.RequestException as e:
print(f"下载失败: {e}")
if os.path.exists(filename): # 清理部分下载的文件
os.remove(filename)
使用tqdm库实现美观的进度条:
python复制from tqdm import tqdm
def download_with_progress(url, filename):
response = requests.get(url, stream=True)
total_size = int(response.headers.get('content-length', 0))
with open(filename, 'wb') as f, tqdm(
desc=filename,
total=total_size,
unit='iB',
unit_scale=True
) as bar:
for chunk in response.iter_content(chunk_size=1024):
size = f.write(chunk)
bar.update(size)
通过记录已下载字节数实现续传:
python复制def resume_download(url, filename):
start_byte = os.path.getsize(filename) if os.path.exists(filename) else 0
headers = {'Range': f'bytes={start_byte}-'} if start_byte else {}
response = requests.get(url, headers=headers, stream=True)
mode = 'ab' if start_byte else 'wb' # 追加或新建
with open(filename, mode) as f:
for chunk in response.iter_content(8192):
f.write(chunk)
对大文件采用分段下载策略:
python复制from concurrent.futures import ThreadPoolExecutor
def download_range(url, start, end, filename):
headers = {'Range': f'bytes={start}-{end}'}
response = requests.get(url, headers=headers, stream=True)
with open(filename, 'r+b') as f:
f.seek(start)
for chunk in response.iter_content(8192):
f.write(chunk)
def parallel_download(url, filename, workers=4):
response = requests.head(url)
total_size = int(response.headers['content-length'])
chunk_size = total_size // workers
offsets = [(i * chunk_size, (i + 1) * chunk_size - 1)
for i in range(workers)]
offsets[-1] = (offsets[-1][0], total_size) # 最后一段包含余数
with open(filename, 'wb') as f:
f.truncate(total_size) # 预分配空间
with ThreadPoolExecutor(max_workers=workers) as executor:
futures = []
for start, end in offsets:
futures.append(executor.submit(
download_range, url, start, end, filename
))
for future in futures:
future.result()
python复制from urllib.parse import urlparse
def is_valid_url(url):
try:
result = urlparse(url)
return all([result.scheme in ('http', 'https'),
result.netloc])
except ValueError:
return False
python复制import re
def sanitize_filename(filename):
return re.sub(r'[\\/*?:"<>|]', "", filename)[:255]
python复制session = requests.Session()
# 所有下载使用同一个session
response = session.get(url)
python复制requests.get(url, verify=False) # 不推荐生产环境使用
# 更好的方案是指定CA证书路径
requests.get(url, verify='/path/to/cert.pem')
python复制# 连接超时5秒,读取超时30秒
requests.get(url, timeout=(5, 30))
python复制proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://proxy.example.com:8080'
}
requests.get(url, proxies=proxies)
使用队列实现批量下载控制:
python复制from queue import Queue
import threading
download_queue = Queue()
def worker():
while True:
url, filename = download_queue.get()
try:
download(url, filename)
finally:
download_queue.task_done()
# 启动4个工作线程
for _ in range(4):
threading.Thread(target=worker, daemon=True).start()
# 添加下载任务
download_queue.put(("http://example.com/file1.zip", "file1.zip"))
download_queue.put(("http://example.com/file2.pdf", "file2.pdf"))
# 等待所有任务完成
download_queue.join()
实现带宽控制:
python复制import time
def throttled_download(url, filename, max_speed_kb=100):
response = requests.get(url, stream=True)
with open(filename, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)
if max_speed_kb: # 限速逻辑
time.sleep(len(chunk) / (max_speed_kb * 1024))
添加MD5校验:
python复制import hashlib
def verify_file(filename, expected_md5):
hash_md5 = hashlib.md5()
with open(filename, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest() == expected_md5
在实际项目中,我通常会将这些功能模块化,形成自己的下载工具库。比如创建一个Downloader类,整合所有高级功能:
python复制class Downloader:
def __init__(self, max_workers=4, timeout=30, chunk_size=8192):
self.session = requests.Session()
self.executor = ThreadPoolExecutor(max_workers)
self.timeout = timeout
self.chunk_size = chunk_size
def download(self, url, filename, progress=False):
# 实现包含所有特性的下载方法
pass
def batch_download(self, url_list):
# 批量下载实现
pass
这种面向对象的设计使得代码更易维护和扩展。当需要新增功能如自动重试、速度限制时,只需在类中添加相应方法即可。