在数据处理和后台服务开发中,数据库查询性能往往是系统瓶颈所在。当我们需要同时执行多个SQL查询时,传统的同步查询方式会导致大量时间浪费在I/O等待上。本文将分享三种Python环境下提高SQL查询效率的方法,并重点介绍基于asyncmy驱动的高性能异步查询方案。
我们主要对比三种实现方式:
这三种方案代表了从传统同步到现代异步编程的演进路径,性能表现也有显著差异。下面我们逐一拆解每种方案的实现细节。
python复制from concurrent.futures import ThreadPoolExecutor
import asyncio
import time
import pandas as pd
from sqlalchemy import create_engine
# 查询SQL列表
all_tables = [
"select * from hdrx.bas_source",
"select * from hdrx.bas_station",
"select * from hdrx.bas_unit",
"select * from hdrx.source_data_day_his"
] * 2 # 重复查询以增加负载
def runsql(ttt):
"""同步SQL查询函数"""
engine = create_engine(
"mysql+pymysql://root:root@127.0.0.1:3306/hdrx")
return pd.read_sql_query(ttt, con=engine)
async def mini():
"""线程池调度函数"""
with ThreadPoolExecutor(max_workers=10) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, runsql, table)
for table in all_tables
]
return await asyncio.gather(*tasks, return_exceptions=True)
# 性能测试
start = time.time()
results = asyncio.run(mini())
print(f"线程池方案耗时: {time.time()-start:.2f}秒")
线程池方案的核心思想是:
这种方案的优点是:
但缺点也很明显:
提示:max_workers参数需要根据数据库服务器的连接数限制和机器CPU核心数合理设置,通常建议设置为CPU核心数的2-3倍。
python复制import asyncio
import time
import pandas as pd
from sqlalchemy import create_engine
async def runsql(ttt):
"""伪异步查询函数"""
engine = create_engine(
"mysql+pymysql://root:root@127.0.0.1:3306/hdrx")
return pd.read_sql_query(ttt, con=engine)
async def mini():
"""协程调度函数"""
tasks = [runsql(table) for table in all_tables]
return await asyncio.gather(*tasks, return_exceptions=True)
# 性能测试
start = time.time()
results = asyncio.run(mini())
print(f"协程方案耗时: {time.time()-start:.2f}秒")
这个方案看似使用了asyncio协程,但实际上存在严重问题:
这种实现方式是不推荐的错误示范,它展示了异步编程中的一个常见陷阱:仅仅使用async/await语法并不会自动使同步I/O操作变成异步的。
python复制import asyncio
import time
import pandas as pd
from sqlalchemy.ext.asyncio import create_async_engine
from sqlalchemy.sql import text
# 数据库配置
DB_CONFIG = {
"user": "root",
"password": "root",
"host": "127.0.0.1",
"port": 3306,
"database": "hdrx"
}
async def runsql_async(ttt):
"""真异步SQL查询函数"""
async_engine = create_async_engine(
f"mysql+asyncmy://{DB_CONFIG['user']}:{DB_CONFIG['password']}@{DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']}",
echo=False
)
async with async_engine.connect() as conn:
result = await conn.execute(text(ttt))
df = pd.DataFrame(result.fetchall(), columns=result.keys())
await async_engine.dispose()
return df
async def mini_async():
"""异步任务调度函数"""
tasks = [runsql_async(sql) for sql in all_tables]
return await asyncio.gather(*tasks, return_exceptions=True)
# 性能测试
start = time.time()
results = asyncio.run(mini_async())
print(f"纯异步方案耗时: {time.time()-start:.2f}秒")
这种方案的性能优势主要体现在:
在本地开发环境(8核CPU,MySQL 8.0)测试8个中等复杂度查询的结果:
| 方案类型 | 平均耗时(秒) | CPU占用率 | 内存占用 |
|---|---|---|---|
| 线程池(10线程) | 1.82 | 75% | 较高 |
| 伪异步协程 | 3.15 | 25% | 低 |
| 纯异步(asyncmy) | 0.97 | 35% | 最低 |
注意事项:asyncmy需要Python 3.7+和MySQL 5.6+,对老旧环境支持有限
python复制from sqlalchemy.ext.asyncio import create_async_engine, AsyncEngine
async def get_engine() -> AsyncEngine:
return create_async_engine(
"mysql+asyncmy://user:pass@host/db",
pool_size=10, # 连接池大小
max_overflow=5, # 允许超出pool_size的连接数
pool_recycle=3600, # 连接回收时间(秒)
pool_pre_ping=True # 执行前检查连接有效性
)
python复制from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10)
)
async def safe_query(sql: str):
try:
async with engine.connect() as conn:
result = await conn.execute(text(sql))
return result.fetchall()
except Exception as e:
print(f"Query failed: {e}")
raise
建议监控以下关键指标:
症状:数据库连接数持续增长,最终达到上限
解决方案:
python复制async def query_with_timeout(sql: str, timeout: float = 5.0):
try:
return await asyncio.wait_for(runsql_async(sql), timeout=timeout)
except asyncio.TimeoutError:
print(f"Query timeout: {sql}")
return None
对于可能返回大量数据的查询:
python复制async def stream_large_result(sql: str, chunk_size: int = 1000):
async with engine.connect() as conn:
async with conn.stream(text(sql)) as result:
async for chunk in result.partitions(chunk_size):
process_chunk(chunk)
在实际项目中采用asyncmy纯异步方案后,我们的查询吞吐量提升了3倍以上,同时服务器资源消耗降低了60%。这种性能提升在需要处理大量并发查询的微服务架构中尤为明显。对于新的Python项目,直接从异步方案开始设计会是更明智的选择。