1. 问题背景与核心挑战
当PostgreSQL数据库出现锁等待时,快速定位持有锁的查询语句是每个DBA都会遇到的棘手问题。上周我们的生产环境就遭遇了一次典型的锁堆积:一个简单的UPDATE操作阻塞了数十个后续请求,整个业务流水线几乎停滞。通过这次实战,我总结出一套完整的锁诊断方法论。
PostgreSQL的锁机制设计精妙但排查复杂。与MySQL的SHOW PROCESSLIST不同,PG需要联合多个系统视图才能还原完整的锁等待链条。更麻烦的是,默认配置下你可能连持有锁的SQL文本都看不到——这就像在犯罪现场发现指纹却找不到嫌疑人档案。
2. 核心诊断工具链解析
2.1 关键系统视图说明
sql复制-- 锁信息核心视图
SELECT * FROM pg_locks WHERE pid = 12345;
-- 进程活动视图
SELECT * FROM pg_stat_activity WHERE pid = 12345;
-- 锁等待关系视图
SELECT * FROM pg_blocking_pids(12345);
这三个视图构成了锁诊断的黄金三角。但实际使用时要注意:
pg_locks只显示当前授予的锁或正在等待的锁请求pg_stat_activity中的query字段可能被截断(受限于track_activity_query_size)- 被阻塞的进程可能已经超时退出,导致关联信息丢失
2.2 增强型诊断查询
这是我优化后的组合查询,能一次性展示完整的锁等待链:
sql复制WITH lock_tree AS (
SELECT
blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query,
blocked_activity.application_name AS blocked_app,
blocking_activity.application_name AS blocking_app,
now() - blocked_activity.query_start AS blocked_duration,
now() - blocking_activity.query_start AS blocking_duration
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
)
SELECT * FROM lock_tree
ORDER BY blocking_duration DESC;
关键改进点:
- 关联所有可能的锁类型(行锁、表锁、事务锁等)
- 显示阻塞持续时间帮助判断严重程度
- 包含应用名称便于定位问题来源
3. 实战诊断流程详解
3.1 快速定位问题会话
当收到锁告警时,我通常按以下步骤操作:
-
确认锁等待情况:
sql复制SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock'; -
识别最老的阻塞会话:
sql复制SELECT pid, query_start, state, wait_event_type, wait_event FROM pg_stat_activity WHERE backend_type = 'client backend' ORDER BY query_start ASC LIMIT 5; -
检查锁等待链:
sql复制SELECT waiting_pid, waiting_query, blocking_pid, blocking_query FROM pg_stat_activity_waiters;
3.2 获取完整SQL文本的技巧
经常遇到pg_stat_activity.query显示为<insufficient privilege>或<command string not enabled>。解决方法:
-
启用完整SQL记录(需要重启):
ini复制# postgresql.conf track_activity_query_size = 2048 # 默认1024 -
临时通过pg_stat_statements获取:
sql复制SELECT query FROM pg_stat_statements JOIN pg_stat_activity ON pg_stat_activity.query = pg_stat_statements.query WHERE pg_stat_activity.pid = 阻塞进程ID; -
通过log_statement捕获(需提前配置):
sql复制grep "process 阻塞进程ID" /var/log/postgresql/postgresql-14-main.log
4. 高级锁分析技术
4.1 锁模式深度解读
PostgreSQL有8种表锁模式,按冲突级别排序:
| 锁模式 | 冲突锁 | 典型操作 |
|---|---|---|
| ACCESS SHARE | EXCLUSIVE, ACCESS EXCLUSIVE | SELECT |
| ROW SHARE | EXCLUSIVE, ACCESS EXCLUSIVE | SELECT FOR UPDATE/SHARE |
| ROW EXCLUSIVE | SHARE, SHARE ROW EXCLUSIVE | UPDATE, DELETE, INSERT |
| SHARE UPDATE | EXCLUSIVE | VACUUM FULL, CREATE INDEX CONCURRENTLY |
| SHARE | ROW EXCLUSIVE, EXCLUSIVE | CREATE INDEX |
| SHARE ROW EXCLUSIVE | SHARE, SHARE ROW EXCLUSIVE | 无内置操作使用 |
| EXCLUSIVE | ROW SHARE, SHARE, ACCESS SHARE | 无内置操作使用 |
| ACCESS EXCLUSIVE | 所有其他模式 | DROP TABLE, TRUNCATE, ALTER TABLE |
理解这个矩阵能预判哪些操作会相互阻塞。例如,知道ALTER TABLE需要ACCESS EXCLUSIVE锁,就能理解为什么它会阻塞所有其他操作。
4.2 事务ID回卷风险检测
长时间运行的事务不仅会阻塞其他操作,还可能导致事务ID回卷灾难:
sql复制SELECT pid, datname, usename, state, backend_xmin,
age(backend_xmin) as xmin_age,
backend_xid, age(backend_xid) as xid_age,
query_start, query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL OR backend_xid IS NOT NULL
ORDER BY greatest(age(backend_xmin), age(backend_xid)) DESC;
重点关注:
xmin_age超过1亿的事务需要立即处理- 长时间运行的VACUUM也可能成为阻塞源
5. 自动化监控方案
5.1 关键监控指标
建议在监控系统中配置以下指标:
-
锁等待会话数:
sql复制SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock'; -
最长阻塞时间:
sql复制SELECT max(now() - xact_start) FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active'); -
锁等待链深度:
sql复制WITH RECURSIVE lock_chains AS ( SELECT pid, pg_blocking_pids(pid) as blockers FROM pg_stat_activity UNION ALL SELECT lc.pid, p.pid FROM lock_chains lc JOIN pg_stat_activity p ON p.pid = ANY(lc.blockers) ) SELECT max(array_length(blockers, 1)) as max_depth FROM lock_chains;
5.2 预警处理流程
当触发锁告警时,建议按此流程处理:
-
收集快照信息:
bash复制psql -c "SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - xact_start > interval '10 minutes';" psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - xact_start > interval '1 hour';" -
自动生成诊断报告:
sql复制\o /tmp/lock_report_$(date +%Y%m%d).txt \i lock_diagnosis.sql \o -
邮件通知DBA团队,附带诊断报告和关键截图
6. 预防锁问题的工程实践
6.1 应用层优化策略
-
事务设计原则:
- 遵循"短平快"原则,事务执行时间控制在秒级
- 避免在事务中包含用户交互或网络请求
- 大事务拆分为小批次处理
-
锁获取顺序:
- 全局定义表访问顺序(如按表名字母序)
- 使用
LOCK TABLE明确声明锁需求
-
重试机制:
python复制def execute_with_retry(query, max_retries=3): for attempt in range(max_retries): try: return cursor.execute(query) except psycopg2.errors.LockNotAvailable: sleep(2 ** attempt) continue raise LockTimeoutError(f"Failed after {max_retries} retries")
6.2 数据库配置优化
关键参数调整建议:
ini复制# 减少锁等待超时
lock_timeout = 5s # 默认0表示无限等待
deadlock_timeout = 1s # 默认1s
# 增强监控能力
log_lock_waits = on # 记录超过deadlock_timeout的等待
log_statement = 'all' # 生产环境谨慎使用
# 控制并发
max_connections = 100 # 根据硬件调整
statement_timeout = 30s # 防止长时间查询
7. 疑难案例解析
7.1 幽灵阻塞问题
曾遇到一个诡异现象:显示有阻塞但找不到阻塞进程。最终发现是备库上的复制槽导致的:
sql复制SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
解决方案:
sql复制SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE backend_type = 'walsender';
7.2 外键锁升级
一个简单的DELETE操作阻塞了整个表,原因是未加索引的外键触发了全表扫描锁:
sql复制-- 检查未索引的外键
SELECT conname, conrelid::regclass, confrelid::regclass
FROM pg_constraint
WHERE contype = 'f'
AND conindid = 0;
修复方案:
sql复制-- 为所有外键创建索引
DO $$
DECLARE
r RECORD;
BEGIN
FOR r IN
SELECT conrelid::regclass AS table_from,
conname AS fk_name,
pg_get_constraintdef(oid) AS fk_definition
FROM pg_constraint
WHERE contype = 'f'
LOOP
EXECUTE format('CREATE INDEX ON %s (%s)',
r.table_from,
regexp_replace(r.fk_definition,
'FOREIGN KEY \((.*)\) REFERENCES.*',
'\1'));
END LOOP;
END $$;