1. 问题背景与核心挑战
当PostgreSQL数据库出现性能瓶颈或死锁时,快速定位持有锁的查询语句是每个DBA的必备技能。我在处理高并发系统的数据库问题时,经常遇到会话被阻塞却难以追踪源头的情况。与MySQL的SHOW PROCESSLIST不同,PostgreSQL的锁机制更为复杂,需要结合多个系统视图才能完整还原锁等待链条。
上周我们生产环境就出现过一次典型场景:订单服务的UPDATE操作大面积超时,但常规监控只能看到大量"waiting"会话,真正的罪魁祸首却隐藏在海量的活跃查询中。通过以下方法,我们最终定位到是一个漏加索引的统计查询长期持有ACCESS EXCLUSIVE锁。
2. 核心系统视图解析
2.1 锁监控关键视图
PostgreSQL提供了几个关键信息视图用于锁诊断:
pg_locks:所有当前锁的实时快照pg_stat_activity:所有服务器进程的当前状态pg_class:存储数据库对象信息(表、索引等)pg_database:数据库信息
这几个视图的关系就像刑侦中的"指纹库"——pg_locks记录锁的"指纹",pg_stat_activity记录持有者的"身份信息",而pg_class和pg_database则提供"案发现场"的地图。
2.2 锁类型速查表
PostgreSQL有8种锁模式,按冲突级别排序:
| 锁模式 | 冲突级别 | 典型场景 |
|---|---|---|
| ACCESS SHARE | 最低 | SELECT查询 |
| ROW SHARE | SELECT FOR UPDATE/SHARE | |
| ROW EXCLUSIVE | UPDATE/DELETE | |
| SHARE UPDATE EXCLUSIVE | VACUUM/ANALYZE | |
| SHARE | CREATE INDEX | |
| SHARE ROW EXCLUSIVE | 某些ALTER TABLE | |
| EXCLUSIVE | 某些索引创建 | |
| ACCESS EXCLUSIVE | 最高 | DROP TABLE/TRUNCATE |
关键经验:锁冲突往往发生在不同级别的操作之间,比如ACCESS EXCLUSIVE会阻塞所有其他锁。
3. 诊断SQL与实战解析
3.1 基础诊断查询
这个查询能显示所有阻塞关系和被阻塞语句:
sql复制SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.usename AS blocked_user,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement,
blocked_activity.application_name AS blocked_app,
blocking_activity.application_name AS blocking_app
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
3.2 查询结果解读技巧
执行后会得到类似这样的输出:
| blocked_pid | blocking_pid | blocked_user | blocking_user | blocked_statement | blocking_statement |
|---|---|---|---|---|---|
| 7890 | 6543 | app_user | batch_user | UPDATE orders SET... | VACUUM FULL orders |
这里揭示了一个经典场景:运维执行的VACUUM FULL(需要ACCESS EXCLUSIVE锁)阻塞了常规业务UPDATE操作。此时应该:
- 评估是否必须使用VACUUM FULL(通常常规VACUUM足够)
- 在业务低峰期执行此类操作
- 考虑使用CONCURRENTLY重建索引替代
3.3 增强版诊断脚本
对于更复杂的锁等待链,我常用这个增强查询:
sql复制WITH lock_tree AS (
SELECT
l1.pid AS holder,
l2.pid AS waiter,
a1.query AS holding_query,
a2.query AS waiting_query,
a1.application_name AS holder_app,
a2.application_name AS waiter_app,
now() - a1.query_start AS holding_duration,
now() - a2.query_start AS waiting_duration
FROM pg_locks l1
JOIN pg_locks l2
ON l1.locktype = l2.locktype
AND l1.DATABASE IS NOT DISTINCT FROM l2.DATABASE
AND l1.relation IS NOT DISTINCT FROM l2.relation
AND l1.page IS NOT DISTINCT FROM l2.page
AND l1.tuple IS NOT DISTINCT FROM l2.tuple
AND l1.virtualxid IS NOT DISTINCT FROM l2.virtualxid
AND l1.transactionid IS NOT DISTINCT FROM l2.transactionid
AND l1.classid IS NOT DISTINCT FROM l2.classid
AND l1.objid IS NOT DISTINCT FROM l2.objid
AND l1.objsubid IS NOT DISTINCT FROM l2.objsubid
AND l1.pid != l2.pid
AND NOT l2.GRANTED
JOIN pg_stat_activity a1 ON a1.pid = l1.pid
JOIN pg_stat_activity a2 ON a2.pid = l2.pid
)
SELECT * FROM lock_tree
ORDER BY holding_duration DESC;
这个CTE查询的优势在于:
- 显示锁持有时间(holding_duration),方便识别长期持有者
- 按持有时间降序排列,最可能的问题排在最前
- 包含应用名称信息,便于追踪问题来源
4. 高级场景处理方案
4.1 分布式事务锁等待
在使用2PC(两阶段提交)时,可能出现"prepared transaction"持有锁的情况。此时需要检查:
sql复制SELECT gid, prepared, owner, database, transaction
FROM pg_prepared_xacts;
处理方案:
- 确认事务状态:
SELECT * FROM pg_prepared_xacts - 提交或回滚:
COMMIT PREPARED 'transaction_id'或ROLLBACK PREPARED 'transaction_id'
4.2 行级锁冲突排查
当出现元组级锁等待时,这个查询能精确定位冲突行:
sql复制SELECT blocked_activity.pid AS blocked_pid,
blocking_activity.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query,
blocked_locks.relation::regclass AS locked_table,
pg_blocking_pids(blocked_activity.pid) AS blocking_pids
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.locktype = 'tuple'
AND blocking_locks.relation = blocked_locks.relation
AND blocking_locks.page = blocked_locks.page
AND blocking_locks.tuple = blocked_locks.tuple
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
4.3 自动化监控方案
对于生产环境,建议配置以下监控:
- 创建专用监控视图:
sql复制CREATE VIEW lock_monitor AS
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query,
blocked_activity.application_name AS blocked_app,
blocking_activity.application_name AS blocking_app,
now() - blocked_activity.query_start AS blocked_duration,
now() - blocking_activity.query_start AS blocking_duration
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
- 设置报警规则(示例PromQL):
yaml复制- alert: LongRunningBlockingQueries
expr: |
pg_stat_activity_query_duration_seconds{state="active"} > 300
and on(pid) pg_locks_granted == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Query blocked for more than 5 minutes"
description: "PID {{ $labels.pid }} is blocked by {{ humanize $value }} seconds"
5. 性能优化与预防措施
5.1 锁超时配置
合理设置锁等待超时能避免雪崩:
sql复制-- 会话级设置
SET lock_timeout = '3s';
-- 数据库级设置
ALTER DATABASE mydb SET lock_timeout = '3s';
-- 用户级设置
ALTER ROLE app_user SET lock_timeout = '5s';
5.2 事务设计最佳实践
-
缩短事务持续时间:
- 避免在事务中进行耗时计算
- 将大事务拆分为小批次
-
锁获取顺序一致:
- 所有事务按相同顺序访问表
- 使用
ORDER BY子句确保锁获取顺序
-
使用
SKIP LOCKED处理高并发:sql复制SELECT * FROM jobs WHERE status = 'pending' FOR UPDATE SKIP LOCKED LIMIT 1;
5.3 索引优化策略
缺失索引是锁问题的常见诱因:
sql复制-- 查找可能缺少索引的查询
SELECT query, calls, total_time, rows,
total_time/calls AS avg_time,
rows/calls AS avg_rows
FROM pg_stat_statements
WHERE query LIKE '%WHERE%'
ORDER BY (total_time/calls)*(rows/calls) DESC
LIMIT 20;
6. 疑难案例解析
6.1 幽灵锁问题
现象:查询显示有锁等待,但找不到blocking_pid
解决方案:
- 检查
pg_prepared_xacts中的预备事务 - 查找已断开连接但仍持有锁的进程:
sql复制SELECT pid, query, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - query_start > interval '5 minutes';
6.2 自增序列竞争
批量插入时的序列竞争会导致性能下降:
sql复制-- 查看序列争用
SELECT seq_scan, seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 0;
-- 使用CACHE优化
ALTER SEQUENCE my_seq CACHE 100;
6.3 分区表锁放大
分区表的锁会传播到所有子表,解决方案:
- 使用
ONLY限定操作范围:sql复制UPDATE ONLY parent_table SET ... WHERE ...; - 考虑改用声明式分区(PG10+)
7. 工具链推荐
- pgAdmin的锁监控面板
- pg_top实时监控(类似top命令)
- pg_stat_planner扩展(分析锁等待历史)
- 自定义监控脚本示例:
bash复制#!/bin/bash
while true; do
psql -c "SELECT * FROM lock_monitor WHERE blocking_duration > interval '5 minutes'" \
-o /var/log/pg_locks_$(date +%s).log
sleep 60
done
在实际运维中,我发现约70%的锁问题源于事务设计不当或缺少索引。曾经有个电商系统因为缺少orders表的customer_id索引,导致黑五促销时整个系统几乎挂掉。通过建立正确的索引组合,我们成功将平均锁等待时间从1200ms降到了23ms。