凌晨三点,手机突然响起刺耳的告警声——线上核心服务出现大量"Deadlock found when trying to get lock"错误。作为值班工程师,此刻需要的不是长篇大论的原理解析,而是能立即止血的实战方案。本文将分享如何利用MySQL内置工具在5分钟内完成死锁定位、进程终止和服务恢复的全套操作流程。
当出现死锁时,典型的现象包括:
快速验证命令:
sql复制SHOW ENGINE INNODB STATUS\G
在输出中搜索"LATEST DETECTED DEADLOCK"部分,可以获取最近发生的死锁详情。
通过information_schema库快速定位阻塞源:
sql复制SELECT
r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
关键字段说明:
| 字段 | 说明 | 处理优先级 |
|---|---|---|
| waiting_thread | 等待锁的线程ID | 中等 |
| blocking_thread | 持有锁的线程ID | 最高 |
| blocking_query | 阻塞中的SQL语句 | 分析重点 |
获取到blocking_thread后,执行:
sql复制KILL [blocking_thread];
注意:优先终止长时间运行(>30s)的事务,避免影响正常业务事务
MySQL提供多个监控表用于死锁分析:
| 表名 | 主要字段 | 用途说明 |
|---|---|---|
| INNODB_TRX | trx_id, trx_state, trx_started | 查看当前所有事务状态 |
| INNODB_LOCKS | lock_id, lock_mode, lock_type | 查看现有锁信息 |
| INNODB_LOCK_WAITS | requesting_trx_id, blocking_trx_id | 查看锁等待关系 |
推荐使用以下联查语句快速定位问题:
sql复制SELECT
r.trx_id waiting_trx,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query,
TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) blocking_time
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id
ORDER BY blocking_time DESC;
配置定期执行的监控脚本:
bash复制#!/bin/bash
DEADLOCK=$(mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -c "LATEST DETECTED DEADLOCK")
if [ $DEADLOCK -gt 0 ]; then
# 触发告警并保存现场
mysql -e "SHOW ENGINE INNODB STATUS\G" > /tmp/deadlock_$(date +%s).log
send_alert "MySQL deadlock detected"
fi
innodb_lock_wait_timeout=50(默认50秒)常见死锁诱因及解决方案:
| 问题类型 | 现象 | 优化方案 |
|---|---|---|
| 间隙锁冲突 | 范围查询导致锁范围过大 | 改用精确查询或调整隔离级别 |
| 索引缺失 | 全表扫描导致锁表 | 添加合适索引 |
| 热点行竞争 | 高频更新同一行 | 引入队列或批量处理 |
在应用层添加重试逻辑:
python复制def safe_update():
retries = 3
while retries > 0:
try:
execute_update()
break
except DeadlockError:
retries -= 1
time.sleep(random.uniform(0.1, 0.5))
从SHOW ENGINE INNODB STATUS输出中提取关键信息:
分析模式:
code复制*** (1) TRANSACTION:
TRANSACTION 12345, ACTIVE 10 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 111, OS thread handle 0x7f1a2c0b1700, query id 222 10.0.0.1 user updating
UPDATE table1 SET col1=val1 WHERE id=100
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 333 page no 3 n bits 72 index `PRIMARY` of table `db1`.`table1` trx id 12345 lock_mode X locks rec but not gap
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 333 page no 4 n bits 72 index `idx_col2` of table `db1`.`table1` trx id 12345 lock_mode X locks rec but not gap waiting
建议部署以下监控指标:
innodb_deadlocks:死锁计数器innodb_row_lock_time_avg:平均行锁等待时间innodb_row_lock_waits:行锁等待次数Prometheus配置示例:
yaml复制- name: mysql_deadlocks
rules:
- alert: HighDeadlockRate
expr: rate(mysql_global_status_innodb_deadlocks[1m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL deadlock detected (instance {{ $labels.instance }})"
在实际处理某电商平台死锁问题时,我们发现凌晨批量更新用户积分时频繁触发死锁。通过分析发现是事务中同时更新了用户表和积分明细表,但不同服务中这两个表的更新顺序不一致。统一为"先用户表后积分表"的访问顺序后,死锁发生率下降了90%。