在分布式系统中,TCC(Try-Confirm-Cancel)事务模型因其灵活性被广泛采用。但在实际生产环境中,我们经常会遇到两个典型问题:悬挂(Hanging)和空回滚(Empty Rollback)。这两个问题如果处理不当,轻则导致资源浪费,重则引发数据不一致。
悬挂问题本质上是个"孤儿资源"问题。当Try阶段预留资源后,由于网络分区或系统崩溃,Confirm/Cancel操作未能正确执行,导致资源被永久锁定。这种情况在跨多个数据中心的分布式系统中尤为常见。
典型场景示例:
空回滚则是"幽灵调用"问题。当Try请求尚未到达服务端时,事务管理器已判定超时并触发Cancel,此时Cancel操作实际上没有业务数据可回滚。
典型时序:
核心思路是通过事务日志记录各阶段状态,配合定时任务实现悬挂检测。以下是增强版实现:
java复制public class EnhancedTccHangingResolver {
private final TransactionLogRepository logRepo;
private final ScheduledExecutorService scheduler;
// 事务状态扩展
enum TccStatus {
TRYING(1),
CONFIRMED(2),
CANCELLED(3),
HANGING(4), // 新增悬挂状态
COMPENSATED(5);
private final int code;
// ...
}
@PostConstruct
public void init() {
// 每5分钟扫描一次悬挂事务
scheduler.scheduleAtFixedRate(this::scanHangingTransactions,
5, 5, TimeUnit.MINUTES);
}
private void scanHangingTransactions() {
LocalDateTime threshold = LocalDateTime.now().minusMinutes(10);
List<TransactionLog> hangingLogs = logRepo.findByStatusAndCreateTimeBefore(
TccStatus.TRYING, threshold);
hangingLogs.forEach(log -> {
log.setStatus(TccStatus.HANGING);
logRepo.save(log);
// 异步补偿
compensateAsync(log);
});
}
private void compensateAsync(TransactionLog log) {
CompletableFuture.runAsync(() -> {
try {
CompensationResult result = compensationService.compensate(log);
if (result.isSuccess()) {
log.setStatus(TccStatus.COMPENSATED);
} else {
log.setRetryCount(log.getRetryCount() + 1);
}
logRepo.save(log);
} catch (Exception e) {
log.error("补偿执行异常", e);
}
});
}
}
关键设计要点:
单纯依赖数据库记录在高并发场景下可能成为瓶颈。结合Redis分布式锁的方案:
java复制public class RedisTccLockManager {
private final RedissonClient redisson;
private final TransactionLogDAO logDAO;
public void executeInLock(String xid, Runnable task) {
RLock lock = redisson.getLock(buildLockKey(xid));
try {
boolean acquired = lock.tryLock(3, 30, TimeUnit.SECONDS);
if (!acquired) {
throw new TccLockException("获取锁超时");
}
// 双重检查
if (logDAO.existsUnfinished(xid)) {
throw new TccHangingRiskException("存在未完成事务");
}
task.run();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new TccLockException("锁获取中断", e);
} finally {
if (lock.isHeldByCurrentThread()) {
lock.unlock();
}
}
}
private String buildLockKey(String xid) {
return "tcc:lock:" + xid;
}
}
该方案优势:
空回滚防御的核心是建立可靠的状态机机制:
java复制@Slf4j
public class EmptyRollbackHandler {
private final TransactionStateStore stateStore;
public void processCancel(String xid, CancelAction action) {
// 获取当前状态(带乐观锁版本号)
TransactionState state = stateStore.getWithVersion(xid);
switch (state.getStatus()) {
case INIT:
handleEmptyRollback(state);
break;
case TRY_SUCCESS:
executeBusinessCancel(state, action);
break;
case CANCELLED:
case CANCELLED_EMPTY:
log.info("幂等返回: xid={}", xid);
return;
default:
throw new IllegalStateException("非法状态: " + state.getStatus());
}
}
private void handleEmptyRollback(TransactionState state) {
state.setStatus(Status.CANCELLED_EMPTY);
if (!stateStore.updateWithVersion(state)) {
throw new ConcurrentModificationException("状态并发修改");
}
log.warn("记录空回滚: xid={}", state.getXid());
monitor.recordEmptyRollback();
}
private void executeBusinessCancel(TransactionState state, CancelAction action) {
try {
action.execute();
state.setStatus(Status.CANCELLED);
stateStore.update(state);
} catch (Exception e) {
state.setStatus(Status.CANCEL_FAILED);
stateStore.update(state);
throw e;
}
}
}
状态转换设计要点:
对于金融级场景,可以采用WAL(Write-Ahead Log)模式:
java复制public class WalTccService {
private final TransactionWalDAO walDAO;
@Transactional
public void tryPhase(String xid, TryAction action) {
// 预写日志(带业务快照)
TransactionWal wal = new TransactionWal();
wal.setXid(xid);
wal.setAction("TRY");
wal.setBusinessSnapshot(action.getSnapshot());
wal.setStatus(WalStatus.PREPARED);
walDAO.insert(wal);
try {
action.execute();
// 更新日志状态
wal.setStatus(WalStatus.COMMITTED);
walDAO.update(wal);
} catch (Exception e) {
wal.setStatus(WalStatus.FAILED);
walDAO.update(wal);
throw e;
}
}
public void cancelPhase(String xid) {
TransactionWal wal = walDAO.selectByXid(xid);
if (wal == null) {
// 处理空回滚
recordEmptyRollback(xid);
return;
}
// 基于快照回滚
BusinessSnapshot snapshot = wal.getBusinessSnapshot();
compensateService.compensate(snapshot);
wal.setStatus(WalStatus.ROLLBACKED);
walDAO.update(wal);
}
}
方案优势:
完整架构应包含以下模块:
code复制└── tcc-core
├── coordinator # 协调器
├── storage # 状态存储
│ ├── jdbc
│ ├── redis
│ └── zookeeper
├── lock # 分布式锁
├── recovery # 恢复机制
├── monitor # 监控
└── spi # 扩展接口
关键接口定义:
java复制public interface TccTransactionStore {
// 事务记录
boolean createTransaction(TccTransaction transaction);
boolean updateTransactionStatus(String xid, String status);
TccTransaction getTransaction(String xid);
// 参与者记录
boolean addParticipant(TccParticipant participant);
List<TccParticipant> getParticipants(String xid);
}
public interface TccRecoveryStrategy {
void recoverHangingTransactions(Duration timeout);
void retryFailedOperations(int maxRetries);
}
public interface TccMonitor {
void recordHanging(String xid);
void recordEmptyRollback(String xid);
void recordCompensationFailure(String xid);
}
事务表结构增强版:
sql复制CREATE TABLE tcc_global_transaction (
xid VARCHAR(128) PRIMARY KEY,
status VARCHAR(32) NOT NULL,
application_id VARCHAR(64) NOT NULL,
transaction_type VARCHAR(32) NOT NULL,
retried_count INT DEFAULT 0,
create_time TIMESTAMP(3) NOT NULL,
update_time TIMESTAMP(3) NOT NULL,
timeout TIMESTAMP(3) NOT NULL,
context TEXT,
INDEX idx_status_timeout (status, timeout),
INDEX idx_app_create (application_id, create_time)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED;
CREATE TABLE tcc_branch_transaction (
branch_id VARCHAR(128) PRIMARY KEY,
xid VARCHAR(128) NOT NULL,
resource_id VARCHAR(256) NOT NULL,
status VARCHAR(32) NOT NULL,
operation_type VARCHAR(32) NOT NULL,
confirm_data TEXT,
cancel_data TEXT,
retried_count INT DEFAULT 0,
create_time TIMESTAMP(3) NOT NULL,
update_time TIMESTAMP(3) NOT NULL,
UNIQUE KEY uk_xid_resource (xid, resource_id),
INDEX idx_xid_status (xid, status)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED;
设计考量:
基于Quartz的补偿任务实现:
java复制public class TccCompensationJob implements Job {
private static final Logger logger = LoggerFactory.getLogger(TccCompensationJob.class);
@Override
public void execute(JobExecutionContext context) {
TccTransactionStore store = getStore(context);
List<TccTransaction> hangingTransactions =
store.findHangingTransactions(getTimeout(context));
hangingTransactions.forEach(tx -> {
try {
compensateTransaction(tx);
} catch (Exception e) {
logger.error("补偿事务失败: {}", tx.getXid(), e);
store.recordCompensationFailure(tx.getXid());
}
});
}
private void compensateTransaction(TccTransaction tx) {
List<TccParticipant> participants = store.getParticipants(tx.getXid());
for (TccParticipant p : participants) {
if (p.getStatus() == ParticipantStatus.TRY_SUCCESS) {
executeCancel(p);
store.markParticipantCancelled(p.getBranchId());
}
}
store.markTransactionCompensated(tx.getXid());
}
private Duration getTimeout(JobExecutionContext context) {
return Duration.ofMinutes(
context.getMergedJobDataMap().getInt("timeoutMinutes"));
}
}
调度策略建议:
典型配置参数示例(基于Spring Boot):
yaml复制tcc:
recovery:
enabled: true
initial-delay: 30s
interval: 5m
max-retries: 5
backoff-multiplier: 2.0
transaction:
default-timeout: 30s
max-timeout: 10m
serialization: json
storage:
type: redis
redis:
key-prefix: "tcc:"
expire-time: 7d
jdbc:
table-prefix: "tcc_"
monitor:
enabled: true
metrics:
enabled: true
logging:
level: warn
关键参数说明:
核心监控指标示例:
| 指标类别 | 具体指标 | 报警阈值 |
|---|---|---|
| 事务总量 | 新建/成功/失败事务数 | 失败率>1% |
| 悬挂事务 | 悬挂数量/占比 | 悬挂数>10或占比>0.5% |
| 空回滚 | 空回滚次数/比例 | 比例>1% |
| 补偿效果 | 补偿成功率/平均耗时 | 成功率<95% |
| 系统性能 | 事务平均耗时/百分位值 | P99>5s |
Prometheus配置示例:
yaml复制- name: tcc_monitor
rules:
- record: tcc_transaction_failure_rate
expr: rate(tcc_transaction_failed_total[5m]) / rate(tcc_transaction_total[5m])
- alert: TccHighFailureRate
expr: tcc_transaction_failure_rate > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "TCC事务失败率过高"
description: "当前失败率 {{ $value }}"
性能优化实战建议:
日志存储优化:
锁竞争优化:
网络调优:
资源隔离:
我在实际项目中曾遇到一个典型案例:某支付系统在促销活动时出现大量悬挂事务。通过分析发现是Redis连接池配置不足导致锁获取超时。解决方案是:
调整后系统在同等流量下悬挂事务数量从日均300+降至个位数。这个案例说明,解决悬挂问题不仅需要正确的技术方案,还需要结合具体场景进行针对性调优。