在现代微服务架构中,服务健康监控是保障系统稳定性的基石。作为一名长期奋战在一线的Java开发者,我亲历过无数次因监控缺失导致的线上事故。本文将分享如何基于Spring Boot构建完整的健康检查与监控体系,这套方案已在我们的生产环境稳定运行3年,日均处理超过2亿次健康检查请求。
Spring Boot Actuator作为监控系统的核心组件,其设计哲学是"约定优于配置"。但实际应用中,我发现很多团队仅停留在基础使用层面,未能充分发挥其潜力。比如,某电商平台曾因Redis连接池耗尽导致大面积服务不可用,其实只需一个自定义健康检查就能提前预警。
在Spring Boot 2.4+版本中,Actuator的端点暴露机制有了重要变化。以下是经过生产验证的配置方案:
yaml复制management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
base-path: /internal-monitor # 避免使用默认/actuator路径
endpoint:
health:
show-details: when_authorized
probes:
enabled: true # 启用K8s就绪/存活探针支持
group:
readiness:
include: db,redis,diskSpace
liveness:
include: ping
关键配置解析:
base-path重定义:避免扫描工具发现监控端点警告:永远不要在生产环境使用
management.endpoints.web.exposure.include=*。我曾亲历因端点暴露导致配置信息泄露的安全事故。
结合Spring Security的端点保护配置:
java复制@Configuration
@ConditionalOnClass(SecurityFilterChain.class)
public class ActuatorSecurityConfig {
@Bean
SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) throws Exception {
http.requestMatcher(EndpointRequest.toAnyEndpoint())
.authorizeRequests(requests ->
requests.anyRequest().hasRole("MONITOR"))
.httpBasic(withDefaults())
.csrf().disable(); // 禁用CSRF以支持Prometheus拉取
return http.build();
}
}
配套的权限配置需在application.yml中声明:
yaml复制spring:
security:
user:
name: monitor_user
password: ${MONITOR_PASSWORD:changeme}
roles: MONITOR
基础版的数据库检查只能验证连接是否建立,而实际生产中我们需要更细致的监控:
java复制@Component
public class DatabaseHealthIndicator implements HealthIndicator {
private final DataSource dataSource;
private final HikariConfig hikariConfig;
public Health health() {
try (Connection conn = dataSource.getConnection()) {
Health.Builder builder = Health.up();
// 获取连接池关键指标
HikariDataSource hikari = (HikariDataSource) dataSource;
builder.withDetail("active", hikari.getHikariPoolMXBean().getActiveConnections())
.withDetail("idle", hikari.getHikariPoolMXBean().getIdleConnections())
.withDetail("wait", hikari.getHikariPoolMXBean().getThreadsAwaitingConnection());
// 执行真实SQL验证
try (Statement stmt = conn.createStatement()) {
stmt.executeQuery("SELECT 1 FROM DUAL");
return builder.build();
}
} catch (SQLException e) {
return Health.down(e)
.withDetail("error_code", e.getErrorCode())
.build();
}
}
}
这个增强版实现了:
对于Redis集群的健康检查,需要处理更复杂的场景:
java复制@Component
public class RedisClusterHealthIndicator implements HealthIndicator {
private final RedisConnectionFactory connectionFactory;
public Health health() {
try (RedisConnection connection = connectionFactory.getConnection()) {
Map<String, Object> details = new LinkedHashMap<>();
// 获取集群节点信息
if (connection instanceof RedisClusterConnection) {
RedisClusterConnection clusterConn = (RedisClusterConnection) connection;
clusterConn.clusterGetNodes().forEach(node -> {
details.put("node_" + node.getId(),
String.format("%s:%d %s",
node.getHost(),
node.getPort(),
node.getState()));
});
}
// 执行PING测试
String pingResult = connection.ping();
if ("PONG".equals(pingResult)) {
return Health.up().withDetails(details).build();
}
return Health.down().withDetails(details).build();
} catch (Exception e) {
return Health.down(e)
.withDetail("error", e.getMessage())
.build();
}
}
}
在大型分布式系统中,需要定制化的指标收集策略:
java复制@Configuration
public class MetricsConfig {
@Bean
MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> {
registry.config().commonTags(
"application", "order-service",
"region", System.getenv("REGION"),
"instance", ManagementFactory.getRuntimeMXBean().getName());
};
}
@Bean
TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
}
关键指标采集示例:
java复制@Service
public class OrderService {
private final Counter orderCounter;
private final DistributionServiceTime serviceTime;
public OrderService(MeterRegistry registry) {
this.orderCounter = registry.counter("order.count");
this.serviceTime = DistributionServiceTime.builder("order.process.time")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
}
@Timed(value = "order.process", longTask = true)
public void processOrder(Order order) {
Timer.Sample sample = Timer.start();
try {
// 业务处理逻辑
orderCounter.increment();
} finally {
sample.stop(serviceTime);
}
}
}
生产级prometheus.yml配置示例:
yaml复制scrape_configs:
- job_name: 'spring-boot'
metrics_path: '/internal-monitor/prometheus'
scrape_interval: 15s
scrape_timeout: 5s
honor_labels: true
static_configs:
- targets: ['service1:8080', 'service2:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(http_server_requests_seconds_.*|tomcat_.*)'
action: keep
推荐的核心监控面板:
sql复制# 服务健康状态查询
sum(up{application="order-service"}) by (instance)
alert.rules示例:
yaml复制groups:
- name: spring-boot-alerts
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_errors_total{status=~"5.."}[1m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "5xx error rate is {{ $value }}"
java复制@Component
public class CachedHealthIndicator implements HealthIndicator {
private final HealthIndicator delegate;
private volatile Health cachedHealth;
private final long cacheDuration;
public Health health() {
if (System.currentTimeMillis() - lastChecked > cacheDuration) {
synchronized (this) {
if (System.currentTimeMillis() - lastChecked > cacheDuration) {
cachedHealth = delegate.health();
lastChecked = System.currentTimeMillis();
}
}
}
return cachedHealth;
}
}
案例:某次大促期间,健康检查频繁超时
java复制@Bean
public HealthIndicator creditServiceHealthIndicator() {
return () -> {
Future<Health> future = Executors.newSingleThreadExecutor()
.submit(() -> doRealCheck());
try {
return future.get(500, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
return Health.unknown()
.withDetail("reason", "check timeout")
.build();
}
};
}
这套监控体系在我们多个核心系统中稳定运行,帮助团队将平均故障发现时间从15分钟缩短到30秒内。记住,好的监控系统不在于工具的堆砌,而在于对业务场景的深度理解和持续优化。