在数字化转型浪潮中,运维监控系统如同企业的"神经系统",实时感知业务运行状态。本文将手把手带您完成从零搭建一套完整的生产级监控系统,涵盖数据采集(Zabbix Agent)、可视化展示(Grafana)和智能告警(微信机器人)三大核心模块。不同于零散的官方文档,我们采用"问题驱动"的实战路径,所有配置均经过真实环境验证,特别针对Ubuntu 20.04系统优化,避开常见"坑点"。
在开始前,建议使用全新的Ubuntu 20.04 LTS系统,配置不低于4核CPU/8GB内存/100GB存储。首先执行系统更新并安装必要工具:
bash复制sudo apt update && sudo apt upgrade -y
sudo apt install -y vim curl wget net-tools ufw
配置防火墙规则(Zabbix默认使用10051端口):
bash复制sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 10050/tcp
sudo ufw allow 10051/tcp
sudo ufw enable
Zabbix 6.0推荐使用MySQL 8.0作为后端数据库。以下是优化过的安装配置流程:
bash复制sudo apt install -y mysql-server-8.0
安全加固MySQL并创建专用数据库:
mysql复制ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'YourSecurePassword123!';
CREATE DATABASE zabbix CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
CREATE USER 'zabbix'@'localhost' IDENTIFIED BY 'ZabbixDBPassword123!';
GRANT ALL PRIVILEGES ON zabbix.* TO 'zabbix'@'localhost';
FLUSH PRIVILEGES;
注意:生产环境务必修改默认密码,并考虑将数据库部署在独立服务器
添加官方仓库并安装核心组件:
bash复制wget https://repo.zabbix.com/zabbix/6.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.0-4+ubuntu20.04_all.deb
sudo dpkg -i zabbix-release_6.0-4+ubuntu20.04_all.deb
sudo apt update
sudo apt install -y zabbix-server-mysql zabbix-frontend-php zabbix-nginx-conf zabbix-sql-scripts zabbix-agent
导入初始数据库架构(根据机器性能可能需要5-15分钟):
bash复制zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql --default-character-set=utf8mb4 -uzabbix -p zabbix
关键配置文件优化(/etc/zabbix/zabbix_server.conf):
properties复制DBHost=localhost
DBName=zabbix
DBUser=zabbix
DBPassword=ZabbixDBPassword123!
StartPollers=20
StartPollersUnreachable=10
StartTrappers=15
StartPingers=10
CacheSize=512M
HistoryCacheSize=256M
TrendCacheSize=128M
启动服务并设置开机自启:
bash复制sudo systemctl restart zabbix-server zabbix-agent nginx php7.4-fpm
sudo systemctl enable zabbix-server zabbix-agent nginx php7.4-fpm
安装Grafana 10.x最新稳定版:
bash复制sudo apt-get install -y apt-transport-https software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install -y grafana=10.0.3
配置Nginx反向代理(/etc/nginx/sites-available/grafana):
nginx复制server {
listen 3000;
server_name yourdomain.com;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
安装官方Zabbix插件并配置数据源:
bash复制sudo grafana-cli plugins install alexanderzobnin-zabbix-app
sudo systemctl restart grafana-server
登录Grafana后,按以下参数添加Zabbix数据源:
| 配置项 | 值 |
|---|---|
| Name | Zabbix-Production |
| URL | http://localhost/zabbix/api_jsonrpc.php |
| Access | Proxy |
| Auth | 启用基本认证 |
| Username | Admin |
| Password | zabbix |
创建主机监控总览看板时,推荐添加这些核心面板:
json复制// 示例面板JSON配置片段
{
"title": "CPU Utilization",
"type": "timeseries",
"datasource": "Zabbix-Production",
"targets": [{
"group": "{HOSTGROUP}",
"host": "{HOSTNAME}",
"item": "CPU utilization"
}],
"options": {
"showThresholdLabels": true,
"thresholdsStyle": "area"
}
}
使用Ansible实现跨主机批量部署(inventory.ini):
ini复制[web_servers]
web1 ansible_host=192.168.1.101
web2 ansible_host=192.168.1.102
[db_servers]
db1 ansible_host=192.168.1.201
部署Playbook(zabbix-agent.yml):
yaml复制---
- hosts: all
become: yes
tasks:
- name: Add Zabbix repository
apt_repository:
repo: "deb https://repo.zabbix.com/zabbix/6.0/ubuntu focal main"
state: present
filename: zabbix.list
key_url: "https://repo.zabbix.com/zabbix-official-repo.key"
- name: Install Zabbix Agent
apt:
name: zabbix-agent
state: latest
update_cache: yes
- name: Configure zabbix_agentd.conf
template:
src: templates/zabbix_agentd.conf.j2
dest: /etc/zabbix/zabbix_agentd.conf
notify: Restart Zabbix Agent
handlers:
- name: Restart Zabbix Agent
service:
name: zabbix-agent
state: restarted
对于需要监控Docker和GPU的特殊节点,使用Agent2扩展能力:
bash复制sudo apt install -y zabbix-agent2 nvidia-container-toolkit
GPU监控关键配置(/etc/zabbix/zabbix_agent2.conf.d/gpu.conf):
properties复制Plugins.Nvidia.GPU.Discovery.Interval=1m
Plugins.Nvidia.GPU.Utilization.Interval=30s
Plugins.Nvidia.GPU.Temperature.Interval=1m
Docker容器自动发现配置:
properties复制Plugins.Docker.Endpoint=unix:///var/run/docker.sock
Plugins.Docker.Discovery.Interval=5m
Plugins.Docker.Containers.Discovery=1
Plugins.Docker.Containers.Metrics=1
https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx)创建告警脚本(/usr/lib/zabbix/alertscripts/wechat_alert.py):
python复制#!/usr/bin/env python3
import requests
import sys
import json
webhook_url = "YOUR_WEBHOOK_URL"
subject = sys.argv[1]
message = sys.argv[2]
headers = {'Content-Type': 'application/json'}
data = {
"msgtype": "markdown",
"markdown": {
"content": f"**{subject}**\n{message}"
}
}
response = requests.post(webhook_url, headers=headers, data=json.dumps(data))
print(response.text)
设置脚本权限并测试:
bash复制chmod +x /usr/lib/zabbix/alertscripts/wechat_alert.py
chown zabbix:zabbix /usr/lib/zabbix/alertscripts/wechat_alert.py
在Zabbix前端配置分级告警策略:
告警消息模板优化示例:
code复制[{TRIGGER.STATUS}] {HOST.NAME}
告警级别: {TRIGGER.SEVERITY}
问题详情: {TRIGGER.NAME}
当前值: {ITEM.VALUE}
事件时间: {EVENT.TIME}
持续时间: {EVENT.AGE}
数据采集测试:
bash复制zabbix_get -s 127.0.0.1 -k "system.cpu.load[all,avg1]"
告警触发测试:
bash复制sudo dd if=/dev/zero of=/tmp/test bs=1M count=1024
可视化验证:
常见性能问题及解决方案:
| 症状 | 可能原因 | 解决方案 |
|---|---|---|
| Zabbix Server高负载 | 监控项过多 | 调整轮询间隔,优化触发器 |
| 数据库响应慢 | 未建立合适索引 | 添加history/trends表索引 |
| 告警延迟 | 动作配置不合理 | 设置合理的告警聚合策略 |
| Grafana加载慢 | 查询时间范围过大 | 限制面板默认时间范围 |
数据库优化建议命令:
sql复制ALTER TABLE history_uint ADD INDEX history_uint_1 (itemid, clock);
ALTER TABLE trends_uint ADD INDEX trends_uint_1 (itemid, clock);
OPTIMIZE TABLE history, history_uint, trends, trends_uint;
日常运维中,建议建立以下检查清单:
备份关键配置的快速命令:
bash复制# Zabbix配置备份
tar czvf /backup/zabbix_config_$(date +%F).tar.gz /etc/zabbix
# 数据库备份
mysqldump -u zabbix -p zabbix | gzip > /backup/zabbix_db_$(date +%F).sql.gz
遇到服务异常时,可依次排查:
systemctl status zabbix-servermysql -u zabbix -p -e "SHOW STATUS"curl -I http://localhost/zabbix