Loading... ## Alertmanager 安装 <div class="tab-container post_tab box-shadow-wrap-lg"> <ul class="nav no-padder b-b scroll-hide" role="tablist"> <li class='nav-item active' role="presentation"><a class='nav-link active' style="" data-toggle="tab" aria-controls='tabs-f489d159601c332045a6624a64d69939680' role="tab" data-target='#tabs-f489d159601c332045a6624a64d69939680'>在容器中运行</a></li><li class='nav-item ' role="presentation"><a class='nav-link ' style="" data-toggle="tab" aria-controls='tabs-628a9faae94fcaef9083218fed4d35b441' role="tab" data-target='#tabs-628a9faae94fcaef9083218fed4d35b441'>在服务器上运行</a></li> </ul> <div class="tab-content no-border"> <div role="tabpanel" id='tabs-f489d159601c332045a6624a64d69939680' class="tab-pane fade active in"> - 创建配置文件 `/usr/local/alertmanager/alertmanager.yml` ```yaml global: resolve_timeout: 3h # 持续5分钟没收到告警信息后认为问题已解决 route: group_by: ['alertname', 'instance'] # 分组 group_wait: 30s # 组内排队时间 group_interval: 2m # 组内告警间隔 repeat_interval: 8h # 重复告警间隔 receiver: 'feishu' # 告警接收者 receivers: - name: 'feishu' webhook_configs: - url: '' # 这里先留空,部署完PrometheusAlert后再填写对应的模版地址 ``` ### 启用 TLS 和 Basic Auth - 将安装 Prometheus 时生成的证书和密钥复制到 Alertmanager 的配置路径 `/usr/local/alertmanager/` - 使用 htpasswd 生成密码 hash(如果命令不存在,则需要安装软件包 `apt install apache2-utils` ) ```bash htpasswd -nBC 12 '' | tr -d ':\n' ``` - 创建配置文件 `/usr/local/alertmanager/web.yml` ```yaml tls_server_config: cert_file: node_exporter.crt key_file: node_exporter.key basic_auth_users: # qadawwfwaawdawddsaz 为用户名,冒号后面的为上面生成的密码hash qadawwfwaawdawddsazz: $2y$12$/igeeEzsefadawdawddasawdwadwadawdauya ``` ### 启动容器 ```bash docker run --name alertmanager --restart=always -d \ -p 9093:9093 -v /etc/localtime:/etc/localtime:ro \ -v /usr/local/alertmanager:/etc/alertmanager \ --user root \ --network prometheus \ prom/alertmanager:latest ``` </div><div role="tabpanel" id='tabs-628a9faae94fcaef9083218fed4d35b441' class="tab-pane fade "> - 从[官网](https://prometheus.io/download/)下载 Alertmanager 软件包,解压到 `/usr/local/alertmanager` ```bash wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz tar zxvf alertmanager-0.27.0.linux-amd64.tar.gz mv alertmanager-0.27.0.linux-amd64 /usr/local/alertmanager ``` ### 创建配置文件 创建配置文件 `/usr/local/alertmanager/alertmanager.yml` ```yaml global: resolve_timeout: 5m # 持续5分钟没收到告警信息后认为问题已解决 route: group_by: ['instance'] # 分组,处于同组的告警会被合并为一个通知。这里设置instance相同的告警合并为同一个通知 group_wait: 10s # 时间窗口,窗口内同一个分组的所有消息会被合并为同一个通知 group_interval: 30s # 每个分组中最多每30秒发送一条警报 repeat_interval: 1h # 发送报警间隔 receiver: 'web.hook.prometheusalert' # 告警接收者 receivers: - name: 'web.hook.prometheusalert' webhook_configs: - url: '' # 这里先留空,部署完PrometheusAlert后再填写对应的模版地址 ``` ### 启用 TLS 和 Basic Auth - 将安装 Prometheus 生成的证书和密钥复制到 Alertmanager 的路径 `/usr/local/alertmanager/` - 使用 htpasswd 生成密码hash(如果命令不存在,则需要安装软件包 `apt install apache2-utils` ) ```bash htpasswd -nBC 12 '' | tr -d ':\n' ``` - 创建配置文件 `/usr/local/alertmanager/web.yml` ```yaml tls_server_config: cert_file: node_exporter.crt key_file: node_exporter.key basic_auth_users: # qadawwfwaawdawddsaz 为用户名,冒号后面的为上面生成的密码hash qadawwfwaawdawddsazz: $2y$12$/igeeEzsefadawdawddasawdwadwadawdauya ``` ### 通过服务启动 创建服务 `vim /etc/systemd/system/alertmanager.service`,要根据实际情况修改启动参数 ```bash [Unit] Description=Prometheus alertmanager Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/alertmanager/alertmanager \ --config.file /usr/local/alertmanager/alertmanager.yml \ --web.config.file /usr/local/alertmanager/web.yml \ --web.external-url https://alertmanager.internal.qwerto.cc \ --storage.path /usr/local/alertmanager/ ExecReload=/bin/kill -s HUP $MAINPID [Install] WantedBy=multi-user.target ``` 创建用户并赋权 ```bash useradd prometheus -M -s /sbin/nologin chown prometheus:prometheus /usr/local/alertmanager -R chmod 755 /usr/local/alertmanager -R ``` 设置开机启动并启动服务 ```bash systemctl enable alertmanager service alertmanager start service alertmanager status ``` </div> </div> </div> ## 增加防火墙规则 Alertmanager 默认使用 `9093` 端口,防火墙上需要放行这个端口(不建议运行公网直接访问),建议只允许 Prometheus 服务端IP和需要访问管理页面的IP访问 ``` iptables -A INPUT -s 10.20.30.210 -p tcp --dport 9093 -j ACCEPT -m comment --comment "prometheus-alertmanager" iptables-save >/etc/iptables/rules.v4 ``` ## 修改 Prometheus 服务端配置,对接 Alertmanagers ### 在 Prometheus 服务端增加 Alertmanagers 配置 ```yaml # Alertmanager configuration alerting: alertmanagers: - scheme: https tls_config: ca_file: node_exporter.crt # 证书文件名 server_name: "qwerto.local" insecure_skip_verify: true basic_auth: username: qadawwfwaawdawddsazz # Basic Auth设置的用户名(明文) password: qwertyuiop123456789 # Basic Auth设置的密码(明文) static_configs: - targets: ["localhost:9093"] ``` ### 在 Prometheus 服务端增加 Rule File 配置 修改配置文件 `vim /usr/local/prometheus/prometheus.yml`,在 `rule_files` 中增加一个 `alert_rules_node.yml`,例如: ```yaml rule_files: - "alert_rules_node.yml" ``` 增加配置文件 `vim /usr/local/prometheus/alert_rules_node.yml`,在这里面配置告警规则,可以参考: ```yaml groups: - name: alert_node rules: # Node # - alert: PrometheusTargetMissing expr: up{job=~"node"} == 0 for: 1m labels: severity: critical annotations: summary: Prometheus target missing description: "A Prometheus target has disappeared. An exporter might be crashed.\n - ServerName: {{ $labels.server_name }}" # Disk # - alert: HostOutOfDiskSpace expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 5m labels: severity: warning annotations: summary: Host out of disk space description: "Disk is almost full (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostDiskWillFillIn24Hours expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 15m labels: severity: warning annotations: summary: Host disk will fill in 24 hours description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostOutOfInodes expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 5m labels: severity: warning annotations: summary: Host out of inodes description: "Disk is almost running out of available inodes (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostInodesWillFillIn24Hours expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!="msdosfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!="msdosfs"} == 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 15m labels: severity: warning annotations: summary: Host inodes will fill in 24 hours description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostUnusualDiskReadRate expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host unusual disk read rate description: "Disk is probably reading too much data (> 50 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostUnusualDiskWriteRate expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host unusual disk write rate description: "Disk is probably writing too much data (> 50 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostUnusualDiskReadLatency expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host unusual disk read latency description: "Disk latency is growing (read operations > 100ms)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostUnusualDiskWriteLatency expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host unusual disk write latency description: "Disk latency is growing (write operations > 100ms)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" # Memory # - alert: HostOutOfMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename,server_name) node_uname_info for: 5m labels: severity: warning annotations: summary: Host out of memory description: "Node memory is filling up (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" # Network # - alert: HostUnusualNetworkThroughputIn expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename,server_name) node_uname_info for: 8m labels: severity: warning annotations: summary: Host unusual network throughput in description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostUnusualNetworkThroughputOut expr: (sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename,server_name) node_uname_info for: 8m labels: severity: warning annotations: summary: Host unusual network throughput out description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostConntrackLimit expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename,server_name) node_uname_info for: 6m labels: severity: warning annotations: summary: Host conntrack limit description: "The number of conntrack is approaching limit\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" # CPU # - alert: HostHighCpuLoad expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename,server_name) node_uname_info for: 8m labels: severity: warning annotations: summary: Host high CPU load description: "CPU load is > 80%\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostCpuHighIowait expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host CPU high iowait description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" # Systemd # - alert: HostSystemdServiceCrashed expr: (node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename,server_name) node_uname_info for: 4m labels: severity: warning annotations: summary: Host systemd service crashed description: "systemd service crashed\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" # Clock # - alert: HostClockSkew expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host clock skew description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" - alert: HostClockNotSynchronising expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename,server_name) node_uname_info for: 10m labels: severity: warning annotations: summary: Host clock not synchronising description: "Clock not synchronising. Ensure NTP is configured on this host.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}" ``` 重载 Prometheus 后生效 ## 测试 - 尝试触发一个告警 - 访问 Alertmanager 管理地址查看是否有对应的告警提示 ## 参考资料 [Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/) [Prometheus: understanding the delays on alerting](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html) 最后修改:2024 年 07 月 30 日 © 允许规范转载 赞 如果觉得我的文章对你有用,请随意赞赏