安装 Prometheus（四） - Alertmanager

博主： Qwert
发布时间：2024 年 03 月 04 日
217 次浏览
暂无评论
13296字数
分类：运维笔记

## Alertmanager 安装

<div class="tab-container post_tab box-shadow-wrap-lg">
<ul class="nav no-padder b-b scroll-hide" role="tablist">
<li class='nav-item active' role="presentation"><a class='nav-link active' style="" data-toggle="tab" 
aria-controls='tabs-0d1ab589a6029c6bbfe3c32dda77c19f10' role="tab" data-target='#tabs-0d1ab589a6029c6bbfe3c32dda77c19f10'>在容器中运行</a></li><li class='nav-item ' role="presentation"><a class='nav-link ' style="" data-toggle="tab" 
aria-controls='tabs-fff71489611f0055d19d8983c146a6f4821' role="tab" data-target='#tabs-fff71489611f0055d19d8983c146a6f4821'>在服务器上运行</a></li>
</ul>
<div class="tab-content no-border">
<div role="tabpanel" id='tabs-0d1ab589a6029c6bbfe3c32dda77c19f10' class="tab-pane fade active in">

- 创建配置文件 `/usr/local/alertmanager/alertmanager.yml`

```yaml
global:
  resolve_timeout: 3h  # 持续5分钟没收到告警信息后认为问题已解决
route:
  group_by: ['alertname', 'instance']  # 分组
  group_wait: 30s  # 组内排队时间
  group_interval: 2m  # 组内告警间隔
  repeat_interval: 8h  # 重复告警间隔
  receiver: 'feishu'  # 告警接收者
receivers:
- name: 'feishu'
  webhook_configs:
  - url: ''  # 这里先留空，部署完PrometheusAlert后再填写对应的模版地址
```

### 启用 TLS 和 Basic Auth

- 将安装 Prometheus 时生成的证书和密钥复制到 Alertmanager 的配置路径 `/usr/local/alertmanager/`
- 使用 htpasswd 生成密码 hash（如果命令不存在，则需要安装软件包 `apt install apache2-utils` ）

```bash
htpasswd -nBC 12 '' | tr -d ':\n'
```

- 创建配置文件 `/usr/local/alertmanager/web.yml`

```yaml
tls_server_config:
  cert_file: node_exporter.crt
  key_file: node_exporter.key
basic_auth_users:
  # qadawwfwaawdawddsaz 为用户名，冒号后面的为上面生成的密码hash
  qadawwfwaawdawddsazz: $2y$12$/igeeEzsefadawdawddasawdwadwadawdauya
```

### 启动容器

```bash
docker run --name alertmanager --restart=always -d \
    -p 9093:9093
    -v /etc/localtime:/etc/localtime:ro \
    -v /usr/local/alertmanager:/etc/alertmanager \
    --user root \
    --network prometheus \
    prom/alertmanager:latest
```

</div><div role="tabpanel" id='tabs-fff71489611f0055d19d8983c146a6f4821' class="tab-pane fade  ">

- 从[官网](https://prometheus.io/download/)下载 Alertmanager 软件包，解压到 `/usr/local/alertmanager`

```bash
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar zxvf alertmanager-0.27.0.linux-amd64.tar.gz
mv alertmanager-0.27.0.linux-amd64 /usr/local/alertmanager
```

### 创建配置文件

创建配置文件 `/usr/local/alertmanager/alertmanager.yml`

```yaml
global:
  resolve_timeout: 5m  # 持续5分钟没收到告警信息后认为问题已解决
route:
  group_by: ['instance']  # 分组，处于同组的告警会被合并为一个通知。这里设置instance相同的告警合并为同一个通知
  group_wait: 10s  # 时间窗口，窗口内同一个分组的所有消息会被合并为同一个通知
  group_interval: 30s  # 每个分组中最多每30秒发送一条警报
  repeat_interval: 1h  # 发送报警间隔
  receiver: 'web.hook.prometheusalert'  # 告警接收者
receivers:
- name: 'web.hook.prometheusalert'
  webhook_configs:
  - url: ''  # 这里先留空，部署完PrometheusAlert后再填写对应的模版地址
```

### 启用 TLS 和 Basic Auth

- 将安装 Prometheus 生成的证书和密钥复制到 Alertmanager 的路径 `/usr/local/alertmanager/`
- 使用 htpasswd 生成密码hash（如果命令不存在，则需要安装软件包 `apt install apache2-utils` ）

```bash
htpasswd -nBC 12 '' | tr -d ':\n'
```

- 创建配置文件 `/usr/local/alertmanager/web.yml`

### 通过服务启动

创建服务 `vim /etc/systemd/system/alertmanager.service`，要根据实际情况修改启动参数

```bash
[Unit]
Description=Prometheus alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager \
    --config.file /usr/local/alertmanager/alertmanager.yml \
    --web.config.file /usr/local/alertmanager/web.yml \
    --web.external-url https://alertmanager.internal.qwerto.cc \
    --storage.path /usr/local/alertmanager/
ExecReload=/bin/kill -s HUP $MAINPID

[Install]
WantedBy=multi-user.target
```

创建用户并赋权

```bash
useradd prometheus -M -s /sbin/nologin
chown prometheus:prometheus /usr/local/alertmanager -R
chmod 755 /usr/local/alertmanager -R
```

设置开机启动并启动服务

```bash
systemctl enable alertmanager
service alertmanager start
service alertmanager status
```

</div>
</div>
</div>

## 增加防火墙规则

Alertmanager 默认使用 `9093` 端口，防火墙上需要放行这个端口（不建议运行公网直接访问），建议只允许 Prometheus 服务端IP和需要访问管理页面的IP访问

```
iptables -A INPUT -s 10.20.30.210 -p tcp --dport 9093 -j ACCEPT -m comment --comment "prometheus-alertmanager"
iptables-save >/etc/iptables/rules.v4
```

## 修改 Prometheus 服务端配置，对接 Alertmanagers

### 在 Prometheus 服务端增加 Alertmanagers 配置

```yaml
# Alertmanager configuration
alerting:
  alertmanagers:
    - scheme: https
      tls_config:
        ca_file: node_exporter.crt  # 证书文件名
        server_name: "qwerto.local"
        insecure_skip_verify: true
      basic_auth:
        username: qadawwfwaawdawddsazz  # Basic Auth设置的用户名（明文）
        password: qwertyuiop123456789  # Basic Auth设置的密码（明文）
      static_configs:
        - targets: ["localhost:9093"]
```

### 在 Prometheus 服务端增加 Rule File 配置

修改配置文件 `vim /usr/local/prometheus/prometheus.yml`，在 `rule_files` 中增加一个 `alert_rules_node.yml`，例如：

```yaml
rule_files:
  - "alert_rules_node.yml"
```

增加配置文件 `vim /usr/local/prometheus/alert_rules_node.yml`，在这里面配置告警规则，可以参考：

```yaml
groups:
- name: alert_node
  rules:
  # Node #
  - alert: PrometheusTargetMissing
    expr: up{job=~"node"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing
      description: "A Prometheus target has disappeared. An exporter might be crashed.\n - ServerName: {{ $labels.server_name }}"

# Disk #
  - alert: HostOutOfDiskSpace
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space
      description: "Disk is almost full (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostDiskWillFillIn24Hours
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostOutOfInodes
    expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes
      description: "Disk is almost running out of available inodes (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostInodesWillFillIn24Hours
    expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!="msdosfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!="msdosfs"} == 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Host inodes will fill in 24 hours
      description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostUnusualDiskReadRate
    expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate
      description: "Disk is probably reading too much data (> 50 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostUnusualDiskWriteRate
    expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate
      description: "Disk is probably writing too much data (> 50 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostUnusualDiskReadLatency
    expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency
      description: "Disk latency is growing (read operations > 100ms)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostUnusualDiskWriteLatency
    expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency
      description: "Disk latency is growing (write operations > 100ms)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"

# Memory #
  - alert: HostOutOfMemory
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory
      description: "Node memory is filling up (< 10% left)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"

# Network #
  - alert: HostUnusualNetworkThroughputIn
    expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 8m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostUnusualNetworkThroughputOut
    expr: (sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 8m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostConntrackLimit
    expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 6m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit
      description: "The number of conntrack is approaching limit\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"

# CPU #
  - alert: HostHighCpuLoad
    expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 8m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load
      description: "CPU load is > 80%\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostCpuHighIowait
    expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host CPU high iowait
      description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"

# Systemd #
  - alert: HostSystemdServiceCrashed
    expr: (node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 4m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed
      description: "systemd service crashed\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"

# Clock # 
  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
  - alert: HostClockNotSynchronising
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename,server_name) node_uname_info
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n - VALUE: {{ $value }}\n - ServerName: {{ $labels.server_name }}"
```

重载 Prometheus 后生效

## 测试

- 尝试触发一个告警
- 访问 Alertmanager 管理地址查看是否有对应的告警提示

## 参考资料

[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/)

[Prometheus: understanding the delays on alerting](https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html)

最后修改：2024 年 07 月 30 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

安装 Prometheus（四） - Alertmanager

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Cassandra 安装

Linux IPV4/IPV6 双栈网络下设置 IPv4 优先

安装 OpenVPN

安装 Prometheus（六） - Blackbox Exporter

限制访问 Nginx 站点的客户端IP

Nginx 配置反向代理

安装 Prometheus（二）- Node Exporter

自建 Docker 镜像（反向代理）

限制访问 Nginx 站点的客户端IP

AWS EC2 磁盘扩容

安装 Prometheus（四） - Alertmanager

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

安装 Prometheus（四） - Alertmanager

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款