Docker 容器间通信不能用 localhost，网络详细用法看 https://www.yuque.com/charlesjohn/blog/tm8p80

启动 Consul

将以下内容写入 docker-compose.yml

version: "3"
services:
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /:/host
    restart: always
    command: "--web.listen-address=:9100 --path.rootfs=/host --path.procfs=/host/proc --path.sysfs=/host/sys"
    networks:
      - prom

  dingtalk:
    image: timonwong/prometheus-webhook-dingtalk:latest
    volumes:
      - "./alertmanager/config.yml:/etc/prometheus-webhook-dingtalk/config.yml"
    ports:
      - "8060:8060"
    networks:
      - prom
  alertmanager:
    depends_on:
      - dingtalk
    image: prom/alertmanager:latest
    volumes:
      - "./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml"

    ports:
      - "9093:9093"
      - "9094:9094"
    networks:
      - prom
  prometheus:
    depends_on:
      - alertmanager
      - consul
    image: prom/prometheus:latest
    volumes:
      - "./prometheus/:/etc/prometheus/"
    ports:
      - "9090:9090"
    networks:
      - prom
  grafana:
    depends_on:
      - prometheus
    image: grafana/grafana:latest
    volumes:
      - "./grafana/:/var/lib/grafana"
    ports:
      - "3000:3000"
    networks:
      - prom
  consul:
    image: consul
    ports:
      - "8500:8500"
    networks:
      - prom
networks:
  prom:
    driver: bridge

应用接入

添加依赖

<dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <scope>runtime</scope>
        </dependency>

修改配置

// 在需要监控的 Spring Boot 配置文件中加入如下配置，base-path 不能修改，不然 Consul 将无法检查健康状况
management:
  metrics:
    tags:
      application: ${spring.application.name}
    export:
      prometheus:
        enabled: true
  endpoints:
    web:
      exposure:
        include: prometheus
      base-path: /actuator/health

// 修改健康检查地址就能修改 base-path 了，修改方式如下，第 23 行
spring:
  application:
    name: 你的服务名
  cloud:
    consul:
      discovery:
        health-check-interval: 10s
        health-check-path: ${management.server.servlet.context-path}/actuator/health
        instance-id: ${spring.application.name}
        prefer-ip-address: true
        register: true
      enabled: true

配置监控

将以下内容写入 prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "*rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["prometheus:9090"]

  - job_name: "consul"
    metrics_path: "/actuator/health/prometheus"
    consul_sd_configs:
      - server: "consul:8500"
        services: []

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "alertmanager"
    static_configs:
      - targets: ["alertmanager:9093"]

配置报警

将以下内容写入 alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.163.com:465" #邮箱smtp服务器代理，启用SSL发信, 端口一般是465
  smtp_from: "alert@163.com" #发送邮箱名称
  smtp_auth_username: "alert@163.com" #邮箱名称
  smtp_auth_password: "password" #邮箱密码或授权码
  smtp_require_tls: false

route:
  receiver: "default"
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 1h
  group_by: ["alertname"]

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

receivers:
  - name: "default"
    email_configs:
      - to: "receiver@163.com"
        send_resolved: true
    webhook_configs:
      - url: "http://dingtalk:8060/dingtalk/webhook/send"
        send_resolved: true

将以下内容写入 alertmanager/config.yml

targets:
  webhook:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx
    mention:
      all: true

将以下内容写入 prometheus/alert-rules.yml

groups:
  - name: node-alert
    rules:
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 5m
        labels:
          severity: critical
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} down"
          description: "Instance: {{ $labels.instance }} 已经宕机 5分钟"
          value: "{{ $value }}"

      - alert: NodeCpuHigh
        expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m]))) * 100 > 80
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} cpu使用率过高"
          description: "CPU 使用率超过 80%"
          value: "{{ $value }}"

      - alert: NodeCpuIowaitHigh
        expr: avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="iowait"}[5m])) * 100 > 50
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} cpu iowait 使用率过高"
          description: "CPU iowait 使用率超过 50%"
          value: "{{ $value }}"

      - alert: NodeLoad5High
        expr: node_load5 > (count by (instance) (node_cpu_seconds_total{job="node",mode='system'})) * 1.2
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} load(5m) 过高"
          description: "Load(5m) 过高，超出cpu核数 1.2倍"
          value: "{{ $value }}"

      - alert: NodeMemoryHigh
        expr: (1 - node_memory_MemAvailable_bytes{job="node"} / node_memory_MemTotal_bytes{job="node"}) * 100 > 90
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} memory 使用率过高"
          description: "Memory 使用率超过 90%"
          value: "{{ $value }}"

      - alert: NodeDiskRootHigh
        expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"}) * 100 > 90
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk(/ 分区) 使用率过高"
          description: "Disk(/ 分区) 使用率超过 90%"
          value: "{{ $value }}"

      - alert: NodeDiskBootHigh
        expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"}) * 100 > 80
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk(/boot 分区) 使用率过高"
          description: "Disk(/boot 分区) 使用率超过 80%"
          value: "{{ $value }}"

      - alert: NodeDiskReadHigh
        expr: irate(node_disk_read_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk 读取字节数 速率过高"
          description: "Disk 读取字节数 速率超过 20 MB/s"
          value: "{{ $value }}"

      - alert: NodeDiskWriteHigh
        expr: irate(node_disk_written_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk 写入字节数 速率过高"
          description: "Disk 写入字节数 速率超过 20 MB/s"
          value: "{{ $value }}"

      - alert: NodeDiskReadRateCountHigh
        expr: irate(node_disk_reads_completed_total{job="node"}[5m]) > 3000
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk iops 每秒读取速率过高"
          description: "Disk iops 每秒读取速率超过 3000 iops"
          value: "{{ $value }}"

      - alert: NodeDiskWriteRateCountHigh
        expr: irate(node_disk_writes_completed_total{job="node"}[5m]) > 3000
        for: 5m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk iops 每秒写入速率过高"
          description: "Disk iops 每秒写入速率超过 3000 iops"
          value: "{{ $value }}"

      - alert: NodeInodeRootUsedPercentHigh
        expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/"}) * 100 > 80
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk(/ 分区) inode 使用率过高"
          description: "Disk (/ 分区) inode 使用率超过 80%"
          value: "{{ $value }}"

      - alert: NodeInodeBootUsedPercentHigh
        expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/boot"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/boot"}) * 100 > 80
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} disk(/boot 分区) inode 使用率过高"
          description: "Disk (/boot 分区) inode 使用率超过 80%"
          value: "{{ $value }}"

      - alert: NodeFilefdAllocatedPercentHigh
        expr: node_filefd_allocated{job="node"} / node_filefd_maximum{job="node"} * 100 > 80
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} filefd 打开百分比过高"
          description: "Filefd 打开百分比 超过 80%"
          value: "{{ $value }}"

      - alert: NodeNetworkNetinBitRateHigh
        expr: avg by (instance) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
        for: 3m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} network 接收比特数 速率过高"
          description: "Network 接收比特数 速率超过 20MB/s"
          value: "{{ $value }}"

      - alert: NodeNetworkNetoutBitRateHigh
        expr: avg by (instance) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
        for: 3m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} network 发送比特数 速率过高"
          description: "Network 发送比特数 速率超过 20MB/s"
          value: "{{ $value }}"

      - alert: NodeNetworkNetinPacketErrorRateHigh
        expr: avg by (instance) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
        for: 3m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} 接收错误包 速率过高"
          description: "Network 接收错误包 速率超过 15个/秒"
          value: "{{ $value }}"

      - alert: NodeNetworkNetoutPacketErrorRateHigh
        expr: avg by (instance) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
        for: 3m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} 发送错误包 速率过高"
          description: "Network 发送错误包 速率超过 15个/秒"
          value: "{{ $value }}"

      - alert: NodeProcessBlockedHigh
        expr: node_procs_blocked{job="node"} > 10
        for: 10m
        labels:
          severity: warning
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} 当前被阻塞的任务的数量过多"
          description: "Process 当前被阻塞的任务的数量超过 10个"
          value: "{{ $value }}"

      - alert: NodeTimeOffsetHigh
        expr: abs(node_timex_offset_seconds{job="node"}) > 3 * 60
        for: 2m
        labels:
          severity: info
          instance: "{{ $labels.instance }}"
        annotations:
          summary: "instance: {{ $labels.instance }} 时间偏差过大"
          description: "Time 节点的时间偏差超过 3m"
          value: "{{ $value }}"

整套跑起来

最后走起
docker-compose up -d

这时候如果报错了，就是权限问题
将 docker-compose.yml 第 50 行
- "/var/lib/grafana:/var/lib/grafana"
改成
- "./grafana/:/var/lib/grafana"
我这里已经提前修改了，防止你们报错，还得去删除容器重新起。

解决 Target 中出现 415 Unsupport Media Type 和 Prometheus is not enabled since its retention time is not positive 参考：
https://blog.csdn.net/u013360850/article/details/106158994

更多 consul 配置 https://www.cnblogs.com/qidakang/p/7837519.html

常见配置

配置 consul 目标

修改 prometheus/prometheus.yml 第 24 行
services: []
services 数组中存放的就是你要监控的服务，只能填注册到 Consul 的服务名。
比如：
services: ['user', 'security', 'gateway']

配置 context-path

如果你的 application.yml 配置了 servlet.context-path，比如下边这样

server.port=8090
server.servlet.context-path=/monitoring
management.endpoints.web.exposure.include=*

那么你得修改 prometheus/prometheus.yml 第 21 行
metrics_path: "/actuator/health/prometheus"
在 /actuator/health/prometheus 前边加上你配置的 context-path，修改完以后变成了
metrics_path: "/monitoring/actuator/health/prometheus"

配置 base-path

// 想要修改 base-path，你得同步修改 prometheus/prometheus.yml 第 21 行的 metrics_path，和 application.yml 中的 health-check-path

// 比如你 base-path 改成了
base-path: /health
// 那么你 health-check-path 就得改成
health-check-path: ${management.server.servlet.context-path}/health
// 那么你 metrics_path 就改成
metrics_path: "/health/prometheus" 这里就是 "/你的base-path/prometheus"

//其实你不用修改这么多配置，这只是麻烦的办法，作为一个高明的框架，可以在不侵入代码配置的请框架进行修改。参考如下文章的 relabel_configs，他可以操作你的 metrics_path，还可以对服务进行忽略，修改等等操作。但是你不要直接用他的配置文件，他那个只是演示，你要改的适应自己的业务和监控需求。
https://blog.csdn.net/aixiaoyang168/article/details/103022342
https://blog.csdn.net/poorCoder_/article/details/79120218

比如下边这篇文章，他说必须用 relabel_configs 修改 metrics_path，否则就会 404 其实是不准确的，是他的配置文件没配完，但是他的处理方式是比较聪明的，我是在配好文件以后才发现可以直接 relabel_configs 修改。
https://blog.csdn.net/sanduo112/article/details/90266200

如果有些服务明明地址配对了，我自己能访问，但是他就是访问不到呢？这肯定是因为 Docker 的网络问题，这里需要进行一下配置，参考这篇文章。
https://blog.csdn.net/u013360850/article/details/106159160

还有一些服务，你怎么配都不对，比如 Consul 内部数据复制使用了 8300 端口，也注册了进去。添加如下relabel_configs 忽略掉它。
    relabel_configs:
      - source_labels: [__meta_consul_service]
        regex: .*consul.*
        target_label: __meta_consul_service
        action: drop

Consul 从配置到启动