Posted 2024-02-18Updated 2024-02-1831 minutes read (About 4642 words)

Prometheus

prometheus_logo

Prometheus是一个开源的云原生监控系统和时间序列数据库。

一、Prometheus概述

Prometheus 作为新一代的云原生监控系统，目前已经有超过 650+位贡献者参与到 Prometheus 的研发工作上，并且超过 120+项的第三方集成。

1.1 Prometheus的优点

提供多维度数据模型和灵活的查询方式，通过将监控指标关联多个 tag，来将监控数据进行任意维度的组合，并且提供简单的 PromQL 查询方式，还提供 HTTP 查询接口，可以很方便地结合 Grafana 等 GUI 组件展示数据。
在不依赖外部存储的情况下，支持服务器节点的本地存储，通过 Prometheus 自带的时序数据库，可以完成每秒千万级的数据存储；不仅如此，在保存大量历史数据的场景中，Prometheus 可以对接第三方时序数据库和 OpenTSDB 等。
定义了开放指标数据标准，以基于 HTTP 的 Pull 方式采集时序数据，只有实现了 Prometheus 监控数据才可以被 Prometheus 采集、汇总、并支持 Push 方式向中间网关推送时序列数据，能更加灵活地应对多种监控场景。
支持通过静态文件配置和动态发现机制发现监控对象，自动完成数据采集。
Prometheus 目前已经支持 Kubernetes、etcd、Consul 等多种服务发现机制。易于维护，可以通过二进制文件直接启动，并且提供了容器化部署镜像。
支持数据的分区采样和联邦部署，支持大规模集群监控。

1.2 Prometheus基本组件

Prometheus Server：是 Prometheus 组件中的核心部分，负责实现对监控数据的获取，存储以及查询。收集到的数据统称为metrics。
Push Gateway：当网络需求无法直接满足时，就可以利用 Push Gateway 来进行中转。可以通过 Push Gateway 将内部网络的监控数据主动 Push 到 Gateway 当中。而 Prometheus Server 则可以采用同样 Pull 的方式从 Push Gateway 中获取到监控数据。
Exporter：主要用来采集数据，并通过 HTTP 服务的形式暴露给 Prometheus Server，Prometheus Server 通过访问该 Exporter 提供的接口，即可获取到需要采集的监控数据。
Alert manager：管理告警，主要是负责实现报警功能。现在grafana也能实现报警功能，所以也慢慢被取代。

Prometheus架构

1.3 Prometheus数据类型

Counter（计数器类型）：Counter类型的指标的工作方式和计数器一样，只增不减（除非系统发生了重置）。
Gauge（仪表盘类型）：Gauge是可增可减的指标类，可以用于反应当前应用的状态。
Histogram（直方图类型）：主要用于表示一段时间范围内对数据进行采样（通常是请求持续时间或响应大小），并能够对其指定区间以及总数进行统计，通常它采集的数据展示为直方图。
Summary（摘要类型）：主要用于表示一段时间内数据采样结果（通常是请求持续时间或响应大小）。

二、Prometheus安装

2.1 Prometheus server安装

Prometheus安装较为简单，下载解压即可

1
2
3

wget https://github.com/prometheus/prometheus/releases/download/v2.26.0-rc.0/prometheus-2.26.0-rc.0.linux-amd64.tar.gz
tar -xf prometheus-2.26.0-rc.0.linux-amd64.tar.gz
mv prometheus-2.26.0-rc.0.linux-amd64 prometheus

prometheus.yml配置文件

# 全局配置
global:
  scrape_interval:     15s # 设置抓取间隔，默认为1分钟
  evaluation_interval: 15s # 估算规则的默认周期，每15秒计算一次规则，默认1分钟
  # scrape_timeout # 默认抓取超时，默认为10s

# 报警配置
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# 规则文件列表，使用'evaluation_interval' 参数去抓取
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# 抓取配置列表
scrape_configs:
  # 任务名称
  - job_name: 'prometheus'
    # 要被监控的客户端
    static_configs:
    - targets: ['localhost:9090']

创建Prometheus的用户及数据存储目录

groupadd prometheus
useradd -g prometheus -s /sbin/nologin prometheus
mkdir /root/prometheus/data
chown -R prometheus:prometheus /root/prometheus

Prometheus的启动很简单，只需要直接启动解压目录的二进制文件Prometheus即可，但是为了更加方便对Prometheus进行管理，这里编写脚本或者使用screen工具来进行启动

vim /root/prometheus/start.sh
#!/bin/bash
prometheus_dir=/root/prometheus
${prometheus_dir}/prometheus --config.file=${prometheus_dir}/prometheus.yml --storage.tsdb.path=${prometheus_dir}/data --storage.tsdb.retention.time=24h --web.enable-lifecycle --storage.tsdb.no-lockfile
# --config.file:指定配置文件路径
# --storage.tsdb.path:指定tsdb路径
# --storage.tsdb.retention.time:指定数据存储时间
# --web.enable-lifecycle:类似nginx的reload功能
# --storage.tsdb.no-lockfile:如果用k8s的deployment管理需加此项

启动Prometheus后访问

1 2	# nohup英文全称no hang up（不挂起），用于在系统后台不挂断地运行命令，退出终端不会影响程序的运行。 nohup sh start.sh 2>&1 > prometheus.log

Prometheus安装一

用screen工具进行启动

yum -y install screen
# 进入后台
screen
# 运行脚本
/root/prometheus/prometheus --config.file=/root/prometheus/prometheus.yml --storage.tsdb.path=/root/prometheus/data --storage.tsdb.retention.time=24h --web.enable-lifecycle --storage.tsdb.no-lockfile
# 输入CTRL + A + D撤回前台
# 查看后台运行的脚本
screen -ls
# 返回后台
screen -r 后台id
# 删除后台
screen -S 后台id -X quit

2.2 node exporter安装

在Prometheus架构中，exporter是负责收集数据并将信息汇报给Prometheus Server的组件。官方提供了node_exporter内置了对主机系统的基础监控。

下载node exporter

1
2
3

wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xf node_exporter-1.1.2.linux-amd64.tar.gz
mv node_exporter-1.1.2.linux-amd64.tar.gz node_exporter

在prometheus.yml中添加被监控主机

1 2	static_configs: - targets: ['localhost:9090','localhost:9100']

后台启动exporter和重启prometheus

1 2	screen ./root/node_exporter/node_exporter

通过curl命令获取收集到的数据key

1 2	curl http://localhost:9100/metrics ...

用其中的一个key在Prometheus测试是否被监控

node_exporter测试

三、Prometheus命令行的使用

3.1 计算cpu使用率

计算cpu使用率一

通过上图可以知道，linux的cpu使用是分为很多种状态的，例如用户态user，空闲态idle。

要计算cpu的使用率有两种粗略的公式：

除去idle状态的所有cpu状态时间之和 / cpu时间总和
100% - （idle状态 / cpu时间总和）

但这两种方式都存在两个问题：

如何计算某一时间段的cpu使用率？例如精确到每一分钟。
实际工作中cpu大多数都是多核的，node exporter截取到的数据精确到了每个核，如何监控所有核加起来的数据？

Prometheus提供了许多的函数，其中 increase 和 sum 就很好的解决了以上两个问题。

提取cpu的key，即node_cpu_seconds_total
把idle空闲时间和总时间过滤出来，在Prometheus中使用{}进行过滤

1	node_cpu_seconds_total{mode='idle'}

使用increase函数取一分钟内的增量

1	increase(node_cpu_seconds_total{mode='idle'}[1m])

使用sum函数将每个核的数整合起来

1	sum(increase(node_cpu_seconds_total{mode='idle'}[1m]))

到这里又出现一个问题，sum函数会将所有数据整合起来，不光将一台机器的所有cpu加到一起，也将所有机器的cpu都加到了一起，最终显示的是集群cpu的总平均值，by(instance)可以解决这个问题。

1	sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by(instance)

这样就得到了空闲时cpu的数据了，用上边第一个公式即可得到单台主机cpu在一分钟内的使用率。

1	(1 - ((sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by(instance)) / (sum(increase(node_cpu_seconds_total[1m])) by(instance)))) * 100

计算cpu使用率二

3.2 计算内存使用率

内存使用率公式为 = (available / total) * 100

1	(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

计算内存使用率

3.3 rate函数

rate函数是专门搭配counter类型数据使用的函数，功能是按照设置的一个时间段，取counter在这个时间段中平均每秒的增量。

举个栗子，假设我们要取ens33这个网卡在一分钟内字节的接受数量，假如一分钟内接收到的是1000bytes，那么平均每秒接收到就是1000bytes / 1m * 60s ≈ 16bytes/s。

1	rate(node_network_receive_bytes_total{device='ens33'}[1m])

如果是五分钟的话即为5000bytes / 5m * 60s ≈ 16bytes/s，结果是一样的，但曲线图就不一样了，上图为一分钟，下图为五分钟，因为五分钟的密度要更底，所以可以看到五分钟的曲线图更加平缓。

rate函数一

rate函数二

rate和increase的概念有些类似，但rate取的是一段时间增量的平均每秒数量，increase取的是一段时间增量的总量，即：

rate(1m)：总量 / 60s
increase(1m)：总量

3.4 sum函数

sum函数就是将收到的数据全部进行整合。

假如一个集群里有20台服务器，分别为5台web服务器，10台db服务器，还有5台其他服务的服务器，这时候sum就可以分为三条曲线来代表不同功能服务器的总和数据。

3.5 topk函数

topk函数的作用就是取前几位的最高值。

topk函数一

3.6 count函数

count函数的作用是把符合条件的数值进行整合。

假如我们要查看集群中cpu使用率超过80%的主机数量的话

1	count((1 - ((sum(increase(node_cpu_seconds_total{mode='idle'}[1m])) by(instance)) / (sum(increase(node_cpu_seconds_total[1m])) by(instance)))) * 100 > 80)

四、Push gateway

Push gateway实际上就是一种被动推送数据的方式，与exporter主动获取不同。

4.1 Push gateway安装

下载安装Push gateway

1
2
3

wget https://github.com/prometheus/pushgateway/releases/download/v1.4.0/pushgateway-1.4.0.linux-amd64.tar.gz
tar -xf pushgateway-1.4.0.linux-amd64.tar.gz
mv pushgateway-1.4.0.linux-amd64 pushgateway

后台运行Push gateway

1 2	screen ./root/pushgateway/pushgateway

在prometheus.yml中加上

1
2
3

- job_name: 'pushgateway'
  static_configs:
  - targets: ['localhost:9091']

添加pushgateway

4.2 自定义编写脚本

由于Push gateway自己本身是没有任何抓取数据的功能的，所以用户需要自行编写脚本来抓取数据。

举个例子：编写脚本抓取 TCP waiting_connection 的数量

编写自定义脚本

#!/bin/bash

# 获取监控主机名
instance_name=`hostname -f | cut -d'.' -f1`

# 如果主机名为localhost，则退出
if [ $instance_name == "localhost" ]
then
        echo "不能监控主机名为localhost的主机"
        exit 1
fi

#---
# 获取TCP CONNECTED数量

# 抓取TCP CONNECTED数量，定义为一个新key
lable_tcp_connected="count_netstat_connected_connections"
count_netstat_connected_connections=`netstat -an | grep 'CONNECTED' | wc -l`

# 上传至pushgateway
echo "$lable_tcp_connected $count_netstat_connected_connections" | curl --data-binary @- http://localhost:9091/metrics/job/pushgateway/instance/$instance_name
#---

# 该脚本是通过post的方式将key推送给pushgateway
# http://localhost:9091 即推送给哪台pushgateway主机
# job/pushgateway 即推送给prometheus.yml中定义的job为pushgateway的主机
# instance/$instance_name 推送后显示的主机名

因为脚本都是运行一次后就结束了，可以配合crontab反复运行

crontab -e
# 每分钟执行一次脚本
* * * * * sh /root/pushgateway/node_exporter_shell.sh
# 每10s执行一次脚本
* * * * * sh /root/pushgateway/node_exporter_shell.sh

自定义编写脚本

4.3 编写抓取ping丢包和延迟时间数据

在node_exporter_shell.sh中加入

#---
# 获取ping某网站丢包率和延迟时间

site_address="www.baidu.com"

# 获取丢包率和延迟时间，定义为两个新key
lable_ping_packet_loss="ping_packet_loss"
ping_packet_loss_test=`ping -c3 $site_address | awk 'NR==7{print $6}'`
# 字符串截取，%?为去除最后一个字符
ping_packet_loss=`echo ${ping_packet_loss_test%?}`

lable_ping_time="ping_time"
ping_time_test=`ping -c3 $site_address | awk 'NR==7{print $10}'`
# 字符串截取，%??为去除最后两个字符
ping_time=`echo ${ping_time_test%??}`

# 上传至push_ping_timegateway
echo "$lable_ping_packet_loss $ping_packet_loss" | curl --data-binary @- http://localhost:9091/metrics/job/pushgateway/instance/$instance_name

echo "$lable_ping_time $ping_time" | curl --data-binary @- http://localhost:9091/metrics/job/pushgateway/instance/$instance_name
#---

五、Grafana的使用

Grafana是一款用Go语言开发的开源数据可视化工具，可以做数据监控和数据统计，带有告警功能。

5.1 Grafana安装

wget https://dl.grafana.com/oss/release/grafana-7.5.1-1.x86_64.rpm
yum -y install grafana-7.5.1-1.x86_64.rpm
systemctl start grafana-server
systemctl enable grafana-server

5.2 设置数据源

Grafana -> Configuration -> Date Sources -> Prometheus

Grafana安装一

New dashboard

Grafana安装二

添加一个监控和CPU内存使用率的仪表盘

Grafana安装三

Grafana安装四

5.3 json备份和还原

备份：dashboard -> Settings -> JSON Model，将里面内容保存为json文件

JSON备份

恢复：Create -> import

JSON还原

5.4 Grafana实现报警功能

配置Grafana文件

# 安装依赖和图形显示插件
yum -y install libatk-bridge* libXss* libgtk*
grafana-cli plugins install grafana-image-renderer
# 修改配置
vim /etc/grafana/grafana.ini
enabled = true
host = smtp.163.com:25
# 发送报警邮件的邮箱
user = chenqiming13@163.com
# 授权码
password = QXQALYMTRYRWIOOS
skip_verify = true
from_address = chenqiming13@163.com
from_name = Grafana
systemctl restart grafana-server

创建报警规则

Grafana实现报警功能一

Grafana实现报警功能二

针对具体监控项，设置发送邮件阈值等，这里设置为发现超过阈值起5分钟后触发报警

Grafana实现报警功能三

Grafana实现报警功能四

Grafana实现报警功能五

Grafana实现报警功能六

![Grafana实现报警功能七](Grafana实现报警功能七.png

六、Prometheus + Grafana实际案例

6.1 predict_linear函数实现硬盘监控

硬盘使用率公式为：（（总容量 - 剩余容量）/ 总容量）* 100，在Prometheus中表示为

1	((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes) * 100

通过df -m可以看出计算出来的值是正确的

计算硬盘使用率一

Prometheus提供了一个predict_linear函数可以预计多长时间磁盘爆满，例如当前这1个小时的磁盘可用率急剧下降，这种情况可能导致磁盘很快被写满，这时可以使用该函数，用当前1小时的数据去预测未来几个小时的状态，实现提前报警。

1 2	# 该式子表示用当前1小时的值来预测未来4小时后如果根目录下容量小于0则触发报警 predict_linear(node_filesystem_free_bytes {mountpoint ="/"}[1h], 4*3600) < 0

在Grafana添加监控硬盘使用率和预测硬盘使用率的仪表盘

计算硬盘使用率二

6.2 监控硬盘IO

公式为：（读取时间 / 写入时间）/ 1024 / 1024，用rate函数取一分钟内读和写的字节增长率来计算，用Prometheus表示为

1	((rate(node_disk_read_bytes_total[1m]) + rate(node_disk_written_bytes_total[1m])) / 1024 / 1024) > 0

监控IO状态

6.3 监控TCP_WAIT状态的数量

在被监控主机上编写监控脚本

#!/bin/bash

# 获取监控主机名
instance_name=`hostname -f | cut -d'.' -f1`

# 如果主机名为localhost，则退出
if [ $instance_name == "localhost" ]
then
        echo "不能监控主机名为localhost的主机"
        exit 1
fi

#---
# 获取TCP WAIT数量

# 抓取TCP WAIT数量，定义为一个新key
lable_tcp_wait="count_netstat_wait_connections"
count_netstat_wait_connections=`netstat -an | grep 'WAIT' | wc -l`

# 上传至pushgateway
echo "$lable_tcp_wait $count_netstat_wait_connections" | curl --data-binary @- http://localhost:9091/metrics/job/pushgateway/instance/$instance_name
#---

6.4 监控文件描述符使用率

在linux中，每当进程打开一个文件时，系统就会为其分配一个唯一的整型文件描述符，用来标识这个文件，每个进程默认打开的文件描述符有三个，分别为标准输入、标准输出、标准错误，即stdin、stout、steer，用文件描述符来表示为0、1、2。

用命令可以查看目前系统的最大文件描述符限制，一般默认设置是1024。

ulimit -n

文件描述符使用率公式为：（已分配的文件描述符数量 / 最大文件描述符数量）* 100，在Prometheus中则表示为

1	(node_filefd_allocated / node_filefd_maximum) * 100

监控文件描述符使用率

6.5 网络延迟和丢包率监控

前面我们采用的都是简单的ping + ip地址来进行测试，实际上这样测试发出去的icmp数据包是非常小的，只适合用来测试网络是否连通，因此用以下命令来进行优化：

ping -q ip地址 -s 500 -W 1000 -c 100
-q:不显示指令执行过程，开头和结尾的相关信息除外。
-s:设置数据包的大小。
-W:在等待 timeout 秒后开始执行。
-c:设置完成要求回应的次数。

网络延迟和丢包率监控

6.6 使用Pageduty实现报警

Pagerduty是一套付费监控报警系统，经常作为SRE/运维人员的监控报警工具，可以和市面上常见的监控工具直接整合。

创建新service

pateduty实现报警一 pateduty实现报警二

在Grafana新建报警渠道，并在仪表盘中设置为Pageduty报警

pateduty实现报警三

设置报警信息

pateduty实现报警四

查看是否收到报警

pateduty实现报警五

当问题解决可以点击已解决

pateduty实现报警六

Prometheus

https://cqmmm.github.io/2024/02/18/prometheus/

Author

Warner

Posted on

2024-02-18

Updated on

2024-02-18

Licensed under

#prometheus

You need to set install_url to use ShareThis. Please set it in _config.yml.

Prometheus

一、Prometheus概述

1.1 Prometheus的优点

1.2 Prometheus基本组件

1.3 Prometheus数据类型

二、Prometheus安装

2.1 Prometheus server安装

2.2 node exporter安装

三、Prometheus命令行的使用

3.1 计算cpu使用率

3.2 计算内存使用率

3.3 rate函数

3.4 sum函数

3.5 topk函数

3.6 count函数

四、Push gateway

4.1 Push gateway安装

4.2 自定义编写脚本

4.3 编写抓取ping丢包和延迟时间数据

五、Grafana的使用

5.1 Grafana安装

5.2 设置数据源

5.3 json备份和还原

5.4 Grafana实现报警功能

六、Prometheus + Grafana实际案例

6.1 predict_linear函数实现硬盘监控

6.2 监控硬盘IO

6.3 监控TCP_WAIT状态的数量

6.4 监控文件描述符使用率

6.5 网络延迟和丢包率监控

6.6 使用Pageduty实现报警

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Links

Recents

Archives

Tags

Subscribe for updates

follow.it