Prometheus 高级使用教程

本教程旨在帮助初学者和有一定基础的用户全面掌握 Prometheus 的使用，从部署到高级功能，涵盖如何成为 Prometheus 高手所需的所有知识点。教程将详细介绍 Prometheus 的部署、配置、Exporter 注册、PromQL 查询、报警规则、与 Grafana 集成以及最佳实践。本文档假设你对 Prometheus 有基本了解（如其核心概念：时间序列、指标类型等）。如果完全不熟悉，建议先阅读官方文档或基础教程。

Prometheus 概述

Prometheus 是一个开源的监控和报警系统，专为云原生环境设计，广泛用于监控容器化应用、微服务和基础设施。它通过抓取（scrape）HTTP 端点收集时间序列数据，提供强大的 PromQL 查询语言和灵活的报警机制。Prometheus 的核心优势包括：

时间序列存储：高效存储和查询时间序列数据。
多维数据模型：通过标签（labels）支持灵活的数据过滤和聚合。
动态服务发现：自动发现 Kubernetes、Consul 等环境中的监控目标。
生态丰富：支持多种 Exporter 和与 Grafana 等工具集成。

本教程将带你从零开始，逐步掌握 Prometheus 的所有功能。

安装和部署 Prometheus

在 Linux 上安装

下载 Prometheus
从 Prometheus 官网下载最新版本。例如：

1
2
3


wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvfz prometheus-2.47.0.linux-amd64.tar.gz
cd prometheus-2.47.0.linux-amd64

创建配置文件
创建 prometheus.yml 文件，添加基本配置：

1
2
3
4
5
6
7
8


global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

运行 Prometheus
使用以下命令启动：

1

./prometheus --config.file=prometheus.yml

验证安装
打开浏览器，访问 http://localhost:9090，检查 Prometheus Web 界面是否正常运行。

设置系统服务（可选）
为确保 Prometheus 开机自启，创建 systemd 服务文件：

1

sudo nano /etc/systemd/system/prometheus.service

添加以下内容：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/
Restart=always

[Install]
WantedBy=multi-user.target

创建用户并设置权限：

1
2
3
4
5
6
7


sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo cp prometheus /usr/local/bin/
sudo cp prometheus.yml /etc/prometheus/
sudo systemctl enable prometheus
sudo systemctl start prometheus

Docker 部署

拉取 Prometheus 镜像
1

docker pull prom/prometheus:v2.47.0
创建配置文件
将 prometheus.yml 保存在本地目录（如 /path/to/prometheus.yml）。

运行容器

1

docker run -d -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus:v2.47.0

验证
访问 http://localhost:9090。

Kubernetes 部署

使用 Helm 安装
Helm 是部署 Prometheus 的推荐方式。确保已安装 Helm：

1
2
3


helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack

访问 Prometheus
默认情况下，Prometheus 通过 Kubernetes 服务暴露。使用 kubectl port-forward 访问：
1

kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090
自定义配置
编辑 Helm 值文件（values.yaml）以调整 scrape 目标、存储等配置。

配置 Prometheus

配置文件详解

Prometheus 的配置文件 prometheus.yml 包含以下核心部分：

global：全局配置，如抓取间隔和规则评估间隔。

1
2
3
4
5


global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: production

scrape_configs：定义抓取目标。

1
2
3
4
5
6


scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics
    scheme: http

rule_files：指定报警和记录规则文件。

1
2


rule_files:
  - 'rules/alert.rules.yml'

alerting：配置 Alertmanager 地址。

1
2
3
4


alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

服务发现机制

Prometheus 支持多种服务发现机制，适用于动态环境：

静态配置
直接指定目标地址，适合小型或静态环境：

1
2


static_configs:
  - targets: ['host1:9100', 'host2:9100']

DNS 服务发现
使用 DNS 查询发现目标：

1
2
3
4


dns_sd_configs:
  - names: ['example.com']
    type: A
    port: 9100

Kubernetes 服务发现
自动发现 Kubernetes 中的 pod、服务等：

1
2
3
4


kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: ['default']

Consul、Zookeeper 等
通过第三方服务发现工具动态更新目标。

使用 Exporter 收集指标

Exporter 是将非 Prometheus 原生支持的系统指标转换为 Prometheus 格式的工具。

常见的 Exporter

Node Exporter：监控服务器的 CPU、内存、磁盘等。
MySQL Exporter：监控 MySQL 数据库性能。
Blackbox Exporter：用于 HTTP、TCP 等端点探测。
JMX Exporter：监控 Java 应用的 JMX 指标。

部署 Node Exporter

下载 Node Exporter

1
2
3


wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64

运行 Node Exporter
1

./node_exporter

配置 Prometheus 抓取
在 prometheus.yml 中添加：

1
2
3
4


scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

验证
访问 http://localhost:9100/metrics，查看暴露的指标。

自定义 Exporter

对于自定义应用，可以使用 Prometheus 客户端库（如 Python、Go、Java）暴露指标。

示例：Python 自定义 Exporter

安装 Python 客户端库：
1

pip install prometheus-client

编写 Python 脚本：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


from prometheus_client import start_http_server, Counter, Gauge
import time

# 定义指标
request_count = Counter('app_requests_total', 'Total number of requests', ['endpoint'])
request_latency = Gauge('app_request_latency_seconds', 'Request latency', ['endpoint'])

def process_request(endpoint, latency):
    request_count.labels(endpoint=endpoint).inc()
    request_latency.labels(endpoint=endpoint).set(latency)

if __name__ == '__main__':
    start_http_server(8000)  # 暴露 /metrics 端点
    while True:
        process_request('/api', 0.123)
        time.sleep(1)

配置 Prometheus 抓取：

1
2
3
4


scrape_configs:
  - job_name: 'python_app'
    static_configs:
      - targets: ['localhost:8000']

访问 http://localhost:8000/metrics 查看自定义指标。

PromQL 高级查询

PromQL 是 Prometheus 的核心查询语言，用于分析时间序列数据。

基本语法

瞬时查询：获取当前值。
1

http_requests_total
范围查询：获取一段时间内的数据。
1

http_requests_total[5m]
速率计算：计算每秒增量。
1

rate(http_requests_total[5m])

标签过滤：按标签筛选。

1

http_requests_total{job="myapp", method="GET"}

常用函数

rate()：计算时间序列的每秒增长率。
sum()：对时间序列求和。
avg()：计算平均值。
topk()：返回前 k 个时间序列。
increase()：计算指定时间范围内的增量。
irate()：计算瞬时增长率，适合快速变化的指标。

复杂查询示例

计算所有服务的总请求速率：

1

sum(rate(http_requests_total[5m])) by (job)

找出延迟最高的端点：

1

topk(5, avg(rate(http_request_duration_seconds[5m])) by (endpoint))

检测异常：

1

http_requests_total{status="500"} / http_requests_total > 0.05

配置报警和 Alertmanager

报警规则

创建规则文件 alert.rules.yml：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a high error rate ({{ $value }})."

在 prometheus.yml 中引用：
1 2

rule_files: - 'alert.rules.yml'

Alertmanager 配置

安装 Alertmanager：

1
2
3


wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64

创建 Alertmanager 配置文件 alertmanager.yml：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'email'
  group_by: ['alertname', 'instance']

receivers:
  - name: 'email'
    email_configs:
      - to: 'admin@example.com'

运行 Alertmanager：

1

./alertmanager --config.file=alertmanager.yml

配置 Prometheus：在 prometheus.yml 中添加：

1
2
3
4


alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

通知渠道集成

Alertmanager 支持多种通知渠道，包括：

Email：配置 SMTP 服务器。

Slack：

1
2
3
4
5


receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'

PagerDuty、Webhook 等。

与 Grafana 集成

安装 Grafana

下载并安装：

1
2
3
4


wget https://dl.grafana.com/oss/release/grafana_9.5.2_amd64.deb
sudo dpkg -i grafana_9.5.2_amd64.deb
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

访问 Grafana：打开 http://localhost:3000，默认用户名/密码为 admin/admin。

配置 Prometheus 数据源

登录 Grafana，点击“Configuration” > “Data Sources” > “Add data source”。
选择 “Prometheus”，输入 URL（如 http://localhost:9090）。
保存并测试连接。

创建仪表盘

点击“Create” > “Dashboard” > “Add new panel”。
输入 PromQL 查询，例如：
1

rate(http_requests_total[5m])
配置图表类型（如折线图）、标题和时间范围。
保存仪表盘。

推荐导入社区仪表盘（如 Node Exporter 的 ID 1860）以快速开始。

Prometheus 高级功能

远程存储

Prometheus 的本地存储适合中小规模场景。对于大规模部署，推荐使用远程存储（如 Thanos、VictoriaMetrics）。

配置远程写入：

1
2


remote_write:
  - url: 'http://remote-storage:8080/api/v1/write'

配置远程读取：

1
2


remote_read:
  - url: 'http://remote-storage:8080/api/v1/read'

联邦化部署

在多集群环境中，使用联邦化部署聚合数据：

1
2
3
4
5
6
7
8
9


scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]': ['{job=~".+"}']
    static_configs:
      - targets: ['prometheus2:9090']

Prometheus 的联邦化部署（Federation）是一种将多个 Prometheus 服务器的数据聚合到一起的方法，适用于大规模监控场景，比如跨数据中心、跨云、分层采集等。它可以让你把各地采集到的部分监控数据汇总到一个中央 Prometheus，方便统一查询和分析。

一、联邦化部署的原理

Prometheus 本身是分布式部署的，每个实例只采集本地的数据。当你的监控规模很大时，单个 Prometheus 容量和性能会成为瓶颈，或者你有多个数据中心/集群需要独立运维。此时可以用联邦化方式，把各自 Prometheus 的部分数据拉到中央 Prometheus，形成分层汇总。

下层 Prometheus：负责采集本区域的数据（比如每个数据中心一个）。
上层 Prometheus（联邦服务器）：定期从下层 Prometheus 拉取部分数据（如聚合后的指标），只保存需要汇总或关注的部分数据。

二、联邦化的主要类型

1. 普通联邦（Classic Federation）

上层 Prometheus 使用 scrape_configs 配置，直接抓取下层 Prometheus 的 /federate HTTP API。

可以通过参数筛选要拉取的 metrics，比如：

http://prometheus下层:9090/federate?match[]={job="node"}

适合跨数据中心、分层监控、只关心部分聚合数据的场景。

2. 远程读写（Remote Read/Write）

使用 Prometheus remote read/write 协议，把数据同步到远程存储或者其它 Prometheus。
适合大规模持久化和历史数据分析，但不是经典联邦化场景。

三、联邦化部署的优缺点

优点

分层架构，易于扩展：每个区域/集群独立采集，中央服务器只聚合需要的数据，减轻压力。
数据安全性高：可以只汇聚部分敏感或关键指标，无需暴露全部数据。
实现多租户/分区管理：每个团队或业务单元有自己的 Prometheus，中央只做汇总。

缺点

查询粒度受限：中央 Prometheus只能查到汇总后的数据，无法获取下层全部细节。
配置复杂：需要合理规划指标汇总、筛选、标签处理等。
延迟问题：汇总数据有一定延迟，不适合需要实时分析全部明细指标的场景。

四、联邦化部署的应用场景

多数据中心、跨地域监控
大型企业分部门/分业务监控
云原生监控系统分层汇总
只需要部分聚合数据做统一展示和告警

五、配置示例

假设你有两个 Prometheus 实例，下层采集所有 node_exporter 数据，上层只拉汇总指标：

上层 Prometheus scrape_configs 示例：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


scrape_configs:
  - job_name: 'federate'
    metrics_path: '/federate'
    honor_labels: true
    static_configs:
      - targets:
        - '下层Prometheus1:9090'
        - '下层Prometheus2:9090'
    params:
      'match[]':
        - '{job="node"}'
        - '{__name__=~"job:.*"}'

六、资料参考

总结：
Prometheus 联邦化部署是一种将分布式采集的数据汇总到中央服务器的方法，适合大规模、分层、跨地域监控场景。它通过 /federate API 拉取部分指标，实现灵活的分层数据管理和查询。

如需实际配置指导或案例分析，请补充你的具体场景！

高可用性

为确保高可用性，运行多个 Prometheus 实例并使用外部存储和 Alertmanager 集群。

最佳实践

优化抓取间隔：根据负载调整 scrape_interval，避免过高频率导致性能问题。
标签规范化：避免高基数标签（如用户 ID），以减少存储压力。
使用记录规则：将复杂 PromQL 查询保存为记录规则以提高性能。
备份数据：定期备份时间序列数据，防止数据丢失。
监控 Prometheus 本身：使用 prometheus_* 指标监控 Prometheus 的性能。

故障排查和优化

抓取失败：检查目标端点的 /metrics 是否可访问，确认网络连通性。
高内存使用：减少高基数指标，使用远程存储。
查询性能慢：优化 PromQL，尽量使用 rate() 而非 increase()。
报警丢失：检查 Alertmanager 日志，确保通知渠道配置正确。

学习资源

官方文档：https://prometheus.io/docs/introduction/overview/
PromQL 教程：https://prometheus.io/docs/prometheus/latest/querying/basics/
Grafana 教程：https://grafana.com/docs/
Thanos 文档：https://thanos.io/
Prometheus 社区：https://prometheus.io/community/

通过本教程，你应该能够从零开始部署 Prometheus，配置 Exporter，编写复杂 PromQL 查询，设置报警，并与 Grafana 集成。持续实践和探索社区资源将帮助你成为 Prometheus 高手！

文章目录