Prometheus for Managers – PromQL

Welcome to the third part of our series on Prometheus for Managers. In the previous two parts, we explored the fundamentals of Prometheus, its role in system monitoring, and the various types of metrics it captures. Today, we embark on a deeper dive into PromQL, the powerful query language that unlocks the insights hidden within Prometheus data. This query language allows users to perform ad-hoc queries, filtering, aggregations, and analysis on the collected metrics.

In this article we will cover:

⛳ The main concepts relevant to effectively use PromQL
🧐 Deep dives on vectors, selectors & matchers, and functions

Let’s start this new journey! 😉

To effectively use PromQL, it’s essential to grasp the underlying concepts that underpin its functioning. The table of contents for this article:

  • 🔖 Time Series Data
  • 🔖 Metrics, Labels, and Vectors
    • 🔵 Labels
    • 🔵 Vectors
    • 🔵 Scalars
    • 🔵 Selectors & Matchers
  • 🔖 Functions
    • 🌟 Date Type Transformation Functions
    • 🌟 Aggregation Functions
    • 🌟 Combining Functions for Complex Queries
  • 📚 Learn more

🔖 Time Series Data: Time series data represents a sequence of measurements collected over a period of time. In Prometheus, each measurement is associated with a timestamp, allowing for temporal analysis. Sample → CPU utilisation for a server over a 5-minute interval (timestamp, CPU utilisation in percentage)

1641061200,0.1234
1641061205,0.2345
1641061210,0.4567
1641061215,0.6789
1641061220,0.8901

🔖 Metrics, Labels, and Vectors: Metrics are the fundamental building blocks of time series data. They represent a specific measure, such as CPU usage, memory consumption, or request latency. Metrics are identified by their names and labels, which provide additional context and categorisation. Vectors, collections of metric values, represent the actual data points associated with a metric.

🔖 Functions for manipulating and analysing time-series data: PromQL offers a range of aggregation and transformation functions that enable users to analyse and manipulate time series data. These functions include calculations like sum, average, and minimum, as well as more advanced operations like quantiles and exponential smoothing.

🔖 Metrics, Labels, and Vectors

A deep dive about metrics was provided in the first article of this series. Now, let’s learn more about labels, vectors and scalars.

  1. Labels: Labels are metadata associated with metrics, providing additional context and categorisation. They are key-value pairs that help distinguish between similar metrics and facilitate filtering and grouping operations.
  2. Vectors: Vectors are collections of data points, each representing a measurement for a particular metric and timestamp. They form the backbone of time-series data in Prometheus.
  3. Scalars: Scalars are numerical values representing the measurement of a metric at a specific timestamp. They are the fundamental units of analysis in PromQL.

🔵 Labels

Samples of labels:

CPU utilization for a server over a 5-minute interval:

{
  "metric": "node_cpu_seconds_total",
  "instance": "my-server",
  "cpu": "0",
  "job": "my-app",
  "namespace": "my-namespace"
}

Memory usage for a process over a 1-hour interval:

{
  "metric": "process_resident_memory_bytes",
  "pid": 1234,
  "app": "my-web-app",
  "namespace": "my-namespace"
}

HTTP request latency for an API over a 10-minute interval:

{
  "metric": "http_request_duration_seconds",
  "status": "200",
  "endpoint": "/api/v1/users",
  "job": "my-api-service",
  "namespace": "my-namespace"
}

In each of these examples, the label metric identifies the type of metric being measured. The other labels provide additional context and categorization for the metric. For example, the label instance identifies the specific server on which the metric was collected, the label cpu identifies the specific CPU core on which the metric was collected, the label job identifies the application or service that generated the metric, and the label namespace identifies the deployment environment where the metric was collected.

Labels are important for several reasons. First, they help to distinguish between similar metrics. For example, the two metrics node_cpu_seconds_total and process_resident_memory_bytes both measure the amount of CPU time and memory usage, respectively, but they have different labels. This allows Prometheus to distinguish between these two metrics and aggregate them separately.

Second, labels can be used to filter and group metrics. For example, the query avg(http_request_duration_seconds{status=~"^2[0-9]$"})[1h] retrieves the average request duration for HTTP requests with status codes between 200 and 299 over the past hour. This query uses the label status to filter the metrics to only include requests with status codes between 200 and 299.

Finally, labels can be used to identify trends and anomalies. For example, the query histogram_quantile(0.9, http_request_duration_seconds{endpoint="/api/v1/users"})[15m] retrieves the 90th percentile of HTTP request duration for the API endpoint /api/v1/users over the past 15 minutes. This query can be used to identify any sudden increases in request duration that may be indicative of a problem.

🔵Vectors

When talking about vectors we could identify 2 types:

Type 1️⃣ – Instant Vector – A set of time series, each containing a single sample, all sharing the same timestamp.

Other said an instant vector query in Prometheus retrieves the current value of a metric at a particular point in time. This approach is valuable for obtaining real-time insights into metric trends or for conducting calculations that require a specific timestamp

Sample:

Retrieve the CPU usage of a server called “my-server” at a specific timestamp to see how busy it was at that exact moment

node_cpu_seconds_total{instance="my-server", cpu="0", timestamp="1641180400"}

Type 2️⃣ – Range Vector – A collection of data points representing a continuous stream of measurements over a defined duration. Range vectors facilitate comprehensive time-based analysis and aggregation, enabling in-depth exploration of temporal trends and patterns

Range vectors, represented as angular brackets in PromQL, serve as the foundation for sophisticated time series analysis. In conjunction with aggregation and transformation functions, range vectors empower users to extract meaningful insights from temporal data.

Samples:

Retrieving the CPU utilization for a specific server over the past 5 minutes

rate(node_cpu_seconds_total[5m]){instance="my-server"}

Analyzing the number of HTTP requests for status codes between 200 and 399 over the past hour

http_requests_total{status=~"^[23]$"}[1h]

Identifying the peak memory usage for each process by process ID over the past 24 hours

max by (pid) (process_resident_memory_bytes[24h])

🔵Scalars

Some samples of scalars:

CPU utilization for a server over a 5-minute interval:

  • 0.25
  • 0.45
  • 0.5
  • 0.65
  • 0.8

Memory usage for a process over a 1-hour interval:

  • 123456
  • 145678
  • 167890
  • 189012
  • 210123

HTTP request latency for an API over a 10-minute interval:

  • 0.112
  • 0.193
  • 0.274
  • 0.355
  • 0.436

In each of these examples, the scalar represents a single measurement of a metric at a specific timestamp. The units of the scalar are typically indicated by the metric name. For example, the CPU utilisation scalars are in percentage points (%), the memory usage scalars are in bytes (B), and the HTTP request latency scalars are in seconds(s).

🔵 Selectors & Matchers

PromQL offers a powerful set of features for selecting and filtering metrics. Selectors and matchers play a crucial role in this process, enabling users to precisely identify the metrics they want to analyze and exclude irrelevant data.

Selectors:

Selectors are used to specify the type of metric to query, such as node_cpu_seconds_total or http_request_duration_seconds. They also provide a way to group metrics based on specific labels. For instance, the selector http_request_duration_seconds{status="200"} selects only HTTP requests with a status code of 200.

Matchers:

Matchers are used to filter metrics based on specific conditions related to labels or timestamps. They enable users to narrow down the analysis to focus on relevant data points. For example, the matcher status=~"^2[0-9]$" selects only HTTP requests with status codes between 200 and 299.

Matcher Types

  • =: Select labels that are exactly equal to the provided string. process_cpu_seconds_total{job="kube-state-metrics"}
  • !=: Select labels that are not equal to the provided string. process_cpu_seconds_total{job!="kube-state-metrics"}
  • =~: Select labels that regex-match the provided string. prometheus_http_requests_total{handler=~"/api/v1.*"}
  • !~: Select labels that do not regex-match the provided string. prometheus_http_requests_total{handler!~"/api/v1.*"}

Combining Selectors and Matchers:

Selectors and matchers can be combined to create complex expressions that precisely identify the metrics to analyze. For instance, the expression sum(http_request_duration_seconds{status="200" and instance="my-server"}) calculates the sum of HTTP request durations for requests with a status code of 200 that were made from the instance named “my-server”.

Examples of Selectors and Matchers:

  • sum(node_cpu_seconds_total{mode="idle"}) Calculates the total amount of CPU time spent idle for the past hour.
  • avg(http_request_duration_seconds{status="200"}) by (endpoint) Calculates the average request duration for each API endpoint for the past 15 minutes.
  • histogram_quantile(0.9, http_request_duration_seconds{status="500"})[5m] Calculates the 90th percentile of HTTP request duration for 500 status codes over the past 5 minutes.

🔖 Functions for manipulating and analysing time-series data

PromQL offers a wide range of functions for manipulating and analyzing time-series data.

🌟 Data Type Transformation Functions:

PromQL provides a rich set of functions for transforming data types, enabling users to manipulate scalar values, labels, and vectors. These functions are essential for converting data between units, filtering data based on conditions, and calculating summary statistics.

Scalar Transformations

  • round(): Rounds a scalar to a specified number of decimal places.
  • seconds_to_ms(): Converts seconds to milliseconds.
  • ms_to_seconds(): Converts milliseconds to seconds.
  • humanize_bytes(): Converts bytes to human-readable format (e.g., 1 KB1 MB1 GB).

Label Transformations

  • label_join(): Combines multiple labels into a single string.
  • label_replace(): Replaces a label value with a new value based on conditions.
  • label_drop(): Removes a label from a metric.

Vector Transformations

  • sum(): Calculates the sum of all values in a vector.
  • avg(): Calculates the average of all values in a vector.
  • min(): Retrieves the minimum value in a vector.
  • max(): Retrieves the maximum value in a vector.
  • by(): Groups a vector based on a label and calculates summary statistics for each group.
  • filter(): Filters a vector based on conditions related to labels or timestamps.

🌟Aggregation Functions:

Aggregation functions are essential for summarizing and analyzing time-series data. They operate on vectors of metric values, extracting key metrics and statistics from the data.

Summarisation

  • sum(): Calculates the sum of all values in a vector.
  • avg(): Calculates the average of all values in a vector.
  • min(): Retrieves the minimum value in a vector.
  • max(): Retrieves the maximum value in a vector.
  • delta(): Calculates the change in a metric over time.
  • rate – The rate function calculates at what rate the counter increases per second over a given time window. src
  • irate – Calculates at what rate the counter increases per second over a defined time window. The difference being that irate only looks at the last two data points. This makes irate well suited for graphing volatile and/or fast-moving counters. src
  • increase – The increase function calculates the counter increase over a given time frame. src
  • resets – The function gives you the number of counter resets over a given time window. src

Quantiles

  • histogram_quantile(): Calculates a specified percentile of a vector.
  • quantile(): Calculates the lower and upper quartiles of a vector.

Distributions

  • histogram(): Returns a histogram of a vector, including the minimum, maximum, quantiles, and percentiles.
  • entropy(): Calculates the entropy of a vector, which measures the distribution of values in the vector.

🌟 Combining Functions for Complex Queries:

PromQL allows users to combine multiple functions within a single query, building complex expressions that extract meaningful insights from time-series data. These combined expressions enable users to:

  • Filter and group metrics based on multiple conditions and labels
  • Analyze trends and patterns across different time periods and dimensions
  • Identify anomalies and potential performance issues

Here are some examples of how users can combine functions in PromQL:

  • Calculate the average CPU usage for each CPU core across the past hour: avg(node_cpu_seconds_total{instance="my-server"}[1h]) by (cpu)
  • Identify the percentage of HTTP requests that took longer than 500 milliseconds over the past 10 minutes: http_request_duration_seconds{status="200"} > 500ms
  • Calculate the 99th percentile of request latency for the API endpoint /api/v1/users over the past 15 minutes: histogram_quantile(0.9, http_request_duration_seconds{endpoint="/api/v1/users"})[15m]

📚 Learn more

Leave a comment