title | description |
---|---|
Monitor data pipelines |
Learn how to monitor your Fluent Bit data pipelines |
Fluent Bit includes features for monitoring the internals of your pipeline, in addition to connecting to Prometheus and Grafana, Health checks, and connectors to use external services:
- HTTP Server: JSON and Prometheus Exporter-style metrics
- Grafana Dashboards and Alerts
- Health Checks
- Telemetry Pipeline: hosted service to monitor and visualize your pipelines
Fluent Bit includes an HTTP server for querying internal information and monitoring metrics of each running plugin.
You can integrate the monitoring interface with Prometheus.
To get started, enable the HTTP server from the configuration file. The following
configuration instructs Fluent Bit to start an HTTP server on TCP port 2020
and
listen on all network interfaces:
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
[INPUT]
Name cpu
[OUTPUT]
Name stdout
Match *
Apply the configuration file:
bin/fluent-bit -c fluent-bit.conf
Fluent Bit starts and generates output in your terminal:
Fluent Bit v1.4.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/03/10 19:08:24] [ info] [engine] started
[2020/03/10 19:08:24] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
Use curl
to gather information about the HTTP server. The following command sends
the command output to the jq
program, which outputs human-readable JSON data to the
terminal.
curl -s http://127.0.0.1:2020 | jq
{
"fluent-bit": {
"version": "0.13.0",
"edition": "Community",
"flags": [
"FLB_HAVE_TLS",
"FLB_HAVE_METRICS",
"FLB_HAVE_SQLDB",
"FLB_HAVE_TRACE",
"FLB_HAVE_HTTP_SERVER",
"FLB_HAVE_FLUSH_LIBCO",
"FLB_HAVE_SYSTEMD",
"FLB_HAVE_VALGRIND",
"FLB_HAVE_FORK",
"FLB_HAVE_PROXY_GO",
"FLB_HAVE_REGEX",
"FLB_HAVE_C_TLS",
"FLB_HAVE_SETJMP",
"FLB_HAVE_ACCEPT4",
"FLB_HAVE_INOTIFY"
]
}
}
Fluent Bit exposes the following endpoints for monitoring.
URI | Description | Data format |
---|---|---|
/ | Fluent Bit build information. | JSON |
/api/v1/uptime | Return uptime information in seconds. | JSON |
/api/v1/metrics | Display internal metrics per loaded plugin. | JSON |
/api/v1/metrics/prometheus | Display internal metrics per loaded plugin in Prometheus Server format. | Prometheus Text 0.0.4 |
/api/v1/storage | Get internal metrics of the storage layer / buffered data. This option is enabled only if in the SERVICE section of the property storage.metrics is enabled. |
JSON |
/api/v1/health | Display the Fluent Bit health check result. | String |
/api/v2/metrics | Display internal metrics per loaded plugin. | cmetrics text format |
/api/v2/metrics/prometheus | Display internal metrics per loaded plugin ready in Prometheus Server format. | Prometheus Text 0.0.4 |
/api/v2/reload | Execute hot reloading or get the status of hot reloading. See the hot-reloading documentation. | JSON |
The following descriptions apply to v1 metric endpoints.
The following descriptions apply to metrics outputted in Prometheus format by the
/api/v1/metrics/prometheus
endpoint.
The following terms are key to understanding how Fluent Bit processes metrics:
-
Record: a single message collected from a source, such as a single long line in a file.
-
Chunk: log records ingested and stored by Fluent Bit input plugin instances. A batch of records in a chunk are tracked together as a single unit.
The Fluent Bit engine attempts to fit records into chunks of at most
2 MB
, but the size can vary at runtime. Chunks are then sent to an output. An output plugin instance can either successfully send the full chunk to the destination and mark it as successful, or it can fail the chunk entirely if an unrecoverable error is encountered, or it can ask for the chunk to be retried.
Metric name | Labels | Description | Type | Unit |
---|---|---|---|---|
fluentbit_input_bytes_total |
name: the name or alias for the input instance | The number of bytes of log records that this input instance has ingested successfully. | counter | bytes |
fluentbit_input_records_total |
name: the name or alias for the input instance | The number of log records this input ingested successfully. | counter | records |
fluentbit_output_dropped_records_total |
name: the name or alias for the output instance | The number of log records dropped by the output. These records hit an unrecoverable error or retries expired for their chunk. | counter | records |
fluentbit_output_errors_total |
name: the name or alias for the output instance | The number of chunks with an error that's either unrecoverable or unable to retry. This metric represents the number of times a chunk failed, and doesn't correspond with the number of error messages visible in the Fluent Bit log output. | counter | chunks |
fluentbit_output_proc_bytes_total |
name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record is not sent due to some error, it doesn't count towards this metric. | counter | bytes |
fluentbit_output_proc_records_total |
name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record is not sent successfully, it doesn't count towards this metric. | counter | records |
fluentbit_output_retried_records_total |
name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
fluentbit_output_retries_failed_total |
name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a Retry_Limit , which applies to chunks. When the Retry_Limit is exceeded, the chunk is discarded and this metric is incremented. |
counter | chunks |
fluentbit_output_retries_total |
name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
fluentbit_uptime |
The number of seconds that Fluent Bit has been running. | counter | seconds | |
process_start_time_seconds |
The Unix Epoch timestamp for when Fluent Bit started. | gauge | seconds |
The following descriptions apply to metrics outputted in JSON format by the
/api/v1/storage
endpoint.
Metric Key | Description | Unit |
---|---|---|
chunks.total_chunks |
The total number of chunks of records that Fluent Bit is currently buffering. | chunks |
chunks.mem_chunks |
The total number of chunks that are currently buffered in memory. Chunks can be both in memory and on the file system at the same time. | chunks |
chunks.fs_chunks |
The total number of chunks saved to the filesystem. | chunks |
chunks.fs_chunks_up |
The count of chunks that are both in file system and in memory. | chunks |
chunks.fs_chunks_down |
The count of chunks that are only in the file system. | chunks |
input_chunks.{plugin name}.status.overlimit |
Indicates whether the input instance exceeded its configured Mem_Buf_Limit. |
boolean |
input_chunks.{plugin name}.status.mem_size |
The size of memory that this input is consuming to buffer logs in chunks. | bytes |
input_chunks.{plugin name}.status.mem_limit |
The buffer memory limit (Mem_Buf_Limit ) that applies to this input plugin. |
bytes |
input_chunks.{plugin name}.chunks.total |
The current total number of chunks owned by this input instance. | chunks |
input_chunks.{plugin name}.chunks.up |
The current number of chunks that are in memory for this input. If file system storage is enabled, chunks that are "up" are also stored in the filesystem layer. | chunks |
input_chunks.{plugin name}.chunks.down |
The current number of chunks that are "down" in the filesystem for this input. | chunks |
input_chunks.{plugin name}.chunks.busy |
Chunks are that are being processed or sent by outputs and are not eligible to have new data appended. | chunks |
input_chunks.{plugin name}.chunks.busy_size |
The sum of the byte size of each chunk which is currently marked as busy. | bytes |
The following descriptions apply to v2 metric endpoints.
The following descriptions apply to metrics outputted in Prometheus format by the
/api/v2/metrics/prometheus
or /api/v2/metrics
endpoints.
The following terms are key to understanding how Fluent Bit processes metrics:
-
Record: a single message collected from a source, such as a single long line in a file.
-
Chunk: log records ingested and stored by Fluent Bit input plugin instances. A batch of records in a chunk are tracked together as a single unit.
The Fluent Bit engine attempts to fit records into chunks of at most
2 MB
, but the size can vary at runtime. Chunks are then sent to an output. An output plugin instance can either successfully send the full chunk to the destination and mark it as successful, or it can fail the chunk entirely if an unrecoverable error is encountered, or it can ask for the chunk to be retried.
Metric Name | Labels | Description | Type | Unit |
---|---|---|---|---|
fluentbit_input_bytes_total |
name: the name or alias for the input instance | The number of bytes of log records that this input instance has ingested successfully. | counter | bytes |
fluentbit_input_records_total |
name: the name or alias for the input instance | The number of log records this input ingested successfully. | counter | records |
fluentbit_filter_bytes_total |
name: the name or alias for the filter instance | The number of bytes of log records that this filter instance has ingested successfully. | counter | bytes |
fluentbit_filter_records_total |
name: the name or alias for the filter instance | The number of log records this filter has ingested successfully. | counter | records |
fluentbit_filter_added_records_total |
name: the name or alias for the filter instance | The number of log records added by the filter into the data pipeline. | counter | records |
fluentbit_filter_drop_records_total |
name: the name or alias for the filter instance | The number of log records dropped by the filter and removed from the data pipeline. | counter | records |
fluentbit_output_dropped_records_total |
name: the name or alias for the output instance | The number of log records dropped by the output. These records hit an unrecoverable error or retries expired for their chunk. | counter | records |
fluentbit_output_errors_total |
name: the name or alias for the output instance | The number of chunks with an error that's either unrecoverable or unable to retry. This metric represents the number of times a chunk failed, and doesn't correspond with the number of error messages visible in the Fluent Bit log output. | counter | chunks |
fluentbit_output_proc_bytes_total |
name: the name or alias for the output instance | The number of bytes of log records that this output instance sent successfully. This metric represents the total byte size of all unique chunks sent by this output. If a record is not sent due to some error, it doesn't count towards this metric. | counter | bytes |
fluentbit_output_proc_records_total |
name: the name or alias for the output instance | The number of log records that this output instance sent successfully. This metric represents the total record count of all unique chunks sent by this output. If a record is not sent successfully, it doesn't count towards this metric. | counter | records |
fluentbit_output_retried_records_total |
name: the name or alias for the output instance | The number of log records that experienced a retry. This metric is calculated at the chunk level, the count increased when an entire chunk is marked for retry. An output plugin might perform multiple actions that generate many error messages when uploading a single chunk. | counter | records |
fluentbit_output_retries_failed_total |
name: the name or alias for the output instance | The number of times that retries expired for a chunk. Each plugin configures a Retry_Limit , which applies to chunks. When the Retry_Limit is exceeded, the chunk is discarded and this metric is incremented. |
counter | chunks |
fluentbit_output_retries_total |
name: the name or alias for the output instance | The number of times this output instance requested a retry for a chunk. | counter | chunks |
fluentbit_uptime |
hostname: the hostname on running Fluent Bit | The number of seconds that Fluent Bit has been running. | counter | seconds |
fluentbit_process_start_time_seconds |
hostname: the hostname on running Fluent Bit | The Unix Epoch time stamp for when Fluent Bit started. | gauge | seconds |
fluentbit_build_info |
hostname: the hostname, version: the version of Fluent Bit, os: OS type | Build version information. The returned value is originated from initializing the Unix Epoch time stamp of configuration context. | gauge | seconds |
fluentbit_hot_reloaded_times |
hostname: the hostname on running Fluent Bit | Collect the count of hot reloaded times. | gauge | seconds |
The following are detailed descriptions for the metrics collected by the storage layer.
Metric Name | Labels | Description | Type | Unit |
---|---|---|---|---|
fluentbit_input_chunks.storage_chunks |
None | The total number of chunks of records that Fluent Bit is currently buffering. | gauge | chunks |
fluentbit_storage_mem_chunk |
None | The total number of chunks that are currently buffered in memory. Chunks can be both in memory and on the file system at the same time. | gauge | chunks |
fluentbit_storage_fs_chunks |
None | The total number of chunks saved to the file system. | gauge | chunks |
fluentbit_storage_fs_chunks_up |
None | The count of chunks that are both in file system and in memory. | gauge | chunks |
fluentbit_storage_fs_chunks_down |
None | The count of chunks that are only in the file system. | gauge | chunks |
fluentbit_storage_fs_chunks_busy |
None | The total number of chunks are in a busy state. | gauge | chunks |
fluentbit_storage_fs_chunks_busy_bytes |
None | The total bytes of chunks are in a busy state. | gauge | bytes |
fluentbit_input_storage_overlimit |
name: the name or alias for the input instance | Indicates whether the input instance exceeded its configured Mem_Buf_Limit. |
gauge | boolean |
fluentbit_input_storage_memory_bytes |
name: the name or alias for the input instance | The size of memory that this input is consuming to buffer logs in chunks. | gauge | bytes |
fluentbit_input_storage_chunks |
name: the name or alias for the input instance | The current total number of chunks owned by this input instance. | gauge | chunks |
fluentbit_input_storage_chunks_up |
name: the name or alias for the input instance | The current number of chunks that are in memory for this input. If file system storage is enabled, chunks that are "up" are also stored in the filesystem layer. | gauge | chunks |
fluentbit_input_storage_chunks_down |
name: the name or alias for the input instance | The current number of chunks that are "down" in the filesystem for this input. | gauge | chunks |
fluentbit_input_storage_chunks_busy |
name: the name or alias for the input instance | Chunks are that are being processed or sent by outputs and are not eligible to have new data appended. | gauge | chunks |
fluentbit_input_storage_chunks_busy_bytes |
name: the name or alias for the input instance | The sum of the byte size of each chunk which is currently marked as busy. | gauge | bytes |
fluentbit_output_upstream_total_connections |
name: the name or alias for the output instance | The sum of the connection count of each output plugins. | gauge | bytes |
fluentbit_output_upstream_busy_connections |
name: the name or alias for the output instance | The sum of the connection count in a busy state of each output plugins. | gauge | bytes |
Query the service uptime with the following command:
$ curl -s http://127.0.0.1:2020/api/v1/uptime | jq
The command prints a similar output like this:
{
"uptime_sec": 8950000,
"uptime_hr": "Fluent Bit has been running: 103 days, 14 hours, 6 minutes and 40 seconds"
}
Query internal metrics in JSON format with the following command:
$ curl -s http://127.0.0.1:2020/api/v1/metrics | jq
The command prints a similar output like this:
{
"input": {
"cpu.0": {
"records": 8,
"bytes": 2536
}
},
"output": {
"stdout.0": {
"proc_records": 5,
"proc_bytes": 1585,
"errors": 0,
"retries": 0,
"retries_failed": 0
}
}
}
Query internal metrics in Prometheus Text 0.0.4 format:
$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
This command returns the same metrics in Prometheus format instead of JSON:
fluentbit_input_records_total{name="cpu.0"} 57 1509150350542
fluentbit_input_bytes_total{name="cpu.0"} 18069 1509150350542
fluentbit_output_proc_records_total{name="stdout.0"} 54 1509150350542
fluentbit_output_proc_bytes_total{name="stdout.0"} 17118 1509150350542
fluentbit_output_errors_total{name="stdout.0"} 0 1509150350542
fluentbit_output_retries_total{name="stdout.0"} 0 1509150350542
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1509150350542
By default, configured plugins on runtime get an internal name in the format
_plugin_name.ID_
. For monitoring purposes, this can be confusing if many plugins of
the same type were configured. To make a distinction each configured input or output
section can get an alias that will be used as the parent name for the metric.
The following example sets an alias to the INPUT
section of the configuration file,
which is using the CPU input plugin:
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
[INPUT]
Name cpu
Alias server1_cpu
[OUTPUT]
Name stdout
Alias raw_output
Match *
When querying the related metrics, the aliases are returned instead of the plugin name:
{
"input": {
"server1_cpu": {
"records": 8,
"bytes": 2536
}
},
"output": {
"raw_output": {
"proc_records": 5,
"proc_bytes": 1585,
"errors": 0,
"retries": 0,
"retries_failed": 0
}
}
}
You can create Grafana dashboards and alerts using Fluent Bit's exposed Prometheus style metrics.
The provided example dashboard
is heavily inspired by Banzai Cloud's
logging operator dashboard with a few
key differences, such as the use of the instance
label, stacked graphs, and a focus
on Fluent Bit metrics. See
this blog post
for more information.
Sample alerts are available here.
Fluent bit now supports four new configs to set up the health check.
Configuration name | Description | Default |
---|---|---|
Health_Check |
enable Health check feature | Off |
HC_Errors_Count |
the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: [2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3) |
5 |
HC_Retry_Failure_Count |
the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: [2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1 |
5 |
HC_Period |
The time period by second to count the error and retry failure data point | 60 |
Not every error log means an error to be counted. The error retry failures count only on specific errors, which is the example in configuration table description.
Based on the HC_Period
setting, if the real error number is over HC_Errors_Count
,
or retry failure is over HC_Retry_Failure_Count
, Fluent Bit is considered
unhealthy. The health endpoint returns an HTTP status 500
and an error
message.
Otherwise, the endpoint returns HTTP status 200
and an ok
message.
The equation to calculate this behavior is:
health status = (HC_Errors_Count > HC_Errors_Count config value) OR
(HC_Retry_Failure_Count > HC_Retry_Failure_Count config value) IN
the HC_Period interval
The HC_Errors_Count
and HC_Retry_Failure_Count
only count for output plugins and
count a sum for errors and retry failures from all running output plugins.
The following configuration file example shows how to define these settings:
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5
[INPUT]
Name cpu
[OUTPUT]
Name stdout
Match *
Use the following command to call the health endpoint:
curl -s http://127.0.0.1:2020/api/v1/health
With the example config, the health status is determined by the following equation:
Health status = (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds
- If this equation evaluates to
TRUE
, then Fluent Bit is unhealthy. - If this equation evaluates to
FALSE
, then Fluent Bit is healthy.
Telemetry Pipeline is a hosted service that allows you to monitor your Fluent Bit agents including data flow, metrics, and configurations.