
Metrics are important measurements that provide insights into the usage and performance of Services. BentoML provides a set of default metrics for performance analysis while you can also define custom metrics with Prometheus.

In this document, you will:

  • Learn and configure the default metrics in BentoML

  • Create custom metrics for BentoML Services

  • Use Prometheus to scrape metrics

  • Create a Grafana dashboard to visualize metrics

Understand metrics¶

You can access metrics via the metrics endpoint of a BentoML Service. This endpoint is enabled by default and outputs metrics that Prometheus can scrape to monitor your Services continuously.

Default metrics¶

BentoML automatically collects a set of default metrics for each Service. These metrics are tracked across different dimensions to provide detailed visibility into Service operations:






endpoint, runner_name, service_name, service_version



endpoint, service_name, runner_name, service_version, http_response_code

bentoml_service_request_duration_seconds_sum, bentoml_service_request_duration_seconds_count, bentoml_service_request_duration_seconds_bucket


endpoint, service_name, runner_name, service_version, http_response_code

bentoml_service_adaptive_batch_size_sum, bentoml_service_adaptive_batch_size_count, bentoml_service_adaptive_batch_size_bucket


method_name, service_name, runner_name, service_version, worker_index

  • request_in_progress: The number of requests that are currently being processed by a Service.

  • request_total: The total number of requests that a Service has processed.

  • request_duration_seconds: The time taken to process requests, including the total sum of request processing time, count of requests processed, and distribution across specified duration buckets.

  • adaptive_batch_size: The adaptive batch sizes used during Service execution, which is relevant for optimizing performance in batch processing scenarios. You need to enable adaptive batching to collect this metric.

Metric types¶

BentoML supports all metric types provided by Prometheus.

  • Gauge: A metric that represents a single numerical value that can arbitrarily go up and down.

  • Counter: A cumulative metric that only increases, useful for counting total requests.

  • Histogram: Tracks the number of observations and the sum of the observed values in configurable buckets, allowing you to calculate averages, percentiles, and so on.

  • Summary: Similar to Histogram but provides a total count of observations and a sum of observed values.

For more information, see the Prometheus documentation.


Dimensions tracked for the default BentoML metrics include:

  • endpoint: The specific API endpoint being accessed.

  • runner_name: The name of the running Service handling the request.

  • service_name: The name of the Bento Service handling the request.

  • service_version: The version of the Service.

  • http_response_code: The HTTP response code of the request.

  • worker_index: The worker instance that is running the inference.

Configure default metrics¶

To customize how metrics are collected and reported in BentoML, use the metrics parameter within the @bentoml.service decorator:

    "enabled": True,
    "namespace": "custom_namespace",
class MyService:
    # Service implementation
  • enabled: This option is enabled by default. When enabled, you can access the metrics through the metrics endpoint of a BentoML Service.

  • namespace: Follows the labeling convention of Prometheus. The default namespace is bentoml_service, which covers most use cases.

Customize the duration bucket size¶

You can customize the duration bucket size of request_duration_seconds in the following two ways:

  • Manual bucket definition. Specify explicit steps using buckets:

        "enabled": True,
        "namespace": "bentoml_service",
        "duration": {
            "buckets": [0.1, 0.2, 0.5, 1, 2, 5, 10]
    class MyService:
        # Service implementation
  • Exponential bucket generation. Automatically generate exponential buckets with any given min, max and factor values.

    • min: The lower bound of the smallest bucket in the histogram.

    • max: The upper bound of the largest bucket in the histogram.

    • factor: Determines the exponential growth rate of the bucket sizes. Each subsequent bucket boundary is calculated by multiplying the previous boundary by the factor.

        "enabled": True,
        "namespace": "bentoml_service",
        "duration": {
            "min": 0.1,
            "max": 10,
            "factor": 1.2
    class MyService:
        # Service implementation


  • duration.min, duration.max and duration.factor are mutually exclusive with duration.buckets.

  • duration.factor must be greater than 1 to ensure each subsequent bucket is larger than the previous one.

  • The buckets for the adaptive_batch_size Histogram are calculated based on the max_batch_size defined. The bucket sizes start at 1 and increase exponentially up to the max_batch_size with a factor of 2.

By default, BentoML uses the duration buckets provided by Prometheus.

Create custom metrics¶

You can define and use custom metrics of Counter, Histogram, Summary, and Gauge within your BentoML Service using the prometheus_client API.


Install the Prometheus Python client package.

pip install prometheus-client

Define custom metrics¶

To define custom metrics, use the metric classes from the prometheus_client module and set the following parameters as needed:

  • name: A unique string identifier for the metric.

  • documentation: A description of what the metric measures.

  • labelnames: A list of strings defining the labels to apply to the metric. Labels add dimensions to the metric, which are useful for querying and aggregation purposes. When you record a metric, you specify the labels in the format <metric_object>.labels(<label_name>='<label_value>').<metric_function>. Once you define a label for a metric, all instances of that metric must include that label with some value.

    The value of a label can also be dynamic, meaning it can change based on the context of the tracked metric. For example, you can use a label to log the version of model serving predictions, and this version label can change as you update the model.

  • buckets: A Histogram-specific parameter which defines the boundaries for Histogram buckets, useful for categorizing measurement ranges. The list should end with float('inf') to capture all values that exceed the highest defined boundary. See the Prometheus documentation on Histogram for more details.

import bentoml
from prometheus_client import Histogram

# Define Histogram metric
inference_duration_histogram = Histogram(
    documentation="Time taken for inference",
      0.005, 0.01, 0.025, 0.05, 0.075,
      0.1, 0.25, 0.5, 0.75, 1.0,
      2.5, 5.0, 7.5, 10.0, float("inf"),

class HistogramService:
    def __init__(self) -> None:
        # Initialization code

    def infer(self, text: str) -> str:
        # Track the metric
        # Implementation logic
import bentoml
from prometheus_client import Counter

# Define Counter metric
inference_requests_counter = Counter(
    documentation="Total number of inference requests",

class CounterService:
    def __init__(self) -> None:
        # Initialization code

    def infer(self, text: str) -> str:
        # Track the metric
        inference_requests_counter.labels(endpoint='summarize').inc()  # Increment the counter by 1
        # Implementation logic
import bentoml
from prometheus_client import Summary

# Define Summary metric
response_size_summary = Summary(
    documentation="Response size in bytes",

class SummaryService:
    def __init__(self) -> None:
        # Initialization code

    def infer(self, text: str) -> str:
        # Track the metric
        # Implementation logic
import bentoml
from prometheus_client import Gauge

# Define Gauge metric
in_progress_gauge = Gauge(
    documentation="In-progress inference requests",

class GaugeService:
    def __init__(self) -> None:
        # Initialization code

    def infer(self, text: str) -> str:
        # Track the metric
        in_progress_gauge.labels(endpoint='summarize').inc()  # Increment by 1
        in_progress_gauge.labels(endpoint='summarize').dec()  # Decrement by 1
        # Implementation logic

For more information on prometheus_client, see the Prometheus Python client library documentation.

An example with custom metrics¶

The following file contains a custom Histogram and a Counter metric to measure the inference time and track the total number of requests.

from __future__ import annotations
import bentoml
from prometheus_client import Histogram, Counter
from transformers import pipeline
import time

# Define the metrics
request_counter = Counter(
    documentation='Total number of summarization requests',

inference_time_histogram = Histogram(
    documentation='Time taken for summarization inference',
    buckets=(0.1, 0.2, 0.5, 1, 2, 5, 10, float('inf'))  # Example buckets

EXAMPLE_INPUT = "Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking 20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to celebrate what is being hailed as 'The Leap of the Century."

    resources={"cpu": "2"},
    traffic={"timeout": 10},
class Summarization:
    def __init__(self) -> None:
        self.pipeline = pipeline('summarization')

    def summarize(self, text: str = EXAMPLE_INPUT) -> str:
        start_time = time.time()
            result = self.pipeline(text)
            summary_text = result[0]['summary_text']
            # Capture successful requests
            status = 'success'
        except Exception as e:
            # Capture failures
            summary_text = str(e)
            status = 'failure'
            # Measure how long the inference took and update the histogram
            inference_time_histogram.labels(status=status).observe(time.time() - start_time)
            # Increment the request counter

        return summary_text

Run this Service locally:

bentoml serve service:Summarization

Make sure you have sent some requests to the summarize endpoint, then view the custom metrics by running the following command. You need to replace inference_time_seconds and summary_requests_total with your own metric names.

curl -X 'GET' 'http://localhost:3000/metrics' -H 'accept: */*' | grep -E 'inference_time_seconds|summary_requests_total'

Expected output:

# HELP summary_requests_total Total number of summarization requests
# TYPE summary_requests_total counter
summary_requests_total{status="success"} 12.0
# HELP inference_time_seconds Time taken for summarization inference
# TYPE inference_time_seconds histogram
inference_time_seconds_sum{status="success"} 51.74311947822571
inference_time_seconds_bucket{le="0.1",status="success"} 0.0
inference_time_seconds_bucket{le="0.2",status="success"} 0.0
inference_time_seconds_bucket{le="0.5",status="success"} 0.0
inference_time_seconds_bucket{le="1.0",status="success"} 0.0
inference_time_seconds_bucket{le="2.0",status="success"} 0.0
inference_time_seconds_bucket{le="5.0",status="success"} 12.0
inference_time_seconds_bucket{le="10.0",status="success"} 12.0
inference_time_seconds_bucket{le="+Inf",status="success"} 12.0
inference_time_seconds_count{status="success"} 12.0

Use Prometheus to scrape metrics¶

You can integrate Prometheus to scrape and visualize both default and custom metrics from your BentoML Service.

  1. Install Prometheus.

  2. Create a Prometheus configuration file to define scrape jobs. Here is an example that scrapes metrics every 5 seconds from a BentoML Service.

      scrape_interval: 5s
      evaluation_interval: 15s
      - job_name: prometheus
        metrics_path: "/metrics" # The metrics endpoint of the BentoML Service
          - targets: [""] # The address where the BentoML Service is running
  3. Make sure you have a BentoML Service running, then start Prometheus in a different terminal session using the configuration file you created:

    ./prometheus --config.file=/path/to/the/file/prometheus.yml
  4. Once Prometheus is running, access its web UI by visiting http://localhost:9090 in your web browser. This interface allows you to query and visualize metrics collected from your BentoML Service.

  5. Use PromQL expressions to query and visualize metrics. For example, to get the 99th percentile of request durations to the /encode endpoint over the last minute, use:

    histogram_quantile(0.99, rate(bentoml_service_request_duration_seconds_bucket{endpoint="/encode"}[1m]))
    Prometheus UI for BentoML metrics

Create a Grafana dashboard¶

Grafana is an analytics platform that allows you to create dynamic and informative dashboards to visualize BentoML metrics. Do the following to create a Grafana dashboard.

  1. Install Grafana.

  2. By default, Grafana runs on port 3000, which conflicts with BentoML’s default port. To avoid this, change Grafana’s default port. For example:

    sudo nano /etc/grafana/grafana.ini

    Find the [http] section and change http_port to a free port like 4000:

    ;http_port = 3000  # Change it to a port of your choice and uncomment the line by removing the semicolon
    http_port = 4000
  3. Save the file and restart Grafana to apply the change:

    sudo systemctl restart grafana-server
  4. Access the Grafana web UI at http://localhost:4000/ (use your own port). Log in with the default credentials (admin/admin).

  5. In the Grafana search box at the top, enter Data sources and add Prometheus as an available option. In Connection, set the URL to the address of your running Prometheus instance, such as http://localhost:9090. Save the configuration and test the connection to ensure Grafana can retrieve data from Prometheus.

    Add Prometheus in Grafana
  6. With Prometheus configured as a data source, you can create a new dashboard. Start by adding a panel and selecting a metric to visualize, such as bentoml_service_request_duration_seconds_bucket. Grafana offers a wide array of visualization options, from simple line graphs to more complex representations like heatmaps or gauges.

    Grafana UI for BentoML metrics

    For detailed instructions on dashboard creation and customization, read the Grafana documentation.