Getting Started

Run on Google Colab

Try out this quickstart guide interactively on Google Colab: Open in Colab.

Note that Docker containerization does not work in the Colab environment.

Run notebook locally

Install BentoML. This requires python 3.6 or above, install with pip command:

pip install bentoml

When referring the the latest documentation instead of the stable release doc, it is required to install the preview release of BentoML:

pip install --pre -U bentoml

Download and run the notebook in this quickstart guide:

# Download BentoML git repo
git clone
cd bentoml

# Install jupyter and other dependencies
pip install jupyter
pip install ./guides/quick-start/requirements.txt

# Run the notebook
jupyter notebook ./guides/quick-start/bentoml-quick-start-guide.ipynb

Alternatively, Download the notebook (Right-Click and then “Save Link As”) to your notebook workspace.

To build a model server docker image, you will also need to install docker for your system, read more about how to install docker here.

Hello World

Before starting, let’s prepare a trained model for serving with BentoML. Train a classifier model with Scikit-Learn on the Iris data set:

from sklearn import svm
from sklearn import datasets

# Load training data
iris = datasets.load_iris()
X, y =,

# Model Training
clf = svm.SVC(gamma='scale'), y)

Model serving with BentoML comes after a model is trained. The first step is creating a prediction service class, which defines the models required and the inference APIs which contains the serving logic code. Here is a minimal prediction service created for serving the iris classifier model trained above:

import pandas as pd

from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.frameworks.sklearn import SklearnModelArtifact

class IrisClassifier(BentoService):
    A minimum prediction service exposing a Scikit-learn model

    @api(input=DataframeInput(), batch=True)
    def predict(self, df: pd.DataFrame):
        An inference API named `predict` with Dataframe input adapter, which codifies
        how HTTP requests or CSV files are converted to a pandas Dataframe object as the
        inference API function input
        return self.artifacts.model.predict(df)

Firstly, the @artifact(...) here defines the required trained models to be packed with this prediction service. BentoML model artifacts are pre-built wrappers for persisting, loading and running a trained model. This example uses the SklearnModelArtifact for the scikit-learn framework. BentoML also provide artifact class for other ML frameworks, including PytorchModelArtifact, KerasModelArtifact, and XgboostModelArtifact etc.

The @env decorator specifies the dependencies and environment settings required for this prediction service. It allows BentoML to reproduce the exact same environment when moving the model and related code to production. With the infer_pip_packages=True flag, BentoML will automatically find all the PyPI packages that are used by the prediction service code and pins their versions.

The @api decorator defines an inference API, which is the entry point for accessing the prediction service. The input=DataframeInput() means this inference API callback function defined by the user, is expecting a pandas.DataFrame object as its input.

When the batch flag is set to True, an inference APIs is suppose to accept a list of inputs and return a list of results. In the case of DataframeInput, each row of the dataframe is mapping to one prediction request received from the client. BentoML will convert HTTP JSON requests into pandas.DataFrame object before passing it to the user-defined inference API function.

This design allows BentoML to group API requests into small batches while serving online traffic. Comparing to a regular flask or FastAPI based model server, this can largely increases the overall throughput of the API server.

Besides DataframeInput, BentoML also supports API input types such as JsonInput, ImageInput, FileInput and more. DataframeInput and TfTensorInput only support inference API with batch=True, while other input adapters support either batch or single-item API.

Save prediction service for distribution

The following code packages the trained model with the prediction service class IrisClassifier defined above, and then saves the IrisClassifier instance to disk in the BentoML format for distribution and deployment:

# import the IrisClassifier class defined above
from iris_classifier import IrisClassifier

# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()

# Pack the newly trained model artifact
iris_classifier_service.pack('model', clf)

# Save the prediction service to disk for model serving
saved_path =

BentoML stores all packaged model files under the ~/bentoml/repository/{service_name}/{service_version} directory by default. The BentoML packaged model format contains all the code, files, and configs required to run and deploy the model.

BentoML also comes with a model management component called YataiService, which provides a central hub for teams to manage and access packaged models via Web UI and API:

BentoML YataiService Bento Repository Page BentoML YataiService Bento Details Page

Launch Yatai server locally with docker and view your local repository of BentoML packaged models:

docker run -v ~/bentoml:/root/bentoml \
    -p 3000:3000 -p 50051:50051 \


The {saved_path} in the following commands are referring to the returned value of It is the file path where the BentoService saved bundle is stored. BentoML locally keeps track of all the BentoService SavedBundle you’ve created, you can also find the saved_path of your BentoService from the output of bentoml list -o wide, bentoml get IrisClassifier -o wide and bentoml get IrisClassifier:latest command.

A quick way of getting the saved_path from the command line is via the –print-location option:

saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

Model Serving via REST API

To start a REST API model server locally with the IrisClassifier saved above, use the bentoml serve command followed by service name and version tag:

bentoml serve IrisClassifier:latest

Alternatively, use the saved path to load and serve the BentoML packaged model directly: .. code-block:: bash

# Find the local path of the latest version IrisClassifier saved bundle saved_path=$(bentoml get IrisClassifier:latest –print-location –quiet)

bentoml serve $saved_path

The IrisClassifier model is now served at localhost:5000. Use curl command to send a prediction request:

curl -i \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \

Or with python and request library:

import requests
response ="", json=[[5.1, 3.5, 1.4, 0.2]])

Note that BentoML API server automatically converts the Dataframe JSON format into a pandas.DataFrame object before sending it to the user-defined inference API function.

The BentoML API server also provides a simple web UI dashboard. Go to http://localhost:5000 in the browser and use the Web UI to send prediction request:

BentoML API Server Web UI Screenshot

Launch inference job from CLI

BentoML cli supports loading and running a packaged model from CLI. With the DataframeInput adapter, the CLI command supports reading input Dataframe data from CLI argument or local csv or json files:

bentoml run IrisClassifier:latest predict --input='[[5.1, 3.5, 1.4, 0.2]]'

bentoml run IrisClassifier:latest predict --input='./iris_data.csv'

Containerize Model API Server

One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.

If you already have docker configured, run the following command to build a docker container image for serving the IrisClassifier prediction service created above:

bentoml containerize IrisClassifier:latest -t iris-classifier

Start a container with the docker image built from the previous step:

docker run -p 5000:5000 iris-classifier:latest --workers=1 --enable-microbatch

If you need fine-grained control over how the docker image is built, BentoML provides a convenient way to containerize the model API server manually:

# 1. Find the SavedBundle directory with `bentoml get` command
saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

# 2. Run `docker build` with the SavedBundle directory which contains a generated Dockerfile
docker build -t iris-classifier $saved_path

# 3. Run the generated docker image to start a docker container serving the model
docker run -p 5000:5000 iris-classifier --enable-microbatch --workers=1

This made it possible to deploy BentoML bundled ML models with platforms such as Kubeflow, Knative, Kubernetes, which provides advanced model deployment features such as auto-scaling, A/B testing, scale-to-zero, canary rollout and multi-armed bandit.


Ensure docker is installed before running the command above. Instructions on installing docker:

Deployment Options

If you are at a small team with limited engineering or DevOps resources, try out automated deployment with BentoML CLI, currently supporting AWS Lambda, AWS SageMaker, and Azure Functions:

If the cloud platform you are working with is not on the list above, try out these step-by-step guide on manually deploying BentoML packaged model to cloud platforms:

Lastly, if you have a DevOps or ML Engineering team who’s operating a Kubernetes or OpenShift cluster, use the following guides as references for implementing your deployment strategy:

Distribute BentoML packaged model as a PyPI library

The BentoService SavedBundle is pip-installable and can be directly distributed as a PyPI package if you plan to use the model in your python applications. You can install it as as a system-wide python package with pip:

saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

pip install $saved_path
# Your bentoML model class name will become the package name
import IrisClassifier

installed_svc = IrisClassifier.load()
installed_svc.predict([[5.1, 3.5, 1.4, 0.2]])

This also allow users to upload their BentoService to as public python package or to their organization’s private PyPi index to share with other developers.

cd $saved_path & python sdist upload


You will have to configure “.pypirc” file before uploading to pypi index. You can find more information about distributing python package at:

Learning more about BentoML

Interested in learning more about BentoML? Check out the BentoML Core Concepts and best practices walkthrough, a must-read for anyone who is looking to adopt BentoML.

Be sure to join BentoML slack channel to hear about the latest development updates and be part of the roadmap discussions.