Getting Started

Run on Google Colab

Try out this quickstart guide interactively on Google Colab: Open in Colab.

Note that Docker containerization does not work in the Colab environment.

Run Notebook Locally

Install BentoML. This requires python 3.6 or above, install with pip command:

pip install bentoml

When referring the latest documentation instead of the stable release doc, it is required to install the preview release of BentoML:

pip install --pre -U bentoml

Download and run the notebook in this quickstart guide:

# Download BentoML git repo
git clone
cd bentoml

# Install jupyter and other dependencies
pip install jupyter
pip install -r ./guides/quick-start/requirements.txt

# Run the notebook
jupyter notebook ./guides/quick-start/bentoml-quick-start-guide.ipynb

Alternatively, Download the notebook (Right-Click and then “Save Link As”) to your notebook workspace.

To build a model server docker image, you will also need to install docker for your system, read more about how to install docker here.


Before started, let’s discuss how BentoML’s project structure would look like. For most use-cases, users can follow this minimal scaffold for deploying with BentoML to avoid any potential errors (example project structure can be found under guides/quick-start):

├──       # responsible for packing BentoService
├──      # BentoService definition
├──               # DL Model definitions
├──               # OPTIONAL: training scripts
└── requirements.txt


For users who already have a DL project, users usually have a scripts, and thus bento_deploy/ is not needed.


For PyTorch use case, users should have a bento_deploy/ class definition in order to deserialize the model correctly.

We then need to prepare a trained model before serving with BentoML. Train a classifier model with Scikit-Learn on the Iris data set:

from sklearn import svm
from sklearn import datasets

# Load training data
iris = datasets.load_iris()
X, y =,

# Model Training
clf = svm.SVC(gamma='scale'), y)

Example: Hello World

Model serving with BentoML comes after a model is trained. The first step is creating a prediction service class, which defines the models required and the inference APIs which contains the serving logic code. Here is a minimal prediction service created for serving the iris classifier model trained above, which is saved under

import pandas as pd

from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.frameworks.sklearn import SklearnModelArtifact

class IrisClassifier(BentoService):
    A minimum prediction service exposing a Scikit-learn model

    @api(input=DataframeInput(), batch=True)
    def predict(self, df: pd.DataFrame):
        An inference API named `predict` with Dataframe input adapter, which codifies
        how HTTP requests or CSV files are converted to a pandas Dataframe object as the
        inference API function input
        return self.artifacts.model.predict(df)

Firstly, the @artifact(...) here defines the required trained models to be packed with this prediction service. BentoML model artifacts are pre-built wrappers for persisting, loading and running a trained model. This example uses the SklearnModelArtifact for the scikit-learn framework. BentoML also provide artifact class for other ML frameworks, including PytorchModelArtifact, KerasModelArtifact, and XgboostModelArtifact etc.

The @env decorator specifies the dependencies and environment settings required for this prediction service. It allows BentoML to reproduce the exact same environment when moving the model and related code to production. With the infer_pip_packages=True flag, BentoML will automatically find all the PyPI packages that are used by the prediction service code and pins their versions.

The @api decorator defines an inference API, which is the entry point for accessing the prediction service. The input=DataframeInput() means this inference API callback function defined by the user, is expecting a pandas.DataFrame object as its input.

When the batch flag is set to True, an inference APIs is suppose to accept a list of inputs and return a list of results. In the case of DataframeInput, each row of the dataframe is mapping to one prediction request received from the client. BentoML will convert HTTP JSON requests into pandas.DataFrame object before passing it to the user-defined inference API function.

This design allows BentoML to group API requests into small batches while serving online traffic. Comparing to a regular flask or FastAPI based model server, this can largely increase the overall throughput of the API server.

Besides DataframeInput, BentoML also supports API input types such as JsonInput, ImageInput, FileInput and more. DataframeInput and TfTensorInput only support inference API with batch=True, while other input adapters support either batch or single-item API.

Save prediction service for distribution

The following code packages the trained model with the prediction service class IrisClassifier defined above, and then saves the IrisClassifier instance to disk in the BentoML format for distribution and deployment, under


# import the IrisClassifier class defined above
from bento_service import IrisClassifier

# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()

# Pack the newly trained model artifact
iris_classifier_service.pack('model', clf)

# Save the prediction service to disk for model serving
saved_path =

BentoML stores all packaged model files under the ~/bentoml/repository/{service_name}/{service_version} directory by default. The BentoML packaged model format contains all the code, files, and configs required to run and deploy the model.

BentoML also comes with a model management component called YataiService, which provides a central hub for teams to manage and access packaged models via Web UI and API:

BentoML YataiService Bento Repository Page BentoML YataiService Bento Details Page

Launch Yatai server locally with docker and view your local repository of BentoML packaged models:

docker run \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v ~/bentoml:/bentoml \
  -p 3000:3000 \
  -p 50051:50051 \


The {saved_path} in the following commands are referring to the returned value of It is the file path where the BentoService saved bundle is stored. BentoML locally keeps track of all the BentoService SavedBundle you’ve created, you can also find the saved_path of your BentoService from the output of bentoml list -o wide, bentoml get IrisClassifier -o wide and bentoml get IrisClassifier:latest command.

A quick way of getting the saved_path from the command line is via the –print-location option:

saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

Model Serving via REST API

To start a REST API model server locally with the IrisClassifier saved above, use the bentoml serve command followed by service name and version tag:

bentoml serve IrisClassifier:latest

Alternatively, use the saved path to load and serve the BentoML packaged model directly:

# Find the local path of the latest version IrisClassifier saved bundle
saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

bentoml serve $saved_path

The IrisClassifier model is now served at localhost:5000. Use curl command to send a prediction request:

curl -i \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \

Or with python and request library:

import requests
response ="", json=[[5.1, 3.5, 1.4, 0.2]])

Note that BentoML API server automatically converts the Dataframe JSON format into a pandas.DataFrame object before sending it to the user-defined inference API function.

The BentoML API server also provides a simple web UI dashboard. Go to http://localhost:5000 in the browser and use the Web UI to send prediction request:

BentoML API Server Web UI Screenshot

Launch inference job from CLI

The BentoML CLI supports loading and running a packaged model from CLI. With the DataframeInput adapter, the CLI command supports reading input Dataframe data directly from CLI arguments and local files:

bentoml run IrisClassifier:latest predict --input '[[5.1, 3.5, 1.4, 0.2]]'

bentoml run IrisClassifier:latest predict --input-file './iris_data.csv'

More details on running packaged models that use other input adapters here: Offline Batch Serving

Containerize Model API Server

One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.

If you already have docker configured, run the following command to build a docker container image for serving the IrisClassifier prediction service created above:

bentoml containerize IrisClassifier:latest -t iris-classifier

Start a container with the docker image built from the previous step:

docker run -p 5000:5000 iris-classifier:latest --workers=2

If you need fine-grained control over how the docker image is built, BentoML provides a convenient way to containerize the model API server manually:

# 1. Find the SavedBundle directory with `bentoml get` command
saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

# 2. Run `docker build` with the SavedBundle directory which contains a generated Dockerfile
docker build -t iris-classifier $saved_path

# 3. Run the generated docker image to start a docker container serving the model
docker run -p 5000:5000 iris-classifier --workers=2

This made it possible to deploy BentoML bundled ML models with platforms such as Kubeflow, Knative, Kubernetes, which provides advanced model deployment features such as auto-scaling, A/B testing, scale-to-zero, canary rollout and multi-armed bandit.


Ensure docker is installed before running the command above. Instructions on installing docker:

Other deployment options are documented in the BentoML Deployment Guide, including Kubernetes, AWS, Azure, Google Cloud, Heroku, and etc.

Learning more about BentoML

Interested in learning more about BentoML? Check out the BentoML Core Concepts and best practices walkthrough, a must-read for anyone who is looking to adopt BentoML.

Be sure to join BentoML slack channel to hear about the latest development updates and be part of the roadmap discussions.