Deploying to KFServing

KFServing enables serverless inference on Kubernetes cluster for common machine learning frameworks like Tensorflow, XGBoost, scikit-learn and etc. BentoServices can easily deploy to KFServing and take advantage of what KFServing offers.

This guide demonstrates how to serve a scikit-learn based iris classifier model with BentoML on a KFServing cluster. The same deployment steps are also applicable for models trained with other machine learning frameworks, see more BentoML examples here.


Before starting this guide, make sure you have the following:

  • a cluster with KFServing installed

  • Docker and Docker Hub installed and configured on your local machine.

  • Python 3.6 or above and required PyPi packages: bentoml and scikit-learn

KFServing deployment with BentoML

Run the example project from the quick start guide to create the BentoML saved bundle for deployment:

git clone
pip install -r ./bentoml/guides/quick-start/requirements.txt
python ./bentoml/guides/quick-start/

Verify the saved bundle created:

$ bentoml get IrisClassifier:latest

# Sample output

  "name": "IrisClassifier",
  "version": "20200121141808_FE78B5",
  "uri": {
    "type": "LOCAL",
    "uri": "/Users/bozhaoyu/bentoml/repository/IrisClassifier/20200121141808_FE78B5"
  "bentoServiceMetadata": {
    "name": "IrisClassifier",
    "version": "20200121141808_FE78B5",
    "createdAt": "2020-01-21T22:18:25.079723Z",
    "env": {
      "condaEnv": "name: bentoml-IrisClassifier\nchannels:\n- defaults\ndependencies:\n- python=3.7.3\n- pip\n",
      "pipDependencies": "bentoml==0.5.8\nscikit-learn",
      "pythonVersion": "3.7.3"
    "artifacts": [
        "name": "model",
        "artifactType": "SklearnModelArtifact"
    "apis": [
        "name": "predict",
        "InputType": "DataframeInput",
        "docs": "BentoService API"

The BentoML saved bundle created can now be used to start a REST API Server hosting the BentoService and available for sending test request:

# Start BentoML API server:
bentoml serve IrisClassifier:latest
# Send test request:
curl -i \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \

Deploy BentoService to KFServing

BentoML provides a convenient way to containerize the model API server with Docker:

  1. Find the SavedBundle directory with bentoml get command

  2. Run docker build with the SavedBundle directory which contains a generated Dockerfile

  3. Run the generated docker image to start a docker container serving the model

# Find the local path of the latest version IrisClassifier saved bundle
saved_path=$(bentoml get IrisClassifier:latest --print-location --quiet)

# Replace {docker_username} with your Docker Hub username
docker build -t {docker_username}/iris-classifier $saved_path
docker push {docker_username}/iris-classifier

Note: BentoML’s REST interface is different than the Tensorflow V1 HTTP API that KFServing expects. Requests will send directly to the prediction service and bypass the top-level InferenceService.

Support for KFServing V2 prediction protocol with BentoML is coming soon.

The following is an example YAML file for specifying the resources required to run an InferenceService in KFServing. Replace {docker_username} with your Docker Hub username and save it to bentoml.yaml file:

kind: InferenceService
  labels: "1.0"
  name: iris-classifier
          image: {docker_username}/iris-classifier
            - containerPort: 5000

Use kubectl apply command to deploy the InferenceService:

kubectl apply -f bentoml.yaml

Run prediction

CLUSTER_IP=$(kubectl -n istio-system get service $INGRESS_GATEWAY -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
SERVICE_HOSTNAME=$(kubectl get route ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v -H "Host: ${SERVICE_HOSTNAME}" \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \

Delete deployment

kubectl delete -f bentoml.yaml