Preparing Models#
Save A Trained Model#
A trained ML model instance needs to be saved with BentoML API, in order to serve it
with BentoML. For most cases, it will be just one line added to your model training
pipeline, invoking a save_model
call, as demonstrated in the
tutorial:
saved_model = bentoml.sklearn.save_model("iris_clf", clf)
print(f"Model saved: {saved_model}")
# Model saved: Model(tag="iris_clf:2uo5fkgxj27exuqj")
See also
It is also possible to use pre-trained models directly with BentoML, without saving it to the model store first. Check out the Custom Runner example to learn more.
Tip
If you have an existing model saved to file on disk, you will need to load the model
in a python session first and then use BentoML’s framework specific
save_model
method to put it into the BentoML model store.
We recommend always save the model with BentoML as soon as it finished training and
validation. By putting the save_model
call to the end of your training
pipeline, all your finalized models can be managed in one place and ready for
inference.
Optionally, you may attach custom labels, metadata, or custom_objects
to be
saved alongside your model in the model store, e.g.:
bentoml.pytorch.save_model(
"demo_mnist", # model name in the local model store
trained_model, # model instance being saved
labels={ # user-defined labels for managing models in Yatai
"owner": "nlp_team",
"stage": "dev",
},
metadata={ # user-defined additional metadata
"acc": acc,
"cv_stats": cv_stats,
"dataset_version": "20210820",
},
custom_objects={ # save additional user-defined python objects
"tokenizer": tokenizer_object,
}
)
labels: user-defined labels for managing models, e.g. team=nlp, stage=dev.
metadata: user-defined metadata for storing model training context information or model evaluation metrics, e.g. dataset version, training parameters, model scores.
custom_objects: user-defined additional python objects, e.g. a tokenizer instance, preprocessor function, model configuration json, serialized with cloudpickle. Custom objects will be serialized with cloudpickle.
Retrieve a saved model#
To load the model instance back into memory, use the framework-specific
load_model
method. For example:
import bentoml
from sklearn.base import BaseEstimator
model: BaseEstimator = bentoml.sklearn.load_model("iris_clf:latest")
Note
The load_model
method is only here for testing and advanced customizations.
For general model serving use cases, use Runner for running model inference. See the
Using Model Runner section below to learn more.
For retrieving the model metadata or custom objects, use the get
method:
import bentoml
bento_model: bentoml.Model = bentoml.models.get("iris_clf:latest")
print(bento_model.tag)
print(bento_model.path)
print(bento_model.custom_objects)
print(bento_model.info.metadata)
print(bento_model.info.labels)
my_runner: bentoml.Runner = bento_model.to_runner()
bentoml.models.get
returns a bentoml.Model
instance, which is a reference to a saved model entry in the BentoML model store. The
bentoml.Model
instance then provides access to the model info and the
to_runner
API for creating a Runner instance from the model.
Note
BentoML also provides a framework-specific get
method under each framework
module, e.g.: benotml.pytorch.get
. It behaves exactly the same as
bentoml.models.get
, besides that it verifies if the model found was saved
with the same framework.
Managing Models#
Saved models are stored in BentoML’s model store, which is a local file directory
maintained by BentoML. Users can view and manage all saved models via the
bentoml models
CLI command:
> bentoml models list
Tag Module Size Creation Time Path
iris_clf:2uo5fkgxj27exuqj bentoml.sklearn 5.81 KiB 2022-05-19 08:36:52 ~/bentoml/models/iris_clf/2uo5fkgxj27exuqj
iris_clf:nb5vrfgwfgtjruqj bentoml.sklearn 5.80 KiB 2022-05-17 21:36:27 ~/bentoml/models/iris_clf/nb5vrfgwfgtjruqj
> bentoml models get iris_clf:latest
name: iris_clf
version: 2uo5fkgxj27exuqj
module: bentoml.sklearn
labels: {}
options: {}
metadata: {}
context:
framework_name: sklearn
framework_versions:
scikit-learn: 1.1.0
bentoml_version: 1.0.0
python_version: 3.8.12
signatures:
predict:
batchable: false
api_version: v1
creation_time: '2022-05-19T08:36:52.456990+00:00'
> bentoml models delete iris_clf:latest -y
INFO [cli] Model(tag="iris_clf:2uo5fkgxj27exuqj") deleted
Model Import and Export#
Models saved with BentoML can be exported to a standalone archive file outside of the model store, for sharing models between teams or moving models between different build stages. For example:
> bentoml models export iris_clf:latest .
Model(tag="iris_clf:2uo5fkgxj27exuqj") exported to ./iris_clf-2uo5fkgxj27exuqj.bentomodel
> bentoml models import ./iris_clf-2uo5fkgxj27exuqj.bentomodel
Model(tag="iris_clf:2uo5fkgxj27exuqj") imported
Note
Model can be exported to or import from AWS S3, GCS, FTP, Dropbox, etc. For example:
pip install fs-s3fs # Additional dependency required for working with s3
bentoml models export iris_clf:latest s3://my_bucket/my_prefix/
Push and Pull with Yatai#
Yatai provides a centralized Model repository that comes with flexible APIs and Web UI for managing all models (and Bentos) created by your team. It can be configured to store model files on cloud blob storage such as AWS S3, MinIO or GCS.
Once your team have Yatai setup, you can use the bentoml models push
and
bentoml models pull
command to get models to and from Yatai:
> bentoml models push iris_clf:latest
Successfully pushed model "iris_clf:2uo5fkgxj27exuqj" │
> bentoml models pull iris_clf:latest
Successfully pulled model "iris_clf:2uo5fkgxj27exuqj"

Tip
Learn more about CLI usage from bentoml models --help
.
Model Management API#
Besides the CLI commands, BentoML also provides equivalent Python APIs for managing models:
import bentoml
bento_model: bentoml.Model = bentoml.models.get("iris_clf:latest")
print(bento_model.path)
print(bento_model.info.metadata)
print(bento_model.info.labels)
bentoml.models.list
returns a list of bentoml.Model:
import bentoml
models = bentoml.models.list()
import bentoml
bentoml.models.export_model('iris_clf:latest', '/path/to/folder/my_model.bentomodel')
bentoml.models.import_model('/path/to/folder/my_model.bentomodel')
Note
Model can be exported to or import from AWS S3, GCS, FTP, Dropbox, etc. For example:
bentoml.models.import_model('s3://my_bucket/folder/my_model.bentomodel')
If your team has Yatai setup, you can also push local Models to Yatai, it provides APIs and Web UI for managing all Models created by your team and stores model files on cloud blob storage such as AWS S3, MinIO or GCS.
import bentoml
bentoml.models.push("iris_clf:latest")
bentoml.models.pull("iris_clf:latest")
import bentoml
bentoml.models.delete("iris_clf:latest")
Using Model Runner#
The way to run model inference in the context of a bentoml.Service
, is via a
Runner. The Runner abstraction gives BentoServer more flexibility in terms of how to
schedule the inference computation, how to dynamically batch inference calls and better
take advantage of all hardware resource available.
As demonstrated in the tutorial, a model runner can be created
from a saved model via the to_runner
API:
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
The runner instance can then be used for creating a bentoml.Service
:
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
result = iris_clf_runner.predict.run(input_series)
return result
To test out the runner interface before writing the Service API callback function, you can create a local runner instance outside of a Service:
# Create a Runner instance:
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
# Initializes the runner in current process, this is meant for development and testing only:
iris_clf_runner.init_local()
# This should yield the same result as the loaded model:
iris_clf_runner.predict.run([[5.9, 3., 5.1, 1.8]])
To learn more about Runner usage and its architecture, see Using Runners.
Model Signatures#
A model signature represents a method on a model object that can be called. This information is used when creating BentoML runners for this model.
From the example above, the iris_clf_runner.predict.run
call will pass through
the function input to the model’s predict
method, running from a remote runner
process.
For many other ML frameworks, the model object’s inference
method may not be called predict
. Users can customize it by specifying the model
signature during save_model
:
bentoml.pytorch.save_model(
"demo_mnist", # model name in the local model store
trained_model, # model instance being saved
signatures={ # model signatures for runner inference
"classify": {
"batchable": False,
}
}
)
runner = bentoml.pytorch.get("demo_mnist:latest").to_runner()
runner.init_local()
runner.classify.run( MODEL_INPUT )
A special case here is Python’s magic method __call__
. Similar to the
Python language convention, the call to runner.run
will be applied to
the model’s __call__
method:
bentoml.pytorch.save_model(
"demo_mnist", # model name in the local model store
trained_model, # model instance being saved
signatures={ # model signatures for runner inference
"__call__": {
"batchable": False,
},
}
)
runner = bentoml.pytorch.get("demo_mnist:latest").to_runner()
runner.init_local()
runner.run( MODEL_INPUT )
Batching#
For model inference calls that supports taking a batch input, it is recommended to
enable batching for the target model signature. In which case, runner#run
calls
made from multiple Service workers can be dynamically merged to a larger batch and run
as one inference call in the runner worker. Here’s an example:
bentoml.pytorch.save_model(
"demo_mnist", # model name in the local model store
trained_model, # model instance being saved
signatures={ # model signatures for runner inference
"__call__": {
"batchable": True,
"batch_dim": 0,
},
}
)
runner = bentoml.pytorch.get("demo_mnist:latest").to_runner()
runner.init_local()
runner.run( MODEL_INPUT )
Tip
The runner interface is exactly the same, regardless batchable
was set to
True or False.
The batch_dim
parameter determines the dimension(s) that contain multiple data
when passing to this run method. The default batch_dim
, when left unspecified,
is 0
.
For example, if you have two inputs you want to run prediction on, [1, 2]
and
[3, 4]
, if the array you would pass to the predict method would be
[[1, 2], [3, 4]]
, then the batch dimension would be 0
. If the array you
would pass to the predict method would be [[1, 3], [2, 4]]
, then the batch
dimension would be 1
. For example:
# Save two models with `predict` method that supports taking input batches on the
# dimension 0 and the other on dimension 1:
bentoml.pytorch.save_model("demo0", model_0, signatures={
"predict": {"batchable": True, "batch_dim": 0}}
)
bentoml.pytorch.save_model("demo1", model_1, signatures={
"predict": {"batchable": True, "batch_dim": 1}}
)
# if the following calls are batched, the input to the actual predict method on the
# model.predict method would be [[1, 2], [3, 4], [5, 6]]
runner0 = bentoml.pytorch.get("demo0:latest").to_runner()
runner0.init_local()
runner0.predict.run(np.array([[1, 2], [3, 4]]))
runner0.predict.run(np.array([[5, 6]]))
# if the following calls are batched, the input to the actual predict method on the
# model.predict would be [[1, 2, 5], [3, 4, 6]]
runner1 = bentoml.pytorch.get("demo1:latest").to_runner()
runner1.init_local()
runner1.predict.run(np.array([[1, 2], [3, 4]]))
runner1.predict.run(np.array([[5], [6]]))
Expert API
If there are multiple arguments to the run method and there is only one batch dimension supplied, all arguments will use that batch dimension.
The batch dimension can also be a tuple of (input batch dimension, output batch
dimension). For example, if the predict method should have its input batched along
the first axis and its output batched along the zeroth axis, batch_dim`
can
be set to (1, 0)
.
For online serving workloads, adaptive batching is a critical component that contributes to the overall performance. If throughput and latency are important to you, learn more about other Runner options and batching configurations in the Using Runners and Adaptive Batching doc.
Todo
Add example for using ModelOptions for setting runtime options