Core APIs


class bentoml.Service(name, runners=None)

The service definition is the manifestation of the Service Oriented Architecture and the core building block in BentoML where users define the service runtime architecture and model serving logic.

A BentoML service is defined via instantiate this Service class. When creating a Service instance, user must provide a Service name and list of runners that are required by this Service. The instance can then be used to define InferenceAPIs via the api decorator.

bentoml.load(bento_identifier, working_dir=None, change_global_cwd=False)

Load a Service instance by the bento_identifier

A bento_identifier:str can be provided in three different forms:

  • Tag pointing to a Bento in local Bento store under BENTOML_HOME/bentos

  • File path to a Bento directory

  • “import_str” for loading a service instance from the working_dir

Example load from Bento usage:

# load from local bento store

# load from bento directory

Example load from working directory by “import_str” usage:

# When multiple service defined in the same module

# Find svc by Python module name or file path

# When there's only one Service instance in the target module, the attributes
# part in the svc_import_path can be omitted

Bento Build, *, labels=None, description=None, include=None, exclude=None, additional_models=None, docker=None, python=None, conda=None, version=None, build_ctx=None, _bento_store=<simple_di.providers.SingletonFactory object>, _model_store=<simple_di.providers.SingletonFactory object>)

User-facing API for building a Bento. The available build options are identical to the keys of a valid ‘bentofile.yaml’ file.

This API will not respect any ‘bentofile.yaml’ files. Build options should instead be provided via function call parameters.

  • service (str) – import str for finding the bentoml.Service instance build target

  • labels (Optional[Dict[str, str]]) – optional immutable labels for carrying contextual info

  • description (Optional[str]) – optional description string in markdown format

  • include (Optional[List[str]]) – list of file paths and patterns specifying files to include in Bento, default is all files under build_ctx, beside the ones excluded from the exclude parameter or a .bentoignore file for a given directory

  • exclude (Optional[List[str]]) – list of file paths and patterns to exclude from the final Bento archive

  • additional_models (Optional[List[Union[str, bentoml._internal.tag.Tag]]]) – list of model tags to pack in Bento, in addition to the models that are required by service’s runners. These models must be found in the given _model_store

  • docker (Optional[Dict[str, Any]]) – dictionary for configuring Bento’s containerization process, see details in bentoml._internal.bento.build_config.DockerOptions

  • python (Optional[Dict[str, Any]]) – dictionary for configuring Bento’s python dependencies, see details in bentoml._internal.bento.build_config.PythonOptions

  • conda (Optional[Dict[str, Any]]) – dictionary for configuring Bento’s conda dependencies, see details in bentoml._internal.bento.build_config.CondaOptions

  • version (Optional[str]) – Override the default auto generated version str

  • build_ctx (Optional[str]) – Build context directory, when used as

  • _bento_store (BentoStore) – save Bento created to this BentoStore

  • _model_store (ModelStore) – pull Models required from this ModelStore


a Bento instance representing the materialized Bento saved in BentoStore

Return type



class bentoml.Runner(name)

Runner represents a unit of serving logic that can be scaled horizontally to maximize throughput. This Runner class is an abstract class, used for creating actual Runners, by implementing __init__, _setup and _run_batch method.

Runner instances exposes run and run_batch method, which will eventually be piped to the _run_batch implementation of the Runner. BentoML applies dynamic batching optimization for all Runners by default.

Why use Runner:

  • Runner allow BentoML to better leverage multiple threads or processes for higher hardware utilization (CPU, GPU, Memory)

  • Runner enables higher concurrency in serving workload, which minimizes latency: you may prefetch data while model is being executed, parallelize data extraction, data transformation or multiple runner execution.

  • Runner comes with dynamic batching optimization, which groups run calls into batch execution when serving online, the batch size and wait time is adaptive to the workload. This can bring massive throughput improvement to a ML service.

All _run_batch argument value must be one of the three types below: numpy.ndarray, pandas.DataFrame, List[PickleSerializable]

Return value of _run_batch acceptable types :

  • numpy.ndarray, pandas.DataFrame, pandas.Series, List[PickleSerializable]

  • Tuple of the types above, indicating multiple return values

Runner run accepts argument value of the following types:

  • numpy.ndarray => numpy.ndarray

  • pandas.DataFrame, pandas.Series => pandas.DataFrame

  • any => List[PickleSerializable]

Note: for pandas.DataFrame and List, the batch_axis must be 0

class bentoml.SimpleRunner(name)

SimpleRunner is a special type of Runner that does not support dynamic batching. Instead of _run_batch in Runner, a _run method is expected to be defined in its subclasses.

A SimpleRunner only exposes run method to its users.

SimpleRunner._run can accept arbitrary input type that are pickle-serializable


class bentoml.Tag(name, version=None)


class bentoml.Model(tag, Model__fs, info, custom_objects=None, flushed=False)