ONNX

Users can now use ONNX with BentoML with the following API: load, save, and load_runner as follow:

import math

import bentoml
import torch

import numpy as np
import torch.nn as nn

class ExtendedModel(nn.Module):
   def __init__(self, D_in, H, D_out):
         # In the constructor we instantiate two nn.Linear modules and assign them as
         #  member variables.
         super(ExtendedModel, self).__init__()
         self.linear1 = nn.Linear(D_in, H)
         self.linear2 = nn.Linear(H, D_out)

   def forward(self, x, bias):
         # In the forward function we accept a Tensor of input data and an optional bias
         h_relu = self.linear1(x).clamp(min=0)
         y_pred = self.linear2(h_relu)
         return y_pred + bias


N, D_in, H, D_out = 64, 1000, 100, 1
x = torch.randn(N, D_in)
model = ExtendedModel(D_in, H, D_out)

input_names = ["x", "bias"]
output_names = ["output1"]

tmpdir = "/tmp/model"
model_path = os.path.join(tmpdir, "test_torch.onnx")
torch.onnx.export(
   model,
   (x, torch.Tensor([1.0])),
   model_path,
   input_names=input_names,
   output_names=output_names,
)

# `save` a ONNX model to BentoML modelstore:
tag = bentoml.onnx.save("onnx_model", model_path, model_store=modelstore)
bias1, bias2 = bias_pair

# retrieve metadata with `bentoml.models.get`:
metadata = bentoml.models.get(tag)

# `load` the given model back:
loaded = bentoml.onnx.load("onnx_model")

# Run a given model under `Runner` abstraction with `load_runner`
r1 = bentoml.onnx.load_runner(tag)

r2 = bentoml.onnx.load_runner(tag)

res1 = r1.run_batch(x, np.array([bias1]).astype(np.float32))[0][0].item()
res2 = r2.run_batch(x, np.array([bias2]).astype(np.float32))[0][0].item()

# tensor to float may introduce larger errors, so we bump rel_tol
# from 1e-9 to 1e-6 just in case
assert math.isclose(res1 - res2, bias1 - bias2, rel_tol=1e-6)

Note

You can find more examples for ONNX in our gallery repo.

bentoml.onnx.save(name, model, *, labels=None, custom_objects=None, metadata=None)

Save a model instance to BentoML modelstore.

Parameters
  • name (str) – Name for given model instance. This should pass Python identifier check.

  • model (Union[onnx.ModelProto, path-like object]) – Instance of model to be saved.

  • labels (Dict[str, str], optional, default to None) – user-defined labels for managing models, e.g. team=nlp, stage=dev

  • custom_objects (Dict[str, Any]], optional, default to None) – user-defined additional python objects to be saved alongside the model, e.g. a tokenizer instance, preprocessor function, model configuration json

  • metadata (Dict[str, Any], optional, default to None) – Custom metadata for given model.

  • model_store (ModelStore, default to BentoMLContainer.model_store) – BentoML modelstore, provided by DI Container.

Returns

A tag with a format name:version where name is the user-defined model’s name, and a generated version by BentoML.

Return type

Tag

Examples:

import bentoml

import torch
import torch.nn as nn

class ExtendedModel(nn.Module):
    def __init__(self, D_in, H, D_out):
        # In the constructor we instantiate two nn.Linear modules and assign them as
        #  member variables.
        super(ExtendedModel, self).__init__()
        self.linear1 = nn.Linear(D_in, H)
        self.linear2 = nn.Linear(H, D_out)

    def forward(self, x, bias):
        # In the forward function we accept a Tensor of input data and an optional bias
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred + bias


N, D_in, H, D_out = 64, 1000, 100, 1
x = torch.randn(N, D_in)
model = ExtendedModel(D_in, H, D_out)

input_names = ["x", "bias"]
output_names = ["output1"]

tmpdir = "/tmp/model"
model_path = os.path.join(tmpdir, "test_torch.onnx")
torch.onnx.export(
    model,
    (x, torch.Tensor([1.0])),
    model_path,
    input_names=input_names,
    output_names=output_names,
)

tag = bentoml.onnx.save("onnx_model", model_path, model_store=modelstore)
bentoml.onnx.load(tag, backend='onnxruntime', providers=None, session_options=None, model_store=<simple_di.providers.SingletonFactory object>)

Load a model from BentoML local modelstore with given name.

Parameters
  • tag (Union[str, Tag]) – Tag of a saved model in BentoML local modelstore.

  • backend (str, optional, default to onnxruntime) – Different backend runtime supported by ONNX. Currently only accepted onnxruntime and onnxruntime-gpu.

  • providers (List[Union[str, Tuple[str, Dict[str, Any]], optional, default to None) – Different providers provided by users. By default BentoML will use onnxruntime.get_available_providers() when loading a model.

  • session_options (onnxruntime.SessionOptions, optional, default to None) – SessionOptions per use case. If not specified, then default to None.

  • model_store (ModelStore, default to BentoMLContainer.model_store) – BentoML modelstore, provided by DI Container.

Returns

an instance of ONNX model from BentoML modelstore.

Return type

onnxruntime.InferenceSession

Examples:

import bentoml

model = bentoml.onnx.load(tag)
bentoml.onnx.load_runner(tag, *, backend='onnxruntime', gpu_device_id=- 1, disable_copy_in_default_stream=False, providers=None, session_options=None, name=None)

Runner represents a unit of serving logic that can be scaled horizontally to maximize throughput. bentoml.onnx.load_runner implements a Runner class that wrap around an ONNX model, which optimize it for the BentoML runtime.

Parameters
  • tag (Union[str, Tag]) – Tag of a saved model in BentoML local modelstore.

  • gpu_device_id (int, optional, default to -1) – GPU device ID. Currently only support CUDA.

  • disable_copy_in_default_stream (bool, optional, default to False) – Whether to do copies in the default stream or use separate streams. Refers to Execution Providers for more information.

  • backend (str, optional, default to onnxruntime) – Different backend runtime supported by ONNX. Currently only accepted onnxruntime and onnxruntime-gpu.

  • providers (List[Union[str, Tuple[str, Dict[str, Any]], optional, default to None) – Different providers provided by users. By default BentoML will use CPUExecutionProvider when loading a model.

  • session_options (onnxruntime.SessionOptions, optional, default to None) – SessionOptions per use case. If not specified, then default to None.

Returns

Runner instances for bentoml.onnx model

Return type

Runner

Examples:

runner = bentoml.onnx.load_runner(
    tag, model_store=modelstore, backend="onnxruntime-gpu", gpu_device_id=0
)
runner.run_batch(data)