Serving with gRPC#

time expected: 12 minutes

This guide will demonstrate advanced features that BentoML offers for you to get started with gRPC:

First-class support for custom gRPC Servicer, custom interceptors, handlers.
Seemlessly adding gRPC support to existing Bento.

This guide will also walk you through tradeoffs of serving with gRPC, as well as recommendation on scenarios where gRPC might be a good fit.

Requirements: This guide assumes that you have basic knowledge of gRPC and protobuf. If you aren’t familar with gRPC, you can start with gRPC quick start guide.

Using gRPC in BentoML#

We will dive into some of the details of how gRPC is implemented in BentoML.

Protobuf definition#

Let’s take a quick look at protobuf definition of the BentoService:

service BentoService {
  rpc Call(Request) returns (Response) {}
}

Expands for current protobuf definition.

syntax = "proto3";

package bentoml.grpc.v1;

import "google/protobuf/struct.proto";
import "google/protobuf/wrappers.proto";

// cc_enable_arenas pre-allocate memory for given message to improve speed. (C++ only)
option cc_enable_arenas = true;
option go_package = "github.com/bentoml/bentoml/grpc/v1;service";
option java_multiple_files = true;
option java_outer_classname = "ServiceProto";
option java_package = "com.bentoml.grpc.v1";
option objc_class_prefix = "SVC";
option py_generic_services = true;

// a gRPC BentoServer.
service BentoService {
  // Call handles methodcaller of given API entrypoint.
  rpc Call(Request) returns (Response) {}
  // ServiceMetadata returns metadata of bentoml.Service.
  rpc ServiceMetadata(ServiceMetadataRequest) returns (ServiceMetadataResponse) {}
}

// ServiceMetadataRequest message doesn't take any arguments.
message ServiceMetadataRequest {}

// ServiceMetadataResponse returns metadata of bentoml.Service.
// Currently it includes name, version, apis, and docs.
message ServiceMetadataResponse {
  // DescriptorMetadata is a metadata of any given IODescriptor.
  message DescriptorMetadata {
    // descriptor_id describes the given ID of the descriptor, which matches with our OpenAPI definition.
    optional string descriptor_id = 1;

    // attributes is the kwargs of the given descriptor.
    google.protobuf.Struct attributes = 2;
  }
  // InferenceAPI is bentoml._internal.service.inferece_api.InferenceAPI
  // that is exposed to gRPC client.
  // There is no way for reflection to get information of given @svc.api.
  message InferenceAPI {
    // name is the name of the API.
    string name = 1;
    // input is the input descriptor of the API.
    optional DescriptorMetadata input = 2;
    // output is the output descriptor of the API.
    optional DescriptorMetadata output = 3;
    // docs is the optional documentation of the API.
    optional string docs = 4;
  }
  // name is the service name.
  string name = 1;
  // apis holds a list of InferenceAPI of the service.
  repeated InferenceAPI apis = 2;
  // docs is the documentation of the service.
  string docs = 3;
}

// Request message for incoming Call.
message Request {
  // api_name defines the API entrypoint to call.
  // api_name is the name of the function defined in bentoml.Service.
  // Example:
  //
  //     @svc.api(input=NumpyNdarray(), output=File())
  //     def predict(input: NDArray[float]) -> bytes:
  //         ...
  //
  //     api_name is "predict" in this case.
  string api_name = 1;

  oneof content {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 3;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 5;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 6;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 7;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 8;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 9;

    // Multipart represents a multipart message.
    // It comprises of a mapping from given type name to a subset of aforementioned types.
    Multipart multipart = 10;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 2;
  }

  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // repeated Tensor tensors = 4;
  reserved 4, 11 to 13;
}

// Request message for incoming Call.
message Response {
  oneof content {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 1;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 3;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 5;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 6;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 7;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 8;

    // Multipart represents a multipart message.
    // It comprises of a mapping from given type name to a subset of aforementioned types.
    Multipart multipart = 9;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 2;
  }
  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // repeated Tensor tensors = 4;
  reserved 4, 10 to 13;
}

// Part represents possible value types for multipart message.
// These are the same as the types in Request message.
message Part {
  oneof representation {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 1;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 3;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 5;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 6;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 7;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 8;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 4;
  }

  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // Tensor tensors = 4;
  reserved 2, 9 to 13;
}

// Multipart represents a multipart message.
// It comprises of a mapping from given type name to a subset of aforementioned types.
message Multipart {
  map<string, Part> fields = 1;
}

// File represents for any arbitrary file type. This can be
// plaintext, image, video, audio, etc.
message File {
  // optional file type, let it be csv, text, parquet, etc.
  // v1alpha1 uses 1 as FileType enum.
  optional string kind = 3;
  // contents of file as bytes.
  bytes content = 2;
}

// DataFrame represents any tabular data type. We are using
// DataFrame as a trivial representation for tabular type.
// This message carries given implementation of tabular data based on given orientation.
// TODO: support index, records, etc.
message DataFrame {
  // columns name
  repeated string column_names = 1;

  // columns orient.
  // { column ↠ { index ↠ value } }
  repeated Series columns = 2;
}

// Series portrays a series of values. This can be used for
// representing Series types in tabular data.
message Series {
  // A bool parameter value
  repeated bool bool_values = 1 [packed = true];

  // A float parameter value
  repeated float float_values = 2 [packed = true];

  // A int32 parameter value
  repeated int32 int32_values = 3 [packed = true];

  // A int64 parameter value
  repeated int64 int64_values = 6 [packed = true];

  // A string parameter value
  repeated string string_values = 5;

  // represents a double parameter value.
  repeated double double_values = 4 [packed = true];
}

// NDArray represents a n-dimensional array of arbitrary type.
message NDArray {
  // Represents data type of a given array.
  enum DType {
    // Represents a None type.
    DTYPE_UNSPECIFIED = 0;

    // Represents an float type.
    DTYPE_FLOAT = 1;

    // Represents an double type.
    DTYPE_DOUBLE = 2;

    // Represents a bool type.
    DTYPE_BOOL = 3;

    // Represents an int32 type.
    DTYPE_INT32 = 4;

    // Represents an int64 type.
    DTYPE_INT64 = 5;

    // Represents a uint32 type.
    DTYPE_UINT32 = 6;

    // Represents a uint64 type.
    DTYPE_UINT64 = 7;

    // Represents a string type.
    DTYPE_STRING = 8;
  }

  // DTYPE is the data type of given array
  DType dtype = 1;

  // shape is the shape of given array.
  repeated int32 shape = 2;

  // represents a string parameter value.
  repeated string string_values = 5;

  // represents a float parameter value.
  repeated float float_values = 3 [packed = true];

  // represents a double parameter value.
  repeated double double_values = 4 [packed = true];

  // represents a bool parameter value.
  repeated bool bool_values = 6 [packed = true];

  // represents a int32 parameter value.
  repeated int32 int32_values = 7 [packed = true];

  // represents a int64 parameter value.
  repeated int64 int64_values = 8 [packed = true];

  // represents a uint32 parameter value.
  repeated uint32 uint32_values = 9 [packed = true];

  // represents a uint64 parameter value.
  repeated uint64 uint64_values = 10 [packed = true];
}

v1alpha1

syntax = "proto3";

package bentoml.grpc.v1alpha1;

import "google/protobuf/struct.proto";
import "google/protobuf/wrappers.proto";

// cc_enable_arenas pre-allocate memory for given message to improve speed. (C++ only)
option cc_enable_arenas = true;
option go_package = "github.com/bentoml/bentoml/grpc/v1alpha1;service";
option java_multiple_files = true;
option java_outer_classname = "ServiceProto";
option java_package = "com.bentoml.grpc.v1alpha1";
option objc_class_prefix = "SVC";
option py_generic_services = true;

// a gRPC BentoServer.
service BentoService {
  // Call handles methodcaller of given API entrypoint.
  rpc Call(Request) returns (Response) {}
}

// Request message for incoming Call.
message Request {
  // api_name defines the API entrypoint to call.
  // api_name is the name of the function defined in bentoml.Service.
  // Example:
  //
  //     @svc.api(input=NumpyNdarray(), output=File())
  //     def predict(input: NDArray[float]) -> bytes:
  //         ...
  //
  //     api_name is "predict" in this case.
  string api_name = 1;

  oneof content {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 3;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 5;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 6;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 7;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 8;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 9;

    // Multipart represents a multipart message.
    // It comprises of a mapping from given type name to a subset of aforementioned types.
    Multipart multipart = 10;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 2;
  }

  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // repeated Tensor tensors = 4;
  reserved 4, 11 to 13;
}

// Request message for incoming Call.
message Response {
  oneof content {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 1;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 3;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 5;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 6;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 7;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 8;

    // Multipart represents a multipart message.
    // It comprises of a mapping from given type name to a subset of aforementioned types.
    Multipart multipart = 9;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 2;
  }
  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // repeated Tensor tensors = 4;
  reserved 4, 10 to 13;
}

// Part represents possible value types for multipart message.
// These are the same as the types in Request message.
message Part {
  oneof representation {
    // NDArray represents a n-dimensional array of arbitrary type.
    NDArray ndarray = 1;

    // DataFrame represents any tabular data type. We are using
    // DataFrame as a trivial representation for tabular type.
    DataFrame dataframe = 3;

    // Series portrays a series of values. This can be used for
    // representing Series types in tabular data.
    Series series = 5;

    // File represents for any arbitrary file type. This can be
    // plaintext, image, video, audio, etc.
    File file = 6;

    // Text represents a string inputs.
    google.protobuf.StringValue text = 7;

    // JSON is represented by using google.protobuf.Value.
    // see https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/struct.proto
    google.protobuf.Value json = 8;

    // serialized_bytes is for data serialized in BentoML's internal serialization format.
    bytes serialized_bytes = 4;
  }

  // Tensor is similiar to ndarray but with a name
  // We are reserving it for now for future use.
  // Tensor tensors = 4;
  reserved 2, 9 to 13;
}

// Multipart represents a multipart message.
// It comprises of a mapping from given type name to a subset of aforementioned types.
message Multipart {
  map<string, Part> fields = 1;
}

// File represents for any arbitrary file type. This can be
// plaintext, image, video, audio, etc.
message File {
  // FileType represents possible file type to be handled by BentoML.
  // Currently, we only support plaintext (Text()), image (Image()), and file (File()).
  // TODO: support audio and video streaming file types.
  enum FileType {
    FILE_TYPE_UNSPECIFIED = 0;

    // file types
    FILE_TYPE_CSV = 1;
    FILE_TYPE_PLAINTEXT = 2;
    FILE_TYPE_JSON = 3;
    FILE_TYPE_BYTES = 4;
    FILE_TYPE_PDF = 5;

    // image types
    FILE_TYPE_PNG = 6;
    FILE_TYPE_JPEG = 7;
    FILE_TYPE_GIF = 8;
    FILE_TYPE_BMP = 9;
    FILE_TYPE_TIFF = 10;
    FILE_TYPE_WEBP = 11;
    FILE_TYPE_SVG = 12;
  }

  // optional type of file, let it be csv, text, parquet, etc.
  optional FileType kind = 1;

  // contents of file as bytes.
  bytes content = 2;
}

// DataFrame represents any tabular data type. We are using
// DataFrame as a trivial representation for tabular type.
// This message carries given implementation of tabular data based on given orientation.
// TODO: support index, records, etc.
message DataFrame {
  // columns name
  repeated string column_names = 1;

  // columns orient.
  // { column ↠ { index ↠ value } }
  repeated Series columns = 2;
}

// Series portrays a series of values. This can be used for
// representing Series types in tabular data.
message Series {
  // A bool parameter value
  repeated bool bool_values = 1 [packed = true];

  // A float parameter value
  repeated float float_values = 2 [packed = true];

  // A int32 parameter value
  repeated int32 int32_values = 3 [packed = true];

  // A int64 parameter value
  repeated int64 int64_values = 6 [packed = true];

  // A string parameter value
  repeated string string_values = 5;

  // represents a double parameter value.
  repeated double double_values = 4 [packed = true];
}

// NDArray represents a n-dimensional array of arbitrary type.
message NDArray {
  // Represents data type of a given array.
  enum DType {
    // Represents a None type.
    DTYPE_UNSPECIFIED = 0;

    // Represents an float type.
    DTYPE_FLOAT = 1;

    // Represents an double type.
    DTYPE_DOUBLE = 2;

    // Represents a bool type.
    DTYPE_BOOL = 3;

    // Represents an int32 type.
    DTYPE_INT32 = 4;

    // Represents an int64 type.
    DTYPE_INT64 = 5;

    // Represents a uint32 type.
    DTYPE_UINT32 = 6;

    // Represents a uint64 type.
    DTYPE_UINT64 = 7;

    // Represents a string type.
    DTYPE_STRING = 8;
  }

  // DTYPE is the data type of given array
  DType dtype = 1;

  // shape is the shape of given array.
  repeated int32 shape = 2;

  // represents a string parameter value.
  repeated string string_values = 5;

  // represents a float parameter value.
  repeated float float_values = 3 [packed = true];

  // represents a double parameter value.
  repeated double double_values = 4 [packed = true];

  // represents a bool parameter value.
  repeated bool bool_values = 6 [packed = true];

  // represents a int32 parameter value.
  repeated int32 int32_values = 7 [packed = true];

  // represents a int64 parameter value.
  repeated int64 int64_values = 8 [packed = true];

  // represents a uint32 parameter value.
  repeated uint32 uint32_values = 9 [packed = true];

  // represents a uint64 parameter value.
  repeated uint64 uint64_values = 10 [packed = true];
}

As you can see, BentoService defines a simple rpc Call that sends a Request message and returns a Response message.

A Request message takes in:

api_name: the name of the API function defined inside your BentoService.
oneof content: the field can be one of the following types:

Protobuf definition	IO Descriptor
Array representation via NDArray	bentoml.io.NumpyNdarray
Tabular data representation via DataFrame	bentoml.io.PandasDataFrame
Series representation via Series	bentoml.io.PandasDataFrame
File-like object via File	bentoml.io.File
`google.protobuf.StringValue`	bentoml.io.Text
`google.protobuf.Value`	bentoml.io.JSON
Complex payload via Multipart	bentoml.io.Multipart
Compact data format via serialized_bytes	(See below)

Note

Series is currently not yet supported.

The Response message will then return one of the aforementioned types as result.

Example: In the quickstart guide, we defined a classify API that takes in a bentoml.io.NumpyNdarray.

Therefore, our Request message would have the following structure:

Python

from bentoml.grpc.v1 import service_pb2 as pb

req = pb.Request(
    api_name="classify",
    ndarray=pb.NDArray(
        dtype=pb.NDArray.DTYPE_FLOAT, shape=(1, 4), float_values=[5.9, 3, 5.1, 1.8]
    ),
)

package main

import (
	pb "github.com/bentoml/bentoml/grpc/v1"
)

var req = &pb.Request{
	ApiName: "classify",
	Content: &pb.Request_Ndarray{
		Ndarray: &pb.NDArray{
			Dtype:       *pb.NDArray_DTYPE_FLOAT.Enum(),
			Shape:       []int32{1, 4},
			FloatValues: []float32{3.5, 2.4, 7.8, 5.1},
		},
	},
}

C++

#include "bentoml/grpc/v1/service.pb.h"

using bentoml::grpc::v1::BentoService;
using bentoml::grpc::v1::NDArray;
using bentoml::grpc::v1::Request;

std::vector<float> data = {3.5, 2.4, 7.8, 5.1};
std::vector<int> shape = {1, 4};

Request request;
request.set_api_name("classify");

NDArray *ndarray = request.mutable_ndarray();
ndarray->mutable_shape()->Assign(shape.begin(), shape.end());
ndarray->mutable_float_values()->Assign(data.begin(), data.end());

Java

import java.util.*;

int shape[] = { 1, 4 };
Iterable<Integer> shapeIterable = convert(shape);
Float array[] = { 3.5f, 2.4f, 7.8f, 5.1f };
Iterable<Float> arrayIterable = Arrays.asList(array);

NDArray.Builder builder = NDArray.newBuilder().addAllShape(shapeIterable).addAllFloatValues(arrayIterable).setDtype(NDArray.DType.DTYPE_FLOAT);

Request req = Request.newBuilder().setApiName(apiName).setNdarray(builder).build();

Kotlin

val shape: List<Int> = listOf(1, 4)
val data: List<Float> = listOf(3.5f, 2.4f, 7.8f, 5.1f)

val ndarray = NDArray.newBuilder().addAllShape(shape).addAllFloatValues(data).build()
val req = Request.newBuilder().setApiName(apiName).setNdarray(ndarray).build()

Node.js

const pb = require("./bentoml/grpc/v1/service_pb");

var ndarray = new pb.NDArray();
ndarray
  .setDtype(pb.NDArray.DType.DTYPE_FLOAT)
  .setShapeList([1, 4])
  .setFloatValuesList([3.5, 2.4, 7.8, 5.1]);
var req = new pb.Request();
req.setApiName("classify").setNdarray(ndarray);

Swift

import BentoServiceModel

var shape: [Int32] = [1, 4]
var data: [Float] = [3.5, 2.4, 7.8, 5.1]

let ndarray: Bentoml_Grpc_v1_NDArray = .with {
  $0.shape = shape
  $0.floatValues = data
  $0.dtype = Bentoml_Grpc_v1_NDArray.DType.float
}

let request: Bentoml_Grpc_v1_Request = .with {
  $0.apiName = apiName
  $0.ndarray = ndarray
}

Array representation via `NDArray`#

Description: NDArray represents a flattened n-dimensional array of arbitrary type. It accepts the following fields:

dtype

The data type of given input. This is a Enum field that provides 1-1 mapping with Protobuf data types to NumPy data types:

pb.NDArray.DType	numpy.dtype	Enum value
`DTYPE_UNSPECIFIED`	`None`	0
`DTYPE_FLOAT`	`np.float`	1
`DTYPE_DOUBLE`	`np.double`	2
`DTYPE_BOOL`	`np.bool_`	3
`DTYPE_INT32`	`np.int32`	4
`DTYPE_INT64`	`np.int64`	5
`DTYPE_UINT32`	`np.uint32`	6
`DTYPE_UINT64`	`np.uint64`	7
`DTYPE_STRING`	`np.str_`	8

shape

A list of int32 that represents the shape of the flattened array. the bentoml.io.NumpyNdarray will then reshape the given payload into expected shape.

Note that this value will always takes precendence over the shape field in the bentoml.io.NumpyNdarray descriptor, meaning the array will be reshaped to this value first if given. Refer to bentoml.io.NumpyNdarray.from_proto() for implementation details.

string_values, float_values, double_values, bool_values, int32_values, int64_values, uint32_values, unit64_values

Each of the fields is a list of the corresponding data type. The list is a flattened array, and will be reconstructed alongside with shape field to the original payload.

Per request sent, one message should only contain ONE of the aforementioned fields.

The interaction among the above fields and dtype are as follows:

if dtype is not present in the message:
- All of the fields are empty, then we return a np.empty.
- We will loop through all of the provided fields, and only allows one field per message.
  
  If here are more than one field (i.e. string_values and float_values), then we will raise an error, as we don’t know how to deserialize the data.

otherwise:

We will use the provided dtype-to-field map to get the data from the given message.

DType	field
`DTYPE_BOOL`	`bool_values`
`DTYPE_DOUBLE`	`double_values`
`DTYPE_FLOAT`	`float_values`
`DTYPE_INT32`	`int32_values`
`DTYPE_INT64`	`int64_values`
`DTYPE_STRING`	`string_values`
`DTYPE_UINT32`	`uint32_values`
`DTYPE_UINT64`	`uint64_values`

For example, if dtype is DTYPE_FLOAT, then the payload expects to have float_values field.

Python API

NumpyNdarray.from_sample(
   np.array([[5.4, 3.4, 1.5, 0.4]])
)

pb.NDArray

ndarray {
  dtype: DTYPE_FLOAT
  shape: 1
  shape: 4
  float_values: 5.4
  float_values: 3.4
  float_values: 1.5
  float_values: 0.4
}

API reference: bentoml.io.NumpyNdarray.from_proto()

Tabular data representation via `DataFrame`#

Description: DataFrame represents any tabular data type. Currently we only support the columns orientation since it is best for preserving the input order.

It accepts the following fields:

column_names

A list of string that represents the column names of the given tabular data.
column_values

A list of Series where Series represents a series of arbitrary data type. The allowed fields for Series as similar to the ones in NDArray:
- one of [string_values, float_values, double_values, bool_values, int32_values, int64_values, uint32_values, unit64_values]

Python API

PandasDataFrame.from_sample(
    pd.DataFrame({
      "age": [3, 29],
      "height": [94, 170],
      "weight": [31, 115]
    }),
    orient="columns",
)

pb.DataFrame

dataframe {
  column_names: "age"
  column_names: "height"
  column_names: "weight"
  columns {
    int32_values: 3
    int32_values: 29
  }
  columns {
    int32_values: 40
    int32_values: 190
  }
  columns {
    int32_values: 140
    int32_values: 178
  }
}

API reference: bentoml.io.PandasDataFrame.from_proto()

Series representation via `Series`#

Description: Series portrays a series of values. This can be used for representing Series types in tabular data.

It accepts the following fields:

string_values, float_values, double_values, bool_values, int32_values, int64_values

Similar to NumpyNdarray, each of the fields is a list of the corresponding data type. The list is a 1-D array, and will be then pass to pd.Series.

Each request should only contain ONE of the aforementioned fields.

The interaction among the above fields and dtype from PandasSeries are as follows:
- if dtype is not present in the descriptor:
  - All of the fields are empty, then we return an empty pd.Series.
  - We will loop through all of the provided fields, and only allows one field per message.
    
    If here are more than one field (i.e. string_values and float_values), then we will raise an error, as we don’t know how to deserialize the data.
- otherwise:
  - We will use the provided dtype-to-field map to get the data from the given message.

Python API

PandasSeries.from_sample([5.4, 3.4, 1.5, 0.4])

pb.Series

series {
  float_values: 5.4
  float_values: 3.4
  float_values: 1.5
  float_values: 0.4
}

API reference: bentoml.io.PandasSeries.from_proto()

File-like object via `File`#

Description: File represents any arbitrary file type. this can be used to send in any file type, including images, videos, audio, etc.

Note

Currently both bentoml.io.File and bentoml.io.Image are using pb.File

It accepts the following fields:

content

A bytes field that represents the content of the file.
kind

An optional string field that represents the file type. If specified, it will raise an error if mime_type specified in bentoml.io.File is not matched.

Python API

Image(mime_type="application/pdf")

pb.File

file {
  kind: "application/pdf"
  content: <bytes>
}

bentoml.io.Image will also be using pb.File.

Python API

File(mime_type="image/png")

pb.File

file {
  kind: "image/png"
  content: <bytes>
}

Complex payload via `Multipart`#

Description: Multipart represents a complex payload that can contain multiple different fields. It takes a fields, which is a dictionary of input name to its coresponding bentoml.io.IODescriptor

Python API

Multipart(
   meta=Text(),
   arr=NumpyNdarray(
      dtype=np.float16,
      shape=[2,2]
   )
)

pb.Multipart

multipart {
   fields {
      key: "arr"
      value {
         ndarray {
         dtype: DTYPE_FLOAT
         shape: 2
         shape: 2
         float_values: 1.0
         float_values: 2.0
         float_values: 3.0
         float_values: 4.0
         }
      }
   }
   fields {
      key: "meta"
      value {
         text {
         value: "nlp"
         }
      }
   }
}

API reference: bentoml.io.Multipart.from_proto()

Compact data format via `serialized_bytes`#

The serialized_bytes field in both Request and Response is reserved for pre-established protocol encoding between client and server.

BentoML leverages the field to improve serialization performance between BentoML client and server. Thus the field is not recommended for use directly.

Mounting Servicer#

gRPC service multiplexing enables us to mount additional custom servicers alongside with BentoService, and serve them under the same port.

service.py#

import route_guide_pb2
import route_guide_pb2_grpc
from servicer_impl import RouteGuideServicer

svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

services_name = [
    v.full_name for v in route_guide_pb2.DESCRIPTOR.services_by_name.values()
]
svc.mount_grpc_servicer(
    RouteGuideServicer,
    add_servicer_fn=add_RouteGuideServicer_to_server,
    service_names=services_name,
)

Serve your service with bentoml serve-grpc command:

» bentoml serve-grpc service.py:svc --reload --enable-reflection

Now your RouteGuide service can also be accessed through localhost:3000.

Note

service_names is REQUIRED here, as this will be used for server reflection when --enable-reflection is passed to bentoml serve-grpc.

Mounting gRPC Interceptors#

Inteceptors are a component of gRPC that allows us to intercept and interact with the proto message and service context either before - or after - the actual RPC call was sent/received by client/server.

Interceptors to gRPC is what middleware is to HTTP. The most common use-case for interceptors are authentication, tracing, access logs, and more.

BentoML comes with a sets of built-in async interceptors to provide support for access logs, OpenTelemetry, and Prometheus.

The following diagrams demonstrates the flow of a gRPC request from client to server:

Since interceptors are executed in the order they are added, users interceptors will be executed after the built-in interceptors.

Users interceptors shouldn’t modify the existing headers and data of the incoming Request.

BentoML currently only support async interceptors (via grpc.aio.ServerInterceptor, as opposed to grpc.ServerInterceptor). This is because BentoML gRPC server is an async implementation of gRPC server.

Note

If you are using grpc.ServerInterceptor, you will need to migrate it over to use the new grpc.aio.ServerInterceptor in order to use this feature.

Feel free to reach out to us at #support on Slack

To add your intercptors to existing BentoService, use svc.add_grpc_interceptor:

service.py#

from custom_interceptor import CustomInterceptor

svc.add_grpc_interceptor(CustomInterceptor)

Note

add_grpc_interceptor also supports partial class as well as multiple arguments interceptors:

multiple arguments

from metadata_interceptor import AppendMetadataInterceptor

svc.add_grpc_interceptor(AppendMetadataInterceptor, usage="NLP", accuracy_score=0.867)

partial method

from functools import partial

from metadata_interceptor import AppendMetadataInterceptor

svc.add_grpc_interceptor(partial(AppendMetadataInterceptor, usage="NLP", accuracy_score=0.867))

Recommendations#

gRPC is designed to be high performance framework for inter-service communications. This means that it is a perfect fit for building microservices. The following are some recommendation we have for using gRPC for model serving:

Demystifying the misconception of gRPC vs. REST#

You might stumble upon articles comparing gRPC to REST, and you might get the impression that gRPC is a better choice than REST when building services. This is not entirely true.

gRPC is built on top of HTTP/2, and it addresses some of the shortcomings of HTTP/1.1, such as head-of-line blocking, and HTTP pipelining. However, gRPC is not a replacement for REST, and indeed it is not a replacement for model serving. gRPC comes with its own set of trade-offs, such as:

Limited browser support: It is impossible to call a gRPC service directly from any browser. You will end up using tools such as gRPCUI in order to interact with your service, or having to go through the hassle of implementing a gRPC client in your language of choice.
Binary protocol format: While Protobuf is efficient to send and receive over the wire, it is not human-readable. This means additional toolin for debugging and analyzing protobuf messages are required.
Knowledge gap: gRPC comes with its own concepts and learning curve, which requires teams to invest time in filling those knowledge gap to be effectively use gRPC. This often leads to a lot of friction and sometimes increase friction to the development agility.
Lack of support for additional content types: gRPC depends on protobuf, its content type are restrictive, in comparison to out-of-the-box support from HTTP+REST.

Should I use gRPC instead of REST for model serving?#

Yes and no.

If your organization is already using gRPC for inter-service communications, using your Bento with gRPC is a no-brainer. You will be able to seemlessly integrate your Bento with your existing gRPC services without having to worry about the overhead of implementing grpc-gateway.

However, if your organization is not using gRPC, we recommend to keep using REST for model serving. This is because REST is a well-known and well-understood protocol, meaning there is no knowledge gap for your team, which will increase developer agility, and faster go-to-market strategy.

Performance tuning#

BentoML allows user to tune the performance of gRPC via bentoml_configuration.yaml via api_server.grpc.

A quick overview of the available configuration for gRPC:

bentoml_configuration.yaml#

api_server:
  grpc:
    host: 0.0.0.0
    port: 3000
    max_concurrent_streams: ~
    maximum_concurrent_rpcs: ~
    max_message_length: -1
    reflection:
      enabled: false
    metrics:
      host: 0.0.0.0
      port: 3001

`max_concurrent_streams`#

Definition: Maximum number of concurrent incoming streams to allow on a HTTP2 connection.

By default we don’t set a limit cap. HTTP/2 connections typically has limit of maximum concurrent streams on a connection at one time.

`maximum_concurrent_rpcs`#

Definition: The maximum number of concurrent RPCs this server will service before returning RESOURCE_EXHAUSTED status.

By default we set to None to indicate no limit, and let gRPC to decide the limit.

`max_message_length`#

Definition: The maximum message length in bytes allowed to be received on/can be send to the server.

By default we set to -1 to indicate no limit. Message size limits via this options is a way to prevent gRPC from consuming excessive resources. By default, gRPC uses per-message limits to manage inbound and outbound message.

We recommend you to also check out gRPC performance best practice to learn about best practice for gRPC.

Serving with gRPC#

Get started with gRPC in BentoML#

Requirements#

Using your gRPC BentoService#

Client Implementation#

Containerize your Bento 🍱 with gRPC support#

Using gRPC in BentoML#

Protobuf definition#

Array representation via `NDArray`#

Tabular data representation via `DataFrame`#

Series representation via `Series`#

File-like object via `File`#

Complex payload via `Multipart`#

Compact data format via `serialized_bytes`#

Mounting Servicer#

Mounting gRPC Interceptors#

Recommendations#

Demystifying the misconception of gRPC vs. REST#

Should I use gRPC instead of REST for model serving?#

Performance tuning#

`max_concurrent_streams`#

`maximum_concurrent_rpcs`#

`max_message_length`#

Serving with gRPC#

Get started with gRPC in BentoML#

Requirements#

Using your gRPC BentoService#

Client Implementation#

Containerize your Bento 🍱 with gRPC support#

Using gRPC in BentoML#

Protobuf definition#

Array representation via NDArray#

Tabular data representation via DataFrame#

Series representation via Series#

File-like object via File#

Complex payload via Multipart#

Compact data format via serialized_bytes#

Mounting Servicer#

Mounting gRPC Interceptors#

Recommendations#

Demystifying the misconception of gRPC vs. REST#

Should I use gRPC instead of REST for model serving?#

Performance tuning#

max_concurrent_streams#

maximum_concurrent_rpcs#

max_message_length#

Array representation via `NDArray`#

Tabular data representation via `DataFrame`#

Series representation via `Series`#

File-like object via `File`#

Complex payload via `Multipart`#

Compact data format via `serialized_bytes`#

`max_concurrent_streams`#

`maximum_concurrent_rpcs`#

`max_message_length`#