Ray#
Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve, a library built to scale API services on a Ray cluster, to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
The central API in the Ray Serve integration in BentoML is bentoml.ray.deployment
, it seamless converts a Bento into a Ray Serve Deployment. At the simpliest form, only a bento tag is required to create a Deployment.
import bentoml
classifier = bentoml.ray.deployment('iris_classifier:latest')
The Ray Serve Deployment can then be deployed locally or to a Ray cluster using the Ray Serve’s run command.
serve run bento_ray:classifier
Scaling Resources and Autoscaling#
The bentoml.ray.deployment
API also supports configuring scaling resources and autoscaling behaviors. In addition to the Bento tag, service_deployment_config
and runner_deployment_config
arguments can be passed in to configure the Deployments of API Server and Runners respectively.
All parameters allowed in Ray Serve Deployment can be specified in the service_deployment_config
and runner_deployment_config
. The Runner name should be specified as the key in the runner_deployment_config
.
import bentoml
classifier = bentoml.ray.deployment(
'iris_classifier:latest',
{
"route_prefix": "/classify",
"num_replicas": 3,
"ray_actor_options": {
"num_cpus": 1
}
},
{
"iris_clf": {
"num_replicas": 1,
"ray_actor_options": {
"num_cpus": 5
}
}
}
)
Note
Arguments in the service_deployment_config
and runner_deployment_config
dictionaries are passed through directly to Deployment. Please refer to Ray Serve Scaling and Resource Allocation for the full list of supported arguments.
Batching#
Batching behaviors can be configured through the enable_batching
and batching_config
arguments. Using Runner name as the key, both max_batch_size
and batch_wait_timeout_s
can be configured for each Runner independently through batching_config
.
import bentoml
deploy = bentoml.ray.deployment(
'fraud_detection:latest',
enable_batching=True,
batching_config={
"iris_clf": {
"predict": {
"max_batch_size": 1024,
"batch_wait_timeout_s": 0.2
}
}
}
)
Note
Arguments in the batching_config
dictionary are passed through directly to Ray Serve. Please refer to Ray Serve Batching for the full list of supported arguments.
Reference#
See the API references to learn more about the Ray Serve integration in BentoML.