Concurrency#
Concurrency refers to the ability of a BentoML Service to process multiple requests simultaneously. It directly impacts the throughput, latency, and overall performance of machine learning models in production. Proper concurrency management ensures that a Service can handle varying loads efficiently, maximizing resource utilization while maintaining high-quality service (QoS).
This document explains how to configure concurrency for BentoML Services.
Configure concurrency#
To specify concurrency for a BentoML Service, use the concurrency field in traffic within the @bentoml.service
decorator when setting configurations:
@bentoml.service(
traffic={
"concurrency": 1, # An integer value
}
)
class MyService:
...
Key points about concurrency in BentoML:
concurrency
is a new field introduced in BentoML 1.2.8. It represents the ideal number of requests that a BentoML Service (namely, all workers in the Service) can simultaneously process. By default, BentoML does not impose a limit on concurrency to avoid bottlenecks.If your Service supports adaptive batching or continuous batching, set
concurrency
to match the batch size. This aligns processing capacity with batch requirements, optimizing throughput.If a Service spawns multiple workers to leverage the parallelism of the underlying hardware accelerators (for example, multi-device GPUs),
concurrency
should be configured as the number of parallelism the devices can support.For Services designed to handle one request at a time, set
concurrency
to1
, ensuring that requests are processed sequentially without overlap.
Concurrency and max concurrency#
When using the traffic
field in the @bentoml.service
decorator, you can configure concurrency
and max_concurrency
at the same time, which are both related to how many requests a Service can handle simultaneously.
@bentoml.service(
traffic={
"concurrency": 5,
"max_concurrency": 10,
}
)
class MyService:
...
Note that they serve different purposes:
concurrency
: Indicates the ideal number of simultaneous requests that a Service is designed to handle efficiently. It’s a guideline for optimizing performance, particularly in terms of how batching or parallel processing is implemented. This means that the simultaneous requests being processed by a Service instance can still exceed theconcurrency
configured.max_concurrency
: Acts as a hard limit on the number of requests that can be processed simultaneously by a single instance of a Service. It’s used to prevent a Service from being overwhelmed by too many requests at once, which could degrade performance or lead to resource exhaustion. Requests that exceed themax_concurrency
limit will be rejected to maintain QoS and ensure that each request is handled within an acceptable time frame. Note that starting from BentoML 1.2.8,max_concurrency
applies to the aggregate of all workers within a Service. For prior versions, it works on a per-worker basis.
Concurrency-based autoscaling#
For using concurrency-based autoscaling on BentoCloud, see Autoscaling.