Batching is the term used for combining multiple inputs for submission to processing at the same time. The idea is that processing multiple messages is be faster than processing each individual message one at a time. In practice many ML frameworks have optimizations for processing multiple messages at a time because that is how the underlying hardware works in many cases.
“While serving a TensorFlow model, batching individual model inference requests together can be important for performance. In particular, batching is necessary to unlock the high throughput promised by hardware accelerators such as GPUs.” – TensorFlow documentation
The current batching feature is implemented on the server-side. This is advantageous as opposed to client-side batching because it simplifies the client’s logic and it is often times more efficient due to traffic volume.
As an optimization for a real-time service, batching works off of 2 main concepts.
Batching Window: The maximum time that a service should wait to build a “batch” before releasing a batch for processing. This is essentially the max latency for processing in a low throughput system. It helps avoid the situation where if very few messages have been submitted (smaller than the max batch size) the batch must wait for a long time to be processed.
Max Batch Size: The maximum size that a batch can reach before the batch is release for processing. It puts a cap on the size of the batch in which should optimize for maximum throughput. The concept only applies within the maximum wait time before the batch is released.
BentoML’s adaptive batching works off of these 2 basic concepts and builds on them. Our adaptive batching adapts both the batching window and the max batch size based off of incoming traffic patterns at the time. The dispatching mechanism regresses the recent processing time, wait time and batch sizes to optimize for lowest latency.
The batching mechanism is located on the model runner. Each model runner receives inference requests and batches those requests based on optimal latency.
The load balancer will distribute the requests to each of the running API services. The API services will in turn distribute the inference requests to the model runners. The distribution of requests to the model runners uses a random algorithm which provides for slightly more efficient batch sizes as opposed to round robin. Additional dispatch algorithms are planned for the future.
Running with Adaptive Batching#
There are 2 ways that adaptive batching will run depending on how you’ve deployed BentoML.
In the standard BentoML library, each model runner is it’s own process. In this case, the batching happens at the process level.
For a Yatai deployment into Kubernetes, each model runner is structured as it’s own Pod. The batching will occur at the Pod level.
The main configuration concern is the way in which each input is combined when batching occurs. We call this the “batch axis”. When configuring whether a model runner should be batching, the batch axis must be specified
Why batching? How dynamic batch work? Using dynamic batching Runnable Input Output types and setting batch dimension Custom Runnable data types Order of request needs to be preserved Error handling Configuring batch parameters (max batch size, max latency)