Understanding BentoML adaptive micro batching

1. The overall architecture of BentoML’s micro-batching server

1.1 Why micro batching matters

While serving a TensorFlow model, batching individual model inference requests together can be important for performance. In particular, batching is necessary to unlock the high throughput promised by hardware accelerators such as GPUs.


Plus, under BentoML’s architecture, the HTTP handling and data preprocessing procedure will also benefit from micro-batching.

1.2 Architecture & Data Flow

%3 cluster_u user clients cluster_m bento batching service cluster_b bento model service cluster_b2 bento model service cluster_b3 bento model service user1 user1 marshal_inbound marshal inbound user1->marshal_inbound http request user2 user2 user2->marshal_inbound user3 user3 user3->marshal_inbound user4 user4 user4->marshal_inbound cork cork marshal_inbound->cork packed request lb cork->lb extract input adapter user_customized_handler user defined handler extract->user_customized_handler merged data extract2 input adapter user_customized_handler2 user defined handler extract2->user_customized_handler2 extract3 input adapter user_customized_handler3 user defined handler extract3->user_customized_handler3 lb->extract lb->extract2 lb->extract3

1.3 Parameters & Concepts of micro batching

  • inbound requests: requests from user clients

  • outbound requests: requests to upstream model servers

  • mb_max_batch_size The maximum size of any batch. This parameter governs the throughput/latency tradeoff, and also avoids having batches that are so large they exceed some resource constraint (e.g. GPU memory to hold a batch’s data). Default: 1000.

  • mb_max_latency The latency goal of your service in milliseconds. Default: 300.

  • outbound semaphore: The semaphore represents the degree of parallelism, i.e. the maximum number of batches processed concurrently. It is set automatically when launching the bento service as the same number of model server workers.

  • Estimated time: Estimated time for model server to execute a batch. Inferred from historical data and current batch size in queue.

1.4 Sequence & How it works

Take bento service with single API and —workers=1 as example

time L1 http client s000 L1->s000 s001 s000->s001 s100 s000->s100 request 1 s002 s001->s002 s101 s001->s101 request 2 s003 s002->s003 s102 s002->s102 request 3 s004 s003->s004 s103 s003->s103 request 4 s005 s004->s005 s006 s005->s006 s105 s005->s105 request 5 s007 s006->s007 s106 s006->s106 request 6 L2 batching service L2->s100 s100->s101 s101->s102 s102->s103 s202 s102->s202 merged_request{1,2,3} s104 s103->s104 s104->s004 response 1,2,3 s104->s105 s105->s106 s107 s106->s107 s206 s106->s206 merged_request{4,5,6} L3 bento model service s200 L3->s200 s201 s200->s201 s201->s202 s203 s202->s203 s204 s203->s204 s204->s104 merged_response{1,2,3} s205 s204->s205 s205->s206 s207 s206->s207

To achieve optimal efficiency, the CORK dispatcher performs a adaptive control to cork/release inbound requests. The releasing happens when:

  • meets one of the following conditions:

    • the waited time + estimated time exceeds mb_max_latency OR

    • it is not worth to wait next inbound request *

  • AND the outbound semaphore is not locked

A large mb_max_latency didn’t represents that each request will be responded in this latency. The algorithm will determine a adaptive wait time between 0 and the mb_max_latency. But when under excessive request pressure, more response time will reach the mb_max_latency.

In each releasing, the count of released requests is decided by algorithm, but less than mb_max_batch_size.

If the outbound semaphore is still locked, requests may be canceled once reached mb_max_latency.

1.5 The main design decisions and tradeoffs

Throughput and latency are most concerned for API servers. BentoML will fine-tune batches automatically to(in the order priority):

  • Ensure the user defined constraint of mb_max_batch_size and mb_max_latency.

  • Maximum the Throughput

  • Minimum the average Latency

2. parameter tuning best practices & recommendations

Different from TensorFlow Serving, BentoML will automatically adjust the batch size and wait timeout, balancing the maximum throughput and latency. It will respond to the fluctuations of server loading.

class MovieReviewService(bentoml.BentoService):
                 mb_max_latency=300, mb_max_batch_size=1000)
    def predict(self, inputs):

mb_max_batch_size is 1000 by default and mb_max_latency is 300 by default.

  • If the RAM of GPU only allowed input with 100 batch size, then you could set mb_max_batch_size to 100

  • If the clients using your API has the request timeout 200ms, then you could set mb_max_latency to 200.

  • If you know the executing of your model is very slow (for example, the latency is more than 100ms), then enlarging the mb_max_latency to 10 * 100ms will help to achieve higher throughput.

3. How to implement batch mode for custom input adapters

TL;DR: Implement the method handle_batch_request(requests) following existent input adapters.

The batching service is HTTP request-wise now, which is mostly transparent for developers. The only difference between handle_batch_request and handle_request is:

  • the input parameter is a list of request object

  • the return value should be a list of response object

To maximize the benefit of micro-batching, remember to use the batch alternative of each operation from the beginning. For example, each pd.read_csv/read_json take constantly 2ms, so code like this

def handle_batch_request(self, requests):
    dfs = []
    for req in requests:
    # ...

will be O(N) in time complexity. Thus we implemented an nearly O(1) function to concat DataFrame CSV strings, so that all DataFrames in requests could be loaded by calling pd.read_csv once.

4. Comparison

4.1 TensorFlow Serving

Tensorflow Serving employed similar approach to batch individual requests together. But the parameters of batching scheduling is static. Assume your model had 1 ms latency. If you enabled batching and configure it with batch_timeout_micros = 300 * 1000, whether necessary or not, the latency of every request now would be 300ms + 1ms.

You will need to fine-tune these parameters by experiments before deployment. Once deployed, it won’t change anymore.

The best values to use for the batch scheduling parameters depend on your model, system and environment, as well as your throughput and latency goals. Choosing good values is best done via experiments. Here are some guidelines that may be helpful in selecting values to experiment with.


4.2 Clipper

Clipper applied a combination of TCP Nagle and AIMD algorithm. This approach is more similar with BentoML, the difference is scheduling algorithm and the goal of optimization.

To automatically find the optimal maximum batch size for each model container we employ an additive-increase-multiplicative-decrease (AIMD) scheme.

Clipper: A Low-Latency Online Prediction Serving System

Clipper has parameter SLO(similar with mb_max_latency), the optimization goal of AIMD is to maximize the throughput under the bound of SLO.

Therefore, for most cases, Clipper have higher latency than BentoML, which also means it’s able to serve less users at same time.