Serving on GPU

It is widely recognized within the academia world and industry that GPUs have superior benefits over CPU-based platform due to its speed and efficiency advantages for both training and inference tasks, as shown by NVIDIA.

Almost every deep learning frameworks (Tensorflow, PyTorch, ONNX, etc.) have supports for GPU, this guide demonstrates how to serve your BentoService with GPU.


  • GNU/Linux x86_64 with kernel version >=3.10. (uname -a to check)

  • Docker >=19.03

  • NVIDIA GPU that has compute capability >=3.0 (find yours from NVIDIA)

NVIDIA Drivers

Make sure you have installed NVIDIA driver for your Linux distribution. The recommended way to install drivers is to use the package manager of your distribution but other alternatives are also available.

For instruction on how to use your package manager to install drivers from CUDA network repository, follow this guide.

NVIDIA Container Toolkit

See also

NVIDIA provides detailed instructions for installing both Docker CE and nvidia-docker. Refers to nvidia-docker wiki for more information.


For Arch users install nvidia-docker via AUR.


Recent updates to systemd re-architecture, which is described via #1447, completely breaks nvidia-docker. This issue is confirmed to be patched for future releases.

Debian-based OS

Disable cgroup hierarchy by adding to systemd.unified_cgroup_hierarchy=0 to GRUB_CMDLINE_LINUX_DEFAULT.

GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0"

Other OS

Change #no-cgroups=false to no-cgroups=true under /etc/nvidia-container-runtime/config.toml.


Add the following:

# docker-compose.yaml

  - /dev/nvidia0:/dev/nvidia0
  - /dev/nvidiactl:/dev/nvidiactl
  - /dev/nvidia-modeset:/dev/nvidia-modeset
  - /dev/nvidia-uvm:/dev/nvidia-uvm
  - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Framework Support for GPU Inference with Implementation

Jump to Tensorflow Implementation | PyTorch Implementation | ONNX Implementation


The examples we show here are merely demonstration on how GPU inference works among different frameworks to avoid bloating the guide.

See also

Please refers to BentoML’s gallery for more detailed use-cases on GPU Serving.



As of 0.13.0, Multiple GPUs Inference is currently not supported. (However, it is within our future roadmap to provide support for such feature)


In order to check for GPU usage, one can run nvidia-smi to check whether BentoService is using GPU. e.g

# BentoService is running in another session
» nvidia-smi
Thu Jun 10 15:30:28 2021
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P8     6W /  N/A |    753MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A    179346      C   /opt/conda/bin/python             745MiB |


After each implementation:

# to serve our service locally
$ bentoml serve TensorflowService:latest
# containerize our saved service
$ bentoml containerize TensorflowService:latest -t tf_svc
# start our container and check for GPU usages:
$ docker run --gpus all ${DEVICE_ARGS} -p 3000:3000 tf_svc:latest --workers=2


see General workaround (Recommended) for $DEVICE_ARGS.

Docker Images Options

Users have options to build their own customized docker images to serve with BentoService via @env(docker_base_images=""). Make sure that your custom docker images have Python and CUDA library in order to run with GPU.

BentoML also provides three CUDA-enabled images with CUDA 11.3 and CUDNN 8.2.0 (refers to this support matrix for CUDA and CUDNN version matching).


See PyTorch’s notes on GPU serving.



If users want to utilize multiple GPUs while training, refers to Tensorflow’s distributed strategies.

TLDR, Tensorflow code with tf.keras model will run transparently on a single GPU without any changes. One can read more here.


NOT RECOMMEND to manually set device placement unless you know what you are doing!

During training, if one chooses to manually set device placement for specific operations, e.g:


# train my_model on GPU:1
with tf.device("/GPU:1"):
    ... # train code goes here.

then make sure you correctly create your model during inference to avoid any potential errors.

# my_model_gpu is a trained on GPU:1, with weight and tokenizer to file
# now I want to run model on GPU:0
with tf.device("/GPU:0"):
    my_inference_model = build_model() # build_model
    ... # inference code goes here.


Tensorflow provides /GPU:{device_id} where device_id is our GPU/CPU id. This is useful if you have a multiple CPUs/GPUs setup. For most use-case /GPU:0 will do the job.

You can get the specific device with

tf.config.list_physical_devices("GPU") # or CPU

Tensorflow Implementation




Since PyTorch bundles CUDNN and NCCL runtime with its python library, we recommend users install PyTorch with conda via BentoML @env instead of using GPU images provided by BentoML:

@env(conda_dependencies=['pytorch', 'torchtext', 'cudatoolkit=11.1'], conda_channels=['pytorch', 'nvidia'])

PyTorch provides a more pythonic way to define device for our deep learning model. This can be used through training and inference tasks

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


PyTorch provides users with OPTIONAL cuda:{device_id} or cpu:{device_id} to explicitly assign GPU if the vendors contain multiple GPUs or CPUs. For mose use-case “cuda” or “cpu” will dynamically allocate GPU resources and fallback to CPU for you.

However, make sure that in our BentoService definition every tensor that is needed for inference should be cast to the same device as our our model, see PyTorch Implementation.


All of the above apply to transformers, PytorchLightning or any other variant of PyTorch deep learning frameworks.

PyTorch Implementation



Users only need to install onnxruntime-gpu to be able to run their ONNX model with GPU. It will automatically fallback to CPUs if no GPUs are found.


ONNX use-case is dependent on the base deep learning framework user chooses to build their model on. This guide will provide PyTorch to ONNX use-case. Contributions are welcome for others deep learning frameworks.

User can check if GPU is running for their InferenceSession with get_providers():

cuda = "CUDA" in session.get_providers()[0] # True if you have a GPU

Some notes with regarding to building ONNX services:

  • as shown with ONNX Implementation below, make sure that you setup a correct input and outputs for your ONNX models to avoid any errors.

  • your input should be a numpy array, refers to to_numpy() for example.

ONNX Implementation