GPU Serving with BentoML

It is widely recognized within the academia world and industry that GPUs have superior benefits over CPU-based platform due to its speed and efficiency advantages for both training and inference tasks, as shown by NVIDIA.

Almost every deep learning frameworks (Tensorflow, PyTorch, ONNX, etc.) have supports for GPU, this guide demonstrates how to serve your BentoService with GPU.

1. Prerequisite

  • GNU/Linux x86_64 with kernel version >=3.10. (uname -a to check)

  • Docker >=19.03

  • NVIDIA GPU that has compute capability >=3.0 (find yours from NVIDIA)

1.1 NVIDIA Drivers

Make sure you have installed NVIDIA driver for your Linux distribution. The recommended way to install drivers is to use the package manager of your distribution but other alternatives are also available.

For instruction on how to use your package manager to install drivers from CUDA network repository, follow this guide.

1.2 NVIDIA Container Toolkit

See also

NVIDIA provides detailed instructions for installing both Docker CE and nvidia-docker. Refers to nvidia-docker wiki for more information.

Note

For Arch users install nvidia-docker via AUR.

Warning

Recent updates to systemd re-architecture, which is described via #1447, completely breaks nvidia-docker. This issue is confirmed to be patched for future releases.

Debian-based OS

Disable cgroup hierarchy by adding to systemd.unified_cgroup_hierarchy=0 to GRUB_CMDLINE_LINUX_DEFAULT.

GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0"

Others OS

Change #no-cgroups=false to no-cgroups=true under /etc/nvidia-container-runtime/config.toml.

docker-compose

Added the following:

# docker-compose.yaml

...
devices:
  - /dev/nvidia0:/dev/nvidia0
  - /dev/nvidiactl:/dev/nvidiactl
  - /dev/nvidia-modeset:/dev/nvidia-modeset
  - /dev/nvidia-uvm:/dev/nvidia-uvm
  - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

2. Framework Support for GPU Inference with Implementation

Jump to Tensorflow Implementation | PyTorch Implementation | ONNX Implementation

Note

The examples we show here are merely demonstration on how GPU inference works among different frameworks to avoid bloating the guide.

See also

Please refers to BentoML’s gallery for more detailed use-case on GPU Serving.

2.1 Preface

Warning

As of 0.13.0, Multiple GPUs Inference is currently not supported. (However, it is within our future roadmap to provide support for such feature)

Note

In order to check for GPU usage, one can run nvidia-smi to check whether BentoService is using GPU. e.g

# BentoService is running in another session
$ nvidia-smi
Thu Jun  3 17:02:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3          |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M  | Bus-Id        Disp.A    | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap |         Memory-Usage    | GPU-Util  Compute M. |
|                                     |                           |               MIG M.    |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off    | 00000000:01:00.0 Off  |                  N/A     |
| N/A   59C    P8    5W /  N/A     |      6MiB /  6078MiB   |      0%      Default    |
|                                     |                          |                  N/A     |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                                |
|  GPU   GI   CI        PID   Type   Process name                       GPU Memory     |
|        ID   ID                                                            Usage           |
|=============================================================================|
|    0   N/A  N/A      1418      G   /opt/conda/venv/bin/python       5781MiB        |
+-----------------------------------------------------------------------------+

Note

After each implementation:

# to serve our service locally
$ bentoml serve TensorflowService:latest
# containerize our saved service
$ bentoml containerize TensorflowService:latest -t tf_svc
# Start our container and check for GPU usages:
$ docker run --gpus all ${DEVICE_ARGS} -p 5000:5000 tf_svc:latest --workers=2

Note

see General workaround (Recommended) for $DEVICE_ARGS.

2.2 Docker Images Options

Users have options to build their own customized docker images to serve with BentoService via @env(docker_base_images=""). Make sure that your custom docker images have Python and CUDA library in order to run with GPU.

BentoML also provides three CUDA-enabled images with CUDA 11.3 and CUDNN 8.2.0 (refers to this support matrix for CUDA and CUDNN version matching).

2.3 Tensorflow

Note

If users want to utilize multiple GPUs while training, refers to Tensorflow’s distributed strategies.

TLDR, Tensorflow code with tf.keras model will run transparently on a single GPU without any changes. One can read more here.

Warning

NOT RECOMMEND to manually set device placement unless you know what you are doing!

During training, if one choose to manually set device placement for specific operations, e.g:

tf.debugging.set_log_device_placement(True)

# train my_model on GPU:1
with tf.device("/GPU:1"):
    ... # train code goes here.

then make sure you correctly create your model during inference to avoid any potential errors.

# my_model_gpu is a trained on GPU:0, with weight and tokenizer to file
with tf.device("/GPU:0"):
    my_inference_model = build_model() # build_model
    my_inference_model.set_weights(my_model_gpu.get_weights())
    ... # inference code goes here.

Note

Tensorflow provides /GPU:{device_id} where device_id is our GPU/CPU ids. This is useful if you have a multiple CPUs/GPUs setup. For most use-case /GPU:0 will do the job.

You can get the specific device with

tf.config.list_physical_devices("GPU") # or CPU

Tensorflow Implementation

Note

refers to Tensorflow gallery for the complete version.

# bento_svc.py
import bentoml
from bentoml.adapters import JsonInput
from bentoml.frameworks.keras import KerasModelArtifact
from bentoml.service.artifacts.common import PickleArtifact

@bentoml.env(pip_packages=['tensorflow', 'scikit-learn', 'pandas'] ,\
      docker_base_image="bentoml/model-server:0.12.1-py38-gpu")
@bentoml.artifacts([KerasModelArtifact('model'), PickleArtifact('tokenizer')])
class TensorflowService(bentoml.BentoService):

    @api(input=JsonInput())
    def predict(self, parsed_json):
        return self.artifacts.model.predict(input_data)
# bento_packer.py
from bento_svc import TensorflowService

# OPTIONAL: to remove tf memory limit on our card
config.experimental.set_memory_growth(gpu[0], True)

model = load_model()
tokenizer = load_tokenizer()

bento_svc = TensorflowService()
bento_svc.pack('model', model)
bento_svc.pack('tokenizer', tokenizer)

saved_path = bento_svc.save()

2.4 PyTorch

Note

Since PyTorch bundled CUDNN and NCCL runtime with the python library the RECOMMENDED way to run your PyTorch service is to install PyTorch with conda via BentoML @env:

@env(conda_dependencies=['pytorch', 'torchtext', 'cudatoolkit=11.1'], conda_channels=['pytorch', 'nvidia'],

PyTorch provides a more pythonic way to define device for our deep learning model. This can be used through training and inference tasks

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Note

PyTorch provides users with OPTIONAL cuda:{device_id} or cpu:{device_id} to explicitly assign GPU if the vendors contain multiple GPUs or CPUs. For mose use-case “cuda” or “cpu” will dynamically allocate GPU resources and fallback to CPU for you.

However, make sure that in our BentoService definition every tensor that is needed for inference should be cast to the same device as our our model, see PyTorch Implementation.

Note

All of the above apply to transformers, PytorchLightning or any other variant of PyTorch deep learning frameworks.

PyTorch Implementation

Note

refers to PyTorch gallery for the complete version.

# bento_svc.py

from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput, JsonOutput
from bentoml.frameworks.pytorch import PytorchModelArtifact
from bentoml.service.artifacts.pickle import PickleArtifact
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

@env(conda_dependencies=['pytorch', 'torchtext', 'cudatoolkit=11.1'], conda_channels=['pytorch', 'nvidia'])
@artifacts([PytorchModelArtifact("model"), PickleArtifact("tokenizer"), PickleArtifact("vocab")])
class PytorchService(BentoService):

    def classify_categories(self, sentence):
        text_pipeline, _ = get_pipeline(self.artifacts.tokenizer, self.artifacts.vocab)
        with torch.no_grad():
            # since we want to run our inference tasks with GPU, we need to cast
            # our text and offsets to GPU
            text = torch.tensor(text_pipeline(sentence)).to(device)
            offsets = torch.tensor([0]).to(device)
            output = self.artifacts.model(text, offsets=offsets)
            return output.argmax(1).item() + 1

    @api(input=JsonInput(), output=JsonOutput())
    def predict(self, parsed_json):
        label = self.classify_categories(parsed_json.get("text"))
        return {'categories': self.label[label]}
# bento_packer.py

import torch

from bento_svc import PytorchService

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

tokenizer, vocab = get_tokenizer_vocab()
vocab_size, embedding_size, num_class = get_model_params(vocab)

# here we assign our inference model to the defined device
model = TextClassificationModel(vocab_size, embedding_size, num_class).to(device)
model.load_state_dict(torch.load("model/pytorch_model.pt"))
model.eval()

bento_svc = PytorchService()

bento_svc.pack("model", model)
bento_svc.pack("tokenizer", tokenizer)
bento_svc.pack("vocab", vocab)
saved_path = bento_svc.save()

2.5 ONNX

User only need to install onnxruntime-gpu to be able to run their ONNX model with GPU. It will automatically fallback to CPU if no GPUs are found.

Note

ONNX use-case is dependent on the base deep learning framework user choose to build their model on. This guide will provide PyTorch to ONNX use-case. Contributions are welcome for others deep learning frameworks.

User can check if GPU is running for their InferenceSession with get_providers():

cuda = "CUDA" in session.get_providers()[0] # True if you have a GPU

Some notes with regarding to building ONNX services:

  • as shown with ONNX Implementation below, make sure that you setup a correct input and outputs for your ONNX models to avoid any errors.

  • your input should be a numpy array, refers to to_numpy() for example.

ONNX Implementation

Note

refers to ONNX gallery for the complete version.

# bento_svc.py
import torch
from bentoml import BentoService, api, env, artifacts
from bentoml.adapters import JsonInput, JsonOutput
from bentoml.frameworks.onnx import OnnxModelArtifact
from bentoml.service.artifacts.pickle import PickleArtifact
from onnxruntime.capi.onnxruntime_pybind11_state import InvalidArgument

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def to_numpy(tensor):
    return tensor.detach().cpu().clone().numpy() if tensor.requires_grad else tensor.cpu().clone().numpy()


@env(infer_pip_packages=False, pip_packages=['onnxruntime-gpu'])
@artifacts(
    [OnnxModelArtifact('model', backend='onnxruntime-gpu'), PickleArtifact('tokenizer'), PickleArtifact('vocab')])
class OnnxService(BentoService):

    def classify_categories(self, sentence):
        text_pipeline, _ = get_pipeline(self.artifacts.tokenizer, self.artifacts.vocab)
        text = to_numpy(torch.tensor(text_pipeline(sentence)).to(device))
        tensor_name = self.artifacts.model.get_inputs()[0].name
        output_name = self.artifacts.model.get_outputs()[0].name
        onnx_inputs = {tensor_name: text}

        try:
            r = self.artifacts.model.run([output_name], onnx_inputs)[0]
            return r.argmax(1).item() + 1
        except (RuntimeError, InvalidArgument) as e:
            print(f"ERROR with shape: {onnx_inputs[tensor_name].shape} - {e}")

    @api(input=JsonInput(), output=JsonOutput())
    def predict(self, parsed_json):
        sentence = parsed_json.get('text')
        return {'categories': self.label[self.classify_categories(sentence)]}
import torch
from bento_svc import OnnxService

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer, vocab = get_tokenizer_vocab()
vocab_size, embedding_size, num_class = get_model_params(vocab)
model = TextClassificationModel(vocab_size, embedding_size, num_class).to(device)
model.load_state_dict(torch.load("model/pytorch_model.pt"))
model.eval()

# a dummy input is required for onnx model. User has to make sure to correctly set dimension of this input
# to match with given model inputs. e.g:
#
# an alexnet models will take in a 224x224 images so dummy inputs will have a static shape [3, 224,224].
#
# however, our new categorization tasks requires a variable in length of our input variables, thus
# our dummy input should have a dynamic shape [vocab_size].
#
# ONNX also only takes torch.LongTensor or torch.cuda.LongTensor so remember to cast correctly.
# we can handle dynamic axes (vocab_size in this case) with ``dynamic_axes=`` as shown below.

inp = torch.rand(vocab_size).long().to(device)

torch.onnx.export(model, inp, onnx_model_path, export_params=True, opset_version=11, do_constant_folding=True,
                  input_names=["input"], output_names=["output"],
                  dynamic_axes={"input": {0: "vocab_size"}, "output": {0: "vocab_size"}})

bento_svc = OnnxService()
bento_svc.pack("model", onnx_model_path)
bento_svc.pack("tokenizer", tokenizer)
bento_svc.pack("vocab", vocab)
saved_path = bento_svc.save()