GPU deployment#

Prerequisites#

yatai-deployment

Because GPU support is related to the BentotDeployment CRD, it relies on yatai-deployment

GPU Deployment with Kubernetes#

Yatai allows you to deploy bentos on Nvidia GPUs on demand. You should make sure there is Nvidia GPU available in the cluster, see your cluster provider for more details, or https://github.com/NVIDIA/k8s-device-plugin if you are using Yatai in your own Cluster. Once you have ensured there is “nvidia.com/gpu” resource available in your cluster, Yatai is ready to serve GPU-based bentos.

Through the Web UI#

Steps to deploy a GPU supported bento to Yatai: 1. select the “Deployments” tab of your Yatai Web UI, click “Create” button to create a new Deployment. 2. select the target bento 3. scroll down to “Runners”, select the runner you want to accelerate with GPU, and add a custom resources request with key nvidia.com/gpu and value 1 to request 1 GPU for each replica of this runner.

Note: Typically you don’t need to allocate GPUs to the bento service itself, since it can not be accelerated by GPUs. Instead, allocate GPU to the runner that will take care of the actual inference.

Through the CLI#

Apply the following yaml for a BentoDeployment CR:

  apiVersion: serving.yatai.ai/v2alpha1
  kind: BentoDeployment
  metadata:
    name: my-bento-deployment
    namespace: my-namespace
  spec:
    bento: iris-1
    ingress:
      enabled: true
    envs:
    - name: foo
      value: bar
    resources:
      limits:
        cpu: 2000m
        memory: "1Gi"
      requests:
        cpu: 1000m
        memory: "500Mi"
    autoscaling:
      maxReplicas: 5
      minReplicas: 1
    runners:
    - name: runner1
      resources:
        limits:
          cpu: 2000m
          memory: "4Gi"
          custom:
            nvidia.com/gpu: 1
        requests:
          cpu: 1000m
          memory: "2Gi"
      autoscaling:
        maxReplicas: 2
        minReplicas: 1

Fractional-GPU resource allocation#

Sometimes you may want to allocate a fraction of a GPU to a runner, for example, you have a GPU with 8GB memory, and you want to allocate 4GB memory to a runner, and 4GB memory to another. Yatai is designed taking this into consideration. However, the cluster needs to be configured to support this feature first.

For managed Kubernetes solutions, you could seek help from your cluster provider to see if there is a solution. For example, in AWS EKS, see https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/.

For self-managed Kubernetes cluster, you could install an open source solution like https://github.com/elastic-ai/elastic-gpu.

Once setup, you could allocate a fraction of a GPU to a runner by replacing the nvidia.com/gpu in resource request with the resource name provided in the solution.