NVIDIA GPU Operator – Simplifying AI/ML Deployments on the Canonical Platform

Leveraging Kubernetes for AI deployments is becoming increasingly popular. Chances are if your business is involved in AI/ML with Kubernetes you are using tools like Kubeflow to reduce complexity, costs and deployment time. Or, you may be missing out!

With AI/ML being the tech topics of the world, GPUs play a critical role in the space. NVIDIA, a prominent player in the GPU space is one of the top choices for most stakeholders in the field. Nvidia takes their commitment to the space a step ahead with the launch of the GPU Operator open-source project at Mobile World Congress LA.

What is the GPU Operator

The GPU, being a high performance compute resource in the cluster requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime, etc. With the GPU Operator, you can manage resources in a Kubernetes cluster and automate bootstrapping GPU nodes tasks. 

Supported Platforms

The NVIDIA GPU Operator currently supports and has been validated with the following:

●     Pascal+ GPUs are supported (incl. Tesla V100 and T4)

●     Kubernetes v1.13+

  • Canonical’s Kubernetes <version> has been tested with and supports NVIDIA Nvidia GPU Operator. The GPU Operator works out the box with Canonical’s Charmed Kubernetes and is supported from day one. 

– Note: Helm may fail to initialize in Kubernetes v1.16. The Helm installation step above includes a workaround for this. More details can be found in the Github issue.

●     Helm 2

●     Ubuntu 18.04.3 LTS

●     The GPU Operator includes  the following NVIDIA components:

●     Docker CE 19.03.2

●     NVIDIA Container Toolkit 1.0.5

●      NVIDIA Kubernetes Device Plugin 1.0.0-beta4

●      NVIDIA Tesla Driver 418.87.01

 Set-Up

Prerequisites

The GPU Operator has a few prerequisites:

  • It requires a fresh configuration of nodes – nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).
  • i2c_core and ipmi_msghandler kernel modules need to be loaded

The following command ensures these modules are loaded:

$ sudo modprobe -a i2c_core ipmi_msghandler

The module loading step is not persistent and refreshes after a reboot. To make module loading persistent add the modules to the config file as shown:

$ echo -e “i2c_core\nipmi_msghandler” | sudo tee /etc/modules-load.d/driver.conf

  • Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed .

If NFD is already running in the cluster prior to the deployment of the operator, set the variable nfd.enabled=false at the helm install step:

$ helm install –devel –set nfd.enabled=false nvidia/gpu-operator -n test-operator

See notes on NFD setup

Install Helm

$ curl -L https://git.io/get_helm.sh | bash

Create service-account for helm

$ kubectl create serviceaccount -n kube-system tiller

$ kubectl create clusterrolebinding tiller-cluster-rule –clusterrole=cluster-admin –serviceaccount=kube-system:tiller

Initialize Helm

$ helm init –service-account tiller –wait

Note that if you have Helm already deployed in your cluster and you are adding a new node, run this instead

$ helm init –client-only

 

Install the GPU Operator

Note that after running this command, NFD will be automatically deployed.

$ helm install –devel nvidia/gpu-operator -n test-operator –wait

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml

To check the gpu-operator version

$ helm ls

Running a Sample GPU Application

Create a tensorflow notebook example

$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml

Grab the token from the pod once it is created

$ kubectl get pod tf-notebook

$ kubectl logs tf-notebook

Use the following URL in your browser when you connect for the first time, to login with a token:

http://localhost:8888/?token=MY_TOKEN

You can now access the notebook on http://localhost:30001/?token=MY_TOKEN

What’s next

NVIDIA and Canonical will continue partnering to improve the AI/ML space and enable innovators.  One area of interest is extending the GPU Operator to MicroK8s. MicroK8s takes the Kubernetes simplification one step ahead; a lightweight Kubernetes distribution with Kubeflow, GPUs, Helm and GPU Operator all in one package -Get started in seconds!.

Contributing

If you find a bug, have technical issues or would like to contribute to the NVIDIA GPU Operator, please visit the official Github page.

For issues or contributing to Canonical’s Kubernetes, please visit the Github page. You can also reach out to us on Twitter @canonical @ubuntu.

Canonical and NVIDIA look forward to your valuable feedback!

About: Blog