Site icon Digital Thought Disruption

Containerized Deep Learning: Running NVIDIA GPUs on VMware Tanzu & Nutanix Karbon

Table of Contents

  1. Introduction
  2. Why Containerized Deep Learning on Kubernetes?
  3. Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs
  4. Prerequisites & Lab Setup
  5. Deploying NVIDIA GPU Support on VMware Tanzu
  6. Deploying NVIDIA GPU Support on Nutanix Karbon
  7. Example: Training a Deep Learning Model with NGC Containers
  8. Integrating DevOps: CI/CD, Harbor, and Monitoring
  9. Troubleshooting & Best Practices
  10. References

1. Introduction

Deep learning workloads are transforming enterprise AI. Running them efficiently at scale requires more than just powerful GPUs. Kubernetes has become the go-to platform for orchestrating modern, containerized machine learning environments. With Kubernetes, you get reproducibility, scalability, and deep integration with DevOps workflows.

This tutorial and proof-of-concept guide shows how to harness NVIDIA GPUs within VMware Tanzu and Nutanix Karbon Kubernetes platforms. Here, you will find practical configuration steps, workflow automation tips, and sample code. You will also see references to VMware’s official GPU enablement blogs and Nutanix Karbon GPU documentation for accuracy.


2. Why Containerized Deep Learning on Kubernetes?

Key benefits:

Enterprise challenge:
Most enterprise ML teams struggle to make GPU resources shareable, auditable, and easy to manage. VMware Tanzu and Nutanix Karbon provide battle-tested Kubernetes solutions. NVIDIA GPU Operator and NGC containers deliver full-stack GPU support.


3. Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs

VMware Tanzu

Nutanix Karbon

NVIDIA GPUs


4. Prerequisites & Lab Setup

Hardware

Software

Network & Storage

Licensing and Compatibility


5. Deploying NVIDIA GPU Support on VMware Tanzu

Step 1: Enable GPU Passthrough (DirectPath I/O)

  1. In vSphere Client:
    • Right-click the host, go to Hardware, and select PCI Devices.
    • Choose the NVIDIA GPU and click Mark for Passthrough.
  2. Assign GPU to Tanzu worker nodes:
    • Edit the VM settings, select Add PCI Device, and choose the GPU.

Step 2: Install NVIDIA Drivers and Container Toolkit

Run these as root on each worker node:

# Install NVIDIA Driver (example for RHEL/CentOS)
yum install -y kernel-devel
bash NVIDIA-Linux-x86_64-*.run

# Install NVIDIA Docker Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-docker2
systemctl restart docker

Verify installation:

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Step 3: Deploy NVIDIA GPU Operator (Helm)

kubectl create namespace gpu-operator
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator

References:


6. Deploying NVIDIA GPU Support on Nutanix Karbon

Step 1: Enable GPU Passthrough in Nutanix Prism

  1. In Prism Element:
    • Go to VM, click Hardware, then Add GPU, and select the NVIDIA GPU.
  2. For each Karbon worker VM:
    • Attach the GPU and reboot.

Step 2: Install Drivers & Toolkit

Follow the same NVIDIA driver and toolkit steps as above. Adapt for Ubuntu or Debian as needed.

Step 3: Deploy NVIDIA GPU Operator

  1. Create a namespace: kubectl create namespace gpu-operator
  2. Install via Helm: helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update helm install gpu-operator nvidia/gpu-operator -n gpu-operator

Documentation Note:
Karbon documentation is less centralized than VMware’s, but Nutanix confirms GPU support and Kubernetes orchestration through AHV and Prism. Always validate your deployment using the latest Nutanix solution briefs and official docs.

References:


7. Example: Training a Deep Learning Model with NGC Containers

Let’s run a PyTorch training job on Kubernetes using the official NVIDIA NGC image.

Step 1: Pull Example YAML

ngc_pytorch_job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-gpu-job
spec:
template:
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:24.04-py3
command: ["python", "-c", "import torch; print(torch.cuda.is_available()); print(torch.rand(2,2).cuda())"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Deploy:

kubectl apply -f ngc_pytorch_job.yaml
kubectl logs job/pytorch-gpu-job

You should see True for GPU availability and random tensor output, which proves GPU utilization inside the container.


8. Integrating DevOps: CI/CD, Harbor Registry, and Monitoring

CI/CD with GitLab or Jenkins

stages:
- build
- deploy

build:
script:
- docker build -t my-registry/ngc-custom:latest .
- docker push my-registry/ngc-custom:latest

deploy:
script:
- kubectl apply -f k8s/job.yaml

Harbor Registry

Monitoring: Prometheus & Grafana


9. Troubleshooting & Best Practices

IssueCause or Solution
GPU not detected in podCheck node label nvidia.com/gpu.present. Verify driver and Operator logs.
Pod pending (no GPU)Not enough GPU nodes or quota. Check kubectl describe node.
vGPU licensing errorsMake sure the NVIDIA license server is reachable from all GPU-enabled VMs. Verify that your deployment’s vGPU licensing model matches your Nutanix AHV and Kubernetes configuration, especially in multi-tenant scenarios.
Compatibility mismatchesReview Nutanix’s latest compatibility matrices for vGPU and Kubernetes versions.
CUDA version mismatchUse matching CUDA version in driver and container.
Image pull errorsCheck Harbor or NGC registry access. Verify the image tag is correct.

Pro Tips:


10. References

Disclaimer

The views expressed in this article are those of the author and do not represent the opinions of any vendor, my employer or any affiliated organization. Always refer to the official vendor documentation before production deployment.

Exit mobile version