Containerized Deep Learning: Running NVIDIA GPUs on VMware Tanzu & Nutanix Karbon

Table of Contents

  1. Introduction
  2. Why Containerized Deep Learning on Kubernetes?
  3. Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs
  4. Prerequisites & Lab Setup
  5. Deploying NVIDIA GPU Support on VMware Tanzu
  6. Deploying NVIDIA GPU Support on Nutanix Karbon
  7. Example: Training a Deep Learning Model with NGC Containers
  8. Integrating DevOps: CI/CD, Harbor, and Monitoring
  9. Troubleshooting & Best Practices
  10. References

1. Introduction

Deep learning workloads are transforming enterprise AI. Running them efficiently at scale requires more than just powerful GPUs. Kubernetes has become the go-to platform for orchestrating modern, containerized machine learning environments. With Kubernetes, you get reproducibility, scalability, and deep integration with DevOps workflows.

This tutorial and proof-of-concept guide shows how to harness NVIDIA GPUs within VMware Tanzu and Nutanix Karbon Kubernetes platforms. Here, you will find practical configuration steps, workflow automation tips, and sample code. You will also see references to VMware’s official GPU enablement blogs and Nutanix Karbon GPU documentation for accuracy.


2. Why Containerized Deep Learning on Kubernetes?

Key benefits:

  • Reproducibility: Environments, dependencies, and data pipelines are defined as code.
  • Portability: Move workloads seamlessly across development, test, and production clusters.
  • DevOps Automation: Integrate ML workflows with CI/CD, GitOps, and monitoring stacks.
  • Resource Efficiency: Dynamically schedule GPU workloads to maximize utilization.

Enterprise challenge:
Most enterprise ML teams struggle to make GPU resources shareable, auditable, and easy to manage. VMware Tanzu and Nutanix Karbon provide battle-tested Kubernetes solutions. NVIDIA GPU Operator and NGC containers deliver full-stack GPU support.


3. Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs

VMware Tanzu

  • Tanzu Kubernetes Grid (TKG): Production-grade Kubernetes for vSphere.
  • GPU Support: Uses vSphere DirectPath I/O for passthrough or vGPU. Tanzu supports the NVIDIA GPU Operator.

Nutanix Karbon

  • Karbon: CNCF-certified Kubernetes, tightly integrated with Nutanix AHV and Prism.
  • GPU Support: Enables passthrough or vGPU with the Nutanix AHV stack. Supports the NVIDIA GPU Operator and orchestration of NGC containers.
  • Documentation Note: Karbon documentation is less centralized than VMware’s. However, Nutanix confirms robust GPU enablement and Kubernetes orchestration via AHV and Prism. For detailed, step-by-step guidance, refer to Nutanix official solution briefs and Nutanix.dev.

NVIDIA GPUs

  • NVIDIA Data Center GPUs: A100, H100, V100, T4, and others.
  • Key Software: NVIDIA drivers, CUDA toolkit, NVIDIA Container Toolkit, GPU Operator, and NGC (NVIDIA GPU Cloud) container registry.

4. Prerequisites & Lab Setup

Hardware

  • VMware or Nutanix cluster with NVIDIA GPUs, either passthrough or vGPU-capable.
  • Sufficient compute and memory for Kubernetes master and worker nodes.

Software

  • VMware vSphere 7 or later with Tanzu, or Nutanix AHV with Karbon enabled.
  • Kubernetes cluster, version 1.25 or newer is recommended.
  • Docker or Containerd installed on all nodes.
  • Internet access for NGC and Harbor, unless mirrored locally.

Network & Storage

  • High-bandwidth, low-latency interconnects.
  • Persistent storage class, such as vSAN, Nutanix Files, or Nutanix Volumes.

Licensing and Compatibility

  • Access to NVIDIA vGPU licenses if using vGPU.
  • vGPU Licensing: Always verify vGPU licensing and compatibility for your deployment, especially for multi-tenant clusters. Licensing models and GPU sharing features may vary by environment, so confirm support using Nutanix and NVIDIA official documentation.

5. Deploying NVIDIA GPU Support on VMware Tanzu

Step 1: Enable GPU Passthrough (DirectPath I/O)

  1. In vSphere Client:
    • Right-click the host, go to Hardware, and select PCI Devices.
    • Choose the NVIDIA GPU and click Mark for Passthrough.
  2. Assign GPU to Tanzu worker nodes:
    • Edit the VM settings, select Add PCI Device, and choose the GPU.

Step 2: Install NVIDIA Drivers and Container Toolkit

Run these as root on each worker node:

# Install NVIDIA Driver (example for RHEL/CentOS)
yum install -y kernel-devel
bash NVIDIA-Linux-x86_64-*.run

# Install NVIDIA Docker Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-docker2
systemctl restart docker

Verify installation:

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Step 3: Deploy NVIDIA GPU Operator (Helm)

kubectl create namespace gpu-operator
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator

References:


6. Deploying NVIDIA GPU Support on Nutanix Karbon

Step 1: Enable GPU Passthrough in Nutanix Prism

  1. In Prism Element:
    • Go to VM, click Hardware, then Add GPU, and select the NVIDIA GPU.
  2. For each Karbon worker VM:
    • Attach the GPU and reboot.

Step 2: Install Drivers & Toolkit

Follow the same NVIDIA driver and toolkit steps as above. Adapt for Ubuntu or Debian as needed.

Step 3: Deploy NVIDIA GPU Operator

  1. Create a namespace: kubectl create namespace gpu-operator
  2. Install via Helm: helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update helm install gpu-operator nvidia/gpu-operator -n gpu-operator

Documentation Note:
Karbon documentation is less centralized than VMware’s, but Nutanix confirms GPU support and Kubernetes orchestration through AHV and Prism. Always validate your deployment using the latest Nutanix solution briefs and official docs.

References:


7. Example: Training a Deep Learning Model with NGC Containers

Let’s run a PyTorch training job on Kubernetes using the official NVIDIA NGC image.

Step 1: Pull Example YAML

ngc_pytorch_job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-gpu-job
spec:
template:
spec:
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:24.04-py3
command: ["python", "-c", "import torch; print(torch.cuda.is_available()); print(torch.rand(2,2).cuda())"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never

Deploy:

kubectl apply -f ngc_pytorch_job.yaml
kubectl logs job/pytorch-gpu-job

You should see True for GPU availability and random tensor output, which proves GPU utilization inside the container.


8. Integrating DevOps: CI/CD, Harbor Registry, and Monitoring

CI/CD with GitLab or Jenkins

  • Sample pipeline:
    Build and push custom NGC-compatible images when code changes.
    Deploy ML workloads as Kubernetes Jobs.
stages:
- build
- deploy

build:
script:
- docker build -t my-registry/ngc-custom:latest .
- docker push my-registry/ngc-custom:latest

deploy:
script:
- kubectl apply -f k8s/job.yaml

Harbor Registry

  • Store signed, scanned, and versioned ML images in Harbor.
  • Integrate with Tanzu or Karbon clusters for secure image pulls.

Monitoring: Prometheus & Grafana

  • Deploy the NVIDIA DCGM Exporter for GPU metrics.
  • Visualize GPU utilization in Grafana dashboards.

9. Troubleshooting & Best Practices

IssueCause or Solution
GPU not detected in podCheck node label nvidia.com/gpu.present. Verify driver and Operator logs.
Pod pending (no GPU)Not enough GPU nodes or quota. Check kubectl describe node.
vGPU licensing errorsMake sure the NVIDIA license server is reachable from all GPU-enabled VMs. Verify that your deployment’s vGPU licensing model matches your Nutanix AHV and Kubernetes configuration, especially in multi-tenant scenarios.
Compatibility mismatchesReview Nutanix’s latest compatibility matrices for vGPU and Kubernetes versions.
CUDA version mismatchUse matching CUDA version in driver and container.
Image pull errorsCheck Harbor or NGC registry access. Verify the image tag is correct.

Pro Tips:

  • Use taints and tolerations to isolate GPU nodes.
  • Automate driver and Operator deployment with Ansible or Terraform for large clusters.
  • Use kubectl top pod with metrics-server and DCGM for live GPU stats.

10. References

Disclaimer

The views expressed in this article are those of the author and do not represent the opinions of any vendor, my employer or any affiliated organization. Always refer to the official vendor documentation before production deployment.

Leave a Reply

Discover more from Digital Thought Disruption

Subscribe now to keep reading and get access to the full archive.

Continue reading