Containerized Deep Learning: Running NVIDIA GPUs on VMware Tanzu & Nutanix Karbon

Paul Bryant

7 months ago

Introduction
Why Containerized Deep Learning on Kubernetes?
Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs
Prerequisites & Lab Setup
Deploying NVIDIA GPU Support on VMware Tanzu
Deploying NVIDIA GPU Support on Nutanix Karbon
Example: Training a Deep Learning Model with NGC Containers
Integrating DevOps: CI/CD, Harbor, and Monitoring
Troubleshooting & Best Practices
References

1. Introduction

Deep learning workloads are transforming enterprise AI. Running them efficiently at scale requires more than just powerful GPUs. Kubernetes has become the go-to platform for orchestrating modern, containerized machine learning environments. With Kubernetes, you get reproducibility, scalability, and deep integration with DevOps workflows.

This tutorial and proof-of-concept guide shows how to harness NVIDIA GPUs within VMware Tanzu and Nutanix Karbon Kubernetes platforms. Here, you will find practical configuration steps, workflow automation tips, and sample code. You will also see references to VMware’s official GPU enablement blogs and Nutanix Karbon GPU documentation for accuracy.

2. Why Containerized Deep Learning on Kubernetes?

Key benefits:

Reproducibility: Environments, dependencies, and data pipelines are defined as code.
Portability: Move workloads seamlessly across development, test, and production clusters.
DevOps Automation: Integrate ML workflows with CI/CD, GitOps, and monitoring stacks.
Resource Efficiency: Dynamically schedule GPU workloads to maximize utilization.

Enterprise challenge:
Most enterprise ML teams struggle to make GPU resources shareable, auditable, and easy to manage. VMware Tanzu and Nutanix Karbon provide battle-tested Kubernetes solutions. NVIDIA GPU Operator and NGC containers deliver full-stack GPU support.

3. Overview: VMware Tanzu, Nutanix Karbon, and NVIDIA GPUs

VMware Tanzu

Tanzu Kubernetes Grid (TKG): Production-grade Kubernetes for vSphere.
GPU Support: Uses vSphere DirectPath I/O for passthrough or vGPU. Tanzu supports the NVIDIA GPU Operator.

Nutanix Karbon

Karbon: CNCF-certified Kubernetes, tightly integrated with Nutanix AHV and Prism.
GPU Support: Enables passthrough or vGPU with the Nutanix AHV stack. Supports the NVIDIA GPU Operator and orchestration of NGC containers.
Documentation Note: Karbon documentation is less centralized than VMware’s. However, Nutanix confirms robust GPU enablement and Kubernetes orchestration via AHV and Prism. For detailed, step-by-step guidance, refer to Nutanix official solution briefs and Nutanix.dev.

NVIDIA GPUs

NVIDIA Data Center GPUs: A100, H100, V100, T4, and others.
Key Software: NVIDIA drivers, CUDA toolkit, NVIDIA Container Toolkit, GPU Operator, and NGC (NVIDIA GPU Cloud) container registry.

4. Prerequisites & Lab Setup

Hardware

VMware or Nutanix cluster with NVIDIA GPUs, either passthrough or vGPU-capable.
Sufficient compute and memory for Kubernetes master and worker nodes.

Software

VMware vSphere 7 or later with Tanzu, or Nutanix AHV with Karbon enabled.
Kubernetes cluster, version 1.25 or newer is recommended.
Docker or Containerd installed on all nodes.
Internet access for NGC and Harbor, unless mirrored locally.

Network & Storage

High-bandwidth, low-latency interconnects.
Persistent storage class, such as vSAN, Nutanix Files, or Nutanix Volumes.

Licensing and Compatibility

Access to NVIDIA vGPU licenses if using vGPU.
vGPU Licensing: Always verify vGPU licensing and compatibility for your deployment, especially for multi-tenant clusters. Licensing models and GPU sharing features may vary by environment, so confirm support using Nutanix and NVIDIA official documentation.

5. Deploying NVIDIA GPU Support on VMware Tanzu

Step 1: Enable GPU Passthrough (DirectPath I/O)

In vSphere Client:
- Right-click the host, go to Hardware, and select PCI Devices.
- Choose the NVIDIA GPU and click Mark for Passthrough.
Assign GPU to Tanzu worker nodes:
- Edit the VM settings, select Add PCI Device, and choose the GPU.

Step 2: Install NVIDIA Drivers and Container Toolkit

Run these as root on each worker node:

# Install NVIDIA Driver (example for RHEL/CentOS)
yum install -y kernel-devel
bash NVIDIA-Linux-x86_64-*.run

# Install NVIDIA Docker Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/yum.repos.d/nvidia-docker.repo
yum install -y nvidia-docker2
systemctl restart docker

Verify installation:

nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi

Step 3: Deploy NVIDIA GPU Operator (Helm)

kubectl create namespace gpu-operator
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install --wait --generate-name -n gpu-operator nvidia/gpu-operator

References:

6. Deploying NVIDIA GPU Support on Nutanix Karbon

Step 1: Enable GPU Passthrough in Nutanix Prism

In Prism Element:
- Go to VM, click Hardware, then Add GPU, and select the NVIDIA GPU.
For each Karbon worker VM:
- Attach the GPU and reboot.

Step 2: Install Drivers & Toolkit

Follow the same NVIDIA driver and toolkit steps as above. Adapt for Ubuntu or Debian as needed.

Step 3: Deploy NVIDIA GPU Operator

Create a namespace: kubectl create namespace gpu-operator
Install via Helm: helm repo add nvidia https://nvidia.github.io/gpu-operator helm repo update helm install gpu-operator nvidia/gpu-operator -n gpu-operator

Documentation Note:
Karbon documentation is less centralized than VMware’s, but Nutanix confirms GPU support and Kubernetes orchestration through AHV and Prism. Always validate your deployment using the latest Nutanix solution briefs and official docs.

References:

7. Example: Training a Deep Learning Model with NGC Containers

Let’s run a PyTorch training job on Kubernetes using the official NVIDIA NGC image.

Step 1: Pull Example YAML

ngc_pytorch_job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-gpu-job
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: nvcr.io/nvidia/pytorch:24.04-py3
        command: ["python", "-c", "import torch; print(torch.cuda.is_available()); print(torch.rand(2,2).cuda())"]
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never

Deploy:

kubectl apply -f ngc_pytorch_job.yaml
kubectl logs job/pytorch-gpu-job

You should see True for GPU availability and random tensor output, which proves GPU utilization inside the container.

8. Integrating DevOps: CI/CD, Harbor Registry, and Monitoring

CI/CD with GitLab or Jenkins

Sample pipeline:
Build and push custom NGC-compatible images when code changes.
Deploy ML workloads as Kubernetes Jobs.

stages:
  - build
  - deploy

build:
  script:
    - docker build -t my-registry/ngc-custom:latest .
    - docker push my-registry/ngc-custom:latest

deploy:
  script:
    - kubectl apply -f k8s/job.yaml

Harbor Registry

Store signed, scanned, and versioned ML images in Harbor.
Integrate with Tanzu or Karbon clusters for secure image pulls.

Monitoring: Prometheus & Grafana

Deploy the NVIDIA DCGM Exporter for GPU metrics.
Visualize GPU utilization in Grafana dashboards.

9. Troubleshooting & Best Practices

Issue	Cause or Solution
GPU not detected in pod	Check node label `nvidia.com/gpu.present`. Verify driver and Operator logs.
Pod pending (no GPU)	Not enough GPU nodes or quota. Check `kubectl describe node`.
vGPU licensing errors	Make sure the NVIDIA license server is reachable from all GPU-enabled VMs. Verify that your deployment’s vGPU licensing model matches your Nutanix AHV and Kubernetes configuration, especially in multi-tenant scenarios.
Compatibility mismatches	Review Nutanix’s latest compatibility matrices for vGPU and Kubernetes versions.
CUDA version mismatch	Use matching CUDA version in driver and container.
Image pull errors	Check Harbor or NGC registry access. Verify the image tag is correct.

Pro Tips:

Use taints and tolerations to isolate GPU nodes.
Automate driver and Operator deployment with Ansible or Terraform for large clusters.
Use kubectl top pod with metrics-server and DCGM for live GPU stats.

10. References

Disclaimer

The views expressed in this article are those of the author and do not represent the opinions of any vendor, my employer or any affiliated organization. Always refer to the official vendor documentation before production deployment.

How to Go God Mode in Nutanix Flow: Unlocking Advanced Security and Automation

Nutanix Flow has rapidly evolved from simple microsegmentation to a robust enterprise security, automation, and network visibility suite. If you want to...

NVIDIA’s AI Revolution: From Data Centers to Cloud

Table of Contents Introduction Market Overview: The New Age of AI Infrastructure NVIDIA’s Expanding Ecosystem Microsoft Partnership VMware Partnership Real-World Example: NVIDIA AI Enterprise on VMware Cloud Foundation Scalability: Technology…

Table of Contents