Files
wild-central-api/internal/setup/cluster-services/nvidia-device-plugin

NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.

Overview

This service deploys the official NVIDIA Device Plugin as a DaemonSet that:

  • Discovers NVIDIA GPUs on worker nodes
  • Labels nodes with GPU product information (e.g., nvidia.com/gpu.product=GeForce-RTX-4090)
  • Advertises GPU resources (nvidia.com/gpu) to the Kubernetes scheduler
  • Enables pods to request GPU resources

Prerequisites

Before installing the NVIDIA Device Plugin, ensure that:

  1. NVIDIA Drivers are installed (>= 384.81)
  2. nvidia-container-toolkit is installed (>= 1.7.0)
  3. nvidia-container-runtime is configured as the default container runtime
  4. Worker nodes have NVIDIA GPUs

Talos Linux Requirements

For Talos Linux nodes, you need:

  • NVIDIA drivers extension in the Talos schematic
  • nvidia-container-toolkit extension
  • Proper container runtime configuration

Installation

# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin

Verification

After installation, verify the plugin is working:

# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia

Usage in Applications

Once installed, applications can request GPU resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: nvidia/cuda:latest
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

Troubleshooting

Plugin Not Starting

  • Verify NVIDIA drivers are installed on worker nodes
  • Check that nvidia-container-toolkit is properly configured
  • Ensure worker nodes are not tainted in a way that prevents scheduling

No GPU Resources Advertised

  • Check plugin logs: kubectl logs -n kube-system -l name=nvidia-device-plugin-ds
  • Verify NVIDIA runtime is the default container runtime
  • Ensure GPUs are detected by the driver: check node logs for GPU detection messages

Configuration

The plugin uses the following configuration:

  • Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
  • Namespace: kube-system
  • Priority Class: system-node-critical
  • Tolerations: Schedules on nodes with nvidia.com/gpu taint

References