NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.

Overview

This service deploys the official NVIDIA Device Plugin as a DaemonSet that:

Discovers NVIDIA GPUs on worker nodes
Labels nodes with GPU product information (e.g., nvidia.com/gpu.product=GeForce-RTX-4090)
Advertises GPU resources (nvidia.com/gpu) to the Kubernetes scheduler
Enables pods to request GPU resources

Prerequisites

Before installing the NVIDIA Device Plugin, ensure that:

NVIDIA Drivers are installed (>= 384.81)
nvidia-container-toolkit is installed (>= 1.7.0)
nvidia-container-runtime is configured as the default container runtime
Worker nodes have NVIDIA GPUs

Talos Linux Requirements

For Talos Linux nodes, you need:

NVIDIA drivers extension in the Talos schematic
nvidia-container-toolkit extension
Proper container runtime configuration

Installation

# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin

Verification

After installation, verify the plugin is working:

# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia

Usage in Applications

Once installed, applications can request GPU resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: nvidia/cuda:latest
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

Troubleshooting

Plugin Not Starting

Verify NVIDIA drivers are installed on worker nodes
Check that nvidia-container-toolkit is properly configured
Ensure worker nodes are not tainted in a way that prevents scheduling

No GPU Resources Advertised

Check plugin logs: kubectl logs -n kube-system -l name=nvidia-device-plugin-ds
Verify NVIDIA runtime is the default container runtime
Ensure GPUs are detected by the driver: check node logs for GPU detection messages

Configuration

The plugin uses the following configuration:

Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
Namespace: kube-system
Priority Class: system-node-critical
Tolerations: Schedules on nodes with nvidia.com/gpu taint

2.9 KiB Raw Blame History