Files
wild-directory/nvidia-device-plugin
Paul Payne 9687fad812 feat: Move cluster services to wild-directory as unified packages
Convert all 15 cluster services from embedded API format to
wild-directory packages using the unified manifest format:
- metallb, traefik, cert-manager, longhorn, snapshot-controller
- nfs, smtp, coredns, node-feature-discovery, nvidia-device-plugin
- externaldns, docker-registry, headlamp, crowdsec, utils

Changes:
- wild-manifest.yaml → manifest.yaml with is, defaultConfig, requires
- Eliminated configReferences and serviceConfig fields
- Flattened kustomize.template/ to package root
- Template vars use flat defaultConfig keys
- install.sh paths updated for apps/ layout
- Updated 9 app manifests: cloud.smtp.* → apps.smtp.* with requires
- Removed dead install: true field from 6 app manifests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-17 02:26:46 +00:00
..

NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.

Overview

This service deploys the official NVIDIA Device Plugin as a DaemonSet that:

  • Discovers NVIDIA GPUs on worker nodes
  • Labels nodes with GPU product information (e.g., nvidia.com/gpu.product=GeForce-RTX-4090)
  • Advertises GPU resources (nvidia.com/gpu) to the Kubernetes scheduler
  • Enables pods to request GPU resources

Prerequisites

Before installing the NVIDIA Device Plugin, ensure that:

  1. NVIDIA Drivers are installed (>= 384.81)
  2. nvidia-container-toolkit is installed (>= 1.7.0)
  3. nvidia-container-runtime is configured as the default container runtime
  4. Worker nodes have NVIDIA GPUs

Talos Linux Requirements

For Talos Linux nodes, you need:

  • NVIDIA drivers extension in the Talos schematic
  • nvidia-container-toolkit extension
  • Proper container runtime configuration

Installation

# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin

Verification

After installation, verify the plugin is working:

# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia

Usage in Applications

Once installed, applications can request GPU resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: nvidia/cuda:latest
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

Troubleshooting

Plugin Not Starting

  • Verify NVIDIA drivers are installed on worker nodes
  • Check that nvidia-container-toolkit is properly configured
  • Ensure worker nodes are not tainted in a way that prevents scheduling

No GPU Resources Advertised

  • Check plugin logs: kubectl logs -n kube-system -l name=nvidia-device-plugin-ds
  • Verify NVIDIA runtime is the default container runtime
  • Ensure GPUs are detected by the driver: check node logs for GPU detection messages

Configuration

The plugin uses the following configuration:

  • Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
  • Namespace: kube-system
  • Priority Class: system-node-critical
  • Tolerations: Schedules on nodes with nvidia.com/gpu taint

References