Files

Paul Payne 9687fad812 feat: Move cluster services to wild-directory as unified packages

Convert all 15 cluster services from embedded API format to
wild-directory packages using the unified manifest format:
- metallb, traefik, cert-manager, longhorn, snapshot-controller
- nfs, smtp, coredns, node-feature-discovery, nvidia-device-plugin
- externaldns, docker-registry, headlamp, crowdsec, utils

Changes:
- wild-manifest.yaml → manifest.yaml with is, defaultConfig, requires
- Eliminated configReferences and serviceConfig fields
- Flattened kustomize.template/ to package root
- Template vars use flat defaultConfig keys
- install.sh paths updated for apps/ layout
- Updated 9 app manifests: cloud.smtp.* → apps.smtp.* with requires
- Removed dead install: true field from 6 app manifests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-05-17 02:26:46 +00:00

daemonset.yaml

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

install.sh

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

kustomization.yaml

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

manifest.yaml

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

README.md

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

runtimeclass.yaml

feat: Move cluster services to wild-directory as unified packages

2026-05-17 02:26:46 +00:00

README.md

NVIDIA Device Plugin

The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.

Overview

This service deploys the official NVIDIA Device Plugin as a DaemonSet that:

Discovers NVIDIA GPUs on worker nodes
Labels nodes with GPU product information (e.g., nvidia.com/gpu.product=GeForce-RTX-4090)
Advertises GPU resources (nvidia.com/gpu) to the Kubernetes scheduler
Enables pods to request GPU resources

Prerequisites

Before installing the NVIDIA Device Plugin, ensure that:

NVIDIA Drivers are installed (>= 384.81)
nvidia-container-toolkit is installed (>= 1.7.0)
nvidia-container-runtime is configured as the default container runtime
Worker nodes have NVIDIA GPUs

Talos Linux Requirements

For Talos Linux nodes, you need:

NVIDIA drivers extension in the Talos schematic
nvidia-container-toolkit extension
Proper container runtime configuration

Installation

# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin

Verification

After installation, verify the plugin is working:

# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'

# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia

Usage in Applications

Once installed, applications can request GPU resources:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: nvidia/cuda:latest
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

Troubleshooting

Plugin Not Starting

Verify NVIDIA drivers are installed on worker nodes
Check that nvidia-container-toolkit is properly configured
Ensure worker nodes are not tainted in a way that prevents scheduling

No GPU Resources Advertised

Check plugin logs: kubectl logs -n kube-system -l name=nvidia-device-plugin-ds
Verify NVIDIA runtime is the default container runtime
Ensure GPUs are detected by the driver: check node logs for GPU detection messages

Configuration

The plugin uses the following configuration:

Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
Namespace: kube-system
Priority Class: system-node-critical
Tolerations: Schedules on nodes with nvidia.com/gpu taint