NVIDIA Device Plugin
The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.
Overview
This service deploys the official NVIDIA Device Plugin as a DaemonSet that:
- Discovers NVIDIA GPUs on worker nodes
- Labels nodes with GPU product information (e.g.,
nvidia.com/gpu.product=GeForce-RTX-4090) - Advertises GPU resources (
nvidia.com/gpu) to the Kubernetes scheduler - Enables pods to request GPU resources
Prerequisites
Before installing the NVIDIA Device Plugin, ensure that:
- NVIDIA Drivers are installed (>= 384.81)
- nvidia-container-toolkit is installed (>= 1.7.0)
- nvidia-container-runtime is configured as the default container runtime
- Worker nodes have NVIDIA GPUs
Talos Linux Requirements
For Talos Linux nodes, you need:
- NVIDIA drivers extension in the Talos schematic
- nvidia-container-toolkit extension
- Proper container runtime configuration
Installation
# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin
Verification
After installation, verify the plugin is working:
# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia
# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'
# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia
Usage in Applications
Once installed, applications can request GPU resources:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-app
spec:
template:
spec:
containers:
- name: app
image: nvidia/cuda:latest
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
Troubleshooting
Plugin Not Starting
- Verify NVIDIA drivers are installed on worker nodes
- Check that nvidia-container-toolkit is properly configured
- Ensure worker nodes are not tainted in a way that prevents scheduling
No GPU Resources Advertised
- Check plugin logs:
kubectl logs -n kube-system -l name=nvidia-device-plugin-ds - Verify NVIDIA runtime is the default container runtime
- Ensure GPUs are detected by the driver: check node logs for GPU detection messages
Configuration
The plugin uses the following configuration:
- Image:
nvcr.io/nvidia/k8s-device-plugin:v0.17.1 - Namespace:
kube-system - Priority Class:
system-node-critical - Tolerations: Schedules on nodes with
nvidia.com/gputaint