2.9 KiB
2.9 KiB
NVIDIA Device Plugin
The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.
Overview
This service deploys the official NVIDIA Device Plugin as a DaemonSet that:
- Discovers NVIDIA GPUs on worker nodes
- Labels nodes with GPU product information (e.g.,
nvidia.com/gpu.product=GeForce-RTX-4090) - Advertises GPU resources (
nvidia.com/gpu) to the Kubernetes scheduler - Enables pods to request GPU resources
Prerequisites
Before installing the NVIDIA Device Plugin, ensure that:
- NVIDIA Drivers are installed (>= 384.81)
- nvidia-container-toolkit is installed (>= 1.7.0)
- nvidia-container-runtime is configured as the default container runtime
- Worker nodes have NVIDIA GPUs
Talos Linux Requirements
For Talos Linux nodes, you need:
- NVIDIA drivers extension in the Talos schematic
- nvidia-container-toolkit extension
- Proper container runtime configuration
Installation
# Configure and install the service
wild-cluster-services-configure nvidia-device-plugin
wild-cluster-install nvidia-device-plugin
Verification
After installation, verify the plugin is working:
# Check plugin pods are running
kubectl get pods -n kube-system | grep nvidia
# Verify GPU resources are advertised
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'
# Check GPU node labels
kubectl get nodes --show-labels | grep nvidia
Usage in Applications
Once installed, applications can request GPU resources:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-app
spec:
template:
spec:
containers:
- name: app
image: nvidia/cuda:latest
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
Troubleshooting
Plugin Not Starting
- Verify NVIDIA drivers are installed on worker nodes
- Check that nvidia-container-toolkit is properly configured
- Ensure worker nodes are not tainted in a way that prevents scheduling
No GPU Resources Advertised
- Check plugin logs:
kubectl logs -n kube-system -l name=nvidia-device-plugin-ds - Verify NVIDIA runtime is the default container runtime
- Ensure GPUs are detected by the driver: check node logs for GPU detection messages
Configuration
The plugin uses the following configuration:
- Image:
nvcr.io/nvidia/k8s-device-plugin:v0.17.1 - Namespace:
kube-system - Priority Class:
system-node-critical - Tolerations: Schedules on nodes with
nvidia.com/gputaint