98 lines
2.9 KiB
Markdown
98 lines
2.9 KiB
Markdown
# NVIDIA Device Plugin
|
|
|
|
The NVIDIA Device Plugin for Kubernetes enables GPU scheduling and resource management on nodes with NVIDIA GPUs.
|
|
|
|
## Overview
|
|
|
|
This service deploys the official NVIDIA Device Plugin as a DaemonSet that:
|
|
- Discovers NVIDIA GPUs on worker nodes
|
|
- Labels nodes with GPU product information (e.g., `nvidia.com/gpu.product=GeForce-RTX-4090`)
|
|
- Advertises GPU resources (`nvidia.com/gpu`) to the Kubernetes scheduler
|
|
- Enables pods to request GPU resources
|
|
|
|
## Prerequisites
|
|
|
|
Before installing the NVIDIA Device Plugin, ensure that:
|
|
|
|
1. **NVIDIA Drivers** are installed (>= 384.81)
|
|
2. **nvidia-container-toolkit** is installed (>= 1.7.0)
|
|
3. **nvidia-container-runtime** is configured as the default container runtime
|
|
4. Worker nodes have NVIDIA GPUs
|
|
|
|
### Talos Linux Requirements
|
|
|
|
For Talos Linux nodes, you need:
|
|
- NVIDIA drivers extension in the Talos schematic
|
|
- nvidia-container-toolkit extension
|
|
- Proper container runtime configuration
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Configure and install the service
|
|
wild-cluster-services-configure nvidia-device-plugin
|
|
wild-cluster-install nvidia-device-plugin
|
|
```
|
|
|
|
## Verification
|
|
|
|
After installation, verify the plugin is working:
|
|
|
|
```bash
|
|
# Check plugin pods are running
|
|
kubectl get pods -n kube-system | grep nvidia
|
|
|
|
# Verify GPU resources are advertised
|
|
kubectl get nodes -o json | jq '.items[].status.capacity | select(has("nvidia.com/gpu"))'
|
|
|
|
# Check GPU node labels
|
|
kubectl get nodes --show-labels | grep nvidia
|
|
```
|
|
|
|
## Usage in Applications
|
|
|
|
Once installed, applications can request GPU resources:
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: gpu-app
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
image: nvidia/cuda:latest
|
|
resources:
|
|
requests:
|
|
nvidia.com/gpu: 1
|
|
limits:
|
|
nvidia.com/gpu: 1
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Plugin Not Starting
|
|
- Verify NVIDIA drivers are installed on worker nodes
|
|
- Check that nvidia-container-toolkit is properly configured
|
|
- Ensure worker nodes are not tainted in a way that prevents scheduling
|
|
|
|
### No GPU Resources Advertised
|
|
- Check plugin logs: `kubectl logs -n kube-system -l name=nvidia-device-plugin-ds`
|
|
- Verify NVIDIA runtime is the default container runtime
|
|
- Ensure GPUs are detected by the driver: check node logs for GPU detection messages
|
|
|
|
## Configuration
|
|
|
|
The plugin uses the following configuration:
|
|
- **Image**: `nvcr.io/nvidia/k8s-device-plugin:v0.17.1`
|
|
- **Namespace**: `kube-system`
|
|
- **Priority Class**: `system-node-critical`
|
|
- **Tolerations**: Schedules on nodes with `nvidia.com/gpu` taint
|
|
|
|
## References
|
|
|
|
- [Official NVIDIA Device Plugin Repository](https://github.com/NVIDIA/k8s-device-plugin)
|
|
- [Kubernetes GPU Scheduling Documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/)
|
|
- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/) |