Passive: GPU Health Monitoring
- ✓
- ✓GPUs falling off the bus monitoring (XID 79,
NVML_ERROR_GPU_IS_LOST)NVIDIA XID errorsdmesg | grep -i 'xid\|nvidia' - ✓GPU and CPU memory ECC errors (SBE/DBE volatile and aggregate)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv
Passive: Network and Hardware Monitoring
- ✓PCIe errors via NVML and DCGM counters (replay/error thresholds)
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv - ✓Ethernet and InfiniBand link flaps (
ethtool,ibdiagnet,ibportstate)ibstat && ibdiagnet 2>/dev/null || echo 'ibdiagnet not installed' - ✓
Passive: Error Detection and Performance
- ✓
- ✓Stalled NCCL/RCCL job detection (
GPU_UTILvs power consumption)nvidia-smi --query-gpu=utilization.gpu,power.draw --format=csv -l 1 - ✓InfiniBand health (PKey consistency, link specs, error counters)
bash tests/k8s/security/ib-pkey-validation.sh
Active: GPU Diagnostics and Performance
- ✓
- ✓DtoH and HtoD bandwidth testing for PCIe performance validation
sbatch tests/slurm/compute/nvbandwidth/nvbandwidth.sbatch - ✓gpu-burn/gpu-fryer for validating GPU under loadgpu-burn
Active: Communication and Network Testing
- ✓Local NCCL all reduce tests for NVLink/NVSwitch/NVLS performancenccl-tests
sbatch tests/slurm/networking/nccl/2node-run-nccl.sbatch - ✓Local InfiniBand all reduce test (with
NCCL_P2P_DISABLE=1)NCCL_P2P_DISABLE=1 sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch - ✓Pairwise GPU
ib_write_bwandib_write_latencybidirectional testsRDMA perftestsbatch tests/slurm/networking/nccl/ib-perftest-cuda.sbatch
Active: Hardware Validation and AI Workload Testing
- ✓NVIDIA TinyMeg2 for hardware correctness and SDC-free validation
- ✓Megatron or TorchTitan tests for TFLOP/s/GPU performance and loss convergenceTorchTitan
kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml
Automation
- ✓Weekly scheduled active health checks on idle nodes
- ✓NCCL and scheduler topology health validation (Slurm
topology.yaml/conf; K8s topology-aware scheduling, gang scheduling, bin packing)scontrol show topology 2>/dev/null || kubectl get topologies.kueue.x-k8s.io,hypernodes -A 2>/dev/null - ✓Kubernetes node health check and problem detector integrationNode Problem Detector
- ✓NVLink connectivity and error tracking (critical for NVL72)
nvidia-smi nvlink -s && nvidia-smi nvlink -e - ✓Automated node draining and replacement for failed health checks
- ✓AI/ML-based prediction of failures
General Expectations
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓24x7 support availability
- ✓Process for security fixes and upgrades exists, proactive notifications are clear
- ✓Integration with comprehensive monitoring and alerting systems