Health Checks

Proactive health monitoring identifies issues before they impact workloads through both active diagnostic testing and passive continuous monitoring that automatically remediates common problems.

Passive: GPU Health Monitoring

  • DCGM background health checks enabledDCGM diagnostics
    dcgmi health -c -j
  • GPUs falling off the bus monitoring (XID 79, NVML_ERROR_GPU_IS_LOST)NVIDIA XID errors
    dmesg | grep -i 'xid\|nvidia'
  • GPU and CPU memory ECC errors (SBE/DBE volatile and aggregate)
    nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

Passive: Network and Hardware Monitoring

  • PCIe errors via NVML and DCGM counters (replay/error thresholds)
    nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
  • Ethernet and InfiniBand link flaps (ethtool, ibdiagnet, ibportstate)
    ibstat && ibdiagnet 2>/dev/null || echo 'ibdiagnet not installed'
  • GPU temperature monitoring (DCGM_FI_DEV_GPU_TEMP)DCGM field IDs
    dcgmi dmon -e 150 -d 1000

Passive: Error Detection and Performance

  • Uncorrectable NVIDIA XID and SXID error code monitoringNVIDIA XID errors
    dmesg | grep -i 'xid\|sxid'
  • Stalled NCCL/RCCL job detection (GPU_UTIL vs power consumption)
    nvidia-smi --query-gpu=utilization.gpu,power.draw --format=csv -l 1
  • InfiniBand health (PKey consistency, link specs, error counters)
    bash tests/k8s/security/ib-pkey-validation.sh

Active: GPU Diagnostics and Performance

  • NVIDIA DCGM diag level 3 with Extensive Testing (EUD)DCGM diagnostics
    dcgmi diag -r 3
  • DtoH and HtoD bandwidth testing for PCIe performance validation
    sbatch tests/slurm/compute/nvbandwidth/nvbandwidth.sbatch
  • gpu-burn/gpu-fryer for validating GPU under loadgpu-burn

Active: Communication and Network Testing

  • Local NCCL all reduce tests for NVLink/NVSwitch/NVLS performancenccl-tests
    sbatch tests/slurm/networking/nccl/2node-run-nccl.sbatch
  • Local InfiniBand all reduce test (with NCCL_P2P_DISABLE=1)
    NCCL_P2P_DISABLE=1 sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch
  • Pairwise GPU ib_write_bw and ib_write_latency bidirectional testsRDMA perftest
    sbatch tests/slurm/networking/nccl/ib-perftest-cuda.sbatch

Active: Hardware Validation and AI Workload Testing

  • NVIDIA TinyMeg2 for hardware correctness and SDC-free validation
  • Megatron or TorchTitan tests for TFLOP/s/GPU performance and loss convergenceTorchTitan
    kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml

Automation

  • Weekly scheduled active health checks on idle nodes
  • NCCL and scheduler topology health validation (Slurm topology.yaml/conf; K8s topology-aware scheduling, gang scheduling, bin packing)
    scontrol show topology 2>/dev/null || kubectl get topologies.kueue.x-k8s.io,hypernodes -A 2>/dev/null
  • Kubernetes node health check and problem detector integrationNode Problem Detector
  • NVLink connectivity and error tracking (critical for NVL72)
    nvidia-smi nvlink -s && nvidia-smi nvlink -e
  • Automated node draining and replacement for failed health checks
  • AI/ML-based prediction of failures

General Expectations

  • Console, dashboard, CLI and/or API available to manage resources
  • 24x7 support availability
  • Process for security fixes and upgrades exists, proactive notifications are clear
  • Integration with comprehensive monitoring and alerting systems

All expectations