Name: ClusterMAX GPU Cloud Ratings
License: https://semianalysis.com

Passive: GPU Health Monitoring

✓
DCGM background health checks enabledDCGM diagnostics
dcgmi health -c -j
✓
GPUs falling off the bus monitoring (XID 79, NVML_ERROR_GPU_IS_LOST)NVIDIA XID errors
dmesg | grep -i 'xid\|nvidia'
✓
GPU and CPU memory ECC errors (SBE/DBE volatile and aggregate)
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv

Passive: Network and Hardware Monitoring

✓
PCIe errors via NVML and DCGM counters (replay/error thresholds)
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
✓
Ethernet and InfiniBand link flaps (ethtool, ibdiagnet, ibportstate)
ibstat && ibdiagnet 2>/dev/null || echo 'ibdiagnet not installed'
✓
GPU temperature monitoring (DCGM_FI_DEV_GPU_TEMP)DCGM field IDs
dcgmi dmon -e 150 -d 1000

✓
Uncorrectable NVIDIA XID and SXID error code monitoringNVIDIA XID errors
dmesg | grep -i 'xid\|sxid'
✓
Stalled NCCL/RCCL job detection (GPU_UTIL vs power consumption)
nvidia-smi --query-gpu=utilization.gpu,power.draw --format=csv -l 1
✓
InfiniBand health (PKey consistency, link specs, error counters)
bash tests/k8s/security/ib-pkey-validation.sh

✓
NVIDIA DCGM diag level 3 with Extensive Testing (EUD)DCGM diagnostics
dcgmi diag -r 3
✓
DtoH and HtoD bandwidth testing for PCIe performance validation
sbatch tests/slurm/compute/nvbandwidth/nvbandwidth.sbatch
✓
gpu-burn/gpu-fryer for validating GPU under loadgpu-burn

✓
Local NCCL all reduce tests for NVLink/NVSwitch/NVLS performancenccl-tests
sbatch tests/slurm/networking/nccl/2node-run-nccl.sbatch
✓
Local InfiniBand all reduce test (with NCCL_P2P_DISABLE=1)
NCCL_P2P_DISABLE=1 sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch
✓
Pairwise GPU ib_write_bw and ib_write_latency bidirectional testsRDMA perftest
sbatch tests/slurm/networking/nccl/ib-perftest-cuda.sbatch

✓
NVIDIA TinyMeg2 for hardware correctness and SDC-free validation
✓
Megatron or TorchTitan tests for TFLOP/s/GPU performance and loss convergenceTorchTitan
kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml

✓
Weekly scheduled active health checks on idle nodes
✓
NCCL and scheduler topology health validation (Slurm topology.yaml/conf; K8s topology-aware scheduling, gang scheduling, bin packing)
scontrol show topology 2>/dev/null || kubectl get topologies.kueue.x-k8s.io,hypernodes -A 2>/dev/null
✓
Kubernetes node health check and problem detector integrationNode Problem Detector
✓
NVLink connectivity and error tracking (critical for NVL72)
nvidia-smi nvlink -s && nvidia-smi nvlink -e
✓
Automated node draining and replacement for failed health checks
✓
AI/ML-based prediction of failures

✓
Console, dashboard, CLI and/or API available to manage resources
✓
24x7 support availability
✓
Process for security fixes and upgrades exists, proactive notifications are clear
✓
Integration with comprehensive monitoring and alerting systems