Reliability

Uptime guarantees, fault tolerance mechanisms, and disaster recovery capabilities.

Key Requirements

  • Network stability assessment
  • MSA SLA evaluation (99%, 99.9%, etc.)
  • Link flap monitoring and prevention
  • GPU falling off the bus detection
  • PCIe error monitoring
  • Ethernet and InfiniBand event monitoring (Link Flaps)
  • Thermal monitoring (GPU temperature)
  • GPU and CPU memory stats (ECC error rate)
  • NVIDIA XID and SXID error code detection
  • InfiniBand health monitoring (link status, error counters, PKey consistency)
  • NCCL and Slurm topology health
  • Kubernetes node health checksKubernetes
  • NVLink connectivity and error tracking (critical for NVL72)
  • Driver and core library version consistency across nodes
  • ECC error detection
  • Temperature monitoring and throttling alerts
  • Power monitoring and utilization tracking
  • NVIDIA XID/SXID error detection (through DCGM)
  • PCIe bus and power state health
  • IPMI exporter and fan speed monitoring
  • InfiniBand link status validation (ibstat)
  • Error counter monitoring (retries, dropped packets)
  • Partition Key (PKey) consistency across nodes
  • NCCL operation health tracking
  • Automated node draining and replacement

All evaluation criteria