Name: ClusterMAX GPU Cloud Ratings
License: https://semianalysis.com

Cluster Overview

✓
Number of nodes, pods, workloads running
kubectl get nodes && kubectl get pods --all-namespaces | wc -l
✓
Current node health status
kubectl get nodes -o wide
✓
Resource utilization (CPU, memory, disk, network) at cluster/node/namespace/pod levels
kubectl top nodes && kubectl top pods --all-namespaces
✓
Trends and summaries
✓
Workload stability (restarts, pending pods, scaling events)
✓
Control plane health (API server latency, etcd performance, controller manager and scheduler metrics)

Resource Management Integration

✓
Slurm integration with job stats by user, type, etc.Slurm Accounting
sacct --format=JobID,User,Partition,AllocGRES,State,Elapsed --starttime=now-7days
✓
Kubernetes integration with kube-state-metrics, node-exporter, dcgm-exporter, cAdvisordcgm-exporter
kubectl get pods -n monitoring -l app=dcgm-exporter
✓
Resource quotas with limits and usage of GPUs for users and groups
✓
Real-time GPU node availability status
✓
Scheduler resource contention monitoring

✓
kube-prometheus-stack for Kubernetes metrics (node state, resource usage, control plane, workload performance)kube-prometheus-stack
helm list -n monitoring
✓
NVIDIA DCGM integration for comprehensive GPU monitoring or AMD equivalentDCGM docs
dcgmi discovery -l
✓
KV cache usage for horizontal pod autoscaling with serving (gpu_cache_usage_perc)
✓
Integrated alert management and notification systems

✓
Node-level power draw with real-time power consumption monitoring
nvidia-smi --query-gpu=power.draw --format=csv
✓
Fan speed monitoring for cooling system performance and status
nvidia-smi --query-gpu=fan.speed --format=csv
✓
Temperature monitoring (CPU, RAM, NIC, transceiver, other critical components)
nvidia-smi --query-gpu=temperature.gpu --format=csv
✓
PCIe AER rates for Advanced Error Reporting on PCIe bus health
✓
dmesg logs monitoring via Promtail for system-level message capturePromtail docs
✓
GB200 NVL72 specialized monitoring requirements

✓
TFLOPs estimation via DCGM_FI_PROF_PIPE_TENSOR_ACTIVEDCGM field IDs
dcgmi dmon -e 1004 -d 1000
✓
SM Active monitoring via DCGM_FI_PROF_SM_ACTIVE
dcgmi dmon -e 1002 -d 1000
✓
SM Occupancy monitoring via DCGM_FI_PROF_SM_OCCUPANCY
dcgmi dmon -e 1003 -d 1000
✓
NCU profiling with NVIDIA Nsight Compute available without sudoNsight Compute
ncu --version

✓
NVLink/XGMI throughput for inter-GPU communication bandwidth
nvidia-smi nvlink -s
✓
PCIe throughput for host CPU to GPU data transfer rates
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
✓
InfiniBand/RoCEv2 throughput for high-speed network performance between nodesRDMA perftest
ibstat && perftest ib_write_bw

✓
Console, dashboard, CLI and/or API available to manage resources
✓
24x7 support availability
✓
Process for security fixes and upgrades exists, proactive notifications are clear
✓
Integration with active and passive health check systems