Name: ClusterMAX GPU Cloud Ratings
License: https://semianalysis.com

Access

✓
Kubeconfig simple download or properly configured login nodeKubeconfig docs
kubectl cluster-info
✓
RBAC options with remote SSO provider integrationK8s RBAC docs
✓
Helm access available without custom external authenticationHelm docs
helm version && helm repo list

Configuration

✓
GPU Operator installed and configuredGPU Operator docs
kubectl get pods -n gpu-operator
✓
Network Operator installed and configuredNetwork Operator docs
kubectl get pods -n nvidia-network-operator
✓
MPI Operator deployed or easy to installMPI Operator
kubectl get crd | grep mpijob
✓
CSI provider with ReadWriteMany supportK8s PV Access Modes
kubectl get storageclass
✓
Default StorageClass configured and functional, PVCs provision without hanging
kubectl get storageclass -o wide
✓
Host path storage option available for high performance and caching
✓
Ingress/egress to cluster properly configured
✓
MetalLB or external load balancer available, public IPs assignableMetalLB docs
kubectl get pods -n metallb-system

✓
Compute performs as expected (GEMMs, MAMF, bandwidth, etc.)
kubectl apply -f tests/k8s/compute/mamf-finder/mamf-benchmark.yaml
✓
Storage performs as expected (fio, etc.)
kubectl apply -f tests/k8s/storage/fio-benchmark.yaml
✓
Network performs as expected (nccl-tests or rccl-tests)nccl-tests
kubectl apply -f tests/k8s/networking/nccl/mpi-nccl-tests/nccl-test-2node.yaml
✓
TorchTitan or Megatron training job reaches expected 30-40% MFUTorchTitan
kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml
✓
prime-rl/verifiers RL job reaches expected throughputPRIME-RL
✓
llm-d or SGLang OME prefill/decode disaggregated inference performance testingllm-d
kubectl apply -f tests/k8s/inference/llm-d/helmfile.yaml

✓
GPU Operator up to date for drivers, container toolkit, including latest security patchesGPU Operator docs
bash tests/k8s/security/gpu-operator-check.sh
✓
Network Operator up to date for InfiniBand or Spectrum-X RoCENetwork Operator docs
✓
Documentation for Broadcom or Pollara RoCE NICs passthrough
✓
Experience with MPI Operator and PyTorchJob from KubeflowMPI Operator
✓
Experience with JobSet, Volcano, Kueue or other OSS training frameworksKueue docs
✓
Experience with llm-d, SGLang OME, or other OSS inference frameworksllm-d
✓
ACS and other BIOS settings monitored on underlying hosts
✓
Node Problem Detector (NPD), draino or similar for automated drain/cordon + repair/replaceNode Problem Detector
✓
kube-prometheus-stack, dcgmi for monitoring dmesg for ECCs, XIDs, and similar errorskube-prometheus-stack
kubectl get pods -n monitoring

✓
Console, dashboard, CLI and/or API available to manage resources
✓
24x7 support availability
✓
Process for security fixes and upgrades exists, proactive notifications are clear
✓
Integration with comprehensive monitoring and alerting systems
✓
Integration with active and passive health check systems