Access
- ✓
- ✓RBAC options with remote SSO provider integrationK8s RBAC docs
- ✓
Configuration
- ✓
- ✓Network Operator installed and configuredNetwork Operator docs
kubectl get pods -n nvidia-network-operator - ✓
- ✓
- ✓Default StorageClass configured and functional, PVCs provision without hanging
kubectl get storageclass -o wide - ✓Host path storage option available for high performance and caching
- ✓Ingress/egress to cluster properly configured
- ✓MetalLB or external load balancer available, public IPs assignableMetalLB docs
kubectl get pods -n metallb-system
Performance Testing
- ✓Compute performs as expected (GEMMs, MAMF, bandwidth, etc.)
kubectl apply -f tests/k8s/compute/mamf-finder/mamf-benchmark.yaml - ✓Storage performs as expected (
fio, etc.)kubectl apply -f tests/k8s/storage/fio-benchmark.yaml - ✓Network performs as expected (nccl-tests or rccl-tests)nccl-tests
kubectl apply -f tests/k8s/networking/nccl/mpi-nccl-tests/nccl-test-2node.yaml - ✓TorchTitan or Megatron training job reaches expected 30-40% MFUTorchTitan
kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml - ✓prime-rl/verifiers RL job reaches expected throughputPRIME-RL
- ✓llm-d or SGLang OME prefill/decode disaggregated inference performance testingllm-d
kubectl apply -f tests/k8s/inference/llm-d/helmfile.yaml
Other
- ✓GPU Operator up to date for drivers, container toolkit, including latest security patchesGPU Operator docs
bash tests/k8s/security/gpu-operator-check.sh - ✓Network Operator up to date for InfiniBand or Spectrum-X RoCENetwork Operator docs
- ✓Documentation for Broadcom or Pollara RoCE NICs passthrough
- ✓Experience with MPI Operator and PyTorchJob from KubeflowMPI Operator
- ✓Experience with JobSet, Volcano, Kueue or other OSS training frameworksKueue docs
- ✓Experience with llm-d, SGLang OME, or other OSS inference frameworksllm-d
- ✓ACS and other BIOS settings monitored on underlying hosts
- ✓Node Problem Detector (NPD), draino or similar for automated drain/cordon + repair/replaceNode Problem Detector
- ✓kube-prometheus-stack,
dcgmifor monitoringdmesgfor ECCs, XIDs, and similar errorskube-prometheus-stackkubectl get pods -n monitoring
General Expectations
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓24x7 support availability
- ✓Process for security fixes and upgrades exists, proactive notifications are clear
- ✓Integration with comprehensive monitoring and alerting systems
- ✓Integration with active and passive health check systems