Kubernetes

Kubernetes is the industry-standard container orchestration software. It is the de facto standard for inference and growing in popularity for training.

Access

  • Kubeconfig simple download or properly configured login nodeKubeconfig docs
    kubectl cluster-info
  • RBAC options with remote SSO provider integrationK8s RBAC docs
  • Helm access available without custom external authenticationHelm docs
    helm version && helm repo list

Configuration

  • GPU Operator installed and configuredGPU Operator docs
    kubectl get pods -n gpu-operator
  • Network Operator installed and configuredNetwork Operator docs
    kubectl get pods -n nvidia-network-operator
  • MPI Operator deployed or easy to installMPI Operator
    kubectl get crd | grep mpijob
  • CSI provider with ReadWriteMany supportK8s PV Access Modes
    kubectl get storageclass
  • Default StorageClass configured and functional, PVCs provision without hanging
    kubectl get storageclass -o wide
  • Host path storage option available for high performance and caching
  • Ingress/egress to cluster properly configured
  • MetalLB or external load balancer available, public IPs assignableMetalLB docs
    kubectl get pods -n metallb-system

Performance Testing

  • Compute performs as expected (GEMMs, MAMF, bandwidth, etc.)
    kubectl apply -f tests/k8s/compute/mamf-finder/mamf-benchmark.yaml
  • Storage performs as expected (fio, etc.)
    kubectl apply -f tests/k8s/storage/fio-benchmark.yaml
  • Network performs as expected (nccl-tests or rccl-tests)nccl-tests
    kubectl apply -f tests/k8s/networking/nccl/mpi-nccl-tests/nccl-test-2node.yaml
  • TorchTitan or Megatron training job reaches expected 30-40% MFUTorchTitan
    kubectl apply -f tests/k8s/training/kueue/00-torchtitan.yaml
  • prime-rl/verifiers RL job reaches expected throughputPRIME-RL
  • llm-d or SGLang OME prefill/decode disaggregated inference performance testingllm-d
    kubectl apply -f tests/k8s/inference/llm-d/helmfile.yaml

Other

  • GPU Operator up to date for drivers, container toolkit, including latest security patchesGPU Operator docs
    bash tests/k8s/security/gpu-operator-check.sh
  • Network Operator up to date for InfiniBand or Spectrum-X RoCENetwork Operator docs
  • Documentation for Broadcom or Pollara RoCE NICs passthrough
  • Experience with MPI Operator and PyTorchJob from KubeflowMPI Operator
  • Experience with JobSet, Volcano, Kueue or other OSS training frameworksKueue docs
  • Experience with llm-d, SGLang OME, or other OSS inference frameworksllm-d
  • ACS and other BIOS settings monitored on underlying hosts
  • Node Problem Detector (NPD), draino or similar for automated drain/cordon + repair/replaceNode Problem Detector
  • kube-prometheus-stack, dcgmi for monitoring dmesg for ECCs, XIDs, and similar errorskube-prometheus-stack
    kubectl get pods -n monitoring

General Expectations

  • Console, dashboard, CLI and/or API available to manage resources
  • 24x7 support availability
  • Process for security fixes and upgrades exists, proactive notifications are clear
  • Integration with comprehensive monitoring and alerting systems
  • Integration with active and passive health check systems

All expectations