Cluster Overview
- ✓Number of nodes, pods, workloads running
kubectl get nodes && kubectl get pods --all-namespaces | wc -l - ✓Current node health status
kubectl get nodes -o wide - ✓Resource utilization (CPU, memory, disk, network) at cluster/node/namespace/pod levels
kubectl top nodes && kubectl top pods --all-namespaces - ✓Trends and summaries
- ✓Workload stability (restarts, pending pods, scaling events)
- ✓Control plane health (API server latency, etcd performance, controller manager and scheduler metrics)
Resource Management Integration
- ✓Slurm integration with job stats by user, type, etc.Slurm Accounting
sacct --format=JobID,User,Partition,AllocGRES,State,Elapsed --starttime=now-7days - ✓Kubernetes integration with kube-state-metrics, node-exporter, dcgm-exporter, cAdvisordcgm-exporter
kubectl get pods -n monitoring -l app=dcgm-exporter - ✓Resource quotas with limits and usage of GPUs for users and groups
- ✓Real-time GPU node availability status
- ✓Scheduler resource contention monitoring
Monitoring Stack
- ✓kube-prometheus-stack for Kubernetes metrics (node state, resource usage, control plane, workload performance)kube-prometheus-stack
helm list -n monitoring - ✓NVIDIA DCGM integration for comprehensive GPU monitoring or AMD equivalentDCGM docs
dcgmi discovery -l - ✓KV cache usage for horizontal pod autoscaling with serving (
gpu_cache_usage_perc) - ✓Integrated alert management and notification systems
Hardware Monitoring
- ✓Node-level power draw with real-time power consumption monitoring
nvidia-smi --query-gpu=power.draw --format=csv - ✓Fan speed monitoring for cooling system performance and status
nvidia-smi --query-gpu=fan.speed --format=csv - ✓Temperature monitoring (CPU, RAM, NIC, transceiver, other critical components)
nvidia-smi --query-gpu=temperature.gpu --format=csv - ✓PCIe AER rates for Advanced Error Reporting on PCIe bus health
- ✓
dmesglogs monitoring via Promtail for system-level message capturePromtail docs - ✓GB200 NVL72 specialized monitoring requirements
Performance Monitoring
- ✓
- ✓SM Active monitoring via
DCGM_FI_PROF_SM_ACTIVEdcgmi dmon -e 1002 -d 1000 - ✓SM Occupancy monitoring via
DCGM_FI_PROF_SM_OCCUPANCYdcgmi dmon -e 1003 -d 1000 - ✓
Network Monitoring
- ✓NVLink/XGMI throughput for inter-GPU communication bandwidth
nvidia-smi nvlink -s - ✓PCIe throughput for host CPU to GPU data transfer rates
nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv - ✓InfiniBand/RoCEv2 throughput for high-speed network performance between nodesRDMA perftest
ibstat && perftest ib_write_bw
Resource Management
- ✓User and group quotas
- ✓Node availability
- ✓Scheduler job history
General Expectations
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓24x7 support availability
- ✓Process for security fixes and upgrades exists, proactive notifications are clear
- ✓Integration with active and passive health check systems