Monitoring

A complete monitoring dashboard includes high-level and low-level views of the cluster, providing comprehensive visibility into system performance, resource utilization, and potential issues before they impact workloads.

Cluster Overview

  • Number of nodes, pods, workloads running
    kubectl get nodes && kubectl get pods --all-namespaces | wc -l
  • Current node health status
    kubectl get nodes -o wide
  • Resource utilization (CPU, memory, disk, network) at cluster/node/namespace/pod levels
    kubectl top nodes && kubectl top pods --all-namespaces
  • Trends and summaries
  • Workload stability (restarts, pending pods, scaling events)
  • Control plane health (API server latency, etcd performance, controller manager and scheduler metrics)

Resource Management Integration

  • Slurm integration with job stats by user, type, etc.Slurm Accounting
    sacct --format=JobID,User,Partition,AllocGRES,State,Elapsed --starttime=now-7days
  • Kubernetes integration with kube-state-metrics, node-exporter, dcgm-exporter, cAdvisordcgm-exporter
    kubectl get pods -n monitoring -l app=dcgm-exporter
  • Resource quotas with limits and usage of GPUs for users and groups
  • Real-time GPU node availability status
  • Scheduler resource contention monitoring

Monitoring Stack

  • kube-prometheus-stack for Kubernetes metrics (node state, resource usage, control plane, workload performance)kube-prometheus-stack
    helm list -n monitoring
  • NVIDIA DCGM integration for comprehensive GPU monitoring or AMD equivalentDCGM docs
    dcgmi discovery -l
  • KV cache usage for horizontal pod autoscaling with serving (gpu_cache_usage_perc)
  • Integrated alert management and notification systems

Hardware Monitoring

  • Node-level power draw with real-time power consumption monitoring
    nvidia-smi --query-gpu=power.draw --format=csv
  • Fan speed monitoring for cooling system performance and status
    nvidia-smi --query-gpu=fan.speed --format=csv
  • Temperature monitoring (CPU, RAM, NIC, transceiver, other critical components)
    nvidia-smi --query-gpu=temperature.gpu --format=csv
  • PCIe AER rates for Advanced Error Reporting on PCIe bus health
  • dmesg logs monitoring via Promtail for system-level message capturePromtail docs
  • GB200 NVL72 specialized monitoring requirements

Performance Monitoring

  • TFLOPs estimation via DCGM_FI_PROF_PIPE_TENSOR_ACTIVEDCGM field IDs
    dcgmi dmon -e 1004 -d 1000
  • SM Active monitoring via DCGM_FI_PROF_SM_ACTIVE
    dcgmi dmon -e 1002 -d 1000
  • SM Occupancy monitoring via DCGM_FI_PROF_SM_OCCUPANCY
    dcgmi dmon -e 1003 -d 1000
  • NCU profiling with NVIDIA Nsight Compute available without sudoNsight Compute
    ncu --version

Network Monitoring

  • NVLink/XGMI throughput for inter-GPU communication bandwidth
    nvidia-smi nvlink -s
  • PCIe throughput for host CPU to GPU data transfer rates
    nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current --format=csv
  • InfiniBand/RoCEv2 throughput for high-speed network performance between nodesRDMA perftest
    ibstat && perftest ib_write_bw

Resource Management

  • User and group quotas
  • Node availability
  • Scheduler job history

General Expectations

  • Console, dashboard, CLI and/or API available to manage resources
  • 24x7 support availability
  • Process for security fixes and upgrades exists, proactive notifications are clear
  • Integration with active and passive health check systems

All expectations