Access
- ✓Head node provisioned and accessible via a simple SSH command
- ✓Standard Slurm commands functional:
sinfo,squeue,scontrol,salloc,sbatch,srunSlurm man pagessinfo -N -l && squeue -u $USER - ✓Shared filesystem configured with default home directory and reasonable quota
- ✓Essential packages available:
python,git,curl,wget,apt,vim,nanowhich python3 git curl wget apt vim nano - ✓
sudoaccess available on head node for package installation - ✓Easy to add new users and groups via CLI or console
- ✓Easy to enforce RBAC on users and groups on cluster and storage
- ✓Integration with external IDPs (Okta, Google, Microsoft, GitHub via OIDC/OAuth 2.0)
- ✓Integration with
sacctfor job tracking and resource utilization by usersacctdocssacct --format=JobID,JobName,Partition,AllocGRES,State,Elapsed - ✓Passwordless SSH connectivity between nodes enabled
srun -N4 hostnamessh [node_name]
Configuration
- ✓GPUDirect RDMA enabled on worker nodes via
dma_buf(nvidia-open kernel module). The out-of-treenvidia_peermemmodule is deprecated and does not pass on its own.GPUDirect RDMA docssrun bash -c 'grep -i "open kernel" /proc/driver/nvidia/version && find /sys/module/nvidia -maxdepth 4 -name dma_buf\*' - ✓
topology.confproperly configured for system architectureSlurmtopology.confcat /etc/slurm/topology.conf - ✓NVCC compiler installed and accessible
nvcc --version - ✓HPC-X or equivalent MPI implementation installed without hunting in
/optNVIDIA HPC-Xmodule avail hpcx || ls /opt/hpcx*/ - ✓NCCL properly installed, configured, and up to dateNCCL docs
dpkg -l | grep nccl || rpm -qa | grep nccl - ✓
- ✓Default CPUs per task, memory per CPU, and other settings are logical
scontrol show config | grep -i 'DefCpu\|DefMem'
Containers
- ✓
- ✓
- ✓Docker available on worker nodes with modern NVIDIA Container ToolkitNVIDIA Container Toolkit
srun docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi - ✓
Networking and Collectives
- ✓NCCL auto-configuration:
NCCL_MIN_NCHANNELS,NCCL_PROTO,NCCL_ALGONOT in/etc/nccl.confNCCL environment variablescat /etc/nccl.conf - ✓RoCEv2 configuration:
NCCL_IB_GID_INDEX=3in/etc/nccl.confNCCL_IB_GID_INDEXgrep NCCL_IB_GID_INDEX /etc/nccl.conf - ✓NCCL tests run successfully at expected bandwidthnccl-tests
sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch - ✓High bandwidth NICs/HCAs named correctly (
mlx5_0,mlx5_1, etc.)ibstat | grep -E 'CA |Port |State|Rate'
Monitoring and Health Checks
- ✓DCGM background health checks enabled and plugged into Slurm
HealthCheckProgramDCGM docsscontrol show config | grep HealthCheckProgram - ✓Prolog and Epilog scripts lightweight (<30s to get on a node via
srun)Slurm Prolog/Epilogtime srun -N1 hostname - ✓SHARP support for enhanced NCCL performanceNVIDIA SHARP docs
- ✓Dashboard includes Slurm job accounting data via
sacctSlurm Accounting - ✓Automatic remediation systems for failed nodes
General Expectations
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓24x7 support availability
- ✓Process for security fixes and upgrades exists, proactive notifications are clear
- ✓Integration with comprehensive monitoring and alerting systems
- ✓Integration with active and passive health check systems