Slurm

Slurm is an open-source job scheduler and the de facto standard for HPC for over 20 years. Used on over 60% of TOP500 supercomputers and over 50% of AI training clusters.

Access

  • Head node provisioned and accessible via a simple SSH command
  • Standard Slurm commands functional: sinfo, squeue, scontrol, salloc, sbatch, srunSlurm man pages
    sinfo -N -l && squeue -u $USER
  • Shared filesystem configured with default home directory and reasonable quota
  • Essential packages available: python, git, curl, wget, apt, vim, nano
    which python3 git curl wget apt vim nano
  • sudo access available on head node for package installation
  • Easy to add new users and groups via CLI or console
  • Easy to enforce RBAC on users and groups on cluster and storage
  • Integration with external IDPs (Okta, Google, Microsoft, GitHub via OIDC/OAuth 2.0)
  • Integration with sacct for job tracking and resource utilization by usersacct docs
    sacct --format=JobID,JobName,Partition,AllocGRES,State,Elapsed
  • Passwordless SSH connectivity between nodes enabled
    srun -N4 hostname
    ssh [node_name]

Configuration

  • GPUDirect RDMA enabled on worker nodes via dma_buf (nvidia-open kernel module). The out-of-tree nvidia_peermem module is deprecated and does not pass on its own.GPUDirect RDMA docs
    srun bash -c 'grep -i "open kernel" /proc/driver/nvidia/version && find /sys/module/nvidia -maxdepth 4 -name dma_buf\*'
  • topology.conf properly configured for system architectureSlurm topology.conf
    cat /etc/slurm/topology.conf
  • NVCC compiler installed and accessible
    nvcc --version
  • HPC-X or equivalent MPI implementation installed without hunting in /optNVIDIA HPC-X
    module avail hpcx || ls /opt/hpcx*/
  • NCCL properly installed, configured, and up to dateNCCL docs
    dpkg -l | grep nccl || rpm -qa | grep nccl
  • Lmod installed and configuredLmod docs
    module avail
  • Default CPUs per task, memory per CPU, and other settings are logical
    scontrol show config | grep -i 'DefCpu\|DefMem'

Containers

  • Support for the Pyxis pluginNVIDIA Pyxis
    srun --help | grep -A4 -- '--container-image'
  • Support for EnrootNVIDIA Enroot
    enroot version
  • Docker available on worker nodes with modern NVIDIA Container ToolkitNVIDIA Container Toolkit
    srun docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
  • Singularity and Apptainer container supportApptainer docs
    apptainer version || singularity version

Networking and Collectives

  • NCCL auto-configuration: NCCL_MIN_NCHANNELS, NCCL_PROTO, NCCL_ALGO NOT in /etc/nccl.confNCCL environment variables
    cat /etc/nccl.conf
  • RoCEv2 configuration: NCCL_IB_GID_INDEX=3 in /etc/nccl.confNCCL_IB_GID_INDEX
    grep NCCL_IB_GID_INDEX /etc/nccl.conf
  • NCCL tests run successfully at expected bandwidthnccl-tests
    sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch
  • High bandwidth NICs/HCAs named correctly (mlx5_0, mlx5_1, etc.)
    ibstat | grep -E 'CA |Port |State|Rate'

Monitoring and Health Checks

  • DCGM background health checks enabled and plugged into Slurm HealthCheckProgramDCGM docs
    scontrol show config | grep HealthCheckProgram
  • Prolog and Epilog scripts lightweight (<30s to get on a node via srun)Slurm Prolog/Epilog
    time srun -N1 hostname
  • SHARP support for enhanced NCCL performanceNVIDIA SHARP docs
  • Dashboard includes Slurm job accounting data via sacctSlurm Accounting
  • Automatic remediation systems for failed nodes

General Expectations

  • Console, dashboard, CLI and/or API available to manage resources
  • 24x7 support availability
  • Process for security fixes and upgrades exists, proactive notifications are clear
  • Integration with comprehensive monitoring and alerting systems
  • Integration with active and passive health check systems

All expectations