Name: ClusterMAX GPU Cloud Ratings
License: https://semianalysis.com

Access

✓
Head node provisioned and accessible via a simple SSH command
✓
Standard Slurm commands functional: sinfo, squeue, scontrol, salloc, sbatch, srunSlurm man pages
sinfo -N -l && squeue -u $USER
✓
Shared filesystem configured with default home directory and reasonable quota
✓
Essential packages available: python, git, curl, wget, apt, vim, nano
which python3 git curl wget apt vim nano
✓
sudo access available on head node for package installation
✓
Easy to add new users and groups via CLI or console
✓
Easy to enforce RBAC on users and groups on cluster and storage
✓
Integration with external IDPs (Okta, Google, Microsoft, GitHub via OIDC/OAuth 2.0)
✓
Integration with sacct for job tracking and resource utilization by usersacct docs
sacct --format=JobID,JobName,Partition,AllocGRES,State,Elapsed
✓
Passwordless SSH connectivity between nodes enabled
srun -N4 hostname
ssh [node_name]

Configuration

✓
GPUDirect RDMA enabled on worker nodes via dma_buf (nvidia-open kernel module). The out-of-tree nvidia_peermem module is deprecated and does not pass on its own.GPUDirect RDMA docs
srun bash -c 'grep -i "open kernel" /proc/driver/nvidia/version && find /sys/module/nvidia -maxdepth 4 -name dma_buf\*'
✓
topology.conf properly configured for system architectureSlurm topology.conf
cat /etc/slurm/topology.conf
✓
NVCC compiler installed and accessible
nvcc --version
✓
HPC-X or equivalent MPI implementation installed without hunting in /optNVIDIA HPC-X
module avail hpcx || ls /opt/hpcx*/
✓
NCCL properly installed, configured, and up to dateNCCL docs
dpkg -l | grep nccl || rpm -qa | grep nccl
✓
Lmod installed and configuredLmod docs
module avail
✓
Default CPUs per task, memory per CPU, and other settings are logical
scontrol show config | grep -i 'DefCpu\|DefMem'

✓
Support for the Pyxis pluginNVIDIA Pyxis
srun --help | grep -A4 -- '--container-image'
✓
Support for EnrootNVIDIA Enroot
enroot version
✓
Docker available on worker nodes with modern NVIDIA Container ToolkitNVIDIA Container Toolkit
srun docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
✓
Singularity and Apptainer container supportApptainer docs
apptainer version || singularity version

✓
NCCL auto-configuration: NCCL_MIN_NCHANNELS, NCCL_PROTO, NCCL_ALGO NOT in /etc/nccl.confNCCL environment variables
cat /etc/nccl.conf
✓
RoCEv2 configuration: NCCL_IB_GID_INDEX=3 in /etc/nccl.confNCCL_IB_GID_INDEX
grep NCCL_IB_GID_INDEX /etc/nccl.conf
✓
NCCL tests run successfully at expected bandwidthnccl-tests
sbatch tests/slurm/networking/nccl/4node-run-nccl.sbatch
✓
High bandwidth NICs/HCAs named correctly (mlx5_0, mlx5_1, etc.)
ibstat | grep -E 'CA |Port |State|Rate'

✓
DCGM background health checks enabled and plugged into Slurm HealthCheckProgramDCGM docs
scontrol show config | grep HealthCheckProgram
✓
Prolog and Epilog scripts lightweight (<30s to get on a node via srun)Slurm Prolog/Epilog
time srun -N1 hostname
✓
SHARP support for enhanced NCCL performanceNVIDIA SHARP docs
✓
Dashboard includes Slurm job accounting data via sacctSlurm Accounting
✓
Automatic remediation systems for failed nodes

✓
Console, dashboard, CLI and/or API available to manage resources
✓
24x7 support availability
✓
Process for security fixes and upgrades exists, proactive notifications are clear
✓
Integration with comprehensive monitoring and alerting systems
✓
Integration with active and passive health check systems