Runpod manages a significant fleet of over 20,000 GPUs, with users all over the world. However, their fundamental architectural choice to put every user inside a “pod” (container) severely limits their ability to service large scale training, inference, and any enterprise workloads.

In our testing, the container-centric design prevents the use of standard HPC and MLOps tools, such as running Slurm with Pyxis or Enroot for containerized MPI jobs, performing active health checks on the underlying bare-metal infrastructure, or using Kubernetes.

In our testing of Runpod’s Slurm offering (still in Beta), we initially used a cluster directly from another provider, FarmGPU, and gave feedback on a number of issues we found. The Runpod technical team was responsive, took the feedback, and committed to actively incorporate this feedback in their next development cycle. A few weeks later, different Runpod team members insisted that we re-test with a different bare metal provider, directly from their console. While we appreciate their engagement, all the core issues we found on the first round of testing remained.

The default user is root, with no way to add additional users, enforce RBAC, or use an external IAM provider. The default home directory (~) is not on a shared filesystem, forcing users to navigate to a separate /workspace directory. More critically, the environment lacks essential tooling. We found no pre-installed MPI, and initial attempts to run MPI-based jobs using srun failed due to a required hostfile modification, specifying external container hostnames and routes, since these are not updated in DNS or standard IPs. Specifically, we had to export NCCL_SOCKET_IFNAME=”ens1” because it was not pre-populated in /etc/nccl.conf, export HF_HOME=/workspace/.cache/huggingface because /root is the default workdir, not /workspace, run head_node_ip=$(srun --nodes=1 --ntasks=1 -w “$head_node” ip addr show ens1 | grep “inet “ | awk ‘{print $2}’ | cut -d’/’ -f1) and include --hostfile hostfile in mpirun commands, instead of much simpler options on standard clusters. Even with knowledge of these custom approaches going into the second round of testing, it is currently still poorly documented and clearly a beta feature.

On monitoring and health checks, we expect it will continue to be difficult for Runpod to ensure the reliability and performance required for large scale training. We have heard from multiple Runpod customers that since Runpod does not explicitly state which underlying hardware provider you’re going to land on (aside from specifying a “region”, and a binary “secure” or “community” cloud) that they effectively feel like they’re spinning a roulette wheel to try and “get a good pod”. In other words, users waste a bunch of time spinning up/down pods based on their perception of quality, because price-per-value information is not available to them in the console.

Source: looking at some European regions on the runpod console

Overall, we expect Runpod will continue to serve a niche market that values its simplified, container-first approach, but it will struggle to make progress against our criteria without a fundamental change to their architecture.

All ClusterMAX™ 2.0 + 2.1 reviews