STN is second in our list of providers that should be in the silver tier if our testing went better. By comparison, STN is similar to Cirrascale, which is to say STN offers dedicated managed services for clusters that are built-to-order for individual customers. There is no “public” cloud experience, and frankly not much about this is “cloud”. But customers who want a high-touch experience can get it here. In our testing, the STN platform is undermined by significant configuration errors and reliability problems, landing STN in our Bronze tier.
Onboarding is entirely manual, requiring phone calls to review PDFs and set up accounts. We were given a 4-node B200 cluster with impressive hardware, including four network fabrics (RoCEv2 for interconnect and storage) and 25TB of VAST. However, this high-end hardware was let down by basic configuration mistakes. For example, we found seven local NVMe drives unmounted on each node. The Slurm environment was also missing key components for performance: no topology.conf, GPUDirect RDMA was disabled (nvidia_peermem not loaded), and MPI was not installed.
Unfortunately, STN’s biggest weakness was reliability. During testing, we saw two different nodes go into a “down” state, one of which stayed “down” for over two days. Since the STN repair process is entirely manual, it requires customers to spot and report failures themselves. Notably, dcgm health -c is enabled on the nodes, but it is not plugged into Slurm as a HealthCheckProgram.
Checking in on our nodes in “down” state on different occasions
We suggest that in the future, STN focus on actual cluster reliability instead of reporting fake “Uptime SLA” metrics to Grafana.
Source: says that we have evaluated our own SLA and are approaching 100%
Finally, getting jobs to run was a struggle. It took weeks for STN engineers to modify the cluster to include hpcx, nccl, and nvcc, enable GPUDirectRDMA and turn off ACS so that we could a basic nccl-test and torchtitan training job to run on four nodes. We also ran into what looked like network traffic shaping over the WAN that slowed down our downloads, but made speedtest-cli look great.
With all this said, in our conversations with customers, STN has demonstrated that they have the capability to do deep, custom work for their customers. Going forward we suggest that STN work to automate a lot of its Slurm provisioning, health checks, and develop a comprehensive monitoring dashboard to improve reliability. Until then, we feel that STN remains a high-risk choice and a Bronze-tier provider.