Fluidstack is the only cloud to debut in our Gold tier this round, and certainly has the most unique business model. Almost all of Fluidstack’s customer deployments involve a third-party datacenter provider. Fluidstack is effectively the hired gun that organizations go to in order to turn bronze-tier datacenter infrastructure into a gold-tier customer experience. Google has also gone into the market to secure colocation demand with Fluidstack as the operator of Terawulf and Cipher sites, potentially for their TPUs. We explore why GCP is willing to “backstop” these deals and why GCP needs Fluidstack here.

This is clear when it comes to customers such as Meta, Poolside, Blackforest Labs, and an unnamed customer running in a TeraWulf datacenter in Buffalo that got a massive financial backstop from Google.

Source: Fluidstack

Our hands-on experience with Fluidstack was a live demonstration of their value proposition: a highly collaborative, deeply technical partnership that rapidly improves the platform based on expert feedback. While the initial cluster had rough edges, the speed and precision with which the Fluidstack team addressed every issue was unparalleled.

Our initial slurm cluster came with pyxis and MPI support integrated into srun, and initial two-node nccl-tests showed performance within range for large message sizes. However, we immediately hit a significant usability issue: the prolog.d script was so bloated with health checks that it took over a minute to schedule an interactive run on a single node. The script was running full single-node NCCL tests for NVLink and InfiniBand, plus host-to-device bandwidth checks, every time a job started.

When we pointed this out, the team immediately acknowledged it and committed to their roadmap of moving these active health checks to run on idle nodes in the background, which is the standard practice for other top-tier providers.

This kicked off a rapid-fire feedback loop that defined our testing period:

Performance Tuning: We noted that the Nvidia HPC-X toolkit was missing from the base image, which is necessary for optimal nccl performance at medium message sizes. While it was available within NGC containers, not all users leverage pyxis/enroot. Within 24 hours, the Fluidstack team had deployed HPC-X to the base image on our cluster and added it to their standard deployment pipeline for all customers.

Monitoring Dashboards: The Grafana dashboard was solid, but we identified missing graphs for NVLink Rx/Tx utilization and incorrect DCGM metrics for tensor core pipes (they were capturing SIMT units instead of tensor core-specific pipes like DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE). The team implemented the correct DCGM metrics the following day.

Security Posture: This was the most critical finding. We discovered the cluster was running a version of the nvidia-container-toolkit vulnerable to NVIDIAScape (CVE-2025-23266). The team patched the vulnerability on our cluster within minutes of us reporting it. While the immediate fix was impressive, our feedback focused on the larger operational need for automated dependency scanning and a proactive security process, such as enrolling in Nvidia’s security embargo program. This prompted a healthy discussion on their software supply chain security strategy.

Passive Health Checks: We found that DCGM’s background health checks were not enabled. By injecting PCIe replay errors (dcgmi test --inject --gpuid 0 -f 202), we confirmed that the node would not automatically drain. Our recommendation was to actively poll dcgmi health -c and configure NVIDIA Health Check (NHC) to drain nodes based on specific thresholds (e.g., >8 PCIe replays per minute or >100 NVLink CRC errors per second). The team immediately added this to their near-term roadmap.

Transitioning to Kubernetes was seamless, with a kubeconfig readily available from the UI. The cluster provided a solid foundation with standard components like Cilium for CNI, a CSI w/ ReadWriteMany support, node-problem-detector, kube-prometheus-stack, draino, a custom controller to turn off ACS, and the Nvidia Network Operator + GPU Operator, all managed via ArgoCD. This high-touch model, typically involves shared Infrastructure-as-Code repos, ensuring customers get the exact tools they need, such as adding cert-manager in our case.

This test, however, resurfaced the most critical theme from our slurm evaluation: software supply chain security. We discovered that the Nvidia GPU Operator chart was a minor version behind, leaving the cluster vulnerable to the same NVIDIAScape exploit. This highlighted a significant gap in their proactive security posture, particularly their absence from vendor embargo programs that provide advance notice of vulnerabilities. Once notified, the team coordinated a maintenance window and patched the vulnerability in under an hour, and motivated them to formalize a security process that includes more frequent proactive updates, subscriptions to vulnerability databases, and taking steps to join Nvidia’s disclosure program.

Overall our experience with Fluidstack was strong. The platform was not perfect out-of-the-box, but both slurm and kubernetes were both in a perfectly usable state within hours of cluster handover, and the engineering team demonstrated an elite level of responsiveness and expertise. Issues that might take weeks or months to get addressed in a hyperscaler’s ticketing system were fixed in hours. If there is anyone that demonstrated the “Forward Deployed Engineering” ethos during our testing, it was Fluidstack.

All ClusterMAX™ 2.0 + 2.1 reviews