ClusterMAX 2.0Gold

Fluidstack

Strong performance across all categories with minor gaps. Generally wins deals at competitive pricing.

ByJordan NanosDaniel NishballDylan Patel
Published

Fluidstack Quick Stats

ClusterMAX Tier
Gold (4 / 5)
Source Rating Cycle
ClusterMAX 2.0
GPUs Offered
TPU
Slurm Support
Discussed in review
Kubernetes Support
Discussed in review
SOC 2 Mentioned
Not flagged
NCCL Benchmarks
In review
Last Updated
Nov 06, 2025

Want to model Fluidstack cluster cost? Calculate H100, H200, B200 & GB200 NVL72 TCO with the ClusterMAX calculator.

Fluidstack is the only cloud to debut in our Gold tier this round, and certainly has the most unique business model. Almost all of Fluidstack’s customer deployments involve a third-party datacenter provider. Fluidstack is effectively the hired gun that organizations go to in order to turn bronze-tier datacenter infrastructure into a gold-tier customer experience. Google has also gone into the market to secure colocation demand with Fluidstack as the operator of Terawulf and Cipher sites, potentially for their TPUs. We explore why GCP is willing to “backstop” these deals and why GCP needs Fluidstack here.

This is clear when it comes to customers such as Meta, Poolside, Blackforest Labs, and an unnamed customer running in a TeraWulf datacenter in Buffalo that got a massive financial backstop from Google.

Source: Fluidstack

Our hands-on experience with Fluidstack was a live demonstration of their value proposition: a highly collaborative, deeply technical partnership that rapidly improves the platform based on expert feedback. While the initial cluster had rough edges, the speed and precision with which the Fluidstack team addressed every issue was unparalleled.

Our initial slurm cluster came with pyxis and MPI support integrated into srun, and initial two-node nccl-tests showed performance within range for large message sizes. However, we immediately hit a significant usability issue: the prolog.d script was so bloated with health checks that it took over a minute to schedule an interactive run on a single node. The script was running full single-node NCCL tests for NVLink and InfiniBand, plus host-to-device bandwidth checks, every time a job started.

When we pointed this out, the team immediately acknowledged it and committed to their roadmap of moving these active health checks to run on idle nodes in the background, which is the standard practice for other top-tier providers.

This kicked off a rapid-fire feedback loop that defined our testing period:

Performance Tuning: We noted that the Nvidia HPC-X toolkit was missing from the base image, which is necessary for optimal nccl performance at medium message sizes. While it was available within NGC containers, not all users leverage pyxis/enroot. Within 24 hours, the Fluidstack team had deployed HPC-X to the base image on our cluster and added it to their standard deployment pipeline for all customers.

Monitoring Dashboards: The Grafana dashboard was solid, but we identified missing graphs for NVLink Rx/Tx utilization and incorrect DCGM metrics for tensor core pipes (they were capturing SIMT units instead of tensor core-specific pipes like DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE). The team implemented the correct DCGM metrics the following day.

Security Posture: This was the most critical finding. We discovered the cluster was running a version of the nvidia-container-toolkit vulnerable to NVIDIAScape (CVE-2025-23266). The team patched the vulnerability on our cluster within minutes of us reporting it. While the immediate fix was impressive, our feedback focused on the larger operational need for automated dependency scanning and a proactive security process, such as enrolling in Nvidia’s security embargo program. This prompted a healthy discussion on their software supply chain security strategy.

Passive Health Checks: We found that DCGM’s background health checks were not enabled. By injecting PCIe replay errors (dcgmi test --inject --gpuid 0 -f 202), we confirmed that the node would not automatically drain. Our recommendation was to actively poll dcgmi health -c and configure NVIDIA Health Check (NHC) to drain nodes based on specific thresholds (e.g., >8 PCIe replays per minute or >100 NVLink CRC errors per second). The team immediately added this to their near-term roadmap.

Transitioning to Kubernetes was seamless, with a kubeconfig readily available from the UI. The cluster provided a solid foundation with standard components like Cilium for CNI, a CSI w/ ReadWriteMany support, node-problem-detector, kube-prometheus-stack, draino, a custom controller to turn off ACS, and the Nvidia Network Operator + GPU Operator, all managed via ArgoCD. This high-touch model, typically involves shared Infrastructure-as-Code repos, ensuring customers get the exact tools they need, such as adding cert-manager in our case.

This test, however, resurfaced the most critical theme from our slurm evaluation: software supply chain security. We discovered that the Nvidia GPU Operator chart was a minor version behind, leaving the cluster vulnerable to the same NVIDIAScape exploit. This highlighted a significant gap in their proactive security posture, particularly their absence from vendor embargo programs that provide advance notice of vulnerabilities. Once notified, the team coordinated a maintenance window and patched the vulnerability in under an hour, and motivated them to formalize a security process that includes more frequent proactive updates, subscriptions to vulnerability databases, and taking steps to join Nvidia’s disclosure program.

Overall our experience with Fluidstack was strong. The platform was not perfect out-of-the-box, but both slurm and kubernetes were both in a perfectly usable state within hours of cluster handover, and the engineering team demonstrated an elite level of responsiveness and expertise. Issues that might take weeks or months to get addressed in a hyperscaler’s ticketing system were fixed in hours. If there is anyone that demonstrated the “Forward Deployed Engineering” ethos during our testing, it was Fluidstack.

Fluidstack GPU Cloud FAQ

What tier is Fluidstack in ClusterMAX?

Fluidstack is rated Gold tier in the ClusterMAX 2.0 GPU cloud rating system by SemiAnalysis (with the ClusterMAX 2.1 Update applied April 2026). Gold is a top-tier rating in the ClusterMAX rating system. Strong performance across all categories with minor gaps. Generally wins deals at competitive pricing.

Is Fluidstack SOC 2 Type II certified?

Fluidstack's ClusterMAX review does not flag a SOC 2 Type II attestation as confirmed. SemiAnalysis treats SOC 2 Type II as a baseline expectation for any GPU cloud serving enterprise or regulated AI workloads — see the ClusterMAX criteria page for the full security baseline.

Does Fluidstack support Slurm?

Yes. The Fluidstack review on ClusterMAX covers their Slurm offering — including whether it is managed, self-managed, or runs as Slurm-on-Kubernetes (SUNK, Soperator, or Slinky). See the Orchestration section of the review for the specific Slurm flavor offered and SemiAnalysis' hands-on experience.

Does Fluidstack support Kubernetes?

Yes. The Fluidstack review on ClusterMAX covers their Kubernetes offering — whether managed Kubernetes is provided, what control plane is used, and how GPU operator, networking, and storage integrate. See the Orchestration and Storage sections of the review for details.

What GPUs does Fluidstack offer?

Based on the SemiAnalysis hands-on review, Fluidstack offers (or has been publicly tied to) the following NVIDIA / AMD GPU SKUs: TPU. Specific inventory, region availability, and on-demand vs reserved access are detailed in the Fluidstack ClusterMAX review.

What is the NCCL all-reduce performance on Fluidstack?

The Fluidstack review on ClusterMAX includes hands-on NCCL all-reduce results from SemiAnalysis testing. NCCL bandwidth (in GB/s) is one of the most important indicators of training cluster health — see the Networking section of the review for the specific numbers and how they compare to the ClusterMAX cohort.

How does Fluidstack compare to CoreWeave?

CoreWeave is the only ClusterMAX Platinum provider, while Fluidstack is rated Gold. The Fluidstack review documents the specific gaps versus CoreWeave across the 10 ClusterMAX criteria (Security, Lifecycle, Orchestration, Storage, Networking, Reliability, Monitoring, Pricing, Partnerships, Availability). See the Fluidstack review body and the ClusterMAX /criteria page for the full comparison framework.

Is Fluidstack recommended for LLM training?

Fluidstack is in a ClusterMAX tier that SemiAnalysis directly recommends for production GPU workloads (Platinum / Gold / Silver / Bronze). The Fluidstack review details which workload profiles fit best — large-scale pretraining, fine-tuning, on-demand experimentation, or inference — based on hands-on cluster testing.

All ClusterMAX™ 2.0 + 2.1 reviews