You would think Google would set the standard. From jax to the transformer, search to maps, Waymo to YouTube. We use Gmail with Gcal to book a Gmeet to get our work done. We have to pull CoreWeave containers from gcr to run on their kubernetes cluster. Rumors abound about the TPU.
Since the first version of this article, Google has addressed some issues holding them back, specifically making the decision to go with standard CX-7 NICs for their H200 (a3-mega) and B200 (a3-ultra) instances, as well as their GB200 NVL72 instances (a4).
Our testing began by provisioning clusters for both slurm and Kubernetes (GKE). The managed slurm “Cluster Director” offering is still in preview, though at Google “preview” also means that key customers have had it for several months, and things work well. The architecture follows a standard managed service model where the slurmctld is handled by GCP, leaving users with access to the login and worker nodes. We appreciated the default setup, including scripts for testing network performance via nccl-test and storage performance via FIO pre-staged in a GCS bucket for immediate use. For storage, GCP recommends Filestore for home directories, which provides enterprise features like snapshots and backups, while managed Lustre is positioned for large-scale, high-performance scratch space. Provisioning our Lustre filesystem was straightforward but not instant, taking roughly 40 minutes to complete.
Interestingly, the cluster also demonstrated self-healing capabilities; when we intentionally deleted a worker node to clean things up and move from slurm to GKE testing, the Cluster Director service automatically recreated it in a matter of minutes to maintain the desired capacity. We had to delete the whole cluster from the cluster director screen to get it to take. Interesting demo.
Source: Google Cloud. Deleting a SLURM Cluster on GKE. This UI sparks joy
However, the GKE-based solution is where GCP truly shines and feels years ahead of all Neocloud competition but CoreWeave. Using the “Cluster Toolkit,” the initial setup was streamlined. Most impressively, the cluster arrived with Kueue and JobSet pre-installed. This immediate, out-of-the-box support for modern, Kubernetes-native scheduling for batch workloads is a significant differentiator. While competitors are still building their own operators or relying on Slurm-on-Kubernetes projects, GCP provides a mature, fully integrated solution.
Out-of-the-box performance was decent. Running nccl-tests with a standard JobSet YAML, we immediately achieved the expected bandwidth for allgather, allreduce, and alltoall operations without any tuning. However, it is worth noting that our experience is not representative of what others are seeing at scale.
Currently with gcp gIB machines which use Nvidia CX-7 NICs (such as the a3-ultra H200, a4 B200, and a4x GB200), to get good performance, users must use the gIB plugin. This means that users need to add additional container mounts and lines into an sbatch script or jobset manifest, such as --container-mounts=”/usr/local/gib”, export NCCL_NET=gIB, source /usr/local/gib/scripts/set_nccl_env.sh, etc. This is a poor UX, leading to even advanced users seeing poor performance when compared directly to other providers. You effectively need to have a GCP engineer to get the expected performance at scale, and it is still an open question for us whether alltoall collectives work as expected on this scale-out network.
Our suggestion to Google to improve this UX is to have Nvidia bake the gIB plugin binaries directly into all NGC container images, and include logic during container init to automatically select the gIB plugin when on compatible GCP machines. This would remove the need for users to manually mount it into their containers. There is a way to detect if running on a GCP machine with gIB, either through vendor and device IDs, or by checking /sys/bus/pci/devices/*. Google and Nvidia have said that they have started to look into this and have plans on how to improve it.
On a more advanced networking front, GCP provides a crucial capability for large-scale training: NCCL straggler detection, powered by their CoMMA (Collective Monitoring and Management Agent). In distributed jobs with hundreds or thousands of GPUs, a single underperforming node or “straggler” can bottleneck the entire collective. Diagnosing where the straggler is presents a significant challenge. CoMMA attempts to addresses this by using a sophisticated eBPF-based agent that non-intrusively traces NCCL operations. By monitoring the progress of collectives like AllReduce, AllGather and AlltoAll, it claims it can identify the specific ranks that are lagging. When a straggler is detected, CoMMA emits a detailed JSON payload to Cloud Logging, identifying not only the slow ranks but also the ranks that are proceeding normally. Customer feedback about CoMMA have been mixed.
Storage performance was robust and capacity was flexible. GCP’s tooling automatically prepared an FIO benchmark job, which we ran to test I/O patterns for scratch writes, training data reads, and checkpointing, all of which delivered solid results for both the home directory and lustre mounts. Google also has a marketplace that includes solutions like Weka, in case customers have preferences to deploy. Of course, GCS is available on-demand too, where many enterprises already have their data stored for long-term retention.
Of course, no cloud experience is without its complexities. The primary hurdle we encountered was a classic cloud IAM footgun. When attempting to run a torchtitan training job, our pods were denied access to the dataset in a GCS bucket. This required diagnosing the node pool’s service account and running a series of gcloud and gsutil commands to grant the necessary permissions. While this is a common workflow for experienced GCP users, it’s a trade-off that working with a hyperscaler presents.
GCP’s focus on production AI workloads is evident the GKE Inference Gateway now being GA. Our evaluation of the GKE Inference Gateway focused on two features: prefill-decode (PD) disaggregation and prefix-aware routing. We found that PD disaggregation, advertised with a potential 60% throughput improvement, is not integrated into the standard GKE Quickstart profiles or documentation. It currently exists as an “advanced optimization” that is a “constant work in progress.”
In contrast, GKE’s implementation of prefix-aware routing is mature and well-documented. Unlike common patterns that require a user-managed proxy to route requests to inference engines like vLLM or SGLang for KV cache reuse, GKE integrates this routing logic directly into its managed L7 load balancer. This design eliminates a user-managed component from the serving stack, reducing operational complexity. GKE provides a robust inference networking layer, but there is a clear distinction between its stable, integrated features like managed routing and its not-quite-documented capabilities like PD disaggregation with llm-d.
For monitoring, google integrates DCGM metrics right into the main cluster dashboard. This is a great UX when compared to a separate grafana instance, with things like authN and authZ being wired up automatically to the same intuitive console where the cluster was deployed. This also allows for some customization We suggested adding a TFLOP estimator via DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * peak_fp8_flops. For example, for H200, it would be 1979 TFLOPS.
On health checks, Google’s slurm offering is missing a background health check program. Currently, they rely on a prolog health check that runs some dcgm tests but haven’t yet integrated it as a NodeHealthCheck program in slurm for monitoring purposes during a batch job. By contrast, GKE has an option for users to configure AutoRepair, and an API for “repair and replace” functions where users can request a replace. This is a strong reactive offering, but requires manual setup from the customer’s cluster admin, and does not get to the level of proactivity that Gold and Platinum tier Neoclouds exhibit with their health checks. We encourage Google to follow some of their competitors, and treat failures with the perspective that if a customer discovers it first, something is wrong.
The experience we have had working with Google’s engineering team is exceptional, but it comes at a steep price. Access to premium support generally requires a multi-million dollar compute contract and a 3% premium (on purchases that is a minimum of $1M), creating a high barrier to entry and a clear distinction when compared directly to Neoclouds.
Going forward, we have concerns about NVL72 rack-scale architectures in Google datacenters. Google and AWS have both gone with NVL36x2 instead of a true NVL72 rack due to power, cooling, networking, and reliability concerns. The result is supposed to be a similar of NVL72 with the same scale-up domain of 72 GPUs as a standard NVL72 rack, but due to the cross-rack NVLink ACC cables it is a different topology.
Source: SemiAnalysis GB200 hardware arch
Source: SemiAnalysis GB200 hardware arch
but in practice users of GCP or AWS NVL36x2 have been waiting weeks or months longer to get stable firmware, and get the rack to a point of stability where they can run basic collectives.
Source: An NVL36x2 engineering build, via Google on Twitter
In conclusion, Google aims to set the bar and command a pricing premium, but wrinkles like the gIB workflow, the lack of a GA managed slurm service, and reported issues with NVL72 rack-scale stability, as well as unclear SLAs + SLOs make the current pricing difficult to justify, especially for the legacy H100 instances that are still so popular amongst users. However, as the industry moves beyond H100s, Google’s roadmap is clearly strong. Once they roll out their B200 and GB200 instances at scale and push some roadmap items to GA, they will be in a powerful position to justify that premium. Google is on the fast track to the Gold-tier or higher.