# ClusterMAX Cloud Reviews — Full Content > By SemiAnalysis (Jordan Nanos, Daniel Nishball, Dylan Patel) This file contains the full text of all 85 GPU cloud reviews from ClusterMAX 2.0 + 2.1 Update (https://www.clustermax.ai/cloudreview). It is intended for consumption by large language models and AI assistants. --- # CoreWeave (Platinum) > CoreWeave earns a ClusterMAX 2.0 Platinum rating from SemiAnalysis. Since our last article, CoreWeave has made some significant announcements. They raised $1.5B in an IPO on the NASDAQ, trading under $CRWV, and their share price is up over 200% in 6 months. They have announced three expansions with… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Platinum - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/coreweave - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/coreweave/llm.txt - **Topics**: CoreWeave review, CoreWeave GPU cloud, CoreWeave ClusterMAX rating, CoreWeave Platinum, Platinum tier GPU cloud, GPU cloud review, neocloud review, CoreWeave GB200 NVL72, CoreWeave GB200, CoreWeave GB300, CoreWeave B200, CoreWeave H100, GB200 NVL72 cloud, GB200 cloud, GB300 cloud, B200 cloud, H100 cloud, InfiniBand, RoCE, Kubernetes, Slurm, NCCL, DCGM, bare metal, managed Slurm, ClusterMAX 2.0, SemiAnalysis Since our last article, CoreWeave has made some significant announcements. They raised $1.5B in an IPO on the NASDAQ, trading under $CRWV, and their share price is up over 200% in 6 months. They have announced three expansions with OpenAI: $11.9B in March, $4B in May, and $6.5B in September, bringing the total commitment to $22.4B. These deals are targeting training. CoreWeave has also landed Meta as a new customer in September, singing a $14.2B deal over 6 years to 2031. They have also announced an all-stock acquisition of Core Scientific in July for $9B, which represents a 1.3GW expansion of datacenter footprint. They also committed £1.5B to the UK, and $6B in Lancaster, Pennsylvania. Note: for a blow-by-blow tracker of CoreWeave bringing capacity online, readers can see our datacenter industry model: They acquired Weights & Biases in May, and OpenPipe in September. We will discuss the W&B integration below. They also launched CoreWeave Ventures, a fund to invest in AI companies: . We believe this is in direct response to strategic investment practices such as AWS Activate (up to $300k in credits), Microsoft for Startups (up to $150k in credits), Google for Startups Cloud Program (up to $350k in credits), and the NVIDIA Inception + NVentures program. CoreWeave has announced some of the world’s first GB200 NVL72 and GB300 NVL72 deployments, as well as some RTX 6000 Pro Blackwell instances, setting the stage for future diversification with Rubin CPX. In terms of our testing, CoreWeave’s clusters meet all criteria, and set the bar for other Neoclouds to follow. It is for this reason that we are aware of multiple examples where CoreWeave is able to command a higher price for managed slurm or Kubernetes clusters (by roughly 10-15%, per GPU-hr) vs their direct competition such as Nebius, Fluidstack, Crusoe, Lambda, and Together.ai. Indeed, CoreWeave’s pricing is closer to the pricing of the big 4 hyperscalers. The main challenge that we perceive for CoreWeave going forward is to continue innovating and differentiating vs their competition, so as to to maintain this pricing power. We believe they will be successful doing so for the GB200 and GB300 NVL72 generation. In this section we will talk about what is new at CoreWeave, and how they keep setting the bar for others in the industry to follow. * Slurm-on-Kubernetes * Bare Metal Provisioning * Use of DPUs * Security * Monitoring and Health Checks * W&B Integration * Storage * TCO * Services * Customers #### Slurm-on-Kubernetes At this point CoreWeave has consolidated to three offerings: 1. CoreWeave Managed SUNK (Slurm on Kubernetes) 2. CoreWeave Managed Kubernetes 3. CoreWeave Bare Metal without any managed scheduler Since the original release of SUNK in October of 2023, uptake has been strong for new clusters, which has led to the deprecation of their slurm-on-bare metal service. All new customers that prefer slurm are pushed to SUNK. Earlier in this article, we discussed the slurm-on-Kubernetes trend at large, where SUNK is the most mature option vs Soperator and Slinky. CoreWeave developed SUNK from scratch, controls the roadmap, and has built deep integration with their underlying Kubernetes runtime CKS, as well as their monitoring dashboards (mainly Grafana, branded as CoreWeave Observe™), health check system, and provisioning system (branded as CoreWeave Mission Control). In addition, CoreWeave has gone beyond open-source slurm to develop their own custom fork of the popular job scheduler. Specifically, in open source Slinky, there are memory leaks in the REST API of the slurm controller which leads to issues if, for example, some user is trying to queue 100,000 jobs. Specifically, the way that slurm works at scale is through the concept of priorities. In other words, there is no way for a big pretraining job with top priority to be auto-scaled down, and there should always be spare nodes available to be added to this job in the event of an interruption. But while that job is running, users can tag smaller research/experiment slurm jobs as preemptible, and run them on the medium or low-priority partition. Functionally, this means that the slurm scheduler hands off information to kubernetes, which labels the kubernetes job to associate it to the partition. In effect, frontier labs have a giant backlog of low priority sweeps that can always take up extra nodes: trying out the latest learning rate, data mix etc. to see if it improves their research. As a result, CoreWeave re-wrote the logic in slurm’s REST API in go, and is now using an RPC-based login pod controller for SUNK that is more performant at scale. Interestingly, we are aware of direct licensing deals that CoreWeave has done with end users who want to run SUNK on managed Kubernetes clusters outside of CoreWeave. While we don’t believe SUNK licensing is a meaningful revenue driver for CoreWeave, it is an indication of the quality of the customer experience when using SUNK and a testament to their engineering effort. #### Bare Metal Provisioning A significant difference between CoreWeave and other cloud providers is their use of bare metal machines for both control plane and worker node services. Since basically all CoreWeave customers use whole 8-way machines in standard HGX (or 4-way machines in NVL72 racks), there is no need to virtualize a machine into multiple 1, 2, or 4-way GPU instances. However, other providers like Nebius and Crusoe who also don’t split up GPU machines continue to use kubevirt and cloud-hypervisor respectively in order to realize other benefits of VMs: shared block storage (resizing, quick provisioning, PXE boots, backup, clone, restore, etc.) and network isolation. Since VMs are easier/quicker to spin up and down than bare metal machines (i.e. the underlying OS doesn’t need to change between tenants) CoreWeave has a challenge to address: how to quickly replace and repair a broken machine in a large cluster. To address this, CoreWeave has developed the Fleet Lifecycle Controller and Node Lifecycle Controller which are used by their FleetOps and CloudOps teams to provision machines through their service CoreWeave Mission Control. This custom stack is actually using a CRD called a NodeSet on kubernetes that defines bare metal nodes as a kubernetes resource before they are even online. Furthermore, CoreWeave has developed custom operators, similar to a mix of a DaemonSet and ReplicaSet, for functions like idle checking and debugging. This customization extends to Protected Rolling Updates, which are aware of the slurm state and wait for nodes to drain before rolling out updated pods. Effectively, nodes that have been repaired after a hardware failure sit in a queue, waiting to be added to a logical cluster on the multi-tenant backend network fabric. #### Use of DPUs Due to the use of bare metal provisioning, CoreWeave must contend with the same challenges as AWS with Nitro or Azure with Boost. In other words, how to implement secure multi-tenant isolation for both the frontend and backend networks. It is important to note that these tenancy challenges exist even in the scenario where a customer has rented an entire datacenter in a wholesale bare metal manner. Tenants still hire and fire interns, consultants, partner companies, academic collaborators, and have general interest in implementing isolation between groups of users on the same underlying hardware. As a result, CoreWeave uses Nvidia BlueField DPUs on every node to offload functions traditionally handled by a hypervisor on the host CPU (VPCs, encryption, network isolation, NAT gateways, etc.). Using a distributed NAT gateway and distributed storage gateway architecture eliminates a common central performance choke point. AI workloads are “bursty” since individual research jobs or autoscaling inference endpoints can randomly start pulling massive model weight files simultaneously. Switching from centralized to distributed gateway services on DPUs guarantees line-rate performance to WAN or storage. On Kubernetes, this actually gets implemented via a CRDs and custom controllers, which allows for bare metal nodes to move between tenants while preserving network isolation and policy. The DPU becomes the enforced boundary, not the hypervisor. Effectively, the API for programming this layer is exposed to CoreWeave’s team via `kubectl get dpuconfigurations.nimbus.infra.coreweave.com` On the backend network, InfiniBand network isolation is implemented by changing PKeys (Partition Keys) per tenant, which is the hardware-enforced mechanism similar to an Ethernet VLAN. CoreWeave is one of the only providers to offer SHARP, which can significantly performance for certain collectives. However, for the highest security customers, CoreWeave does not allow multi-tenant InfiniBand with SHARP enabled. [](https://substackcdn.com/image/fetch/$s_!Qhje!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64268219-bec7-4ebe-91be-5e7fbc0a293c_935x534.png)Source: CoreWeave #### Security When it comes to security, CoreWeave is the only Neocloud with a hyperscaler mentality. This means zero-trust policies, strict isolation between tenants, and continuous audits. For example, while some other Neoclouds provide direct access to server BMCs, CoreWeave treats the BMC, [which is a huge attack surface](https://developer.nvidia.com/blog/analyzing-baseboard-management-controllers-to-secure-data-center-infrastructure/), with extreme caution. Host-to-BMC Access is Disabled: Both the KCS (Keyboard Controller Style) and NIC (RNDIS Ethernet over USB) interfaces from the host OS to the BMC are disabled. Effectively, the BMC management network is isolated to only communicate with the control plane (N/S side), with Layer 2 Isolation enforced to prevent lateral BMC-to-BMC movement (E/W side). For very large customers, BMC access is via a dynamic jumpbox setup that utilizes the dedicated BMC network. An ACL on the DPU is updated by the Fleet Lifecycle Controller via NIMBUS and ensures they can only access BMC IPs in their specific tenant. Another critical decision is avoiding scenarios where multiple customers run on the same machine. Effectively, customers can attack their own machine as much as they want, as seen in the container escape scenario, but if a node goes down for repair or is moved between tenants, it gets PXE booted to a clean state before another tenant can run on it. This is handled by the Fleet Lifecycle Controller (FLCC) and Node Lifecycle Controller (NLCC). In terms of container escapes, future exploits are expected in upstream nightly branches from projects like pytorch, transformers, vllm, sglang, etc. and CoreWeave is in the process of switching to using ChainGuard images as the basis for all customer images. We’ve seen a common trend in this space following Broadcom’s acquisition of VMware, and the subsequent price increase to Bitnami. Notably, CoreWeave’s container images have become a standard for many other neoclouds, from login pods to nccl-test examples, many other providers are building FROM the coreweave image, or just providing scripts that pull the image from CoreWeave’s gcr registry directly: More on security: * Defense-in-Depth and Risk Modeling: Security is built on a comprehensive threat model that drives a defense-in-depth strategy, informing a secure-by-default application development lifecycle and deep runtime and infrastructure hardening. * Application and Code Security: The Application Security Maturity Framework mandates that all code changes undergo secret scanning, SAST, SCA, and DAST with strict remediation SLAs. Risk is blocked pre-production via pre-commit hooks, policy-as-code, CI/CD enforcement, Chainguard base images, and golden service templates. * Production Infrastructure and Access: Teleport governs privileged access with customer approvals, RBAC, TPM-backed node joins, and session logging (including keystrokes/syscalls), actively eliminating traditional SSH. * Fleet Integrity and Attestation: SPDM-based firmware attestation, Secure Boot, and Measured Boot validate fleet integrity from power-on, ensuring only cryptographically verified firmware runs and enabling remote attestation prior to workload scheduling. * Data and Workload Encryption/Identity: SecVault PKI infrastructure supports encryption for data (object storage, databases, APIs), while mTLS adoption will bind endpoint identity to trusted firmware. Cross-cluster JWT authentication and SPIFFE integration secure workload-to-workload communication. * Continuous Monitoring and Posture: Eclypsium and Wiz provide continuous security posture management, including firmware vulnerability scanning and cloud workload posture, while telemetry pipelines ensure policy compliance and deviation detection. * Enterprise Identity and Data: Security mandates phishing-resistant MFA and device trust for all users, with Kolide enforcing device posture checks. Policy-as-code Okta rules govern access, and systems like Cyera, Proofpoint, and Netskope manage data governance and DLP controls. The feedback is here is the security concerns from CoreWeave is way to limiting for a lot of power users. For example as default, systemd is not available, and a lot of CPU and GPU profiling tooling is also not available due to the restrictive security concerns. This has led to an strange design choices from end users such as running background processes inside tmux shells instead of using system. #### Monitoring and Health Checks CoreWeave’s monitoring and health check systems are a key differentiator, and the primary line item being to quantified in TCO discussions with large customers who question the CoreWeave pricing premium. In other words, users running at 10k+ GPU scale understand the impact that interruptions can have on their training jobs. CoreWeave recognizes that standard Nvidia (DCGM) exporters do not expose all critical metrics, such as certain thermal sensors vital for diagnosing subtle hardware issues like failing thermal paste. CoreWeave developed proprietary exporters using the lower-level NVML library. This provides the necessary granularity for robust node-level health validation. In addition, they have also built exporters for the interconnect fabric. To identify transient physical-layer problems like signal integrity degradation between compute trays and switches, CoreWeave developed a sophisticated correlation engine. We have noticed that other vendors do not run sustained multi-node jobs during burn-in and during active scheduled health checks, instead running single node jobs, collectives, and GPU stress tests ([GPU burn](https://github.com/wilicc/gpu-burn), [GPU fryer](https://github.com/huggingface/gpu-fryer), or [Multi-node ubergemm](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-multinode-diagnostics.html#supported-tests-for-multi-node-diagnostics)) individually. The point is that failures are often caused by the **simultaneous** thermal expansion and contraction of the GPUs and the interconnect. This is especially important for NVL72 rack-scale architectures. By tracking events like simultaneous XID or SXID errors across the fabric, CoreWeave can automatically root-cause many failure types. For instance, if multiple compute trays connected to a single switch port report errors, the switch is flagged, while if an error follows a specific tray after it has been moved, the tray or its cabling is flagged. Simple intuition like this builds with experience, which CoreWeave has now been building for months with the GB200 and GB300 NVL72 rack-scale systems that have posed so challenging for others to operate. [](https://substackcdn.com/image/fetch/$s_!Ok0z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ddb3f17-6379-4995-8c9e-d3bea778e4f9_936x636.png)Source: CoreWeave simulating some errors on our cluster Meta’s Llama 3 paper is the most clear description of where reliability issues can manifest in a training campaign. Over a 54 day period when training Llama 3 405B, using 8 pods of 3,072 H100s, with 16k of 24k GPUs in use at a given time, there were 466 job interruptions (47 of which were planned upgrades), resulting in 419 GPU server failures, and three instances of heavy manual intervention. This is an implied MTBF of 2,111 H100-days, and we can assume that “heavy manual intervention” means restarting the job from the last checkpoint. This example highlights that even small improvements in hardware stability and interconnect performance can shave days or even weeks off of a multi-month training schedule for a large model. Hardware failures also stand out as the #1 source of frustration amongst researchers that we talk to. [](https://substackcdn.com/image/fetch/$s_!WKIC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d9eb732-36a5-4a60-8cbe-9b929f2a182b_936x549.png)Source: The Llama 3 Herd of Models This section of this paper has also become infamous for two other reasons: a mention of the “diurnal 1-2% throughput variation based on time-of-day” (i.e. GPUs get hotter in the middle of the day and perform worse) and a comment on how “tens of thousands of GPUs may increase or decrease power consumption at the same time (…) can result in instant fluctuations of power consumption across the data center on the order of tens of megawatts, stretching the limits of the power grid” which resulted in the accidental upstream of the PYTORCH_NO_POWERPLANT_BLOWUP=1 environment variable at some point by a Meta engineer. [](https://substackcdn.com/image/fetch/$s_!ETih!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03b0c4ca-8a5f-41c4-9d6d-76dab48edf85_930x308.png)Source: The Llama 3 Herd of Models In summary, we continue to treat CoreWeave’s monitoring dashboard, passive health check approach, and suite of active health checks as the standard when working with other Neoclouds on their reliability challenges. To read more about these details, see: #### W&B Integration Prior to the acquisition, Weights & Biases (W&B) had become the industry leader in user-facing metrics for job scheduling, but pricing was quite high and the company seemed to be innovating at a loss. Clearly, CoreWeave noticed a big overlap between their customers and W&B’s enterprise customers. Specifically, some of these customers have been called out for having absolutely massive logging infrastructure behind their W&B deployments, effectively abusing the system. The obvious first step in W&B integration was to integrate infrastructure level metrics, like OOME’s or AER’s into the W&B dashboard so that users can see them. Generally, individual researchers don’t get access to the cluster-level grafana dashboard, but there is still lots of useful information that they can use, specifically from the underlying Nvidia DCGM. #### Storage CoreWeave’s storage offerings have matured over time to include a native Object Storage offering, “CAIOS” (CoreWeave AI Object Storage) and a local cache, “LOTA” (Local Object Transfer Accelerator). LOTA is a transparent, distributed cache that lives directly on the local NVMe of every GPU node. Public benchmarks show it reaching a sustained throughput of over 7GB/s per GPU on Blackwell. Effectively, with LOTA, the user doesn’t need to cp or rsync anything. They simply point their S3-compatible application to the cwlota.com endpoint instead of the primary CAIOS endpoint, cwobject.com. LOTA then manages the caching of data onto the local NVMe in a distributed manner across the cluster. According to list prices, for example, a 1,204 GPU cluster at $3/hr for 1year will cost $26.9M. However, for storage, 1PB of active data at $0.11/GB per month, costs $1.3M or 4.6% of the total BOM. In general, we rarely see storage costs exceed 5% of the total cluster cost. Over time, we expect hyperscalers like AWS, Azure, GCP, and Oracle to follow the current trend of reducing their price per GB per month on their Object Storage offerings (driven by competition from CloudFlare R2 in many cases), reduce or remove punitive egress fees, and reduce or remove costs associated with storage operations like data access. #### Support In our experience, and when talking directly to users of CoreWeave services, we get the impression that team members are empowered, excited to help customers, and proud to work at the company. Notably, all datacenter technicians are CoreWeave employees, go through standard company training, and have equity in the company. CoreWeave is one of the only Neoclouds that has not augmented their capacity with someone else’s GPUs, maintaining vertical integration for all their facilities. In addition, CoreWeave’s “direct to expert” support model means that all customers get quick responses, at no additional cost. But this isn’t always the case! Recently, CoreWeave actually sent us a ClearFeed notification in slack because their annual company offsite might result in some delayed support responses. [](https://substackcdn.com/image/fetch/$s_!xqlJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd98fcd18-383d-40fe-b664-de6d617bab40_578x280.png)Source: CoreWeave Last time, we recommended that CoreWeave work on a UI console flow for deploying their managed slurm solution, ideally with less than four button clicks. CoreWeave has basically achieved this, though it does seem like the 30+ datacenters on screen are mostly greyed-out. [](https://substackcdn.com/image/fetch/$s_!Kh7_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84800a5c-6b9a-4832-865f-9bcfb71e8f05_936x508.png)Source: CoreWeave #### Total Cost of Ownership (TCO) In summary, the result of CoreWeave’s execution for Slurm-on-Kubernetes, Bare Metal Provisioning, Use of DPUs, Security, Monitoring and Health Checks, W&B Integration, Storage, and Support provides CoreWeave an ability to argue for lower TCO vs their competition, and command a pricing premium on a per GPU-hr basis. Our feedback to CoreWeave as they scale is to continue to work on the on-demand, self-serve cluster experience, continue to develop autoscaling features for supporting inference at scale, and to maintain their lead in reliability for the NVL72 rack-scale deployments. From the customer perspective, a downside of CoreWeave continues to be that they do not offer on-demand instances or autoscaling, and rarely accept short-term rentals. This is different from Nebius and Crusoe, and limits the potential upside associated with high margin “spot instance” markets. --- # Nebius (Gold) > Nebius earns a ClusterMAX 2.0 Gold rating from SemiAnalysis. Since ClusterMAX 1.0, Nebius has continued to show up as the most direct competitor to CoreWeave in our customer conversations. Nebius counts customers such as Shopify, Recraft, Mirage, Genesis Therapeutics, and most recently landed a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Gold - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/nebius - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/nebius/llm.txt - **Topics**: Nebius review, Nebius GPU cloud, Nebius ClusterMAX rating, Nebius Gold, Gold tier GPU cloud, GPU cloud review, neocloud review, Nebius GB200 NVL72, Nebius GB200, Nebius GB300, Nebius B200, Nebius H200, GB200 NVL72 cloud, GB200 cloud, GB300 cloud, B200 cloud, H200 cloud, InfiniBand, RoCE, Kubernetes, Slurm, NCCL, bare metal, managed Slurm, ClusterMAX 2.0, SemiAnalysis Since ClusterMAX 1.0, Nebius has continued to show up as the most direct competitor to CoreWeave in our customer conversations. Nebius counts customers such as Shopify, Recraft, Mirage, Genesis Therapeutics, and most recently landed a 5-year $17.4B deal (with expansion to $19.4B) with Microsoft for their Vineland, New Jersey datacenter. We [expect continued pull-ins from Nebius to Microsoft](https://semianalysis.com/core-research/openais-250b-azure-commitment-msft-needs-neoclouds-capacity/) and the healthy deal pipeline sugests more incremental capacity coming online over time. Nebius differentiates financially due to its low cost of capital, with billions on its balance sheet, no debt, and a strong position as one of two publicly traded GPU-only Neoclouds that we track (the other being CoreWeave). Nebius differentiates technically with a virtualized approach to GPU infrastructure (built on experience from Yandex), and an AI-native approach due to their dogfood approach with their internal AI team, that has resulted in multiple spinoff startups in the AI space. While we are aware of their struggles in securing colocation deals and establishing a credit rating, their engineering prowess is evident. The Nebius platform prioritizes flexibility, on-demand access, and a robust Kubernetes-native experience. This stands in contrast to CoreWeave’s bare metal, long-term reservation model and makes Nebius a compelling choice for autoscaling and spot instances, specifically used for experimentation and inference. Nebius continues to show up as a low-cost provider on various marketplaces, and as the infrastructure, and is one of the only providers with realistic pricing right on the homepage of their website. [](https://substackcdn.com/image/fetch/$s_!7803!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2786b950-f482-424a-85f6-0a53e8b6166c_935x516.png)Source: Nebius As if that wasn’t enough, Nebius is also foraying into the inference endpoint market! In this section, we will discuss our hands-on experience with Nebius’ platform: * Slurm-on-Kubernetes * Virtualization and Storage * Monitoring and Health Checks * On-Demand Instances #### Slurm-on-Kubernetes Since ClusterMAX 1.0, Nebius has officially launched their [managed Soperator service](https://nebius.com/blog/posts/introducing-managed-soperator) for a fully self-service Slurm-on-Kubernetes experience. We were able to test this out, and as expected we got a Slurm cluster that was completely set up and ready for use out of the box. This included pre-installed drivers, Docker, passwordless SSH between nodes, and expected performance on collectives (nccl-tests) and pytorch training jobs (torchtitan pretraining) out-of-the-box. These requirements may be considered basic, but are not to be taken for granted. Later in the article we will describe how difficult it is for other providers to install Slurm-on-Kubernetes (via Soperator or Slinky) with good defaults. Notably, since Nebius uses its own open-source project, [Soperator](https://github.com/nebius/soperator), they completely control the roadmap and are vertically integrated from customer support issues in an sbatch script, down to the kubernetes orchestration, hardware, and datacenter troubleshooting layers. The control plane also looks nice: [](https://substackcdn.com/image/fetch/$s_!WuiY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4cd7168-85d2-428f-bf86-1fb76e9503f0_937x660.png)Source: SemiAnalysis Nebius cluster #### Virtualization and Storage Unlike CoreWeave’s bare-metal-first strategy, Nebius has built its platform on layers of Kubernetes clusters managing each other, using KubeVirt all the way down. This means that customer workloads, even for full 8-GPU nodes, run inside virtual machines on a kubernetes cluster that they can access, which itself is managed by a kubernetes cluster that only Nebius can access. This design is similar to how GCP orchestrates compute. The architecture allows Nebius to leverage the benefits of VMs, such as rapid provisioning and advanced storage features. For example, they use virtio-fs to attach a massive shared root filesystem, which presented as 197TB out-of-the box in our cluster, mounted at “/”, but obviously does not require 197TB of drives to be physically installed in the servers themselves. [](https://substackcdn.com/image/fetch/$s_!HNOA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d49df2-690f-4d8a-b9e4-de7ed73d1cbc_518x148.png)Source: SemiAnalysis Nebius cluster Nebius’s storage solution is built on YDB , which actually underpins their block, shared filesystem, and storage offerings. This approach to storage has reduced startup times (i.e. pulling a container image or model file) for new machines with some example on-demand autoscaling workloads from over 10 minutes to around 2-3 minutes. A common perception is that bare metal offers superior performance over VMs. When we raised this, the Nebius team was adamant that users should simply benchmark the platform. Their position is that because there is no virtualization layer for the InfiniBand fabric or the NvidiaGPUs themselves, performance should be identical to bare metal. Our initial testing seems to validate this claim, and third party results do too. For example, in the last round of MLPerf Inference v5.1 benchmarks, Nebius achieved top tier performance on Nvidia GB200 NVL72, HGX B200 and HGX H200 systems, for inference with Llama models. An additional note that customers have expressed with VMs is the ability to easily enable low-level hardware counters for performance monitoring. The method for enabling hardware counters varies by virtualization platform. In VMware vSphere, you can enable virtualized CPU performance counters by editing the VM’s settings. This feature, known as vPMC, allows the guest OS to access the host’s Performance Monitoring Unit (PMU). Meanwhile on Windows Server and Windows 10/11 with Hyper-V, you can use PowerShell cmdlets like Set-VMProcessor to enable specific performance monitoring hardware features (e.g., pmu, pebs, lbr) for a stopped virtual machine. However, on KubeVirt (which Nebius uses) via KVM/QEMU, the VMs inherit the capability to expose hardware counters from the underlying host. The process typically involves configuring the VM’s CRD on the underlying kubernetes cluster to enable virtual PMU from Intel. Hardware-level performance data like CPU cycle counts, cache misses, and branch mis predictions are available through there. You can typically enable this capability by activating a power metrics or PMU plugin, such as a Telegraf plugin, on the Kubernetes cluster. For example, some users perform advanced performance tuning by using features like the Kubernetes CPU manager to pin vCPUs to host pCPUs for predictable latency on CPU-heavy workloads. Notably, for very large customers who insist on it and are doing a long-term rental, Nebius does have an option to provide bare metal clusters. #### Monitoring and Health Checks Initially, our access to monitoring was a simple Grafana dashboard available via an SSH port forward. These metrics and health checks were basic, but the team later released a series of updates that raised the bar significantly. Interestingly, since all of this is being integrated into SOperator, and Soperator is open source, we have been able to watch the roadmap come to life on the Soperator GitHub project: [](https://substackcdn.com/image/fetch/$s_!VtJL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82ef1e50-f573-4155-a180-32b5871ddb9b_935x494.png)Source: Nebius Soperator on GitHub There is no other Neocloud as open and transparent with their development as Nebius. Around the same time we got notified about the improvements to the dashboard and health checks, Nebius also released a blog post describing what they do for reliability in detail: on-site factory tests, node deployment tests, virtual platform tests, pre-provisioning cluster tests, passive and active health checks. We believe this suite of burn-in tests, checks, and monitoring dashboards will improve cluster reliability and usability, especially as Nebius moves to adopting the GB300 NVL72 rack-scale systems at scale, for customers such as Microsoft. [](https://substackcdn.com/image/fetch/$s_!MAvP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6108afa3-2d8a-4fb8-9c11-1de2059a312a_936x309.png)Source: Nebius [](https://substackcdn.com/image/fetch/$s_!oqr3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e1c842-6ae7-4974-9958-4da08be8d043_936x226.png)Source: Nebius In the blog, Nebius describes a hypothetical example with 13 failures over 336 hours (14 days) on a 1,024 GPU cluster, resulting in a GPU-level MTBF of 26,446 GPU-hr, or 1,101 GPU-days. An apt comparison for this is to the data from Meta’s Llama 3 paper, which claims 419 failures over 1,296 hours (55 days) on a 16,000 GPU cluster, resulting in a GPU-level MTBF of 50,677 GPU-hr, or 2,111 GPU-days. As a further comparison, we have heard from customers that run similar scale clusters (1k to 2k GPUs) from gold and silver tier providers experience as much as 5+ failures per day, for extended periods of time. This translates to a GPU-level MTBF of less than 10,000 GPU-hr, or less than 400 GPU-days. In our research, the number shared by Meta are very high, demonstrating the quality with which Meta runs their datacenters and Hopper generation GPUs. Meanwhile the hypotheical number from Nebius tracks as a reasonably good customer experience. Later in the blog, Nebius claims to have had single 3,000 GPU cluster operate uninterrupted for 169,800 GPU hours or 56.6 hours of stable operation. This would translate to an absurdly high GPU-level MTBF of 169,800 GPU-hours or 7,000 days. We are generally frustrated by providers who cherry-pick reliability data in this manner. We encourage customers to track this reliability data for themselves, especially if DataDog, New Relic, Splunk, or a custom Prometheus Alertmanager is setup and connected to a slack channel for notifications on XID related errors. If you are tracking this data, and are willing to contribute it to anonymized and aggregated research, please get in touch: [clustermax@semianalysis.com](mailto:clustermax@semianalysis.com) [](https://substackcdn.com/image/fetch/$s_!9dMR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f8d842-3d8f-423a-b777-552113266e08_854x468.png)Source: Nebius It is clear that Nebius is building battle scars when it comes to managed slurm clusters at the 1k+ GPU scale. #### On-Demand Instances A key differentiator for Nebius is their robust support for on-demand and autoscaling workloads. This is a direct result of their software-defined architecture. They offer pre-emptible instances, primarily for inference customers, which function similarly to spot instances on hyperscalers. This allows users to access capacity at a lower cost, with the understanding that the workload can be interrupted. We’ve seen public examples of this in action, such as Shopify’s work with SkyPilot and dstack sky on the Nebius platform, which highlights their strength in supporting dynamic, research-oriented workloads. This flexibility is a significant advantage for users who cannot commit to long-term contracts, and seems to be a major source of inbound customer qualification for Nebius. #### TCO Nebius presents a compelling and technologically distinct alternative in the GPU cloud market. Their deep investment in a Kubernetes-native, virtualized stack using KubeVirt and Soperator allows them to offer a degree of flexibility and on-demand access that is rare in the high-performance training space. While they may face headwinds in datacenter acquisition, their software stack is mature and performant. Our feedback to Nebius is to continue improving their monitoring and health-check visibility as they roll these updates out to customers, and to streamline the notification process for all tiers of users. Their ability to deliver bare-metal-equivalent performance through a VM-based architecture is a significant engineering achievement. For users whose needs revolve around research, autoscaling inference, and workloads that can benefit from a spot-like pre-emptible market, Nebius is an excellent choice that challenges the long-term reservation model of its competitors. Notably, some customers have been fixated on Nebius’ Russian roots, despite the fact that all of their staff are based outside of Russia, as opposed to making purchasing decisions on technical merits. We’re not sure how Nebius can address customers that hold this mindset. --- # Oracle (Gold) > Oracle earns a ClusterMAX 2.0 Gold rating from SemiAnalysis. Oracle just posted the most incredible quarterly earnings the market has ever seen. We were ahead of estimates, but still didn’t catch this in full. Specifically, Oracle signed four multi-billion-dollar contracts with three different… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Gold - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/oracle - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/oracle/llm.txt - **Topics**: Oracle review, Oracle GPU cloud, Oracle ClusterMAX rating, Oracle Gold, Gold tier GPU cloud, GPU cloud review, neocloud review, InfiniBand, RoCE, Kubernetes, Slurm, NCCL, bare metal, managed Slurm, ClusterMAX 2.0, SemiAnalysis Oracle just posted the most incredible quarterly earnings the market has ever seen. We were [ahead of estimates](https://semianalysis.com/core-research/orcl-preview-rpo-can-increase-120b-in-f1q26-and-above-200b-for-fy26-well-above-street-expectations/), but still didn’t catch this in full. Specifically, Oracle signed four multi-billion-dollar contracts with three different customers in Q1, including a $300Bn+ deal with OpenAI. Recent release from The Information suggested AI server margin concerns from these multi-billion contracts, but we think lower margin during a ramp up period makes sense and [expect margins to significantly expand](https://semianalysis.com/core-research/core-weekly-insights-orcl-veco-colocation-chain/). [](https://substackcdn.com/image/fetch/$s_!ZcrH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44192941-3be5-40f8-b4ba-348096089482_936x671.png)Source: Oracle Oracle is in a unique position as the only hyperscaler (over 100 AD’s across 45 active regions globally) without an in-house AGI Research program or significant venture capital investment (though Larry did invest $2B in xAI after a few texts from Elon) which has led them to land contracts with OpenAI, Meta, ByteDance, and Nvidia. They currently have over 60% of US Stargate according to public releases. We also track this in detail in our [Accelerator & HBM Model](https://semianalysis.com/accelerator-hbm-model/). Oracle also pivoted to wholesale bare metal early, taking advantage of their balance sheet while also maintaining a notable presence in the managed slurm and managed kubernetes market. In many cases, we have found other cloud providers giving us servers with IP addresses, locations, and other configuration information that makes it clear we’re actually running in an Oracle datacenter. Oracle’s default setup is typically provisioned through the console, using Terraform for automation behind the scenes. A notable point of friction for some users is the almost-mandatory use of Oracle Linux (version 8.10 by default), which is based on Fedora. This operating system choice is contentious, as many AI workloads, particularly those in the open-source community, are first tested on Debian-based operating systems, specifically Ubuntu, for its broad compatibility and ease of use. [](https://substackcdn.com/image/fetch/$s_!jhz4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c28227d-728e-413f-bf31-a4b5ec9e5736_935x336.png)Source: SemiAnalysis Oracle Linux headache We attribute this default to Oracle Linux as a historical beef with Canonical. This is surprising given that certified Ubuntu images have been available on Oracle Bare Metal Cloud Services since 2017 and are current modern to version 24.04. In order to deploy a slurm or kubernetes cluster through the OCI console, users unfortunately need to use the OCI console. Visually, the console adheres to the [Redwood design system](https://redwood.oracle.com/), and uses their [JavaScript Extension Toolkit (JET)](https://www.oracle.com/webfolder/technetwork/jet/index.html), both of which do not spark joy. Oracle remains steadfast in its lifelong commitment to Java, even in the age of AI. After deploying a cluster, users who want to access a Grafana dashboard will need to navigate the maze of UI element options at their disposal. For those interested, the cheat code is: left hand burger menu > Developer Services > Stacks > Stack details > Application Information > (scroll down) > Grafana admin password. [](https://substackcdn.com/image/fetch/$s_!EOKU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee8adbd-3c4e-4dee-a980-dad1e7f8f67f_935x502.png)Source: SemiAnalysis OCI console During testing of both slurm and Kubernetes, everything went smoothly. We were able to quickly achieve expected collective bandwidth on nccl-tests, and expected MFU on torchtitan. Interestingly, upon first login to the slurm cluster we found a single node in a drain state. This was quickly replaced, but highlighted the difference in approaches to health checks and bare metal node provisioning when compared to CoreWeave. Oracle is actively working to improve platform reliability and user experience. Node Auto-Repair and Node Problem Detector integration is expected in Q4, with the goal of providing customers with a “doctor HPC” user experience via an official OCI binary. The team is also developing an Active Health Check mechanism called the Sustained Workflow Check, which involves running PyTorch and CUDA matmul linear regressions for 2-5 minutes to ensure sustained performance. This check is currently running on-demand, and in most cases an Oracle engineer works directly with customers to schedule the checks in a low-priority partition. Default behaviour is being developed to integrate this into both slurm and OKE. For managed Kubernetes, Oracle offers OKE (Oracle Container Engine for Kubernetes), with all resources being public by default. Users have the option to disable this default and utilize the Nvidia GPU Operator, although the default setup uses a custom operator. OKE provides GPU and GPU+RDMA pools as provisioning options, along with integrated storage options via a checkbox. The official instructions for setting up RDMA are publicly available, and nccl-test manifests show good out-of-the-box performance for allreduce and allgather operations. A key point of frustration in the OKE setup is the lack of a direct kubeconfig file. Users are instead required to SSH into the cluster to perform management functions. This is counter-intuitive for a publicly accessible, load-balanced service and can require a bastion proxy for proper external access to the cluster from a cluster admin or a user. One of the key benefits of kubernetes over slurm for users is the ability to develop code locally, and switch between different cluster contexts quickly, without ssh. In terms of networking, the default for OKE to use an RDMA network in Kubernetes is to inject two fields, hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet into pod specs. On OKE, Oracle does not deploy the full GPU Operator but rather the device plugin only, with plans to add the full operator later. The Nvidia toolkit is installed and automatically updated on the nodes, ensuring the software stack is current. Performance testing on kubernetes using vllm benchmarks for pd disaggregation with llm-d showed strong results, and was easy to setup and integrate with provided LoadBalancer services via public IP. [](https://substackcdn.com/image/fetch/$s_!UM-s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F885c73c5-d50d-4a69-9e9b-e1909928916a_936x690.png)Source: Oracle HPC on OKE repo The initial testing phase suffered from a lack of integrated health checks. While slurm metrics were added later, the control plane initially lacked the necessary CLI features, which required some rollbacks to prevent customers from inadvertently terminating jobs. The introduction of a new “mgmt” CLI aims to address these operational complexities and we agree. [](https://substackcdn.com/image/fetch/$s_!YcGx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8c94afe-05bc-48a1-a06f-130e5147e0de_936x306.png)Source: Oracle For Storage, Oracle offers a robust marketplace, with Weka and DDN being the primary partners. Weka is available both on their online on-demand marketplace (i.e. it can be built on-demand with bare-metal instances full of NVMe) and through direct deals. Oracle customers report that the shared support experience is stronger with Weka on Oracle than with either VAST or DDN. For Networking, Oracle is like any other hyperscaler talking their book, trying to convince large customers that InfiniBand is not the only high-performance network solution, and their RoCE works well. They seem to be making progress in this regard. Overall, Oracle has made significant progress improving their managed cluster offerings, with improved monitoring dashboards and node lifecycle management, but there is still room to improve in terms of proactivity. There is still a chance that customers will discover a bad node before Oracle’s automated systems, and have to report the node for replacement manually. Oracle continues to be the most cost-effective of the four hyperscalers, and stands out as deploying new infrastructure the quickest, while also providing the best support. We expect Oracle to continue to grow both their wholesale bare metal and managed cluster business going forward, and we encourage Oracle to maintain its commitment to excellent customer support for all customers. --- # Azure (Gold) > Azure earns a ClusterMAX 2.0 Gold rating from SemiAnalysis. Azure maintains its ranking as a Gold-tier provider, considering that it will still be providing the bulk of OpenAI’s capacity through the end of 2026. Unfortunately, if you don’t work for OpenAI, Azure is not a significant player for… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Gold - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/azure - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/azure/llm.txt - **Topics**: Azure review, Azure GPU cloud, Azure ClusterMAX rating, Azure Gold, Gold tier GPU cloud, GPU cloud review, neocloud review, InfiniBand, RoCE, Kubernetes, Slurm, DCGM, bare metal, ClusterMAX 2.0, SemiAnalysis Azure maintains its ranking as a Gold-tier provider, considering that it will still be providing the bulk of OpenAI’s capacity through the end of 2026. Unfortunately, if you don’t work for OpenAI, Azure is not a significant player for managed clusters or on-demand VMs. Capacity is constrained across all regions globally for Hopper and Blackwell, and the CycleCloud slurm provisioning process is in need of an update, or at least some simplification. This has been [reaffirmed](https://openai.com/index/next-chapter-of-microsoft-openai-partnership/) with OpenAI and Microsoft locking in a long-horizon partnership and Microsoft geting a ~27% stake with model/IP rights through 2032. This gap between wholesale bare metal experience for anchor tenants like OpenAI and the managed experience for the rest of the market becomes clear when we look at reliability and compare slurm (CycleCloud) with kubernetes (AKS). AKS reliability includes fully managed Node Auto-Repair feature. This system automatically detects unhealthy nodes based on kubelet status conditions and attempts remediation through reboots or re-imaging. This philosophy extends to monitoring, where Azure Monitor for Containers provides, integrated visibility into every layer of the cluster out-of-the-box. In stark contrast, CycleCloud relies on the traditional HPC model via slurm’s HealthCheckProgram. However, CycleCloud does not provide a good default, like LBNL’s Node Health Check , or anything customized to Azure infrastructure. Instead, the full operational burden of health checks is placed on the user, who must write, test, and maintain custom scripts to monitor GPUs and the InfiniBand fabric. Beyond that, the integrated monitoring is limited to a high-level node status view in the UI, forcing users to implement their own solutions for any meaningful job-level or hardware-specific insights such as DCGM dashboards. As an example, when deploying a CycleCloud cluster, the current documentation for CycleCloud is split between older guides and a newer GitHub-centric approach. Users are required to configure login and scheduler nodes separately, as well as provision and manage their own MySQL database to handle slurm accounting (sacct). [](https://substackcdn.com/image/fetch/$s_!-gJX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38e571f6-0d68-463a-a971-5649830a1019_937x467.png)Source: Azure However, the comprehensive nature of a hyperscaler cloud platforms also has some merits. Networking is straightforward offering access options via NAT Gateway or bastion host. It also provides flexibility through support for custom images, integration with Azure Spot Virtual Machines for cost-effective bursting. Azure has a legacy in HPC that will feel familiar to users coming to a GPU cluster from an academic HPC background. [](https://substackcdn.com/image/fetch/$s_!k5IT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd8528d9-f71b-4de4-babd-ad473510f54e_937x467.png)Source: Azure On networking, Azure continues to lead the hyperscalers in performance, being the only one to deploy with InfiniBand, and implement SHARP at scale. Security is also rock solid, Microsoft in general holds a reputation for robust security and compliance practices, which has made it a trusted partner for federal government agencies and defense contractors. With that said, the dynamics of Microsoft’s relationship with its key customer, OpenAI are shifting. Since Satya mentioned he’s “good for his $80B”, Stargate has turned into a $600B Behemoth, much of which has been captured by Oracle. Google, xAI and Meta have followed suit, with Zuck committing to the same total spend of $600B over the next 5-7 years. The reality is that we are forecasting Azure to lose share in the market when considering the frontier labs compute requirements and existing commitments. This leaves Azure with the rest of the market, who generally demand strong managed cluster experiences for slurm or kubernetes and a streamlined support experience. To address this customer base, we believe that Azure must re-vamp its CycleCloud offering, simplifying the current cluster deployment and monitoring experience. Otherwise, Azure is at risk of being demoted to Silver due to its poor user experience for startups from Series A to AI unicorns. Compared to the fully managed, Kubernetes-native, and vertically integrated offerings from Neoclouds like CoreWeave, Nebius, and Oracle, as well as the aggressive capacity buildout and revised pricing we have seen from AWS and GCP, Azure has stiff competition. --- # Fluidstack (Gold) > Fluidstack earns a ClusterMAX 2.0 Gold rating from SemiAnalysis. Fluidstack is the only cloud to debut in our Gold tier this round, and certainly has the most unique business model. Almost all of Fluidstack’s customer deployments involve a third-party datacenter provider. Fluidstack is effectively… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Gold - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/fluidstack - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/fluidstack/llm.txt - **Topics**: Fluidstack review, Fluidstack GPU cloud, Fluidstack ClusterMAX rating, Fluidstack Gold, Gold tier GPU cloud, GPU cloud review, neocloud review, Fluidstack TPU, TPU cloud, InfiniBand, RoCE, NVLink, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis Fluidstack is the only cloud to debut in our Gold tier this round, and certainly has the most unique business model. Almost all of Fluidstack’s customer deployments involve a third-party datacenter provider. Fluidstack is effectively the hired gun that organizations go to in order to turn bronze-tier datacenter infrastructure into a gold-tier customer experience. Google has also gone into the market to [secure colocation demand](https://semianalysis.com/core-research/core-weekly-insights-6/) with Fluidstack as the operator of Terawulf and Cipher sites, potentially for their TPUs. We explore why GCP is willing to “backstop” these deals and why GCP needs Fluidstack [here](https://semianalysis.com/core-research/google-clouds-growth-will-surge-in-2026/). This is clear when it comes to customers such as Meta, Poolside, Blackforest Labs, and an unnamed customer running in a TeraWulf datacenter in Buffalo that got a massive financial backstop from Google. Source: Fluidstack Our hands-on experience with Fluidstack was a live demonstration of their value proposition: a highly collaborative, deeply technical partnership that rapidly improves the platform based on expert feedback. While the initial cluster had rough edges, the speed and precision with which the Fluidstack team addressed every issue was unparalleled. Our initial slurm cluster came with pyxis and MPI support integrated into srun, and initial two-node nccl-tests showed performance within range for large message sizes. However, we immediately hit a significant usability issue: the prolog.d script was so bloated with health checks that it took over a minute to schedule an interactive run on a single node. The script was running full single-node NCCL tests for NVLink and InfiniBand, plus host-to-device bandwidth checks, every time a job started. When we pointed this out, the team immediately acknowledged it and committed to their roadmap of moving these active health checks to run on idle nodes in the background, which is the standard practice for other top-tier providers. This kicked off a rapid-fire feedback loop that defined our testing period: Performance Tuning: We noted that the Nvidia HPC-X toolkit was missing from the base image, which is necessary for optimal nccl performance at medium message sizes. While it was available within NGC containers, not all users leverage pyxis/enroot. Within 24 hours, the Fluidstack team had deployed HPC-X to the base image on our cluster and added it to their standard deployment pipeline for all customers. Monitoring Dashboards: The Grafana dashboard was solid, but we identified missing graphs for NVLink Rx/Tx utilization and incorrect DCGM metrics for tensor core pipes (they were capturing SIMT units instead of tensor core-specific pipes like DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE). The team implemented the correct DCGM metrics the following day. Security Posture: This was the most critical finding. We discovered the cluster was running a version of the nvidia-container-toolkit vulnerable to NVIDIAScape (CVE-2025-23266). The team patched the vulnerability on our cluster within minutes of us reporting it. While the immediate fix was impressive, our feedback focused on the larger operational need for automated dependency scanning and a proactive security process, such as enrolling in Nvidia’s security embargo program. This prompted a healthy discussion on their software supply chain security strategy. Passive Health Checks: We found that DCGM’s background health checks were not enabled. By injecting PCIe replay errors (dcgmi test --inject --gpuid 0 -f 202), we confirmed that the node would not automatically drain. Our recommendation was to actively poll dcgmi health -c and configure NVIDIA Health Check (NHC) to drain nodes based on specific thresholds (e.g., >8 PCIe replays per minute or >100 NVLink CRC errors per second). The team immediately added this to their near-term roadmap. Transitioning to Kubernetes was seamless, with a kubeconfig readily available from the UI. The cluster provided a solid foundation with standard components like Cilium for CNI, a CSI w/ ReadWriteMany support, node-problem-detector, kube-prometheus-stack, draino, a custom controller to turn off ACS, and the Nvidia Network Operator + GPU Operator, all managed via ArgoCD. This high-touch model, typically involves shared Infrastructure-as-Code repos, ensuring customers get the exact tools they need, such as adding cert-manager in our case. This test, however, resurfaced the most critical theme from our slurm evaluation: software supply chain security. We discovered that the Nvidia GPU Operator chart was a minor version behind, leaving the cluster vulnerable to the same NVIDIAScape exploit. This highlighted a significant gap in their proactive security posture, particularly their absence from vendor embargo programs that provide advance notice of vulnerabilities. Once notified, the team coordinated a maintenance window and patched the vulnerability in under an hour, and motivated them to formalize a security process that includes more frequent proactive updates, subscriptions to vulnerability databases, and taking steps to join Nvidia’s disclosure program. Overall our experience with Fluidstack was strong. The platform was not perfect out-of-the-box, but both slurm and kubernetes were both in a perfectly usable state within hours of cluster handover, and the engineering team demonstrated an elite level of responsiveness and expertise. Issues that might take weeks or months to get addressed in a hyperscaler’s ticketing system were fixed in hours. If there is anyone that demonstrated the “Forward Deployed Engineering” ethos during our testing, it was Fluidstack. --- # Crusoe (Gold) > Crusoe earns a ClusterMAX 2.0 Gold rating from SemiAnalysis. Since March, Crusoe has been hard at work expanding their datacenter footprint, while trying to keep the Neocloud business alive. Crusoe has announced: A partnership with Oracle to develop Abilene, the flagship Stargate project for… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Gold - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/crusoe - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/crusoe/llm.txt - **Topics**: Crusoe review, Crusoe GPU cloud, Crusoe ClusterMAX rating, Crusoe Gold, Gold tier GPU cloud, GPU cloud review, neocloud review, Crusoe GB200 NVL72, Crusoe GB200, Crusoe B200, Crusoe MI355X, GB200 NVL72 cloud, GB200 cloud, B200 cloud, MI355X cloud, InfiniBand, NDR, RoCE, Kubernetes, Slurm, managed Slurm, ClusterMAX 2.0, SemiAnalysis Since March, Crusoe has been hard at work expanding their datacenter footprint, while trying to keep the Neocloud business alive. Crusoe has announced: * A partnership with Oracle to develop Abilene, the flagship Stargate project for OpenAI, at over 1.2 GW, worth $15B in joint venture funding. * An order of 29 LM2500XPRESS aeroderivative gas turbine packages from GE Vernova, enough for over 1GW of power. * A deployment of AMD MI355X GPUs, despite counting Nvidia as a key investor in their $600M Series D fundraise. * A 1.8GW datacenter in Wyoming, with a design to scale up to 10GW: . * Expansion of their Iceland facility with atNorth, and a $175M credit facility for this project. * A smaller, 12MW facility in Norway with Polar, and planned expansion to 52MW. * A $750M credit facility from Brookfield. * A $225 million credit facility from Upper90. * “Prometheus,” a 150MW facility located in the Permian Basin in West Texas. Interestingly, Prometheus has debuted Crusoe’s Digital Flare Mitigation technology publicly for the first time. During oil extraction, at sites like the Permian Basin, natural gas flaring is a waste product. However with DFM, Crusoe is able to install onsite mobile datacenter units that divert the waste product to generators for the datacenter onsite. These DFM’s have been announced as “Crusoe Spark”, and now include all the requisite infrastructure required to host B200s. [](https://substackcdn.com/image/fetch/$s_!PGll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac83abd3-c9ec-43a5-85de-186d8fabb3ae_936x526.jpeg)Source: Crusoe Spark launches (via Crusoe on YouTube) After all these announcements, Crusoe is left with a claimed 3.4GW of datacenter footprint, some of which is already showing up as revenue on their balance sheet. [](https://substackcdn.com/image/fetch/$s_!mCt9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F214c7c4a-8cc4-4368-a9d3-6c464fa3174a_937x397.png)Source: Crusoe.ai So yeah, lots going on. As for the actual, customer experience on Crusoe, around six months ago when we started testing slurm on Crusoe, they had just launched their fully managed slurm solution called “Auto Clusters”. The lifespan of this service offering has come and gone in the interim period, with the focus now being a Slurm-on-Kubernetes experience. Unfortunately, the new Slurm-on-Kubernetes experience is in its early days and is not usable out of the box. Starting up a cluster is simple, via the Crusoe CLI, avoiding complicated terraform scripts and simplifying some of the complexity of a webUI. [](https://substackcdn.com/image/fetch/$s_!FP6i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a435e9-0276-472d-b8d4-90d95a97ba9b_641x639.png)Source: Crusoe However, when a simple CLI approach is used, we expect reasonable defaults. Crusoe claims to have developed their Slurm-on-Kubernetes offering in-house, while taking inspiration from Slinky. Unfortunately, the login pod was missing vim, nano, git, python, and sudo permissions. We gave some recommendations on how to take less inspiration from open-source Slinky and make the cluster usable out-of-the-box. The SonK offering also doesn’t support partitions, RBAC, and SSO integration, making it basically unusable for a research lab beyond the scale of about 10 researchers. In addition, when provisioning a kubernetes cluster without slurm for our testing, we had a lot of extras to setup. A Crusoe CMK cluster does not include a default ReadWriteMany StorageClass, making it impossible to deploy any workload with a persistent volume claim. We had to go through many extra configuration steps on the console to figure out how to configure this storage class. During our testing, we also encountered several performance and reliability issues on slurm, kubernetes, and on a standalone machine. We repeatedly saw NVML driver mismatch errors inside individual Docker containers, indicating potential image or driver management instability. We expect this is due to Crusoe’s use of [cloud-hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor.), and insistence on building all their infrastructure, including GB200 NVL72, with VMs. On the networking side, while PKeys for InfiniBand partitioning were integrated, using them through the console was not intuitive. We have also had challenges with shared filesystems randomly unmounting, requirements to deploy OS drives and configure RAID settings manually (with the requisite footguns). In conversations with Crusoe users when discussing reliability at scale, it has been hit-or-miss. Some have had good experiences, but anyone who tested clusters in Crusoe’s Iceland facility prior to March 2025 seem to have all had a common experience: lots of link flaps and random filesystem unmounts. Crusoe ended up having to clean the 20,000 fiber ends using a “clicker” that were full of dust and other debris. Some people have said that the debris was volcanic ash. We found that in November 2023 the Icelandic Data Center ICE02 from atNorth started publishing status updates regarding increased seismic activity and volcanic uplift near Mt. Þorbjörn in the Reykjanes Peninsula. The datacenter is about 35km away from this volcano. [](https://substackcdn.com/image/fetch/$s_!2sao!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a52486d-933a-4cfc-9dde-54adfdb05413_937x694.png)Source: EDIS Global [](https://substackcdn.com/image/fetch/$s_!vy7B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37303b2d-250d-418d-b83a-d73f4edecf14_937x719.png)Source: [](https://substackcdn.com/image/fetch/$s_!R9Vb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0808ec5-423a-4805-a804-cfced30fe539_936x829.png)Source: Google Maps. Checking in to see how long it would take an atNorth datacenter technician to visit an active volcano on their lunch break It is our understanding that this datacenter Crusoe now calls home has continued to experience significant seismic activity and air quality concerns, leading to some more hits on YouTube videos like this one from Iceland. [](https://substackcdn.com/image/fetch/$s_!9YC8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb520d037-0aae-485d-960d-50bcf89ac500_937x652.png)Source: trueCable on YouTube Overall, Crusoe is clearly executing on an ambitious strategy, securing massive power capacity and datacenter real estate. They have already pivoted once from crypto mining to AI cloud, and seem to be in the process of another pivot from cloud provider to datacenter infrastructure provider. However, Crusoe is at risk of being downgraded to ClusterMAX Silver due to many of their top individual contributor engineers quitting, leaving the culture in their cloud division beginning to resemble big tech. There are too many middle managers across the organization, especially in engineering. This has caused incredibly slow moving releases, such as their AutoClusters feature, leaving us with concerns about the future of Crusoe’s public cloud offerings. Chase needs to do a rapid course correction if he doesn’t want to lose all of his 10x engineers and eventually lose their Neocloud business with it. --- # Together (Silver) > Together earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Together is a strong provider with a robust cluster offering for both slurm and kubernetes, but it is held back from the gold category due to reliability issues. When comparing offers, we hear from users that they generally expect a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/together - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/together/llm.txt - **Topics**: Together review, Together GPU cloud, Together ClusterMAX rating, Together Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, InfiniBand, RoCE, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis Together is a strong provider with a robust cluster offering for both slurm and kubernetes, but it is held back from the gold category due to reliability issues. When comparing offers, we hear from users that they generally expect a lower price per GPU-hr from Together to justify the trade-off in reliability. Together is among a few providers for which we tend to hear the most reliability complaints about from users operating clusters of 64 GPUs or more. We expect this is due to their use of a broad mix of datacenter partners, which creates a “roll-of-the-dice” dynamic for performance and stability. Unfortunately, Together also does not offer to do 1-week POCs with most customers, unlike other Silver, Gold, and Platinum tier providers, which makes it difficult for buyers to know what sort of experience they will have on the cluster before making a multi-million dollar commitment. These are all the reasons why TogetherAI went from an Gold tier provider to Silver tier provider. Together’s multi-datacenter strategy seems to be driven by necessity. They have significant compute needs due to their serverless inference endpoint business, which is growing steadily. During our research for this article, we spoke with multiple Neoclouds that claim Together is one of their biggest customers. Competing in the serverless inference endpoint business does provide two key benefits: it creates a sales funnel to cross-sell GPU clusters to inference customers, and it allows Together to absorb the cost of idle cluster compute by running inference workloads on it. It also give Together an opportunity to enjoy the fruits of their kernel team’s labour. TKC is an exceptional feature, and the impact of Tri Dao’s FlashAttention cannot be overstated. During the research for this article, we got to hear directly from Dan Fu about the TKC roadmap. We suspect that Dan is the only person in the industry with the title of “VP, Kernels”, and for good reason. TKC is consistently impressive, and it helps both customers and Together’s serverless inference endpoint business achieve improved performance and efficiency. Together’s model of offsetting costs from idle compute by running public and private serverless endpoints is now being copied by the likes of Nebius. Why not make some extra money from idle compute? During testing, we got access to a classic Together slurm cluster, a TKE kubernetes cluster, and a soon-to-be-released Instant Cluster in preview. For slurm, the onboarding process was smooth. Just create an account on the console, upload ssh keys, and the together engineering team sends you an onboarding document. One ssh command and the cluster works out of the box. Unfortunately, during testing we noticed that the cluster responded very slowly to terminal commands in a VSCode or cursor remote SSH session. The standard terminal application was fine, and we could replicate the slowness from multiple locations, leading us to believe it was a problem with their datacenter provider The Kubernetes onboarding experience was less polished. Instead of providing a kubeconfig file to download, we were expected to login and access the cluster via ssh. As mentioned previously, this is atypical for kubernetes admins and users who generally prefer to develop code locally and switch contexts on demand. In addition, we found that standard tools like Helm were not installed, and users do not get sudo permissions by default, requiring more manual setup. Together uses rancher k3s to provide these clusters, which is strange considering how much of the serverless endpoint runs on kubernetes. Together has several customers, including Hedra, Cartesia, and Krea, that are successfully running production inference on thousands of GPUs using these managed K8s clusters. However, at this time, together does not have horizontal node autoscaling capabilities in these clusters. Whatever capacity you commit to is what you get. It is interesting to see the dynamic between the cluster business and the endpoint business in action: users can see it as together competing against itself, or providing end users with choice. [](https://substackcdn.com/image/fetch/$s_!Csqy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d4c4089-b55b-43bc-a687-9619706269bb_936x819.png)Source: Together. Trying to use our TKS cluster L “Instant Clusters” is Together’s newest offering, designed to be fully managed via API, CLI, and a Terraform provider. This product allows users to dynamically provision clusters and add or remove nodes on demand, making it suitable for handling burst capacity and autoscaling. The architecture for Instant Clusters provides strong tenant isolation using a multi-layered approach similar to Nebius. First, a base Kubernetes cluster uses KubeVirt to create dedicated Virtual Machines (VMs) for a customer. Second, these VMs are used to form an isolated Kubernetes cluster dedicated to that customer. Third, slurm is then installed into the customer’s dedicated K8s cluster using slurm-operator from Slinky. Overall, this architecture allows Together to offer flexible, on-demand Slurm environments on top of a modern, virtualized stack. Notably, in our testing, Together is the only provider to correctly configure Slinky out-of-the-box with sudo permissions, vim/nano, git, python, and other basic packages pre-installed. They clearly have already rolled out this offering to users, and we are excited for it to launch in full GA. On these clusters, Together provides 24/7 support from an on-call SRE team that is primarily US-based. For networking, they work directly with customers to configure firewall rules at the datacenter level and provide IP addresses as needed, including 1:1 NAT and public IPs assignable through services like MetalLB. The final, and most important piece of differentiation from Together and gold tier providers is a proactive and automated approach to monitoring and reliability. This has been a weak point for Together, and is difficult for them to work around given the broad use of datacenter and GPU infrastructure partners they have contracts with. During our review of the monitoring dashboard, we noted a bug in their Grafana monitoring dashboard that incorrectly reported InfiniBand bandwidth at a physically impossible 1.14 Tbit/s. To their credit, when we pointed this out, their team quickly identified the calculation error in their query and deployed a fix. For passive health checks, we expect checks run continuously in the background to detect failures on live nodes. This is where the gap between their current implementation and a fully automated system is most clear. Together has implemented detection for many critical issues, including GPUs falling off the bus, PCIe errors, InfiniBand link flaps, high GPU thermals, and high ECC memory error rates. A baseline Kubernetes node health check is also in place. However, the most critical missing piece is automated remediation. While they can detect most of the issues above, the logic to automatically drain a faulty node is still on the roadmap for everything except for GPUs falling off the bus in slurm. Other crucial features on the roadmap include detecting uncorrectable Nvidia XID errors, identifying stalled NCCL jobs, and implementing AI/ML-based predictive failure analysis. For active health checks, Together has currently implemented a comprehensive suite of tests for single-node validation. It includes Nvidia’s DCGM diagnostics (level 3), PCIe bandwidth tests, single-node NCCL and InfiniBand all-reduce tests to validate local interconnects, and GPU stress tests like GPUBurn. However, key multi-node and application-level tests are still on the roadmap. This includes pairwise ib_write tests to validate the InfiniBand fabric under load, hardware correctness validation with Nvidia’s TinyMeg2, and full-stack performance tests with models like Megatron to ensure TFLOPs and loss convergence match reference numbers. We have previously noted how important these tests are during burn-in and during cluster operation, as they stress both the GPUs and the interconnect at the same time, for an extended period of time, resulting in thermal expansion and contraction of the entire cluster, similar to normal operation. We encourage Together to prioritize implementing these active health checks, as we believe it will help them improve reliability, especially when working with datacenter partners that are not under their direct control. In summary, Together continues to operate on a solid foundation for managed clusters. They have a large and growing customer base for both their clusters, and serverless inference endpoint products. Their active, single-node health checks are strong. However, the system is not yet complete. We believe that the gap between detecting node failures passively, instead of automatically remediating them proactively is a key reason for the reliability issues users experience today. --- # Lambda (Silver) > Lambda earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Lambda is another gold-tier candidate that unfortunately comes in at 1 in the wrong category: customer complaints. Lambda started out in 2012 by building facial recognition software, then pivoted to reselling SuperMicro GPU… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/lambda - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/lambda/llm.txt - **Topics**: Lambda review, Lambda GPU cloud, Lambda ClusterMAX rating, Lambda Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, InfiniBand, RoCE, SOC 2, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis Lambda is another gold-tier candidate that unfortunately comes in at #1 in the wrong category: customer complaints. Lambda started out in 2012 by building facial recognition software, then pivoted to reselling SuperMicro GPU workstations, servers, and eventually on-premises clusters. Today, they appear to be 100% focused on their “Superintelligence Cloud” and working to shed their legacy on-prem server and workstation business. Their [recent announcement](https://lambda.ai/blog/lambda-announces-multibillion-dollar-agreement-with-microsoft-to-deploy-ai-infrastructure-powered-by-tens-of-thousands-of-nvidia-gpus) with Microsoft suggests they will be providing capacity worth multi-billions across 10s of thousands of Nvidia GPUs. A recurring theme when we talk to users is that Lambda seems to unfortunately be trying to do everything for everyone. While the company has deep experience in building dedicated HPC clusters, this has not yet translated into a polished, user-friendly cloud console, or cluster monitoring experience. Their product offerings feel conflicted with a new-mslurm, old-mslurm, new-mk8s, old-mk8s, private cloud, 1-Click Cluster, and on-demand Instances. Notably, 1-Click Clusters aren’t really one click, as you need to wait for approval. It’s more of a 1-Click-if-approved-and-paid-for-then-you-can-have-it-Cluster. [](https://substackcdn.com/image/fetch/$s_!DOw0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff30c7652-e758-40dd-b0a1-c15ef0d3e2fa_936x242.png)Source: Lambda Labs For users that want an on-demand machines instantly, Lambda is generally considered to be the top-tier on-demand provider, with the largest fleet of GPUs available. However, in our recent experience, Lambda is in fact suffering from success in on-demand. We are generally met with greyed-out screens showing that capacity is sold-out: [](https://substackcdn.com/image/fetch/$s_!MPHl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e1e4cd-aa7d-4ed1-b78b-f12886012569_935x486.png)Source: Lambda Labs: trying to get an on-demand GPU instance from Lambda Also, for a hot minute, Lambda appeared to be getting into the serverless inference API endpoint business, which would put them in direct competition with some of their largest customers. But that is no longer: [](https://substackcdn.com/image/fetch/$s_!G540!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9c6481a-22dc-4998-9964-47344c99dc46_936x598.png)Source: Lambda Labs Overall, we like the focus. Lambda has pivoted, and is very focused on their 1-Click-Cluster (1CC) business, focusing on “big game hunting”. During our testing, we evaluated both their new (self-managed) and old (rancher-based) Kubernetes offerings, and their newly available slurm offering. Neither of these is UI or CLI driven, instead requiring a Lambda engineer to set up the cluster for you. Lambda’s Kubernetes product feels like an early-stage offering, marked by technical debt and a challenging user experience. While the current product does not use Rancher, the public documentation still references it, causing initial confusion. The user experience for inference workloads is particularly lacking. Clusters do not come with a default public IP solution (like MetalLB or an external LoadBalancer). Setting up public-facing inference services is complex and not well-documented, requiring significant manual configuration. This reflects a platform that is developed to target training workloads, not inference. While documentation exists for a simple, single-GPU vLLM deployment, there are no examples for multi-GPU, multi-node, or auto-scaling inference workloads. For monitoring, Lambda uses a mix of open-source tools, including LeptonAI’s gpud for GPU device management and node-problem-detector for health checks, but the integration is not seamless into their monitoring dashboards for the new or old mk8s products. Dashboards are easy to access, but missing integration to the metrics without an install of an agent that is not documented, and upon further inspection, still in development. For slurm, Lambda’s offering is a more recent addition, and the onboarding process was fraught with issues. The initial setup process was cumbersome: ssh keys were not correctly provisioned on the cluster, the default home directory was not shared across nodes by default, requiring data to be moved manually. New user account creation is a headache, requiring workarounds like unsetting environment variables (XDG_DATA_HOME) to function correctly. To their credit, once these initial hurdles were overcome, the cluster’s performance was strong. We observed expected allreduce, allgather and alltoall bandwidth on nccl-tests and were able to achieve full MFU on an example torchtitan training workloads. Lambda also provides some useful, albeit hard to find, tooling. For example, a welcome message (which was invisible in some SSH clients like Cursor or VSCode) contained custom instructions for a grafana-access command to quickly view performance metrics. Lambda’s approach to reliability on the slurm cluster included a custom dcgm-status script, which can be run on-demand: The script is also scheduled to run on a regular cadence in a low-priority, “preemptible” partition: [](https://substackcdn.com/image/fetch/$s_!7Eom!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F886e4633-5588-4da7-aed5-ed1427531a17_935x725.png)Source: our Lambda test cluster [](https://substackcdn.com/image/fetch/$s_!eIrW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa61413b0-5ee2-456e-b2a2-b13e50e1d551_935x224.png)Source: our Lambda test cluster We were impressed by Lambda’s commitment to developing comprehensive active and passive health checks, and believe that they are well on their way to improving reliability challenges, and building the battle scars necessary to run NVL72 rack-scale systems at scale. With that said, some of the access issues we encountered point to broader operational challenges at Lambda. Their cloud console (though not our cluster) experienced outages during our brief testing window. [](https://substackcdn.com/image/fetch/$s_!6UO2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7df56f4d-e298-4e9d-a53d-cba66c254179_937x557.png)Source: Lambda Labs [](https://substackcdn.com/image/fetch/$s_!ZdHp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74617ea2-6c23-4630-bf0f-e7a9b810ae31_937x496.png)Source: Lambda Labs Internally, there appears to be a general degree of disorganization. When asked about a true “cloud console” experience, Lambda acknowledged that the team’s background is primarily in traditional HPC cluster deployment, not building scalable, self-service cloud infrastructure. We encourage Lambda to truly focus on the cloud experience going forward as they simplify their portfolio and focus on their mslurm and mk8s offerings. On the positive side, Lambda is actively working on improving its platform based on our feedback. They have a compliance team addressing SOC 2 Type II requirements for individual sites, and are working to implement both SHARP and InfiniBand security keys for multi-tenant isolation, following recent Nvidia recommendations (and, likely, the onboarding of Nvidia as a customer with a $1.5B contract). Their storage offerings primarily focus on VAST, with future S3-compatible offerings currently in development. Overall, Lambda is a strong provider with deep hardware expertise, massive capacity, and big plans for the future. However, their public cloud product feels immature, and engaging with the team feels chaotic. We encourage Lambda to continue to work on translating their HPC hardware prowess into a stable, easy-to-use, and reliable cloud service. --- # Google Cloud (GCP) (Silver) > Google Cloud (GCP) earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. You would think Google would set the standard. From jax to the transformer, search to maps, Waymo to YouTube. We use Gmail with Gcal to book a Gmeet to get our work done. We have to pull CoreWeave containers from gcr to run… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/googlecloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/googlecloud/llm.txt - **Topics**: Google Cloud (GCP) review, Google Cloud (GCP) GPU cloud, Google Cloud (GCP) ClusterMAX rating, Google Cloud (GCP) Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Google Cloud (GCP) GB200 NVL72, Google Cloud (GCP) GB200, Google Cloud (GCP) B200, Google Cloud (GCP) H200, Google Cloud (GCP) H100, Google Cloud (GCP) TPU, GB200 NVL72 cloud, GB200 cloud, B200 cloud, H200 cloud, H100 cloud, TPU cloud, NDR, RoCE, NVLink, Kubernetes, Slurm, NCCL, DCGM, managed Slurm, ClusterMAX 2.0, SemiAnalysis You would think Google would set the standard. From jax to the transformer, search to maps, Waymo to YouTube. We use Gmail with Gcal to book a Gmeet to get our work done. We have to pull CoreWeave containers from gcr to run on their kubernetes cluster. Rumors abound about the TPU. Since the first version of this article, Google has addressed some issues holding them back, specifically making the decision to go with standard CX-7 NICs for their H200 (a3-mega) and B200 (a3-ultra) instances, as well as their GB200 NVL72 instances (a4). Our testing began by provisioning clusters for both slurm and Kubernetes (GKE). The managed slurm “Cluster Director” offering is still in preview, though at Google “preview” also means that key customers have had it for several months, and things work well. The architecture follows a standard managed service model where the slurmctld is handled by GCP, leaving users with access to the login and worker nodes. We appreciated the default setup, including scripts for testing network performance via nccl-test and storage performance via FIO pre-staged in a GCS bucket for immediate use. For storage, GCP recommends Filestore for home directories, which provides enterprise features like snapshots and backups, while managed Lustre is positioned for large-scale, high-performance scratch space. Provisioning our Lustre filesystem was straightforward but not instant, taking roughly 40 minutes to complete. Interestingly, the cluster also demonstrated self-healing capabilities; when we intentionally deleted a worker node to clean things up and move from slurm to GKE testing, the Cluster Director service automatically recreated it in a matter of minutes to maintain the desired capacity. We had to delete the whole cluster from the cluster director screen to get it to take. Interesting demo. [](https://substackcdn.com/image/fetch/$s_!2zoS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ab2ffbc-3459-42f7-bbec-94bd13b1b201_935x524.png)Source: Google Cloud. Deleting a SLURM Cluster on GKE. This UI sparks joy However, the GKE-based solution is where GCP truly shines and feels years ahead of all Neocloud competition but CoreWeave. Using the “Cluster Toolkit,” the initial setup was streamlined. Most impressively, the cluster arrived with Kueue and JobSet pre-installed. This immediate, out-of-the-box support for modern, Kubernetes-native scheduling for batch workloads is a significant differentiator. While competitors are still building their own operators or relying on Slurm-on-Kubernetes projects, GCP provides a mature, fully integrated solution. Out-of-the-box performance was decent. Running nccl-tests with a standard JobSet YAML, we immediately achieved the expected bandwidth for allgather, allreduce, and alltoall operations without any tuning. However, it is worth noting that our experience is not representative of what others are seeing at scale. Currently with gcp gIB machines which use Nvidia CX-7 NICs (such as the a3-ultra H200, a4 B200, and a4x GB200), to get good performance, users must use the `gIB` plugin. This means that users need to add additional container mounts and lines into an sbatch script or jobset manifest, such as `--container-mounts=”/usr/local/gib”`, `export NCCL_NET=gIB`, `source /usr/local/gib/scripts/set_nccl_env.sh`, etc. This is a poor UX, leading to even advanced users seeing poor performance when compared directly to other providers. You effectively need to have a GCP engineer to get the expected performance at scale, and it is still an open question for us whether alltoall collectives work as expected on this scale-out network. Our suggestion to Google to improve this UX is to have Nvidia bake the gIB plugin binaries directly into all NGC container images, and include logic during container init to automatically select the gIB plugin when on compatible GCP machines. This would remove the need for users to manually mount it into their containers. There is a way to detect if running on a GCP machine with gIB, either through vendor and device IDs, or by checking `/sys/bus/pci/devices/*`. Google and Nvidia have said that they have started to look into this and have plans on how to improve it. On a more advanced networking front, GCP provides a crucial capability for large-scale training: NCCL straggler detection, powered by their CoMMA (Collective Monitoring and Management Agent). In distributed jobs with hundreds or thousands of GPUs, a single underperforming node or “straggler” can bottleneck the entire collective. Diagnosing where the straggler is presents a significant challenge. CoMMA attempts to addresses this by using a sophisticated eBPF-based agent that non-intrusively traces NCCL operations. By monitoring the progress of collectives like `AllReduce, AllGather and AlltoAll`, it claims it can identify the specific ranks that are lagging. When a straggler is detected, CoMMA emits a detailed JSON payload to Cloud Logging, identifying not only the slow ranks but also the ranks that are proceeding normally. Customer feedback about CoMMA have been mixed. Storage performance was robust and capacity was flexible. GCP’s tooling automatically prepared an FIO benchmark job, which we ran to test I/O patterns for scratch writes, training data reads, and checkpointing, all of which delivered solid results for both the home directory and lustre mounts. Google also has a marketplace that includes solutions like Weka, in case customers have preferences to deploy. Of course, GCS is available on-demand too, where many enterprises already have their data stored for long-term retention. Of course, no cloud experience is without its complexities. The primary hurdle we encountered was a classic cloud IAM footgun. When attempting to run a torchtitan training job, our pods were denied access to the dataset in a GCS bucket. This required diagnosing the node pool’s service account and running a series of gcloud and gsutil commands to grant the necessary permissions. While this is a common workflow for experienced GCP users, it’s a trade-off that working with a hyperscaler presents. GCP’s focus on production AI workloads is evident the [GKE Inference Gateway](https://cloud.google.com/kubernetes-engine/docs/concepts/about-gke-inference-gateway) now being GA. Our evaluation of the GKE Inference Gateway focused on two features: prefill-decode (PD) disaggregation and prefix-aware routing. We found that PD disaggregation, advertised with a potential 60% throughput improvement, is not integrated into the standard GKE Quickstart profiles or documentation. It currently exists as an “advanced optimization” that is a “constant work in progress.” In contrast, GKE’s implementation of prefix-aware routing is mature and well-documented. Unlike common patterns that require a user-managed proxy to route requests to inference engines like vLLM or SGLang for KV cache reuse, GKE integrates this routing logic directly into its managed L7 load balancer. This design eliminates a user-managed component from the serving stack, reducing operational complexity. GKE provides a robust inference networking layer, but there is a clear distinction between its stable, integrated features like managed routing and its not-quite-documented capabilities like PD disaggregation with llm-d. For monitoring, google integrates DCGM metrics right into the main cluster dashboard. This is a great UX when compared to a separate grafana instance, with things like authN and authZ being wired up automatically to the same intuitive console where the cluster was deployed. This also allows for some customization We suggested adding a TFLOP estimator via DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * peak_fp8_flops. For example, for H200, it would be 1979 TFLOPS. [](https://substackcdn.com/image/fetch/$s_!ZPjc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1c1f0c4-32da-479b-8692-765a865157b9_935x479.png)Source: Google Cloud On health checks, Google’s slurm offering is missing a background health check program. Currently, they rely on a prolog health check that runs some dcgm tests but haven’t yet integrated it as a NodeHealthCheck program in slurm for monitoring purposes during a batch job. By contrast, GKE has an option for users to configure AutoRepair, and an API for “repair and replace” functions where users can request a replace. This is a strong reactive offering, but requires manual setup from the customer’s cluster admin, and does not get to the level of proactivity that Gold and Platinum tier Neoclouds exhibit with their health checks. We encourage Google to follow some of their competitors, and treat failures with the perspective that if a customer discovers it first, something is wrong. The experience we have had working with Google’s engineering team is exceptional, but it comes at a steep price. Access to premium support generally requires a multi-million dollar compute contract and a 3% premium (on purchases that is a minimum of $1M), creating a high barrier to entry and a clear distinction when compared directly to Neoclouds. Going forward, we have concerns about NVL72 rack-scale architectures in Google datacenters. Google and AWS have both gone with NVL36x2 instead of a true NVL72 rack due to power, cooling, networking, and reliability concerns. The result is supposed to be a similar of NVL72 with the same scale-up domain of 72 GPUs as a standard NVL72 rack, but due to the cross-rack NVLink ACC cables it is a different topology. [](https://substackcdn.com/image/fetch/$s_!DF5M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eccc5d0-4ca5-458e-85f9-42dcef9b2797_936x1187.png)Source: SemiAnalysis GB200 hardware arch [](https://substackcdn.com/image/fetch/$s_!CFNe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73f31e58-83f7-4bae-aa39-7ee1bd8cbccb_936x572.png)Source: SemiAnalysis GB200 hardware arch but in practice users of GCP or AWS NVL36x2 have been waiting weeks or months longer to get stable firmware, and get the rack to a point of stability where they can run basic collectives. [](https://substackcdn.com/image/fetch/$s_!Yp1z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25cb704-5eb8-4a14-9a71-6ff33e17426b_668x1001.jpeg)Source: An NVL36x2 engineering build, via Google on Twitter In conclusion, Google aims to set the bar and command a pricing premium, but wrinkles like the gIB workflow, the lack of a GA managed slurm service, and reported issues with NVL72 rack-scale stability, as well as unclear SLAs + SLOs make the current pricing difficult to justify, especially for the legacy H100 instances that are still so popular amongst users. However, as the industry moves beyond H100s, Google’s roadmap is clearly strong. Once they roll out their B200 and GB200 instances at scale and push some roadmap items to GA, they will be in a powerful position to justify that premium. Google is on the fast track to the Gold-tier or higher. --- # Amazon Web Services (AWS) (Silver) > Amazon Web Services (AWS) earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Our experience with the world’s biggest cloud has been full of headache. AWS offers SageMaker Hyperpod Slurm and SageMaker Hyperpod EKS (kubernetes). We started with slurm. Interestingly, AWS and OpenAI signed a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/amazonwebservices - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/amazonwebservices/llm.txt - **Topics**: Amazon Web Services (AWS) review, Amazon Web Services (AWS) GPU cloud, Amazon Web Services (AWS) ClusterMAX rating, Amazon Web Services (AWS) Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Amazon Web Services (AWS) GB200 NVL72, Amazon Web Services (AWS) GB200, Amazon Web Services (AWS) GB300, Amazon Web Services (AWS) B200, Amazon Web Services (AWS) H200, GB200 NVL72 cloud, GB200 cloud, GB300 cloud, B200 cloud, H200 cloud, InfiniBand, NDR, RoCE, NVLink, Kubernetes, Slurm, NCCL, ClusterMAX 2.0, SemiAnalysis Our experience with the world’s biggest cloud has been full of headache. AWS offers SageMaker Hyperpod Slurm and SageMaker Hyperpod EKS (kubernetes). We started with slurm. Interestingly, AWS and OpenAI signed a multi-year deal for OpenAI to run core AI workloads on AWS EC2 UltraServers with NVIDIA GB200/GB300 worth $38B over 7 years, yet with no mention of EFA or HyperPod/Slurm in the announcement. Our initial setup process following the primary documentation path for creating a slurm cluster through the SageMaker console. This path proved to be a dead end. The only successful method for provisioning a functional cluster was to abandon the standard documentation and instead use a CloudFormation stack from an official AWS workshop at http://catalog.workshops.aws/sagemaker-hyperpod . This approach pre-provisions the entire required infrastructure stack, including the VPC, IAM roles, S3 bucket, and FSx for Lustre file system, before attempting to create the cluster itself. Effectively, the default console setup does not correctly configure the necessary dependencies. With that said, the process to get the CloudFormation scripts to work correctly requires navigating multiple documents to correct IAM policies (AmazonSageMakerClusterInstanceRolePolicy), request quotas of all sorts, and upload/run lifecycle scripts.Notably, these scripts are buried five directories deep in an unrelated GitHub repository: and are incredibly brittle. [](https://substackcdn.com/image/fetch/$s_!lmyP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cc47afa-5361-46ec-9d5b-3b30bd0aa740_937x421.png)Source: AWS [](https://substackcdn.com/image/fetch/$s_!gl8j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f48c6d-8d31-4976-8fd1-698756843dc8_937x362.png)Source: Requesting and approving quota for ourselves on the AWS console The scripts have to be manually downloaded from the git repo, uploaded to an S3 bucket, and then added at the fourth step of configuring a cluster. When creating a VPC, IAM roles, S3 Bucket, and Lustre FSx, if you miss a step or need to upload a script to a different path, you have to restart the provisioning process. On our first try, we didn’t define enough controller nodes to handle our 4-node ml.p5en.48xlarge (H200) cluster. On our second try, 1 of the 4 nodes in the cluster didn’t mount the Lustre FSx properly, due to a race condition, and the whole cluster rolled back. On the third try, the size of the instance being requested for the controller node had been exhausted in the region/az, so we needed to rollback and try again. Finally, on the fourth try, with a specific controller VM size (c5.xlarge instead of m5.4xlarge), and adding exactly one node at a time, we were able to provision the cluster properly. The provisioning process for a single cluster can take about two hours, as each node can take upwards of 30 minutes to deploy (if capacity is available). In total, we worked on provisioning this cluster for 14 straight hours, with intermittent calls from five different AWS engineers across various time zones. Notably, the race condition on Lustre FSx that requires adding one node at a time has been known about by AWS engineers for over a year and not fixed. We spoke to three separate AWS customers during our research that validated they have experienced the exact same issues when setting up a hyperpod slurm cluster. In addition, the standard, documented path for getting started with a single GPU instance does not actually produce a working GPU instance. Following the console guide results in a GPU instance provisioned without any Nvidia drivers installed, and a default root volume size of 8GB, which is insufficient to even install the required drivers manually. We believe this is a primary reason why various marketplaces reselling GPU compute in AWS datacenters such as lightning.ai and Qubrid have able to maintain a business: the AWS UI is just so hard to use. On the HyperPod cluster, AWS (like other hyperscalers) removes public IPs in favor of a proprietary SSH wrapper script easy_ssh.sh . Unfortunately, this easy_ssh.sh is not easy, instead requiring an Access Token to be retrieved from the AWS console as they are cycled every 24 hours by default, and use the AWS SSM approach for access. This wastes time and is annoying, let alone the process to manage users with add_users.sh or plugging the cluster into an IAM provider. Uniquely, AWS is the only cloud where account managers pestered one of our team members relentlessly for payment on a capacity block that they had provided us directly for our testing. While this did get rectified, the experience speaks to the fact that the AWS organization is a behemoth, and customers need to push hard to get the left hand to speak to the right. Beyond our direct testing, independent feedback from multiple AWS users deploying hundreds of GPUs highlight additional issues: the need for a /16 CIDR to avoid IPv4 exhaustion (since 81 IPs are consumed per GPU instance), and a lack of IPv6 support on EKS. Regular footguns also show that HyperPod does not use existing reservations automatically, another source of potential cluster recreation, and a different (but similarly frustrating) need to add nodes incrementally, in this case to avoid EFA errors. On health checks, AWS does have a relatively comprehensive approach to health checks compared to other hyperscalers. However, deep health checks can be excessively long (60-120 minutes) and are best disabled for faster scaling. Unfortunately, monitoring dashboards for slurm or Kubernetes cluster health, performance, and job stats are basically non-existent beyond standard, manual, open source tooling. [](https://substackcdn.com/image/fetch/$s_!DPK_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F441d976b-d94a-4dc9-9a26-b22bb93ebf2d_936x645.png)Source: Deep Health Checks on AWS Finally, on networking, not much has changed since our previous article. AWS remains steadfast in its commitment to EFA for all H200, B200, B300, GB200 NVL72, GB300 NVL72, and even future VR300 rack-scale architectures. Customers that see superior performance from InfiniBand and high-end RoCEv2 deployments generally dislike EFA performance and debugging. However, AWS is steadfast in their commitment to EFA, going so far as to design future architectures where they will run PCIe connections between their compute trays and a separate “JBOK” (Just a Bunch of NICs) rack full of custom K2V6 EFA NICs. On GB200, their p6e platform uses NVL36x2 and runs into the same NVLink unreliability troubles as GCP where the cross rack NVLink ACC cable causing major issues. [](https://substackcdn.com/image/fetch/$s_!VtHS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6e31a4-8405-45e5-8dd5-1688fcb33c7b_936x491.jpeg)Source: AWS AWS markets this disaggregated design as a strategic choice for resiliency, claiming it enables N+1 NIC redundancy and improves the mean time before failure (MTBF) of sensitive optics by moving them to a cooler, dedicated tray. However, the engineering reality suggests this move is less a choice and more a necessity driven by the thermal and spatial constraints of fitting multiple power-hungry K2V6 NICs inside a dense 1U compute sled. This architecture introduces non-trivial latency from PCIe Active Electrical Cable (AEC) retimers and feels like a complex workaround to us. However, this JBOK design also enables a long-overdue shift to a rail-optimized network topology, which is critical for the performance of MoE models heavy on All-to-All collectives. But this obsession with reliability at the component level leads to a shockingly inefficient operational model at the system level. An entire GB200 rack (or logical rack, as AWS is going for NVL36x2, just like Google) is treated as a single failure domain called an “Ultraserver.” This means a single faulty compute sled requires draining workloads from all 18 nodes in the rack before any repair can be attempted. This is a stark contrast to the hot-swappable serviceability customers expect and receive from other GB200 NVL72 rack-scale providers. In the worst case, this policy has brutal TCO implications as it demands entire “spare” racks to maintain capacity SLAs, a cost inevitably passed on to the customer via poor SLA penalties, or higher prices. For users of EFA, debugging is also incredibly challenging. First, in a traditional HPC environment using InfiniBand or RoCEv2 (Converged Ethernet), engineers have a standard toolkit: ib_write_bw, ib_ping, ibv_devinfo, and ibdiagnet for direct testing of the physical layer. However with EFA, your access ends at the EFA driver on the host. Second, since NCCL does not communicate with EFA directly there are multiple layers of abstraction to contend with. The communication path is a complex chain of software shims: _NCCL → aws-ofi-nccl Plugin → Libfabric API → EFA Libfabric Provider → Custom ibverbs provider in RDMA Core Library → EFA Kernel Driver → AWS Hardware_ When a NCCL collective (like an AllReduce) hangs or performs poorly, the error message is often generic, like a timeout or a provider error. Pinpointing the source of the problem is a nightmare: is it a bug in NCCL itself? Is it an incompatibility or bug in the aws-ofi-nccl plugin? Is Libfabric misconfigured or hitting a corner case? Is the EFA provider encountering an issue with the SRD protocol (e.g., congestion, retransmissions)? Is there a physical hardware problem on the NICs, switches or cables? Without deep introspection tools for each of these layers, debugging becomes a process of managing support tickets with AWS. Third, is the case of “gray failures”, where job performance degrades for inexplicable reasons. Is it congestion from other jobs on our cluster? Sub-optimal routing policies? A noisy neighbor tenant on the same global fabric? Multi-tenancy is always difficult to handle in networking, and a backend interconnect for GPU clusters is no different. Finally, the same usability issues with cluster setup can impact networking experience too. Security Groups, IAM Permissions, and Cluster Placement Groups all need to be handled correctly to ensure a given user is getting proper performance. Many small things added together results in a big challenge for administrators. In general, we try to represent the customer experience, which repeatedly tells us that EFA does not perform well at scale. But AWS doesn’t care, they are not cow-towing to Nvidia and adopting CX-7 or CX-8 NICs. They have already sunk enough time and energy into this EFA NIC, and they’re going to make it work and save that 0.8% of TCO, dammit. Overall, Amazon SageMaker HyperPod is surprisingly difficult to use, especially considering that AWS is the leader in the cloud industry and brands itself on customer obsession. Official AWS documentation is hard to follow or incorrect, and the underlying platform suffers from issues of usability and performance at scale. For teams considering HyperPod, we recommend budgeting for significant engineering effort focused on cluster maintenance, including time to build custom automation that can work around AWS’s unique limitations. --- # Scaleway (Silver) > Scaleway earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Scaleway continues to carve out a niche as a premium, sovereign European cloud provider with a focus on large-scale, AI training particularly for startups and non-profits. The company’s primary offering for high-end AI is centered… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/scaleway - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/scaleway/llm.txt - **Topics**: Scaleway review, Scaleway GPU cloud, Scaleway ClusterMAX rating, Scaleway Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Scaleway GB200, Scaleway B200, Scaleway H100, GB200 cloud, B200 cloud, H100 cloud, InfiniBand, Spectrum-X, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis Scaleway continues to carve out a niche as a premium, sovereign European cloud provider with a focus on large-scale, AI training particularly for startups and non-profits. The company’s primary offering for high-end AI is centered entirely on slurm for training. Inference, and Kubernetes in general is not a significant part of their go-forward strategy. A significant recent improvement is the deployment of a “copilot” Grafana instance, which includes essential DCGM and slurm exporters. This directly addresses a criticism we made in the first version of this article, regarding a lack of monitoring in their offering. Furthermore, Scaleway has enhanced reliability by implementing health checks through slurm prolog and epilog scripts, with monitoring data being actively managed to ensure cluster stability. [](https://substackcdn.com/image/fetch/$s_!r7fD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8b85c85-be81-4b4f-b116-deca7acebb4d_935x598.png)Source: Scaleway On the networking front, Scaleway continues to mainly offer H100 nodes with Nvidia’s Spectrum-X networking. They generally provide customers an opportunity to reserve clusters for days or weeks to compare its performance against traditional InfiniBand, and have seen good results. Interestingly, during testing, Scaleway worked with a customer to develop a synthetic benchmark better than nccl-tests, which indicated a 20% performance improvement over InfiniBand. However, in a crucial real-world test case, that performance advantage was not realized. Looking ahead, Scaleway is still in the planning phase for Blackwell, and is targeting HGX B200/B300 baseboards only, rather than the fully integrated GB200 NVL systems. The company is also exploring AMD GPUs, but have yet to see significant customer traction in Europe. Overall, Scaleway’s business model has begun to reflect a “European premium” for sovereign, GDPR-compliant infrastructure. This is evident in their resource allocation, which requires customers to contract for an entire cluster for large-scale jobs, rather than allowing on-demand access to single 8-way GPU machines. This model targets well-funded, serious AI projects, including an ecosystem built around the Scaleway Startup Program. This program offers credits and support, aiming to onboard the next generation of European tech companies. We expect Scaleway to continue to operate in a solid niche, prioritizing dedicated, high-performance clusters for, the European market with an associated premium. --- # Cirrascale (Silver) > Cirrascale earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Cirrascale occupies a unique, and somewhat confusing, position in the market, landing them in our Silver tier. The company operates on a build-to-order basis for its cloud services, a model that feels more like a high-touch… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/cirrascale - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/cirrascale/llm.txt - **Topics**: Cirrascale review, Cirrascale GPU cloud, Cirrascale ClusterMAX rating, Cirrascale Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Cirrascale B200, Cirrascale MI355X, B200 cloud, MI355X cloud, Kubernetes, Slurm, DCGM, ClusterMAX 2.0, SemiAnalysis Cirrascale occupies a unique, and somewhat confusing, position in the market, landing them in our Silver tier. The company operates on a build-to-order basis for its cloud services, a model that feels more like a high-touch colocation or system integration service than a conventional cloud offering. Their offerings include rent-to-own plans and a service where they help customers procure servers (e.g., from Supermicro), which the customer then owns, while Cirrascale provides hosting, setup, and RMA coordination for a fee. This is a fundamentally different approach from most other providers, possibly comparable to Lambda’s Private Cloud business, some of Fluidstack’s agreements, and STN’s managed services. Our interactions with the Cirrascale team have been challenging. The Cirrascale team feels that our criteria, particularly around software orchestration like Kubernetes, are not relevant to their customers. Their philosophy is to avoid any responsibility at the platform layer, meaning they provide bare-metal access and expect customers to bring and manage their own software stacks. In conversations with customers that use Cirrascale, however, we have heard that they don’t actually like this approach. Integration of a simple DCGM background health check into a Slurm environment that plugs into datacenter operations systems would allow for quicker diagnosis of problems, and more goodput during training runs. It may also save Cirrascale time and money when performing RMAs. While Cirrascale has thousands of GPUs deployed and a large backlog of customers for new B200 and AMD MI355X systems, their market position has also seen significant shifts. Notably, OpenAI, which once hosted all its owned servers with Cirrascale, migrated its entire infrastructure to Microsoft Azure. This move by a flagship AI lab away from the customer-owned/managed-hosting model to a hyperscaler is a telling indicator of the industry’s direction. In summary, Cirrascale serves a specific niche: organizations that want to own their hardware assets but outsource the complexities of datacenter operations. However, their hands-off approach to the software stack make it difficult to recommend them for teams that expect a reliable, hands-off cluster. This model places a heavy operational burden on the customer, solidifying Cirrascale’s position in the Silver tier. --- # GCORE (Silver) > GCORE earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. GCORE is a Luxembourg-based provider that was founded in 2014, originally focusing on gaming, CDN, and general purpose cloud. But now, AI. GCORE offers GPUs across Europe, including datacenters in Luxembourg, Portugal, Germany, the… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/gcore - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/gcore/llm.txt - **Topics**: GCORE review, GCORE GPU cloud, GCORE ClusterMAX rating, GCORE Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, GCORE B200, B200 cloud, RoCE, Kubernetes, Slurm, NCCL, ClusterMAX 2.0, SemiAnalysis GCORE is a Luxembourg-based provider that was founded in 2014, originally focusing on gaming, CDN, and general purpose cloud. But now, AI. GCORE offers GPUs across Europe, including datacenters in Luxembourg, Portugal, Germany, the Netherlands, the UK, and the US (Virginia and California). They also have plans to go into the Nordics, partially via self-build, but also via an established partnership with Northern Data Group (also known as Taiga Cloud). It’s unclear where this partnership will go, as Northern Data apparently just had their offices raided over tax fraud allegations related to crypto mining operations in 2023. The GCORE platform is feature rich, with a nice balance of usability and strong underlying hardware performance. Unfortunately, only Kubernetes was available to test for us. We learned after the fact that their Slurm-on-Kubernetes offering, based on SOperator, is [buried in API documentation](https://gcore.com/docs/api-reference/cloud/managed-kubernetes/create-k8s-cluster#body-add-ons-slurm). We look forward to testing this in the future. [](https://substackcdn.com/image/fetch/$s_!HQ3I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb200cc78-4890-4c38-90ad-cb7e86ac6a8d_937x510.png)Source: The GCORE console (alt: what we wish our AWS console looked like) The onboarding process began with a series of manual hurdles where we quickly realized we were dealing with an enterprise-ready console modeled after the hyperscalers. After creating an account, we were required to request a quota increase to spin up a cluster. Interestingly, while at the hyperscalers we can approve these quote increases ourselves, with GCORE there was a nameless faceless support team member making the decision for us. This actually resulted in us needing to make three separate attempts (over the course of two working days) to get quota approved for 2TiB of VAST Storage to go with our 2-node kubernetes cluster. [](https://substackcdn.com/image/fetch/$s_!ZEE8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9e9284-6f8d-4ddd-b2fa-9f67e6a4e21d_937x495.png)Source: GCORE Forging ahead, we followed the required steps: creating a three virtual networks, a VPC, provisioning our Kubernetes cluster, which promptly became stuck in a “provisioning” state for over two hours before ultimately failing. Notably, GCORE takes networking seriously: routers, configurable networks, floating IPs, firewalls, and reserved IPs. It can just make things confusing on setup for non-cloud native users. [](https://substackcdn.com/image/fetch/$s_!NQE1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae21b51-4c1b-4fe1-a0ea-7aae893b1734_937x505.png)Source: Setting up a router for our Kubernetes cluster on GCORE Our second attempt was more successful, at least on the surface. The cluster spun-up correctly, including a default ReadWriteMany StorageClass using the VAST quota we fought so hard for. Unfortunately, the cluster was delivered without the Nvidia GPU Operator or the Network Operator. This is a critical point for many general-purpose clouds that have kubernetes experience but miss some of the basics when they turn to serving the AI market. Some opinions (like having the GPU and Network Operator pre-installed) are worth enforcing in customer clusters. After confirming that performance on nccl-tests, a torchtitan pretraining job, and disaggregated prefill/decode inference endpoints via llm-d was working as expected we turned to focus on monitoring. Unfortunately, this seems to be left completely up to the user. While gold and platinum tier providers handle the CNI, CSI, active/passive health checks (e.g. via node-problem-detector or custom controllers and CRDs), kube-prometheus-stack (i.e. on a Grafana dashboard), and Slurm-on-Kubernetes, GCORE leaves that all that stuff up to the user. Overall, GCORE’s platform is strong, and one of the best purely self-service offerings we tried for kubernetes. The console includes all the enterprise goodies one would expect, and it makes sense why this enable them to sell into large enterprise with PCI DSS compliance and a global datacenter footprint. We encourage GCORE to develop an advanced cluster monitoring dashboard, implement active/passive health checks on the kubernetes layer, and consider developing a first-class Slurm-on-Kubernetes experience over time. --- # Firmus / Sustainable Metal Cloud (SMC) (Silver) > Firmus / Sustainable Metal Cloud (SMC) earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Firmus is an Australian company that was recently backed by a strategic investment from Nvidia at a $1.9B valuation: . Their current ambition is to build a “Stargate for the southern hemisphere,” with a specific focus on next-generation rack-scale systems like the GB300 NVL72 and VR. Though we believe that the bulk of Firmus’s experience with immersion cooling is misguided, and now wasted, we also believe that this team is one of the few in the industry that has the engineering chops to monitor and maintain the physical layer of these DLC systems effectively. Our review of their current telemetry and failure prediction system for their immersion deployments demonstrates significant attention to detail, and a deep understanding of the physical stack, down to the signal quality and light levels in custom transceivers and optical cables. However, this experience at the lowest physical level can be undermined by a higher UX level that feels out-of-touch with customer requirements. Our testing began with a difficult wrinkle: cluster access is gated behind a mandatory VPN. This is a significant operational bottleneck for teams accustomed to standard cloud workflows with public IPs or streamlined SSH wrappers. While some security-conscious customers (such as international federal agencies for defense, intelligence, and research) may find this acceptable and even prefer isolation at Layer 2,3, 5 or 7, the general public does not operate this way. The fact that Firmus had no alternative access method prepared was telling for us. Once connected, our slurm environment also had some configuration issues. The standard topology.conf file was not set for topology-aware scheduling, and a simple “srun -N1 –gpus-per-node=8 –pty bash” command took over a minute to execute due to an exceptionally long prolog. It seems that the Firmus team took some of our previous feedback around health checks to an extreme, filling up the prolog with unnecessary dcgm level 3 checks when level 1, 2, or just an epilog with HealthCheckProgram configured would suffice. To their credit, a pre-staged nccl-test script was provided and ran at expected bandwidth. As mentioned previously, the Firmus monitoring stack is unique, going beyond standard DCGM metrics and feeding ML models to predict component failures before they occur. A “link flap” is formally defined as five events in one hour, triggering automated diagnostics. Their internal validation suite is exhaustive, running regression tests on spare nodes that include P2P bandwidth tests, GDR copies, small-scale llama training runs, and NCCL tests to proactively identify GPUs, NVLink, or InfiniBand interconnects that are approaching failure. [](https://substackcdn.com/image/fetch/$s_!ECPy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a1344a-e46a-49c1-aeca-da6c023cb9f7_894x392.png)Source: Firmus Custom Monitoring Dashboard for Immersion Tanks [](https://substackcdn.com/image/fetch/$s_!Hz_s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff610ff4e-5907-4f66-aaa4-104c67fe15a0_937x454.png)Source: Firmus Customized Grafana Dashboard, showing relevant GPU Utilization Metrics during a training run This level of investment in monitoring at the physical layer is how Firmus plans to back up an aggressive “99.94% SLA”, aiming to differentiate itself from competitors by ensuring maximum goodput – something that we have also heard from top-tier providers like CoreWeave and Nebius. Their business model mirrors other major Nvidia clouds, with attractive prospective pricing for their upcoming rack-scale deployments, much of which is made possible by a low power cost in their massive expansion into Tasmania. We encourage Firmus to double-down on their focus on operational excellence from the physical layer to the orchestration layer (i.e. properly configured slurm and kubernetes clusters) without getting distracted by fancy PaaS and SaaS applications that the vendor-du-jour is pitching. --- # GMO Cloud (Silver) > GMO Cloud earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. GMO Cloud, part of the sprawling Japanese conglomerate GMO Internet Group, presents a highly opinionated approach targeting their domestic market. The offering is built on a foundation of security that for us is so stringent it… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/gmocloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/gmocloud/llm.txt - **Topics**: GMO Cloud review, GMO Cloud GPU cloud, GMO Cloud ClusterMAX rating, GMO Cloud Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Spectrum-X, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis GMO Cloud, part of the sprawling Japanese conglomerate GMO Internet Group, presents a highly opinionated approach targeting their domestic market. The offering is built on a foundation of security that for us is so stringent it alters the user experience, while still providing solid performance. We focused on slurm as kubernetes is not available, and quickly found that sinfo and scontrol are completely disabled for end-users. This decision, presumably made in the name of security, caused us issues with pre-baked scripts that depend on scontrol show hostnames $SLURM_JOB_NODELIST and other basic convenience functions. It also resulted in us having to modify some of our standard debugging practices, since users are unable to inspect the cluster state or topology. Thankfully, GMO provides a convenience command “snodes”, and a custom script, “get_master_addr.sh” which got us running jobs at expected performance. [](https://substackcdn.com/image/fetch/$s_!1YkV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e14b4a-62b3-45c7-aee1-abea36895992_790x481.png)Source: SemiAnalysis using GMO convenience scripts In addition to these convenience scripts, a few other usability issues arose. GMO did not configure topology.conf, relying on a rationale that since they manually allocate customer clusters to servers that do not span across different Spectrum-X leaf switches, and organize everything by known hostnames, they are able to make topology awareness at the slurm level redundant. We think this points to a lack of experience running large customer clusters, and handling hardware failures in large multi-tenant environments. The theme of forcing non-standard workflows due to a focus on security continued with their containerization strategy. The environment lacks support for Pyxis and Enroot, effectively blocking teams that have standardized on Docker-based containers. Users are required to rebuild their entire workflow around Singularity, a relatively significant undertaking that creates another barrier to entry for new users. Unfortunately, this focus on security can also fall short at a basic level, creating a strange paradox. On one hand, simple command-line tools with no known exploits are locked down. On the other, we found outdated packages, such as nvidia-container-toolkit versions 1.16.2 and 1.17.4 on login and compute nodes respectively. While GMO acknowledges these are flagged by their internal vulnerability scanners and slated for an update, the presence of old software vulnerable to 9.0 Critical CVE’s running on our brand-new cluster contrasts sharply with the user-facing restrictions. Overall, GMO’s approach feels like security theater to us. On the positive side, the base environment is well-configured for HPC tasks. The nodes come pre-installed and configured with HPC-X, NCCL, and nvcc making it dead simple to build nccl-tests from source and run it at full expected bandwidth. We were also able to run torchtitan jobs at expected MFU. In addition, the standard dcgmi health -c program is configured properly as a Slurm HealthCheckProgram, addressing our background health check expectations. Finally, the platform lacks key observability and reliability features. There is no monitoring dashboard, though GMO states a Grafana-based solution is planned for a future release. For now, users must rely on basic Slurm email notifications for job status, and we could not identify any proactive health check system, placing the burden of failure detection largely on the user. Overall, GMO has established a clear advantage within Japan, especially with the region’s dependence on Slurm and other traditional HPC technologies. Support is strong, and we expect that example customers like Turing: and AI Robot Association (AIRoA) trust the offering as GMO Cloud is a leader domestically. We recommend that GMO focus on usability over security theatre, improve monitoring options for users via custom Grafana dashboard, improve passive and active health checks, and consider developing a kubernetes offering in the future. --- # Vultr (Silver) > Vultr earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. To kick things off, Vultr set the record for this round of ClusterMAX by bringing 12 people onto our kickoff call. Vultr raised money last year at a $3.5B valuation, including an investment from AMD Ventures, and this past summer also… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/vultr - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/vultr/llm.txt - **Topics**: Vultr review, Vultr GPU cloud, Vultr ClusterMAX rating, Vultr Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Vultr B200, Vultr MI355X, B200 cloud, MI355X cloud, Kubernetes, Slurm, NCCL, ClusterMAX 2.0, SemiAnalysis To kick things off, Vultr set the record for this round of ClusterMAX by bringing 12 people onto our kickoff call. Vultr raised money last year at a $3.5B valuation, including an investment from AMD Ventures, and this past summer also got $329M of debt financing. As a result, Vultr now offers AMD MI355X GPUs (backstopped by AMD) and an expanding fleet of NVIDIA GPUs (including HGX B200), across some of their 32 global regions. [](https://substackcdn.com/image/fetch/$s_!aQk6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F134631bf-ecf8-4abf-9216-b4445ce7626e_936x598.png) When we started our testing, the Vultr SLURM service seemed brand new, like a second class citizen in the console. This was clear when we logged in too. The cluster was missing pyxis, hpcx, topology.conf, the default login user was “root” (with no default workdir). Most importantly, there was no shared home filesystem. We recommended some basic fixes, and quickly got going with an “ubuntu” user, with a default workdir switched to a shared /mnt/vfs. [](https://substackcdn.com/image/fetch/$s_!-YDx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb36e8c64-9698-483e-9d32-c10253107baf_477x244.png) [](https://substackcdn.com/image/fetch/$s_!0hbp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5661a17e-509b-4500-bc40-1a8c7ea32f01_694x314.png) Eventually, we were able to get nccl-tests at expected bandwidth, and some basic torchtitan training runs going at expected MFU. When we were handed our kuberenetes cluster, we unfortunately got versions of the NVIDIA GPU Operator and Network Operator that were over 1 year old, meaning they were subject to three separate “critical” level CVEs, such as NVIDIAscape from Wiz: . We recommended an upgrade, and the team mentioned they were “writing the jira for it”. During testing, we had some intermittent link flaps that eventually went away on their own. Unfortunately, there was no proactive notification or remediation of this, due to a lack of a monitoring dashboard and any active or passive health checks on the cluster’s interconnect. [](https://substackcdn.com/image/fetch/$s_!A7zn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5750b6c8-290f-4cfe-98de-4292ed80da61_935x800.png) After eventually getting nccl-tests to run at full bandwidth on the kubernetes cluster, we engaged with the support team to troubleshoot a training job on the cluster. One of the team members, Enis, was familiar enough with KubeFlow to get it installed and configure an example torchtitan training job to work on their network. We were impressed! [](https://substackcdn.com/image/fetch/$s_!4SbE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbd41255-92a5-4296-b599-aac95d1f3c9d_937x272.png)Source: a beautiful sight After shifting to inference, we saw a strong showing from VKE. The Vultr Cloud Controller Manager runs as part of Vultr’s managed control plane (not visible in the cluster), and handles automatic provisioning of resources like a LoadBalancer public IP. Reasonable default helm charts were installed, and it was easy to configure new ones, thanks to a default ReadWriteMany StorageClass being configured. Following our feedback, Vultr has joined the NVIDIA embargo program to ensure they are notified ahead of time for future security vulnerabilities. Vultr’s outreach to AMD’s Product Security Office seems to have motivated AMD to develop a similar security embargo program on their own. We appreciate Vultr’s commitment to improvement and the direct engagement from their engineers. We recommend that they work on developing a monitoring dashboard, active and passive health checks, and continue building experience operating large GPU clusters. --- # Voltage Park (Silver) > Voltage Park earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Voltage Park is a story of turnaround and redemption. If we were to have done this review in 2023 or 2024, the story would have been much different. The current Voltage Park is who we are rating, and the current Voltage Park is a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/voltagepark - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/voltagepark/llm.txt - **Topics**: Voltage Park review, Voltage Park GPU cloud, Voltage Park ClusterMAX rating, Voltage Park Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Voltage Park H100, H100 cloud, RoCE, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Voltage Park is a story of turnaround and redemption. If we were to have done this review in 2023 or 2024, the story would have been much different. The current Voltage Park is who we are rating, and the current Voltage Park is a reasonably less weak provider focused exclusively on H100 GPUs. As of our testing, their on-demand capacity appears to be regularly sold out. The company is shipping features at a rapid pace, including a recently launched SLURM service and OIDC integration for Kubernetes. Voltage park offers the one of the lowest price in the industry. Our initial experience with slurm included a lot of provisioning challenges, with multiple attempts being required to spin up our test cluster. Once provisioned, we would be load balanced to different login nodes, with the now-classic SonK issue of not being able to run code. Not git, vim, nano, or sudo permissions. However, Voltage Park is the only provider with these SonK issues that seemed to be aware of them, suggesting a kubectl exec command to access the login pod, instead of the original ssh via public IP. While this container-first approach got us to workaround the initial root permission issue, it still takes time to install software, and if your connection gets reset, all software installs go away. In other words, login pods are stateless. The Voltage Park engineering team committed to building a new container image for login pods that included necessary software to run slurm jobs and edit code, and they delivered just that in under 24 hours. We were impressed by the commitment to customer support. [](https://substackcdn.com/image/fetch/$s_!nh9-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fc1fb9-c6d2-42d6-8bb7-f0de9e07b378_937x442.png)Source: Spinning up a SonK cluster in Voltage Park, right from the console Inside the intended container environment, the setup is more robust. We found a correctly configured topology.conf for network-aware scheduling, SLURM prolog and epilog scripts in place, and a modern container toolkit with pyxis and enroot installed. Interconnect performance was strong, running collectives at expected bandwidth. We also saw good download speeds, and a reasonably fast shared filesystem. Operationally, we encountered two major points of concern. First, Voltage Park’s dashboard has a “Shutdown” function that is distinct from “Terminate”. “Shutdown” halts the instances but continues to bill for the reserved capacity, a nuance that is not made sufficiently clear in the UI, and we expect is a disaster waiting to happen. Notably, not a single other provider offers these distinct “Shutdown” and “Terminate” options, and even after discussing the purpose of the “Shutdown” button with the Voltage Park team, it is still very confusing to us what the intended use case is. We recommend Second, their process for handling hardware failures in on-demand clusters is manual, requiring operator intervention to cycle nodes out of a user’s cluster. This is a far cry from the automated, resilient systems offered by top-tier providers. This is also demonstrated by a lack of up-to-date security patches. The cluster was also pre-installed with an nvidia container toolkit version (1.17.4) that was out-of-date by 9 months, and as discussed previously in this article, victim to CVE-2025-23266 (NVIDIAScape) and CVE-2025-23267, with CVSS scores of 9.0 and 8.5 out of 10 respectively (“Critical”). In conclusion, we believe that Voltage Park now has a solid technical foundation to carry forward and recover from reputational issues. We are encouraged by the execution of the technical team, and look forward to seeing more improvements in the future. --- # Tensorwave (Silver) > Tensorwave earns a ClusterMAX 2.0 Silver rating from SemiAnalysis. Tensorwave is a provider that recently raised a $100M Series A from AMD Ventures. As a result, they have an exclusive focus on AMD hardware, including 8,192 MI325X GPUs in their Tucson, Arizona datacenter. Since we love all GPUs and… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Silver - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/tensorwave - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/tensorwave/llm.txt - **Topics**: Tensorwave review, Tensorwave GPU cloud, Tensorwave ClusterMAX rating, Tensorwave Silver, Silver tier GPU cloud, GPU cloud review, neocloud review, Tensorwave MI325X, MI325X cloud, RoCE, Kubernetes, Slurm, DCGM, ClusterMAX 2.0, SemiAnalysis Tensorwave is a provider that recently raised a $100M Series A from AMD Ventures. As a result, they have an exclusive focus on AMD hardware, including 8,192 MI325X GPUs in their Tucson, Arizona datacenter. Since we love all GPUs and love AMD, we have been working with Tensorwave for a long time as they graciously provide us access to GPUs for benchmarking that is well beyond the scope of ClusterMAX. We are grateful for this support. Our testing on Tensorwave’s SonK platform has shown it to be largely unstable. The onboarding process is confusing, relying on Rancher’s RKE2 open-source kubernetes distribution (formerly RKE government), Longhorn for storage, and a modified version of Slinky for SonK (to get it to support AMD GPUs properly). To login to the cluster we initially had to escalate to sudo just to run basic kubectl commands and get a “slurm-login” convenience script working. It took a significant amount of back and forth with the Tensorwave team to get a working kubeconfig (notably, this is now easy to download from the console). We also ran into issues with permissions and user groups, which did not seem to be properly synchronized between the jump box and the Slurm login nodes. This issue has also been fixed since our testing period, but it is clear that there is limited experience getting an RBAC-scoped cluster working with an external IAM provider. In addition, the Slurm login node was missing the (now classic) tools we expect: vim, nano, git and sudo permissions to run apt install. However, in Tensorwave’s case, it only took a few hours for the team to modify the base container image to include these tools. We were impressed by this turnaround time. [](https://substackcdn.com/image/fetch/$s_!bUi8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8feae912-04d3-41c6-bae6-0f0236850397_936x561.png)Source: our Tensorwave Console In addition to access, there was no topology-aware scheduling in place, health checks were not integrated with Slurm for auto-draining nodes that fail a health check, and the monitoring dashboard was missing critical information about GPU and system health that is unique to AMD’s [RDC package](https://github.com/ROCm/rocm-systems/tree/develop/projects/rdc). While NVIDIA providers get a simpler foundation building on DCGM, Tensorwave has had to build a lot of this from scratch, since they are AMD exclusive. Most importantly, however, is reliability. During our testing, we have experienced a number of reliability issues, including some outages that stretch over multiple hours or days. In a two-month period, we have experienced 7 distinct interruptions: hardware and firmware issues on GPU nodes, a redeployment of Kubernetes, SonK/slurm-login connection issues, maintenance on Weka storage, maintenance on switches and routers, and even a power outage. Notably, none of these issues are directly related to AMD GPUs, it is the rest of the cluster and the facilities around the GPU. To their credit, the Tensorwave team is always very responsive to our feedback and quick to address issues we raise. We have also seen a general trend of reliability improving over time. Overall, the fact that we have to provide guidance on proper Slurm setup, monitoring, and health checks points to a general lack of experience running multi-tenant clusters at the scale of 8,192 MI325X GPUs or larger. We look forward to collaborating more with Tensorwave over time as they build out more AMD GPU capacity. --- # GMI (Bronze) > GMI earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. GMI is our top Bronze neocloud that is just not quite there yet. The company shows promise, with recent developments like achieving security compliance and implementing confidential computing capabilities for H100 and H200 nodes. However,… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/gmi - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/gmi/llm.txt - **Topics**: GMI review, GMI GPU cloud, GMI ClusterMAX rating, GMI Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, GMI H200, GMI H100, H200 cloud, H100 cloud, InfiniBand, Kubernetes, Slurm, GPUDirect, ClusterMAX 2.0, SemiAnalysis GMI is our top Bronze neocloud that is just not quite there yet. The company shows promise, with recent developments like achieving security compliance and implementing confidential computing capabilities for H100 and H200 nodes. However, in our testing the slurm cluster was frankly unusable. We did not get access to a self-service console or monitoring dashboard of any kind, and it took over a month from our initial request, and multiple follow ups to finally login. On the cluster, slurmctld was running directly on a compute node, and the environment was missing basic tools like docker the modules utility. More critically, the cluster was provisioned without a shared home directory across nodes despite having VAST with POSIX/NFS and S3 options in the environment. After negotiating to get a shared fs configured on the cluster, we found that the performance was terrible. Basic file-saving operations and carriage returns in the terminal would take multiple seconds to complete or respond. [](https://substackcdn.com/image/fetch/$s_!vQpv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42dd5b79-c08f-4724-b66a-96d52b6935bd_726x545.png) One of our GMI nodes, with 1.9TB of shared storage, and 27.9TB of local storage, matching NVIDIA’s DGX specification perfectly On the positive side, the underlying hardware appears to be configured correctly for high-performance workloads. A check for nvidia_peermem confirmed that GPUDirect RDMA is enabled, and the team confirmed that their interconnect network is built on InfiniBand with PKeys for network segmentation. We also found no evidence of active or passive health checks, and no monitoring dashboards were provided to give visibility into cluster state or job performance. In the future, when we can confirm that the Slurm offering is working well, development of monitoring dashboards is complete, active and passive health checks are in place, and a comprehensive Kubernetes offering is available, we expect GMI to be an obvious candidate to move into the silver tier. --- # STN (Bronze) > STN earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. STN is second in our list of providers that should be in the silver tier if our testing went better. By comparison, STN is similar to Cirrascale, which is to say STN offers dedicated managed services for clusters that are built-to-order… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/stn - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/stn/llm.txt - **Topics**: STN review, STN GPU cloud, STN ClusterMAX rating, STN Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, STN B200, B200 cloud, RoCE, Slurm, NCCL, DCGM, GPUDirect, ClusterMAX 2.0, SemiAnalysis STN is second in our list of providers that should be in the silver tier if our testing went better. By comparison, STN is similar to Cirrascale, which is to say STN offers dedicated managed services for clusters that are built-to-order for individual customers. There is no “public” cloud experience, and frankly not much about this is “cloud”. But customers who want a high-touch experience can get it here. In our testing, the STN platform is undermined by significant configuration errors and reliability problems, landing STN in our Bronze tier. Onboarding is entirely manual, requiring phone calls to review PDFs and set up accounts. We were given a 4-node B200 cluster with impressive hardware, including four network fabrics (RoCEv2 for interconnect and storage) and 25TB of VAST. However, this high-end hardware was let down by basic configuration mistakes. For example, we found seven local NVMe drives unmounted on each node. The Slurm environment was also missing key components for performance: no topology.conf, GPUDirect RDMA was disabled (nvidia_peermem not loaded), and MPI was not installed. Unfortunately, STN’s biggest weakness was reliability. During testing, we saw two different nodes go into a “down” state, one of which stayed “down” for over two days. Since the STN repair process is entirely manual, it requires customers to spot and report failures themselves. Notably, dcgm health -c is enabled on the nodes, but it is not plugged into Slurm as a HealthCheckProgram. [](https://substackcdn.com/image/fetch/$s_!1LeD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05aabfe-462f-4d90-a316-c7fb3a17450d_936x230.png) [](https://substackcdn.com/image/fetch/$s_!1sdk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d4badd1-4890-4f36-885c-ba0fd43a9958_936x191.png)Checking in on our nodes in “down” state on different occasions We suggest that in the future, STN focus on actual cluster reliability instead of reporting fake “Uptime SLA” metrics to Grafana. [](https://substackcdn.com/image/fetch/$s_!UvoQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60568609-cfe9-4750-81cc-fc0647b7fd5d_937x457.png)Source: says that we have evaluated our own SLA and are approaching 100% Finally, getting jobs to run was a struggle. It took weeks for STN engineers to modify the cluster to include hpcx, nccl, and nvcc, enable GPUDirectRDMA and turn off ACS so that we could a basic nccl-test and torchtitan training job to run on four nodes. We also ran into what looked like network traffic shaping over the WAN that slowed down our downloads, but made speedtest-cli look great. With all this said, in our conversations with customers, STN has demonstrated that they have the capability to do deep, custom work for their customers. Going forward we suggest that STN work to automate a lot of its Slurm provisioning, health checks, and develop a comprehensive monitoring dashboard to improve reliability. Until then, we feel that STN remains a high-risk choice and a Bronze-tier provider. --- # Prime Intellect (Bronze) > Prime Intellect earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Prime Intellect is our favourite non-neocloud startup that happens to be a neocloud too. Prime is most well known for their decentralized training runs (INTELLECT) and synthetic dataset generation (SYNTHETIC). They have also… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/primeintellect - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/primeintellect/llm.txt - **Topics**: Prime Intellect review, Prime Intellect GPU cloud, Prime Intellect ClusterMAX rating, Prime Intellect Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, DCGM, ClusterMAX 2.0, SemiAnalysis Prime Intellect is our favourite non-neocloud startup that happens to be a neocloud too. Prime is most well known for their decentralized training runs (INTELLECT) and synthetic dataset generation (SYNTHETIC). They have also debuted an environments hub, quickly becoming the go-to place for researchers interested in open source RL environments. We absolutely love Prime’s open source contributions: Verifiers (a library for creating RL environments), PCCL (a library for running collectives over TCP/IP, i.e. on the WAN), and PRIME-RL (a framework for asynchronous RL at scale). For our testing, we were provided with a 4-node SLURM cluster, a feature that was still in beta at the time. We gave initial feedback on some configuration issues: no shared home directory, no passwordless ssh, preinstalled MPI, lmod, container toolkit, pyxis or enroot. Initial attempts to launch batch jobs failed due to InvalidAccount errors, and distributed PyTorch runs via torchrun hung on hostname resolution, suggesting some networking issues. We also saw a lack of health checks or dcgmi integration, and no monitoring dashboard available. [](https://substackcdn.com/image/fetch/$s_!TcPW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7540cdec-e850-4878-8291-af81ab0fd471_937x490.png)Spinning up a slurm cluster on prime After these findings, the prime team was all over it and completely overhauled the configuration. In less than a day, they added passwordless ssh, docker with nct, enroot, pyxis, the nvidia hpc sdk (nvcc, mpirun, hpcx), aliasing for python3, preinstalled uv, and a custom controller-hosted nfs mount for /home directories. Prime Intellect’s responsiveness was among the biggest thing we took away from the engagement. We were able to go from sending feedback in slack to a working cluster in a matter of hours. In the future, we look forward to further validation of their slurm offering for topology-aware scheduling, automated health checks, monitoring dashboards, and large-scale I/O performance. We are aware of some large customers have have taken the plunge, running clusters at the 1k GPU scale with Prime outside of their public console and marketplace. We are also very excited for the launch of a kubernetes offering, coming soon. Overall, if the team at Prime keeps up their relentless pace of shipping new features, we expect them to quickly move higher in the ClusterMAX rankings. --- # Neysa (Bronze) > Neysa earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Neysa is an emerging provider operating in the Indian market. They have recently signed an MoU with NTT Data and the Telangana government to build a 400MW, 25k GPU facility in Hyderabad, and currently operate a fleet of H100, H200, and… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/neysa - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/neysa/llm.txt - **Topics**: Neysa review, Neysa GPU cloud, Neysa ClusterMAX rating, Neysa Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Neysa H200, Neysa H100, Neysa MI300X, H200 cloud, H100 cloud, MI300X cloud, RoCE, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Neysa is an emerging provider operating in the Indian market. They have recently signed an MoU with NTT Data and the Telangana government to build a 400MW, 25k GPU facility in Hyderabad, and currently operate a fleet of H100, H200, and soon MI300X AMD GPUs. However, our testing revealed that their current platform has gaps in security and usability when compared to international competitors. The onboarding process raised concerns on security for us. Access is managed via username and password-based SSH, with manual IP address filtering and a fragmented user account system. We had no way to create new users for others on the team to test, implying that it would be difficult to support RBAC with an external IAM provider. The SLURM environment itself also suffered from basic configuration errors. Jobs fail to run initially as no default partition is configured, requiring manual specification for every submission. In addition there was no topology.conf configured. If Neya is going to run a 25k GPU cluster in the future, topology aware scheduling is going to be critical. Also, monitoring and health checks are effectively non-existent. The provided Grafana dashboard was non-functional during our testing and appeared to be missing some expected exporters for health checks or performance monitoring to work. On a more positive note, the software stack for containerized workloads is modern. We found an up-to-date NVIDIA container toolkit, and both pyxis and enroot were installed. At the time of testing, Neysa did not have a Kubernetes offering available for us to test. We look forward to testing it in the future. We expect Neysa to benefit from compliance with Indian regulations such as the DPDP, but we find it unlikely that they are able expand beyond their domestic market at this time. We encourage Neysa to improve their default experience: a better security posture, user management, proactive support experience, default monitoring systems, and health checks. [](https://substackcdn.com/image/fetch/$s_!5Z1W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F304d92fb-1a08-449f-9af4-1fcf61e70ce6_937x442.png) --- # Hyperstack/NexGen (Bronze) > Hyperstack/NexGen earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Hyperstack has a snappy, easy-to-use web portal, where we can quickly and easily spin up and spin down GPU VMs across three regions: Norway, Canada, and the US. They also plug in to external marketplaces such as… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/hyperstacknexgen - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/hyperstacknexgen/llm.txt - **Topics**: Hyperstack/NexGen review, Hyperstack/NexGen GPU cloud, Hyperstack/NexGen ClusterMAX rating, Hyperstack/NexGen Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Hyperstack/NexGen H200, Hyperstack/NexGen H100, H200 cloud, H100 cloud, Kubernetes, ClusterMAX 2.0, SemiAnalysis Hyperstack has a snappy, easy-to-use web portal, where we can quickly and easily spin up and spin down GPU VMs across three regions: Norway, Canada, and the US. They also plug in to external marketplaces such as PaleBlueDot.AI, where we were able to test them again. [](https://substackcdn.com/image/fetch/$s_!YCOW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb116b492-30b7-47e1-b2f3-4a11e289ca61_937x500.png) However, in our direct testing we also found that either their flagship Kubernetes service is broken, or we got them on a bad capacity day. After waiting for three hours, we got a vague “reconcile failed” error. The system gave us no logs or details. Luckily, our account was not charged for GPU time during this attempt (unlike some other providers on our list, more on that later). [](https://substackcdn.com/image/fetch/$s_!M5yv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c9eb707-93bd-4748-8e36-05ea73a8308c_937x498.png) On our second attempt, with different GPUs, more progress was made. After 4 hours stuck in the “Creating” stage, it did look like some machines were created and public IPs were allocated to the cluster. [](https://substackcdn.com/image/fetch/$s_!afJ1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444f9277-628c-48c2-a56d-c4d24e83f31a_937x497.png) [](https://substackcdn.com/image/fetch/$s_!U2L_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19b91477-1c1a-4bd6-9284-4c34decac75a_937x486.png)Our kubernetes cluster, waiting for something to happen We encourage Hyperstack to fix this core problem with their kubernetes deployment workflow so that customers can reliably use its H100 and H200 GPUs in clusters. --- # Atlas Cloud (Bronze) > Atlas Cloud earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Atlas Cloud presents a somewhat confusing picture. While their website points users towards a model playground and serverless endpoint environment, the core business is a typical bare metal wholesale provider. Atlas operates under… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/atlascloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/atlascloud/llm.txt - **Topics**: Atlas Cloud review, Atlas Cloud GPU cloud, Atlas Cloud ClusterMAX rating, Atlas Cloud Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, bare metal, ClusterMAX 2.0, SemiAnalysis Atlas Cloud presents a somewhat confusing picture. While their website points users towards a model playground and serverless endpoint environment, the core business is a typical bare metal wholesale provider. Atlas operates under the umbrella of a holding company called VCV Digital, with a sister company in Tiger DC, which is currently building a new datacenter in South Carolina. VCV Digital also owns the crypto company One Blockchain. The company acknowledges that the focus of Atlas is in transition, having shifted focus from the US to Asia and now back again. In our testing, we were unable to spin up/down a GPU Pod for testing. [](https://substackcdn.com/image/fetch/$s_!MviJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8960789-7b48-4ce8-af9d-29dbb4da61f2_936x494.png) With that said, we expect Atlas to continue to operate in the baremetal wholesale market going forward. We encourage Atlas to review our criteria and inform the development of core security, user management, networking, and storage services. Eventually, we would also encourage them to consider developing advanced monitoring, health checks, orchestration software and support to expand upon the bare metal wholesale business in their TigerDC sites and beyond. --- # BuzzHPC (Bronze) > BuzzHPC earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. BuzzHPC is the AI division of HIVE Digital Technologies (fka HIVE Blockchain), a crypto mining focused on cool climates with green energy (Canada, Iceland, Sweden). HIVE pivoted into the AI cloud market in 2022 when they acquired a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/buzzhpc - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/buzzhpc/llm.txt - **Topics**: BuzzHPC review, BuzzHPC GPU cloud, BuzzHPC ClusterMAX rating, BuzzHPC Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Slurm, NCCL, DCGM, GPUDirect, ClusterMAX 2.0, SemiAnalysis BuzzHPC is the AI division of HIVE Digital Technologies (fka HIVE Blockchain), a crypto mining focused on cool climates with green energy (Canada, Iceland, Sweden). HIVE pivoted into the AI cloud market in 2022 when they acquired a 50MW facility in New Brunswick, Canada from GPU Atlantic, aka gpu.one. [](https://substackcdn.com/image/fetch/$s_!jTgs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebb0014-9b67-4425-a79e-4b4f91668cd3_936x511.png)Source: our sources In our testing, the Slurm cluster we got had almost everything wrong with it that we’ve seen in this testing, all at once. It was almost impressive. Here is a list: * initially, no control plane machine * initially, no NFS mount, and then user’s default workdir was not on the shared filesystem * initially, no passwordless ssh between nodes * docker and the nvidia container toolkit not installed on the worker nodes * modules not installed, also no hpcx, nccl, nvcc * no pyxis or enroot * dcgmi background health checks not installed, or enabled * no prolog or epilog configured, no active health checks * no montoring dashboard To get around all of this, we ran a 2-node nccl test with the pytorch-bundled libnccl. Unfortunately, we did not see expected bandwidth (we about 10x lower than expected). This was weird, because ibstat showed 8x 400Gb CX-7 in the nodes. So, we quickly confirmed that both GPUDirect RDMA was not installed, and ACS was not turned off. [](https://substackcdn.com/image/fetch/$s_!b_t_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F855a2823-51fe-492e-aad4-1d47a4f19b6f_936x478.png)The BuzzHPC Console To their credit, the BuzzHPC team was responsive and worked with us over several days to resolve some of the issues we identified. They’ve also committed to building our feedback into their default slurm offering going forward. However, even after these fixes, the cluster does not meet the standards we expect for usability, monitoring and health checks. It seems that BuzzHPC’s platform is still actively in development. We look forward to seeing more from BuzzHPC in the future. --- # Shadeform (Bronze) > Shadeform earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Shadeform operates as a marketplace for GPU compute rather than a direct provider, owning no GPUs for themselves. Unlike other marketplaces, Shadeform has a lean team and only $2 million in funding, and focuses strictly on their… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/shadeform - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/shadeform/llm.txt - **Topics**: Shadeform review, Shadeform GPU cloud, Shadeform ClusterMAX rating, Shadeform Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, SOC2, HIPAA, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Shadeform operates as a marketplace for GPU compute rather than a direct provider, owning no GPUs for themselves. Unlike other marketplaces, Shadeform has a lean team and only $2 million in funding, and focuses strictly on their software and brokering deals. Their platform offers a transparent view of available GPU instances, uniquely identifying the underlying provider for each machine, such as Verda (formerly Datacrunch), Lambda, Voltage Park, Hydra Host, Digital Ocean and Nebius. This transparency extends to surfacing compliance information like SOC2 Type II and HIPAA certifications. [](https://substackcdn.com/image/fetch/$s_!8VZx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F865d2258-5385-470a-a5a4-632ee5bde9b9_936x501.png)Source: Hovering over the Nebius icon to check it’s stack of compliance certifications on the Shadeform console In fact, as of our testing, Shadeform is providing access to GPUs from 23 different providers, the most we have seen on any marketplace. Interestingly, Shadeform’s primary business has transformed, with them now entering into the wholesale bare-metal market as a broker. This means that a significant portion of their revenue comes via the negotiation and structuring of large-scale cluster deployments for their clients, particularly in the Asia-Pacific region (Taiwan, Japan, India). The Shadeform website has become a valuable discovery tool for this purpose, and is often the first place neoclouds have their GPUs appear publicly. Shadeform also appears to be the only partner currently in place for NVIDIA’s Brev offering, which we describe later in this article. With all that said, without a comprehensive Slurm or Kubernetes offering that we can test, no monitoring dashboards, and no way to do active/passive health checks on the underlying provider’s machines, we think it will be difficult for Shadeform to move beyond the bronze category. We look forward to testing cluster products and finding ways to evaluate the brokering services in the future. --- # Runpod (Bronze) > Runpod earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Runpod manages a significant fleet of over 20,000 GPUs, with users all over the world. However, their fundamental architectural choice to put every user inside a “pod” (container) severely limits their ability to service large scale… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/runpod - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/runpod/llm.txt - **Topics**: Runpod review, Runpod GPU cloud, Runpod ClusterMAX rating, Runpod Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, NCCL, bare metal, ClusterMAX 2.0, SemiAnalysis Runpod manages a significant fleet of over 20,000 GPUs, with users all over the world. However, their fundamental architectural choice to put every user inside a “pod” (container) severely limits their ability to service large scale training, inference, and any enterprise workloads. In our testing, the container-centric design prevents the use of standard HPC and MLOps tools, such as running Slurm with Pyxis or Enroot for containerized MPI jobs, performing active health checks on the underlying bare-metal infrastructure, or using Kubernetes. In our testing of Runpod’s Slurm offering (still in Beta), we initially used a cluster directly from another provider, FarmGPU, and gave feedback on a number of issues we found. The Runpod technical team was responsive, took the feedback, and committed to actively incorporate this feedback in their next development cycle. A few weeks later, different Runpod team members insisted that we re-test with a different bare metal provider, directly from their console. While we appreciate their engagement, all the core issues we found on the first round of testing remained. The default user is root, with no way to add additional users, enforce RBAC, or use an external IAM provider. The default home directory (~) is not on a shared filesystem, forcing users to navigate to a separate /workspace directory. More critically, the environment lacks essential tooling. We found no pre-installed MPI, and initial attempts to run MPI-based jobs using srun failed due to a required hostfile modification, specifying external container hostnames and routes, since these are not updated in DNS or standard IPs. Specifically, we had to export NCCL_SOCKET_IFNAME=”ens1” because it was not pre-populated in /etc/nccl.conf, export HF_HOME=/workspace/.cache/huggingface because /root is the default workdir, not /workspace, run head_node_ip=$(srun --nodes=1 --ntasks=1 -w “$head_node” ip addr show ens1 | grep “inet “ | awk ‘{print $2}’ | cut -d’/’ -f1) and include --hostfile hostfile in mpirun commands, instead of much simpler options on standard clusters. Even with knowledge of these custom approaches going into the second round of testing, it is currently still poorly documented and clearly a beta feature. On monitoring and health checks, we expect it will continue to be difficult for Runpod to ensure the reliability and performance required for large scale training. We have heard from multiple Runpod customers that since Runpod does not explicitly state which underlying hardware provider you’re going to land on (aside from specifying a “region”, and a binary “secure” or “community” cloud) that they effectively feel like they’re spinning a roulette wheel to try and “get a good pod”. In other words, users waste a bunch of time spinning up/down pods based on their perception of quality, because price-per-value information is not available to them in the console. [](https://substackcdn.com/image/fetch/$s_!IPz7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4b9378-bfa4-4d7f-9b7f-75536fd451c9_935x594.png)Source: looking at some European regions on the runpod console Overall, we expect Runpod will continue to serve a niche market that values its simplified, container-first approach, but it will struggle to make progress against our criteria without a fundamental change to their architecture. --- # Verda/DataCrunch (Bronze) > Verda/DataCrunch earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Verda (formerly DataCrunch) is based in Finland, with datacenters in both Finland and Iceland. When logging in, Verda provides a nice clean console, making provisioning quite straightforward. Their “Instant Clusters” feature… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/verdadatacrunch - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/verdadatacrunch/llm.txt - **Topics**: Verda/DataCrunch review, Verda/DataCrunch GPU cloud, Verda/DataCrunch ClusterMAX rating, Verda/DataCrunch Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Verda/DataCrunch B200, B200 cloud, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, SemiAnalysis Verda (formerly DataCrunch) is based in Finland, with datacenters in both Finland and Iceland. When logging in, Verda provides a nice clean console, making provisioning quite straightforward. Their “Instant Clusters” feature was easy to use and spun up a slurm cluster in minutes. We were also impressed by the completeness of their Slurm implementation, which stands in stark contrast to many other providers on the bronze or even silver tier of this list. From this experience, it seems like they have battle-tested the offering with customers, despite it still being labelled as “Beta”. [](https://substackcdn.com/image/fetch/$s_!Otg1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3bd5227-745d-4b79-a84e-9a4f1629b40d_936x514.png)Source: nice, intuitive setup for spinning up a cluster. Specifically, the B200 cluster we got had everything we expected: pyxis, enroot, hpcx, nccl, nvcc, topology.conf, dcgmi health -c plugged into Slurm’s HealthCheckProgram. On monitoring, the Grafana dashboard included an interesting SSH command to retrieve the password, and was relatively well configured. Missing pieces related to job performance were minor, and we gave feedback how to make some improvements beyond standard DCGM metrics, and display them in a meaningful way to users. The platform also still lacks any way to add users with RBAC enforced at the storage or slurm level. Overall, with working B200 instances available, and comprehensive slurm install, our initial impression was that Verda had made significant improvements from our last round of testing. However, this solid software foundation is still undermined by significant issues on the business and operations side. We have heard about reliability issues from various Verda customers, both at the hardware level and with respect to their WAN connectivity. Specifically, Verda customers have told us that entire sites can go dark with no explanation. While things like this happen, the more serious issue is in response. Unfortunately, we have seen Verda charge their customers for GPU time even when instances are down or entire sites are inaccessible. To us, this is an offensive business practice. Our basic expectation for all cloud providers is to commit to their SLAs in written form, with penalties in the form of credits or deductions off a customer’s monthly bill in the event of a breach. Not upholding a written SLA undermines many of the technical benefits and attractive pricing that we have seen from Verda during our testing. > Note: since publishing this article we’ve discussed this issue in detail with Verda. Verda is committed to compensate any customers who experience downtime with at least 2x the cost of running any instances in the form of a credit. Customers contacting technical support via chat get an automatic message stating that all downtime will be compensated. Verda typically issues refunds within 24 hours of the downtime occurrence during weekdays, and on the following Monday for downtimes occurring during weekends. For customers billed monthly, any downtime or defects are compensated by subtracting the corresponding amount from the monthly invoice. > > Frankly, we think this is an excellent response. > > In general, SemiAnalysis recommends that customers be sure to keep server logs, screenshots, and other information readily available if they are pursuing downtime claims from their provider. At this time we have only seen gold or platinum tier providers proactively issue credits without customers asking for them. Overall we recommend that Verda shore up their reliability challenges, finalize their slurm offering currently in beta, improve the monitoring dashboard, and continue development of their kubernetes offering. We look forward to seeing more from Verda in the future. --- # Digital Ocean (Bronze) > Digital Ocean earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Digital Ocean is another case of a traditional cloud provider attempting to get into the GPU game. However, with standard pricing of $3.44 per H200-hr, no slurm and no kubernetes, we expect it will be difficult for them to… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/digitalocean - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/digitalocean/llm.txt - **Topics**: Digital Ocean review, Digital Ocean GPU cloud, Digital Ocean ClusterMAX rating, Digital Ocean Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Digital Ocean H200, H200 cloud, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Digital Ocean is another case of a traditional cloud provider attempting to get into the GPU game. However, with standard pricing of $3.44 per H200-hr, no slurm and no kubernetes, we expect it will be difficult for them to compete for business where the customers is not already locked into their ecosystem. [](https://substackcdn.com/image/fetch/$s_!-See!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87f4c53a-e6c3-4a3f-a990-607ebc7d5338_937x652.png)Trying to create a “GPU Droplet” While we were unfortunately unable to create a GPU instance directly on the Digital Ocean console directly, we were able to access a machine via PaleBlueDot, a marketplace discussed later in this article. The single machine showed reasonable performance, but without the ability to create clusters, no shared storage or high performance networking, monitoring, or health checks in place, it is difficult for us to recommend using Digital Ocean GPUs for anything more than a bare minimum developer machine. --- # IBM Cloud (Bronze) > IBM Cloud earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. IBM Cloud is our last general-purpose cloud getting into the GPU game. IBM falls victim to the traditional enterprise hubris that leads companies to be opinionated about things that the market has already decided on. Instead of… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/ibmcloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/ibmcloud/llm.txt - **Topics**: IBM Cloud review, IBM Cloud GPU cloud, IBM Cloud ClusterMAX rating, IBM Cloud Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, IBM Cloud H100, H100 cloud, RoCE, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis IBM Cloud is our last general-purpose cloud getting into the GPU game. IBM falls victim to the traditional enterprise hubris that leads companies to be opinionated about things that the market has already decided on. Instead of Slurm, IBM pushes you to use LSF. Instead of Weka or VAST, IBM pushes you to use Spectrum Scale (their GPFS). Instead of kubernetes, IBM pushes you to use OpenShift. This is all supposed to be to your benefit, Mr. Customer, because IBM knows better than you. Except for the fact that it is not, and they don’t. [Even the IBM AI Research division uses SLURM over LSF](https://github.com/foundation-model-stack/fms-fsdp/blob/main/scripts/train.slurm). Unfortunately, when we tried to line up testing with IBM, they went so far as to deactivate our account and block us from making new sign-ups [](https://substackcdn.com/image/fetch/$s_!5PiB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fa58846-1f43-4d61-8698-dfefd7c5f257_937x528.png)IBM blocking us from testing their services L [](https://substackcdn.com/image/fetch/$s_!n2RA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2919c64-8af9-4c2f-bce6-2bd5f893e130_893x282.png)More obscure errors Even when trying to circumvent this verification process, IBM’s Account Verification team (a different team than the Analyst Relations and Product Management team we were originally working with) called the cell phone number included in the account sign up process and pestered us with questions about what we were doing on the platform. “Research” was not good enough, we needed to explain exactly what we were trying to do with the new account. [](https://substackcdn.com/image/fetch/$s_!Ujkf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19449cdd-1fbe-4fc1-b650-d0b5d7e63147_936x495.png)Learning that a GPU is “extra brain power” the CPU lacks [](https://substackcdn.com/image/fetch/$s_!D2Gg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb48203b-36be-4c02-81ab-ca910353a052_936x495.png) Though there are lots of promotions available via coupon code, in the default region of Frankfurt it is hard to justify $12.25 per H100 GPU, per hour… [](https://substackcdn.com/image/fetch/$s_!eiym!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c004f5f-29f3-4be3-ad88-a0a631ce92d4_936x495.png) With all that said, it did take us only 45 seconds to spin up a new machine, a little bit longer to assign a floating IP, and access it. NVIDIA drivers, docker, and the nvidia container toolkit are not pre-installed in the base image, causing a bit more headache before we could get started with testing. But it worked. We were about halfway through a simple download speed test using docker when IBM once again found our account and shut us down. We maintain IBM’s rating as a bronze tier provider until we are able to test their services in the future. --- # Hot Aisle (Bronze) > Hot Aisle earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Hot Aisle is an AMD-exclusive neo cloud offering MI300X GPUs in 1-way VMs or 8-way bare metal nodes, on-demand, at a competitive price. Recently, they completed a SOC 2 Type I attestation and achieved HIPAA compliance. Interestingly,… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/hotaisle - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/hotaisle/llm.txt - **Topics**: Hot Aisle review, Hot Aisle GPU cloud, Hot Aisle ClusterMAX rating, Hot Aisle Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Hot Aisle MI300X, Hot Aisle MI325X, Hot Aisle MI355X, MI300X cloud, MI325X cloud, MI355X cloud, SOC 2, HIPAA, Kubernetes, Slurm, bare metal, ClusterMAX 2.0, SemiAnalysis Hot Aisle is an AMD-exclusive neo cloud offering MI300X GPUs in 1-way VMs or 8-way bare metal nodes, on-demand, at a competitive price. Recently, they completed a SOC 2 Type I attestation and achieved HIPAA compliance. Interestingly, they have also released 2-way and 4-way VMs, with the AMD xGMI interconnect passed through and available. This is a unique and competitive offering for developer machines. We know of a few users in the open-source ecosystem that test on Hot Aisle due to flexibility and representative, real-world performance. [](https://substackcdn.com/image/fetch/$s_!vF5O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39aaf715-ee9d-4fbc-a80b-048d800abeca_936x866.png)AMD’s xGMI interconnect available between 4x GPUs At this time, Hot Aisle does not have shared storage, monitoring dashboards, health checks, modern security practices, RBAC, vertically integrated support, or the ability to run at scale (i.e. anything more than 2 or 4 machines at a time). They claim to have slurm or kubernetes on their website, but when we reached out for help it was not setup. The first time we tried to test, Hot Aisle was unavailable and did not have any bare metal servers or virtual machines available for us to use, though we were able to grab some later on. [](https://substackcdn.com/image/fetch/$s_!EW8Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e32b293-7f4c-4e30-9ee1-26040902054b_935x318.png)Business is booming – only 3 GPUs available! [](https://substackcdn.com/image/fetch/$s_!B26d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2ab1165-f4a9-4bb7-b16d-12bad12369b7_936x459.png) It is difficult to understand how Hot Aisle will make progress beyond providing cheap MI300X while the market moves to MI325X and MI355X. While other providers already full set up and offering access to paying customers for MI355X since September, Hot Aisle’s MI355X may not come until the end of this year, or even into early next year. Focusing on individual developer machines instead of clusters is a niche, focusing on AMD GPUs instead of NVIDIA GPUs is a niche, and focusing on the MI300X instead of the MI325X or MI355X is a niche. So, a niche of a niche of a niche market. --- # Vast.ai (Bronze) > Vast.ai earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Vast.ai (not to be confused with Vast Data, the storage provider) operates as a GPU marketplace, not a direct provider. The platform claims SOC2 compliance is in place directly, and some other notable security information:… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/vastai - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/vastai/llm.txt - **Topics**: Vast.ai review, Vast.ai GPU cloud, Vast.ai ClusterMAX rating, Vast.ai Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, SOC2, Kubernetes, Slurm, managed Slurm, ClusterMAX 2.0, SemiAnalysis Vast.ai (not to be confused with Vast Data, the storage provider) operates as a GPU marketplace, not a direct provider. The platform claims SOC2 compliance is in place directly, and some other notable security information: . They also state that many underlying datacenter providers are ISO27001 compliant, a notable step up from an aggregator, but still less introspection on who the underlying provider is than we would like since generally only the location is described. Vast does provide ways for users to track datacenters by IDs, and toggle for a “secure cloud”, but does not expose who the underlying provider actually is outside of the country. Clusters are available on a per-request basis and we have not been able to try one out. Users can pay for their GPUs with stripe (credit card), coinbase, or crypto.com. Our testing experience confirms the platform’s architectural focus: it is almost exclusively designed for containerized workloads, heavily pushing users toward Jupyter notebook environments. Gaining basic SSH access required manual configuration steps, and once connected, it was clear we were operating inside a container, not a VM or bare-metal host. This container-only model immediately precludes standard multi-node orchestration like Slurm or native Kubernetes, though cluster-on-demand requests are available. [](https://substackcdn.com/image/fetch/$s_!rvPX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79f87294-7c31-4830-b15f-0d81d3f85441_936x585.png) Reliability and performance are unpredictable, which is characteristic of the aggregator model. Our first test instance was provisioned in Czechia, hosted by an unknown underlying provider (though an IP lookup suggests E-Infra or Zoner Cloud). While this instance was functional, the marketplace model means users are rolling the dice on many qualities with every deployment. Without the ability to test managed Slurm/Kubernetes, multi-node clusters, or review monitoring and health checks, Vast.ai remains a platform primarily built for individual developers and hobbyists. --- # CUDO Compute (Bronze) > CUDO Compute earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. CUDO Compute was founded in 2017 and like many others on our list began its journey as a crypto miner, albeit at a modest scale. CUDO now operates a global partner network of data centers, including a recently announced a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/cudocompute - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/cudocompute/llm.txt - **Topics**: CUDO Compute review, CUDO Compute GPU cloud, CUDO Compute ClusterMAX rating, CUDO Compute Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, RoCE, ISO 27001, Kubernetes, Slurm, bare metal, managed Slurm, ClusterMAX 2.0, SemiAnalysis CUDO Compute was founded in 2017 and like many others on our list began its journey as a crypto miner, albeit at a modest scale. CUDO now operates a global partner network of data centers, including a recently announced a partnership with CanopyCloud.io to expand their datacenter network globally. Our hands-on experience started with their web console, which offers a highly configurable and project-based approach to organizing resources across global datacenters. Today, interconnected nodes are only available in Dallas, while 8-way bare metal servers are available in Paris, Stockholm, or Kristiansand Norway. In total, GPU VMs are available in 6 of 10 global datacenters, with the other 4 providing CPU-only VMs. We decided to grab our first ever African GPU VM, via CUDO’s datacenter in Centurion, South Africa. [](https://substackcdn.com/image/fetch/$s_!kMyB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e504a74-35b5-4bc5-a6a9-bc0b30051572_936x510.png)Spinning up a VM on CUDO Compute Spinning up a virtual machine was straightforward, with the provisioning process taking under 4 minutes and the console providing easy ssh key management. Usefully, we could configure a shared disk in the datacenter location we were using, meaning local data can be re-used in between cycles for a VM to spin up/down. However, the 200GB disk we deployed is not a filesystem volume, and is not mounted and visible to the OS image by default. We would prefer a shared filesystem volume that could be mounted to multiple machines, and requires similar underlying functionality on the server side to deliver. We also found it unfortunate that we were logging into the VM as a shared root user, instead of passing RBAC-enforced auth credentials from the console to the underlying VMs. Furthermore, the base Ubuntu image was not AI-ready out of the box. The driver version and nvidia container toolkit version provided were significantly out of date (meaning insecure). The OS image was also missing pip/pip3, and the python3 binary was not aliased to python, requiring extra steps to set up a basic virtual environment for development. Crucially, CUDO Compute does maintain ISO 27001 compliance with underlying datacenters, a key security attestation that many similar providers lack. Overall, CUDO Compute has a promising foundation with a flexible, easy-to-use console and global reach. However, the platform is not ready for large scale training and inference due to a lack of managed slurm or kubernetes services, shared file storage, monitoring dashboards, health checks, and any sort of proactive, enterprise support options. We recommend that CUDO focus on refining their base machine images for ease-of-use, consider deploying shared file storage, and continue building experience at the orchestration layer for slurm and kubernetes clusters in the future. --- # Lightning.ai (Bronze) > Lightning.ai earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Lightning.ai (aka Lightning Cloud) is a broker for GPU machines in neoclouds and hyperscalers that provides useful MLOps features on top. The founding story of Lightning Cloud begins with the development of PyTorch Lightning, an… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/lightningai - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/lightningai/llm.txt - **Topics**: Lightning.ai review, Lightning.ai GPU cloud, Lightning.ai ClusterMAX rating, Lightning.ai Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Lightning.ai H200, H200 cloud, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Lightning.ai (aka Lightning Cloud) is a broker for GPU machines in neoclouds and hyperscalers that provides useful MLOps features on top. The founding story of Lightning Cloud begins with the development of PyTorch Lightning, an open-source framework that organizes and simplifies boilerplate PyTorch code such as the training loop, logging, checkpointing, and distributed training. The lightning git repo seems to be the #1 way top-of-funnel sales start for Lightning Cloud. Fast forward to today, with LLMs on the rise, there is a split in the market. Older frameworks like NVIDIA NeMo use Lightning under the hood, while new frameworks that we use in our testing such as torchtitan, verifiers and Megatron-LM do not. The open source `pytorch-lightning` and `lightning` packages are still growing rapidly: [](https://substackcdn.com/image/fetch/$s_!Lt7_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9e41f3-542b-4c43-9248-19ccd74462dd_1101x633.png)Source: Lightning.ai, data from pypi Functionally, the Lightning Cloud product offers a simple way to track who’s using what across multiple clouds. We had a chance to test the Lightning Studio, which provides access to GPUs in a browser (VSCode, Jupyter notebook) or remote SSH (VSCode, Cursor, Windsurf, etc). Users can also submit batch jobs and “mmt” (multi-machine training) jobs to individual machines or clusters that they get access to on demand. Our testing of clusters is coming soon. [](https://substackcdn.com/image/fetch/$s_!C0kk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8800c025-6159-4a42-a150-476d1fcf6641_937x494.png)Source: our lightning.ai homepage Notably, these multi-GPU studios, batch jobs, and mmt training jobs are restricted to users on a Pro, Teams or Enterprise Custom payment tier. Lightning is the only neocloud we have seen charging a per-seat price, and translating that into GPU-hrs behind the scenes on clusters that they manage for the customer. [](https://substackcdn.com/image/fetch/$s_!uxyB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaa1389-1d7e-4182-8bcc-0736c661011b_1224x694.png)Source: lightning.ai/pricing Interestingly, there is an easy way to attach/detach GPUs to existing “studios” (i.e. notebooks or remote shells) and auto-sleep them if unused. This means that users only paying for what they use. Lightning also forecasts the wait times associated with spinning up a GPU from a given provider, such as AWS, Google, Lambda, Voltage Park, or Nebius. The worst wait time is for an 8x H200 machine in AWS, estimated at 3hrs. Unfortunately, despite what the website says, there are no GPUs available from NScale. [](https://substackcdn.com/image/fetch/$s_!sPmD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ff237-f89c-463e-9deb-8eba5bad656a_937x494.png)Using a VSCode notebook in Lightning.ai Another piece that jumped out during testing is that notebooks have full CLI access including docker, meaning the notebook is running directly on a VM under the hood. This leaves users with full flexibility in the environment. Overall we have our doubts about the utility of remote developer environments where cluster access is abstracted away from users, especially at the high-end of the market. The largest buyers of GPU compute do not have a problem spinning up a notebook on kubernetes with a simple manifest.yaml, or accessing a single machine via `srun -N1 —gpus-per-node=8 —pty bash` in a slurm cluster. We find it hard to see a path forward for Lightning Cloud if the industry moves beyond the lightning framework and the GPU marketplace business continues to focus on taking a margin on top of expensive hyperscalers, with no third party compute. As for the ClusterMAX rating system, we look forward to testing Lightning Cloud’s mmt training, and kubernetes in the future. We encourage Lightning to consider building a slurm offering, adding monitoring dashboards for underlying cluster health that integrates with job logs and performance profiling, adding integration with active/passive health checks on clusters, and customization options for high performance storage and networking. --- # Qubrid (Bronze) > Qubrid earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Qubrid enters our ratings in the Bronze tier. The provider offers individual GPU instances (VMs) and bare-metal server rentals through a clean web console, with hardware ranging from H100s to the latest B200s. Our hands-on testing of… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/qubrid - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/qubrid/llm.txt - **Topics**: Qubrid review, Qubrid GPU cloud, Qubrid ClusterMAX rating, Qubrid Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Qubrid B200, Qubrid H200, Qubrid H100, B200 cloud, H200 cloud, H100 cloud, ClusterMAX 2.0, SemiAnalysis Qubrid enters our ratings in the Bronze tier. The provider offers individual GPU instances (VMs) and bare-metal server rentals through a clean web console, with hardware ranging from H100s to the latest B200s. Our hands-on testing of their individual VM offering was a mixed bag. On the positive side, the user experience for provisioning a single machine is straightforward. SSH and Jupyter access are easy to configure, and our B200 instance was provisioned and accessible in around 8 minutes. [](https://substackcdn.com/image/fetch/$s_!WNak!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43dcf57e-250a-4d1d-b9b5-bcfb77fc714b_936x499.png)Source: the Qubrid console However, this basic usability is undermined by the fact that Qubrid charges users while the machine is stuck spinning up. We can only speculate that the reason for this is that Qubrid is running on AWS hardware in Ashburn, VA (us-east) as confirmed by a basic IP test when we logged in, though they do not make this clear to their customers. Readers can see in the screenshot above that our B200 instance was provisioned with the CUDA 12.4 toolkit by default. While the driver was newer (12.6), this older toolkit just obviously doesn’t work with Blackwell hardware (i.e. SM100) which requires CUDA 12.8 or above. Finally, Qubrid’s business model feels more like a traditional server host than a flexible neocloud. Their pricing requires strict minimum commitments (e.g., 1-week for H100, 1-month for H200, and 3-months for B200) and they openly advertise yearly server rentals, despite not owning the hardware. We encourage Qubrid to fix its billing practices, address usability issues, and be more upfront about who’s hardware they’re selling to customers. --- # Latitude.sh (Bronze) > Latitude.sh earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. Latitude.sh presents itself as a straightforward provider with bare metal or virtual L40S and H100 machines mostly located in Dallas, Texas. Upon logging in, the console is clean and well-organized. We appreciate the ability to… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/latitudesh - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/latitudesh/llm.txt - **Topics**: Latitude.sh review, Latitude.sh GPU cloud, Latitude.sh ClusterMAX rating, Latitude.sh Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Latitude.sh H100, Latitude.sh L40S, H100 cloud, L40S cloud, Kubernetes, Slurm, bare metal, ClusterMAX 2.0, SemiAnalysis Latitude.sh presents itself as a straightforward provider with bare metal or virtual L40S and H100 machines mostly located in Dallas, Texas. Upon logging in, the console is clean and well-organized. We appreciate the ability to organize resources by project and apply tags for environments like ‘dev’ or ‘pre-prod’. Provisioning options are clear, and spun up a machine in seconds. An interesting and somewhat unique feature is the ‘Cloud Gateway’ service, which leverages Megaport to establish private connections to major public clouds. This could be a compelling offering for customers pursuing hybrid or multi-cloud strategies. Unfortunately, we had a couple issues during testing, where an L40S VM reported an NVML driver/library mismatch error, and an H100 VM’s driver provisioning just didn’t work properly. [](https://substackcdn.com/image/fetch/$s_!0cGs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc9497ec-7fe4-417e-9380-92957a039809_935x205.png) This sort of instability is part of the danger of using virtual machines in general. However, once we provisioned some new VMs, an L40S and H100 instance performed as expected, with the GPU immediately recognized by nvidia-smi out of the box. With that said the only base OS image available is “Ubuntu 24 ML-in-a-Box”, except it includes an out-of-date pytorch version, python3 without the python3-venv package, no alias for python, and no docker or nvidia-container-toolkit pre-installed. Beyond the individual instance issues, Latitude has no Slurm or Kubernetes offerings, no integrated monitoring dashboards, no shared storage options, and no health checks. For individual developers or small teams, Latitude.sh might offer a compelling price point. But for organizations seeking a production-ready cluster, the platform falls short. --- # Denvr Dataworks (Bronze) > Denvr Dataworks earns a ClusterMAX 2.0 Bronze rating from SemiAnalysis. During previous rounds of testing, Denvr Dataworks was a strong cloud provider with a promising future despite a part-time obsession with immersion cooling. To set aside the claims around the viability of immersion cooling on… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Bronze - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/denvrdataworks - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/denvrdataworks/llm.txt - **Topics**: Denvr Dataworks review, Denvr Dataworks GPU cloud, Denvr Dataworks ClusterMAX rating, Denvr Dataworks Bronze, Bronze tier GPU cloud, GPU cloud review, neocloud review, Denvr Dataworks H100, H100 cloud, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis During previous rounds of testing, Denvr Dataworks was a strong cloud provider with a promising future despite a part-time obsession with immersion cooling. To set aside the claims around the viability of immersion cooling on the technical side, it is also clear that claims regarding the usage of fresh water for cooling in datacenters are way overblown. This is described in detail in [this report](https://eta-publications.lbl.gov/sites/default/files/2024-12/lbnl-2024-united-states-data-center-energy-usage-report.pdf?utm_source=substack&utm_medium=email) and [this article](https://andymasley.substack.com/p/the-ai-water-issue-is-fake), which estimate that datacenters in the US used less than 0.15% of the nation’s freshwater last year, depending on how you count it. * 50M gallons per day if counting only cooling * 200-275M gallons per day if counting power but not dam reservoir evaporation * 628M gallons per day if counting evaporation from reservoirs used for hydro power Compared to the roughly 2 billion gallons of water per day used for golf course irrigation, the 50 million gallons per day on liquid cooling is about 2.4% of the total. Unfortunately, it seems that much of the original Denvr team has now left the company and GPUs are inaccessible to us when using their website directly. [](https://substackcdn.com/image/fetch/$s_!6rqe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67f8f0d9-1ee2-4927-930b-eaa7ae6dbfe3_937x496.png)Can’t spin up a VM without a VPC [](https://substackcdn.com/image/fetch/$s_!cRIh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1866d60-824a-494d-96e0-16a973b5210b_937x496.png)Can’t make a VPC without talking to a sysadmin However, Denvr’s hardware has not disappeared entirely. Instead, it appears Denvr has pivoted to wholesale-only, surfacing its capacity through aggregators and marketplaces. During our testing of the Dstack Sky platform, our job was provisioned on a Denvr Dataworks machine in Houston, Texas via vast.ai. After getting through the turducken of ssh tunnels and into the dstack orchestrated, vast.ai deployed, container running on the Denvr server, we were able to successfully run a multi-GPU (2x H100) RL training job on this hardware. When it works, it works. In the future we look forward to revisiting the Denvr platform and testing out slurm or kubernetes offerings. --- # IREN/Iris Energy (Underperforming) > IREN/Iris Energy earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. IREN is one of the most aggressive crypto mining companies trying to convert their facilities into a neocloud. Unlike their competitors such as TeraWulf, Core Scientific, and Cipher Mining, which have all realized… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/irenirisenergy - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/irenirisenergy/llm.txt - **Topics**: IREN/Iris Energy review, IREN/Iris Energy GPU cloud, IREN/Iris Energy ClusterMAX rating, IREN/Iris Energy Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, NCCL, ClusterMAX 2.0, SemiAnalysis IREN is one of the most aggressive crypto mining companies trying to convert their facilities into a neocloud. Unlike their competitors such as TeraWulf, Core Scientific, and Cipher Mining, which have all realized significant value from their existing investments by pursuing the powered shell datacenter infrastructure business (aka colocation), IREN is intent on doing things the hard way and building a neocloud all by themselves, with no relevant experience on their team, now with nearly 100K GPUs committed for current and future customers. We have tested IREN in March 2025 and found the service to be severely lacking, with multiple basic configuration errors on the hardware such as ACS not being disabled and GPU Direct RDMA not being enabled. In March 2025, our two node NCCL test on the AllReduce collective showed that IREN machines had around 129.27GB/s at 128MiB msg size when the Nvidia reference numbers and our testing for top tier neoclouds is well above >= 300GB/s busBW. It was later confirmed to us by IREN engineers that the root cause was due to their team not disabling the ACS setting on the system’s PCIe switch, which meant the GPU couldn’t talk directly to the NIC but instead had to go through the root complex of the CPU. While this should be a simple fix and checks or remediation is easy to automate with software, we have not been able to verify that IREN has made any changes. For this round of testing, IREN has claimed they have no capacity available to test for over 3 months straight. Recently, IREN has had some success, signing a $9.7 billion offtake deal with Microsoft targeting a portion of their 750MW site in Childress, Texas. It is known within the industry that IREN offers below market rate prices compared to providers in the ClusterMAX Silver, Gold, or Platinum tiers. We think the reason is twofold: * Cheaper-than-average cost structure, through ownership of the datacenter and site selection centered on areas with cheap power costs (typical in the Bitcoin mining business) * Inferior service quality, relative to the market average For a deeper analysis of the economics or IREN’s publicly announced AI cloud contracts, [our AI Cloud TCO Model is the best tool](https://semianalysis.com/ai-cloud-tco-model/). It is trusted by many major GPU buyers, as well as their financial sponsors. --- # Hydra Host (Underperforming) > Hydra Host earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Hydra Host is another marketplace/broker that operates without any security compliance attestation and as a result has lost some opportunities with Fortune 500 customers. Their Brokkr platform has been recently redesigned… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/hydrahost - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/hydrahost/llm.txt - **Topics**: Hydra Host review, Hydra Host GPU cloud, Hydra Host ClusterMAX rating, Hydra Host Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Hydra Host B200, Hydra Host H200, Hydra Host H100, Hydra Host A100, Hydra Host L40S, B200 cloud, H200 cloud, H100 cloud, A100 cloud, L40S cloud, SOC2, ClusterMAX 2.0, SemiAnalysis Hydra Host is another marketplace/broker that operates without any security compliance attestation and as a result has lost some opportunities with Fortune 500 customers. Their Brokkr platform has been recently redesigned which makes it easy to access GPUs from a wide variety of datacenters, though there is lots of information missing if you want to know exactly which provider you are renting GPUs from. During our test, lots of GPUs listed on the Brokkr platform were currently “at capacity”, even including the A100 GPU: [](https://substackcdn.com/image/fetch/$s_!Rzez!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64637967-950e-410a-9333-d7481a20bd73_937x496.png) In total, when we tested, Hydra was “at capacity” for the A4000, 3090, 5090, A10, A6000, GH200, A100, and B200. Available on demand were 8x 4090, 8/7/5/4x L40S, 16x V100, 8x H100 in Vietnam, and 8x H200 India, Washington, or Japan. [](https://substackcdn.com/image/fetch/$s_!I42k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec1a57b-e3bb-4291-a8d9-a98d1164799c_937x490.png) Unfortunately, in order to actually get access to one of these servers, Hydra forces users to pre-pay for a weekly bill, and promises to “refund for the unused portion” rather than running a truly on-demand experience. We look forward to testing Hydra’s white glove cluster product during ClusterMAX 2.1 following their pending SOC2 Type II compliance attestation. --- # FarmGPU (Underperforming) > FarmGPU earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. We tested FarmGPU via Runpod only, as they currently do not have a way to access their cloud services directly. We described our experience in detail in that section, where after some back and forth with the team we were… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/farmgpu - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/farmgpu/llm.txt - **Topics**: FarmGPU review, FarmGPU GPU cloud, FarmGPU ClusterMAX rating, FarmGPU Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, NCCL, ClusterMAX 2.0, SemiAnalysis We tested FarmGPU via Runpod only, as they currently do not have a way to access their cloud services directly. We described our experience in detail in that section, where after some back and forth with the team we were eventually able to see expected performance on NCCL tests on their network using a Runpod-orchestrated slurm cluster with modifications. We have not been able to test kubernetes, or verify any monitoring and health checks in place. In general, we appreciate FarmGPU’s commitment to improvement over time, strong knowledge of storage drive performance, and contributions to OCP’s Neocloud workstream describing their experience working with the SONiC NOS on Celestica whitebox switches with Kubernetes. At this time FarmGPU does not have basic security attestation in place. We look forward to testing their services again in the future. --- # Whitefiber (Underperforming) > Whitefiber earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Whitefiber is a wholly-owned subsidiary of Bit Digital, another company which has recently pivoted from crypto mining to Ethereum staking and AI. Whitefiber went public on the NASDAQ in August 2025, raising approximately… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/whitefiber - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/whitefiber/llm.txt - **Topics**: Whitefiber review, Whitefiber GPU cloud, Whitefiber ClusterMAX rating, Whitefiber Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Whitefiber B200, Whitefiber H200, Whitefiber H100, B200 cloud, H200 cloud, H100 cloud, SOC2, SOC 2, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Whitefiber is a wholly-owned subsidiary of Bit Digital, another company which has recently pivoted from crypto mining to Ethereum staking and AI. Whitefiber went public on the NASDAQ in August 2025, raising approximately $150M. The combined company ($WYFI and $BTBT) has a market cap approaching $1B. Bit Digital recently announced a North Carolina datacenter that can expand from 24MW to 99MW, a large augmentation to their existing footprint in Montreal, with the intention to host more AI customers. In the past the company has mentioned LOI’s for 288MW, and they have publicly announced contracts for 4,096 H100, 1,040 H200, and 464 B200. You can read more about companies like this, and their expansion plans in our [datacenter model](https://semianalysis.com/datacenter-industry-model/). In our testing, Whitefiber’s networking was within the reference numbers if not slightly better for some message sizes. In terms of other aspects, we were sadly disappointed and did not match the same quality as their networking. we were given a Slinky cluster, with access to both the slurm and kubernetes layer. Unfortunately, neither of them worked. Initially, slurm was inaccessible from a remote machine, with lots of negotiation required to make it work. Meanwhile, at the kubernetes layer, nothing could be scheduled since slurm-bridge was not configured, and all GPU resources were taken by slinky. Eventually, we got a jump box working to get into the slurm cluster, and found the typical slinky footguns: no git, vim, nano, python, or sudo permissions to install software. Cluster dashboards were long and detailed but missing important metrics about jobs. No active or passive health checks could be found. We were given access to dashboards for clockwork and trainy, neither of which were explained or documented, and were therefore not useful in our testing. Whitefiber has clearly made investments in their AI cloud offering by hiring a large team of consultants, integrating a custom interconnect to replace NVIDIA, and buying every piece of software pitched to them. Unfortunately, they are held back even from the bronze category for lacking a basic security attestation from a third party auditor, such as SOC 2 Type I/II or ISO27001. We believe that by ClusterMAX 2.1 or ClusterMAX 3 if Whitefiber gets a basic SOC2 Type I compliance in place and fixes their SLURM/Kubernetes orchestration, they would find themselves in at least the bronze tier. --- # DeepInfra (Underperforming) > DeepInfra earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. DeepInfra is a recent entrant into the GPU cloud game, starting as an inference provider and now renting out some of the cheapest B200’s on the market. We expect that with a relatively lumpy business like inference growing… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/deepinfra - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/deepinfra/llm.txt - **Topics**: DeepInfra review, DeepInfra GPU cloud, DeepInfra ClusterMAX rating, DeepInfra Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, DeepInfra B200, B200 cloud, ClusterMAX 2.0, SemiAnalysis DeepInfra is a recent entrant into the GPU cloud game, starting as an inference provider and now renting out some of the cheapest B200’s on the market. We expect that with a relatively lumpy business like inference growing faster than can be forecasted on the compute side, DeepInfra is looking for customers to soak up some of their unused capacity. In other words, they are taking the opposite approach when compared to Nebius or GMI’s inference endpoint business that is expanding on an existing cloud business. Unfortunately, DeepInfra’s only current offering in the neocloud market, an 8xB200 instance, was out of capacity whenever we tried to test it out, and there is no security compliance attestation in place. [](https://substackcdn.com/image/fetch/$s_!wPUP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde6ea251-b133-43de-8321-5da1e4a34773_936x495.png) With attractive pricing and a talented engineering team, we hope to see more from DeepInfra in the neocloud market in the future. --- # Dstack Sky (Underperforming) > Dstack Sky earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Dstack the company has a really interesting orchestrator and scheduler that replaces the need for slurm or kubernetes. We love the idea of moving beyond slurm and kubernetes, and have been hearing great reviews from dstack… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/dstacksky - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/dstacksky/llm.txt - **Topics**: Dstack Sky review, Dstack Sky GPU cloud, Dstack Sky ClusterMAX rating, Dstack Sky Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Dstack Sky H100, H100 cloud, SOC 2, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Dstack the company has a really interesting orchestrator and scheduler that replaces the need for slurm or kubernetes. We love the idea of moving beyond slurm and kubernetes, and have been hearing great reviews from dstack users about their experience. On the flipside, the dstack sky marketplace offering is not at the same level. Dstack sky is a cloud broker that works similarly to their GPU orchestration product by focusing on a CLI-driven approach to provisioning GPU resources. The offering allows users to create three types of resources: a dev environment (a GPU instance accessible via an IDE), a task (a batch job), or a service (a deployed model or web app). Under the hood, everything is powered by docker containers. As we have mentioned previously when reviewing other marketplaces and brokers, this creates an initial restriction for building developer environments that users must comply with to use the product. However, it is nice to see that dstack does not require users to build from their base image, instead allowing users to bring their own image of choice while dstack adds orchestration on top. We particularly enjoyed the convenience script that automatically edits a users local .ssh/config file to provide quick access to newly created systems. However, the abstraction comes with a significant lack of transparency. It’s unclear how the underlying GPU provider, or “Backend,” is chosen, and there is no apparent way to view the full list of providers or filter by price when you get “Offers” from the CLI. When requesting an H100, we had no way to distinguish between PCIe and SXM models. During testing he happened to receive 2x SXM GPUs, but this seems to be a matter of chance. [](https://substackcdn.com/image/fetch/$s_!WB6_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50859f48-9e02-459e-9496-8f13c395ff70_936x422.png)Getting offers for 2x H100 from providers via dstack Once connected, we found ourselves logged in as root, implying there is no RBAC or shared storage options on the cluster side. The machine we connected to provided a small 100GB root partition, and our connection speed was extremely slow, with seconds of lag for a carriage return to register on our CLI. This was likely due to our instance being provisioned from a provider in Thailand (“Internet Thailand Company Ltd.”). Storage performance, however, was good, taking only 6s to import torch. This experience highlights the multiple layers of indirection in the Dstack model. We pay Dstack for credits; Dstack then pays a provider like Vast.ai for an instance; Vast.ai in turn pays the end provider to run the container (possibly the provider in Thailand uses a datacenter operator under the hood, too). It’s unclear how many layers exist and who is ultimately responsible for hardware maintenance and security, a significant concern for any serious workload. [](https://substackcdn.com/image/fetch/$s_!aS3F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea80b64-1339-4fcb-b58a-419ac9a8b8f8_937x293.png)An RL eval job on 1x H100, using the `verifiers` repo With all this said, we are still able to use dstack to connect to a remote machine with 2x H100 inside VSCode in under 5 minutes, install required software in under 5 minutes, and run an RL rollout for a sample model eval. All in less than 30 minutes, paid for by the minute with existing credits. A nice experience for on-demand development that motivates us to reconsider the use of CLI’s to spin up machines. When it works, it works. We look forward to testing dstack again in the future, and the company is planning on completing a basic security compliance attestation such as SOC 2 Type 1 soon. --- # PaleBlueDot (Underperforming) > PaleBlueDot earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. PaleBlueDot is one of the many marketplaces covered in the underperforming tier that is missing a basic security attestation. We had a good experience testing some of the five different clouds that are aggregated on the… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/palebluedot - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/palebluedot/llm.txt - **Topics**: PaleBlueDot review, PaleBlueDot GPU cloud, PaleBlueDot ClusterMAX rating, PaleBlueDot Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis PaleBlueDot is one of the many marketplaces covered in the underperforming tier that is missing a basic security attestation. We had a good experience testing some of the five different clouds that are aggregated on the PaleBlueDot marketplace. It was easy to spin up and connect to virtual machines, and we were only charged by the minute. We encourage PaleBlueDot to consider onboarding more providers in order to increase GPU availability and provide a true cluster experience via slurm or kubernetes orchestration and shared storage. --- # Hyperbolic (Underperforming) > Hyperbolic earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Hyperbolic positions itself as a low cost GPU aggregator, however it’s “high-end” H100 and H200 offerings, are explicitly listed as “beta” on the website. Users can pay for their GPUs with cryptocurrency, bank transfers, or… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/hyperbolic - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/hyperbolic/llm.txt - **Topics**: Hyperbolic review, Hyperbolic GPU cloud, Hyperbolic ClusterMAX rating, Hyperbolic Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Hyperbolic H200, Hyperbolic H100, H200 cloud, H100 cloud, ClusterMAX 2.0, SemiAnalysis Hyperbolic positions itself as a low cost GPU aggregator, however it’s “high-end” H100 and H200 offerings, are explicitly listed as “beta” on the website. Users can pay for their GPUs with cryptocurrency, bank transfers, or credit card payments. Our hands-on testing began with exceptionally fast provisioning time, with a new H100 instance spinning up and becoming accessible in under 25 seconds, the fastest we saw in all our research. Unfortunately this was immediately undermined by the state of the software on the node. The instance was pre-provisioned with an out-of-date PyTorch version (2.5.1). Also, essential tools like Docker were not pre-installed. [](https://substackcdn.com/image/fetch/$s_!cOeF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F310f7227-f93e-46e8-92bf-01b423cab3cc_935x536.png)Source: a brand new H100 node on Hyperbolic Platform reliability emerged as the most critical failure. Our testing was plagued by persistent connection drops, instances inexplicably falling into an “Unknown status,” and ultimately, a complete failure of the provisioning system that prevented us from creating or accessing any instances. [](https://substackcdn.com/image/fetch/$s_!gEww!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d43c154-df2f-4982-a966-87971c0ef20d_937x503.png)Source: unknown status… [](https://substackcdn.com/image/fetch/$s_!2XaJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f881b9-e577-4392-b50d-8a042303b87b_937x503.png)Source: error in the hyperbolic UI Overall hyperbolic has a promising business but is lacking basic security attestation, and had among the worst user experience for an on-demand VM from all the marketplaces that we tested. It is unclear if that is on the side of hyperbolic, or the underlying datacenter provider, but we think it illustrates the point about reliability challenges that prospective users of GPU brokers/platforms/marketplaces/aggregators will have to contend with. --- # Aethir (Underperforming) > Aethir earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Aethir functions as an underlying infrastructure partner by creating a decentralized GPU compute infrastructure (DePIN). It aggregates globally distributed, idle GPU capacity and rents it out for both AI and cloud gaming. Its… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/aethir - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/aethir/llm.txt - **Topics**: Aethir review, Aethir GPU cloud, Aethir ClusterMAX rating, Aethir Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Aethir functions as an underlying infrastructure partner by creating a decentralized GPU compute infrastructure (DePIN). It aggregates globally distributed, idle GPU capacity and rents it out for both AI and cloud gaming. Its model is built on its own cryptocurrency token (ATH), which is used by investors to “stake” GPUs and provide liquidity, effectively trading on the volatility of spot instance pricing. Unfortunately, Aethir is not truly a self service experience, requiring prospective buyers to fill out a form in order to purchase GPU time on their platform. [](https://substackcdn.com/image/fetch/$s_!-Nmh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4014c43e-7359-40bd-b8f4-4ae83ef6841b_937x897.png) It seems that this decentralized network concept, while particularly useful for cloud gaming, has not taken off yet for AI. Aethir does not have any security compliance attestation in place for users to assess if the company taking payments and managing the GPU access has access controls in place. --- # Akash Network (Underperforming) > Akash Network earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Akash is a decentralized marketplace with a supposed 64 active providers on “Mainnet”. When we logged it in it seemed like there were many different consumer-grade GPUs available. We decided to give it a go an request 1x… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/akashnetwork - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/akashnetwork/llm.txt - **Topics**: Akash Network review, Akash Network GPU cloud, Akash Network ClusterMAX rating, Akash Network Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Akash Network H200, Akash Network H100, H200 cloud, H100 cloud, ClusterMAX 2.0, SemiAnalysis Akash is a decentralized marketplace with a supposed 64 active providers on “Mainnet”. When we logged it in it seemed like there were many different consumer-grade GPUs available. We decided to give it a go an request 1x H200. [](https://substackcdn.com/image/fetch/$s_!6yGR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ff3a7ae-f4f5-472e-a6ea-f3030109a225_937x497.png)Requesting a 1x H200 deployment Unfortunately, we weren’t able to access any H100 or H200 on the platform. Interestingly, there is an option to request AMD MI100 GPUs, but when we tried to request them or different NVIDIA consumer GPUs (3080, 4090) we couldn’t get anything more than an apologetic loading screen: [](https://substackcdn.com/image/fetch/$s_!XHEv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a504cd5-04d6-4f51-b58f-3dfc27eb350a_937x497.png)Waiting for bids… L After a few hours waiting for a bid for a 4090, and none being found, we moved on. Overall our experience with Akash lends us to thinking that it is basically not ready for any workloads. --- # Salad Cloud (Underperforming) > Salad Cloud earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Salad Cloud is another decentralized marketplace that focuses primarily on consumer-grade gaming GPUs at reduced prices in their “Community Cloud” offering. In their “Secure Cloud” offering, there is no option for high-end… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/saladcloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/saladcloud/llm.txt - **Topics**: Salad Cloud review, Salad Cloud GPU cloud, Salad Cloud ClusterMAX rating, Salad Cloud Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Salad Cloud B200, Salad Cloud H200, Salad Cloud H100, B200 cloud, H200 cloud, H100 cloud, SOC 2, ClusterMAX 2.0, SemiAnalysis Salad Cloud is another decentralized marketplace that focuses primarily on consumer-grade gaming GPUs at reduced prices in their “Community Cloud” offering. In their “Secure Cloud” offering, there is no option for high-end GPUs such as the SXM H100, H200, or B200. Since ClusterMAX 1.0, we do appreciate that Salad Cloud is now SOC 2 Type 1 certified, and does not charge their customers for cold-boot startup times. [](https://substackcdn.com/image/fetch/$s_!Cz-N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733157c5-77e5-456f-aa9d-de55b0d04bc6_937x554.png)Source: security disclaimers on the Salad Cloud website --- # Clore (Underperforming) > Clore earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Clore operates as a decentralized GPU marketplace, similar to Aethir and Akash. It connects individual hardware ‘hosts’ (hobbyists or small-scale crypto miners) with users seeking on-demand compute. Since the platform is built… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/clore - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/clore/llm.txt - **Topics**: Clore review, Clore GPU cloud, Clore ClusterMAX rating, Clore Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Clore operates as a decentralized GPU marketplace, similar to Aethir and Akash. It connects individual hardware ‘hosts’ (hobbyists or small-scale crypto miners) with users seeking on-demand compute. Since the platform is built around its crypto token, users are encouraged to use crypto for payments and staking GPUs. During testing we had basic issues signing up for an account. We have general concerns about the validity of a platform like this with a lack of basic security, reliability, orchestration, storage, and networking features that users expect from their GPU clusters. --- # Mithril/ML Foundry (Underperforming) > Mithril/ML Foundry earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Mithril (formerly ML Foundry, formerly Foundry) operates as a GPU aggregator, or what they term an “AI omnicloud.” Their core philosophy is that the primary problem in the GPU market is one of price discovery and… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/mithrilmlfoundry - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/mithrilmlfoundry/llm.txt - **Topics**: Mithril/ML Foundry review, Mithril/ML Foundry GPU cloud, Mithril/ML Foundry ClusterMAX rating, Mithril/ML Foundry Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, NDR, ClusterMAX 2.0, SemiAnalysis Mithril (formerly ML Foundry, formerly Foundry) operates as a GPU aggregator, or what they term an “AI omnicloud.” Their core philosophy is that the primary problem in the GPU market is one of price discovery and market inefficiency. Their solution is to create a “fluid market” through aggregation and abstraction, allowing costs to adjust dynamically to reflect increased supply. We completely disagree with both the premise and solution. The premise that the GPU market lacks price discover is flawed and represents a fundamental misunderstanding of the market. In our experience, over 90% of the GPU cloud rental volume is done on long-term contract between enterprises with a standard 25% down and monthly payments through the end of the term. In other words, a typical B2B transaction. The reason for this, which has been detailed throughout this report, is that not all GPUs are deployed equally. GPU compute is not a commodity. Mithril, and other companies trying to aggressively financialize the GPU market as though it is crude oil or lumber, is solving for price per GPU-hr as the only variable. This is an important criteria, but it is just one the 129 criteria that we use to assess a provider’s quality, and is often a poor proxy for the realized TCO of a cluster. By building abstractions on top of an aggregated and abstracted “roll-of-the-dice” marketplace of underlying providers, Mithril places the entire operational burden on the end user. As a provider, Mithril has no control over their customer’s support experience, orchestration software preferences, networking and storage performance, monitoring experience, or, most critically, the reliability and security posture of the cluster. With that said, even if GPU compute was a liquid, commoditized market, we would expect the winner to have open access to many GPU providers, real-time data feeds on availability, forecasts for upcoming supply, proxy information for realized quality from end users… but unfortunately we are left with this instead: [](https://substackcdn.com/image/fetch/$s_!y6Pc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4b349d-c2a9-4275-b356-5d8946095494_937x554.png)Source: a 3-month wait, and counting. --- # GPU.net (Underperforming) > GPU.net earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. GPU.net is yet another marketplace, and one of the biggest by the numbers. The homepage proudly displays access to 42 providers, with 121k total GPUs available.… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/gpunet - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/gpunet/llm.txt - **Topics**: GPU.net review, GPU.net GPU cloud, GPU.net ClusterMAX rating, GPU.net Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, GPU.net H200, GPU.net H100, GPU.net L40S, H200 cloud, H100 cloud, L40S cloud, ClusterMAX 2.0, SemiAnalysis GPU.net is yet another marketplace, and one of the biggest by the numbers. The homepage proudly displays access to 42 providers, with 121k total GPUs available. [](https://substackcdn.com/image/fetch/$s_!rPP2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80cc70a2-0d6f-45e7-9ff9-5596ea565901_936x490.png) Pricing is reasonable for on-demand, at $2.15/hr per H100. Unfortunately, all H100 and H200 SXM appeared to be unavailable, showing a “Booking Error” when we tried to purchase them. Eventually, we got a 1x H100 80GB PCIe, and a 2x L40S machine to spin up in about 2 minutes each. [](https://substackcdn.com/image/fetch/$s_!lUag!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc64c3f51-785d-44d6-8edd-e8a31c02cef0_936x509.png) Logging into our H100 machine in North America, and our 2x L40S machine in Australia. One has GPU drivers, and one doesn’t. Interestingly, the machine without drivers had docker installed, and the machine with drivers did not. We saw that the H100 PCIe machine (dubbed “North America” on the console) was created in Hypertsack/NexGen Cloud’s Montreal datacenter, and the 2x L40S machine (dubbed “Australia” on the console) was created in Sharon AI’s Melbourne, Australia datacenter. Overall, GPU.net is another crypto-focused decentralized marketplace that may provide users the optionality to access to multiple cloud providers. However, they struggle with reliability, a consistent user experience, and many of the basic expectations that we have for a neocloud such as security compliance and attestation. --- # Massed Compute (Underperforming) > Massed Compute earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. In the previous version of ClusterMAX, we commented on how Massed Compute, a reasonably well run bare metal compute provider, is unfortunately inundating the internet with AI-generated SEO junk articles with incorrect… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/massedcompute - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/massedcompute/llm.txt - **Topics**: Massed Compute review, Massed Compute GPU cloud, Massed Compute ClusterMAX rating, Massed Compute Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Massed Compute H100, H100 cloud, bare metal, ClusterMAX 2.0, SemiAnalysis In the previous version of ClusterMAX, we commented on how Massed Compute, a reasonably well run bare metal compute provider, is unfortunately inundating the internet with AI-generated SEO junk articles with incorrect information. This is harmful to the community and simple to fix with **< meta name=”robots” content=”noindex, nofollow”>** in the **< head>** section of the HTML for their chatbot webpage. [](https://substackcdn.com/image/fetch/$s_!auJk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F280aedc2-eb37-45aa-9dd2-a310d2aa0e83_935x520.png)Source: Massed Compute’s chatbot hallucinating a new H100, the “dual GPU” --- # Exabits (Underperforming) > Exabits earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Exabits operates as a bare-metal provider, primarily surfacing its capacity through marketplaces rather than a direct-access cloud. The provider is associated with a cryptocurrency project which seems to provide funding for… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/exabits - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/exabits/llm.txt - **Topics**: Exabits review, Exabits GPU cloud, Exabits ClusterMAX rating, Exabits Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Exabits operates as a bare-metal provider, primarily surfacing its capacity through marketplaces rather than a direct-access cloud. The provider is associated with a cryptocurrency project which seems to provide funding for GPUs through staking. We attempted to provision Exabits instances via the PaleBlueDot.ai marketplace, where it is listed alongside other providers such as Digital Ocean, Massed, and Nebius. During our testing window, unfortunately no Exabits capacity was available to provision from this platform. Lacking an accessible self-service platform or available marketplace instances, we were unable to evaluate Exabits against any core criteria, including orchestration (Slurm/Kubernetes), multi-node networking, security, or monitoring. --- # Sesterce (Underperforming) > Sesterce earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Sesterce is a French cloud headquartered in Marseille, currently claiming access to 1GW of compute, 100k GPUs under management, and €750 million of investment. They started as a crypto miner, but have recently announced a… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/sesterce - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/sesterce/llm.txt - **Topics**: Sesterce review, Sesterce GPU cloud, Sesterce ClusterMAX rating, Sesterce Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Sesterce B200, Sesterce H200, Sesterce H100, B200 cloud, H200 cloud, H100 cloud, SOC2, ISO 27001, Kubernetes, Slurm, bare metal, ClusterMAX 2.0, SemiAnalysis Sesterce is a French cloud headquartered in Marseille, currently claiming access to 1GW of compute, 100k GPUs under management, and €750 million of investment. They started as a crypto miner, but have recently announced a Europe-sovereign €52 billion investment plan across multiple datacenters. This plan involves an initial site in Valence Romans Agglo, which will have 40k GPUs, totaling €1.8 billion. They then plan to add two additional sites in Grand Est, totalling 600 MW of capacity and 500,000 GPUs by 2028, with additional scaling to 1.2 GW and over 1 million GPUs by 2030. Big plans. Back to the present day, Sesterce seems to have reasonable availability of NVIDIA GPUs across their “regions” in Helsinki, Kansas City, Des Moines, Salt Lake City, Dulles, New York, Calgary, Toronto, Mumbai, Osaka, Australia, Frankfurt, Amsterdam, Warsaw, Iceland, Norway, and Sweden. From this size and scale, it seems clear that individual VMs are being deployed in datacenters run by other providers. Pricing also reflects a scenario where this is some middleman getting paid. [](https://substackcdn.com/image/fetch/$s_!btir!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F603065b6-44e6-4116-98f2-9ebb5ad04a27_936x493.png) Users can easily create volumes in the Region where they are planning to launch a GPU machine, and re-use these volumes on additional machines in the future. Nice feature. [](https://substackcdn.com/image/fetch/$s_!z3WE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0347acb-bac2-4785-a262-1a813cff2a27_936x493.png) Users can also configure a specific docker image to pre-load in the machines cache. [](https://substackcdn.com/image/fetch/$s_!brEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f27eba5-e5c4-4b92-abcb-176f3d89804a_936x493.png) Our 1x B200 VM spun up in the Helsinki Datacenter, and upon login we found that the machine belongs to DataCrunch (now called Verda, another ClusterMAX bronze provider) Clusters can be requested (but not spun up on-demand) 16 nodes at a time in Helsinki for B200, or 1-4 nodes at a time in Marseille for H100 or H200. Unfortunately, these clusters are just bare metal machines with a shared storage option. Sesterce does not offer Slurm or Kubernetes orchestration, monitoring dashboards for the clusters, or health checks. It also seems that RBAC with authentication from external IAM providers is not available. [](https://substackcdn.com/image/fetch/$s_!IsLO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8955655e-f408-40fd-839f-e9f0d5759772_755x532.png)Logging into a Sesterce machine, with a docker image already loaded in cache The machine also had nvidia drivers docker pre-installed, with the nvidia container toolkit configured and on the latest version. Download speeds were solid from this Helsinki location, posting some of the fastest times to install pytorch, download an ngc pytorch container, and download a model from huggingface. Overall, our experience from Sesterce was solid. We expect that for certain users, paying the Sesterce premium for availability and a proper setup of an individual development machine offsets some of the headache that is experienced with other clouds. We encourage Sesterce to get into the game with on-demand slurm and kubernetes clusters given the solid foundation that exists from their public cloud services. Most importantly, we encourage Sesterce to make it clear which provider is actually running the machine on their underlying platform, and to publicly communicate a third-party attestation of simple security audits and compliance, such as SOC2 Type I or ISO 27001 in order to move into the ClusterMAX ratings. --- # E2E Networks (Underperforming) > E2E Networks earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. E2E Networks is a publicly listed Indian cloud infrastructure provider. The company is undergoing an ambitious expansion fueled by significant fundraising and its integral partnership in the government’s IndiaAI Mission.… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/e2enetworks - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/e2enetworks/llm.txt - **Topics**: E2E Networks review, E2E Networks GPU cloud, E2E Networks ClusterMAX rating, E2E Networks Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, NDR, RoCE, Slurm, ClusterMAX 2.0, SemiAnalysis E2E Networks is a publicly listed Indian cloud infrastructure provider. The company is undergoing an ambitious expansion fueled by significant fundraising and its integral partnership in the government’s IndiaAI Mission. The company operates datacenters in Delhi, Mumbai, and Chennai, though their AI platform “TIR” only runs in Delhi. However, this is our final review in the article and our worst overall experience. The testing process began with an aggressive KYC (Know Your Customer) procedure required before we could even log in. Our attempt to provision a training cluster, following their documentation, quickly ran into problems. The platform did offer an interesting selection of pre-configured software images, including NVIDIA NeMo, and options to create a shared file system for our slurm cluster. However, we ran into a complete lack of resource availability. [](https://substackcdn.com/image/fetch/$s_!3Avy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F415627f0-54bd-4749-8860-1c6130e88e0a_935x496.png)Creating a SLURM cluster [](https://substackcdn.com/image/fetch/$s_!imcj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb316922d-1332-4665-a379-996ba5e49532_619x233.png)Unable to create a shared fs The most serious issue occurred while we were stuck in this queue, waiting for our slurm cluster to be deployed. We watched as our credit balance was drained, and then went into the negatives. The entire time, we were unable to click delete on the cluster and remove our account from the queue. [](https://substackcdn.com/image/fetch/$s_!6Q2W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d778784-6cb0-429d-8c0f-3b5412b7252b_936x496.png)Owing $7,061.05 for a cluster stuck in “creating” stage Eventually, the E2E support team solved the issue by suspending our account completely. We view this business decision to charge customers for time while a cluster is spinning up but not usable as the most offensive business practice we have seen throughout all of our ClusterMAX testing. --- # OVHcloud (Underperforming) > OVHcloud earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. As one of Europe’s largest cloud providers, OVHcloud has a massive footprint and lots of experience. They should be well positioned to capture the sovereign AI market in the EEA. However, OVH is still defined by a legacy… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/ovhcloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/ovhcloud/llm.txt - **Topics**: OVHcloud review, OVHcloud GPU cloud, OVHcloud ClusterMAX rating, OVHcloud Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, OVHcloud H100, H100 cloud, ClusterMAX 2.0, SemiAnalysis As one of Europe’s largest cloud providers, OVHcloud has a massive footprint and lots of experience. They should be well positioned to capture the sovereign AI market in the EEA. However, OVH is still defined by a legacy IaaS/VPS model that lacks modern GPUs and, oh I don’t know, [that one time their Strasbourg datacenter burnt down](https://corporate.ovhcloud.com/en/newsroom/news/informations-site-strasbourg/). Come on its still the first thing anyone things of… [](https://substackcdn.com/image/fetch/$s_!LcyS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a33147d-1e8c-48a4-974f-2c618f740271_937x709.png)Source: the most modern GPU available at OVH is an H100 PCIe (for £2.41 at that) --- # Dihuni (Underperforming) > Dihuni earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Dihuni is a Virginia-based provider with a solid list of customers. Dihuni also has a long-standing collaboration with NEC, a major electronics company in Tokyo. To run their GPU Cloud, Dihuni has established a key partnership… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/dihuni - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/dihuni/llm.txt - **Topics**: Dihuni review, Dihuni GPU cloud, Dihuni ClusterMAX rating, Dihuni Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Dihuni is a Virginia-based provider with a solid list of customers. Dihuni also has a long-standing collaboration with NEC, a major electronics company in Tokyo. To run their GPU Cloud, Dihuni has established a key partnership with Qubrid (covered previously in this article). We think this is an obvious trend: neoclouds like Qubrid beginning to sell their software to other providers like Dihuni that have existing datacenters and want an easy button to get started. --- # Akamai/Linode (Underperforming) > Akamai/Linode earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Akamai is a CDN giant that made a $900M acquisition of Linode in March 2022 to build a cloud from a foundation of datacenters, networking, and free cash flow. The scene was set for ChatGPT to launch later that year.… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/akamailinode - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/akamailinode/llm.txt - **Topics**: Akamai/Linode review, Akamai/Linode GPU cloud, Akamai/Linode ClusterMAX rating, Akamai/Linode Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, managed Slurm, ClusterMAX 2.0, SemiAnalysis Akamai is a CDN giant that made a $900M acquisition of Linode in March 2022 to build a cloud from a foundation of datacenters, networking, and free cash flow. The scene was set for ChatGPT to launch later that year. Unfortunately, Akamai’s entry into the GPU cloud market has been a case study in missed opportunities. Akamai has completely ignored all high end GPUs and instead focuses on the RTX 6000 Blackwell. Unsurprisingly, the platform has no managed Slurm cluster, and its Kubernetes engine is not optimized for GPU servers. Akamai’s strategy seems focused on single-node, small-scale inference or developer VMs, which is a crowded, low-margin market. For a company with their resources, this lack of ambition is a significant disappointment to us. --- # HETZNER (Underperforming) > HETZNER earns a ClusterMAX 2.0 Underperforming rating from SemiAnalysis. Hetzner is popular low cost provider based in Germany, operating facilities in Nuremberg, Falkenstein, Tuusula, and colos in Ashburn, Hillsboro, and Singapore. At this time the only GPUs available are the RTX 4000 and RTX… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Underperforming - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/hetzner - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/hetzner/llm.txt - **Topics**: HETZNER review, HETZNER GPU cloud, HETZNER ClusterMAX rating, HETZNER Underperforming, Underperforming tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Hetzner is popular low cost provider based in Germany, operating facilities in Nuremberg, Falkenstein, Tuusula, and colos in Ashburn, Hillsboro, and Singapore. At this time the only GPUs available are the RTX 4000 and RTX 6000, likely due to some environmental limitations in their datacenters. Until Hetzner adds high end datacenter GPUs for users to build clusters they will remain in the underperforming category for our criteria. --- # NScale (Unavailable) > NScale earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. NScale is an NVIDIA-backed partner with a huge presence in Stargate Norway, which we have covered extensively in our Datacenter Model: . They have announced multi-billion contracts with Microsoft, with the first one in September and follow up with a subsequent announcement in October. These deployments are for Nvidia’s GB300 rack scale solutions. Unfortunately we have been unable to gain access to any GPUs on the NScale platform, either directly or through publicly advertised partners like Lightning.ai. We regret the difficulties and look forward to testing NScale services in the future. --- # Core42/G42 (Unavailable) > Core42/G42 earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis (updated April 2026 in the ClusterMAX 2.1 Update). Core42 is the neocloud division of G42. G42 also operates MGX (an investment fund) and Khazna (a datacenter development company). All are based in the UAE. We have covered these… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2026-04-20 (Apr 20, 2026) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/core42g42 - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/core42g42/llm.txt - **Topics**: Core42/G42 review, Core42/G42 GPU cloud, Core42/G42 ClusterMAX rating, Core42/G42 Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Core42/G42 MI300X, MI300X cloud, Kubernetes, Slurm, ClusterMAX 2.0, ClusterMAX 2.1, ClusterMAX 2.1 Update, SemiAnalysis Core42 is the neocloud division of G42. G42 also operates MGX (an investment fund) and Khazna (a datacenter development company). All are based in the UAE. We have covered these companies extensively in our Datacenter Model, and public articles such as: with recent updates showing that a tit-for-tat deal of a measly 10T USD has resulted in an export license being granted and NVIDIA resuming shipments of GPUs to the UAE. At this time we have not been able to access and test a cluster provided by Core42. We regret the difficulties and look forward to testing Core42 services in the future. ## ClusterMAX 2.1 Update (April 2026) Core42 is a division of G42 with a massive presence in the UAE and a growing presence in the US. With the backing of MGX, and the sister company Khazna Datacenters, all of whom are intimately involved with Stargate UAE, the group means business. Back on the US side, Core42 is also making moves. They have established small sites in San Jose, Grenoble, and 70MW of MI300X in Buffalo (via Terawulf). During our testing we were provided with both slurm and kubernetes clusters from that site, using AMD MI300X GPUs, and crucially some Broadcom Thor-II NICs. This was the first cluster we'd gotten with Thor-II during clustermax testing, and it was a battle. Every single container image we had previously tested on AMD clusters, and nearly every AMD base image they publish to rocm repos such as vllm, sglang, torchtitan, and MoRI are all built with AMDs own Pollara NICs. This meant downloading tarballs from Broadcom's driver search website, scp'ing the files over to the cluster nodes, and rebuilding containers from scratch. A headache to say the least. Notably, the Core42 engineering team was ready to help the entire way, from troubleshooting these driver recipe issues to debugging slurm user errors on our side it was a really strong showing of hands-on, proactive technical support. If Core42 launches some modern GPUs in the US or starts relaxing the compliance restrictions they have in place that prevent us from testing in the UAE sites (or anyone from outside UAE renting GPUs at those sites) we expect Core42 to quickly rise into the silver tier and beyond. --- # HUMAIN Compute (Unavailable) > HUMAIN Compute earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. HUMAIN Compute is the neocloud division of Saudi Arabia’s Public Investment Fund (PIF). The KSA PIF is also involved in other major technology and infrastructure projects as part of the nation’s “Vision 2030” plan. While… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/humaincompute - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/humaincompute/llm.txt - **Topics**: HUMAIN Compute review, HUMAIN Compute GPU cloud, HUMAIN Compute ClusterMAX rating, HUMAIN Compute Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis HUMAIN Compute is the neocloud division of Saudi Arabia’s Public Investment Fund (PIF). The KSA PIF is also involved in other major technology and infrastructure projects as part of the nation’s “Vision 2030” plan. While there have been significant announcements regarding their intent to build large-scale AI infrastructure, a publicly accessible, self-service platform for testing and review is not yet available. --- # Corvex (Unavailable) > Corvex earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Corvex (formerly Klustr) currently operates H200 and B200, with expansion plans for GB200 NVL72 systems coming soon. During our testing period, Corvex was unfortunately sold out. We appreciate that in order to serve their… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/corvex - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/corvex/llm.txt - **Topics**: Corvex review, Corvex GPU cloud, Corvex ClusterMAX rating, Corvex Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Corvex GB200 NVL72, Corvex GB200, Corvex B200, Corvex H200, GB200 NVL72 cloud, GB200 cloud, B200 cloud, H200 cloud, InfiniBand, RoCE, SOC 2, ISO 27001, Kubernetes, Slurm, DCGM, ClusterMAX 2.0, SemiAnalysis Corvex (formerly Klustr) currently operates H200 and B200, with expansion plans for GB200 NVL72 systems coming soon. During our testing period, Corvex was unfortunately sold out. We appreciate that in order to serve their customers, notably secure federal government agencies in the US, Corvex maintains strict compliance with SOC 2 Type II, ISO 27001/27017/27018, PCI-DSS, and FedRAMP certifications, along with InfiniBand and RoCEv2 isolation, SR-IOV safety controls, and automatic customer notification and CVE patch management. The platform advertises automated Slurm and Kubernetes provisioning, out-of-the-box topology configuration, and integrated Grafana monitoring tied into DCGM and job telemetry. We look forward to testing Corvex in the future. --- # Highrise (Unavailable) > Highrise earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. This is the AI brand for the crypto-mining company Hut 8. Despite massive announcements for over 10GW of power capacity, we have been unable to access any GPUs for review at Highrise. We look forward to testing their offerings in… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/highrise - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/highrise/llm.txt - **Topics**: Highrise review, Highrise GPU cloud, Highrise ClusterMAX rating, Highrise Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis This is the AI brand for the crypto-mining company Hut 8. Despite massive announcements for over 10GW of power capacity, we have been unable to access any GPUs for review at Highrise. We look forward to testing their offerings in the future. --- # BluSky (Unavailable) > BluSky earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. BluSky is a small scale startup with access to massive amounts of power through a relationship with RIOT, one of the largest crypto miners in the US. We look forward to testing their offerings in the future. - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/blusky - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/blusky/llm.txt - **Topics**: BluSky review, BluSky GPU cloud, BluSky ClusterMAX rating, BluSky Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis BluSky is a small scale startup with access to massive amounts of power through a relationship with RIOT, one of the largest crypto miners in the US. We look forward to testing their offerings in the future. --- # Andromeda (Unavailable) > Andromeda earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Andromeda was created to serve as a VC-backed cluster for companies within the Nat Friedman and Daniel Gross (NFDG) portfolio, specifically recipients of their AI Grant (or AI Grant). The company also operates the popular… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/andromeda - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/andromeda/llm.txt - **Topics**: Andromeda review, Andromeda GPU cloud, Andromeda ClusterMAX rating, Andromeda Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, NDR, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Andromeda was created to serve as a VC-backed cluster for companies within the Nat Friedman and Daniel Gross (NFDG) portfolio, specifically recipients of their [AI Grant](https://aigrant.com/) (or [AI Grant](https://aigrant.org/)). The company also operates the popular website [gpulist.ai](http://gpulist.ai). Their model now involves procuring capacity from a range of neoclouds on our list, on behalf of the startups. Effectively they are a “fractional SRE” or “fractional procurement team” that the startups can lean on. User described to us that Andromeda can take these clusters and deploy their own orchestration layer, primarily via Slurm on Kubernetes, with light namespace isolation between tenants. It seems that the NFDG portfolio companies trust each other. With Nat and Daniel first going to SSI, and now joining Meta, it is uncertain what the future holds for Andromeda. We are excited by the concept of strategic VCs backing startups with compute capacity in addition to cash, allowing them to compete with startup programs from strategics like the hyperscalers, CoreWeave, and NVIDIA’s NVentures. We look forward to seeing more from Andromeda in the future. --- # Mistral (Unavailable) > Mistral earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Mistral started as one of the leading AI companies in the world, with an affinity for open source models, magnet links, and le chat (apps, not cats). Now that they have procured compute, they have pivoted or added the capability… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/mistral - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/mistral/llm.txt - **Topics**: Mistral review, Mistral GPU cloud, Mistral ClusterMAX rating, Mistral Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Mistral started as one of the leading AI companies in the world, with an affinity for open source models, magnet links, and le chat (apps, not cats). Now that they have procured compute, they have pivoted or added the capability of building a neocloud to their public offering, with a particular focus on sovereign AI projects in Europe. This summer, Mistral even had President Macron advertising their services on stage alongside Jensen. As Jensen said during this session “a country can outsource a lot of things, but outsourcing all of your intelligence makes no sense”. With all of this work well underway, we look forward to testing public offerings from Mistral in the future. --- # Firebird (Unavailable) > Firebird earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Firebird is another provider that debuted this summer at GTC Paris, backed by $500M from the Aermenian government with intention to buy GB300 NVL72 rack scale systems and a focus on sovereign AI. They have support from Telecom… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/firebird - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/firebird/llm.txt - **Topics**: Firebird review, Firebird GPU cloud, Firebird ClusterMAX rating, Firebird Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Firebird GB300, GB300 cloud, ClusterMAX 2.0, SemiAnalysis Firebird is another provider that debuted this summer at GTC Paris, backed by $500M from the Aermenian government with intention to buy GB300 NVL72 rack scale systems and a focus on sovereign AI. They have support from Telecom Armenia and Ireland’s Imagine Broadband. Though a public offering is not yet launched, we are excited to test it in the future. --- # TELUS (Unavailable) > TELUS earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. TELUS is a major Canadian telco that has deployed GPUs in datacenters but is still “launching soon” with its own Sovereign play. The platform is currently unavailable for testing, though we expect to include it in ClusterMAX 2.1,… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/telus - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/telus/llm.txt - **Topics**: TELUS review, TELUS GPU cloud, TELUS ClusterMAX rating, TELUS Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis TELUS is a major Canadian telco that has deployed GPUs in datacenters but is still “launching soon” with its own Sovereign play. The platform is currently unavailable for testing, though we expect to include it in ClusterMAX 2.1, coming soon. We are encouraged that from day one TELUS will offer both slurm and kubernetes clusters to their users. --- # Telenor (Unavailable) > Telenor earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. A major Norwegian telco that has announced Norway’s first “AI Factory” in partnership with NVIDIA. The platform is not yet launched and is unavailable for testing, despite initial announcements being made back in February 2025.… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/telenor - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/telenor/llm.txt - **Topics**: Telenor review, Telenor GPU cloud, Telenor ClusterMAX rating, Telenor Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis A major Norwegian telco that has announced Norway’s first “AI Factory” in partnership with NVIDIA. The platform is not yet launched and is unavailable for testing, despite initial announcements being made back in February 2025. Norway has one of the largest sovereign wealth funds in the world, a cool climate, and cheap green energy. We are excited to see what Telenor comes to market with in the future. --- # Alibaba Cloud (Unavailable) > Alibaba Cloud earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. While a major hyperscaler in Asia, its high-performance, modern GPU offerings (H100/B200) are not readily available for testing in most public regions, with a primary focus on the Chinese domestic market. Alibaba Cloud’s… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/alibabacloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/alibabacloud/llm.txt - **Topics**: Alibaba Cloud review, Alibaba Cloud GPU cloud, Alibaba Cloud ClusterMAX rating, Alibaba Cloud Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Alibaba Cloud B200, Alibaba Cloud H100, B200 cloud, H100 cloud, Kubernetes, Slurm, managed Slurm, ClusterMAX 2.0, SemiAnalysis While a major hyperscaler in Asia, its high-performance, modern GPU offerings (H100/B200) are not readily available for testing in most public regions, with a primary focus on the Chinese domestic market. Alibaba Cloud’s managed Kubernetes service is called Container Service for Kubernetes (ACK). ACK provides a fully managed solution where Alibaba Cloud handles the control plane (master nodes), which are critical for the cluster’s operation. You only need to create and manage the worker nodes. This simplifies the deployment and maintenance of Kubernetes clusters. ACK supports various node types, including those with heterogeneous computing resources like GPUs, making it suitable for a wide range of workloads, especially for AI and machine learning. Alibaba Cloud also provides a managed Slurm solution as part of its Elastic High Performance Computing (E-HPC) service. E-HPC is a platform that simplifies the deployment and management of high-performance computing clusters. It also includes a Slurm on Kubernetes solution, which uses a dedicated operator to deploy and manage Slurm clusters within ACK. [](https://substackcdn.com/image/fetch/$s_!Fm86!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99935c55-61fa-4d7b-8554-4134875c70c4_937x500.png)Source: not much GPU selection in the Chinese availability zones In our [datacenter model](https://semianalysis.com/datacenter-industry-model/), we have covered Alibaba’s expansion globally across Thailand, Mexico, SK, Malaysia, and Philippines. There is also a new datacenter in Brazil being built, as ByteDance is looking to add $10B there, Huawei Cloud is looking to build a 4th datacenter there, and Tencent also plans expansion into South America. Notably, Didi, AliExpress, Shein and many Chinese automakers are expanding to Brazil too. Alibaba Cloud is getting in at the ground floor, helping build Brazil’s AI industry from the ground up. In the future we are very interested in testing the Alibaba Cloud experience. It seems clear that the software console is sophisticated, and proven at scale by some of the world’s largest AI companies. We will be tracking this neocloud’s posture with respect to geopolitics closely over time. --- # ARC Compute (Unavailable) > ARC Compute earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. ARC Compute has a track record of deploying HGX GPU servers to customers in Canada for H100, H200, B200 and B300 generations. Unfortunately ARC has had its fair share of legal troubles, from prosecuting two former employees… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/arccompute - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/arccompute/llm.txt - **Topics**: ARC Compute review, ARC Compute GPU cloud, ARC Compute ClusterMAX rating, ARC Compute Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ARC Compute B200, ARC Compute H200, ARC Compute H100, B200 cloud, H200 cloud, H100 cloud, InfiniBand, ClusterMAX 2.0, SemiAnalysis ARC Compute has a track record of deploying HGX GPU servers to customers in Canada for H100, H200, B200 and B300 generations. Unfortunately ARC has had its fair share of legal troubles, from prosecuting two former employees caught stealing source code and soliciting clients, to allegedly violating US export controls on their server sales. On the website, ARC advertises H100 clusters with InfiniBand starting at $1.45/hr, an attractive price for anyone who can access them for testing. Unfortunately we have not been able to gain access to conduct our testing. --- # MegaSpeed (Unavailable) > MegaSpeed earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. The New York times did some reporting on Megaspeed recently, but somehow did not make the connection to Alibaba as the primary customer. Megaspeed is headquartered in Singapore, and we have been tracking their massive buildout… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/megaspeed - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/megaspeed/llm.txt - **Topics**: MegaSpeed review, MegaSpeed GPU cloud, MegaSpeed ClusterMAX rating, MegaSpeed Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis The New York times did some reporting on Megaspeed recently, but somehow did not make the connection to Alibaba as the primary customer. Megaspeed is headquartered in Singapore, and we have been tracking their massive buildout across Malaysia for a while in our [Datacenter Model](https://semianalysis.com/datacenter-industry-model/). We have been unable to access any Megaspeed cloud services for our testing. --- # Bitdeer (Unavailable) > Bitdeer earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis (updated April 2026 in the ClusterMAX 2.1 Update). Bitdeer is Singaporean-based company that spun out of Bitmain, the largest Chinese crypto mining company that also sells custom ASICs for mining. Bitdeer runs a lot of these ASICs,… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2026-04-20 (Apr 20, 2026) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/bitdeer - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/bitdeer/llm.txt - **Topics**: Bitdeer review, Bitdeer GPU cloud, Bitdeer ClusterMAX rating, Bitdeer Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Bitdeer GB200 NVL72, Bitdeer GB200, Bitdeer B200, Bitdeer H100, GB200 NVL72 cloud, GB200 cloud, B200 cloud, H100 cloud, RoCE, NVLink, SOC2, ClusterMAX 2.0, ClusterMAX 2.1, ClusterMAX 2.1 Update, SemiAnalysis Bitdeer is Singaporean-based company that spun out of Bitmain, the largest Chinese crypto mining company that also sells custom ASICs for mining. Bitdeer runs a lot of these ASICs, and is now also a neocloud. Their platform includes sites around the world: Singapore, Malaysia, Indonesia, Iceland, the Netherlands, Canada and the US. [](https://substackcdn.com/image/fetch/$s_!SyDh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96bde92c-288b-48aa-af4e-fecc450e4184_936x485.png) When we logged in to test, there was supposed availability of H100 in Malaysia, US, Iceland and the Netherlands. There is also B200 and H100 in Singapore. Bitdeer’s website claims that they have achieved SOC2 Type I and ISO/IEC 27001:2022 compliance. This should easily put them on our list. However, Bitdeer has an aggressive KYC process in place that prevented us from renting a GPU during the test period. As a result we can’t verify anything about their on-demand GPU cloud platform. Interesting, Bitdeer offers different prices for bandwidth on a per-VM basis, ranging from 1Mb/s to 1024 Mb/s on a fixed or pay-per-use traffic basis. In heavy usage scenarios, this can double the cost of a 1x B200 VM on their platform from $4.69/hr to $8.27/hr. Beyond hands on testing, Bitdeer has some big plans. Their site in Massillon, Ohio has recently undergone a third-party feasibility assessment regarding their suitability for Tier 3 HPC/AI datacenters, and reported “largely positive results (…) due to the availability of land, power, fiber and water resources”. We find it interesting to see nothing about the existing buildings in this report. It seems that Bitdeer, like other crypto miners, is planning to knock down existing buildings and use the powered land for a brand new AI cloud building, finally selling that as powered shell or colo with a 10-15% margin on their costs. [](https://substackcdn.com/image/fetch/$s_!tCM8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ca6c597-11b7-46f8-a52e-2454732b20ac_936x706.png)Source: SemiAnalysis [Datacenter Model](https://semianalysis.com/datacenter-industry-model/) ## ClusterMAX 2.1 Update (April 2026) We conducted some initial testing with BitDeer at their Malaysia site using 2 nodes of GB200 NVL72. We were limited on time and could not get the IMEX domain configured correctly to confirm the NVLink was setup for intranode communication on the NVL72 domain. We did run some training jobs and figure out the console successfully. With many more GPUs coming online this year, we are excited to see more from BitDeer in terms of orchestration software, monitoring, reliability and support for the big clusters they have announced they're building. --- # Runsun (Unavailable) > Runsun earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Runsun is a provider based in Singapore with GPUs deployed in Japan and the US, and plans to expand to South Korea, Australia and Europe soon. Runsun claims over 10,000 GPUs deployed as a bare metal service. At this time we have… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/runsun - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/runsun/llm.txt - **Topics**: Runsun review, Runsun GPU cloud, Runsun ClusterMAX rating, Runsun Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, bare metal, ClusterMAX 2.0, SemiAnalysis Runsun is a provider based in Singapore with GPUs deployed in Japan and the US, and plans to expand to South Korea, Australia and Europe soon. Runsun claims over 10,000 GPUs deployed as a bare metal service. At this time we have been unable to access anything for testing, but look forward to testing Runsun’s cloud services in the future. --- # FPT CLOUD (Unavailable) > FPT CLOUD earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis (updated April 2026 in the ClusterMAX 2.1 Update). A large regional provider in Southeast Asia with significant capacity in Vietnam and Japan and partnerships throughout the region. We were unable to gain access to a cluster in… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2026-04-20 (Apr 20, 2026) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/fptcloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/fptcloud/llm.txt - **Topics**: FPT CLOUD review, FPT CLOUD GPU cloud, FPT CLOUD ClusterMAX rating, FPT CLOUD Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, FPT CLOUD H200, FPT CLOUD H100, H200 cloud, H100 cloud, DCGM, ClusterMAX 2.0, ClusterMAX 2.1, ClusterMAX 2.1 Update, SemiAnalysis A large regional provider in Southeast Asia with significant capacity in Vietnam and Japan and partnerships throughout the region. We were unable to gain access to a cluster in time for publishing this article, but expect to include them in ClusterMAX 2.1 soon. Stay tuned. ## ClusterMAX 2.1 Update (April 2026) We got the chance to test FPT Smart Cloud back in November 2025. FPT is based in Vietnam and at the time had H100 and H200 available. They use Soperator from Nebius for orchestration, and the cluster was well configured. We noticed some poor performance on the VAST Storage. The monitoring experience was quite strong, some of the best custom DCGM dashboards we have seen with Loki used for logging and analysis. Unfortunately, FPT is held back from the silver tier due to some serious security issues. Our testing showed that PKeys and SAKey were not configured correctly, allowing us to see every other endpoint on the network (i.e. every other customer). --- # Backend (Unavailable) > Backend earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Backend is a Korean neocloud with many software offerings that seem to simplify onboarding to clusters. We appreciate the flexibility with backend apparently supporting customers running their software stack on-prem, in their… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/backend - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/backend/llm.txt - **Topics**: Backend review, Backend GPU cloud, Backend ClusterMAX rating, Backend Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Backend is a Korean neocloud with many software offerings that seem to simplify onboarding to clusters. We appreciate the flexibility with backend apparently supporting customers running their software stack on-prem, in their cloud, or on developer workstations. At this point it seems that backend is missing a significant amount of compute capacity, and the public cloud service is still in beta requiring an invitation to test it out. We look forward to testing backend in the future. --- # Naver (Unavailable) > Naver earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Naver s effectively the “Google of South Korea”, thanks to their success in search, blogging, forums, e-commerce, payments and more. Naver has just recently gone public with their plans to launch a neocloud, thanks in no small part… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/naver - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/naver/llm.txt - **Topics**: Naver review, Naver GPU cloud, Naver ClusterMAX rating, Naver Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis Naver s effectively the “Google of South Korea”, thanks to their success in search, blogging, forums, e-commerce, payments and more. Naver has just recently gone public with their plans to launch a neocloud, thanks in no small part to Jensen’s trip to Seoul and meetings with HBM manufacturers SK Hynix and Samsung. The new cloud project will include a total of some 260,000 GPUs being deployed, with 60,000 going to Naver for the public cloud offering, while the other 200,000 are split between Samsung, SK, Hyundai and Naver’s internal workloads. Notably, Naver already operates a cloud, has a large research organization putting out great work, and has the infrastructure in place to seed the Korean startup and academic ecosystem at large. We look forward to testing Naver’s GPU cloud offerings in the future. --- # Indosat (Unavailable) > Indosat earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. A major Indonesian telco that has a signed an MOU with NVIDIA to become Indonesia’s first certified NVIDIA cloud partner. The platform is not yet launched and is unavailable for testing. - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/indosat - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/indosat/llm.txt - **Topics**: Indosat review, Indosat GPU cloud, Indosat ClusterMAX rating, Indosat Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis A major Indonesian telco that has a signed an MOU with NVIDIA to become Indonesia’s first certified NVIDIA cloud partner. The platform is not yet launched and is unavailable for testing. --- # SAKURA (Unavailable) > SAKURA earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. A major Japanese sovereign AI cloud with significant MOUs with KDDI and HPE for a large Blackwell cluster. Their public, self-service AI cloud platform is not yet available for testing. - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/sakura - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/sakura/llm.txt - **Topics**: SAKURA review, SAKURA GPU cloud, SAKURA ClusterMAX rating, SAKURA Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, ClusterMAX 2.0, SemiAnalysis A major Japanese sovereign AI cloud with significant MOUs with KDDI and HPE for a large Blackwell cluster. Their public, self-service AI cloud platform is not yet available for testing. --- # Yotta (Unavailable) > Yotta earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Yotta’s Shakti Cloud (not to be confused with the datacenter industry conference “Yotta”, going by the same name) was the first mover in India’s sovereign AI push, and according to Jensen his “favourite cloud in Asia”. Yotta has… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/yotta - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/yotta/llm.txt - **Topics**: Yotta review, Yotta GPU cloud, Yotta ClusterMAX rating, Yotta Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Yotta H100, H100 cloud, ClusterMAX 2.0, SemiAnalysis Yotta’s Shakti Cloud (not to be confused with the datacenter industry conference “Yotta”, going by the same name) was the first mover in India’s sovereign AI push, and according to Jensen his “favourite cloud in Asia”. Yotta has announced over 16,000 H100 GPUs are deployed in Shakti, with Blackwell currently being deployed and more on the way. This maintains Yotta’s status with the most GPUs in India at 32,768 total by end of 2025. We regret the difficulties experienced so far in testing Yotta’s cloud services and look forward to including them for testing in ClusterMAX 2.1 coming soon. --- # Neev Cloud (Unavailable) > Neev Cloud earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Neev Cloud is another Indian neocloud that unfortunately did not want us to test their offerings depite plans to deploy 40,000 GPUs by 2026 in their Indore location in Central India, backed by a $1.5B investment. An initial… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/neevcloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/neevcloud/llm.txt - **Topics**: Neev Cloud review, Neev Cloud GPU cloud, Neev Cloud ClusterMAX rating, Neev Cloud Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Neev Cloud B200, Neev Cloud H200, B200 cloud, H200 cloud, InfiniBand, Kubernetes, Slurm, ClusterMAX 2.0, SemiAnalysis Neev Cloud is another Indian neocloud that unfortunately did not want us to test their offerings depite plans to deploy 40,000 GPUs by 2026 in their Indore location in Central India, backed by a $1.5B investment. An initial order in mid-2024 was placed for 8,000 GPUs from HPE, with planned expansion to Chennai, Mumbai, Hyderabad and Noida. The website claims 1000 to 16000 GPUs connected with InfiniBand, with H200, B200 and B300 all available for “pre-reservation”. A prominent picture of the Neev Cloud CEO shaking hands with Modi on their front page leaves us to assume that Neev will be serving sovereign Indian customers primarily. However, it seems that kubernetes and slurm clusters are not available for testing. India is a large country with a vibrant tech ecosystem, but it is unclear how many neoclouds will make it going forward. --- # Evroc (Unavailable) > Evroc earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. Evroc is yet another European company that is building a neocloud with a focus on sustainability and sovereignty. Evroc is headquartered in Stockholm, with plans for datacenters in Arlandastad and Cannes, and partners in Paris,… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/evroc - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/evroc/llm.txt - **Topics**: Evroc review, Evroc GPU cloud, Evroc ClusterMAX rating, Evroc Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Evroc GB300, GB300 cloud, ClusterMAX 2.0, SemiAnalysis Evroc is yet another European company that is building a neocloud with a focus on sustainability and sovereignty. Evroc is headquartered in Stockholm, with plans for datacenters in Arlandastad and Cannes, and partners in Paris, Stockholm and Frankfurt. The plans involve up to 10,000 GPUs GB300 NVL72. At this point we have not been able to test any of Evroc’s cloud services, but look forward to testing the “world’s cleanest cloud” in the future. --- # greenai.cloud (Unavailable) > greenai.cloud earns a ClusterMAX 2.0 Unavailable rating from SemiAnalysis. GreenAI cloud is another Swedish-based cloud provider focused on sustainability and sovereignty. GreenAI specifically claims “CO2-Negative Computing”, and focuses primarily on serving security conscious government agencies… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2025-11-06 (Nov 06, 2025) - **Last updated**: 2025-11-06 (Nov 06, 2025) - **Source**: ClusterMAX 2.0 — https://newsletter.semianalysis.com/p/clustermax-20-the-industry-standard - **URL**: https://www.clustermax.ai/cloudreview/greenaicloud - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/greenaicloud/llm.txt - **Topics**: greenai.cloud review, greenai.cloud GPU cloud, greenai.cloud ClusterMAX rating, greenai.cloud Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, greenai.cloud GB200 NVL72, greenai.cloud GB200, greenai.cloud B200, greenai.cloud H200, greenai.cloud H100, greenai.cloud A100, GB200 NVL72 cloud, GB200 cloud, B200 cloud, H200 cloud, H100 cloud, A100 cloud, ClusterMAX 2.0, SemiAnalysis GreenAI cloud is another Swedish-based cloud provider focused on sustainability and sovereignty. GreenAI specifically claims “CO2-Negative Computing”, and focuses primarily on serving security conscious government agencies in defense, intelligence, health and science. Compliance with Schrems II is in place with a Level 4 secure facility. A100, H100, H200 and B200 GPUs are available, though no GB200 NVL72. Interestingly, a flash-in-the-pan partnership with Cerebras seems to have flamed out, with the greenai team emphasizing that they will never do business with the company again. We have not had the chance to test and of greenai’s cloud services but hope to have that chance in the future. --- # Radiant/Ori (Unavailable) > Radiant/Ori earns a ClusterMAX Unavailable rating from SemiAnalysis (introduced April 2026 in the ClusterMAX 2.1 Update). Radiant was announced recently after Brookfield acquired Ori, a Saudi Aramco backed neocloud with H100s and H200s in London and Dallas. When we tested with Ori on two occasions… - **Authors**: Jordan Nanos, Daniel Nishball, Dylan Patel - **Tier**: Unavailable - **Published**: 2026-04-20 (Apr 20, 2026) - **Last updated**: 2026-04-20 (Apr 20, 2026) - **Source**: ClusterMAX 2.1 Update — https://newsletter.semianalysis.com/i/194395279/clustermax-21-update - **URL**: https://www.clustermax.ai/cloudreview/radiantori - **Single-cloud LLM file**: https://www.clustermax.ai/cloudreview/radiantori/llm.txt - **Topics**: Radiant/Ori review, Radiant/Ori GPU cloud, Radiant/Ori ClusterMAX rating, Radiant/Ori Unavailable, Unavailable tier GPU cloud, GPU cloud review, neocloud review, Radiant/Ori H200, Radiant/Ori H100, H200 cloud, H100 cloud, Kubernetes, Slurm, NCCL, DCGM, ClusterMAX 2.0, ClusterMAX 2.1, ClusterMAX 2.1 Update, SemiAnalysis Radiant was announced recently after Brookfield acquired Ori, a Saudi Aramco backed neocloud with H100s and H200s in London and Dallas. When we tested with Ori on two occasions in the fall, we saw some quick progress but not enough to get to silver. Ori fell victim to the exact same issues as FPT, with PKey and SAKey not configured correctly. In addition, during our first round of testing, we were unable to run nccl-tests at full bandwidth on kubernetes due to an issue with the NetworkOperator picking up NICs that were intended to be for the frontend but were named/configured incorrectly to be used for NCCL. Finally, DCGMI health watches are not enabled by default, and there is no automated background health check program. Our testing of a simple hardware failure simulation showed that the system did not trigger any automated alerting or node replacement over an 18-hour window. The team is targeting Q2 2026 for the release of monitoring dashboards, and seems well on their way to having the funding they need to build Blackwell clusters with comprehensive slurm, kubernetes, monitoring and reliability features customers expect. ---