Azure maintains its ranking as a Gold-tier provider, considering that it will still be providing the bulk of OpenAI’s capacity through the end of 2026.

Unfortunately, if you don’t work for OpenAI, Azure is not a significant player for managed clusters or on-demand VMs. Capacity is constrained across all regions globally for Hopper and Blackwell, and the CycleCloud slurm provisioning process is in need of an update, or at least some simplification. This has been reaffirmed with OpenAI and Microsoft locking in a long-horizon partnership and Microsoft geting a ~27% stake with model/IP rights through 2032.

This gap between wholesale bare metal experience for anchor tenants like OpenAI and the managed experience for the rest of the market becomes clear when we look at reliability and compare slurm (CycleCloud) with kubernetes (AKS).

AKS reliability includes fully managed Node Auto-Repair feature. This system automatically detects unhealthy nodes based on kubelet status conditions and attempts remediation through reboots or re-imaging. This philosophy extends to monitoring, where Azure Monitor for Containers provides, integrated visibility into every layer of the cluster out-of-the-box.

In stark contrast, CycleCloud relies on the traditional HPC model via slurm’s HealthCheckProgram. However, CycleCloud does not provide a good default, like LBNL’s Node Health Check https://github.com/mej/nhc, or anything customized to Azure infrastructure. Instead, the full operational burden of health checks is placed on the user, who must write, test, and maintain custom scripts to monitor GPUs and the InfiniBand fabric. Beyond that, the integrated monitoring is limited to a high-level node status view in the UI, forcing users to implement their own solutions for any meaningful job-level or hardware-specific insights such as DCGM dashboards.

As an example, when deploying a CycleCloud cluster, the current documentation for CycleCloud is split between older guides and a newer GitHub-centric approach. Users are required to configure login and scheduler nodes separately, as well as provision and manage their own MySQL database to handle slurm accounting (sacct).

Source: Azure

However, the comprehensive nature of a hyperscaler cloud platforms also has some merits. Networking is straightforward offering access options via NAT Gateway or bastion host. It also provides flexibility through support for custom images, integration with Azure Spot Virtual Machines for cost-effective bursting. Azure has a legacy in HPC that will feel familiar to users coming to a GPU cluster from an academic HPC background.

Source: Azure

On networking, Azure continues to lead the hyperscalers in performance, being the only one to deploy with InfiniBand, and implement SHARP at scale. Security is also rock solid, Microsoft in general holds a reputation for robust security and compliance practices, which has made it a trusted partner for federal government agencies and defense contractors.

With that said, the dynamics of Microsoft’s relationship with its key customer, OpenAI are shifting. Since Satya mentioned he’s “good for his $80B”, Stargate has turned into a $600B Behemoth, much of which has been captured by Oracle. Google, xAI and Meta have followed suit, with Zuck committing to the same total spend of $600B over the next 5-7 years.

The reality is that we are forecasting Azure to lose share in the market when considering the frontier labs compute requirements and existing commitments. This leaves Azure with the rest of the market, who generally demand strong managed cluster experiences for slurm or kubernetes and a streamlined support experience.

To address this customer base, we believe that Azure must re-vamp its CycleCloud offering, simplifying the current cluster deployment and monitoring experience. Otherwise, Azure is at risk of being demoted to Silver due to its poor user experience for startups from Series A to AI unicorns. Compared to the fully managed, Kubernetes-native, and vertically integrated offerings from Neoclouds like CoreWeave, Nebius, and Oracle, as well as the aggressive capacity buildout and revised pricing we have seen from AWS and GCP, Azure has stiff competition.

All ClusterMAX™ 2.0 + 2.1 reviews