Oracle just posted the most incredible quarterly earnings the market has ever seen. We were ahead of estimates, but still didn’t catch this in full. Specifically, Oracle signed four multi-billion-dollar contracts with three different customers in Q1, including a $300Bn+ deal with OpenAI. Recent release from The Information suggested AI server margin concerns from these multi-billion contracts, but we think lower margin during a ramp up period makes sense and expect margins to significantly expand.

Source: Oracle

Oracle is in a unique position as the only hyperscaler (over 100 AD’s across 45 active regions globally) without an in-house AGI Research program or significant venture capital investment (though Larry did invest $2B in xAI after a few texts from Elon) which has led them to land contracts with OpenAI, Meta, ByteDance, and Nvidia. They currently have over 60% of US Stargate according to public releases. We also track this in detail in our Accelerator & HBM Model.

Oracle also pivoted to wholesale bare metal early, taking advantage of their balance sheet while also maintaining a notable presence in the managed slurm and managed kubernetes market. In many cases, we have found other cloud providers giving us servers with IP addresses, locations, and other configuration information that makes it clear we’re actually running in an Oracle datacenter.

Oracle’s default setup is typically provisioned through the console, using Terraform for automation behind the scenes. A notable point of friction for some users is the almost-mandatory use of Oracle Linux (version 8.10 by default), which is based on Fedora. This operating system choice is contentious, as many AI workloads, particularly those in the open-source community, are first tested on Debian-based operating systems, specifically Ubuntu, for its broad compatibility and ease of use.

Source: SemiAnalysis Oracle Linux headache

We attribute this default to Oracle Linux as a historical beef with Canonical. This is surprising given that certified Ubuntu images have been available on Oracle Bare Metal Cloud Services since 2017 and are current modern to version 24.04.

In order to deploy a slurm or kubernetes cluster through the OCI console, users unfortunately need to use the OCI console. Visually, the console adheres to the Redwood design system, and uses their JavaScript Extension Toolkit (JET), both of which do not spark joy. Oracle remains steadfast in its lifelong commitment to Java, even in the age of AI. After deploying a cluster, users who want to access a Grafana dashboard will need to navigate the maze of UI element options at their disposal. For those interested, the cheat code is: left hand burger menu > Developer Services > Stacks > Stack details > Application Information > (scroll down) > Grafana admin password.

Source: SemiAnalysis OCI console

During testing of both slurm and Kubernetes, everything went smoothly. We were able to quickly achieve expected collective bandwidth on nccl-tests, and expected MFU on torchtitan. Interestingly, upon first login to the slurm cluster we found a single node in a drain state. This was quickly replaced, but highlighted the difference in approaches to health checks and bare metal node provisioning when compared to CoreWeave.

Oracle is actively working to improve platform reliability and user experience. Node Auto-Repair and Node Problem Detector integration is expected in Q4, with the goal of providing customers with a “doctor HPC” user experience via an official OCI binary. The team is also developing an Active Health Check mechanism called the Sustained Workflow Check, which involves running PyTorch and CUDA matmul linear regressions for 2-5 minutes to ensure sustained performance. This check is currently running on-demand, and in most cases an Oracle engineer works directly with customers to schedule the checks in a low-priority partition. Default behaviour is being developed to integrate this into both slurm and OKE.

For managed Kubernetes, Oracle offers OKE (Oracle Container Engine for Kubernetes), with all resources being public by default. Users have the option to disable this default and utilize the Nvidia GPU Operator, although the default setup uses a custom operator. OKE provides GPU and GPU+RDMA pools as provisioning options, along with integrated storage options via a checkbox. The official instructions for setting up RDMA are publicly available, and nccl-test manifests show good out-of-the-box performance for allreduce and allgather operations.

A key point of frustration in the OKE setup is the lack of a direct kubeconfig file. Users are instead required to SSH into the cluster to perform management functions. This is counter-intuitive for a publicly accessible, load-balanced service and can require a bastion proxy for proper external access to the cluster from a cluster admin or a user. One of the key benefits of kubernetes over slurm for users is the ability to develop code locally, and switch between different cluster contexts quickly, without ssh.

In terms of networking, the default for OKE to use an RDMA network in Kubernetes is to inject two fields, hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet into pod specs. On OKE, Oracle does not deploy the full GPU Operator but rather the device plugin only, with plans to add the full operator later. The Nvidia toolkit is installed and automatically updated on the nodes, ensuring the software stack is current. Performance testing on kubernetes using vllm benchmarks for pd disaggregation with llm-d showed strong results, and was easy to setup and integrate with provided LoadBalancer services via public IP.

Source: Oracle HPC on OKE repo https://github.com/oracle-quickstart/oci-hpc-oke/blob/main/docs/running-pytorch-jobs-on-oke-using-hostnetwork-with-rdma.md

The initial testing phase suffered from a lack of integrated health checks. While slurm metrics were added later, the control plane initially lacked the necessary CLI features, which required some rollbacks to prevent customers from inadvertently terminating jobs. The introduction of a new “mgmt” CLI aims to address these operational complexities and we agree.

Source: Oracle

For Storage, Oracle offers a robust marketplace, with Weka and DDN being the primary partners. Weka is available both on their online on-demand marketplace (i.e. it can be built on-demand with bare-metal instances full of NVMe) and through direct deals. Oracle customers report that the shared support experience is stronger with Weka on Oracle than with either VAST or DDN. For Networking, Oracle is like any other hyperscaler talking their book, trying to convince large customers that InfiniBand is not the only high-performance network solution, and their RoCE works well. They seem to be making progress in this regard.

Overall, Oracle has made significant progress improving their managed cluster offerings, with improved monitoring dashboards and node lifecycle management, but there is still room to improve in terms of proactivity. There is still a chance that customers will discover a bad node before Oracle’s automated systems, and have to report the node for replacement manually.

Oracle continues to be the most cost-effective of the four hyperscalers, and stands out as deploying new infrastructure the quickest, while also providing the best support. We expect Oracle to continue to grow both their wholesale bare metal and managed cluster business going forward, and we encourage Oracle to maintain its commitment to excellent customer support for all customers.

All ClusterMAX™ 2.0 + 2.1 reviews