Since ClusterMAX 1.0, Nebius has continued to show up as the most direct competitor to CoreWeave in our customer conversations. Nebius counts customers such as Shopify, Recraft, Mirage, Genesis Therapeutics, and most recently landed a 5-year $17.4B deal (with expansion to $19.4B) with Microsoft for their Vineland, New Jersey datacenter. We expect continued pull-ins from Nebius to Microsoft and the healthy deal pipeline sugests more incremental capacity coming online over time.

Nebius differentiates financially due to its low cost of capital, with billions on its balance sheet, no debt, and a strong position as one of two publicly traded GPU-only Neoclouds that we track (the other being CoreWeave). Nebius differentiates technically with a virtualized approach to GPU infrastructure (built on experience from Yandex), and an AI-native approach due to their dogfood approach with their internal AI team, that has resulted in multiple spinoff startups in the AI space.

While we are aware of their struggles in securing colocation deals and establishing a credit rating, their engineering prowess is evident. The Nebius platform prioritizes flexibility, on-demand access, and a robust Kubernetes-native experience. This stands in contrast to CoreWeave’s bare metal, long-term reservation model and makes Nebius a compelling choice for autoscaling and spot instances, specifically used for experimentation and inference. Nebius continues to show up as a low-cost provider on various marketplaces, and as the infrastructure, and is one of the only providers with realistic pricing right on the homepage of their website.

Source: Nebius

As if that wasn’t enough, Nebius is also foraying into the inference endpoint market!

In this section, we will discuss our hands-on experience with Nebius’ platform:

Slurm-on-Kubernetes
Virtualization and Storage
Monitoring and Health Checks
On-Demand Instances

Slurm-on-Kubernetes

Since ClusterMAX 1.0, Nebius has officially launched their managed Soperator service for a fully self-service Slurm-on-Kubernetes experience.

We were able to test this out, and as expected we got a Slurm cluster that was completely set up and ready for use out of the box. This included pre-installed drivers, Docker, passwordless SSH between nodes, and expected performance on collectives (nccl-tests) and pytorch training jobs (torchtitan pretraining) out-of-the-box.

These requirements may be considered basic, but are not to be taken for granted. Later in the article we will describe how difficult it is for other providers to install Slurm-on-Kubernetes (via Soperator or Slinky) with good defaults. Notably, since Nebius uses its own open-source project, Soperator, they completely control the roadmap and are vertically integrated from customer support issues in an sbatch script, down to the kubernetes orchestration, hardware, and datacenter troubleshooting layers.

The control plane also looks nice:

Source: SemiAnalysis Nebius cluster

Virtualization and Storage

Unlike CoreWeave’s bare-metal-first strategy, Nebius has built its platform on layers of Kubernetes clusters managing each other, using KubeVirt all the way down. This means that customer workloads, even for full 8-GPU nodes, run inside virtual machines on a kubernetes cluster that they can access, which itself is managed by a kubernetes cluster that only Nebius can access. This design is similar to how GCP orchestrates compute. The architecture allows Nebius to leverage the benefits of VMs, such as rapid provisioning and advanced storage features. For example, they use virtio-fs to attach a massive shared root filesystem, which presented as 197TB out-of-the box in our cluster, mounted at “/”, but obviously does not require 197TB of drives to be physically installed in the servers themselves.

Source: SemiAnalysis Nebius cluster

Nebius’s storage solution is built on YDB https://ydb.tech/docs/en/contributor/distributed-storage , which actually underpins their block, shared filesystem, and storage offerings. This approach to storage has reduced startup times (i.e. pulling a container image or model file) for new machines with some example on-demand autoscaling workloads from over 10 minutes to around 2-3 minutes.

A common perception is that bare metal offers superior performance over VMs. When we raised this, the Nebius team was adamant that users should simply benchmark the platform. Their position is that because there is no virtualization layer for the InfiniBand fabric or the NvidiaGPUs themselves, performance should be identical to bare metal. Our initial testing seems to validate this claim, and third party results do too. For example, in the last round of MLPerf Inference v5.1 benchmarks, Nebius achieved top tier performance on Nvidia GB200 NVL72, HGX B200 and HGX H200 systems, for inference with Llama models.

https://nebius.com/blog/posts/bare-metal-class-performance-mlperf-inference

An additional note that customers have expressed with VMs is the ability to easily enable low-level hardware counters for performance monitoring. The method for enabling hardware counters varies by virtualization platform.

In VMware vSphere, you can enable virtualized CPU performance counters by editing the VM’s settings. This feature, known as vPMC, allows the guest OS to access the host’s Performance Monitoring Unit (PMU). Meanwhile on Windows Server and Windows 10/11 with Hyper-V, you can use PowerShell cmdlets like Set-VMProcessor to enable specific performance monitoring hardware features (e.g., pmu, pebs, lbr) for a stopped virtual machine.

However, on KubeVirt (which Nebius uses) via KVM/QEMU, the VMs inherit the capability to expose hardware counters from the underlying host. The process typically involves configuring the VM’s CRD on the underlying kubernetes cluster to enable virtual PMU from Intel. Hardware-level performance data like CPU cycle counts, cache misses, and branch mis predictions are available through there. You can typically enable this capability by activating a power metrics or PMU plugin, such as a Telegraf plugin, on the Kubernetes cluster. For example, some users perform advanced performance tuning by using features like the Kubernetes CPU manager to pin vCPUs to host pCPUs for predictable latency on CPU-heavy workloads.

Notably, for very large customers who insist on it and are doing a long-term rental, Nebius does have an option to provide bare metal clusters.

Monitoring and Health Checks

Initially, our access to monitoring was a simple Grafana dashboard available via an SSH port forward. These metrics and health checks were basic, but the team later released a series of updates that raised the bar significantly.

Interestingly, since all of this is being integrated into SOperator, and Soperator is open source, we have been able to watch the roadmap come to life on the Soperator GitHub project: https://github.com/orgs/nebius/projects/1

Source: Nebius Soperator on GitHub

There is no other Neocloud as open and transparent with their development as Nebius.

Around the same time we got notified about the improvements to the dashboard and health checks, Nebius also released a blog post describing what they do for reliability in detail: on-site factory tests, node deployment tests, virtual platform tests, pre-provisioning cluster tests, passive and active health checks. We believe this suite of burn-in tests, checks, and monitoring dashboards will improve cluster reliability and usability, especially as Nebius moves to adopting the GB300 NVL72 rack-scale systems at scale, for customers such as Microsoft.

Source: Nebius

Source: Nebius https://nebius.com/blog/posts/how-we-build-reliable-clusters

In the blog, Nebius describes a hypothetical example with 13 failures over 336 hours (14 days) on a 1,024 GPU cluster, resulting in a GPU-level MTBF of 26,446 GPU-hr, or 1,101 GPU-days.

An apt comparison for this is to the data from Meta’s Llama 3 paper, which claims 419 failures over 1,296 hours (55 days) on a 16,000 GPU cluster, resulting in a GPU-level MTBF of 50,677 GPU-hr, or 2,111 GPU-days.

As a further comparison, we have heard from customers that run similar scale clusters (1k to 2k GPUs) from gold and silver tier providers experience as much as 5+ failures per day, for extended periods of time. This translates to a GPU-level MTBF of less than 10,000 GPU-hr, or less than 400 GPU-days.

In our research, the number shared by Meta are very high, demonstrating the quality with which Meta runs their datacenters and Hopper generation GPUs. Meanwhile the hypotheical number from Nebius tracks as a reasonably good customer experience.

Later in the blog, Nebius claims to have had single 3,000 GPU cluster operate uninterrupted for 169,800 GPU hours or 56.6 hours of stable operation. This would translate to an absurdly high GPU-level MTBF of 169,800 GPU-hours or 7,000 days. We are generally frustrated by providers who cherry-pick reliability data in this manner.

We encourage customers to track this reliability data for themselves, especially if DataDog, New Relic, Splunk, or a custom Prometheus Alertmanager is setup and connected to a slack channel for notifications on XID related errors. If you are tracking this data, and are willing to contribute it to anonymized and aggregated research, please get in touch: [email protected]

Source: Nebius https://nebius.com/blog/posts/how-we-build-reliable-clusters

It is clear that Nebius is building battle scars when it comes to managed slurm clusters at the 1k+ GPU scale.

On-Demand Instances

A key differentiator for Nebius is their robust support for on-demand and autoscaling workloads. This is a direct result of their software-defined architecture. They offer pre-emptible instances, primarily for inference customers, which function similarly to spot instances on hyperscalers. This allows users to access capacity at a lower cost, with the understanding that the workload can be interrupted.

We’ve seen public examples of this in action, such as Shopify’s work with SkyPilot and dstack sky on the Nebius platform, which highlights their strength in supporting dynamic, research-oriented workloads. This flexibility is a significant advantage for users who cannot commit to long-term contracts, and seems to be a major source of inbound customer qualification for Nebius.

TCO

Nebius presents a compelling and technologically distinct alternative in the GPU cloud market. Their deep investment in a Kubernetes-native, virtualized stack using KubeVirt and Soperator allows them to offer a degree of flexibility and on-demand access that is rare in the high-performance training space. While they may face headwinds in datacenter acquisition, their software stack is mature and performant.

Our feedback to Nebius is to continue improving their monitoring and health-check visibility as they roll these updates out to customers, and to streamline the notification process for all tiers of users. Their ability to deliver bare-metal-equivalent performance through a VM-based architecture is a significant engineering achievement. For users whose needs revolve around research, autoscaling inference, and workloads that can benefit from a spot-like pre-emptible market, Nebius is an excellent choice that challenges the long-term reservation model of its competitors.

Notably, some customers have been fixated on Nebius’ Russian roots, despite the fact that all of their staff are based outside of Russia, as opposed to making purchasing decisions on technical merits. We’re not sure how Nebius can address customers that hold this mindset.

Slurm-on-Kubernetes

Virtualization and Storage

Monitoring and Health Checks

On-Demand Instances

TCO

All ClusterMAX™ 2.0 + 2.1 reviews