Our experience with the world’s biggest cloud has been full of headache. AWS offers SageMaker Hyperpod Slurm and SageMaker Hyperpod EKS (kubernetes). We started with slurm. Interestingly, AWS and OpenAI signed a multi-year deal for OpenAI to run core AI workloads on AWS EC2 UltraServers with NVIDIA GB200/GB300 worth $38B over 7 years, yet with no mention of EFA or HyperPod/Slurm in the announcement.

Our initial setup process following the primary documentation path for creating a slurm cluster through the SageMaker console. This path proved to be a dead end. The only successful method for provisioning a functional cluster was to abandon the standard documentation and instead use a CloudFormation stack from an official AWS workshop at http://catalog.workshops.aws/sagemaker-hyperpod . This approach pre-provisions the entire required infrastructure stack, including the VPC, IAM roles, S3 bucket, and FSx for Lustre file system, before attempting to create the cluster itself. Effectively, the default console setup does not correctly configure the necessary dependencies.

With that said, the process to get the CloudFormation scripts to work correctly requires navigating multiple documents to correct IAM policies (AmazonSageMakerClusterInstanceRolePolicy), request quotas of all sorts, and upload/run lifecycle scripts.Notably, these scripts are buried five directories deep in an unrelated GitHub repository: https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py and are incredibly brittle.

Source: AWS

Source: Requesting and approving quota for ourselves on the AWS console

The scripts have to be manually downloaded from the git repo, uploaded to an S3 bucket, and then added at the fourth step of configuring a cluster. When creating a VPC, IAM roles, S3 Bucket, and Lustre FSx, if you miss a step or need to upload a script to a different path, you have to restart the provisioning process.

On our first try, we didn’t define enough controller nodes to handle our 4-node ml.p5en.48xlarge (H200) cluster. On our second try, 1 of the 4 nodes in the cluster didn’t mount the Lustre FSx properly, due to a race condition, and the whole cluster rolled back. On the third try, the size of the instance being requested for the controller node had been exhausted in the region/az, so we needed to rollback and try again. Finally, on the fourth try, with a specific controller VM size (c5.xlarge instead of m5.4xlarge), and adding exactly one node at a time, we were able to provision the cluster properly. The provisioning process for a single cluster can take about two hours, as each node can take upwards of 30 minutes to deploy (if capacity is available).

In total, we worked on provisioning this cluster for 14 straight hours, with intermittent calls from five different AWS engineers across various time zones. Notably, the race condition on Lustre FSx that requires adding one node at a time has been known about by AWS engineers for over a year and not fixed. We spoke to three separate AWS customers during our research that validated they have experienced the exact same issues when setting up a hyperpod slurm cluster.

In addition, the standard, documented path for getting started with a single GPU instance does not actually produce a working GPU instance. Following the console guide results in a GPU instance provisioned without any Nvidia drivers installed, and a default root volume size of 8GB, which is insufficient to even install the required drivers manually. We believe this is a primary reason why various marketplaces reselling GPU compute in AWS datacenters such as lightning.ai and Qubrid have able to maintain a business: the AWS UI is just so hard to use.

On the HyperPod cluster, AWS (like other hyperscalers) removes public IPs in favor of a proprietary SSH wrapper script easy_ssh.sh https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-run-jobs-slurm-access-nodes.html. Unfortunately, this easy_ssh.sh is not easy, instead requiring an Access Token to be retrieved from the AWS console as they are cycled every 24 hours by default, and use the AWS SSM approach for access. This wastes time and is annoying, let alone the process to manage users with add_users.sh or plugging the cluster into an IAM provider.

Uniquely, AWS is the only cloud where account managers pestered one of our team members relentlessly for payment on a capacity block that they had provided us directly for our testing. While this did get rectified, the experience speaks to the fact that the AWS organization is a behemoth, and customers need to push hard to get the left hand to speak to the right.

Beyond our direct testing, independent feedback from multiple AWS users deploying hundreds of GPUs highlight additional issues: the need for a /16 CIDR to avoid IPv4 exhaustion (since 81 IPs are consumed per GPU instance), and a lack of IPv6 support on EKS. Regular footguns also show that HyperPod does not use existing reservations automatically, another source of potential cluster recreation, and a different (but similarly frustrating) need to add nodes incrementally, in this case to avoid EFA errors.

On health checks, AWS does have a relatively comprehensive approach to health checks compared to other hyperscalers. However, deep health checks can be excessively long (60-120 minutes) and are best disabled for faster scaling. Unfortunately, monitoring dashboards for slurm or Kubernetes cluster health, performance, and job stats are basically non-existent beyond standard, manual, open source tooling.

Source: Deep Health Checks on AWS https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html

Finally, on networking, not much has changed since our previous article. AWS remains steadfast in its commitment to EFA for all H200, B200, B300, GB200 NVL72, GB300 NVL72, and even future VR300 rack-scale architectures. Customers that see superior performance from InfiniBand and high-end RoCEv2 deployments generally dislike EFA performance and debugging. However, AWS is steadfast in their commitment to EFA, going so far as to design future architectures where they will run PCIe connections between their compute trays and a separate “JBOK” (Just a Bunch of NICs) rack full of custom K2V6 EFA NICs.

On GB200, their p6e platform uses NVL36x2 and runs into the same NVLink unreliability troubles as GCP where the cross rack NVLink ACC cable causing major issues.

Source: AWS

AWS markets this disaggregated design as a strategic choice for resiliency, claiming it enables N+1 NIC redundancy and improves the mean time before failure (MTBF) of sensitive optics by moving them to a cooler, dedicated tray. However, the engineering reality suggests this move is less a choice and more a necessity driven by the thermal and spatial constraints of fitting multiple power-hungry K2V6 NICs inside a dense 1U compute sled. This architecture introduces non-trivial latency from PCIe Active Electrical Cable (AEC) retimers and feels like a complex workaround to us.

However, this JBOK design also enables a long-overdue shift to a rail-optimized network topology, which is critical for the performance of MoE models heavy on All-to-All collectives. But this obsession with reliability at the component level leads to a shockingly inefficient operational model at the system level. An entire GB200 rack (or logical rack, as AWS is going for NVL36x2, just like Google) is treated as a single failure domain called an “Ultraserver.” This means a single faulty compute sled requires draining workloads from all 18 nodes in the rack before any repair can be attempted. This is a stark contrast to the hot-swappable serviceability customers expect and receive from other GB200 NVL72 rack-scale providers. In the worst case, this policy has brutal TCO implications as it demands entire “spare” racks to maintain capacity SLAs, a cost inevitably passed on to the customer via poor SLA penalties, or higher prices.

For users of EFA, debugging is also incredibly challenging. First, in a traditional HPC environment using InfiniBand or RoCEv2 (Converged Ethernet), engineers have a standard toolkit: ib_write_bw, ib_ping, ibv_devinfo, and ibdiagnet for direct testing of the physical layer. However with EFA, your access ends at the EFA driver on the host. Second, since NCCL does not communicate with EFA directly there are multiple layers of abstraction to contend with. The communication path is a complex chain of software shims:

NCCL → aws-ofi-nccl Plugin → Libfabric API → EFA Libfabric Provider → Custom ibverbs provider in RDMA Core Library → EFA Kernel Driver → AWS Hardware

When a NCCL collective (like an AllReduce) hangs or performs poorly, the error message is often generic, like a timeout or a provider error. Pinpointing the source of the problem is a nightmare: is it a bug in NCCL itself? Is it an incompatibility or bug in the aws-ofi-nccl plugin? Is Libfabric misconfigured or hitting a corner case? Is the EFA provider encountering an issue with the SRD protocol (e.g., congestion, retransmissions)? Is there a physical hardware problem on the NICs, switches or cables? Without deep introspection tools for each of these layers, debugging becomes a process of managing support tickets with AWS.

Third, is the case of “gray failures”, where job performance degrades for inexplicable reasons. Is it congestion from other jobs on our cluster? Sub-optimal routing policies? A noisy neighbor tenant on the same global fabric? Multi-tenancy is always difficult to handle in networking, and a backend interconnect for GPU clusters is no different.

Finally, the same usability issues with cluster setup can impact networking experience too. Security Groups, IAM Permissions, and Cluster Placement Groups all need to be handled correctly to ensure a given user is getting proper performance. Many small things added together results in a big challenge for administrators.

In general, we try to represent the customer experience, which repeatedly tells us that EFA does not perform well at scale. But AWS doesn’t care, they are not cow-towing to Nvidia and adopting CX-7 or CX-8 NICs. They have already sunk enough time and energy into this EFA NIC, and they’re going to make it work and save that 0.8% of TCO, dammit.

Overall, Amazon SageMaker HyperPod is surprisingly difficult to use, especially considering that AWS is the leader in the cloud industry and brands itself on customer obsession. Official AWS documentation is hard to follow or incorrect, and the underlying platform suffers from issues of usability and performance at scale. For teams considering HyperPod, we recommend budgeting for significant engineering effort focused on cluster maintenance, including time to build custom automation that can work around AWS’s unique limitations.

Amazon Web Services (AWS)

All ClusterMAX™ 2.0 + 2.1 reviews