Lambda is another gold-tier candidate that unfortunately comes in at #1 in the wrong category: customer complaints. Lambda started out in 2012 by building facial recognition software, then pivoted to reselling SuperMicro GPU workstations, servers, and eventually on-premises clusters. Today, they appear to be 100% focused on their “Superintelligence Cloud” and working to shed their legacy on-prem server and workstation business. Their recent announcement with Microsoft suggests they will be providing capacity worth multi-billions across 10s of thousands of Nvidia GPUs.
A recurring theme when we talk to users is that Lambda seems to unfortunately be trying to do everything for everyone. While the company has deep experience in building dedicated HPC clusters, this has not yet translated into a polished, user-friendly cloud console, or cluster monitoring experience. Their product offerings feel conflicted with a new-mslurm, old-mslurm, new-mk8s, old-mk8s, private cloud, 1-Click Cluster, and on-demand Instances.
Notably, 1-Click Clusters aren’t really one click, as you need to wait for approval. It’s more of a 1-Click-if-approved-and-paid-for-then-you-can-have-it-Cluster.
For users that want an on-demand machines instantly, Lambda is generally considered to be the top-tier on-demand provider, with the largest fleet of GPUs available. However, in our recent experience, Lambda is in fact suffering from success in on-demand. We are generally met with greyed-out screens showing that capacity is sold-out:
Source: Lambda Labs: trying to get an on-demand GPU instance from Lambda
Also, for a hot minute, Lambda appeared to be getting into the serverless inference API endpoint business, which would put them in direct competition with some of their largest customers. But that is no longer:
Overall, we like the focus. Lambda has pivoted, and is very focused on their 1-Click-Cluster (1CC) business, focusing on “big game hunting”.
During our testing, we evaluated both their new (self-managed) and old (rancher-based) Kubernetes offerings, and their newly available slurm offering. Neither of these is UI or CLI driven, instead requiring a Lambda engineer to set up the cluster for you.
Lambda’s Kubernetes product feels like an early-stage offering, marked by technical debt and a challenging user experience. While the current product does not use Rancher, the public documentation still references it, causing initial confusion. The user experience for inference workloads is particularly lacking. Clusters do not come with a default public IP solution (like MetalLB or an external LoadBalancer). Setting up public-facing inference services is complex and not well-documented, requiring significant manual configuration. This reflects a platform that is developed to target training workloads, not inference. While documentation exists for a simple, single-GPU vLLM deployment, there are no examples for multi-GPU, multi-node, or auto-scaling inference workloads.
For monitoring, Lambda uses a mix of open-source tools, including LeptonAI’s gpud for GPU device management and node-problem-detector for health checks, but the integration is not seamless into their monitoring dashboards for the new or old mk8s products. Dashboards are easy to access, but missing integration to the metrics without an install of an agent that is not documented, and upon further inspection, still in development.
For slurm, Lambda’s offering is a more recent addition, and the onboarding process was fraught with issues. The initial setup process was cumbersome: ssh keys were not correctly provisioned on the cluster, the default home directory was not shared across nodes by default, requiring data to be moved manually. New user account creation is a headache, requiring workarounds like unsetting environment variables (XDG_DATA_HOME) to function correctly.
To their credit, once these initial hurdles were overcome, the cluster’s performance was strong. We observed expected allreduce, allgather and alltoall bandwidth on nccl-tests and were able to achieve full MFU on an example torchtitan training workloads. Lambda also provides some useful, albeit hard to find, tooling. For example, a welcome message (which was invisible in some SSH clients like Cursor or VSCode) contained custom instructions for a grafana-access command to quickly view performance metrics.
Lambda’s approach to reliability on the slurm cluster included a custom dcgm-status script, which can be run on-demand:
The script is also scheduled to run on a regular cadence in a low-priority, “preemptible” partition:
Source: our Lambda test cluster
Source: our Lambda test cluster
We were impressed by Lambda’s commitment to developing comprehensive active and passive health checks, and believe that they are well on their way to improving reliability challenges, and building the battle scars necessary to run NVL72 rack-scale systems at scale.
With that said, some of the access issues we encountered point to broader operational challenges at Lambda. Their cloud console (though not our cluster) experienced outages during our brief testing window.
Internally, there appears to be a general degree of disorganization. When asked about a true “cloud console” experience, Lambda acknowledged that the team’s background is primarily in traditional HPC cluster deployment, not building scalable, self-service cloud infrastructure. We encourage Lambda to truly focus on the cloud experience going forward as they simplify their portfolio and focus on their mslurm and mk8s offerings.
On the positive side, Lambda is actively working on improving its platform based on our feedback. They have a compliance team addressing SOC 2 Type II requirements for individual sites, and are working to implement both SHARP and InfiniBand security keys for multi-tenant isolation, following recent Nvidia recommendations (and, likely, the onboarding of Nvidia as a customer with a $1.5B contract). Their storage offerings primarily focus on VAST, with future S3-compatible offerings currently in development.
Overall, Lambda is a strong provider with deep hardware expertise, massive capacity, and big plans for the future. However, their public cloud product feels immature, and engaging with the team feels chaotic. We encourage Lambda to continue to work on translating their HPC hardware prowess into a stable, easy-to-use, and reliable cloud service.