Lightning.ai (aka Lightning Cloud) is a broker for GPU machines in neoclouds and hyperscalers that provides useful MLOps features on top. The founding story of Lightning Cloud begins with the development of PyTorch Lightning, an open-source framework that organizes and simplifies boilerplate PyTorch code such as the training loop, logging, checkpointing, and distributed training. The lightning git repo seems to be the #1 way top-of-funnel sales start for Lightning Cloud.
Fast forward to today, with LLMs on the rise, there is a split in the market. Older frameworks like NVIDIA NeMo use Lightning under the hood, while new frameworks that we use in our testing such as torchtitan, verifiers and Megatron-LM do not. The open source pytorch-lightning and lightning packages are still growing rapidly:
Source: Lightning.ai, data from pypi
Functionally, the Lightning Cloud product offers a simple way to track who’s using what across multiple clouds. We had a chance to test the Lightning Studio, which provides access to GPUs in a browser (VSCode, Jupyter notebook) or remote SSH (VSCode, Cursor, Windsurf, etc). Users can also submit batch jobs and “mmt” (multi-machine training) jobs to individual machines or clusters that they get access to on demand. Our testing of clusters is coming soon.
Source: our lightning.ai homepage
Notably, these multi-GPU studios, batch jobs, and mmt training jobs are restricted to users on a Pro, Teams or Enterprise Custom payment tier. Lightning is the only neocloud we have seen charging a per-seat price, and translating that into GPU-hrs behind the scenes on clusters that they manage for the customer.
Interestingly, there is an easy way to attach/detach GPUs to existing “studios” (i.e. notebooks or remote shells) and auto-sleep them if unused. This means that users only paying for what they use. Lightning also forecasts the wait times associated with spinning up a GPU from a given provider, such as AWS, Google, Lambda, Voltage Park, or Nebius. The worst wait time is for an 8x H200 machine in AWS, estimated at 3hrs. Unfortunately, despite what the website says, there are no GPUs available from NScale.
Using a VSCode notebook in Lightning.ai
Another piece that jumped out during testing is that notebooks have full CLI access including docker, meaning the notebook is running directly on a VM under the hood. This leaves users with full flexibility in the environment.
Overall we have our doubts about the utility of remote developer environments where cluster access is abstracted away from users, especially at the high-end of the market. The largest buyers of GPU compute do not have a problem spinning up a notebook on kubernetes with a simple manifest.yaml, or accessing a single machine via srun -N1 —gpus-per-node=8 —pty bash in a slurm cluster.
We find it hard to see a path forward for Lightning Cloud if the industry moves beyond the lightning framework and the GPU marketplace business continues to focus on taking a margin on top of expensive hyperscalers, with no third party compute. As for the ClusterMAX rating system, we look forward to testing Lightning Cloud’s mmt training, and kubernetes in the future. We encourage Lightning to consider building a slurm offering, adding monitoring dashboards for underlying cluster health that integrates with job logs and performance profiling, adding integration with active/passive health checks on clusters, and customization options for high performance storage and networking.