Together is a strong provider with a robust cluster offering for both slurm and kubernetes, but it is held back from the gold category due to reliability issues. When comparing offers, we hear from users that they generally expect a lower price per GPU-hr from Together to justify the trade-off in reliability. Together is among a few providers for which we tend to hear the most reliability complaints about from users operating clusters of 64 GPUs or more. We expect this is due to their use of a broad mix of datacenter partners, which creates a “roll-of-the-dice” dynamic for performance and stability. Unfortunately, Together also does not offer to do 1-week POCs with most customers, unlike other Silver, Gold, and Platinum tier providers, which makes it difficult for buyers to know what sort of experience they will have on the cluster before making a multi-million dollar commitment.
These are all the reasons why TogetherAI went from an Gold tier provider to Silver tier provider.
Together’s multi-datacenter strategy seems to be driven by necessity. They have significant compute needs due to their serverless inference endpoint business, which is growing steadily. During our research for this article, we spoke with multiple Neoclouds that claim Together is one of their biggest customers. Competing in the serverless inference endpoint business does provide two key benefits: it creates a sales funnel to cross-sell GPU clusters to inference customers, and it allows Together to absorb the cost of idle cluster compute by running inference workloads on it. It also give Together an opportunity to enjoy the fruits of their kernel team’s labour. TKC is an exceptional feature, and the impact of Tri Dao’s FlashAttention cannot be overstated. During the research for this article, we got to hear directly from Dan Fu about the TKC roadmap. We suspect that Dan is the only person in the industry with the title of “VP, Kernels”, and for good reason. TKC is consistently impressive, and it helps both customers and Together’s serverless inference endpoint business achieve improved performance and efficiency. Together’s model of offsetting costs from idle compute by running public and private serverless endpoints is now being copied by the likes of Nebius. Why not make some extra money from idle compute?
During testing, we got access to a classic Together slurm cluster, a TKE kubernetes cluster, and a soon-to-be-released Instant Cluster in preview. For slurm, the onboarding process was smooth. Just create an account on the console, upload ssh keys, and the together engineering team sends you an onboarding document. One ssh command and the cluster works out of the box. Unfortunately, during testing we noticed that the cluster responded very slowly to terminal commands in a VSCode or cursor remote SSH session. The standard terminal application was fine, and we could replicate the slowness from multiple locations, leading us to believe it was a problem with their datacenter provider
The Kubernetes onboarding experience was less polished. Instead of providing a kubeconfig file to download, we were expected to login and access the cluster via ssh. As mentioned previously, this is atypical for kubernetes admins and users who generally prefer to develop code locally and switch contexts on demand. In addition, we found that standard tools like Helm were not installed, and users do not get sudo permissions by default, requiring more manual setup. Together uses rancher k3s to provide these clusters, which is strange considering how much of the serverless endpoint runs on kubernetes. Together has several customers, including Hedra, Cartesia, and Krea, that are successfully running production inference on thousands of GPUs using these managed K8s clusters. However, at this time, together does not have horizontal node autoscaling capabilities in these clusters. Whatever capacity you commit to is what you get. It is interesting to see the dynamic between the cluster business and the endpoint business in action: users can see it as together competing against itself, or providing end users with choice.
Source: Together. Trying to use our TKS cluster L
“Instant Clusters” is Together’s newest offering, designed to be fully managed via API, CLI, and a Terraform provider. This product allows users to dynamically provision clusters and add or remove nodes on demand, making it suitable for handling burst capacity and autoscaling. The architecture for Instant Clusters provides strong tenant isolation using a multi-layered approach similar to Nebius. First, a base Kubernetes cluster uses KubeVirt to create dedicated Virtual Machines (VMs) for a customer. Second, these VMs are used to form an isolated Kubernetes cluster dedicated to that customer. Third, slurm is then installed into the customer’s dedicated K8s cluster using slurm-operator from Slinky. Overall, this architecture allows Together to offer flexible, on-demand Slurm environments on top of a modern, virtualized stack. Notably, in our testing, Together is the only provider to correctly configure Slinky out-of-the-box with sudo permissions, vim/nano, git, python, and other basic packages pre-installed. They clearly have already rolled out this offering to users, and we are excited for it to launch in full GA.
On these clusters, Together provides 24/7 support from an on-call SRE team that is primarily US-based. For networking, they work directly with customers to configure firewall rules at the datacenter level and provide IP addresses as needed, including 1:1 NAT and public IPs assignable through services like MetalLB.
The final, and most important piece of differentiation from Together and gold tier providers is a proactive and automated approach to monitoring and reliability. This has been a weak point for Together, and is difficult for them to work around given the broad use of datacenter and GPU infrastructure partners they have contracts with.
During our review of the monitoring dashboard, we noted a bug in their Grafana monitoring dashboard that incorrectly reported InfiniBand bandwidth at a physically impossible 1.14 Tbit/s. To their credit, when we pointed this out, their team quickly identified the calculation error in their query and deployed a fix.
For passive health checks, we expect checks run continuously in the background to detect failures on live nodes. This is where the gap between their current implementation and a fully automated system is most clear. Together has implemented detection for many critical issues, including GPUs falling off the bus, PCIe errors, InfiniBand link flaps, high GPU thermals, and high ECC memory error rates. A baseline Kubernetes node health check is also in place. However, the most critical missing piece is automated remediation. While they can detect most of the issues above, the logic to automatically drain a faulty node is still on the roadmap for everything except for GPUs falling off the bus in slurm. Other crucial features on the roadmap include detecting uncorrectable Nvidia XID errors, identifying stalled NCCL jobs, and implementing AI/ML-based predictive failure analysis.
For active health checks, Together has currently implemented a comprehensive suite of tests for single-node validation. It includes Nvidia’s DCGM diagnostics (level 3), PCIe bandwidth tests, single-node NCCL and InfiniBand all-reduce tests to validate local interconnects, and GPU stress tests like GPUBurn. However, key multi-node and application-level tests are still on the roadmap. This includes pairwise ib_write tests to validate the InfiniBand fabric under load, hardware correctness validation with Nvidia’s TinyMeg2, and full-stack performance tests with models like Megatron to ensure TFLOPs and loss convergence match reference numbers. We have previously noted how important these tests are during burn-in and during cluster operation, as they stress both the GPUs and the interconnect at the same time, for an extended period of time, resulting in thermal expansion and contraction of the entire cluster, similar to normal operation. We encourage Together to prioritize implementing these active health checks, as we believe it will help them improve reliability, especially when working with datacenter partners that are not under their direct control.
In summary, Together continues to operate on a solid foundation for managed clusters. They have a large and growing customer base for both their clusters, and serverless inference endpoint products. Their active, single-node health checks are strong. However, the system is not yet complete. We believe that the gap between detecting node failures passively, instead of automatically remediating them proactively is a key reason for the reliability issues users experience today.