Verda (formerly DataCrunch) is based in Finland, with datacenters in both Finland and Iceland. When logging in, Verda provides a nice clean console, making provisioning quite straightforward. Their “Instant Clusters” feature was easy to use and spun up a slurm cluster in minutes. We were also impressed by the completeness of their Slurm implementation, which stands in stark contrast to many other providers on the bronze or even silver tier of this list. From this experience, it seems like they have battle-tested the offering with customers, despite it still being labelled as “Beta”.
Source: nice, intuitive setup for spinning up a cluster.
Specifically, the B200 cluster we got had everything we expected: pyxis, enroot, hpcx, nccl, nvcc, topology.conf, dcgmi health -c plugged into Slurm’s HealthCheckProgram.
On monitoring, the Grafana dashboard included an interesting SSH command to retrieve the password, and was relatively well configured. Missing pieces related to job performance were minor, and we gave feedback how to make some improvements beyond standard DCGM metrics, and display them in a meaningful way to users. The platform also still lacks any way to add users with RBAC enforced at the storage or slurm level.
Overall, with working B200 instances available, and comprehensive slurm install, our initial impression was that Verda had made significant improvements from our last round of testing. However, this solid software foundation is still undermined by significant issues on the business and operations side.
We have heard about reliability issues from various Verda customers, both at the hardware level and with respect to their WAN connectivity. Specifically, Verda customers have told us that entire sites can go dark with no explanation. While things like this happen, the more serious issue is in response. Unfortunately, we have seen Verda charge their customers for GPU time even when instances are down or entire sites are inaccessible. To us, this is an offensive business practice. Our basic expectation for all cloud providers is to commit to their SLAs in written form, with penalties in the form of credits or deductions off a customer’s monthly bill in the event of a breach. Not upholding a written SLA undermines many of the technical benefits and attractive pricing that we have seen from Verda during our testing.
Note: since publishing this article we’ve discussed this issue in detail with Verda. Verda is committed to compensate any customers who experience downtime with at least 2x the cost of running any instances in the form of a credit. Customers contacting technical support via chat get an automatic message stating that all downtime will be compensated. Verda typically issues refunds within 24 hours of the downtime occurrence during weekdays, and on the following Monday for downtimes occurring during weekends. For customers billed monthly, any downtime or defects are compensated by subtracting the corresponding amount from the monthly invoice.
Frankly, we think this is an excellent response.
In general, SemiAnalysis recommends that customers be sure to keep server logs, screenshots, and other information readily available if they are pursuing downtime claims from their provider. At this time we have only seen gold or platinum tier providers proactively issue credits without customers asking for them.
Overall we recommend that Verda shore up their reliability challenges, finalize their slurm offering currently in beta, improve the monitoring dashboard, and continue development of their kubernetes offering. We look forward to seeing more from Verda in the future.