BuzzHPC is the AI division of HIVE Digital Technologies (fka HIVE Blockchain), a crypto mining focused on cool climates with green energy (Canada, Iceland, Sweden). HIVE pivoted into the AI cloud market in 2022 when they acquired a 50MW facility in New Brunswick, Canada from GPU Atlantic, aka gpu.one.

Source: our sources

In our testing, the Slurm cluster we got had almost everything wrong with it that we’ve seen in this testing, all at once. It was almost impressive. Here is a list:

initially, no control plane machine
initially, no NFS mount, and then user’s default workdir was not on the shared filesystem
initially, no passwordless ssh between nodes
docker and the nvidia container toolkit not installed on the worker nodes
modules not installed, also no hpcx, nccl, nvcc
no pyxis or enroot
dcgmi background health checks not installed, or enabled
no prolog or epilog configured, no active health checks
no montoring dashboard

To get around all of this, we ran a 2-node nccl test with the pytorch-bundled libnccl. Unfortunately, we did not see expected bandwidth (we about 10x lower than expected). This was weird, because ibstat showed 8x 400Gb CX-7 in the nodes.

So, we quickly confirmed that both GPUDirect RDMA was not installed, and ACS was not turned off.

The BuzzHPC Console

To their credit, the BuzzHPC team was responsive and worked with us over several days to resolve some of the issues we identified. They’ve also committed to building our feedback into their default slurm offering going forward.

However, even after these fixes, the cluster does not meet the standards we expect for usability, monitoring and health checks. It seems that BuzzHPC’s platform is still actively in development. We look forward to seeing more from BuzzHPC in the future.

All ClusterMAX™ 2.0 + 2.1 reviews