BuzzHPC is the AI division of HIVE Digital Technologies (fka HIVE Blockchain), a crypto mining focused on cool climates with green energy (Canada, Iceland, Sweden). HIVE pivoted into the AI cloud market in 2022 when they acquired a 50MW facility in New Brunswick, Canada from GPU Atlantic, aka gpu.one.
In our testing, the Slurm cluster we got had almost everything wrong with it that we’ve seen in this testing, all at once. It was almost impressive. Here is a list:
-
initially, no control plane machine
-
initially, no NFS mount, and then user’s default workdir was not on the shared filesystem
-
initially, no passwordless ssh between nodes
-
docker and the nvidia container toolkit not installed on the worker nodes
-
modules not installed, also no hpcx, nccl, nvcc
-
no pyxis or enroot
-
dcgmi background health checks not installed, or enabled
-
no prolog or epilog configured, no active health checks
-
no montoring dashboard
To get around all of this, we ran a 2-node nccl test with the pytorch-bundled libnccl. Unfortunately, we did not see expected bandwidth (we about 10x lower than expected). This was weird, because ibstat showed 8x 400Gb CX-7 in the nodes.
So, we quickly confirmed that both GPUDirect RDMA was not installed, and ACS was not turned off.
To their credit, the BuzzHPC team was responsive and worked with us over several days to resolve some of the issues we identified. They’ve also committed to building our feedback into their default slurm offering going forward.
However, even after these fixes, the cluster does not meet the standards we expect for usability, monitoring and health checks. It seems that BuzzHPC’s platform is still actively in development. We look forward to seeing more from BuzzHPC in the future.