Standalone

Individual GPU machines provide direct hardware access for training, inference, data processing, and research workloads. Standalone systems offer maximum flexibility and performance without orchestration overhead.

Access

  • Console, dashboard, CLI and/or API available to manage resources
  • SSH key upload and management
  • Root/sudo access available on the system
  • API automation support for provisioning and management
  • Snapshot and image curation capabilities

Configuration

  • NVIDIA Container Toolkit with Docker (NVIDIA systems)NVIDIA Container Toolkit
    docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
  • ROCm Container Toolkit with --gpus=all support (AMD systems)ROCm install guide
    docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch rocm-smi
  • Container performance optimization (import, execution, pull speeds)
  • PyTorch and AI framework readiness validation
    python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
  • Native drivers and CUDA/ROCm toolkits installed
    nvidia-smi && nvcc --version
  • Custom driver installation supported

Hardware Support

  • NVIDIA: H100/H200/B200, A100, L40S, RTX 6000 Pro
  • AMD: MI300X, MI325X, MI355X (when available)
  • Flexible billing granularity (second, minute, hour, day)
  • External storage and network egress cost transparency
  • Spot instances available
  • Multi-node interconnects supported

Monitoring and Health Checks

  • nvidia-smi, DCGM, or AMD equivalents with full metrics accessDCGM docs
    nvidia-smi && dcgmi discovery -l
  • GPU clocking, power management, and thermal monitoring
    nvidia-smi -q -d CLOCK,POWER,TEMPERATURE
  • Integration with health check systems
  • Integration with monitoring tools

Networking and Storage

  • InfiniBand/RDMA configuration and testingRDMA perftest
    ibstat && ibv_devinfo
  • NVMe direct access with multiple filesystem support
  • Storage performance benchmarking
    fio --name=seqread --rw=read --bs=1M --size=1G --numjobs=4 --direct=1
  • Bandwidth testing and NCCL validationnccl-tests
  • Interconnect optimization

Security

  • BMC/IPMI restriction and security validation
  • Firewall management
  • Audit logging
  • Container Toolkit vulnerability assessmentNVIDIA Container Toolkit CVE-2024-0132
  • CVE compliance verification
  • Isolation options: bare metal, VM-based, container-based, or hybrid

General Expectations

  • Console, dashboard, CLI and/or API available to manage resources
  • 24x7 support availability
  • Process for security fixes and upgrades exists, proactive notifications are clear
  • Integration with comprehensive monitoring and alerting systems
  • Integration with active and passive health check systems

All expectations