Access
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓SSH key upload and management
- ✓Root/
sudoaccess available on the system - ✓API automation support for provisioning and management
- ✓Snapshot and image curation capabilities
Configuration
- ✓NVIDIA Container Toolkit with Docker (NVIDIA systems)NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi - ✓ROCm Container Toolkit with --gpus=all support (AMD systems)ROCm install guide
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/pytorch rocm-smi - ✓Container performance optimization (import, execution, pull speeds)
- ✓PyTorch and AI framework readiness validation
python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())" - ✓Native drivers and CUDA/ROCm toolkits installed
nvidia-smi && nvcc --version - ✓Custom driver installation supported
Hardware Support
- ✓NVIDIA: H100/H200/B200, A100, L40S, RTX 6000 Pro
- ✓AMD: MI300X, MI325X, MI355X (when available)
- ✓Flexible billing granularity (second, minute, hour, day)
- ✓External storage and network egress cost transparency
- ✓Spot instances available
- ✓Multi-node interconnects supported
Monitoring and Health Checks
- ✓
nvidia-smi, DCGM, or AMD equivalents with full metrics accessDCGM docsnvidia-smi && dcgmi discovery -l - ✓GPU clocking, power management, and thermal monitoring
nvidia-smi -q -d CLOCK,POWER,TEMPERATURE - ✓Integration with health check systems
- ✓Integration with monitoring tools
Networking and Storage
- ✓
- ✓NVMe direct access with multiple filesystem support
- ✓Storage performance benchmarking
fio --name=seqread --rw=read --bs=1M --size=1G --numjobs=4 --direct=1 - ✓Bandwidth testing and NCCL validationnccl-tests
- ✓Interconnect optimization
Security
- ✓BMC/IPMI restriction and security validation
- ✓Firewall management
- ✓Audit logging
- ✓Container Toolkit vulnerability assessmentNVIDIA Container Toolkit CVE-2024-0132
- ✓CVE compliance verification
- ✓Isolation options: bare metal, VM-based, container-based, or hybrid
General Expectations
- ✓Console, dashboard, CLI and/or API available to manage resources
- ✓24x7 support availability
- ✓Process for security fixes and upgrades exists, proactive notifications are clear
- ✓Integration with comprehensive monitoring and alerting systems
- ✓Integration with active and passive health check systems