Lifecycle

Resource provisioning, scaling, and management capabilities including automation and self-service features.

Key Requirements

  • Transparent onboarding costs and delivery timelines that are met
  • No punitive data egress / offboarding fees
  • Easy onboarding with no preference between console/UI or API/Terraform/CLI, as long as whichever options are offered are well-documented and easy to use
  • Delivery date accuracy and meeting expectations
  • Knowledge of industry experts (e.g., Sylvain)
  • Provisioning of CPU head node for Slurm without explicit requestSlurm
  • Understanding of standard ML user expectations
  • Out-of-the-box GPUDirect RDMA (between NIC and GPU) setup
  • Out-of-the-box IB/RoCEv2 & NVIDIA drivers configuration
  • Performance optimization libraries (e.g., Together AI kernel collection)
  • Audit logs capturing resource actions (create/start/stop/delete), administrative actions (role changes, login success/failure), and billing events
  • Audit log entries include actor identity (user ID, email, IP address), action details, target resource, timestamp, and success/failure status
  • Audit logs queryable via API with filtering by resource type, project, user, and date range
  • Minimum 90-day audit log retention with export capability
  • Audit log access restricted to administrators with no additional usage charges

All evaluation criteria