Monitoring

Observability, alerting, and diagnostic capabilities for infrastructure and workloads.

Key Requirements

  • ncu profiling available for all users
  • Out-of-the-box detailed managed Grafana
  • Automated Active and Passive Health Checks
  • Burn-in test documentation
  • TFLOPs estimation tracking
  • Comprehensive passive health check implementation
  • Automated active health check implementation
  • Automatic node draining for detected issues
  • Predictive failure detection: ML models that forecast component failures before they occur
  • Real-time system monitoring
  • Alerting capabilities
  • Diagnostic tools
  • Performance tracking
  • Resource utilization monitoring
  • sacct integration for job accounting and resource utilizationSlurm
  • DCGM health checks plugged into the Slurm HealthCheckProgramSlurm

All evaluation criteria