Item
GPU $/GPU-hrTotal GPUs: 4,096*	$5,426,381	$6,423,183	$5,898,240






		%
Storage $/GiB-mo	$1,408	$1,856



Network $/mo		$82






CPU $/vm-hr		$3,318


Support % uplift		$321,422
		%
Goodput % uplift	$111,349	$131,804	$202,015
	%	%	%
Setup one-time $/yr		$3,278,258	$3,015,787


Debugging $/mo $/yr		$16,667	$16,667

Subtotals
Monthly (amortized)	$5,539,138/mo	$6,989,395/mo	$6,200,693/mo
36-month Total	$199,408,973	$251,618,208	$223,224,957
Relative to Gold-tier	1.00x	1.26x	1.12x

* The Total GPUs value shown on the GPU section header is the sum of the Qty column across every GPU instance row. This same number is used as the Cluster size for every downtime / SLA calculation, so editing GPU quantities here is the only place to change cluster size.

Price Discrepancy (Total)

Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

GPU (instances)

Orchestration

Storage

Network

CPU

Support

Goodput

Setup

Debugging

Hyperscaler vs Gold-tier1.26x(+$1,450,257/mo)

Silver-tier vs Gold-tier1.12x(+$661,555/mo)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier

What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+28.5%

Orchestration+40.3%

Storage+0.0%

Network+0.0%

CPU+0.2%

Support+22.2%

Goodput+1.4%

Setup+6.3%

Debugging+1.1%

Price Discrepancy (Attribution): Silver-tier vs Gold-tier

What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+71.0%

Orchestration

Storage-0.2%

Network

CPU

Support

Goodput+13.6%

Setup+12.6%

Debugging+2.5%

Cluster TCO Calculator

Item
GPU $/GPU-hrTotal GPUs: 4,096*	$5,426,381	$6,423,183	$5,898,240






		%
Storage $/GiB-mo	$1,408	$1,856



Network $/mo		$82






CPU $/vm-hr		$3,318


Support % uplift		$321,422
		%
Goodput % uplift	$111,349	$131,804	$202,015
	%	%	%
Setup one-time $/yr		$3,278,258	$3,015,787


Debugging $/mo $/yr		$16,667	$16,667

Subtotals
Monthly (amortized)	$5,539,138/mo	$6,989,395/mo	$6,200,693/mo
36-month Total	$199,408,973	$251,618,208	$223,224,957
Relative to Gold-tier	1.00x	1.26x	1.12x

Price Discrepancy (Total)

Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

GPU (instances)

Orchestration

Storage

Network

CPU

Support

Goodput

Setup

Debugging

Hyperscaler vs Gold-tier1.26x(+$1,450,257/mo)

Silver-tier vs Gold-tier1.12x(+$661,555/mo)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier

What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+28.5%

Orchestration+40.3%

Storage+0.0%

Network+0.0%

CPU+0.2%

Support+22.2%

Goodput+1.4%

Setup+6.3%

Debugging+1.1%

Price Discrepancy (Attribution): Silver-tier vs Gold-tier

What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+71.0%

Orchestration

Storage-0.2%

Network

CPU

Support

Goodput+13.6%

Setup+12.6%

Debugging+2.5%

Goodput Expense Calculator

Each tier's selected strategy downtime % auto-syncs to the Goodput row of the

Goodput Expense Calculator

Parameter	Gold-tier	Hyperscaler	Silver-tier
Shared
Cluster size (GPUs)*	5,184	5,184	5,184
Avg job size (j_size) (GPUs)		4,096	4,096
Blast radius (b_radius) (GPUs)		64	64
Per Provider
GPU MTBF (GPU-hrs)
Checkpoint freq (t_chkpt) (mins)
Failover time (t_failover) (mins)
Idle spare GPUs (GPUs)
Resiliency
Repair/Replace
Time to identify failure (mins)
Time to repair node (t_repair) (hrs)
Time to init job (mins)
Network overhead (%)
Memory overhead (%)
Results
Cluster MTBF	4.8h	4.8h	2.9h
Interruptions/mo	149.3	149.3	248.8
Downtime loss	12.90%	12.90%	27.73%
Idle spare cost	0.62%	0.62%	0.62%
Performance overhead	0.00%	0.00%	0.00%
Total Goodput Loss	13.52%	13.52%	28.35%

* The Cluster size shown here mirrors the total GPU count on the (sum of Qty across every GPU instance row). This is the value used in every downtime, SLA, and failure-count calculation below. Change cluster size by editing GPU quantities on the TCO tab.

Goodput Expense Formulae

The following formulae are used to calculate Goodput Expense under three scenarios.

Source: SemiAnalysis - How Much Do GPUs Really Cost? (2026)

G_chkpt-cold = [(t_id + t_chkpt/2) + t_init + t_repair] · j_size · #_failures · $_GPU-hr

G_chkpt-hot = {[(t_id + t_chkpt/2) + t_init] · j_size + t_repair · b_radius} · #_failures · $_GPU-hr

G_tolerant = [(t_id + t_failover) · j_size + t_repair · b_radius] · #_failures · $_GPU-hr

Where...

G_chkpt-cold = goodput expense when jobs restart from a checkpoint via a spare node that is "cold" (typically, provider managed). In other words, the jobs wait until a repair/replace happens. This is the worst case scenario, since these kinds of repairs typically take hours or days.

G_chkpt-hot = goodput expense when jobs restart from a checkpoint via a spare node that is "hot" (typically, customer managed but can also be from top-tier providers). In other words, the jobs (depending on defined priorities) can restart immediately on idle nodes (customer managed), pre-empt lower-priority jobs (also customer managed), or restart on a node that gets brought into the cluster from a spare pool (provider managed). Of course, a provider-managed spare pool also depends on some capacity guarantee from the customer (i.e. if one of your machines fail and you report it for repair/replacement, there needs to be spares available). Top-tier providers that are experienced running multi-tenant clusters at 4k+ GPU scale tell us that they will leave anywhere from 2-6% of their nodes in this spare pool to be used for hot-swaps.

G_tolerant = goodput expense when jobs are "fault tolerant", i.e. they can keep running in the event of a hardware issue. This scenario is well understood for single-node inference, where a framework such as llm-d or ome or kserve will just have the load balancer stop sending traffic to the failed node and resend any failed requests to the healthy nodes. The scenario is less well understood in training.

Individual terms are...

t_id = time to identify failure (provider's monitoring system, or customer to report)

t_chkpt = frequency of checkpoints (customer configured)

t_init = time to initialize training job

t_repair = time to repair or replace a failed node, i.e. MTTR

t_failover = time to failover to a hot spare node

b_radius = blast radius, e.g. 8-way HGX or 64-way in NVL72

j_size = average job size

#_failures = number of failures, i.e. MTBF

$_GPU-hr = price per GPU hour

MTBF Reference

Using Gold-tier config, Hot spare strategy

Source	GPU MTBF	Downtime %
Maximum (Nebius blog)↗	169,800	1.90%
Round number (high)	64,000	5.04%
Meta paper claim↗	50,677	6.36%
Round number (mid)	32,000	10.08%
Nebius blog sample↗	26,446	12.19%
Round number (low)	16,000	20.15%
ClusterMAX 2.0 real↗	10,000	32.25%

References

[1]

SemiAnalysis ClusterMAX - Independent evaluation and rating of 84 GPU cloud providers across 10 dimensions

[2]

Meta: The Llama 3 Herd of Models - Training infrastructure details and failure analysis for large-scale GPU clusters

[3]

Meta: Revisiting Reliability in Large-Scale ML Research Clusters - Comprehensive study of hardware failures and MTBF in production ML clusters

[4]

Nebius: How we build reliable clusters for distributed AI workloads - Real-world MTBF data and cluster reliability engineering practices

[5]

Crusoe: Minimizing Hardware Failures in Large GPU Clusters - Automated health checks and failure mitigation strategies

[6]

AWS: Deep Health Checks with SageMaker HyperPod - AWS approach to automated cluster health monitoring and node replacement

[7]

AWS: Capacity Block Pricing - Reserved GPU capacity pricing for ML workloads

[8]

AWS: Premium Support Pricing - Tiered support pricing used in TCO support cost calculations

Item
GPU $/GPU-hrTotal GPUs: 4,096*	$5,426,381	$6,423,183	$5,898,240






		%
Storage $/GiB-mo	$1,408	$1,856



Network $/mo		$82






CPU $/vm-hr		$3,318


Support % uplift		$321,422
		%
Goodput % uplift	$111,349	$131,804	$202,015
	%	%	%
Setup one-time $/yr		$3,278,258	$3,015,787


Debugging $/mo $/yr		$16,667	$16,667

Subtotals
Monthly (amortized)	$5,539,138/mo	$6,989,395/mo	$6,200,693/mo
36-month Total	$199,408,973	$251,618,208	$223,224,957
Relative to Gold-tier	1.00x	1.26x	1.12x

Price Discrepancy (Total)

Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

GPU (instances)

Orchestration

Storage

Network

CPU

Support

Goodput

Setup

Debugging

Hyperscaler vs Gold-tier1.26x(+$1,450,257/mo)

Silver-tier vs Gold-tier1.12x(+$661,555/mo)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier

What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+28.5%

Orchestration+40.3%

Storage+0.0%

Network+0.0%

CPU+0.2%

Support+22.2%

Goodput+1.4%

Setup+6.3%

Debugging+1.1%

Price Discrepancy (Attribution): Silver-tier vs Gold-tier

What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+71.0%

Orchestration

Storage-0.2%

Network

CPU

Support

Goodput+13.6%

Setup+12.6%

Debugging+2.5%

Each tier's selected strategy downtime % auto-syncs to the Goodput row of the

Goodput Expense Calculator

Parameter	Gold-tier	Hyperscaler	Silver-tier
Shared
Cluster size (GPUs)*	5,184	5,184	5,184
Avg job size (j_size) (GPUs)		4,096	4,096
Blast radius (b_radius) (GPUs)		64	64
Per Provider
GPU MTBF (GPU-hrs)
Checkpoint freq (t_chkpt) (mins)
Failover time (t_failover) (mins)
Idle spare GPUs (GPUs)
Resiliency
Repair/Replace
Time to identify failure (mins)
Time to repair node (t_repair) (hrs)
Time to init job (mins)
Network overhead (%)
Memory overhead (%)
Results
Cluster MTBF	4.8h	4.8h	2.9h
Interruptions/mo	149.3	149.3	248.8
Downtime loss	12.90%	12.90%	27.73%
Idle spare cost	0.62%	0.62%	0.62%
Performance overhead	0.00%	0.00%	0.00%
Total Goodput Loss	13.52%	13.52%	28.35%

Goodput Expense Formulae

The following formulae are used to calculate Goodput Expense under three scenarios.

Source: SemiAnalysis - How Much Do GPUs Really Cost? (2026)

G_chkpt-cold = [(t_id + t_chkpt/2) + t_init + t_repair] · j_size · #_failures · $_GPU-hr

G_chkpt-hot = {[(t_id + t_chkpt/2) + t_init] · j_size + t_repair · b_radius} · #_failures · $_GPU-hr

G_tolerant = [(t_id + t_failover) · j_size + t_repair · b_radius] · #_failures · $_GPU-hr

Where...

Individual terms are...

t_id = time to identify failure (provider's monitoring system, or customer to report)

t_chkpt = frequency of checkpoints (customer configured)

t_init = time to initialize training job

t_repair = time to repair or replace a failed node, i.e. MTTR

t_failover = time to failover to a hot spare node

b_radius = blast radius, e.g. 8-way HGX or 64-way in NVL72

j_size = average job size

#_failures = number of failures, i.e. MTBF

$_GPU-hr = price per GPU hour

MTBF Reference

Using Gold-tier config, Hot spare strategy

Source	GPU MTBF	Downtime %
Maximum (Nebius blog)↗	169,800	1.90%
Round number (high)	64,000	5.04%
Meta paper claim↗	50,677	6.36%
Round number (mid)	32,000	10.08%
Nebius blog sample↗	26,446	12.19%
Round number (low)	16,000	20.15%
ClusterMAX 2.0 real↗	10,000	32.25%

References

[1]

SemiAnalysis ClusterMAX - Independent evaluation and rating of 84 GPU cloud providers across 10 dimensions

[2]

Meta: The Llama 3 Herd of Models - Training infrastructure details and failure analysis for large-scale GPU clusters

[3]

Meta: Revisiting Reliability in Large-Scale ML Research Clusters - Comprehensive study of hardware failures and MTBF in production ML clusters

[4]

Nebius: How we build reliable clusters for distributed AI workloads - Real-world MTBF data and cluster reliability engineering practices

[5]

Crusoe: Minimizing Hardware Failures in Large GPU Clusters - Automated health checks and failure mitigation strategies

[6]

AWS: Deep Health Checks with SageMaker HyperPod - AWS approach to automated cluster health monitoring and node replacement

[7]

AWS: Capacity Block Pricing - Reserved GPU capacity pricing for ML workloads

[8]

AWS: Premium Support Pricing - Tiered support pricing used in TCO support cost calculations

Item
GPU $/GPU-hrTotal GPUs: 4,096*	$5,426,381	$6,423,183	$5,898,240






		%
Storage $/GiB-mo	$1,408	$1,856



Network $/mo		$82






CPU $/vm-hr		$3,318


Support % uplift		$321,422
		%
Goodput % uplift	$111,349	$131,804	$202,015
	%	%	%
Setup one-time $/yr		$3,278,258	$3,015,787


Debugging $/mo $/yr		$16,667	$16,667

Subtotals
Monthly (amortized)	$5,539,138/mo	$6,989,395/mo	$6,200,693/mo
36-month Total	$199,408,973	$251,618,208	$223,224,957
Relative to Gold-tier	1.00x	1.26x	1.12x

Price Discrepancy (Total)

Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

GPU (instances)

Orchestration

Storage

Network

CPU

Support

Goodput

Setup

Debugging

Hyperscaler vs Gold-tier1.26x(+$1,450,257/mo)

Silver-tier vs Gold-tier1.12x(+$661,555/mo)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier

What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+28.5%

Orchestration+40.3%

Storage+0.0%

Network+0.0%

CPU+0.2%

Support+22.2%

Goodput+1.4%

Setup+6.3%

Debugging+1.1%

Price Discrepancy (Attribution): Silver-tier vs Gold-tier

What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

GPU (instances)+71.0%

Orchestration

Storage-0.2%

Network

CPU

Support

Goodput+13.6%

Setup+12.6%

Debugging+2.5%