


| Item | Qty | |||||
|---|---|---|---|---|---|---|
GPU $/GPU-hrTotal GPUs: 4,096* | $5,426,381 | $6,423,183 | $5,898,240 | |||
% | ||||||
Storage $/GiB-mo | $1,408 | $1,856 | ||||
Network $/mo | $82 | |||||
CPU $/vm-hr | $3,318 | |||||
Support % uplift | $321,422 | |||||
% | ||||||
Goodput % uplift | $111,349 | $131,804 | $202,015 | |||
% | % | % | ||||
Setup one-time $/yr | $3,278,258 | $3,015,787 | ||||
Debugging $/mo $/yr | $16,667 | $16,667 | ||||
Subtotals | ||||||
| Monthly (amortized) | $5,539,138/mo | $6,989,395/mo | $6,200,693/mo | |||
| 36-month Total | $199,408,973 | $251,618,208 | $223,224,957 | |||
| Relative to Gold-tier | 1.00x | 1.26x | 1.12x | |||
* The Total GPUs value shown on the GPU section header is the sum of the Qty column across every GPU instance row. This same number is used as the Cluster size for every downtime / SLA calculation, so editing GPU quantities here is the only place to change cluster size.
Price Discrepancy (Total)
Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier
What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

Price Discrepancy (Attribution): Silver-tier vs Gold-tier
What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

Cluster TCO Calculator



| Item | Qty | |||||
|---|---|---|---|---|---|---|
GPU $/GPU-hrTotal GPUs: 4,096* | $5,426,381 | $6,423,183 | $5,898,240 | |||
% | ||||||
Storage $/GiB-mo | $1,408 | $1,856 | ||||
Network $/mo | $82 | |||||
CPU $/vm-hr | $3,318 | |||||
Support % uplift | $321,422 | |||||
% | ||||||
Goodput % uplift | $111,349 | $131,804 | $202,015 | |||
% | % | % | ||||
Setup one-time $/yr | $3,278,258 | $3,015,787 | ||||
Debugging $/mo $/yr | $16,667 | $16,667 | ||||
Subtotals | ||||||
| Monthly (amortized) | $5,539,138/mo | $6,989,395/mo | $6,200,693/mo | |||
| 36-month Total | $199,408,973 | $251,618,208 | $223,224,957 | |||
| Relative to Gold-tier | 1.00x | 1.26x | 1.12x | |||
* The Total GPUs value shown on the GPU section header is the sum of the Qty column across every GPU instance row. This same number is used as the Cluster size for every downtime / SLA calculation, so editing GPU quantities here is the only place to change cluster size.
Price Discrepancy (Total)
Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier
What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

Price Discrepancy (Attribution): Silver-tier vs Gold-tier
What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

Goodput Expense Calculator
Goodput Expense Calculator



| Parameter | Gold-tier | Hyperscaler | Silver-tier |
|---|---|---|---|
| Shared | |||
| Cluster size (GPUs)* | 5,184 | 5,184 | 5,184 |
| Avg job size (j_size) (GPUs) | 4,096 | 4,096 | |
| Blast radius (b_radius) (GPUs) | 64 | 64 | |
| Per Provider | |||
| GPU MTBF (GPU-hrs) | |||
| Checkpoint freq (t_chkpt) (mins) | |||
| Failover time (t_failover) (mins) | |||
| Idle spare GPUs (GPUs) | |||
| Resiliency | |||
| Repair/Replace | |||
| Time to identify failure (mins) | |||
| Time to repair node (t_repair) (hrs) | |||
| Time to init job (mins) | |||
| Network overhead (%) | |||
| Memory overhead (%) | |||
| Results | |||
| Cluster MTBF | 4.8h | 4.8h | 2.9h |
| Interruptions/mo | 149.3 | 149.3 | 248.8 |
| Downtime loss | 12.90% | 12.90% | 27.73% |
| Idle spare cost | 0.62% | 0.62% | 0.62% |
| Performance overhead | 0.00% | 0.00% | 0.00% |
| Total Goodput Loss | 13.52% | 13.52% | 28.35% |
* The Cluster size shown here mirrors the total GPU count on the (sum of Qty across every GPU instance row). This is the value used in every downtime, SLA, and failure-count calculation below. Change cluster size by editing GPU quantities on the TCO tab.
Goodput Expense Formulae
The following formulae are used to calculate Goodput Expense under three scenarios.
Source: SemiAnalysis - How Much Do GPUs Really Cost? (2026)
Where...
Gchkpt-cold = goodput expense when jobs restart from a checkpoint via a spare node that is "cold" (typically, provider managed). In other words, the jobs wait until a repair/replace happens. This is the worst case scenario, since these kinds of repairs typically take hours or days.
Gchkpt-hot = goodput expense when jobs restart from a checkpoint via a spare node that is "hot" (typically, customer managed but can also be from top-tier providers). In other words, the jobs (depending on defined priorities) can restart immediately on idle nodes (customer managed), pre-empt lower-priority jobs (also customer managed), or restart on a node that gets brought into the cluster from a spare pool (provider managed). Of course, a provider-managed spare pool also depends on some capacity guarantee from the customer (i.e. if one of your machines fail and you report it for repair/replacement, there needs to be spares available). Top-tier providers that are experienced running multi-tenant clusters at 4k+ GPU scale tell us that they will leave anywhere from 2-6% of their nodes in this spare pool to be used for hot-swaps.
Gtolerant = goodput expense when jobs are "fault tolerant", i.e. they can keep running in the event of a hardware issue. This scenario is well understood for single-node inference, where a framework such as llm-d or ome or kserve will just have the load balancer stop sending traffic to the failed node and resend any failed requests to the healthy nodes. The scenario is less well understood in training.
Individual terms are...



MTBF Reference
Using Gold-tier config, Hot spare strategy
References



| Item | Qty | |||||
|---|---|---|---|---|---|---|
GPU $/GPU-hrTotal GPUs: 4,096* | $5,426,381 | $6,423,183 | $5,898,240 | |||
% | ||||||
Storage $/GiB-mo | $1,408 | $1,856 | ||||
Network $/mo | $82 | |||||
CPU $/vm-hr | $3,318 | |||||
Support % uplift | $321,422 | |||||
% | ||||||
Goodput % uplift | $111,349 | $131,804 | $202,015 | |||
% | % | % | ||||
Setup one-time $/yr | $3,278,258 | $3,015,787 | ||||
Debugging $/mo $/yr | $16,667 | $16,667 | ||||
Subtotals | ||||||
| Monthly (amortized) | $5,539,138/mo | $6,989,395/mo | $6,200,693/mo | |||
| 36-month Total | $199,408,973 | $251,618,208 | $223,224,957 | |||
| Relative to Gold-tier | 1.00x | 1.26x | 1.12x | |||
* The Total GPUs value shown on the GPU section header is the sum of the Qty column across every GPU instance row. This same number is used as the Cluster size for every downtime / SLA calculation, so editing GPU quantities here is the only place to change cluster size.
Price Discrepancy (Total)
Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier
What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

Price Discrepancy (Attribution): Silver-tier vs Gold-tier
What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.

Goodput Expense Calculator



| Parameter | Gold-tier | Hyperscaler | Silver-tier |
|---|---|---|---|
| Shared | |||
| Cluster size (GPUs)* | 5,184 | 5,184 | 5,184 |
| Avg job size (j_size) (GPUs) | 4,096 | 4,096 | |
| Blast radius (b_radius) (GPUs) | 64 | 64 | |
| Per Provider | |||
| GPU MTBF (GPU-hrs) | |||
| Checkpoint freq (t_chkpt) (mins) | |||
| Failover time (t_failover) (mins) | |||
| Idle spare GPUs (GPUs) | |||
| Resiliency | |||
| Repair/Replace | |||
| Time to identify failure (mins) | |||
| Time to repair node (t_repair) (hrs) | |||
| Time to init job (mins) | |||
| Network overhead (%) | |||
| Memory overhead (%) | |||
| Results | |||
| Cluster MTBF | 4.8h | 4.8h | 2.9h |
| Interruptions/mo | 149.3 | 149.3 | 248.8 |
| Downtime loss | 12.90% | 12.90% | 27.73% |
| Idle spare cost | 0.62% | 0.62% | 0.62% |
| Performance overhead | 0.00% | 0.00% | 0.00% |
| Total Goodput Loss | 13.52% | 13.52% | 28.35% |
* The Cluster size shown here mirrors the total GPU count on the (sum of Qty across every GPU instance row). This is the value used in every downtime, SLA, and failure-count calculation below. Change cluster size by editing GPU quantities on the TCO tab.
Goodput Expense Formulae
The following formulae are used to calculate Goodput Expense under three scenarios.
Source: SemiAnalysis - How Much Do GPUs Really Cost? (2026)
Where...
Gchkpt-cold = goodput expense when jobs restart from a checkpoint via a spare node that is "cold" (typically, provider managed). In other words, the jobs wait until a repair/replace happens. This is the worst case scenario, since these kinds of repairs typically take hours or days.
Gchkpt-hot = goodput expense when jobs restart from a checkpoint via a spare node that is "hot" (typically, customer managed but can also be from top-tier providers). In other words, the jobs (depending on defined priorities) can restart immediately on idle nodes (customer managed), pre-empt lower-priority jobs (also customer managed), or restart on a node that gets brought into the cluster from a spare pool (provider managed). Of course, a provider-managed spare pool also depends on some capacity guarantee from the customer (i.e. if one of your machines fail and you report it for repair/replacement, there needs to be spares available). Top-tier providers that are experienced running multi-tenant clusters at 4k+ GPU scale tell us that they will leave anywhere from 2-6% of their nodes in this spare pool to be used for hot-swaps.
Gtolerant = goodput expense when jobs are "fault tolerant", i.e. they can keep running in the event of a hardware issue. This scenario is well understood for single-node inference, where a framework such as llm-d or ome or kserve will just have the load balancer stop sending traffic to the failed node and resend any failed requests to the healthy nodes. The scenario is less well understood in training.
Individual terms are...



MTBF Reference
Using Gold-tier config, Hot spare strategy
References



| Item | Qty | |||||
|---|---|---|---|---|---|---|
GPU $/GPU-hrTotal GPUs: 4,096* | $5,426,381 | $6,423,183 | $5,898,240 | |||
% | ||||||
Storage $/GiB-mo | $1,408 | $1,856 | ||||
Network $/mo | $82 | |||||
CPU $/vm-hr | $3,318 | |||||
Support % uplift | $321,422 | |||||
% | ||||||
Goodput % uplift | $111,349 | $131,804 | $202,015 | |||
% | % | % | ||||
Setup one-time $/yr | $3,278,258 | $3,015,787 | ||||
Debugging $/mo $/yr | $16,667 | $16,667 | ||||
Subtotals | ||||||
| Monthly (amortized) | $5,539,138/mo | $6,989,395/mo | $6,200,693/mo | |||
| 36-month Total | $199,408,973 | $251,618,208 | $223,224,957 | |||
| Relative to Gold-tier | 1.00x | 1.26x | 1.12x | |||
* The Total GPUs value shown on the GPU section header is the sum of the Qty column across every GPU instance row. This same number is used as the Cluster size for every downtime / SLA calculation, so editing GPU quantities here is the only place to change cluster size.
Price Discrepancy (Total)
Cost composition as % of Gold-tier amortized monthly total ($5,539,138/mo = 100%)

Price Discrepancy (Attribution): Hyperscaler vs Gold-tier
What accounts for the +$1,450,257/mo difference? Each category's share of the total gap = 100%.

Price Discrepancy (Attribution): Silver-tier vs Gold-tier
What accounts for the +$661,555/mo difference? Each category's share of the total gap = 100%.
