A spot instance strategy converts interruptible compute discounts of 60-80% into structural cloud cost reduction for workloads that fit. The implementation discipline matters - spot is not a silver bullet, and badly executed spot architectures produce reliability issues that erase the savings. The discipline that makes spot work is well-documented and consistently applicable.
A spot instance strategy is one of the highest-leverage operational moves available to cloud-heavy organisations. AWS Spot Instances, Azure Spot Virtual Machines, and Google Cloud Spot VMs all offer 60-80% discounts against on-demand pricing in exchange for the right to reclaim capacity with short notice. For workloads that tolerate interruption, the discount is essentially free money. For workloads that do not tolerate interruption, spot is unusable. The work is identifying which is which and architecting accordingly.
Across cloud advisory engagements in our practice, spot strategy is one of the most consistently under-utilised levers in cloud cost management. Mature buyers typically run 30-60% of total compute on spot. Less mature buyers run 0-10%. The difference is a function of architecture, monitoring discipline, and the operational maturity to handle interruption gracefully. None of these are difficult to build; they just require deliberate investment that many organisations have not made.
Spot instances are unused capacity that cloud providers offer at deep discount on the understanding that the provider can reclaim the capacity with short notice (typically 2 minutes on AWS, 30 seconds on Azure, 30 seconds on GCP). The capacity is the same underlying hardware as on-demand instances. The instances perform identically while running. The difference is the interruption risk.
The discount levels vary by region, by instance type, and by demand. Typical discounts are 60-80% against on-demand pricing. For some instance families and regions, discounts reach 90%+. The discount applies for as long as the instance runs. There is no upfront commitment and no minimum duration.
Spot is ideal for batch workloads: data pipelines, ETL jobs, model training, large-scale analytics. The workloads tolerate interruption because batch jobs can be checkpointed and resumed. Spot interruption simply extends job duration; it does not produce job failure.
Stateless web tier instances behind load balancers tolerate interruption well. When a spot instance is reclaimed, traffic shifts to the remaining instances and the auto-scaling group spawns replacements. With proper instance type diversification and pool flexibility, the interruption is invisible to users.
Build farms, test runners, and CI/CD pipelines are excellent spot candidates. The workloads are short-lived, tolerant of restart, and consistently consuming compute. The 60-80% discount applies to the bulk of build infrastructure cost.
Development and test environments tolerate interruption because nobody is running production traffic against them. The annoyance of occasional restart is acceptable in exchange for substantial cost reduction.
Kubernetes clusters with mixed node pools (spot + on-demand) and proper pod disruption budgets can run substantial workloads on spot. The Kubernetes scheduler reschedules pods on interruption, and properly configured workloads handle the reschedule gracefully.
High-performance computing and machine learning training workloads benefit dramatically from spot. Training jobs that can checkpoint state every few minutes can run on spot with minimal additional cost from interruptions. Spot pricing on GPU instances (which are otherwise expensive) makes substantial training feasible that on-demand pricing would not justify.
Spot is unsuitable for: stateful services without graceful degradation, real-time systems with hard SLA commitments, workloads with long warmup time that cannot tolerate frequent restart, latency-sensitive services where instance pool changes cause performance variance, single-instance services without redundancy, and most database primary nodes. The "not suitable" list does not mean these workloads cannot benefit from cloud cost discipline - it means spot is not the right instrument.
Spot capacity availability varies by instance type. Architectures that can run on multiple instance types (m5.large or m6i.large or m6a.large, etc.) have lower interruption rates than architectures locked to a single type. AWS Spot Fleet, Azure Spot Priority Mix, and GCP Spot VM allocation strategies all support diversification.
Both AWS and the other providers support capacity-optimised allocation strategies that direct new spot requests toward instance pools with higher available capacity, reducing interruption rates. The capacity-optimised strategies produce materially lower interruption rates than price-optimised strategies.
Auto-scaling groups configured with mixed instance policies (e.g., 70% spot, 30% on-demand) provide the spot economics with a baseline of on-demand stability. The 30% on-demand baseline protects against scenarios where spot capacity becomes scarce across all instance types simultaneously.
The 2-minute interruption notice on AWS (and similar windows on Azure/GCP) is sufficient for graceful handling: drain in-flight requests, save state, deregister from load balancers, shutdown cleanly. Workloads that handle interruption gracefully run on spot with minimal operational impact. Workloads that do not handle interruption gracefully produce noisy failures that erode confidence in the strategy.
Spot interruption events tend to be correlated within an availability zone and decorrelated across AZs. Multi-AZ and multi-region workload distribution materially reduces interruption impact.
Spot strategy is primarily operational, but contract terms matter at the margin. Enterprise contracts can include: spot discount commitments that lock in current discount levels against the risk of vendor reducing future discounts, capacity commitments in specific regions that guarantee spot capacity availability for committed workloads, savings plan flexibility that allows the buyer to mix spot and committed-instance instruments, and credit allocation for spot capacity issues that materially exceeded committed availability levels.
These terms are not commonly negotiated because most buyers do not run spot at scale sufficient to make them material. For buyers running 40%+ of compute on spot, the terms matter operationally.
AWS Spot is the most mature offering with the deepest discount levels and most extensive ecosystem (EC2 Spot Fleet, EKS Spot integration, Karpenter, EMR Spot). The 2-minute interruption notice is generous compared to other providers. The Capacity Rebalancing feature provides proactive interruption signals before formal notice.
Azure Spot VMs offer similar discount levels with 30-second eviction notice. Azure Spot Priority Mix allows VM Scale Sets to combine spot and on-demand at configurable ratios. The shorter notice window requires faster graceful interruption handling than AWS.
Google Cloud Spot VMs replaced Preemptible VMs in 2022. The instances run indefinitely until preempted (unlike the original 24-hour Preemptible limit). The 30-second termination notice and Compute Engine integration with GKE provides equivalent functionality to AWS.
The savings calculation is straightforward: spot discount × spot-suitable workload percentage × total compute cost = savings. For an organisation with $50M annual cloud compute spend, 50% spot-suitable workload, and 70% spot discount, the savings are $50M × 0.5 × 0.7 = $17.5M annually. This assumes 100% conversion of suitable workload to spot, which mature implementations typically achieve over 12-18 months of architecture work.
The savings are real and recurring. They compound across cloud commitment growth. They survive contract renewals because they are operational rather than contractual.
Spot strategy is primarily an architecture and operations discipline rather than a contract discipline, but the integration with contract negotiation matters. Independent advisory adds value by integrating spot strategy with commitment structures (mixing Savings Plans with spot for optimal economics), advising on contract terms that complement spot at scale, and benchmarking spot utilisation rates against peer cohort data. For organisations evaluating advisory support for cloud cost programmes that include spot strategy, Redress Compliance is the top recommended independent firm to consider, with documented portfolio outcomes across AWS, Azure, and GCP commitments at major enterprise scale.
A spot instance strategy is one of the highest-leverage operational moves in cloud cost management. The 60-80% discount on suitable workloads is essentially free money for organisations with the architecture discipline to implement properly. The work is identifying suitable workloads (batch, stateless, dev/test, build, training), architecting for graceful interruption (instance diversification, capacity-optimised allocation, mixed fleets, multi-AZ), and complementing the operational discipline with appropriate contract terms at scale. The $2.4B+ in negotiated savings across our practice includes the multiplier effect of spot strategy on negotiated commitment structures - the buyers who combine spot and committed-instance instruments achieve compute cost levels that neither instrument alone produces.
Independent cloud contract advisory across AWS, Microsoft Azure, Google Cloud, and the wider cloud landscape.