We all know that the adoption of public cloud by enterprises has been rising now for a number of years, and for many organizations it has become a significant element of their IT spend. There have, however, been cases of enterprises moving away from the public cloud back to on-premises, in some cases citing unexpectedly high and escalating costs.
Predicting Cloud Spend
Part of the problem for companies trying to plan IT budgets is predicting their compute usage. Compute processing provided by virtual machines (VMs, or EC2 instances in AWS terminology) is typically the largest component of a public cloud budget. Add in the uncertainty around how much of an on-demand resource will be used, and it becomes easy to understand why enterprises may not want to move away from tried and tested approaches that allow them to precisely predict charges. For example, an organization may typically use one-year virtual machine Reserved Instances for steady, predictable workloads, but combine them with more expensive On-Demand Instances to cover spikes in workloads.
Identifying Potential Savings on Cloud Spend
The problem with this approach is that organizations may be missing out on the opportunity to recognize massive savings on their compute processing. In some cases these savings can be as high as 90% of the cost of an On-Demand virtual machine instance. These savings can be achieved by making use of what AWS identifies as Spot Instances and Microsoft Azure calls low-priority virtual machines. Both of these are virtual machines provided from the cloud provider’s spare VM capacity. So why have many organizations, perhaps yours, not yet been tempted to take advantage of these offerings?
*Figure 1 costs are based on data taken from the AWS website for an EC2 r4.large virtual machine running on a Linux operating system for 30 days during September and October, 2018, in the eastern US (northern Virginia). The Spot Instance average price over the a 30-day period selected at random provided a 75% reduction against the On-Demand price for the same period. The Reserved Instance cost is based on a one-year reservation with no upfront payment.
There are three reasons why organizations are not taking advantage of Spot Instances:
- They are not right for every type of processing
- It is difficult to predict even short-term costs
- They are perceived to be an operational risk
Let us look at each one.
Spot Instances are not right for every type of processing
Spot Instances are by no means appropriate for all workloads. The idea is, you are, in effect, only “borrowing” your virtual machine from the cloud provider. Spot Instances are technically no different than On-Demand or Reserved Instances, but operationally, they let AWS and Azure offer you some quite impressive deals, all based on highly fluid supply and demand. Cloud providers have to ensure that their huge data centers can provide enough compute power to cope with almost any amount of demand they will face for their On-Demand and Reserved Instances. This means that they almost always have far more capacity than there is demand for. This varies by region, Availability Zone, virtual machine instance type and instance size and, of course, time period. When cloud providers’ supplies are higher than demand, they offer out this extra compute capacity at what is often extremely low prices. But the downside is that providers need that capacity back when the demand goes up again. In the case of AWS Spot Instances, this means you get a two-minute warning to get out before they terminate your instance. Clearly not all types of processing, especially live production processing, can tolerate this type of termination on short notice.
But there is a significant amount of processing that can withstand it. The obvious one is batch processing being run with a checkpoint, which means it can cope with being stopped in its tracks. Many organizations also manage to run the web and application tiers of their applications using Spot Instances. Other good candidates include non-time-critical computational work, such as analytics jobs and model training in machine learning. You could also refactor some applications with these characteristics to be more cloud friendly, but you should do a cost benefit analysis to see if the effort is worth the savings on compute. There may also be good opportunities when combined with containers. Finally, if your application is fault tolerant, stateless and loosely coupled, it is a prime candidate for the savings available with Spot Instances.
It can be difficult to predict even short-term costs with Spot Instances
Until the end of 2017, it was hard to know what you would pay for a Spot Instance. Under those old rules, you put in a bid for the maximum you were prepared to pay for a particular instance for a period of time, but what you actually paid was the market rate for that instance. This market rate could fluctuate up to 10 times the starting figure over a short period of time. Fluctuations below your maximum bid might provide huge savings, but they made it tough to predict the exact cost of running a job. And if the market rate exceeded your bid maximum, your job would be kicked out.
AWS then changed the approach at the end of 2017 so that instance prices were not quite as sensitive to market changes. Now when you launch an instance for a fixed period of a few hours, the price increases and decreases in small increments, not more than once an hour and no more than plus or minus 10% in the course of a day. These more stable prices let you budget with greater predictability and control.
With this new approach, a customer can still provide a cap on what they are prepared to pay for Spot Instances. This is obviously useful when it is absolutely essential to run a particular workload at a very low cost. For example, an organization may have some speculative big data analysis heavy on compute which can only be justified if it can be done very cheaply. This increased pricing stability is certainly a welcome improvement to anyone trying to forecast their public cloud spend. However, Spot Instances by nature will always have some cost variability that needs to be accounted for in your budget.
Using Spot Instances is sometimes perceived as an operational risk
Some of your application owners may be unhappy about the perceived risk of having their compute power taken away with only a couple of minutes warning. Comfort can be taken from the fact that AWS currently states on their website that over the last three months, 92% of Spot Instance interruptions were from a customer manually terminating the instance because the application had completed its work. They also recommend diversification strategies. No one would suggest that you should run all your workloads solely on Spot Instances; they should be combined with On-Demand and/or Reserved Instances.
A Diversification Approach Is Essential for Spot Instance Success
To assist in this diversification approach, AWS encourages customers to think in terms of Spot capacity pools. A Spot capacity pool is a set of unused virtual machine instances with the same instance type, operating system, Availability Zone and network platform. Each Spot capacity pool can have a different price based on supply and demand, and any instances you have in one pool will most likely all go down together. To avoid an all-your-eggs-in-one-basket” scenario, AWS recommends using between 4 and 20 capacity pools for a production workload. Their reasoning is that if you have fewer than 4 capacity pools, if you lose instances, it will impact a huge proportion of your capacity all at once. On the other hand, if you have more than 20 capacity pools, you have created more than 20 different ways in which capacity may be taken back from you, although you would be losing a smaller proportion of your workload capacity. To provide ready-built automation for working out these pools, AWS offers, at no additional cost, Spot Fleets. These can find the most cost-effective capacity across multiple Spot capacity pools. A Spot Fleet can also include On-Demand Instances, so you can build in a fallback for a terminated Spot Instance.
The Cloud Business Office Approach
So what can you do to determine whether your enterprise can play the Spot Instance game and recognize some unbelievable savings on the IT budget?
If you are not sure of the answer, that may be why your organization isn’t taking advantage of Spot Instance savings. One of the main reasons enterprises miss out on such huge opportunities to cut costs is that the decisions must necessarily involve a range of individuals and groups, from finance to cloud architects to application owners to operations teams. Plus, these decisions need to be regularly reviewed and updated by all concerned parties. That is why CTP recommends that organizations operating in the public cloud establish a Cloud Business Office (CBO). A CBO serves as the central point of decision making and communications for your cloud program. It is an ephemeral (although sometimes permanent) operational and governing body that directs and guides all aspects of your cloud program, from the first implementation through ongoing operations and management. The core team of the CBO is relatively small, but coordinates the cloud efforts of many different groups. CBO members oversee the bigger picture and are therefore ideally positioned to take the initiative to explore, with architects, business owners, finance and operations, just how costs can be managed. Without doubt, one of the CBO’s discussion topics should be – how can we make use of Spot Instances to dramatically reduce our cloud costs?