By now I am sure you have seen dozens of articles, thousands of tweets, and continued commentary on the S3 outage that AWS suffered on February 28th. It led to the disruption of thousands of highly trafficked web sites, while sites such as Netflix stayed up due to their design practices, and their use of redundancy features on the AWS platform. While outages in the cloud are not common, they occur just as outages occur in a data center with traditional IT systems. What’s the difference? With AWS, Azure and Google you can easily design and deploy robust services that will survive and failover gracefully through these inevitable outages.
AWS, Google and Azure have architected their environments and services in a loosely coupled model so that failures affect discrete services. This enables application architects to deploy applications that consume those discrete services, and quickly migrate to alternative services and locations when failures occur. Avoiding monolithic systems is a key design principle in the cloud and should be leveraged for all deployments to ensure organizational goals for availability are met, regardless of the underlying service being consumed.
When planning availability for traditional data centers, recovery point objective (RPO) and recovery time objective (RTO) are common tools to clearly identify the level of outage and data loss a company can sustain, before incurring losses. RPO and RTO enable organizations to effectively plan spend against potential risks to the business IT systems. Organizations with highly distributed user bases often also measure the accessibility base (AB) to understand impact. All of these concepts must then be weighed against the financial cost (FC) to an organization for an outage. These concepts are just as applicable in the cloud as they are in the data center, but often less discussed design points.
- Recovery Point Objective – The maximum time window of acceptable levels of data loss during a failure situation. Different applications have different tolerances for loss of data, some applications are measured in the transaction level and demand 100% integrity for all transactions, other apps can function on a subset of data and measure lost data in terms of accuracy of output.
- Recovery Time Objective – The measure of time that an organization can tolerate for an application being inaccessible in its entirety. This is often the time necessary for a recovery action to occur in an automated environment to bring an application back online in an alternative location.
- Accessibility Base – The base is measured as the percentage of users that can access an application. In today’s highly distributed environment, applications are often inaccessible to only a portion of users due to networking or other connectivity problems. This measure is to understand how wide an outage is in terms of user population impact.
- Financial Cost – In addition to the metrics for service availability and recovery, the cost of providing different levels of availability is key. Providing for very high levels of availability is not a zero cost activity, and that cost must be offset by ensuring the business needs the targeted level of availability. Many organizations will calculate the cost to the organization for application down time, either for idle employees or lost revenue, and ensure that the cost of redundancy does not exceed the downtime cost.
Once the measures of availability are understood, it is important to leverage key design patterns and supporting capabilities within the public cloud to ensure adherence to RPO, RTO, AB and FC. These design considerations are application level decision points, each with a cost and tradeoff that affect operations and availability:
- Use of Multiple Regions – All three major cloud providers have the concept of “regions” – Physically separate data centers, with independent power sources, networking, HVAC, and software deployment lifecycles. Regions are logically and physically isolated from one another to minimize the disruption if one region experiences a significant failure. For applications that demand the highest levels of RPO and RTO, multiple regions should be used with automated methods for ensuring application recovery and availability.
- Use of Multiple Availability Zones (AZ) – Within Google (Google just refers to them as Zones) and AWS, the concept of Availability Zone is also applicable. AZs are separate data centers within the same geographical region, providing for protection from power and networking failures, but still risking regional level failures such as weather or physical incidents. For applications that demand an RPO with high levels of integrity, AZs provide low latency connectivity for replication and protection of transactions.
- Data Replication – The method of data replication is critical to alignment with RPO and RTO objectives. Many database-driven applications require that transactions commit before a workflow can begin, complete or execute other processes. While native cloud services such as AWS S3 and Azure Storage provide for cross-region replication, databases and other data stores that leverage block-attached storage may require third party tools such as Attunity to ensure transaction level integrity across AZs or regions. For services that have an RPO near zero, high speed third-party data replication is a must.
Regions, AZs and Data Replication are tools to enable architects and application teams to build highly resilient applications, aligned with the RPO and RTO needs of the business. Design for availability must be architected at the application level, ensuring best practices such as storing all data to persistent data stores and building modular applications that can make users aware of outages with common error messages. This awareness ensures that modular applications can continue to provide reduced functionality during service outages, informing users of the reduced functionality, and enabling applications to gracefully move between regions and AZs.
While IaaS and PaaS require design patterns to deliver RPO/RTO, the goals are the same for ensuring a solid user experience for applications. IaaS and PaaS services have different levels of redundancy built into their design for different levels of availability. For example, AWS S3 is deployed at a region level, so users do not need to plan for S3 availability across AZs. This service availability varies by service and cloud provider, so understanding these differences is key.
As with traditional disaster recovery (DR), testing is key to ensuring that services and applications behave as expected during failure scenarios. Quarterly DR testing does not go away just because applications have been migrated to the cloud; rather, it takes on new forms of continuous testing and incident investigation by application teams. Tools such as Chaos Monkey from Netflix enable teams to inject failures into cloud environments to ensure applications respond as designed, and that RPOs and RTOs can be met under a variety of failure scenarios.
In data center-centric deployments we regularly employ tools such as data replication, load balancers, redundant power and UPSes to support expected service levels. These availability considerations have not gone away in the cloud, but have shifted to a service-consumption model to ensure that applications are decoupled in a way that they can be deployed across AZs and Regions. If a service or region fails, applications will continue to serve traffic and meet organizational RPO and RTO objectives.