Backup and disaster recovery (DR) services and planning are not particularly exciting, but they will certainly lead to excitement if appropriate data security and robustness plans have not been prepared. When the topic of DR arises, many people (even those who should know better!) still think of tapes that have to be rotated and couriered to an offsite storage facility. It is true that there still are environments where this approach to backup and recovery is used, but clearly there now are far more efficient, reliable and cost-effective backup and DR options, especially when it comes to cloud-based storage.
Around this topic, a number of terms are often used interchangeably, when in fact they have separate and distinct meanings. So it is a good idea to review them.
An Introduction to Backup/Restore, Disaster Recovery and High Availability in the Cloud
This term is primarily focused on the data itself, not applications. Solutions tend to rely on robust object storage (i.e., Amazon S3 or Azure Blob Storage). They are often based on a layered approach of “warm” data (a smaller amount of data that can be restored quickly) plus “cold” data (larger amounts of archived storage that is cheaper to restore, but takes longer and is more expensive to access).
This term is application focused, requiring the consideration of both the data layer and application stack. Designs should prepare for a range of failure possibilities. Solutions vary widely depending on how quickly the application must return to service after a failure (Recovery Time Objective, or RTO), and how much time’s worth of application data loss can be tolerated (Recovery Point Objective, or RPO). Note that there is no “best” option, only options that are better suited for the specific requirements of each business case. These options include:
- Cold restoration, when no alternative failover resources are kept running. This requires a complete rebuild of the application and data environment (preferably via automation). Because no failover resources are kept running, this design is the cheapest, but takes the longest to execute.
- A pilot light system, where a minimal duplicate system is prepared and ready, but only the bare minimum services are up and running until a failure. This approach is the next most expensive, but is quicker to return to service than a cold restoration.
- A warm standby system, when a complete running environment is ready to take over, but the resources are downsized, and expanded to full production capacity only when needed. It is still more expensive than the first two options, but is able to restore a minimal service level quickly.
- A hot standby system, where a complete, fully resourced duplicate environment is ready to take over for the primary at all times. This is the most expensive and most complex system to design, but enables the shortest possible RTO.
This is not actually a backup term, but the concept does overlap significantly with disaster recovery. At its most basic, high availability is concerned with designing and maintaining business solutions that must support very small RTOs (typically measured in minutes or seconds), and RPOs as small as zero (i.e., operational data must be current at all times, even during failures). The line between hot standby designs and high availability implementations gets blurry, as required RTOs and RPOs approach zero. Typically, a high availability solution will include live, redundant synchronization of data, as well as a separate implementation for longer term backup and archive. Consequently, these implementations can be very expensive.
This term takes into account all the technical concerns described in the previous scenarios, but also addresses a full range of non-technical practices and processes, such as communications plans, regulatory notifications, natural disaster safety practices, etc. Business continuity can be a huge undertaking, and is typically a cross-departmental effort with many resources devoted to it. It is usually not a project managed by a pure-play technology vendor.
BUaaS (Backup as a Service)
This is the term for an offering which provides backup services similar to what is described above under Backup and Restore. In a typical scenario, BUaaS provides offsite (usually cloud-based) backup for on-premises or data center-housed systems, protecting critical data against loss. There are vendors who specifically provide this service, but complete solutions can also be architected using cloud-native services such as AWS Storage Gateway.
DRaaS (Disaster Recovery as a Service)
This is the term for an offering which follows most of the principles described for general disaster recovery. However, similar to BUaaS, DRaaS is focused on providing a cloud-based environment that becomes available after the failure of an on-prem or data center-based application environment.
Key Planning Points
As with strategizing for most essential IT capabilities, with DR it is critically important to define your business needs before selecting a vendor and/or designing a solution. Here are some important aspects to consider when planning a solution.
- Determine your RTO and RPO. Among key criteria, these are by far the most critical. No considerations will have more impact on your design (and its cost!). For example, hot standby solutions can be 5x the cost of cold recovery options, or even more. Because the cost of implementation of different RTO/RPO scenarios is frequently a factor in deciding the exact RTO/RPO standard, there is often a need for an iterative process to arrive at the exact requirements and architecture.
- Application data state is extremely important. Understanding the data sources for an application, including read/write patterns and any data synchronization requirements, is a fundamental requirement for a successful DR implementation. The fact is, cookie cutter designs which do not take application data state into account are highly unlikely to succeed, so be sure to confirm that application data state is clearly understood and built into backup and DR architectures.
- Cloud-based backup or DR designs for on-premise architectures, while using principles similar to cloud-native environments, will have significantly different implementation architectures. This is because of latency issues, ingress and egress bandwidth, transfer costs and possibly differing data security requirements between on-prem and cloud. Be sure to anticipate these potential differences and make sure they are considered during architecture design and selection.
Optimal data backup, restore and disaster recovery designs require significantly more effort and detail than what is described in this short article. However, keeping in mind these basic definitions and principles will go a long way toward ensuring you start your planning with critically important concepts, and end up with solutions that are cost-effective, performant and truly aligned with your business requirements.