In the wake of recent events on the East Coast of the United States, disaster recovery (DR) planning has reared its head again. Of course, it’s a bad time to think about disaster recovery right after an event with such a large impact. However, it’s even worse to never think about it.
Prior to working at Puppet Labs, I spent a lot of time on disaster recovery. For nearly two years, I led a team designing multi-site replication, creating reference architectures for availability and recovery, and selling our business partners on disaster recovery investments. This was for one of the top performing business units at a Fortune 100 company with seven and eight figure budgets for DR.
Disaster recovery is a huge proposition. It’s costly, time consuming, difficult to test correctly and often the first thing cut when doing budget reviews. DR planning is also never complete. You evolve. You change. Your plans need to as well.
The starting points for DR planning can be difficult to find. Infrastructure engineers often jump to technical solutions. Before you figure out the newest wizbang in storage replication technologies and failover, take a step back.
Identify your unit of business
What is the unit of business worth? If you can’t answer this question, the rest of this plan is built on a weak foundation. What is a unit of business? That could be a sale of a new TV, shipping a car part, having software downloaded, or even making sure grandma can see pictures of cats shared on the internet. In any case, you need to know what business process you’re striving to protect.
Identify the cost of downtime
Once you’ve defined a unit of business, find the costs. What’s downtime cost for that service? Are there workarounds? For example, if I need to buy stamps, I can go to my local post office. If they are closed (unavailable), I can go to my local grocery store and buy them. So while it’s a minor inconvenience that I couldn’t purchase stamps at my local postal office, I still end up with the product and the USPS lost little in terms of revenue or reputation to the postal service.
Other processes can’t be worked around without significant costs. If a major credit card firm goes offline for minutes, millions of dollars are lost. People in the process of buying will just switch their payment to the next card in their wallet, thus causing a great loss of revenue (and probably interest).
Knowing the process, revenue impact and workaround options, you can start to talk in terms of DR technology planning. Enter the terms RTO and RPO.
RTO & RPO: Recovery Time & Recovery Point
RTO: Recovery Time Objective is the amount of time you are willing take to recover services. This could be weeks, days, hours or seconds. Do not make assumptions on this. Ask the owners of the business process. Their expectations do not always require a superhero. If they do, they should have ample justification.
RPO: Recovery Point Objective is the state of the process (and probably data) you wish to recover to. This could be start of business day, start of fiscal quarter, the last hour, or the last transactions.
By tweaking the RTO and RPO, you really get a sense of the investment you’ll have to make in DR. This is also where mistakes commonly pop up. I found myself explaining RTO/RPO in every meeting with anybody about DR.
When evaluating technical solutions for meeting RTO and RPO, costs can change quickly. For example, recovering from last night’s backup is a fairly cheap solution. The RTO could be 8 hours to restore a backup onto another system, assuming you’re not in a resource contention race on your backup infrastructure. In this case however, your RPO is last night’s backup. If that’s good enough, super, you’ve solved a DR problem for minimal investment.
It’s quite likely that you have stricter RTO and RPOs than 8 hours. This is where having the recovery model built into the architecture needs to happen. Horizontally scaling with shared-nothing architectures across sites is normally the holy grail for low (near-zero) RPO/RTO. If your applications can fit into this type of architecture, you should be using it. It can be expensive, as you have to have enough capacity to lose a site and still run.
Other options might include virtualization recovery solutions, bringing up services in a public cloud, automation solutions (like Puppet), or shutting down non-business-critical systems.
Build your disaster recovery plan on the needs of the organization
Another thing to verify with your stakeholders are the definitions of disaster and recovery. Is it a disaster if one application is down? Is it only if the whole data center has issues? That is for you to decide. As for recovery, does this mean everything is performing at 100% perfection? Could business happen, but at a slower rate? Would performance degradation be acceptable? Would having some subsystems offline be okay?
Disaster Recovery takes a lot of careful planning and education. At this point, I’ve only scratched the surface of the inputs that can help make a successful disaster recovery plan. With this foundation, you can build out important components of a DR plan, from implementation to testing/validation, upgrades and changes in business processes. These topics may be covered more in-depth in future posts, but the details will be somewhat dependent on your infrastructure and organization. There’s no one-size-fits-all technical solution for disaster recovery, but every organization needs to be in agreement on what they want recovered and what they’re willing to sacrifice for that recovery to identify the best technical solution for their situation.