An effective disaster recovery plan assumes that the worst can, and will, happen. Organisations face a growing number of risks, from natural disasters and power and network failures, to human error, civil disturbance, public health emergencies, and cyber threats.
IT failures are only one of the risks CIOs need to plan for. And while disasters such as floods or fire rightly attract the board’s attention, it is cyber threats that have pushed disaster recovery (DR) planning back up the agenda.
Ransomware, in particular, has forced organisations to look again at strategies for backup and data protection.
A survey carried out in spring 2019 on behalf of Sungard AS, a disaster recovery and business continuity supplier, found that 53% of senior managers believed a cyber attack was the most likely thing that would disrupt their business. This came ahead of IT outages (36%) and network failures (24%).
Regardless of the cause, downtime is expensive. Sungard AS’s research found that the average cost to a business of unplanned downtime was just over £1.4m. It also found as many as 70% of managers believe they need to spend more on business continuity. But no amount of spending will be effective unless it is backed by an effective plan.
Step 1: What doesn’t hurt you, makes you stronger: Business risk analysis
The first stage in any disaster recovery project should be to assess the risks facing the organisation. Managers should link risk assessments to a business impact analysis. It is only by looking at risk and impact together that allows the board to set the organisation’s priorities, and also to decide on the type of protection measures needed.
Some risks will be so great, and the impact so high, that only a formalised business continuity plan will reduce them. For others, a staged recovery plan might be acceptable. And some might best be covered by insurance.
One example in is planning for cyber threats, where businesses have invested in: perimeter security to ensure continuity; a backup and recovery plan to protect data, including against malware; and cyber insurance to cover the most serious incidents.
But a good disaster recovery plan goes further, and considers threats such a disrupted access to buildings – which can be caused by something as mundane as a burst water main – to disruption to staffing from public transport problems or even an outbreak of winter flu. Organisations also need to consider supply chain risks. A supplier is likely to have its own business continuity arrangements, but its priorities and recovery objectives might not match yours.
The key is not to try to protect against every threat, but to have the most comprehensive picture possible of the risks facing the business and an understanding of their likelihood, how deeply they affect the business, and how long it would take to recover from them.
Step 2: If it can go wrong, it will: Breaking down IT risks
IT failures remain a significant source of outages. Industry analyst IDC calculates that half of organisations would not survive an outage that takes down their central IT systems “for an extended time”. But it is not easy to predict which parts of a system could fail, and the impact of the failure.
CIOs should adopt a similar approach to IT risks as they do to environmental, human or infrastructure risks. Experts should examine the likelihood of failure across all components of core systems, whether these are on-premise, outsourced or in the cloud.
IT teams should not just look at hardware, but at the risks posed by data loss and data corruption, including through cyber attacks or malware, and of application unavailability. They should then be able to rank systems in terms of criticality, and how easily they can be restored or recovered.
Recent events in banking and the airline industry show how often a relatively inexpensive or simple component, such as part of a network, causes a much bigger problem.
A key part of the process is to identify these single points of failure. But CIOs also need to have a plan for an orderly recovery and, strange as it might sound, orderly failure. The IT part of a business recovery plan should cover the process for shutting down systems, for example in the event of power failure or cyber attack. That way, IT teams know how they should prioritise their own resources during and after an incident.
Step 3: Through the window: Setting recovery objectives
Business impact analyses and risk assessments will, in turn, set the key metrics for the recovery plan. This includes an understanding of acceptable periods of downtime, and their cost – something that can only be calculated in discussion with the business.
The IT team needs to set a recovery point objective (RPO) and recovery time objective (RTO) for each key system and, if needed, for each key component. The RTO is how long the business has to recover – the recovery window – and the RPO sets how far back the organisation needs to go in recovering data.
These metrics will differ from system to system, and can sometimes be hard to reconcile with each other. An organisation with large volumes of valuable historic data might struggle with a short RTO, so data recovery will need to be tiered.
These, in turn, are tied into measures to ensure resilience and application availability. Industry experts point to ever-narrower recovery windows, as businesses – driven by their customers – become less and less tolerant of downtime.
Only a few businesses can afford “five nines” availability for all systems, so their disaster recovery plan is likely to consist of resilience, availability and business continuity measures, along with backup and recovery strategies and a degree of managed failure.
This might include contingency plans, such as staff working from home using cloud-based applications and mobile phones, through to access to high-end business continuity locations. Fortunately, cloud-to-cloud backup of application data and backup of on-premise data to the cloud are both helping businesses of all sizes to become more resilient.
Step 4: Command and control: The response strategy
IT is no use unless there are people available to operate it or use it. Disaster recovery is the archetypal “people, process and technology” challenge.
Unless the organisation is small enough – or the outage brief enough – to get by on cloud-based services and through remote working, the business will need to consider alternative working locations and how to move staff and technology there.
If the outage affects a datacentre and systems failover to a secondary site, IT will need to work to restore the primary location or find a new one, as well as ensure that the now single failover site is backed up too. Even seemingly trivial matters, such as issuing new ID cards, can become significant problems in an emergency.
The main way to contain a disaster, and to ensure effective recovery, is to maintain good communications. The business should, in advance, appoint a person to lead the disaster response. This need not be the CIO and probably should not be the CEO. This “gold commander”, to borrow a police term, does not have to be the person who wrote the DR plan, but does need to be familiar with it.
The disaster response team should include experts from outside IT, including HR, legal and corporate communications, as well as representatives from business operations. Crucially, the team should have a way to communicate in an emergency and, ideally, take part in any DR exercises.
Step 5: Fail to plan, plan to fail: Testing the response plan
Disaster recovery or business continuity exercises can be disruptive, but effective DR plans need to be tested, reviewed and updated. Almost the worst thing a business can do is invest in a DR plan, then leave it on a shelf.
“Firms may have written plans and procedures, but they may not be practical or widely known and aren’t actually then applied in a crisis,” says Samuel Ingrey, a disaster recovery specialist at PA Consulting.
“Firms need a clear decision-making structure and playbooks that have been agreed and refined through practice and testing, and easy-to-understand approaches like a gold, silver and bronze command structure. These are of more practical use to firms during a disaster than a detailed 100-page manual.”
It is only by testing that a firm will know whether the plan works, and whether it is resilient enough to perform under pressure. Simulation, and testing the communications systems, is the best way to expose any weaknesses. Teams can then feed insights gained from the testing phase back into the risk assessment and business impact analysis, fine-tuning the plan as they go.