Aka the business continuity plan and rather less well known as the "risk action plan," this is a document whose existence and table of contents are subject to audit - but which, like most data processing control artifacts, doesn't have to bear much resemblance to reality. In theory, of course, it does: after all the primary control, the user service level agreement, will specify how long data processing has to bring a list of critical applications back on-line, and the risk action plan documents are intended to describe just how those commitments will be met.
Unfortunately, all of the plans I've reviewed have had one thing in common: a lack of testing, or even testability, under realistic conditions.
In theory a disaster recovery plan consists of a list of possible disaster scenarios together with a proven method (including staffing and technology) for overcoming the consequences of each one. Typically, therefore, they'll start with a hypothetical event that closes or wrecks the data center, and then focus on who does what and where to bring a carefully prioritised list of applications back up as quickly as possible.
In reality, of course, the disasters rarely fit the scenarios, the people listed as responsible for each action are rarely reachable, and the senior managers who get rousted out when the brown stuff hits the fan usually throw the best laid plans into total chaos by overruling the rule book within minutes of arriving on site.
That disconnect between plans and reality is perfectly normal and people usually just muddle through, but the abnormal can be even more fun. Two favourite stories:
Everything had been considered, all contingencies covered - except that when an unhappy employee spent $29.95 for a butane torch at Home Depot, disabled the halon system, and then sloshed around some gasoline to really get those rack mounts running hot, it turned out that the only copies the company had of its disaster recovery plans were stored on those servers -along with the readers and encryption keys for the back-up tapes carefully stored off-site.
Worse, the police closed the entire data center to all traffic for about ten days while they conducted their investigation and the health department refused access for another week because of the chemicals released before and during the fire.
A few years later a contractor's employee working on the tunnel system two floors below the data center is thought to have unknowingly punctured a gas line sometime before leaving work on a Friday night. The inevitable happened early Sunday morning - turning that Hitachi into just so much shredded metal and taking the disks and on-site tape vault with it to some otherwise unreachable digital heaven.
On Tuesday, messengers arriving at the central organisation's off site storage facility to pick up Thursday's tapes were turned away - and by late Wednesday local management had got the message: the central agency had put itself in charge of certifying disaster recovery sites, had not certified the Hitachi partner providing standby processing support for the agency, and "quite properly" refused to release the tapes to an uncertified site.
The bottom line message here should be clear: a formal disaster recovery plan of the traditional if this, then that style only makes sense if you can count on being able to control both the timing and the nature of the disaster - and doesn't if you can't. In other words the only things that are really predictable about data center recovery are that the plan won't apply to what actually happens, the recovery process will take longer and cost more than expected, and the whole thing will be far more chaotic and ad hoc then anyone ever wants to admit afterward.
So what do you do instead? that's tomorrow's topic, but here's the one word answer: drill.