% fortune -ae paul murphy

Questioning IT

This is the 14th excerpt from the first book in the Defen series: The Board Member's IT Brief.

This section is concerned with things you should talk to your CIO about - informally, but with attention.

Topic one: disaster avoidance

Basically what you want to know is: what happens to the business if your primary data center, along with the people who work there, suffers a disaster - floods, fires, terrorism, or legionnaire's makes little difference.

The right answers are different between the Unix and Mainframe/Windows CIOs, but both should be concerned about the same three things:

  1. avoidance rather than recovery;

  2. continuity in terms of staffing and service delivery; and,

  3. meeting legal and fiduciary responsibilities.
The most important thing to remember when you think about disaster preparedness is that an implemented information architecture includes staff --so systems redundancy planning has to include staff too.

If data center A gets blown up, flooded, or otherwise shut down, the people at data center B should be unaffected and fully able to take over the workload.

Thus an answer that nods to redundancy through some backup data center but relies on the same people who work at the primary site is, if your organization is any reasonable size at all - say 100 or more total staff- incomplete at best and more hope than plan at worst.

Mainframe and Client-Server Architectures

Both the original data processing environment and its modern descendent, the tightly locked down client-server operation, depend on management to co-ordinate the activities of large numbers of people.

As a result it's usually not practical to maintain two fully staffed data centers unless your organization's operations naturally lend themselves to this through size, geographic dispersion or other external circumstance.

Notice that many CIOs will tell you that hardware failure is a frequent occurance against which their clustering technology provides full protection. This is true - having a server die in a room full of rack mounts stuffed with them should mean nothing, but clustering offers no continuity protection if the building blows up or the clustering expert gets hit by a bus.

If you do have two centers, great. With two, your CIO needs to:

  1. have practiced the transition;

  2. have put in place processes that ensure the availability of any needed data, applications, hardware, and licensing in both places;

  3. have a proven and practiced method in place for communicating replacement server and access information to partners and others with the right to access your systems; and,

  4. have cross trained staff between the two sites.

In the more common case, however, you cannot afford duplicate processes and face practical constraints on cross training or otherwise practicing disaster recovery. In this situation what you need starts with a contractually committed backup site and well developed plans with carefully defined checklists of core activities, responsibilities, backup communications channels, and - most importantly - a clearly defined succession in control for all major functions.

Your CIO should be aware, furthermore, that such plans essentially never survive the first hour or two after the disaster strikes. People don't act according to plan; communication almost always fails; the critical license or permission will turn out to pertain only to the destroyed machine; the designated successor to a staffer put out of action by the emergency will turn out to be on stress leave; data backups will turn out to be badly out of date or unreadable; and the custom applications carefully stored at the backup site will inevitably turn out to be missing some critical patches your business people rely on -and whose absence in the backup makes them inoperable with the updated database structure from the production site.

Your CIO has to have a detailed disaster recovery plan, but the key issue here is realism - does your CIO recognize that Murphy's law goes into over-drive when a disaster happens: that tired people will make every possible mistake -usually twice- and that there's a near absolute inevitability about finding something - like a failed database patch - that stops every attempt to restore normal operations?

Systems Integrity is like a chain - it breaks at the weakest point
A pair of 2003 "Dear Member:" emails from CIPS (the Canadian information Processing Society - an organization from which I had long since resigned in protest over their Windows only website and addiction to hiding important email in floods of junk ) illustrates many typical Microsoft environment management problems:

In the early hours of May 8, 2003 there was a break in and entry at the CIPS National Office. Two servers and one computer system were stolen.

One server was used as our mail server and contained cips.ca email and mailing listserv addresses. The second server hosted section and provincial web sites as well as, membership reports that are used by the Sections and Provinces.

The second server also had a back up drive installed on it. The membership database is backed up overnight and the back up tape was in the server at the time it was stolen. The membership server that houses the membership system was not stolen.

The second one said (among other things):

After further review I am now in a position to verify with you that the on-line membership renewal process is a secured process. Any credit card information provided is encrypted. This is different from what was reported yesterday.

While the missing back-up tape is not readily accessible, members who selected the automatic annual renewal process potentially remain at risk in having their credit card numbers compromised. We will be attempting to contact these members directly.

My bet? that the transactions were only encrypted during communication, and that everything needed to access the SQL-Server database was stored on the desktop machine.

He should also be aware, and therefore, make you aware, that users will have bypassed Systems controls in at least some areas - and every one of those will haunt the organization during a recovery effort.

Thus the CIO may think all key data is stored on his servers, but it won't be true. He may think he's aware of all legal commitments (things like loans secured by changing inventory) that require access to your information systems - but he'll be wrong about that too.

Nobody can do anything about problems like these until they come up, but someone who doesn't know that they will come up, is dangerously naive.

---

Some notes:

  1. These excerpts don't include footnotes and most illustrations have been dropped as simply too hard to insert correctly. (The wordpress html "editor" as used here enables a limited html subset and is implemented to force frustrations like the CPM line delimiters from MS-DOS).

  2. The feedback I'm looking for is what you guys do best: call me on mistakes, add thoughts/corrections on stuff I've missed or gotten wrong, and generally help make the thing better.

  3. When I make changes suggested in the comments, I make those changes only in the original, not in the excerpts reproduced here.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues.