This is the 13th excerpt from the second book in the Defen series: BIT: Business Information Technology: Foundations, Infrastructure, and Culture
Note that the section this is taken from, on the evolution of the data processing culture, includes numerous illustrations and note tables omitted here.
Clearly Defined Line Management Structure with rigid role separation
At a minimum there should be:
SLA includes annually budgeted operations
The service level agreement is the contract between the data center and the user community. This is the peace treaty in the battle for resources and control between user groups and the data center. As such it governs expectations and is renegotiated annually as part of the budget process.
The SLA evolved out of user frustration with the constantly increasing demands for budget and control put forward by the automated data processing department as it struggled to deliver on the promised benefits of new and more powerful applications.
The SLA should be integrated with the overall systems governance process and be administered by a systems steering committee including members of the senior executive.
Clearly documented SDLC standards
Data centers that run only packaged applications tend to stagnate. The growth and service potential is in new development, new deployments, and the discharge of ever increasing corporate responsibilities.
Early System 360 adopters generally underestimated development complexities and limitations, and therefore tended to over promise. As most projects failed while a few succeeded the critical success factors for developers soon became clear and, high among these, was the use of clearly enunciated and strongly enforced systems development lifecycle methodology or SDLC.
Developers who obtained user sign-off at each stage of a project's lifetime and then incorporated the resulting expectations into service level agreements generally found that users who had been co-opted during project design accepted weaker results as successes and were less likely to rebel at budget increases.
The typical SDLC is defined in terms of steps leading to deliverables and sign-offs rather than working code or reviewable systems documentation. Many of these steps are inherently technical but the focus is on the signoffs and processes rather than the contents of each deliverable, thus decoupling the systems development management process from systems development and testing.
"Lights out" 24 x 7 operation
Automated, or "lights out" operation is normally presented as a means of saving costs - not having to run a night shift means not paying those salaries. But, in reality, people assigned operational functions during these shifts tend to be low cost, so savings are usually neligible on the scale of the overall data center budget.
The management value of lights out operation as a best practice derives from something else entirely: the fact that it is functionally impossible to achieve this without first implementing a series of related practices ranging from proper management of job scheduling, to accurate capacity planning, effective abend minimization, and automated report distribution.
Use of Automated Tape Library
Use of an automated tape library coupled with vaulted third party off-site storage for backups is a common best practice mainly because it reduces both data loss and tape mount errors.
Disaster Recovery or Business Continuity Plan
A documented disaster recovery plan must exist.
The traditional first step in a mainframe disaster recovery planning effort is the classification of systems (meaning applications groups) according to the severity of the impacts associated with processing failure. Thus most plans are ultimately predicated on the time frames within which processing is to resume for each of a set of jobs grouped according to headers like Critical, Vital, Sensitive, or Non Critical.
The more common recovery strategies are built around:
Hot site agreements with commercial service organizations under which the company regularly transfers tapes to the hot site operator and the site operator assures the company of access to physical and processing facilities for the duration of any emergency.
Hot site agreements come in multiple "temperatures" with a cold site, for example offering little more than space and a physical facility without having any of the company's code preloaded or communications links pre-tested.
Internal systems duplication in which the company maintains two or more independent data centers and uses each as backup for the other.
Reciprocal agreements. These can be executed between organizations using similar gear and amount to mutual hot site agreements.
Disasters are extremely rare. When they do occur weaknesses in the recovery plan are usually found in one or more of three main places:
As a result it is common in real processing disasters to find the data center director reporting full functionality at the interim site several days before users can resume normal operations.
Some notes:
Notice that getting the facts right is particularly important for BIT - and that the length of the thing plus the complexity of the terminology and ideas introduced suggest that any explanatory anecdotes anyone may want to contribute could be valuable.