Tuesday, September 22, 2009

Guardian: A fault-tolerant operating system environment

This weeks chapter from Beautiful Architecture for cs527 was on the Guardian operating system and the T/16 machine. Both the operating system and the machine was engineered for reliability trading most other quality attributes for this. The core idea was that everything should be duplicated in case one goes down. The T/16 machines had at least two processors, two busses, (often) two disks, etc.

Each process would be duplicated on two processors. On one processor it would be active and on the other it would be passive waiting for the first one to die or give up control. In the reliability world there are basically three ways to recover in the face of failure namely job replication, checkpointing or to attempt to repair the state of the execution. The last one is the least general (but the most used) and must be custom-fit to each problem for example by using exceptions. For the Guardian operating system they chose to do application-controlled checkpointing to allow for recovery. As such each program would be responsible for checkpointing its state at various intervals and if a processor goes down the other one would start from the last checkpoint. The biggest risk with this approach is if an application fails to checkpoint after an externally visible operation (giving the ATM customer money...). If this were to happen the operation would be performed again by the other processor. And what if a processor fails between a request for an IO operation and the point where data is checkpointed?

When reading the chapter I was thinking that the architecture presented was a long string of peculiarities and ad-hoc addons that had been necessary over the decades and not a beautiful system with high conceptual integrity. However, when writing this post it occurs to me that its beauty lies in the way every aspect of it and every decision made enforces its reliability. It is obvious that the architects really had duplicity in their blood as the author points out.

The author states improved commodity hardware and the burden of legacy code as two reasons for why the system became obsolete in the nineties. This sounds reasonable too me and the fact that the system was popular for 15-20 years is no mean feat for special-purpose HW.

No comments:

Post a Comment