Topics of Software Engineering Course Blog

Tuesday, September 22, 2009

Guardian: A fault-tolerant operating system environment

This weeks chapter from Beautiful Architecture for cs527 was on the Guardian operating system and the T/16 machine. Both the operating system and the machine was engineered for reliability trading most other quality attributes for this. The core idea was that everything should be duplicated in case one goes down. The T/16 machines had at least two processors, two busses, (often) two disks, etc.

Each process would be duplicated on two processors. On one processor it would be active and on the other it would be passive waiting for the first one to die or give up control. In the reliability world there are basically three ways to recover in the face of failure namely job replication, checkpointing or to attempt to repair the state of the execution. The last one is the least general (but the most used) and must be custom-fit to each problem for example by using exceptions. For the Guardian operating system they chose to do application-controlled checkpointing to allow for recovery. As such each program would be responsible for checkpointing its state at various intervals and if a processor goes down the other one would start from the last checkpoint. The biggest risk with this approach is if an application fails to checkpoint after an externally visible operation (giving the ATM customer money...). If this were to happen the operation would be performed again by the other processor. And what if a processor fails between a request for an IO operation and the point where data is checkpointed?

When reading the chapter I was thinking that the architecture presented was a long string of peculiarities and ad-hoc addons that had been necessary over the decades and not a beautiful system with high conceptual integrity. However, when writing this post it occurs to me that its beauty lies in the way every aspect of it and every decision made enforces its reliability. It is obvious that the architects really had duplicity in their blood as the author points out.

The author states improved commodity hardware and the burden of legacy code as two reasons for why the system became obsolete in the nineties. This sounds reasonable too me and the fact that the system was popular for 15-20 years is no mean feat for special-purpose HW.

Monday, September 21, 2009

Big Ball of Mud

A big ball of mud is a system that is characterized by lack of structure and conceptual integrity and that contains many ad hoc fixes. The paper argues that this is a pattern and not an anti-pattern as it is likely the most dominant architecture of todays systems. It is argued that a big ball of mud is not necessarily bad. Examples of when it may be warranted is if a system is not complex enough to warrant more architecture, in cases of very strict time to market and in the early phases of the development process before the natural architecture is uncovered.

I liked the many insightful quotes in this article about both architecture and organizations and many of them felt uncomfortably familiar. One of these quotes were that "Sometimes freedom from choice ... is what we really want" and I'd like to reference Barry Schwartz's great Google tech talk "The Paradox of Choice" where he talks about why more choice can be bad.

The pattern Throwaway Code pattern talked about throw away prototypes and why they tend to stick around. They suggest writing such code in another language as a remedy. Another trick to ensure clients/managers doesn't ask for a prototype to be shipped that I learned from a realtime instructor is to ensure every prototype crashes at the very end of the demo. This trick admittedly requires a lot of courage and is a bit sneaky, but is nonetheless interesting and clearly communicates that something is not production code to those that are not initiated to the details behind the user interface.

The final thing that came to mind when reading this article was that I thought it argued a bit too strongly for reconstruction although it did point out that this is not always the best choice. Admittedly, reconstruction is sometimes the only way forward, but the general tendency among programmers seems to lean towards reconstruction more often that I think is warranted. A new implementation will introduce new bugs so in most cases I think refactoring is a better choice as it builds on 'working' code.

Thursday, September 17, 2009

Layers

Layers is arguably the most common architectural pattern and most of the software I have been exposed too have been organized more or less using it. Broadly speaking it is a pretty obvious concept, but the pattern does contain pretty interesting discussions on tradeoffs. It highlights the most central set of conflicting forces in considering this pattern which are the tension between clear separation into multiple layers (modularity) and efficiency. Efficiency concerns come from the fact that multiple layouts and enforced interfaces usually carries an overhead. As one version of the controller part of the MVC song goes: "I wish I had a dime for every single time, I passed on a String". In some types of software such as driver and system software this might be an important tradeoff.

One such case I have experienced was implementing an old industry standard API on a modern and novel piece of hardware. In this case the software layers had to do a lot of costly reshuffling and tricks in order to conform to the specified API interface. As such the required layering came with a big performance hit. However, in most types of software such as most application software this is not a big problem and one are free to select the best abstraction without worrying too much about the efficiency of them.

The chapter mentioned defined interface objects to layers. I guess that they by this mean facade classes such as the one they used as the interface to their model in the making memories system. Furthermore I found the comment on keeping the lower layers as slim as possible to be very true. It also follows the Unix adage of providing features and not policy. Another interesting point was their observation that one should avoid defining components first and them put them into layers. They argue that this will likely lead to a system where layers doesn't really capture the inherent ordering principles of the abstractions. The layer organization will therefore not be intuitive and is unlikely to be respected by maintainers in the future who will take shortcuts and violate the systems conceptual integrity.

Xen

Xen is a virtualization platform that allows one PC or server to run multiple operating systems simultaneous. This allows the users to do things like running different operating systems from the same machine or to provide one operating system for each application. The latter is good because it gives each application a clean execution environment. This ensures that different applications don't interact in mysterious ways and that one doesn't accidentally or intentionally bring the whole system down. It also allows for a more overall robust system as Xen is smaller than a full operating system and therefore less software (and less bugs) have to run at the highest privilege level.

The chapter on Xen in the book Beautiful Architecture presents it as an architecture built on distrust. Since each user is encapsulated in their own OS they have less opportunities to interfere with other users or the execution environment (Xen). In addition the client is somewhat more secure since they run their own environment, although as one other An-Hoe Shih pointed out there are still security risks involved as the virtualization platform might contain malicious code. However, this requires that the service provider is malicious and the client is still more secure against other clients.

What interest me most with Xen is the way the architecture divides concerns into different processors for security. Each operating system runs in its own process with dom0 being the supervisor. In addition it even allows different device drivers to be farmed out to completely new driver domains. The rationale for this is safety, and indeed processors aren't only for speed, but the side effect of this is a very scalable system as cores per chip are likely to move into the tens and hundreds.

It also makes me wonder whether we are finally getting ready to welcome the microkernel operating systems. In the nineties many argued their case, but they never quite made it through to most popular systems (Mac OS X being the exception). The reason for this was probably that they never reached the speed of well designed monolithic or hybrid kernels. But if we are ready to accept multiprocessor web-browser and whole virtualization platform indirections for safety and encapsulation then surely microkernels can't be that bad anymore. Besides, they too would have a good scaling story on future chips.

Tuesday, September 15, 2009

Pipes and Filters

One of this weeks readings was a pattern from the POSA book on pipes and filters. The chapter didn't really contain much new information for me, but I had not thought of it in terms of active/passive filters before. Other than this I thought the chapter mixed the general idea with Unix concept too much, which would make it a bit confusing for readers not familiar with this. Perhaps it would have been better to first discuss the generalized pattern and then have an example describing its use Unix?

Pipes and filters, often called pipelines in the parallel world, is a very common pattern that is used many places for different reasons. In Unix shells its primary usage is to provide a facility for communication between small applications that do one thing and one thing well. The intent there is chiefly flexibility, modularity and reusability - the applications can be used together with other applications to solve problems the original authors may not have thought of.

In the parallel world pipelines are used as a means to expose parallelism in a problem. A task is broken into different stages and if many tasks have to be processed or if one stage can start processing a task before the previous stage is finished then we can do this concurrently. A third place where pipes and filters and the general form of often called data networks is extremely common is in media processing systems. In a previous blog I mentioned gstreamer which is a popular pipes and filters framework for all sorts of media processing in the Linux world. Other media APIs that employ pipelines are OpenGL and OpenAL.

Typical systems implementing the latter two demonstrate yet another benefit of pipelines: If we have broken a task into specialized pipeline stages then we can process some stages more efficiently using specialized HW. In the mobile world there are typically many different specialized processors on a SoC (System On Chip) such as graphics, audio, video, wireless and bluetooth and the most common form of using them is through pipelining. As the number of transistors on a chip, at least for the time being, still increases exponentially (even though the speed doesn't) I suspect that more heterogeneity will start to become the norm on PC microprocessors as well. As such we in the computer world will also have the specialized machinery that Eric G. mentions.