Thursday, October 22, 2009

Making reliable distributed systems in the presence of software errors - Chapter 4

Chapter 4 of Joe Armstrong's thesis discusses programming techniques that are commonly used in Erlang. These techniques have different goals; some are for reliability and others are aimed at simplicity and modularity. He starts by making the argument that since concurrent programming is perceived as harder than sequential programming we should split the concurrent parts from the sequential parts by making the concurrent parts frameworks into which we can plug in sequential code. Furthermore, the parallel sections should be written by expert programmers and the sequential parts by less experienced programmers.

Through example he shows how this can be achieved by creating a server module that accepts requests from any number of clients and for each request executes a sequential function. This function is supplied to the server on startup which means the function is independent of the server. In this example the server module handles all the concurrency by spawning processes and receiving messages and the sequential function it is armed with handles the actual work. These two are thus separate and can be developed separately. In addition one server can be used with several different functions.

Later this example is expanded with fault tolerance. Fault tolerance here means that it can handle faults and this is done by catching exceptions from the supplied function, and then notify the client that it could not complete the operation. The fault tolerance is here that the server doesn't need to go down, but can continue operating in its previous state. Note also that in the Erlang model one would likely have a supervisor looking out for the server, so if the server contains an error which means that an error occurs that it can not handle it can safely die, letting someone else handle the error. In this case that would be the supervisor which would likely respond by creating a new server to replace the old one. This is the core of the reliability model of Erlang. Handle what you can locally, but if you encounter errors (defined as an abnormality you do not know how to fix) then just die; someone else will fix it. It also has the great property that the model can handle HW errors the same as SW errors, given that the supervisor runs on a different system, as the runtime will ensure "death messages" are delivered to the supervisor. I must say having worked on a C system that had high reliability requirements because it had to run in embedded devices I really appreciate this simple error model. In C one does not even have exceptions which means that functions often contain more error handling code than actual application code. This makes it hard to find the actual code in all the error checks and error propagation.

Sharing of resources in Erlang is different from shared memory programming. Instead of using locks and semaphores one creates a server that control access. This is to me a simpler model and an analogy would be that instead of having a door with a in-use sign to the room with the shared resource, and then trust everyone to respect the sign, one has a servant that will give you what you need through a window in the door. Of course this analogy doesn't quite hold with monitors as everyone would lock the door after them, but you would still have to be sure every path into the room has a door with the same lock, which is no easy task, as well as dealing with deadlocks.

Finally, Armstrong talks about intentional programming which I believe strongly in. Could should make its intention obvious and if this means making a lot of functions then so be it.

Ben Britton asks whether we agree that concurrent code can't be written in a side-effect free manner. One the face of it I would disagree pointing to systems such as Haskel, but I believe Armstrong is here distinguishing between parallel code and concurrent code, with concurrent code being code where individual entities modeling the real world cooperate to solve a problem. In such system you would need to have central resource sharing and communication locations with responsibilities such as managing dictionaries and controlling physical devices. These would have side effects by design. If one is purely interested in solving a set of expressions then this is something else, which would draw great benefits from being side-effect free. Furthermore, Ben asks whether the abstractions described are only possible in Erlang, to which I would say definitely no. However, Erlang provide a language, runtime and philosophy that not only provides the tools to express these abstractions in a clean manner, but also invites it.

No comments:

Post a Comment