Tuesday, November 3, 2009

Making reliable distributed systems in the presence of software errors - Chapter 5

Chapter 5 of Joe Armstrong's thesis on fault tolerant systems describes how the fault tolerance of Erlang works and how the programmer can build an application to make it fault fault tolerant. By fault tolerance Armstrong means systems that can continue to complete tasks in the presence of faults and it does not imply the applications are free of them.

The fault tolerance model described in the chapter is based on the concept of fail immediately if you can not recover and then try to do something simpler. The system is organized as a tree of supervisor that supervise behavior processes that perform the actual application functionality. I did not fully understand how having the application recover by doing something simpler would work in practice though. However, another mechanism it has for recovering is to try to restart the failed process which makes sense. Perhaps it will succeed this time.

The chapter went on to discuss the difference between failures, errors and exceptions. I didn't fully get the exact distinction, but what I gathered from the text was that in the Erlang context exceptions are what you throw when the specification does not say what you should do. The programmer working at the lowest lever should not start trying to guess as this leads to an unstable system, but should instead just throw an exception. Always. An error comes in two forms, corrected errors and uncorrected errors. A corrected error is an error that you, or someone above you in the hierarchy knows how to handle. It therefore does not lead to a fault. An uncorrected error on the other hand leads to a failure, which means a process must be restarted or we must try to do something simpler. This could be because of a software error, a hardware resource going down or just because we don't know how to handle the case of a file being missing. In that case the developer are encouraged to explicitly exit or throw an exception if it is a spec error which is likely the case if we don't know what to do when a file is missing. Let someone else handle it.

However, the chapter does say that a runtime error such as a divide-by-zero leads to an exception, but surely this will more often be a programmer error than a spec error? Or is the fact that this exception are not handled by the module a spec error?

No comments:

Post a Comment