Thursday, November 5, 2009

Making reliable distributed systems in the presence of software errors - Chapter 6

Chapter 6 of Armstrong's thesis on building failure-tolerant software describes the process of building an application. It shows how programmers can extend a generic set of behaviors with their own code containing business logic. The generic components of which server, event handler and finite-state-machine are described, are written by experts and have been heavily tested and widely reused. They deal with complicated issues such as concurrency, locks and parallelism in general. The code the application developers need to write are thus for the most part only sequential stubs that are plugged into the above components, which makes it far easier to make these correct. The chapter also discusses how one can generate supervisors to monitor and restart erroneous behaviors and how one can put all of this together in an application.

The chapter made some of the things about Erlang that has been blurry to me come together. Through Armstrong's demonstration of what happens when a failure occurs, in this case a behavior crashes, I started to gain the first glimmers of comprehension about how their systems can be so awe-inspiring robust. It seems to me that there may be three parts to the answer. The first is the obvious one that through supervisors an Erlang programs get multiple lines of defense so that a lot of things have to fail simultaneously to bring down a whole system.

The second one seems to me to be the pragmatic focus on getting, and not loosing any, logging information when something bad happens. Failures occur, but by catching them as soon as they happen, having a supervisor log the reason so that we are sure we don't loose it and by being able to hotswap parts of the system one can gradually weed them out. This again over time makes it less and less likely that enough things will go wrong at once to completely bring down a whole system.

The third reason I think is the mentality of focusing on failures that I am sure programmers develop by working with an environment that forces you to design your systems around failure-recovery. If you are working in C or Java on a standard project then you as a developer never really experience, or have to think much about, failures unless you go out of your way to write tests that exposes them. And since you are not forced to think about it you won't. In Erlang you have to build failure hierarchies and define explicit supervisors. This forces you to start thinking about these things and you will therefore keep failures in mind also when writing the simple behaviors. Put another way, I am sure a ten year Erlang programmer would write pretty fault-tolerant code even if he for some reason was forced to use C for a project as he has been schooled to think about it. This is equivalent to a 10 year Object-Oriented programmer being forced to work with C. He has been schooled on modularity and encapsulation and will tend to write his programs this way, even though he has to use C which has poor support for it.

No comments:

Post a Comment