Thursday, October 22, 2009

Some thoughts on Erlang for Parallel Computing

I like the Erlang model and appreciate its impressive results in its domain so I have been thinking about how it would work for other problems like the computational patterns. Modules of an application where several code paths are executed at the same time can be split into to groups.

You have those applications in which the concurrency is a part of the problem specification. Two common examples of these are web servers and switching systems where there are multiple concurrent users. Another example is GUI event loops where the events are naturally concurrent. A fourth example is web browsers where the many tabs naturally operate concurrent. In these systems the simultaneous execution are often referred to as concurrency. Furthermore, introducing simultaneous computation/actual concurrency into these systems often does not make them that much more difficult to understand, and sometimes can even make them easier to understand. Just look at web development where developers write code on a daily basis that execute concurrently without giving it a second thought. In these systems the introduction of concurrency can give many benefits such as performance, scalability, reliability, stability, security, modularity or even maintainability and simplicity. The key point is that concurrency is a natural part of the problem specification and does not need to be uncovered.

Then there are those problems where concurrency is not a part of the problem specification, but where there is still concurrency in the domain that can be exposed and used to enhance some quality attribute (As I have explained in an earlier blog, I prefer this term to non-functional requirement). The qualify attributes in question are often performance and scalability, but can also be reliability, stability and security. In these systems the simultaneous execution is usually refer to as parallelism. Examples are solutions to the thirteen dwarfs/motifs/computational patterns where parallelism is usually exploited for performance and scalability or Google Chrome where they will spawn of processes, independent of the tab processes, to render certain risky parts of HTML code in order to provide increased security. Since there is no natural concurrency in the problem specification of these problems we have to find ways to restructure the problem to expose it. This is hard and usually leads to units of parallel execution that are closely coupled and that need to communicate often and in non-trivial ways.

Erlang and "Concurrency Oriented Programming" focuses on the first set of problems where the problems are inherently concurrent and programmers are therefore encouraged to map the concurrency in the solution 1:1 with the concurrency of the problem (people in the world or transactions in a switch). Some of the challenges are then to provide a system where processes are cheap and to provide separation between the different entities so that problems in one does not spread. These problems are nicely solved by Erlang through the use of structured protocol-driven message passing and green processes.

The second problem area is harder to solve and it is this area that the OPL effort focuses on the most. Erlang could probably be useful in solving some of these problems, but for problems like the computational patterns I have more reservations. Setting aside, for a moment, the fact that these are problems in which performance is usually the goal, which raises questions about the efficiency of a high-level language like Erlang especially for supercomputing, I do not think the model of Erlang is the ideal fit. The programming model of Erlang is one where independent processes work independently and cooperate by explicit message passing. Admittedly, many of the models that we use in supercomputing today like MPI and Charm++, share these characteristics, but these are often seen as complicated to use for programmers.

As I mentioned above Armstrong argues for mapping the concurrency of our problem 1:1 with the concurrency of the world. I agree with this, which leads me to the conclusion that it would be beneficial if we could use a sequential programming model for the problems that are, or at least we think of as, sequential. However, running them sequentially is not an option. We want to simulate tomorrows weather in hours and not in days so in the absence of increased sequential speed we must parallelize the computation. However, a "mostly" sequential programming model can still be parallelized given enough information to the compiler through parallel loops, etc. These conserve most of the sequential reasoning, but gives us parallel execution. Like we can choose to map concurrent problems to one or a few processors through time sharing we can map sequential programs to several processors in parallel. One example of such a programming model is OpenMP, which is popular for shared-memory systems but not very suitable for distributed systems mostly because it is shared-memory. However, this is an active area of research and one can envision sequential programming models that expose primitives that lets the programmer specify data locality giving the compiler enough information to distribute the computations satisfactorily.

Making reliable distributed systems in the presence of software errors - Chapter 4

Chapter 4 of Joe Armstrong's thesis discusses programming techniques that are commonly used in Erlang. These techniques have different goals; some are for reliability and others are aimed at simplicity and modularity. He starts by making the argument that since concurrent programming is perceived as harder than sequential programming we should split the concurrent parts from the sequential parts by making the concurrent parts frameworks into which we can plug in sequential code. Furthermore, the parallel sections should be written by expert programmers and the sequential parts by less experienced programmers.

Through example he shows how this can be achieved by creating a server module that accepts requests from any number of clients and for each request executes a sequential function. This function is supplied to the server on startup which means the function is independent of the server. In this example the server module handles all the concurrency by spawning processes and receiving messages and the sequential function it is armed with handles the actual work. These two are thus separate and can be developed separately. In addition one server can be used with several different functions.

Later this example is expanded with fault tolerance. Fault tolerance here means that it can handle faults and this is done by catching exceptions from the supplied function, and then notify the client that it could not complete the operation. The fault tolerance is here that the server doesn't need to go down, but can continue operating in its previous state. Note also that in the Erlang model one would likely have a supervisor looking out for the server, so if the server contains an error which means that an error occurs that it can not handle it can safely die, letting someone else handle the error. In this case that would be the supervisor which would likely respond by creating a new server to replace the old one. This is the core of the reliability model of Erlang. Handle what you can locally, but if you encounter errors (defined as an abnormality you do not know how to fix) then just die; someone else will fix it. It also has the great property that the model can handle HW errors the same as SW errors, given that the supervisor runs on a different system, as the runtime will ensure "death messages" are delivered to the supervisor. I must say having worked on a C system that had high reliability requirements because it had to run in embedded devices I really appreciate this simple error model. In C one does not even have exceptions which means that functions often contain more error handling code than actual application code. This makes it hard to find the actual code in all the error checks and error propagation.

Sharing of resources in Erlang is different from shared memory programming. Instead of using locks and semaphores one creates a server that control access. This is to me a simpler model and an analogy would be that instead of having a door with a in-use sign to the room with the shared resource, and then trust everyone to respect the sign, one has a servant that will give you what you need through a window in the door. Of course this analogy doesn't quite hold with monitors as everyone would lock the door after them, but you would still have to be sure every path into the room has a door with the same lock, which is no easy task, as well as dealing with deadlocks.

Finally, Armstrong talks about intentional programming which I believe strongly in. Could should make its intention obvious and if this means making a lot of functions then so be it.

Ben Britton asks whether we agree that concurrent code can't be written in a side-effect free manner. One the face of it I would disagree pointing to systems such as Haskel, but I believe Armstrong is here distinguishing between parallel code and concurrent code, with concurrent code being code where individual entities modeling the real world cooperate to solve a problem. In such system you would need to have central resource sharing and communication locations with responsibilities such as managing dictionaries and controlling physical devices. These would have side effects by design. If one is purely interested in solving a set of expressions then this is something else, which would draw great benefits from being side-effect free. Furthermore, Ben asks whether the abstractions described are only possible in Erlang, to which I would say definitely no. However, Erlang provide a language, runtime and philosophy that not only provides the tools to express these abstractions in a clean manner, but also invites it.

Tuesday, October 20, 2009

Map Reduce and Event-Based Parallel Patterns

Map_Reduce
This pattern described he very topical map-reduce algorithm. Map-reduce has been around for a long time and I believe it is supported natively in some functional programming languages (OCaml?) under the guise of map-fold. The pattern does a decent job of describing the component that I need (the map and reduce functions) as well as some aspects of implementing a map-reduce framework, but I think it could do with a coding example using a simple toy map-reduce library in which a simple operation such as for example scaling down images and then merge them could be implemented.

Also, the implementation section seemed to assume a master-worker paradigm underneath ("the software architecture is logically based on master-worker"), but although this is certainly one reasonable way to implement it there is no reason why one shouldn't do statical load balancing or work stealing instead in certain situations (I don't consider work stealing a sub-category of master worker as OPL does). They do later suggest a round robin (cyclic) schedule for some tasks, but I still think they put too much focus on master-worker which is described elsewhere and really an implementation detail of this pattern.

I also wonder whether filter/reduce should be discussed as a variant of this pattern. It is certainly used out there, but maybe it is a different pattern.

Finally, I their precondition that the reduce function must be associative. Perhaps this should be changed to "should for optimal performance be" as one can certainly get great speedups from the map phase even if the reduce step has to be performed sequentially.

Event-based, Implicit Invocation
This pattern discusses the architecture of event-based systems. One very common example of this is GUI framework where events that are mainly triggered by the user are passed on to event handlers that subscribe to that event type. This pattern was contained more details than the other OPL patterns we have read and for that reason I thought it was one of the better ones. However, at one point they state that "load balancing is significantly simpler with a centralized medium". Does not a centralized medium make the load balancing problem go away entirely for this pattern?

Later on in the example section they discussed GUI event-based systems. These systems have a centralized manager and this is identified as the biggest impediment to scalability. However, I would claim that the biggest impediment is not the centralized manager, but the fact that the event handling is done as blocking functions. If any of these functions take some time to complete they will block everything. This can also be hard to figure out as if one for example use file IO then everything may work well, but then the GUI suddenly starts to lock when one start accessing files through a networked file system. I thought such issues, seeing as they are so common with this very important case of event-based systems, deserve to be mentioned more clearly.

Making reliable distributed systems in the presence of software errors - Chapter 2

Chapter 2 of Joe Armstrong's doctoral thesis on "" discusses the architectural model of a system that can achieve this goal, presenting Erlang as one such system. Processes are here the central concept in the system and in the Erlang language. The chapter argues that the only way to have true reliability in the presence of bugs is to encapsulate each module in its own process so that errors can be contained and the process killed if an error is detected. As such modules should not attempt to correct errors, but instead kill themselves, leaving recovery to someone else (usually to the runtime environment). This is a strong property that simplifies the recovery model as one can perform recovery from code that has not had any errors yet. It also frees the programmer from programming defensively as he/she can rely on any other process either succeeding at what they were supposed to do or crashing. As such one does not have to deal with extensive error checking and error handling code for each call to another module.

The concept of a Concurrency Oriented Programming Language (COPL) is introduced in which programs should model the world like one does in Object Oriented programming. However, unlike Object Oriented programming, it recognizes that the world is concurrent and programmers are encouraged to map the concurrency of the world one-to-one with the concurrency of their programs. To achieve this it is made a first class citizen of the language and the runtime environment utilizes green threads to make concurrency cheap, which gives the programmer the freedom to model all the concurrency of the problem domain.

Processes communicate through asynchronous message passing. This means that no state can be shared and that if another process needs some data then that data must be copied to that process by sending a message. This enforces the use of the Separable parallel pattern which tends to lead to more scalable programs as one need not serialize access to the data through mutual exclusion. This property, in addition to the fact that the messages are asynchronous, also means that the potential for deadlocks is (almost?) completely removed. Finally, programs that copy data (such as one has to do in message passing APIs such as MPI) have greater data locality and avoid false sharing problematics, which are important for performance.

In Erlang concurrency is not intended for performance. The main goals of programming concurrently is reliability (through the process abstraction), scalability, security and even modularity.

Sunday, October 18, 2009

Rereading the classics

Last week I was a bit overloaded so this blog post comes a bit late, but better late than never. The last chapter of beautiful architecture titled "Rereading the classics" was a great read for me. The author were for the most part discussing object orientation and presented the programming language Smalltalk as an example of a system that had taken this to the extreme and made everything an object. I have never seen Smalltalk code before so it made for an interesting case study and I can definitely see how Smalltalk's legacy have shaped subsequent systems like Java and C#.

After having left the previous chapter, with its arguments for extending the type system through inheritance to solve most problems, with an uneasy feeling I was glad to see this chapter's careful use of it. I have indeed read Meyer's great "Effective C++" and have seen the penguins' failed attempts to fly before so I strongly agree that inheritance means is-a and is not a mechanism to compose code like the horrible COMPOSITE_PART of page 334. We learned to favor composition from our teacher this semester and the rest of the gang of four and there is no going back.

Smalltalk has a lot to teach us about Object Orientation and I find it very interesting, but not surprising, that much of the community that gave us design patterns and the guidelines for developing Object-Oriented systems that we use today came from the world of Smalltalk. I've mentioned this before, but things like this make me really wonder how much it is we that shape our tools and how much it is our tools that shape us. I certainly agree that language facilitate thought and surely this must mean that pure languages then facilitate pure thought.

The author closes with a too me surprising turnaround. After having talked up Smalltalk's strength and OO purity he makes the case for more pragmatic programming languages like C and C++ and even throws in a great quote by Stroustrup (although of course a Dane will prefer Kierkegaard!). He gives us some examples from the world of physical architecture where some of the most celebrated architectural masterpieces of the last century, like the Fallingwater house and the Villa Savoye, turned out to be completely unlivable for its inhabitants. Form must follow function and architecture must be made to facilitate the people who will live in it. It is like Alexander's villages in the Swiss alps where the naturally evolved patterns the buildings follow allows them to be beautiful while at the same time being functional for their inhabitants down to the last detail.

So is it for languages too and although I doubt Smalltalk developers will characterize it as unlivable neither has it had the same role on center stage as C++, Java and C#. However it, like beautiful functional programming languages with their purity and expressiveness, and T/16 and Erlang with their ubiquitous focus on reliability through replication it has served to inspire us and has influenced our other programming languages as well as the way that we use them. In some ways I believe the ideas that came out of the Smalltalk community have proved more powerful than any single programming language. At the risk of causing controversy I would like to draw an analogy to the socialistic democracies of Scandinavia and indeed to some degree most of present-day Europe. The ideas of Communism were very pure, but inherently flawed at their core. However, instead of implementing the anti-thesis of Capitalism, which is also very pure and although much preferable has its own problems, they landed at a hegelian synthesis of the two ideas, pragmatically borrowing what they perceived as the strengths of both with a market economy and a strong safety net. I do not endorse this one way or the other, but it is interesting how the radical pure ideas blaze for a while, but are then subsumed into more pragmatic approaches and becomes the status que that everyone takes for granted. I too, like Stroustrup, am inspired by the rationalists, but when everything is said and done prefer to live in a world based on pragmatic empiricism.

Thursday, October 15, 2009

Pipes & Filters, Layered Systems and Iterative Refinement

One of the groups in OPL consist of the structural patterns. These are not intended to document parallelism, but is intended to serve to give a good architectural base from which the parallelism can be exploited through the use of patterns at the lower layers. In the OPL authors' own words our inability to architecture parallel programs stem from our inability to architecture programs in general. Following this train of thoughts the first problem can not be solved without also solving the second.

This week we have read three of the structural patterns, namely Layered Systems, Pipe-and-filter and Iterative Refinement. The first of these patterns, the Layered Systems pattern, looked a bit like a draft and I thought it was lacking in explanation. I liked the POSA one better as it did a better job of explaining the concept and problems faced when trying to apply it. This one read more like a laundry list of things that are important when designing layered systems, and I do not disagree that these are important points, but I do not think I would get most of it if I did not already know the pattern.

The second pattern on Pipe-and-filter systems was better, but I am wondering why they don't provide a diagram of a pipeline or similar for illustration. I also missed a discussion on problems like error handling in pipelines. Should we handle that one the side, thus introducing extra communication, or should we flush the pipeline using poisonous pills when we detect an error? Or perhaps we should ignore it like one does in most media application pipelines?

They also mention in the forces section that the number of filters should be small to reduce pipeline fill/drain time. Unless they mean pipeline latency (the time it takes for one package to reach the other end which is higher if it has to be moved between more stages) I fail to see the reason. Pipeline fill time is usually seen as a problem because for an N-staged pipeline you have to process N packages before reaching full parallelism (N). This is in contrast to task parallelism where one can exploit all the parallelism immediately. However, a pipeline with N+M stages also only need N stages to reach a parallelism of N and the fact that it can reach a bigger amount of parallelism in more time doesn't really affect that. Also, say you have a pipeline of 6 stages with a processing time of 1 each. You then merge these into 3 stages with a processing time of 2 each. Admittedly you now only need 5 time units, as opposed to 3 time units to fill the whole pipeline, but the first case would only need 3 steps to occupy 3 processors while the latter case needs 5 steps. The problem with pipeline fill-time is that you have to wait for the pipeline to start filling up to get parallelism, but I can't see how that problem is solved by reducing the number of stages. Is there a fault in my argument here?

The last pattern discussed in this blog post is Iterative Refinement. If I understood it correctly this is a fairly common pattern that goes under a few other names such as BSP, Loosely Synchronous Computations and Compute/Communicate. It was presented as a general programming model by Lesilie Valiant (their reference) and it has been argued that it is to message passing as structured programming is to goto statements. In "Goto Considered Harmful" Dijkstra argues in favor of structured loops instead of goto statements because they allow you to pinpoint and think about a point in a programs execution based on its point in the source code, the call stack and a set of loop indices. In contrast with goto statements you have to consider the whole previous execution trace of that function. With BSP programs you can also pinpoint a point in the programs execution using the same information as structured programming plus iteration indexes giving you the same benefits. In addition to this it also has the effect of separating communications from computations which greatly reduced program complexity. One common usage of this pattern is in grid computations (image filtering, weather calculations) where the computation will be set up as a sequence of steps where different parts of the grid is computed on different processors independently of each other. Each step is then followed by a communication phase involving border exchanges before a new step is started.

Finding and Reproducing heisenbugs in Concurrent Programs

CHESS is a tool for finding concurrency bugs such as race conditions and deadlocks in multithreaded programs. It does this by hijacking the thread scheduler and drive the scheduling. This allows it to try different thread interleavings in a systematic and deterministic fashion, which means it does not need to rely on spawning a lot of threads to get sufficient thread interference and then pray the operating system will schedule these in such a way that a race or deadlock is triggered.

In addition CHESS will persist the interleavings it tries, which means that once a bug is triggered once it can reproduced reliably. The latter is extremely valuable as it is sometimes the case that one manage to produce a bug, perhaps in extensive nightly or weekend testing, only to have it slip out of our hands like water by not being able to reproduce it. The only choice we are then left with is to attempt to stare the code down hoping the bug will pop out or live with the uneasy knowledge that there is an unknown bug in the code. The former might take a long time and often require a really smart developer (which prevents him/her from working on creating the future bug-free code). The latter is often the only choice, but is less than ideal since if we do not know what the bug is then we do not know what other failures it might trigger. We usually accept that some less severe bugs live in our bug-trackers to be fixed "eventually" (the bug-trackers never seem to quite reach zero), but if we don't know what the bugs are we then we can't know how severe they are.

However, CHESS far from covers all interleavings. The number of potential thread interleavings in a program is enormous and includes all interleavings of all non-atomic instructions (mutual excluded sections are treated atomically) so there is no practical way to test all for any non-trivial program. CHESS therefore makes the simplification of running all threads on the same unit of execution and never preempt them. This effectively means that only instructions that block or yield a thread will cause another thread to be scheduled, which rules out many possible data races. One of the ways they alleviate this is by running a data-race detector on the first few iterations and then manually instrument code to yield based on potential data races reported. This manual instrumentation will then let the CHESS scheduler to try out different interleavings of access to the potential racy resources which means it can detect races there in subsequent iterations. In effect it takeS the output of the race detector and remove all the false positives that these tools are prone to produce (due to intentional races in the code). In the paper they only briefly mention the fact that they have to do this, but to me it seems to be an essential point as I strongly suspect they would otherwise miss a lot of races.

The above simplification means they have far fewer interleavings to test, but the number is still very large so they further optimize by intelligently searching for good interleavings and then schedule these first. This seems to me to be a very sensible way to do it and it means that the user can run the tool on a set of test for fairly a short period to cover most important interleavings or run it for days or weeks to get increasingly more coverage, but with diminishing returns.

I also have to say that although the versions of libraries such PLINQ and TPL that they used were likely the pre-release research versions and not production code (I think they are scheduled for the next .NET?) I am impressed by the amount of bugs they found. Since we can only really measure testing by the amount of bugs we find and not by the amount we haven't found a tool that detects new concurrency bugs in code that has been stress tested for weeks is definitely useful in cases where we need that level of reliability. For many applications however it might not be so bad to have a bug that only manifests once every quarter so in those cases it might not be worth the extra effort.

Tuesday, October 13, 2009

Refactoring for Reentrancy

The paper "Refactoring for Reentrancy" presents a refactoring algorithm for converting global/static state into thread local storage (tls) variables in order to separate the execution of individual threads from each other. Reentrancy is a property of code that is usually used to mean that that particular piece of code will not have side effects that will interfere with other code that is run in parallel.

To begin me their vocabulary confused me as they talked about programs needing to be reentrant and not have global state to prevent them from interfering with other programs being run in parallel. The most common usage of the term program implies it is a separate process, which means it will have its own set of global variables and there will be no problem. However, I quickly figured out that their use of the term program was broader and included libraries (that can be called from multiple threads) as well as one thread of execution inside an application. With that definition their discussion makes much more sense.

As I mentioned the goal of their algorithm was to replace static data with thread local storage. Conceptually it is quite straight forward, but it introduced the need for getters and setters and issues such related to the initialization of that data. The latter they handled by using lazy initialization which seems to work well in most cases as long as it did not use mutable variables to initialize the static fields.

The point of reentrancy is of course to be able to execute code in parallel without race conditions. In the most common usage the term refer to functions that do not have side effects on global memory and is stricter than the term thread safety. The property of thread safety allows side effects as long as they do not contain races. For non-reentrant functions this can be guaranteed through the careful use of locking. However, thread safety does not mean functions will not interfere with each other and the use of locks will often lead to the code becoming serialized, thereby removing the benefits of running it in parallel. Reentrancy in a function or in library is therefore a very attractive property which means that it can safely be used from different threads and that this usage will not introduce thread synchronization overhead. In effect reentrant code is embarrassingly parallel.

Of course it is not possible to always turn global variables into tls variables. In many cases program correctness depend on threads being able to communicate through global data and removing the global data will in these cases lead to incorrect programs. In addition usage of tls data is not free and accessing it is more expensive than accessing global data introducing overhead. However, I like their approach of first making the program correct by removing race conditions on global data and then having the programmer optimize the code by moving tls data to the heap or the stack and potentially also re-introduce shared data (global or on the heap), but this time with proper synchronization.

Finally, their algorithm did not let them make static fields that were initialized using mutable data from another object into tls data structures. This was because moving the initialization time to the first usage of that variable means that the other variable that was used to initialize it could have changed. I felt they could have solved this case by introducing a new static final variable that could be initialized to the load-time value of the mutable variable and then use this variable to initialize the tls variable lazily on first usage. That way all the threads' tls variable would be initialized and they would all be initialized to the same, and semantically correct, value. Can anyone see a hole in this technique?

Thursday, October 8, 2009

Software Architecture: Object-Oriented versus Functional

This chapter of Beautiful Architecture compares the architectural benefits of functional and Object-Oriented languages. It is written by an open OO-proponent and this shines through even though he (at least claim) to do an hones effort of staying neutral. Most paragraphs takes one functional programming usage from the cited paper on functional programming and argues that it can be done as well or better using Object-Oriented approaches and types. In the end the author also argues for agents/closures/delegates stating they are a natural extension to Object-Oriented technology that we can and should steal from the functional world. This seems to be in line with my comment in an earlier blog. Instead of functional programming marching to the forefront, they instead presently seem to serve as a source of influence for imperative programming languages who are subsuming many of their concepts. I'm sure it must be a bit frustrating for some long-time functional programming purists to see their ideas being assimilated into their imperative rivals instead of seeing their languages win out, but that is another issue altogether.

Reading the parts about modularity and extensibility I thought the author was relying too much on inheritance. Modularity is not solely expressed through types, but also through the composition of these In fact my biggest takeaway from the design patterns book was not a particular pattern but the notion of "Favor composition". Inheritance leads to very strong bindings and which makes it hard if you ever need to change the hierarchy. In addition an inheritance hierarchy will strongly embed one particular decomposition into your application which lends itself to a stronger tyranny of the dominant decomposition. Since there are usually many possible decompositions of one domain we should avoid infusing any one more into our applications than we have to in case we have to change it. The point with OO was to model the world, not to solve every problem through type hierarchies and I do not think that always extending the functionality by finding the place in the type hierarchy where the change fits and inserting it there through inheritance is a good idea. In addition I do not really like multiple inheritance as I feel it weakens the is-a intuition of inheritance and lead to very rigid design structures with tight coupling.

As I mentioned above the final part talks about agents and I found this part the most interesting of the chapter. They describe agents as a way to infuse a class with new functionality at runtime, explaining how it cleanly replaces visitor which is used in compilers, etc. I find this way of looking at it enlightening as this is usually exactly what we want to do with task and data oriented libraries such as ParallelArray and ForkJoin. We want to infuse a task class with new functionality, which is the code we want it to run in a separate thread of execution. This is to me a stronger argument for such structures than replacing visitor as I believe this usage pattern will often be needed in the future.

To conclude, even though I disagree with the chapter on some finer points and feel that the author is fairly biased even though he makes a conscious effort not to be, I believe I agree with his final conclusion. To answer one of JLSjr's questions I do prefer the analogy to the real world of Object-Oriented technology to the mathematical foundation of functional programming for most problems. After all most of our software is written to support real world processes and not to perform mathematical calculations.

ReLooper: Refactoring for Loop Parallelism

ReLooper is another refactoring tool that converts normal Java arrays into one of the new Java ParallelArray constructs (JSR166y). Using this tool one can select a Java array and ask to have it converted to a ParallelArray. The tool will then perform the necessary conversion (changing types and replacing calls to new with calls to the ParallelArray factory methods). It will also analyze the loops that the array is used in and try to convert them to an equivalent ParallelArray method that perform the same operations, but in parallel.

In order to maintain semantic equivalence ReLooper performs a pretty impressive analysis of the loop content to detect potential race conditions. This include checking that no loop iterations reads or writes to a variable that another loop iteration writes to (Bernstein's conditions), but it also verifies that there are no IO operations as this can block one of the underlying worker threads, leading to poor performance or even cause the loops to become sequential again.

What I like best about this tool is the analysis part. In the examples given not many lines are changed (12-52 in their examples) and I could fairly easily do these small changes myself, but the fact that it can guarantee that I did not introduce a bug and that the other tens or hundreds of thousands lines still work is a big time saver. If I were to do this myself for a big and complex application I would probably spend 30 minutes converting the original array and hours or days convincing myself that everything still works. Moreover the fact that they, being conservative and treating objects as one block, get so few false warnings is impressive.

One thing that has previously come up is that an industrial strength tool should also do splitting and merging of loops. Say for instance the creation, moving and reduction of the particles in their example was done in the same loop then this tool would need to implement that loop as a reduction loop (since cm is still mutated) and would miss the opportunities to use the more general apply and replaceWithGeneratedValue methods on the first two parts of the input. However, research, like development, is incremental and as Professor Johnson nicely puts it: "When you see a flying elephant you don't complain about its style".

Another thing that again strikes me is the fairly big amount of boilerplate that are required to use something like parallel arrays in Java. One have to find the right class to implement amongst 132 classes and extend it, thereby creating a new class representing the content of the loop body. As one guy mentioned in a SE seminar where we discussed this paper: "The number 132 is probably the best argument for lambda expressions in Java". At least Java has anonymous classes as I have seen C++ TBB code and it is far uglier than this (and I consider myself a C++ guy). However, if C++ gets lambda expressions in the next version or if one use Apple's closure extensions these kind of libraries become much less intrusive.

Finally I would like to address Danny Dig's third question on this paper, which asks us to compare and contrast putting the programmer in the driver seat as opposed to fully automating the transition to a parallel program (compilers and similar). I myself like to be in control of my code and the transformations that are applied to it. I do not trust automatic transformations that I can not oversee and that I can not be sure is still there after I perform a refactoring or fix a bug. Although the former is a somewhat psychological point I believe many programmers feel this way, which makes it an important one. Furthermore, I find compiler parallelization to be brittle and if I was working on a program that depended on this then I would feel inclined to have fairly extensive performance regression tests attempting to verify that the parallel loop did not suddenly become sequential because a minor change confused the compiler. I therefore much prefer the approach of this paper as it puts me in charge, lets me see and approve of changes to my code and updates my code to reflect these changes. The latter is important to me as it means I can read the code too see how it will be run and do not need to read compiler manuals or logs to know whether it will be parallel or not. I also really liked JLSjr's point about the educational value of such a tool. I'm sure that after performing a few of these interactive transformations the developer would be much more comfortable with having "evil" parallelism in his code.

Monday, October 5, 2009

When the bazaar sets out to build cathedrals

Since its beginning KDE has grown to become one of the two most influential desktops. It is built on top of Qt (pronounced cute by trolls) which is a very powerful platform independent framework and toolkit for creating GUI and other applications.

The chapter starts with an extended discussion on factors that contributed to KDE's success. I loved this part which focused on social and political aspects of an open source project, the types of issues that are encountered and how KDE solves them. The chapter highlights how KDE has avoided a management team and instead leaves the decisions in the hand of those that do the work. This is a concept that I have heard being referred to as do-ocracy and I believe it is one of the most important aspects of an open source project. I doubt any large open source projects lacking this feature have ever been very successful.

Another interesting comment in the chapter was how they described the act of reaching consensus as as important as the consensus itself. This is to keep group cohesion and facilitates group "gelling". They also identify (free) software communities as first and foremost being social and not technical structures. I think this is something corporations should learn from as people are the biggest factor to software and a well function group is far more useful than adopting a new and better tool, language or methodology (tm) - software is peopleware and the two can't be separated.

They go on to describe the background and architecture of the Akonadi project. Akonadi is a platform for central storage, management and graceful interchange of personal information such as emails, address book information, etc. In the beginning such information was contained in separate applications that quickly became outdated due to unanticipated requirements. At first they resisted restructuring their code and attempted to meet these new requirements by patching the existing applications. The result is described as a messy code base and like with the messy metropolis the chaos seemed to flow out from the code and affected the social structure leading to an unattractive working environment. This effect was described as Conway's law in reverse in previous readings. The chapter was a bit vague here, but it seemed to me to suggest that what eventually pulled them out of this state was that the software increasingly needed to communicate and cooperate forcing them to do the same. The software had to come together so they did to. There were probably other factors as well that the chapter doesn't mention for political reasons, but I find it interesting to wonder about how much what we make in turn affects the makers.

In the end they came to a crossroads and decided to move the applications they had put so much effort into to the sideline and instead focus their hard won expertise on building the infrastructure for the next generation of PIM software. After some back and forth their new architecture adopted a client/server approach where a separate process would control all the PIM information and provide services to applications. These applications can communicate with the PIM process either through the multithreaded Qt-based libakonadi library or through any other third party libraries that can understand the protocol. Since communication with the server goes through a standardized protocol applications are not required to link against C++/Qt libraries meaning the server can be comfortably used by applications such as Evolution that is implemented in C. It therefore has the potential to reach its goal of being the one-stop shop for all PIM information. One thing I did find strange and unnecessary however was their decision to implement the server using Qt. I can not see any reason why it could not have been implemented using STL and pthreads and it means that a non-KDE system, for instance a Maemo-based phone, has to import Qt to use it.

The chapter ended with a discussion on the Threadweaver framework which is another task based parallel library that use a thread pool to accomplish task. Threadweaver allows users to specify tasks to be run in parallel as well as dependencies between these tasks. The former allows for embarrassingly parallel parallelism while the latter lets it pipeline tasks utilizing more of the inherent concurrency. Under the hood Threadweaver contains a scheduler that schedules these tasks and provide load balancing. Furthermore, the user can give tasks priorities that the scheduler will respect allowing them to tune the scheduling for increased throughput.

Thursday, October 1, 2009

GNU Emacs: Creeping Featurism Is a Strength

The title of this weeks chapter states that creeping featurism, which is widely regarded as a problem with many software projects can also be a strength. Before reading this chapter I have not really had any experience with emacs. Sure I have installed and opened it, but quickly retreated to the safety of my beloved VIM due to the unfamiliar interface. I have observed that a higher percentage of the really good developers I have met use emacs than developers in general. I don't know if this means anything though other than the fact that it is an advanced tool for advanced people.

The chapter talks about the architecture of emacs describing it as a MVC architecture. The internals of emacs consist of a set of character buffers that are its models. The buffer data are flat sets of characters, but they also supports various features such as overlays and text properties. This part of emacs also contains a display engine that will detect changes to the buffers and merge these to larger update operations that is communicated to the views. The views then takes these "update events" and displays them as efficiently as they can. This observer-like design is a powerful feature of emacs in that it allows the controllers to not have to worry about how the text they produce/update will be displayed. This essentially breaks a many to many relationship and greatly reduces complexity.

The final part of this architecture is what is described as the core of emacs, namely its controller layer. This layer, consisting of very thick controllers, I guess is the biggest part of emacs by far as it essentially contains all of its application functionality. Where the model is written in C, the controller layer is written in emacs Lisp that is interpreted by emacs. As such emacs is really an abstract machine for processing and displaying text that provides a set of functionality such as buffer management, etc. And emacs lisp is a domain specific programming language with hooks into the underlying model allowing it to consume and produce text and properties.

By developing such a framework for producing features they have created a very strong separation between the internals of emacs (its model and interpreter) and its feature controllers. This means people can add new features without introducing new complexity in the core much in the same way that people can write new applications for the x86 processor without adding complexity to it. This separation, strongly enforced by the interpreter, is the key to it not blowing up under the weight of all its features.

While reading the chapter I was thinking about how this relates to the chapter on adaptive object models. Both are about platforms that implement interpreters for special-purpose rules. In emacs case the rules are functions written in a turing-complete language that operates on text. In adaptive object models such as the ones described in that paper I guess one tend to use less complex business rules than that, but the basic concept seems to me the same in that both provide abstract machines for implementing rules instead of the rules themselves.

Finally, one thing I was curious about is how things like syntax highlighting is handled. One way would be to have the views format the text, but this would place a lot of application logic in the views. The way I guess they do it is to use the text properties to describe the different colors of the text and then have a set of controllers read buffers and produce these properties. This of course would mean that the views must understand these properties, but a certain amount of view logic is unavoidable.

A Java Fork/Join Framework

The title of this paper was a bit misleading to me as I thought we'd be looking at a pthread-style framework. My first thought was that Java's OO-style threading model is after all a bit nicer than pthread's so why add fork and joins to the language? The paper turned out to be about a fork/join task framework, which makes much more sense and I think the word task should have been in the title.

Java Fork/Join seems like a nice little framework that supports a somewhat convenient way to specify tasks to be run in a thread pool. Tasks (sometimes called fibers) here, and most other places, refer to lightweight chunks of work that are scheduled onto a pool of threads. What makes these more lightweight than threads is that they allow us to reuse threads and we therefore don't need to pay for thread creation or the overhead in having too many threads. Thread pools such as these are not a new thing and programmers have been defining them for a long time for similar purposes, but a framework like this greatly reduces the overhead in rolling your own thread pool implementation.

What is in my opinion one of the greatest potential benefits of such thread pool-style solutions is that they can allow one central component to create the the correct number of threads for the given architecture. In that way the developers don't have to guess how many threads he should use. I used the word guess on purpose in the last sentence because even if a developer knows the number of cores on the target machine, and there is only one of them, he is very unlikely to know how many other threads are running on the system. He therefore can not know whether 4 new threads are too few or too many for an 8-core cpu. The problem is that only the operating system has a global view of the number of threads in use so this framework can't, and doesn't, take all of that into account. However, it would be nice if it at least detected the number of cores on the computer at runtime and gave you a number of threads reflecting this by default so that applications can scale as they are deployed on new computers. One example of a parallel framework that does this for you is QtConcurrent from Nokia (formerly Trolltech).

As a side note, another framework I recently read about that really impressed me because of its simplicity and fitness for its particular use was Apple's Grand Central. Grand Central is a task-based framework with a thread pool and task stealing much like this one (although it does not use the fork/join pattern). The great strength of Grand Central is that it is integrated with the internals of the Darwin kernel (Mac OS X's core), which means it does know just how many threads it should create to get optimal parallelism without too much overhead and it can reuse these threads for all applications using it.

The Java Fork/Join Framework's model of parallelism is, not surprisingly, fork/join parallelism. The central theme here is that a task can spawn a sub-task that will be a child of that task. One can later join with the task to wait for its completion and delete it. This means that at any point in time there is a tree of tasks and if one looks at the lifetime of the tasks one will see a DAG. This is a style of parallel programming that lends itself especially well to recursive problems, but that is flexible enough to handle other kinds of problems by "emulating" other parallel programming models. One example of the latter is loop parallelism that can be "emulated" by forking a set of tasks and then have them compute different parts of an array in parallel. This works well enough, but is more complex to program for these kinds of problems than if one can express loop parallelism directly by using something like the ParallelArray framework (To be shipped with the next version of Java).

The framework implemented task queuing by giving each thread a task pool and then let them steal tasks from each other when they run out of work themselves. The way they implemented this was through shared dequeues where a thread would use its dequeue as a stack and another thread could steal work from that thread by using the dequeue as a queue. At first I reacted to this because it is not fair for the last tasks entered tasks to be executed first (in cases where each task have individually observable results), but the reason why they did it was reasonable and enlightening. They basically put throughput before fairness and used the heuristic that the biggest tasks would be created first. As such the tasks at the end of the queue will likely be bigger meaning a thread that steals it will probably not need to steal another one for a while thereby reducing synchronization.