Thursday, November 12, 2009

Structured Grids, Branch and Bound and Graphical Models

Structured Grid Computational Pattern
This is a very common pattern that is very easy to get a mental image of. Structured grids model spaces in N dimensions as discrete points that interact. Their interaction can be regular or irregular (this pattern only talk about regular interactions) and in the regular case they are usually characterized by a fixed-size stencil.

We discussed it in class today and the biggest problem with this pattern is that it overlaps heavily with geometric decomposition. Indeed it actually describes geometric decomposition as the essence of the task manager described in this pattern's solution. First, geometric decomposition is about decomposing and solving geometric problems in parallel with the issues that bring and is not just a task manager. Second, if geometric decomposition was the essence of a subset of this solution then why would one have that pattern at all. A much more sensible strategy is for this pattern to discuss the problem/solution of performing structured grid computations, discuss stencils and then provide pointers to patterns that can be used to parallelize it. This could of course be geometric decomposition, but also master worker, etc. In particular this means that border exchanges belongs in geometric decomposition (which only mention them for about one paragraph. Indeed, as a part of cs527 I have actually written an entire pattern about just border exchanges. The current draft, with all its flaws, can be found here: URL.

Also, border exchanges as described in this pattern are insufficient to implement their first example. Even if the ghost nodes extend three rows/colums out from their chunks as they mention they still have to deal with the (literal) corner cases as their stencils will need cells from diagonal chunks. To see how this is usually handled refer to the border exchange pattern above.

Actually, todays discussion made the tight relationship between geometric decomposition and border exchanges clearer to me and I am wondering whether it would make sense to attempt to extend my border exchange pattern to be a geometric decomposition pattern or perhaps to try to merge them. It would give geometric decomposition a very specific meaning, more so perhaps than the OPL authors intended. It would essentially cover data-parallel, iterative refinement algorithms with border exchanges, but for me that is what geometric decomposition is. In patterns I value preciseness more than completeness. The original design patterns had this property. They did not give a complete design framework that is sufficiently vague to be complete like many textbooks attempt. Instead they gave crisp, specific and often non-intuiative solutions to concrete problems and for me it was this that gave them their greatest value. Besides I suspect other data-parallel algorithms with other types of communication, like the tiled N-Body example given in yesterdays OPL session, are covered by the Data Parallel pattern.


Branch & Bound Pattern
This pattern was pretty good. It started out with a decent example in the SAT decision problem and gave a good explanation of issues with branch and bound by authors who seemed to know what they were talking about very well. One area of potential improvement would be to add a code example. The paper also seems to implicitly conclude that structure-based parallelization is the one that makes sense, but if that is the case then why talk about the other two. Potentially the construction-based parallelization can give less synchronization overhead at the cost of more work, but I can't see a case where I would want operation-based parallelization. If there are such cases then the pattern should address those.

Another possible improvement that was suggested in class would be to start off with a chess or checkers example. The examples that are there are good, but I suspect many more developers will have some mental image of how a computer chess player searches for moves. It would at least be a more colorful and graphical example as the first example.


Graphical Models
When first reading the title of this pattern I must admit I was thinking of 3D meshes. We discussed it in class and other seemed to have other first-impression, but none seemed to match what it actually is about. Professor Johnson suggested that maybe it should be called s Probabilistic Methods or Stochastic Methods, both of which would have been preferable.

Even after reading the pattern I have no idea about how where it would apply or how I would start applying it. It lacks an example and I happen not to know the Junction Tree Algorithm. An example would have gone a long way, but even the pattern lacked pointers on how to go about parallelizing it. Other computation patterns has a parallel issues section and some, like circuits, are sprinkled with pointers to other patterns. This one had one line that said that "This search can be rather trivially parallelized" related to trying out randomized factorizations, which leads me to suspect that other parts of the patter can be parallelized. If not then wouldn't this just be a subset of branch and bound or perhaps something simpler?

Shared Queues, Speculation and Circuits

Shared Queue
I thought this pattern overall was pretty good and it covered most of the issues of queues. It also had quite extensive examples (taken from the ppp book). A couple of things I would have liked is a discussion on the use of notify vs. notifyAll and the concept of stampedes if too many threads are awakened at the same time as well as a mention of the pthread primitive they mention that provide wait/notify functionality (condition variables). On the other hand I liked how they gave reference to how to implement notification with semaphores (I have seen this been done once -- sometimes you just don't have access to libraries like full pthread for various reasons). I also think they could have dropped the thirds item in their solution ("Consider using distributed queues"). The way it was described on page 8 it is better covered by the task queue pattern. This pattern, as I understand it, is about implementing a queue. Not on laying one or more queues to share tasks, which is what the other pattern is about.


Speculation
I thought this pattern was really interesting and I am glad I have read it, but it also perfectly illustrates to me Professor Johnson's point about starting out with an example. I did not know this pattern beforehand and was left guessing about what it could be used for and how it worked for six pages before I finally got to the example section. In the mean time I was fed a lot of seemingly interesting facts, but not having an example to attach them to left me very frustrated. The html example should have been presented after the first paragraph of the context, but its full solution could have been addressed later in the solution section.

Of course, like Professor Johnson points out, this is due to the OPL format so I think the solution has to be to change it or forever live with the tension of trying to force an example into the start of a format that invites you to put them late. My suggestion is to simply change the name of the example section to additional examples. Not only would this invite writers to have one example in the context/solution part of the pattern, but it would also invite more than one example. The additional examples section could even contain more tricky examples, which are nice to get after one has been guided through the solution using a simple and obvious one.

A couple of minor suggestions are to add regexes in the HTML lexer example in addition to the textual descriptions (page 6). For me at least this would have been much easier to read. Also, I would have liked the another paragraph explaining how to use speculation with speech recognition. On a positive note I really liked how they contrasted speculation with branch and bound in the related pattern section, describing how they differ. This gave me a better understanding of both patterns.


Circuits
At first I thought the pattern was a concurrent execution pattern so I was very unhappy that it did not talk about parallelism in the problem and context. However, I am more content with it after I figured out it was a computational pattern.

The pattern deals with different techniques/algorithms for bit-level manipulation. What I can not quite figure out is why it is called circuits and not just bit manipulation. Sure, working with bits is working close to the Hardware, but the name circuits implies that the computational pattern is about modeling circuits or something similar. I had a look at the notes section for the this pattern on the OPL site and it talked about gates and synthesis, which is what I originally expected and which compounded my confusion. Has there been a mixup here and is perhaps this pattern incorrectly named?

Other than that the bulk of the pattern is in the example section that presents several algorithms for doing different bit manipulation tasks such as hashing, bit-counting and parity checking. Many of these can be parallelized and the pattern does a good job of giving pointers to other patterns that can be used to do so and I think this is the right way to do it for computational patterns.

Wednesday, November 11, 2009

Task Queue, Graph Partitioning and Loop Parallelism

I was ill therefore not very productive at the start of this week so this blog comes a bit late, but better late than never.

Task Queue
I was pretty happy with this pattern and felt it had enough examples for my taste. It did a decent job of describing both shared queues as well as a high level overview of work stealing although it was perhaps a bit light on the latter. The only thing I missed from the example section was in the Java example. When implementing task queues in shared memory programs one usually use a condition variable to put threads to sleep instead of spinning waiting for the queue to return non-null values, but I guess this is mentioned in the shared queue pattern.

Graph Partitioning
I was pretty unhappy with this pattern as it was hard to understand what they were talking about. Like Professor Johnson I get the impression that the pattern is really about how to map tasks onto processing elements at startup time for load balancing and reduced communication. However, this was far from clear from the start of the paper which quickly delved into the details. I suppose it could also be used to remap tasks at runtime if the graph changes. I read a pattern that sounded similar to me from paraplop09 a while back called "A Pattern Language for Topology Aware Mapping.". This pattern language talked about how to map a graph of tasks and the communication between these efficiently onto a machine topology graph so as to effectively use the machine's resources: "Given an object communication graph and a processor topology graph, how can we map the first graph onto the second such that most messages travel a small number of hops (links) on the network" In that paper the intent was much clearer and if these two patterns are talking about the same thing I prefer that version.

Loop Parallelism
Loop parallelism is a pretty common pattern for parallelization on shared memory machines besides fork-join. I thought this paper did a fairly good job of describing it and went through a lot of different techniques one can use to prepare loops for parallelization. However, unless one are already familiar with these techniques then one would probably not get them after reading this pattern as the descriptions were very brief (they did contain examples though!)

There were a few minor issues that I thought were strange. For example, why doesn't the i and j loops carry dependencies in the example on the top of page 3? It looks to me like they do (data created in iteration i of the i loop requires data from iteration i-1 and i+1) unless they employ double buffering in which case that should be mentioned.

Another problem I had was that they refer to the map-reduce pattern for descriptions of reduce. Surely a pattern on loop parallelism is the right place to include a description of both reduce and scans? Reduces are certainly used by map-reduce, but that is a very high level pattern and I think a more detailed description fits here.

Thursday, November 5, 2009

Making reliable distributed systems in the presence of software errors - Chapter 6

Chapter 6 of Armstrong's thesis on building failure-tolerant software describes the process of building an application. It shows how programmers can extend a generic set of behaviors with their own code containing business logic. The generic components of which server, event handler and finite-state-machine are described, are written by experts and have been heavily tested and widely reused. They deal with complicated issues such as concurrency, locks and parallelism in general. The code the application developers need to write are thus for the most part only sequential stubs that are plugged into the above components, which makes it far easier to make these correct. The chapter also discusses how one can generate supervisors to monitor and restart erroneous behaviors and how one can put all of this together in an application.

The chapter made some of the things about Erlang that has been blurry to me come together. Through Armstrong's demonstration of what happens when a failure occurs, in this case a behavior crashes, I started to gain the first glimmers of comprehension about how their systems can be so awe-inspiring robust. It seems to me that there may be three parts to the answer. The first is the obvious one that through supervisors an Erlang programs get multiple lines of defense so that a lot of things have to fail simultaneously to bring down a whole system.

The second one seems to me to be the pragmatic focus on getting, and not loosing any, logging information when something bad happens. Failures occur, but by catching them as soon as they happen, having a supervisor log the reason so that we are sure we don't loose it and by being able to hotswap parts of the system one can gradually weed them out. This again over time makes it less and less likely that enough things will go wrong at once to completely bring down a whole system.

The third reason I think is the mentality of focusing on failures that I am sure programmers develop by working with an environment that forces you to design your systems around failure-recovery. If you are working in C or Java on a standard project then you as a developer never really experience, or have to think much about, failures unless you go out of your way to write tests that exposes them. And since you are not forced to think about it you won't. In Erlang you have to build failure hierarchies and define explicit supervisors. This forces you to start thinking about these things and you will therefore keep failures in mind also when writing the simple behaviors. Put another way, I am sure a ten year Erlang programmer would write pretty fault-tolerant code even if he for some reason was forced to use C for a project as he has been schooled to think about it. This is equivalent to a 10 year Object-Oriented programmer being forced to work with C. He has been schooled on modularity and encapsulation and will tend to write his programs this way, even though he has to use C which has poor support for it.

Pipeline, Geometric Decomposition and Data Parallelism

Pipeline
Pipelining is an important pattern and is one of the best examples of clear-cut task parallelism (interestingly one book actually placed pipelining its own category disjoint from parallelism - Either things operate in parallel or they operate in a pipeline). Even though pipelining tend not to expose a lot of parallelism when used in software (usually just a few parallel tasks) it is sometimes the only parallelism we can extract. And in other cases it can be composed with data or other task parallelism, which gives it the effect of multiplying the parallelism we would otherwise have in the system.

The pattern on pipelining started out with a lot of (interesting) facts about the usage of pipelining in hardware that an application developer trying to implement software pipelining in his application won't care about. At this level in the pattern language we are not really concerned with vector processors and certainly not with instruction pipelining and these discussions therefore distract from the goal. A better approach would be to remove these facts, or at least move them to the known uses section, and focus on how to implement multithreaded/multiprocess software pipelines in our applications. Besides, having the first example in assembly is pretty strange.

The universal forces section mention throughput vs latency. I have heard this consideration before when comparing different CPUs and CPUs to GPUs. In hardware it makes sense as the tradeoff is between adding transistors to perform more work on one instruction per cycle or adding transistors to perform work on multiple instructions per cycle (for example through pipelining). but I wonder how relevant it is for a software pipeline. If you disregard communication cost between pipeline stages a given software task will take as long as it takes, no matter if you have a deep or shallow pipeline. It should not affect latency if you have a two stage pipeline performing n+n work or if you have a single task performing 2n work. Of course this is altered by communication between pipeline stages, but is that what the author has in mind? Or am I missing something?

Finally, I did not like the example. Why use matlab and not languages that are more well known to the target audience like C or Java? He says it is to not implement fft, but fft could just have been hidden behind a function in the example. And the example code just shows how to crop out a circle to do blurring using fft and matalb, not how to actually implement a pipeline. I would be much more satisfied with an example implemented using threads or message passing to show how to set up a pipeline (the parallel programming patterns book from which this pattern was modified had the java code for this. Why was it removed?). The first example could even be as simple as having each stage add a constant to a stream of numbers, with fft as an advanced second example.


Geometric Decomposition
I think this pattern is pretty important and I also think that it is specific enough as opposed to "Data Parallelism" which is rather vague. This description of it discussed the most important considerations when implementing geometric decomposition and it contains a lot of good information, but it definitely lacked an example with real code. Again, the Berkeley format suffers from having an explicit Example section which discourages them from interleaving the example with the solution description. And even the example in the example section only exemplifies how to do heat diffusion, with no code and little text showing how to do this in parallel. After reading this pattern I know about many things that we should consider when doing geometric decomposition, but I know nothing about how to actually begin implementing them.

It also boggles my mind that one can describe something as closely related to Cartesian coordinate system and real world phenomena's without even a single figure showing a grid of cells/points that are distributed amongst processors. But then again I am very much a graphical thinker. Indeed their only figure when describing how to solve a problem in parallel is of a heat equation...

What I don't understand is that the parallel programming patterns book contains these figures and the code in addition to the issues described here. Why can't they just use those, since they patterns like Discrete Event was taken almost directly from it? Is there some restriction on how much they can take, even though the author of this pattern is also an author of the book, or do they explicitly want to shorten it?

Data Parallelism
There is not really much to say about the pattern description as it is only three sentences long. Like the Task parallelism pattern this pattern is too vague. In the geometric decomposition pattern they say that if the geometric chunks have no dependencies then it is an instance of task parallelism because it is embarrassingly parallel. Why isn't it data parallelism? We discussed this in class and the distinction is very vague, which makes me think task and data parallelism are not patterns, but two potentially intersection categories. If they must be included in OPL then they should be two sub-categories of algorithm strategy. If they were then I would put the following patterns in the data parallel group: Geometric Decomposition, Recursive Splitting (yes really - you perform only one task repeatedly to the data structure although the granularity of the data differ dynamically. "The splitting is not a part of the core computation") and at least parts of non-work-efficient parallelism such as Scans. Discrete event, pipelining and speculation falls into the task parallel category.

Finally, I agree with Professor Johnson in that we should replace them with the Embarrassingly Parallel pattern. But even embarrassingly parallel can be both task and data parallelism. Embarrassingly task parallelism is to me when you have different tasks that do not need to communicate (MIMD). Embarrassingly data parallelism is when you have one task that operate on multiple pieces of data (SIMD). Of course this gets a bit vague when considering that the data parallel operations could have branches, but as long as those branches branch based on the data I still think it is data parallelism. Whether there, based on this, should be one or two embarrassingly parallelism pattern is not really that important and I guess it depends on whether you are a lumper or a splitter as Professor Johnson so nicely put it.

Tuesday, November 3, 2009

Making reliable distributed systems in the presence of software errors - Chapter 5

Chapter 5 of Joe Armstrong's thesis on fault tolerant systems describes how the fault tolerance of Erlang works and how the programmer can build an application to make it fault fault tolerant. By fault tolerance Armstrong means systems that can continue to complete tasks in the presence of faults and it does not imply the applications are free of them.

The fault tolerance model described in the chapter is based on the concept of fail immediately if you can not recover and then try to do something simpler. The system is organized as a tree of supervisor that supervise behavior processes that perform the actual application functionality. I did not fully understand how having the application recover by doing something simpler would work in practice though. However, another mechanism it has for recovering is to try to restart the failed process which makes sense. Perhaps it will succeed this time.

The chapter went on to discuss the difference between failures, errors and exceptions. I didn't fully get the exact distinction, but what I gathered from the text was that in the Erlang context exceptions are what you throw when the specification does not say what you should do. The programmer working at the lowest lever should not start trying to guess as this leads to an unstable system, but should instead just throw an exception. Always. An error comes in two forms, corrected errors and uncorrected errors. A corrected error is an error that you, or someone above you in the hierarchy knows how to handle. It therefore does not lead to a fault. An uncorrected error on the other hand leads to a failure, which means a process must be restarted or we must try to do something simpler. This could be because of a software error, a hardware resource going down or just because we don't know how to handle the case of a file being missing. In that case the developer are encouraged to explicitly exit or throw an exception if it is a spec error which is likely the case if we don't know what to do when a file is missing. Let someone else handle it.

However, the chapter does say that a runtime error such as a divide-by-zero leads to an exception, but surely this will more often be a programmer error than a spec error? Or is the fact that this exception are not handled by the module a spec error?

Monday, November 2, 2009

Task Parallelism, Discrete Events and Recursive Splitting

Task Parallelism
Task parallelism is a wide category that encompass a lot of different applications and is often used to describe everything that is not data parallelism. However, OPL (and the parallel patterns book) define it somewhat narrower by excluding more specialized "task parallelism" such as pipelines, discrete events and divide-and-conquer/recursive splitting.

This pattern was pretty well fleshed out with many interesting observations, but I missed a code example. I also found it a bit confusing that the solution section spent much more time describing value-at-risk, speech recognition and the cost of various operations on various architectures than it spent discussing how to find and exploit task parallelism in these problems. I also wonder why the value-at-risk example are split between the solution section and the examples section? Should perhaps the solution section talk about task parallelism in general terms (discussing issues such as embarrassingly parallel task parallelism, synchronization/communication (and how to reduce it), load balancing and how to find task parallelism) and leave the examples for the examples section?


Discrete Event Pattern
I like this pattern a lot. It does a good job of explaining the problem and discuss several important aspects of the solution such as event ordering and deadlocks. It is also the pattern on which Apple's Grand Central is based and is very important in the world of desktop GUI application development. Grand Central is a system for generating tasks that are then placed in task queues. These queues are then processed by different processors and the tasks can generate new tasks and put them in other queue. The tasks therefore work a lot like events and allows the GUI thread to offload work to other processors and it lets these processors place tasks/events back in the GUI thread's task pool. The latter is what makes this an event-based framework as opposed to a task based framework and is very powerful as it allows the application state and GUI widgets to be updated by the GUI thread thereby avoiding mutual exclusion. I particularly love one GC example where a task is generated by the GUI thread to perform some statistical computations on a document (which could potentially be expensive). When these computations are done the task will then place an event/task to update the application state/GUI with the statistics it gathered in the GUI thread's task queue, which is a beautiful solution.

The only thing I missed was a good code example, perhaps implementing the standard thread-based event queues with a switch statement to decide the action that are often used.


Recursive Splitting
This was a very well written pattern with tons of examples using different programming models. It also does a good job of describing the strengths of recursive splitting which is lies in its flexibility to create many tasks and I thought it was very interesting how they contrast it to geometric decomposition.

Thursday, October 22, 2009

Some thoughts on Erlang for Parallel Computing

I like the Erlang model and appreciate its impressive results in its domain so I have been thinking about how it would work for other problems like the computational patterns. Modules of an application where several code paths are executed at the same time can be split into to groups.

You have those applications in which the concurrency is a part of the problem specification. Two common examples of these are web servers and switching systems where there are multiple concurrent users. Another example is GUI event loops where the events are naturally concurrent. A fourth example is web browsers where the many tabs naturally operate concurrent. In these systems the simultaneous execution are often referred to as concurrency. Furthermore, introducing simultaneous computation/actual concurrency into these systems often does not make them that much more difficult to understand, and sometimes can even make them easier to understand. Just look at web development where developers write code on a daily basis that execute concurrently without giving it a second thought. In these systems the introduction of concurrency can give many benefits such as performance, scalability, reliability, stability, security, modularity or even maintainability and simplicity. The key point is that concurrency is a natural part of the problem specification and does not need to be uncovered.

Then there are those problems where concurrency is not a part of the problem specification, but where there is still concurrency in the domain that can be exposed and used to enhance some quality attribute (As I have explained in an earlier blog, I prefer this term to non-functional requirement). The qualify attributes in question are often performance and scalability, but can also be reliability, stability and security. In these systems the simultaneous execution is usually refer to as parallelism. Examples are solutions to the thirteen dwarfs/motifs/computational patterns where parallelism is usually exploited for performance and scalability or Google Chrome where they will spawn of processes, independent of the tab processes, to render certain risky parts of HTML code in order to provide increased security. Since there is no natural concurrency in the problem specification of these problems we have to find ways to restructure the problem to expose it. This is hard and usually leads to units of parallel execution that are closely coupled and that need to communicate often and in non-trivial ways.

Erlang and "Concurrency Oriented Programming" focuses on the first set of problems where the problems are inherently concurrent and programmers are therefore encouraged to map the concurrency in the solution 1:1 with the concurrency of the problem (people in the world or transactions in a switch). Some of the challenges are then to provide a system where processes are cheap and to provide separation between the different entities so that problems in one does not spread. These problems are nicely solved by Erlang through the use of structured protocol-driven message passing and green processes.

The second problem area is harder to solve and it is this area that the OPL effort focuses on the most. Erlang could probably be useful in solving some of these problems, but for problems like the computational patterns I have more reservations. Setting aside, for a moment, the fact that these are problems in which performance is usually the goal, which raises questions about the efficiency of a high-level language like Erlang especially for supercomputing, I do not think the model of Erlang is the ideal fit. The programming model of Erlang is one where independent processes work independently and cooperate by explicit message passing. Admittedly, many of the models that we use in supercomputing today like MPI and Charm++, share these characteristics, but these are often seen as complicated to use for programmers.

As I mentioned above Armstrong argues for mapping the concurrency of our problem 1:1 with the concurrency of the world. I agree with this, which leads me to the conclusion that it would be beneficial if we could use a sequential programming model for the problems that are, or at least we think of as, sequential. However, running them sequentially is not an option. We want to simulate tomorrows weather in hours and not in days so in the absence of increased sequential speed we must parallelize the computation. However, a "mostly" sequential programming model can still be parallelized given enough information to the compiler through parallel loops, etc. These conserve most of the sequential reasoning, but gives us parallel execution. Like we can choose to map concurrent problems to one or a few processors through time sharing we can map sequential programs to several processors in parallel. One example of such a programming model is OpenMP, which is popular for shared-memory systems but not very suitable for distributed systems mostly because it is shared-memory. However, this is an active area of research and one can envision sequential programming models that expose primitives that lets the programmer specify data locality giving the compiler enough information to distribute the computations satisfactorily.

Making reliable distributed systems in the presence of software errors - Chapter 4

Chapter 4 of Joe Armstrong's thesis discusses programming techniques that are commonly used in Erlang. These techniques have different goals; some are for reliability and others are aimed at simplicity and modularity. He starts by making the argument that since concurrent programming is perceived as harder than sequential programming we should split the concurrent parts from the sequential parts by making the concurrent parts frameworks into which we can plug in sequential code. Furthermore, the parallel sections should be written by expert programmers and the sequential parts by less experienced programmers.

Through example he shows how this can be achieved by creating a server module that accepts requests from any number of clients and for each request executes a sequential function. This function is supplied to the server on startup which means the function is independent of the server. In this example the server module handles all the concurrency by spawning processes and receiving messages and the sequential function it is armed with handles the actual work. These two are thus separate and can be developed separately. In addition one server can be used with several different functions.

Later this example is expanded with fault tolerance. Fault tolerance here means that it can handle faults and this is done by catching exceptions from the supplied function, and then notify the client that it could not complete the operation. The fault tolerance is here that the server doesn't need to go down, but can continue operating in its previous state. Note also that in the Erlang model one would likely have a supervisor looking out for the server, so if the server contains an error which means that an error occurs that it can not handle it can safely die, letting someone else handle the error. In this case that would be the supervisor which would likely respond by creating a new server to replace the old one. This is the core of the reliability model of Erlang. Handle what you can locally, but if you encounter errors (defined as an abnormality you do not know how to fix) then just die; someone else will fix it. It also has the great property that the model can handle HW errors the same as SW errors, given that the supervisor runs on a different system, as the runtime will ensure "death messages" are delivered to the supervisor. I must say having worked on a C system that had high reliability requirements because it had to run in embedded devices I really appreciate this simple error model. In C one does not even have exceptions which means that functions often contain more error handling code than actual application code. This makes it hard to find the actual code in all the error checks and error propagation.

Sharing of resources in Erlang is different from shared memory programming. Instead of using locks and semaphores one creates a server that control access. This is to me a simpler model and an analogy would be that instead of having a door with a in-use sign to the room with the shared resource, and then trust everyone to respect the sign, one has a servant that will give you what you need through a window in the door. Of course this analogy doesn't quite hold with monitors as everyone would lock the door after them, but you would still have to be sure every path into the room has a door with the same lock, which is no easy task, as well as dealing with deadlocks.

Finally, Armstrong talks about intentional programming which I believe strongly in. Could should make its intention obvious and if this means making a lot of functions then so be it.

Ben Britton asks whether we agree that concurrent code can't be written in a side-effect free manner. One the face of it I would disagree pointing to systems such as Haskel, but I believe Armstrong is here distinguishing between parallel code and concurrent code, with concurrent code being code where individual entities modeling the real world cooperate to solve a problem. In such system you would need to have central resource sharing and communication locations with responsibilities such as managing dictionaries and controlling physical devices. These would have side effects by design. If one is purely interested in solving a set of expressions then this is something else, which would draw great benefits from being side-effect free. Furthermore, Ben asks whether the abstractions described are only possible in Erlang, to which I would say definitely no. However, Erlang provide a language, runtime and philosophy that not only provides the tools to express these abstractions in a clean manner, but also invites it.

Tuesday, October 20, 2009

Map Reduce and Event-Based Parallel Patterns

Map_Reduce
This pattern described he very topical map-reduce algorithm. Map-reduce has been around for a long time and I believe it is supported natively in some functional programming languages (OCaml?) under the guise of map-fold. The pattern does a decent job of describing the component that I need (the map and reduce functions) as well as some aspects of implementing a map-reduce framework, but I think it could do with a coding example using a simple toy map-reduce library in which a simple operation such as for example scaling down images and then merge them could be implemented.

Also, the implementation section seemed to assume a master-worker paradigm underneath ("the software architecture is logically based on master-worker"), but although this is certainly one reasonable way to implement it there is no reason why one shouldn't do statical load balancing or work stealing instead in certain situations (I don't consider work stealing a sub-category of master worker as OPL does). They do later suggest a round robin (cyclic) schedule for some tasks, but I still think they put too much focus on master-worker which is described elsewhere and really an implementation detail of this pattern.

I also wonder whether filter/reduce should be discussed as a variant of this pattern. It is certainly used out there, but maybe it is a different pattern.

Finally, I their precondition that the reduce function must be associative. Perhaps this should be changed to "should for optimal performance be" as one can certainly get great speedups from the map phase even if the reduce step has to be performed sequentially.

Event-based, Implicit Invocation
This pattern discusses the architecture of event-based systems. One very common example of this is GUI framework where events that are mainly triggered by the user are passed on to event handlers that subscribe to that event type. This pattern was contained more details than the other OPL patterns we have read and for that reason I thought it was one of the better ones. However, at one point they state that "load balancing is significantly simpler with a centralized medium". Does not a centralized medium make the load balancing problem go away entirely for this pattern?

Later on in the example section they discussed GUI event-based systems. These systems have a centralized manager and this is identified as the biggest impediment to scalability. However, I would claim that the biggest impediment is not the centralized manager, but the fact that the event handling is done as blocking functions. If any of these functions take some time to complete they will block everything. This can also be hard to figure out as if one for example use file IO then everything may work well, but then the GUI suddenly starts to lock when one start accessing files through a networked file system. I thought such issues, seeing as they are so common with this very important case of event-based systems, deserve to be mentioned more clearly.

Making reliable distributed systems in the presence of software errors - Chapter 2

Chapter 2 of Joe Armstrong's doctoral thesis on "" discusses the architectural model of a system that can achieve this goal, presenting Erlang as one such system. Processes are here the central concept in the system and in the Erlang language. The chapter argues that the only way to have true reliability in the presence of bugs is to encapsulate each module in its own process so that errors can be contained and the process killed if an error is detected. As such modules should not attempt to correct errors, but instead kill themselves, leaving recovery to someone else (usually to the runtime environment). This is a strong property that simplifies the recovery model as one can perform recovery from code that has not had any errors yet. It also frees the programmer from programming defensively as he/she can rely on any other process either succeeding at what they were supposed to do or crashing. As such one does not have to deal with extensive error checking and error handling code for each call to another module.

The concept of a Concurrency Oriented Programming Language (COPL) is introduced in which programs should model the world like one does in Object Oriented programming. However, unlike Object Oriented programming, it recognizes that the world is concurrent and programmers are encouraged to map the concurrency of the world one-to-one with the concurrency of their programs. To achieve this it is made a first class citizen of the language and the runtime environment utilizes green threads to make concurrency cheap, which gives the programmer the freedom to model all the concurrency of the problem domain.

Processes communicate through asynchronous message passing. This means that no state can be shared and that if another process needs some data then that data must be copied to that process by sending a message. This enforces the use of the Separable parallel pattern which tends to lead to more scalable programs as one need not serialize access to the data through mutual exclusion. This property, in addition to the fact that the messages are asynchronous, also means that the potential for deadlocks is (almost?) completely removed. Finally, programs that copy data (such as one has to do in message passing APIs such as MPI) have greater data locality and avoid false sharing problematics, which are important for performance.

In Erlang concurrency is not intended for performance. The main goals of programming concurrently is reliability (through the process abstraction), scalability, security and even modularity.

Sunday, October 18, 2009

Rereading the classics

Last week I was a bit overloaded so this blog post comes a bit late, but better late than never. The last chapter of beautiful architecture titled "Rereading the classics" was a great read for me. The author were for the most part discussing object orientation and presented the programming language Smalltalk as an example of a system that had taken this to the extreme and made everything an object. I have never seen Smalltalk code before so it made for an interesting case study and I can definitely see how Smalltalk's legacy have shaped subsequent systems like Java and C#.

After having left the previous chapter, with its arguments for extending the type system through inheritance to solve most problems, with an uneasy feeling I was glad to see this chapter's careful use of it. I have indeed read Meyer's great "Effective C++" and have seen the penguins' failed attempts to fly before so I strongly agree that inheritance means is-a and is not a mechanism to compose code like the horrible COMPOSITE_PART of page 334. We learned to favor composition from our teacher this semester and the rest of the gang of four and there is no going back.

Smalltalk has a lot to teach us about Object Orientation and I find it very interesting, but not surprising, that much of the community that gave us design patterns and the guidelines for developing Object-Oriented systems that we use today came from the world of Smalltalk. I've mentioned this before, but things like this make me really wonder how much it is we that shape our tools and how much it is our tools that shape us. I certainly agree that language facilitate thought and surely this must mean that pure languages then facilitate pure thought.

The author closes with a too me surprising turnaround. After having talked up Smalltalk's strength and OO purity he makes the case for more pragmatic programming languages like C and C++ and even throws in a great quote by Stroustrup (although of course a Dane will prefer Kierkegaard!). He gives us some examples from the world of physical architecture where some of the most celebrated architectural masterpieces of the last century, like the Fallingwater house and the Villa Savoye, turned out to be completely unlivable for its inhabitants. Form must follow function and architecture must be made to facilitate the people who will live in it. It is like Alexander's villages in the Swiss alps where the naturally evolved patterns the buildings follow allows them to be beautiful while at the same time being functional for their inhabitants down to the last detail.

So is it for languages too and although I doubt Smalltalk developers will characterize it as unlivable neither has it had the same role on center stage as C++, Java and C#. However it, like beautiful functional programming languages with their purity and expressiveness, and T/16 and Erlang with their ubiquitous focus on reliability through replication it has served to inspire us and has influenced our other programming languages as well as the way that we use them. In some ways I believe the ideas that came out of the Smalltalk community have proved more powerful than any single programming language. At the risk of causing controversy I would like to draw an analogy to the socialistic democracies of Scandinavia and indeed to some degree most of present-day Europe. The ideas of Communism were very pure, but inherently flawed at their core. However, instead of implementing the anti-thesis of Capitalism, which is also very pure and although much preferable has its own problems, they landed at a hegelian synthesis of the two ideas, pragmatically borrowing what they perceived as the strengths of both with a market economy and a strong safety net. I do not endorse this one way or the other, but it is interesting how the radical pure ideas blaze for a while, but are then subsumed into more pragmatic approaches and becomes the status que that everyone takes for granted. I too, like Stroustrup, am inspired by the rationalists, but when everything is said and done prefer to live in a world based on pragmatic empiricism.

Thursday, October 15, 2009

Pipes & Filters, Layered Systems and Iterative Refinement

One of the groups in OPL consist of the structural patterns. These are not intended to document parallelism, but is intended to serve to give a good architectural base from which the parallelism can be exploited through the use of patterns at the lower layers. In the OPL authors' own words our inability to architecture parallel programs stem from our inability to architecture programs in general. Following this train of thoughts the first problem can not be solved without also solving the second.

This week we have read three of the structural patterns, namely Layered Systems, Pipe-and-filter and Iterative Refinement. The first of these patterns, the Layered Systems pattern, looked a bit like a draft and I thought it was lacking in explanation. I liked the POSA one better as it did a better job of explaining the concept and problems faced when trying to apply it. This one read more like a laundry list of things that are important when designing layered systems, and I do not disagree that these are important points, but I do not think I would get most of it if I did not already know the pattern.

The second pattern on Pipe-and-filter systems was better, but I am wondering why they don't provide a diagram of a pipeline or similar for illustration. I also missed a discussion on problems like error handling in pipelines. Should we handle that one the side, thus introducing extra communication, or should we flush the pipeline using poisonous pills when we detect an error? Or perhaps we should ignore it like one does in most media application pipelines?

They also mention in the forces section that the number of filters should be small to reduce pipeline fill/drain time. Unless they mean pipeline latency (the time it takes for one package to reach the other end which is higher if it has to be moved between more stages) I fail to see the reason. Pipeline fill time is usually seen as a problem because for an N-staged pipeline you have to process N packages before reaching full parallelism (N). This is in contrast to task parallelism where one can exploit all the parallelism immediately. However, a pipeline with N+M stages also only need N stages to reach a parallelism of N and the fact that it can reach a bigger amount of parallelism in more time doesn't really affect that. Also, say you have a pipeline of 6 stages with a processing time of 1 each. You then merge these into 3 stages with a processing time of 2 each. Admittedly you now only need 5 time units, as opposed to 3 time units to fill the whole pipeline, but the first case would only need 3 steps to occupy 3 processors while the latter case needs 5 steps. The problem with pipeline fill-time is that you have to wait for the pipeline to start filling up to get parallelism, but I can't see how that problem is solved by reducing the number of stages. Is there a fault in my argument here?

The last pattern discussed in this blog post is Iterative Refinement. If I understood it correctly this is a fairly common pattern that goes under a few other names such as BSP, Loosely Synchronous Computations and Compute/Communicate. It was presented as a general programming model by Lesilie Valiant (their reference) and it has been argued that it is to message passing as structured programming is to goto statements. In "Goto Considered Harmful" Dijkstra argues in favor of structured loops instead of goto statements because they allow you to pinpoint and think about a point in a programs execution based on its point in the source code, the call stack and a set of loop indices. In contrast with goto statements you have to consider the whole previous execution trace of that function. With BSP programs you can also pinpoint a point in the programs execution using the same information as structured programming plus iteration indexes giving you the same benefits. In addition to this it also has the effect of separating communications from computations which greatly reduced program complexity. One common usage of this pattern is in grid computations (image filtering, weather calculations) where the computation will be set up as a sequence of steps where different parts of the grid is computed on different processors independently of each other. Each step is then followed by a communication phase involving border exchanges before a new step is started.

Finding and Reproducing heisenbugs in Concurrent Programs

CHESS is a tool for finding concurrency bugs such as race conditions and deadlocks in multithreaded programs. It does this by hijacking the thread scheduler and drive the scheduling. This allows it to try different thread interleavings in a systematic and deterministic fashion, which means it does not need to rely on spawning a lot of threads to get sufficient thread interference and then pray the operating system will schedule these in such a way that a race or deadlock is triggered.

In addition CHESS will persist the interleavings it tries, which means that once a bug is triggered once it can reproduced reliably. The latter is extremely valuable as it is sometimes the case that one manage to produce a bug, perhaps in extensive nightly or weekend testing, only to have it slip out of our hands like water by not being able to reproduce it. The only choice we are then left with is to attempt to stare the code down hoping the bug will pop out or live with the uneasy knowledge that there is an unknown bug in the code. The former might take a long time and often require a really smart developer (which prevents him/her from working on creating the future bug-free code). The latter is often the only choice, but is less than ideal since if we do not know what the bug is then we do not know what other failures it might trigger. We usually accept that some less severe bugs live in our bug-trackers to be fixed "eventually" (the bug-trackers never seem to quite reach zero), but if we don't know what the bugs are we then we can't know how severe they are.

However, CHESS far from covers all interleavings. The number of potential thread interleavings in a program is enormous and includes all interleavings of all non-atomic instructions (mutual excluded sections are treated atomically) so there is no practical way to test all for any non-trivial program. CHESS therefore makes the simplification of running all threads on the same unit of execution and never preempt them. This effectively means that only instructions that block or yield a thread will cause another thread to be scheduled, which rules out many possible data races. One of the ways they alleviate this is by running a data-race detector on the first few iterations and then manually instrument code to yield based on potential data races reported. This manual instrumentation will then let the CHESS scheduler to try out different interleavings of access to the potential racy resources which means it can detect races there in subsequent iterations. In effect it takeS the output of the race detector and remove all the false positives that these tools are prone to produce (due to intentional races in the code). In the paper they only briefly mention the fact that they have to do this, but to me it seems to be an essential point as I strongly suspect they would otherwise miss a lot of races.

The above simplification means they have far fewer interleavings to test, but the number is still very large so they further optimize by intelligently searching for good interleavings and then schedule these first. This seems to me to be a very sensible way to do it and it means that the user can run the tool on a set of test for fairly a short period to cover most important interleavings or run it for days or weeks to get increasingly more coverage, but with diminishing returns.

I also have to say that although the versions of libraries such PLINQ and TPL that they used were likely the pre-release research versions and not production code (I think they are scheduled for the next .NET?) I am impressed by the amount of bugs they found. Since we can only really measure testing by the amount of bugs we find and not by the amount we haven't found a tool that detects new concurrency bugs in code that has been stress tested for weeks is definitely useful in cases where we need that level of reliability. For many applications however it might not be so bad to have a bug that only manifests once every quarter so in those cases it might not be worth the extra effort.

Tuesday, October 13, 2009

Refactoring for Reentrancy

The paper "Refactoring for Reentrancy" presents a refactoring algorithm for converting global/static state into thread local storage (tls) variables in order to separate the execution of individual threads from each other. Reentrancy is a property of code that is usually used to mean that that particular piece of code will not have side effects that will interfere with other code that is run in parallel.

To begin me their vocabulary confused me as they talked about programs needing to be reentrant and not have global state to prevent them from interfering with other programs being run in parallel. The most common usage of the term program implies it is a separate process, which means it will have its own set of global variables and there will be no problem. However, I quickly figured out that their use of the term program was broader and included libraries (that can be called from multiple threads) as well as one thread of execution inside an application. With that definition their discussion makes much more sense.

As I mentioned the goal of their algorithm was to replace static data with thread local storage. Conceptually it is quite straight forward, but it introduced the need for getters and setters and issues such related to the initialization of that data. The latter they handled by using lazy initialization which seems to work well in most cases as long as it did not use mutable variables to initialize the static fields.

The point of reentrancy is of course to be able to execute code in parallel without race conditions. In the most common usage the term refer to functions that do not have side effects on global memory and is stricter than the term thread safety. The property of thread safety allows side effects as long as they do not contain races. For non-reentrant functions this can be guaranteed through the careful use of locking. However, thread safety does not mean functions will not interfere with each other and the use of locks will often lead to the code becoming serialized, thereby removing the benefits of running it in parallel. Reentrancy in a function or in library is therefore a very attractive property which means that it can safely be used from different threads and that this usage will not introduce thread synchronization overhead. In effect reentrant code is embarrassingly parallel.

Of course it is not possible to always turn global variables into tls variables. In many cases program correctness depend on threads being able to communicate through global data and removing the global data will in these cases lead to incorrect programs. In addition usage of tls data is not free and accessing it is more expensive than accessing global data introducing overhead. However, I like their approach of first making the program correct by removing race conditions on global data and then having the programmer optimize the code by moving tls data to the heap or the stack and potentially also re-introduce shared data (global or on the heap), but this time with proper synchronization.

Finally, their algorithm did not let them make static fields that were initialized using mutable data from another object into tls data structures. This was because moving the initialization time to the first usage of that variable means that the other variable that was used to initialize it could have changed. I felt they could have solved this case by introducing a new static final variable that could be initialized to the load-time value of the mutable variable and then use this variable to initialize the tls variable lazily on first usage. That way all the threads' tls variable would be initialized and they would all be initialized to the same, and semantically correct, value. Can anyone see a hole in this technique?

Thursday, October 8, 2009

Software Architecture: Object-Oriented versus Functional

This chapter of Beautiful Architecture compares the architectural benefits of functional and Object-Oriented languages. It is written by an open OO-proponent and this shines through even though he (at least claim) to do an hones effort of staying neutral. Most paragraphs takes one functional programming usage from the cited paper on functional programming and argues that it can be done as well or better using Object-Oriented approaches and types. In the end the author also argues for agents/closures/delegates stating they are a natural extension to Object-Oriented technology that we can and should steal from the functional world. This seems to be in line with my comment in an earlier blog. Instead of functional programming marching to the forefront, they instead presently seem to serve as a source of influence for imperative programming languages who are subsuming many of their concepts. I'm sure it must be a bit frustrating for some long-time functional programming purists to see their ideas being assimilated into their imperative rivals instead of seeing their languages win out, but that is another issue altogether.

Reading the parts about modularity and extensibility I thought the author was relying too much on inheritance. Modularity is not solely expressed through types, but also through the composition of these In fact my biggest takeaway from the design patterns book was not a particular pattern but the notion of "Favor composition". Inheritance leads to very strong bindings and which makes it hard if you ever need to change the hierarchy. In addition an inheritance hierarchy will strongly embed one particular decomposition into your application which lends itself to a stronger tyranny of the dominant decomposition. Since there are usually many possible decompositions of one domain we should avoid infusing any one more into our applications than we have to in case we have to change it. The point with OO was to model the world, not to solve every problem through type hierarchies and I do not think that always extending the functionality by finding the place in the type hierarchy where the change fits and inserting it there through inheritance is a good idea. In addition I do not really like multiple inheritance as I feel it weakens the is-a intuition of inheritance and lead to very rigid design structures with tight coupling.

As I mentioned above the final part talks about agents and I found this part the most interesting of the chapter. They describe agents as a way to infuse a class with new functionality at runtime, explaining how it cleanly replaces visitor which is used in compilers, etc. I find this way of looking at it enlightening as this is usually exactly what we want to do with task and data oriented libraries such as ParallelArray and ForkJoin. We want to infuse a task class with new functionality, which is the code we want it to run in a separate thread of execution. This is to me a stronger argument for such structures than replacing visitor as I believe this usage pattern will often be needed in the future.

To conclude, even though I disagree with the chapter on some finer points and feel that the author is fairly biased even though he makes a conscious effort not to be, I believe I agree with his final conclusion. To answer one of JLSjr's questions I do prefer the analogy to the real world of Object-Oriented technology to the mathematical foundation of functional programming for most problems. After all most of our software is written to support real world processes and not to perform mathematical calculations.

ReLooper: Refactoring for Loop Parallelism

ReLooper is another refactoring tool that converts normal Java arrays into one of the new Java ParallelArray constructs (JSR166y). Using this tool one can select a Java array and ask to have it converted to a ParallelArray. The tool will then perform the necessary conversion (changing types and replacing calls to new with calls to the ParallelArray factory methods). It will also analyze the loops that the array is used in and try to convert them to an equivalent ParallelArray method that perform the same operations, but in parallel.

In order to maintain semantic equivalence ReLooper performs a pretty impressive analysis of the loop content to detect potential race conditions. This include checking that no loop iterations reads or writes to a variable that another loop iteration writes to (Bernstein's conditions), but it also verifies that there are no IO operations as this can block one of the underlying worker threads, leading to poor performance or even cause the loops to become sequential again.

What I like best about this tool is the analysis part. In the examples given not many lines are changed (12-52 in their examples) and I could fairly easily do these small changes myself, but the fact that it can guarantee that I did not introduce a bug and that the other tens or hundreds of thousands lines still work is a big time saver. If I were to do this myself for a big and complex application I would probably spend 30 minutes converting the original array and hours or days convincing myself that everything still works. Moreover the fact that they, being conservative and treating objects as one block, get so few false warnings is impressive.

One thing that has previously come up is that an industrial strength tool should also do splitting and merging of loops. Say for instance the creation, moving and reduction of the particles in their example was done in the same loop then this tool would need to implement that loop as a reduction loop (since cm is still mutated) and would miss the opportunities to use the more general apply and replaceWithGeneratedValue methods on the first two parts of the input. However, research, like development, is incremental and as Professor Johnson nicely puts it: "When you see a flying elephant you don't complain about its style".

Another thing that again strikes me is the fairly big amount of boilerplate that are required to use something like parallel arrays in Java. One have to find the right class to implement amongst 132 classes and extend it, thereby creating a new class representing the content of the loop body. As one guy mentioned in a SE seminar where we discussed this paper: "The number 132 is probably the best argument for lambda expressions in Java". At least Java has anonymous classes as I have seen C++ TBB code and it is far uglier than this (and I consider myself a C++ guy). However, if C++ gets lambda expressions in the next version or if one use Apple's closure extensions these kind of libraries become much less intrusive.

Finally I would like to address Danny Dig's third question on this paper, which asks us to compare and contrast putting the programmer in the driver seat as opposed to fully automating the transition to a parallel program (compilers and similar). I myself like to be in control of my code and the transformations that are applied to it. I do not trust automatic transformations that I can not oversee and that I can not be sure is still there after I perform a refactoring or fix a bug. Although the former is a somewhat psychological point I believe many programmers feel this way, which makes it an important one. Furthermore, I find compiler parallelization to be brittle and if I was working on a program that depended on this then I would feel inclined to have fairly extensive performance regression tests attempting to verify that the parallel loop did not suddenly become sequential because a minor change confused the compiler. I therefore much prefer the approach of this paper as it puts me in charge, lets me see and approve of changes to my code and updates my code to reflect these changes. The latter is important to me as it means I can read the code too see how it will be run and do not need to read compiler manuals or logs to know whether it will be parallel or not. I also really liked JLSjr's point about the educational value of such a tool. I'm sure that after performing a few of these interactive transformations the developer would be much more comfortable with having "evil" parallelism in his code.

Monday, October 5, 2009

When the bazaar sets out to build cathedrals

Since its beginning KDE has grown to become one of the two most influential desktops. It is built on top of Qt (pronounced cute by trolls) which is a very powerful platform independent framework and toolkit for creating GUI and other applications.

The chapter starts with an extended discussion on factors that contributed to KDE's success. I loved this part which focused on social and political aspects of an open source project, the types of issues that are encountered and how KDE solves them. The chapter highlights how KDE has avoided a management team and instead leaves the decisions in the hand of those that do the work. This is a concept that I have heard being referred to as do-ocracy and I believe it is one of the most important aspects of an open source project. I doubt any large open source projects lacking this feature have ever been very successful.

Another interesting comment in the chapter was how they described the act of reaching consensus as as important as the consensus itself. This is to keep group cohesion and facilitates group "gelling". They also identify (free) software communities as first and foremost being social and not technical structures. I think this is something corporations should learn from as people are the biggest factor to software and a well function group is far more useful than adopting a new and better tool, language or methodology (tm) - software is peopleware and the two can't be separated.

They go on to describe the background and architecture of the Akonadi project. Akonadi is a platform for central storage, management and graceful interchange of personal information such as emails, address book information, etc. In the beginning such information was contained in separate applications that quickly became outdated due to unanticipated requirements. At first they resisted restructuring their code and attempted to meet these new requirements by patching the existing applications. The result is described as a messy code base and like with the messy metropolis the chaos seemed to flow out from the code and affected the social structure leading to an unattractive working environment. This effect was described as Conway's law in reverse in previous readings. The chapter was a bit vague here, but it seemed to me to suggest that what eventually pulled them out of this state was that the software increasingly needed to communicate and cooperate forcing them to do the same. The software had to come together so they did to. There were probably other factors as well that the chapter doesn't mention for political reasons, but I find it interesting to wonder about how much what we make in turn affects the makers.

In the end they came to a crossroads and decided to move the applications they had put so much effort into to the sideline and instead focus their hard won expertise on building the infrastructure for the next generation of PIM software. After some back and forth their new architecture adopted a client/server approach where a separate process would control all the PIM information and provide services to applications. These applications can communicate with the PIM process either through the multithreaded Qt-based libakonadi library or through any other third party libraries that can understand the protocol. Since communication with the server goes through a standardized protocol applications are not required to link against C++/Qt libraries meaning the server can be comfortably used by applications such as Evolution that is implemented in C. It therefore has the potential to reach its goal of being the one-stop shop for all PIM information. One thing I did find strange and unnecessary however was their decision to implement the server using Qt. I can not see any reason why it could not have been implemented using STL and pthreads and it means that a non-KDE system, for instance a Maemo-based phone, has to import Qt to use it.

The chapter ended with a discussion on the Threadweaver framework which is another task based parallel library that use a thread pool to accomplish task. Threadweaver allows users to specify tasks to be run in parallel as well as dependencies between these tasks. The former allows for embarrassingly parallel parallelism while the latter lets it pipeline tasks utilizing more of the inherent concurrency. Under the hood Threadweaver contains a scheduler that schedules these tasks and provide load balancing. Furthermore, the user can give tasks priorities that the scheduler will respect allowing them to tune the scheduling for increased throughput.

Thursday, October 1, 2009

GNU Emacs: Creeping Featurism Is a Strength

The title of this weeks chapter states that creeping featurism, which is widely regarded as a problem with many software projects can also be a strength. Before reading this chapter I have not really had any experience with emacs. Sure I have installed and opened it, but quickly retreated to the safety of my beloved VIM due to the unfamiliar interface. I have observed that a higher percentage of the really good developers I have met use emacs than developers in general. I don't know if this means anything though other than the fact that it is an advanced tool for advanced people.

The chapter talks about the architecture of emacs describing it as a MVC architecture. The internals of emacs consist of a set of character buffers that are its models. The buffer data are flat sets of characters, but they also supports various features such as overlays and text properties. This part of emacs also contains a display engine that will detect changes to the buffers and merge these to larger update operations that is communicated to the views. The views then takes these "update events" and displays them as efficiently as they can. This observer-like design is a powerful feature of emacs in that it allows the controllers to not have to worry about how the text they produce/update will be displayed. This essentially breaks a many to many relationship and greatly reduces complexity.

The final part of this architecture is what is described as the core of emacs, namely its controller layer. This layer, consisting of very thick controllers, I guess is the biggest part of emacs by far as it essentially contains all of its application functionality. Where the model is written in C, the controller layer is written in emacs Lisp that is interpreted by emacs. As such emacs is really an abstract machine for processing and displaying text that provides a set of functionality such as buffer management, etc. And emacs lisp is a domain specific programming language with hooks into the underlying model allowing it to consume and produce text and properties.

By developing such a framework for producing features they have created a very strong separation between the internals of emacs (its model and interpreter) and its feature controllers. This means people can add new features without introducing new complexity in the core much in the same way that people can write new applications for the x86 processor without adding complexity to it. This separation, strongly enforced by the interpreter, is the key to it not blowing up under the weight of all its features.

While reading the chapter I was thinking about how this relates to the chapter on adaptive object models. Both are about platforms that implement interpreters for special-purpose rules. In emacs case the rules are functions written in a turing-complete language that operates on text. In adaptive object models such as the ones described in that paper I guess one tend to use less complex business rules than that, but the basic concept seems to me the same in that both provide abstract machines for implementing rules instead of the rules themselves.

Finally, one thing I was curious about is how things like syntax highlighting is handled. One way would be to have the views format the text, but this would place a lot of application logic in the views. The way I guess they do it is to use the text properties to describe the different colors of the text and then have a set of controllers read buffers and produce these properties. This of course would mean that the views must understand these properties, but a certain amount of view logic is unavoidable.

A Java Fork/Join Framework

The title of this paper was a bit misleading to me as I thought we'd be looking at a pthread-style framework. My first thought was that Java's OO-style threading model is after all a bit nicer than pthread's so why add fork and joins to the language? The paper turned out to be about a fork/join task framework, which makes much more sense and I think the word task should have been in the title.

Java Fork/Join seems like a nice little framework that supports a somewhat convenient way to specify tasks to be run in a thread pool. Tasks (sometimes called fibers) here, and most other places, refer to lightweight chunks of work that are scheduled onto a pool of threads. What makes these more lightweight than threads is that they allow us to reuse threads and we therefore don't need to pay for thread creation or the overhead in having too many threads. Thread pools such as these are not a new thing and programmers have been defining them for a long time for similar purposes, but a framework like this greatly reduces the overhead in rolling your own thread pool implementation.

What is in my opinion one of the greatest potential benefits of such thread pool-style solutions is that they can allow one central component to create the the correct number of threads for the given architecture. In that way the developers don't have to guess how many threads he should use. I used the word guess on purpose in the last sentence because even if a developer knows the number of cores on the target machine, and there is only one of them, he is very unlikely to know how many other threads are running on the system. He therefore can not know whether 4 new threads are too few or too many for an 8-core cpu. The problem is that only the operating system has a global view of the number of threads in use so this framework can't, and doesn't, take all of that into account. However, it would be nice if it at least detected the number of cores on the computer at runtime and gave you a number of threads reflecting this by default so that applications can scale as they are deployed on new computers. One example of a parallel framework that does this for you is QtConcurrent from Nokia (formerly Trolltech).

As a side note, another framework I recently read about that really impressed me because of its simplicity and fitness for its particular use was Apple's Grand Central. Grand Central is a task-based framework with a thread pool and task stealing much like this one (although it does not use the fork/join pattern). The great strength of Grand Central is that it is integrated with the internals of the Darwin kernel (Mac OS X's core), which means it does know just how many threads it should create to get optimal parallelism without too much overhead and it can reuse these threads for all applications using it.

The Java Fork/Join Framework's model of parallelism is, not surprisingly, fork/join parallelism. The central theme here is that a task can spawn a sub-task that will be a child of that task. One can later join with the task to wait for its completion and delete it. This means that at any point in time there is a tree of tasks and if one looks at the lifetime of the tasks one will see a DAG. This is a style of parallel programming that lends itself especially well to recursive problems, but that is flexible enough to handle other kinds of problems by "emulating" other parallel programming models. One example of the latter is loop parallelism that can be "emulated" by forking a set of tasks and then have them compute different parts of an array in parallel. This works well enough, but is more complex to program for these kinds of problems than if one can express loop parallelism directly by using something like the ParallelArray framework (To be shipped with the next version of Java).

The framework implemented task queuing by giving each thread a task pool and then let them steal tasks from each other when they run out of work themselves. The way they implemented this was through shared dequeues where a thread would use its dequeue as a stack and another thread could steal work from that thread by using the dequeue as a queue. At first I reacted to this because it is not fair for the last tasks entered tasks to be executed first (in cases where each task have individually observable results), but the reason why they did it was reasonable and enlightening. They basically put throughput before fairness and used the heuristic that the biggest tasks would be created first. As such the tasks at the end of the queue will likely be bigger meaning a thread that steals it will probably not need to steal another one for a while thereby reducing synchronization.

Tuesday, September 29, 2009

Random thoughts on parallelism, functional languages and new C/C++ constructs

In the CS527 class discussion today we were discussing parallelism and the topic of functional languages for parallelism came up. The question was posed whether functional languages are inherently easier to parallelize than imperative languages due to various constructs such as map/fold and functions without side effects and, if this is true, whether we should therefore switch to these languages.

I made the argument that we should not be quick to throw away all the investment we have made in imperative languages like C and Java as well as all the knowledge we have gathered about using these languages over the years. Professor Johnson pointed out that even though functional languages have long been proclaimed as the right way to program without ever really breaking through to the masses does not mean that they won't in the future. He gave an example from a talk he recently attended where the speaker showed various devices from 1984 (the year, not the book =) that we today typically think of as recent innovations. Examples included touch-screen watches and iphone-like devices, the point I believe being that inventions typically exist for a long time before their time finally comes and they are used in innovative ways.

I found the discussion as well as JLSjr and Professor Johnson's points to be really interesting so I thought about it quite a bit on my way home. It seems to me that what is currently happening is that our imperative languages are starting to soak up features from functional languages that make parallelism easier instead of being full out replaced by them. Examples include broad usage of map-reduce frameworks, a feature that have long been supported natively in functional languages and the planned introduction of lambda-calculus, or anonymous functions, in C++ primarily to make it easier to define kernels to execute in parallel.

A third example of the top of my head is the introduction of closure blocks as an extension to C and C-derivatives (C++ and Objective C) that came with the last release of Mac OS X Snow Leopard. The main rationale for these closures, a long-time functional programming mainstay, is to make it easy to specify small blocks of code that can form a task to be executed in parallel by their new Grand Central. Grand Central is the core of the new parallelism support in OS X that is integrated with their kernel and that basically uses a thread-pool to execute user-specified tasks taken from FIFO queues. Without closures creating such tasks would require a fair bit of plumbing (Java's new ParallelArray library defines more than one hundred classes that one must subclass from to implement different types of task kernels). With closures a lot of very common pieces code can be parallelized by just adding a couple lines of code and a couple of blocks. If you are interested in seeing how this works an interesting overview of Grand Central and the usage of closure blocks in C can be found here.

Perhaps functional programming is here, not through some grand sweeping march where everyone repent their sins and switches to the "only computer languages that are beautiful", but by gradually influencing and changing our current bag of tools (like they have done in the past)? Does anyone have any thoughts on this? Am I reading too much into this or are perhaps these constructs are really a first step on the way to switching completely over?

Monday, September 28, 2009

Jikes RVM

Turtles all the way down.

Chapter 10 of our mainstay, Beautiful Architecture, describes another to me somewhat surprising use of Java. The Jikes RVM is a so-called metacircular virtual machine or runtime. It hosts Java byte code, but what makes it metacircular is that it is also implemented in Java. My first thought was how this could be because somewhere something must be running on the actual hardware. Surely it can't be turtles all the way down. I started to dream about fancy bootstrapping tricks, but the answer turned out to be straightforward which I admit disappointed me a bit. Even though the Jikes RVM is written in Java, a language that is usually compiled to bytecode, in this case they compile it down to machine code.

The authors spend quite a bit of text-space arguing the virtues of metacircularity. I thought that most of these argument was actually for using Java at all and not for using Java for implementing a JVM in particular. The case of metacircularity, from what I could distill, boiled down to two arguments. The first is that this means that the runtime/VM and the code being run will use the same conventions (calling, etc.) which removes some need for converting. The second is that using the VM's JIT compiler to compile itself (there were some bootstrapping in there after all!) creates a virtuous cycle where good thing happens and I can definitely see the benefits with this. However, writing a compiler is an art of tradeoffs (or so I'm told) and surely some of the tradeoffs must be different for regular JIT code and a compiler being compiled offline? I wonder whether they have situations where they must choose between bloating the compiler, making it less suitable as a JIT, or leaving out some costly optimizations thereby making it less suitable for offline compilation of itself and the JVM. However, the project seems to be fairly successful so I guess they managed to navigate these tradeoffs satisfactorily.

If I were to point out any disadvantages with the concept of metacircularity itself then I would argue that every tool has a job and every job has a tool. Not all languages are equally good at implementing all software so a language that is, for example, highly productive in creating enterprise applications is not necessarily the most effective one for creating a compiler. On the contrary a different language with different design decisions will often be more suited for this job. Note that I do not here consider wether Java is best suitable for creating a VM in general. That is another discussion (of which, based on JPC I would say the answer is probably no in most cases). In this paragraph I only considered whether the target language is inherently the best choice for implementing its runtime.

Marcus Strautman posed an interesting question of whether a JVM rolling its own threading model is a good idea today. According to the chapter Jikes RVM did this because of lacking multithreading support in the previous century. Although OS support for multithreading is much better now I would not necessarily throw away green treads completely as there are other benefits to them. The number of cores in processor is expected to be the main beneficiary of Moore's law in the years to come (as a side note Moore's law is falsely, or at least prematurely, proclaimed dead as the law is about transistors per CPU and not about speed). As we must expect cores to increase in the future we should at least try to create as many independent threads of execution as we can in software that needs to be scalable. However, native thread switching comes at a cost which means that our software will be slower today! Green threads are usually cheaper and in environment with cheap threads the application programmer can afford to create more of them (programming time not considered). This means that an implementation with green threads on a dual-core processor could, for example, use two-three native threads in a thread-pool fashion to execute them while on a quad-core it could use four-six threads and so on with no change to the application. Furthermore, with more application threads available the runtime is freer to schedule them in clever ways to hide latency by exploiting techniques such as simultaneous multithreading (aka. hyperthreading), etc.

Our Pattern Language

Our Pattern Language is a pattern language developed by several parties, including Professor Johnson's group at UIUC, but that is hosted by Berkeley. The name is lent from Alexander's pattern language and is supposedly a placeholder for another future name. It is interesting to see how much clearer this new paper is than earlier papers and versions of the OPL wiki and the pattern language definitely seems to be maturing.

OPL is a hierarchical set of patterns divided into five groups. One of the underpinnings of OPL is the arguable point that our inability to architect parallel software stems from our inability to architect software in general. The first group therefore contains general architecture patterns from which the parallel implementation should flow. The next two groups contain the computational problems in parallel computing and algorithms to implement parallelism. The lower groups contain programming models and primitives that are necessary to implement these algorithms.

Another driving idea behind OPL is that research in parallel computing should be driven by the applications that will actually be written. The language therefore includes the 7 dwarfs of super computing and the rest of the 13 more general parallel motifs recast as Computational patterns. These patterns break a little with the common way of framing patterns, namely as a solution to a recurring problem in context. Instead the are broad categories of solutions to a large number of possible problems. One way to get around this that I think works quite well is to treat them as pattern languages consisting of a number of more concrete solutions to problems encountered in the pattern language domain. One can then discuss the forces that are relevant in the domain at large before presenting the common patterns and the tradeoffs in selecting one or the other. This is the approach taken in the paper "N-Body Pattern Language" by Danny Dig, Professor Johnson and Professor Snir.

One point about the computational patterns that I found enlightening it is that they do not cover all problems solved with parallel programming. Many projects use threading or multiple processes for other reasons such as implementing threading to avoid hogging the GUI thread or to wait for and collect hardware interrupts. Other examples mentioned in this course are Professor King's web browser where parallelism is mainly for security and Erlang/Guardian where parallelism was originally mostly for reliability. However, the computational patterns are the problems that need an almost inexhaustible supply of processors and where we can always efficiently use more processors by increasing the problem size. The other problems mentioned above are inherently constrained in how much parallelism one can expose; the computational patterns are not. As such, if we can find just a few killer desktop application that really needs to solve one of these problems then we have a reason to continue making chips with more processors. Otherwise, as Professor Patterson puts it, we are on the road to commoditization and consumers will (rightly) demand cheaper instead of more powerful hardware. After all we don't really want to spend 97 out of 100 cores detecting viruses and there are only so much embarrassingly parallel concurrency available through the user running different programs.

The paper identifies four classes of developers and presents a vision that we will architect software in such a way that each kind of developer will only need to concern themselves with the problems at their layer. One example of this separation of concerns is that an application developer (the largest group of programmers by far) should only need to be "weakly aware of parallel programming". This is a nice vision and is the way it currently works in domains such as enterprise systems where parallelism is ubiquitous, but where most programmers do not spend time thinking about it. If we could achieve this for all the thirteen computational patterns and then find killer applications that need these types of computations then the problem would be largely solved and most developers would not even need much re-education!

However, I do think this will be harder than it was with web-servers. The difference this time around is that the problems have much more complex communication patterns. With web servers each client can be run in parallel in different sessions. The sessions are for the most part independent of all other sessions and only communicate with them through a database management system (that much effort has gone into making parallel). Most of the computational patterns on the other hand have more complex and intrusive communication patterns making it harder to write framework that are applicable to many applications. But then again frameworks don't really need to be that general.

Thursday, September 24, 2009

JPC: An x86 PC Emulator in Pure Java

JPC is an emulator for x86 machine code written in Java. Chapter 9 of Beautiful Architecture describes its rationale, development as well as challenges in creating an emulator in Java. The bulk of the chapter is organized around nine tips for creating fast Java code. An emulator in this context is a piece of software that reads instructions in a machine language and then executes these instructions. As such it is basically not much more than a specialized interpreter for machine code. The benefit emulators as opposed to virtual machines is that they reads and executes machine code instead of running it directly on the host machine. They are therefore more flexible and can support running one type of machine code on a completely different machine. The JPC emulator in this chapter does exactly this as it interprets x86 code while it itself is running on a JVM. The fact that the JVM often happen to be running on an x86 machine (giving the ironic emulation chain x86->Bytecode->x86) is not relevant for JPC and indeed they do provide examples of JPC being able to emulate x86 on other architectures. The downside of emulators as opposed to VMs on the other hand is that they are slower. Instead of just running the program instructions and then trapping system calls they have to read, interpret and run every instruction which adds overhead.

The chapter begins and ends with the authors arguing why one would want to write an emulator in Java since it does add some overhead that native code doesn't have (even though the overhead is less and less as technology matures). The main two benefits they describe which are specific to a Java emulator is safety and portability. They argue that since JPC is running inside the fairly secure JVM such a system is more secure. This is based on the fact that one have several layers of security and an attacker therefore have to find bugs in more than one place to break all the way into the native environment. The second benefit, namely portability, stems from the fact that Java Virtual Machines are available for more platforms than the x86 which means these platforms can use it to run x86 code without having to port the emulator.

When reading this chapter I liked the way they balanced the opposing forces of simplicity and efficiency. Instead of just deciding that it has to go fast so let us optimize everything they tried to pinpoint the areas where it was actually needed while keeping the rest of the system as simple as possible. They seemed to pay for complexity in some parts with simplicity in others (maybe the average complexity of a system like this is a better measure than the most complex function). I also found their sometimes very obscure tricks to be fun to read. One thing I did find to be strange however was their comment on using the heuristic that "the typical machine will not exceed 2 GB of physical RAM". I thought we were passed making such assumptions and indeed this one seems to be becoming outdated already.

The moderator for this paper posed the question of whether JPC could be useful for other architectures than x86. I assume he meant this in the sense of running it on another architecture and not of porting it to support another one. I do think this could be very useful. The JVM is very popular on other architectures like the ARM. In fact ARM devices with JVMs are probably far more common than x86 devices since ARM processors are the most sold 32-processors on the planet (They passed the 10 billion processors mark a year ago and around 1 billion are now shipped every quarter). Just look at the (likely) ARM-powered, Java-capable device in your pocket - the cellphone. That being said however a lot of useful software exist only for the x86 architecture so an emulator capable of running this software on medium-sized, non-x86 devices such as smartphones, PDAs and some netbooks would be very useful.

Wednesday, September 23, 2009

The Adaptive Object-Model Architectural Style

The purpose of this architectural style/pattern is pretty straightforward although I did have some problems understanding some of the details. The goal of the architect using this pattern is to allow the system to be extended with new "business rules" without having to change the source code to do so. The way one often model an application is to let concepts be abstract classes and then subclass these for each instantiation of that concept. The example given in the paper is the one of a car which can be sub-classed as Prius, etc.

If I understood it correctly the adaptive object-model pattern changes this so that instead of modeling the domain as classes one create a set of meta-concepts (classes) that the user can use to model the business rules by composing them in different ways (favor composition). In the extreme case these concepts would form a complete domain-specific language that could itself be turing-complete. This way the application moves from being an implementation of business rules to an interpreter of them. This flexibility comes with a cost in that it makes the application harder to understand for developers not used to this way of thinking. However, on the plus side they can offload some of the work understanding and creating business rules to domain experts who are often not programmers.

In the spirit of sharing I once developed an application that had some of these characteristics although it was far simpler than the full architecture presented here. The application was a performance analysis tool that presented various groups of HW/SW counters and derived counters that had been sampled from a graphics processor and its SW infrastructure. When I started on this project the decision to rewrite the original tool had just been made so we could start from scratch. The original tool would read a fixed set of counters from a binary file created by our graphics drivers. It would then, based on "business knowledge" of the different counters create derived counters (<#something per frame> -> <#something per second> and <#counter1 + #counter2 / #counter3>. This was not a good idea as it required the application to be changed if a base counter was added, if we wanted a new derived counter or if counters were replaced as they would be when new processors came along. In the new version the application became an interpreter that read an XML file of an arbitrary number of counters in arbitrary groupings. These counters were then tagged with information describing how they should be processed and formated by the application. This new design was agnostic of what it displayed which proved very useful as the HW and driver people could add new counters without involving my team. Furthermore, when the maintenance was later taken over by another location in another continent this flexibility became even more valuable as the team maintaining the application and the teams with counters to show could work independently with the XML file format as the interface.

Finally, as a digression and to try to better understand the pattern myself, I found it interesting to compare it to the evolution of graphics processors. These used to be fixed function, which required the HW to be changed to support new lighting models (bump mapping, shadows) and other effects. This proved to be too constraining and the fixed functionality was gradually loosened until it was finally replaced with programmable stages that allowed the user to define their own "rules" (shader programs) for how lighting, etc. should be calculated. The graphics processor would then "interpret" these rules the same way as AOMA software interprets business rules in a database. And as we all know, in the case of graphics processors, the rules eventually became so flexible that graphics processors became useful in areas that had nothing to do with graphics.