Thursday, November 12, 2009

Structured Grids, Branch and Bound and Graphical Models

Structured Grid Computational Pattern
This is a very common pattern that is very easy to get a mental image of. Structured grids model spaces in N dimensions as discrete points that interact. Their interaction can be regular or irregular (this pattern only talk about regular interactions) and in the regular case they are usually characterized by a fixed-size stencil.

We discussed it in class today and the biggest problem with this pattern is that it overlaps heavily with geometric decomposition. Indeed it actually describes geometric decomposition as the essence of the task manager described in this pattern's solution. First, geometric decomposition is about decomposing and solving geometric problems in parallel with the issues that bring and is not just a task manager. Second, if geometric decomposition was the essence of a subset of this solution then why would one have that pattern at all. A much more sensible strategy is for this pattern to discuss the problem/solution of performing structured grid computations, discuss stencils and then provide pointers to patterns that can be used to parallelize it. This could of course be geometric decomposition, but also master worker, etc. In particular this means that border exchanges belongs in geometric decomposition (which only mention them for about one paragraph. Indeed, as a part of cs527 I have actually written an entire pattern about just border exchanges. The current draft, with all its flaws, can be found here: URL.

Also, border exchanges as described in this pattern are insufficient to implement their first example. Even if the ghost nodes extend three rows/colums out from their chunks as they mention they still have to deal with the (literal) corner cases as their stencils will need cells from diagonal chunks. To see how this is usually handled refer to the border exchange pattern above.

Actually, todays discussion made the tight relationship between geometric decomposition and border exchanges clearer to me and I am wondering whether it would make sense to attempt to extend my border exchange pattern to be a geometric decomposition pattern or perhaps to try to merge them. It would give geometric decomposition a very specific meaning, more so perhaps than the OPL authors intended. It would essentially cover data-parallel, iterative refinement algorithms with border exchanges, but for me that is what geometric decomposition is. In patterns I value preciseness more than completeness. The original design patterns had this property. They did not give a complete design framework that is sufficiently vague to be complete like many textbooks attempt. Instead they gave crisp, specific and often non-intuiative solutions to concrete problems and for me it was this that gave them their greatest value. Besides I suspect other data-parallel algorithms with other types of communication, like the tiled N-Body example given in yesterdays OPL session, are covered by the Data Parallel pattern.


Branch & Bound Pattern
This pattern was pretty good. It started out with a decent example in the SAT decision problem and gave a good explanation of issues with branch and bound by authors who seemed to know what they were talking about very well. One area of potential improvement would be to add a code example. The paper also seems to implicitly conclude that structure-based parallelization is the one that makes sense, but if that is the case then why talk about the other two. Potentially the construction-based parallelization can give less synchronization overhead at the cost of more work, but I can't see a case where I would want operation-based parallelization. If there are such cases then the pattern should address those.

Another possible improvement that was suggested in class would be to start off with a chess or checkers example. The examples that are there are good, but I suspect many more developers will have some mental image of how a computer chess player searches for moves. It would at least be a more colorful and graphical example as the first example.


Graphical Models
When first reading the title of this pattern I must admit I was thinking of 3D meshes. We discussed it in class and other seemed to have other first-impression, but none seemed to match what it actually is about. Professor Johnson suggested that maybe it should be called s Probabilistic Methods or Stochastic Methods, both of which would have been preferable.

Even after reading the pattern I have no idea about how where it would apply or how I would start applying it. It lacks an example and I happen not to know the Junction Tree Algorithm. An example would have gone a long way, but even the pattern lacked pointers on how to go about parallelizing it. Other computation patterns has a parallel issues section and some, like circuits, are sprinkled with pointers to other patterns. This one had one line that said that "This search can be rather trivially parallelized" related to trying out randomized factorizations, which leads me to suspect that other parts of the patter can be parallelized. If not then wouldn't this just be a subset of branch and bound or perhaps something simpler?

Shared Queues, Speculation and Circuits

Shared Queue
I thought this pattern overall was pretty good and it covered most of the issues of queues. It also had quite extensive examples (taken from the ppp book). A couple of things I would have liked is a discussion on the use of notify vs. notifyAll and the concept of stampedes if too many threads are awakened at the same time as well as a mention of the pthread primitive they mention that provide wait/notify functionality (condition variables). On the other hand I liked how they gave reference to how to implement notification with semaphores (I have seen this been done once -- sometimes you just don't have access to libraries like full pthread for various reasons). I also think they could have dropped the thirds item in their solution ("Consider using distributed queues"). The way it was described on page 8 it is better covered by the task queue pattern. This pattern, as I understand it, is about implementing a queue. Not on laying one or more queues to share tasks, which is what the other pattern is about.


Speculation
I thought this pattern was really interesting and I am glad I have read it, but it also perfectly illustrates to me Professor Johnson's point about starting out with an example. I did not know this pattern beforehand and was left guessing about what it could be used for and how it worked for six pages before I finally got to the example section. In the mean time I was fed a lot of seemingly interesting facts, but not having an example to attach them to left me very frustrated. The html example should have been presented after the first paragraph of the context, but its full solution could have been addressed later in the solution section.

Of course, like Professor Johnson points out, this is due to the OPL format so I think the solution has to be to change it or forever live with the tension of trying to force an example into the start of a format that invites you to put them late. My suggestion is to simply change the name of the example section to additional examples. Not only would this invite writers to have one example in the context/solution part of the pattern, but it would also invite more than one example. The additional examples section could even contain more tricky examples, which are nice to get after one has been guided through the solution using a simple and obvious one.

A couple of minor suggestions are to add regexes in the HTML lexer example in addition to the textual descriptions (page 6). For me at least this would have been much easier to read. Also, I would have liked the another paragraph explaining how to use speculation with speech recognition. On a positive note I really liked how they contrasted speculation with branch and bound in the related pattern section, describing how they differ. This gave me a better understanding of both patterns.


Circuits
At first I thought the pattern was a concurrent execution pattern so I was very unhappy that it did not talk about parallelism in the problem and context. However, I am more content with it after I figured out it was a computational pattern.

The pattern deals with different techniques/algorithms for bit-level manipulation. What I can not quite figure out is why it is called circuits and not just bit manipulation. Sure, working with bits is working close to the Hardware, but the name circuits implies that the computational pattern is about modeling circuits or something similar. I had a look at the notes section for the this pattern on the OPL site and it talked about gates and synthesis, which is what I originally expected and which compounded my confusion. Has there been a mixup here and is perhaps this pattern incorrectly named?

Other than that the bulk of the pattern is in the example section that presents several algorithms for doing different bit manipulation tasks such as hashing, bit-counting and parity checking. Many of these can be parallelized and the pattern does a good job of giving pointers to other patterns that can be used to do so and I think this is the right way to do it for computational patterns.

Wednesday, November 11, 2009

Task Queue, Graph Partitioning and Loop Parallelism

I was ill therefore not very productive at the start of this week so this blog comes a bit late, but better late than never.

Task Queue
I was pretty happy with this pattern and felt it had enough examples for my taste. It did a decent job of describing both shared queues as well as a high level overview of work stealing although it was perhaps a bit light on the latter. The only thing I missed from the example section was in the Java example. When implementing task queues in shared memory programs one usually use a condition variable to put threads to sleep instead of spinning waiting for the queue to return non-null values, but I guess this is mentioned in the shared queue pattern.

Graph Partitioning
I was pretty unhappy with this pattern as it was hard to understand what they were talking about. Like Professor Johnson I get the impression that the pattern is really about how to map tasks onto processing elements at startup time for load balancing and reduced communication. However, this was far from clear from the start of the paper which quickly delved into the details. I suppose it could also be used to remap tasks at runtime if the graph changes. I read a pattern that sounded similar to me from paraplop09 a while back called "A Pattern Language for Topology Aware Mapping.". This pattern language talked about how to map a graph of tasks and the communication between these efficiently onto a machine topology graph so as to effectively use the machine's resources: "Given an object communication graph and a processor topology graph, how can we map the first graph onto the second such that most messages travel a small number of hops (links) on the network" In that paper the intent was much clearer and if these two patterns are talking about the same thing I prefer that version.

Loop Parallelism
Loop parallelism is a pretty common pattern for parallelization on shared memory machines besides fork-join. I thought this paper did a fairly good job of describing it and went through a lot of different techniques one can use to prepare loops for parallelization. However, unless one are already familiar with these techniques then one would probably not get them after reading this pattern as the descriptions were very brief (they did contain examples though!)

There were a few minor issues that I thought were strange. For example, why doesn't the i and j loops carry dependencies in the example on the top of page 3? It looks to me like they do (data created in iteration i of the i loop requires data from iteration i-1 and i+1) unless they employ double buffering in which case that should be mentioned.

Another problem I had was that they refer to the map-reduce pattern for descriptions of reduce. Surely a pattern on loop parallelism is the right place to include a description of both reduce and scans? Reduces are certainly used by map-reduce, but that is a very high level pattern and I think a more detailed description fits here.

Thursday, November 5, 2009

Making reliable distributed systems in the presence of software errors - Chapter 6

Chapter 6 of Armstrong's thesis on building failure-tolerant software describes the process of building an application. It shows how programmers can extend a generic set of behaviors with their own code containing business logic. The generic components of which server, event handler and finite-state-machine are described, are written by experts and have been heavily tested and widely reused. They deal with complicated issues such as concurrency, locks and parallelism in general. The code the application developers need to write are thus for the most part only sequential stubs that are plugged into the above components, which makes it far easier to make these correct. The chapter also discusses how one can generate supervisors to monitor and restart erroneous behaviors and how one can put all of this together in an application.

The chapter made some of the things about Erlang that has been blurry to me come together. Through Armstrong's demonstration of what happens when a failure occurs, in this case a behavior crashes, I started to gain the first glimmers of comprehension about how their systems can be so awe-inspiring robust. It seems to me that there may be three parts to the answer. The first is the obvious one that through supervisors an Erlang programs get multiple lines of defense so that a lot of things have to fail simultaneously to bring down a whole system.

The second one seems to me to be the pragmatic focus on getting, and not loosing any, logging information when something bad happens. Failures occur, but by catching them as soon as they happen, having a supervisor log the reason so that we are sure we don't loose it and by being able to hotswap parts of the system one can gradually weed them out. This again over time makes it less and less likely that enough things will go wrong at once to completely bring down a whole system.

The third reason I think is the mentality of focusing on failures that I am sure programmers develop by working with an environment that forces you to design your systems around failure-recovery. If you are working in C or Java on a standard project then you as a developer never really experience, or have to think much about, failures unless you go out of your way to write tests that exposes them. And since you are not forced to think about it you won't. In Erlang you have to build failure hierarchies and define explicit supervisors. This forces you to start thinking about these things and you will therefore keep failures in mind also when writing the simple behaviors. Put another way, I am sure a ten year Erlang programmer would write pretty fault-tolerant code even if he for some reason was forced to use C for a project as he has been schooled to think about it. This is equivalent to a 10 year Object-Oriented programmer being forced to work with C. He has been schooled on modularity and encapsulation and will tend to write his programs this way, even though he has to use C which has poor support for it.

Pipeline, Geometric Decomposition and Data Parallelism

Pipeline
Pipelining is an important pattern and is one of the best examples of clear-cut task parallelism (interestingly one book actually placed pipelining its own category disjoint from parallelism - Either things operate in parallel or they operate in a pipeline). Even though pipelining tend not to expose a lot of parallelism when used in software (usually just a few parallel tasks) it is sometimes the only parallelism we can extract. And in other cases it can be composed with data or other task parallelism, which gives it the effect of multiplying the parallelism we would otherwise have in the system.

The pattern on pipelining started out with a lot of (interesting) facts about the usage of pipelining in hardware that an application developer trying to implement software pipelining in his application won't care about. At this level in the pattern language we are not really concerned with vector processors and certainly not with instruction pipelining and these discussions therefore distract from the goal. A better approach would be to remove these facts, or at least move them to the known uses section, and focus on how to implement multithreaded/multiprocess software pipelines in our applications. Besides, having the first example in assembly is pretty strange.

The universal forces section mention throughput vs latency. I have heard this consideration before when comparing different CPUs and CPUs to GPUs. In hardware it makes sense as the tradeoff is between adding transistors to perform more work on one instruction per cycle or adding transistors to perform work on multiple instructions per cycle (for example through pipelining). but I wonder how relevant it is for a software pipeline. If you disregard communication cost between pipeline stages a given software task will take as long as it takes, no matter if you have a deep or shallow pipeline. It should not affect latency if you have a two stage pipeline performing n+n work or if you have a single task performing 2n work. Of course this is altered by communication between pipeline stages, but is that what the author has in mind? Or am I missing something?

Finally, I did not like the example. Why use matlab and not languages that are more well known to the target audience like C or Java? He says it is to not implement fft, but fft could just have been hidden behind a function in the example. And the example code just shows how to crop out a circle to do blurring using fft and matalb, not how to actually implement a pipeline. I would be much more satisfied with an example implemented using threads or message passing to show how to set up a pipeline (the parallel programming patterns book from which this pattern was modified had the java code for this. Why was it removed?). The first example could even be as simple as having each stage add a constant to a stream of numbers, with fft as an advanced second example.


Geometric Decomposition
I think this pattern is pretty important and I also think that it is specific enough as opposed to "Data Parallelism" which is rather vague. This description of it discussed the most important considerations when implementing geometric decomposition and it contains a lot of good information, but it definitely lacked an example with real code. Again, the Berkeley format suffers from having an explicit Example section which discourages them from interleaving the example with the solution description. And even the example in the example section only exemplifies how to do heat diffusion, with no code and little text showing how to do this in parallel. After reading this pattern I know about many things that we should consider when doing geometric decomposition, but I know nothing about how to actually begin implementing them.

It also boggles my mind that one can describe something as closely related to Cartesian coordinate system and real world phenomena's without even a single figure showing a grid of cells/points that are distributed amongst processors. But then again I am very much a graphical thinker. Indeed their only figure when describing how to solve a problem in parallel is of a heat equation...

What I don't understand is that the parallel programming patterns book contains these figures and the code in addition to the issues described here. Why can't they just use those, since they patterns like Discrete Event was taken almost directly from it? Is there some restriction on how much they can take, even though the author of this pattern is also an author of the book, or do they explicitly want to shorten it?

Data Parallelism
There is not really much to say about the pattern description as it is only three sentences long. Like the Task parallelism pattern this pattern is too vague. In the geometric decomposition pattern they say that if the geometric chunks have no dependencies then it is an instance of task parallelism because it is embarrassingly parallel. Why isn't it data parallelism? We discussed this in class and the distinction is very vague, which makes me think task and data parallelism are not patterns, but two potentially intersection categories. If they must be included in OPL then they should be two sub-categories of algorithm strategy. If they were then I would put the following patterns in the data parallel group: Geometric Decomposition, Recursive Splitting (yes really - you perform only one task repeatedly to the data structure although the granularity of the data differ dynamically. "The splitting is not a part of the core computation") and at least parts of non-work-efficient parallelism such as Scans. Discrete event, pipelining and speculation falls into the task parallel category.

Finally, I agree with Professor Johnson in that we should replace them with the Embarrassingly Parallel pattern. But even embarrassingly parallel can be both task and data parallelism. Embarrassingly task parallelism is to me when you have different tasks that do not need to communicate (MIMD). Embarrassingly data parallelism is when you have one task that operate on multiple pieces of data (SIMD). Of course this gets a bit vague when considering that the data parallel operations could have branches, but as long as those branches branch based on the data I still think it is data parallelism. Whether there, based on this, should be one or two embarrassingly parallelism pattern is not really that important and I guess it depends on whether you are a lumper or a splitter as Professor Johnson so nicely put it.

Tuesday, November 3, 2009

Making reliable distributed systems in the presence of software errors - Chapter 5

Chapter 5 of Joe Armstrong's thesis on fault tolerant systems describes how the fault tolerance of Erlang works and how the programmer can build an application to make it fault fault tolerant. By fault tolerance Armstrong means systems that can continue to complete tasks in the presence of faults and it does not imply the applications are free of them.

The fault tolerance model described in the chapter is based on the concept of fail immediately if you can not recover and then try to do something simpler. The system is organized as a tree of supervisor that supervise behavior processes that perform the actual application functionality. I did not fully understand how having the application recover by doing something simpler would work in practice though. However, another mechanism it has for recovering is to try to restart the failed process which makes sense. Perhaps it will succeed this time.

The chapter went on to discuss the difference between failures, errors and exceptions. I didn't fully get the exact distinction, but what I gathered from the text was that in the Erlang context exceptions are what you throw when the specification does not say what you should do. The programmer working at the lowest lever should not start trying to guess as this leads to an unstable system, but should instead just throw an exception. Always. An error comes in two forms, corrected errors and uncorrected errors. A corrected error is an error that you, or someone above you in the hierarchy knows how to handle. It therefore does not lead to a fault. An uncorrected error on the other hand leads to a failure, which means a process must be restarted or we must try to do something simpler. This could be because of a software error, a hardware resource going down or just because we don't know how to handle the case of a file being missing. In that case the developer are encouraged to explicitly exit or throw an exception if it is a spec error which is likely the case if we don't know what to do when a file is missing. Let someone else handle it.

However, the chapter does say that a runtime error such as a divide-by-zero leads to an exception, but surely this will more often be a programmer error than a spec error? Or is the fact that this exception are not handled by the module a spec error?

Monday, November 2, 2009

Task Parallelism, Discrete Events and Recursive Splitting

Task Parallelism
Task parallelism is a wide category that encompass a lot of different applications and is often used to describe everything that is not data parallelism. However, OPL (and the parallel patterns book) define it somewhat narrower by excluding more specialized "task parallelism" such as pipelines, discrete events and divide-and-conquer/recursive splitting.

This pattern was pretty well fleshed out with many interesting observations, but I missed a code example. I also found it a bit confusing that the solution section spent much more time describing value-at-risk, speech recognition and the cost of various operations on various architectures than it spent discussing how to find and exploit task parallelism in these problems. I also wonder why the value-at-risk example are split between the solution section and the examples section? Should perhaps the solution section talk about task parallelism in general terms (discussing issues such as embarrassingly parallel task parallelism, synchronization/communication (and how to reduce it), load balancing and how to find task parallelism) and leave the examples for the examples section?


Discrete Event Pattern
I like this pattern a lot. It does a good job of explaining the problem and discuss several important aspects of the solution such as event ordering and deadlocks. It is also the pattern on which Apple's Grand Central is based and is very important in the world of desktop GUI application development. Grand Central is a system for generating tasks that are then placed in task queues. These queues are then processed by different processors and the tasks can generate new tasks and put them in other queue. The tasks therefore work a lot like events and allows the GUI thread to offload work to other processors and it lets these processors place tasks/events back in the GUI thread's task pool. The latter is what makes this an event-based framework as opposed to a task based framework and is very powerful as it allows the application state and GUI widgets to be updated by the GUI thread thereby avoiding mutual exclusion. I particularly love one GC example where a task is generated by the GUI thread to perform some statistical computations on a document (which could potentially be expensive). When these computations are done the task will then place an event/task to update the application state/GUI with the statistics it gathered in the GUI thread's task queue, which is a beautiful solution.

The only thing I missed was a good code example, perhaps implementing the standard thread-based event queues with a switch statement to decide the action that are often used.


Recursive Splitting
This was a very well written pattern with tons of examples using different programming models. It also does a good job of describing the strengths of recursive splitting which is lies in its flexibility to create many tasks and I thought it was very interesting how they contrast it to geometric decomposition.