Thursday, October 1, 2009

A Java Fork/Join Framework

The title of this paper was a bit misleading to me as I thought we'd be looking at a pthread-style framework. My first thought was that Java's OO-style threading model is after all a bit nicer than pthread's so why add fork and joins to the language? The paper turned out to be about a fork/join task framework, which makes much more sense and I think the word task should have been in the title.

Java Fork/Join seems like a nice little framework that supports a somewhat convenient way to specify tasks to be run in a thread pool. Tasks (sometimes called fibers) here, and most other places, refer to lightweight chunks of work that are scheduled onto a pool of threads. What makes these more lightweight than threads is that they allow us to reuse threads and we therefore don't need to pay for thread creation or the overhead in having too many threads. Thread pools such as these are not a new thing and programmers have been defining them for a long time for similar purposes, but a framework like this greatly reduces the overhead in rolling your own thread pool implementation.

What is in my opinion one of the greatest potential benefits of such thread pool-style solutions is that they can allow one central component to create the the correct number of threads for the given architecture. In that way the developers don't have to guess how many threads he should use. I used the word guess on purpose in the last sentence because even if a developer knows the number of cores on the target machine, and there is only one of them, he is very unlikely to know how many other threads are running on the system. He therefore can not know whether 4 new threads are too few or too many for an 8-core cpu. The problem is that only the operating system has a global view of the number of threads in use so this framework can't, and doesn't, take all of that into account. However, it would be nice if it at least detected the number of cores on the computer at runtime and gave you a number of threads reflecting this by default so that applications can scale as they are deployed on new computers. One example of a parallel framework that does this for you is QtConcurrent from Nokia (formerly Trolltech).

As a side note, another framework I recently read about that really impressed me because of its simplicity and fitness for its particular use was Apple's Grand Central. Grand Central is a task-based framework with a thread pool and task stealing much like this one (although it does not use the fork/join pattern). The great strength of Grand Central is that it is integrated with the internals of the Darwin kernel (Mac OS X's core), which means it does know just how many threads it should create to get optimal parallelism without too much overhead and it can reuse these threads for all applications using it.

The Java Fork/Join Framework's model of parallelism is, not surprisingly, fork/join parallelism. The central theme here is that a task can spawn a sub-task that will be a child of that task. One can later join with the task to wait for its completion and delete it. This means that at any point in time there is a tree of tasks and if one looks at the lifetime of the tasks one will see a DAG. This is a style of parallel programming that lends itself especially well to recursive problems, but that is flexible enough to handle other kinds of problems by "emulating" other parallel programming models. One example of the latter is loop parallelism that can be "emulated" by forking a set of tasks and then have them compute different parts of an array in parallel. This works well enough, but is more complex to program for these kinds of problems than if one can express loop parallelism directly by using something like the ParallelArray framework (To be shipped with the next version of Java).

The framework implemented task queuing by giving each thread a task pool and then let them steal tasks from each other when they run out of work themselves. The way they implemented this was through shared dequeues where a thread would use its dequeue as a stack and another thread could steal work from that thread by using the dequeue as a queue. At first I reacted to this because it is not fair for the last tasks entered tasks to be executed first (in cases where each task have individually observable results), but the reason why they did it was reasonable and enlightening. They basically put throughput before fairness and used the heuristic that the biggest tasks would be created first. As such the tasks at the end of the queue will likely be bigger meaning a thread that steals it will probably not need to steal another one for a while thereby reducing synchronization.

No comments:

Post a Comment