Thursday, October 15, 2009

Finding and Reproducing heisenbugs in Concurrent Programs

CHESS is a tool for finding concurrency bugs such as race conditions and deadlocks in multithreaded programs. It does this by hijacking the thread scheduler and drive the scheduling. This allows it to try different thread interleavings in a systematic and deterministic fashion, which means it does not need to rely on spawning a lot of threads to get sufficient thread interference and then pray the operating system will schedule these in such a way that a race or deadlock is triggered.

In addition CHESS will persist the interleavings it tries, which means that once a bug is triggered once it can reproduced reliably. The latter is extremely valuable as it is sometimes the case that one manage to produce a bug, perhaps in extensive nightly or weekend testing, only to have it slip out of our hands like water by not being able to reproduce it. The only choice we are then left with is to attempt to stare the code down hoping the bug will pop out or live with the uneasy knowledge that there is an unknown bug in the code. The former might take a long time and often require a really smart developer (which prevents him/her from working on creating the future bug-free code). The latter is often the only choice, but is less than ideal since if we do not know what the bug is then we do not know what other failures it might trigger. We usually accept that some less severe bugs live in our bug-trackers to be fixed "eventually" (the bug-trackers never seem to quite reach zero), but if we don't know what the bugs are we then we can't know how severe they are.

However, CHESS far from covers all interleavings. The number of potential thread interleavings in a program is enormous and includes all interleavings of all non-atomic instructions (mutual excluded sections are treated atomically) so there is no practical way to test all for any non-trivial program. CHESS therefore makes the simplification of running all threads on the same unit of execution and never preempt them. This effectively means that only instructions that block or yield a thread will cause another thread to be scheduled, which rules out many possible data races. One of the ways they alleviate this is by running a data-race detector on the first few iterations and then manually instrument code to yield based on potential data races reported. This manual instrumentation will then let the CHESS scheduler to try out different interleavings of access to the potential racy resources which means it can detect races there in subsequent iterations. In effect it takeS the output of the race detector and remove all the false positives that these tools are prone to produce (due to intentional races in the code). In the paper they only briefly mention the fact that they have to do this, but to me it seems to be an essential point as I strongly suspect they would otherwise miss a lot of races.

The above simplification means they have far fewer interleavings to test, but the number is still very large so they further optimize by intelligently searching for good interleavings and then schedule these first. This seems to me to be a very sensible way to do it and it means that the user can run the tool on a set of test for fairly a short period to cover most important interleavings or run it for days or weeks to get increasingly more coverage, but with diminishing returns.

I also have to say that although the versions of libraries such PLINQ and TPL that they used were likely the pre-release research versions and not production code (I think they are scheduled for the next .NET?) I am impressed by the amount of bugs they found. Since we can only really measure testing by the amount of bugs we find and not by the amount we haven't found a tool that detects new concurrency bugs in code that has been stress tested for weeks is definitely useful in cases where we need that level of reliability. For many applications however it might not be so bad to have a bug that only manifests once every quarter so in those cases it might not be worth the extra effort.

No comments:

Post a Comment