Statistical debugging identifies program behaviors that are highly
correlated with failures. Traditionally, this approach has been
applied to desktop software on which it is
effective in identifying the causes that underlie several difficult
classes of bugs including: memory corruption, non-deterministic
bugs, and bugs with multiple temporally-distant triggers.
The domain of scientific computing offers a new target for this type
of debugging. Scientific code is run at massive scales offering
massive quantities of statistical feedback data. Data collection can
scale well because it requires no communication between compute
nodes. Unfortunately, existing statistical debugging techniques
impose run-time overhead that is unsuitable for
computationally-intensive code despite being modest and acceptable
in desktop software. Additionally, the normal communication that
occurs between nodes in parallel jobs violates a key assumption of
statistical independence in existing statistical models.
We report on our experience bringing statistical debugging to the
domain of scientific computing. We present techniques to reduce the
run-time overhead of the required instrumentation by up to 25% over
prior work, along with challenges related to data collection. We
also discuss case studies looking at real bugs in ParaDiS and
BOUT++, as well as some manually-seeded bugs. We demonstrate that
the loss of statistical independence between runs is not a problem
in practice