Search CORE

85 research outputs found

Scalable temporal order analysis for large scale debugging

Author: Barton P. Miller
Ben Liblit
Bronis R. De Supinski
Dong H. Ahn
Gregory L. Lee
Ignacio Laguna
Martin Schulz
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

We present a scalable temporal order analysis technique that sup-ports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to de-termine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execu-tion points in anomalous application runs. Our novel temporal or-dering engine then leverages this information along with the ap-plication’s static control structure to apply data flow analysis tech-niques to determine key application data such as loop control vari-ables. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks. Our evaluation, which extends the Stack Trace Analysis Tool (STAT), demonstrates that this temporal order analysis technique can isolate bugs in benchmark codes with injected faults as well as a real world hang case with AMG2006

CiteSeerX

Crossref

BugDoc: Algorithms to Debug Computational Processes

Author: Alvaro Peter
Attariyan Mona
Bergstra J.
Bergstra James
Chen Ang
Dolatnia Nima
Galhotra Sainyam
Godefroid Patrice
Holler Christian
Hutter F.
Johnson Brittany
Lee Kang Wook
Liblit Ben
Lourencco Raoni
Meliou Alexandra
Snoek Jasper
Snoek Jasper
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/04/2020
Field of study

Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.Comment: To appear in SIGMOD 2020. arXiv admin note: text overlap with arXiv:2002.0464

arXiv.org e-Print Archive

Crossref

Scipedia

Open Repository and Bibliography - Luxembourg

BugDoc: A System for Debugging Computational Pipelines

Author: Attariyan Mona
Bergstra J.
Bergstra James
Chen Ang
Dolatnia Nima
Johnson Brittany
Liblit Ben
Lourencco Raoni
Petrosjan Leon
Snoek Jasper
Snoek Jasper
Publication venue: Association for Computing Machinery
Publication date: 14/06/2020
Field of study

peer reviewedData analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We recently proposed a new approach that makes provenance to automatically and iteratively infer root causes and derive succinct explanations of failures; such an approach was implemented in our prototype, BugDoc. In this demonstration, we will illustrate BugDoc's capabilities to debug pipelines using few configuration instances

Crossref

Open Repository and Bibliography - Luxembourg

Lessons learned at 208K: Towards debugging millions of cores

Author: Barton P. Miller
Ben Liblit
Bronis R. De Supinski
Dong H. Ahn
Dorian C. Arnold
Gregory L. Lee
Martin Schulz
Matthew Legendre
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To scale to such counts and beyond, tools must employ a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become a tool bottleneck. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petas-cale. We then present solutions to these challenges that have been implemented and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

CiteSeerX

Crossref

Statistically Debugging Massively-Parallel Applications

Author: de Supinski Bronis R.
Liblit Ben
Ravitch Tristan
Publication venue
Publication date: 18/02/2013
Field of study

Statistical debugging identifies program behaviors that are highly correlated with failures. Traditionally, this approach has been applied to desktop software on which it is effective in identifying the causes that underlie several difficult classes of bugs including: memory corruption, non-deterministic bugs, and bugs with multiple temporally-distant triggers. The domain of scientific computing offers a new target for this type of debugging. Scientific code is run at massive scales offering massive quantities of statistical feedback data. Data collection can scale well because it requires no communication between compute nodes. Unfortunately, existing statistical debugging techniques impose run-time overhead that is unsuitable for computationally-intensive code despite being modest and acceptable in desktop software. Additionally, the normal communication that occurs between nodes in parallel jobs violates a key assumption of statistical independence in existing statistical models. We report on our experience bringing statistical debugging to the domain of scientific computing. We present techniques to reduce the run-time overhead of the required instrumentation by up to 25% over prior work, along with challenges related to data collection. We also discuss case studies looking at real bugs in ParaDiS and BOUT++, as well as some manually-seeded bugs. We demonstrate that the loss of statistical independence between runs is not a problem in practice

CiteSeerX

MINDS@UW (Univ. of Wisconsin)