Search CORE

143 research outputs found

Recommended from our members

Final Report on Statistical Debugging for Petascale Environments

Author: Liblit B
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 18/01/2013
Field of study

UNT Digital Library

Final Report on Statistical Debugging for Petascale Environments

Author: Liblit B
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 18/01/2013
Field of study

Crossref

UNT Digital Library

BugDoc: Algorithms to Debug Computational Processes

Author: Alvaro Peter
Attariyan Mona
Bergstra J.
Bergstra James
Chen Ang
Dolatnia Nima
Galhotra Sainyam
Godefroid Patrice
Holler Christian
Hutter F.
Johnson Brittany
Lee Kang Wook
Liblit Ben
Lourencco Raoni
Meliou Alexandra
Snoek Jasper
Snoek Jasper
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/04/2020
Field of study

Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.Comment: To appear in SIGMOD 2020. arXiv admin note: text overlap with arXiv:2002.0464

arXiv.org e-Print Archive

Crossref

Scipedia

Open Repository and Bibliography - Luxembourg

Scalable temporal order analysis for large scale debugging

Author: Barton P. Miller
Ben Liblit
Bronis R. De Supinski
Dong H. Ahn
Gregory L. Lee
Ignacio Laguna
Martin Schulz
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

We present a scalable temporal order analysis technique that sup-ports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to de-termine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execu-tion points in anomalous application runs. Our novel temporal or-dering engine then leverages this information along with the ap-plication’s static control structure to apply data flow analysis tech-niques to determine key application data such as loop control vari-ables. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks. Our evaluation, which extends the Stack Trace Analysis Tool (STAT), demonstrates that this temporal order analysis technique can isolate bugs in benchmark codes with injected faults as well as a real world hang case with AMG2006

CiteSeerX

Crossref

Is non-parametric hypothesis testing model robust for statistical fault localization?

Author: Arumuga Nainar
Baudry
Cleve
Do
Gupta
Harrold
Hu
Jiang
Jones
Jones
Jones
Korel
Liblit
Liblit
Liu
Liu
Peifeng Hu
Renieris
T.H. Tse
Tip
W.K. Chan
Wang
Weiser
Wong
Xinming Wang
Zeller
Zeller
Zhang
Zhang
Zhang
Zhenyu Zhang
Zill
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

BugDoc: A System for Debugging Computational Pipelines

Author: Attariyan Mona
Bergstra J.
Bergstra James
Chen Ang
Dolatnia Nima
Johnson Brittany
Liblit Ben
Lourencco Raoni
Petrosjan Leon
Snoek Jasper
Snoek Jasper
Publication venue: Association for Computing Machinery
Publication date: 14/06/2020
Field of study

peer reviewedData analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We recently proposed a new approach that makes provenance to automatically and iteratively infer root causes and derive succinct explanations of failures; such an approach was implemented in our prototype, BugDoc. In this demonstration, we will illustrate BugDoc's capabilities to debug pipelines using few configuration instances

Crossref

Open Repository and Bibliography - Luxembourg

Lessons learned at 208K: Towards debugging millions of cores

Author: Barton P. Miller
Ben Liblit
Bronis R. De Supinski
Dong H. Ahn
Dorian C. Arnold
Gregory L. Lee
Martin Schulz
Matthew Legendre
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To scale to such counts and beyond, tools must employ a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become a tool bottleneck. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petas-cale. We then present solutions to these challenges that have been implemented and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

CiteSeerX

Crossref