90 research outputs found
Dynamic tuning of parallel programs
Performance of parallel programs is one of the reasons of their development. The process of designing and programming a parallel application is a very hard task that requires the necessary knowledge for the detection of performance bottlenecks, and the corresponding changes in the source code of the application to eliminate those bottlenecks. Current approaches to this analysis require a certain level of expertise from the developers part in locating and understanding the performance details of the application execution. For these reasons, we present an automatic performance analysis tool with the objective of alleviating the developers of this hard task: Kappa Pi. The most important limitation of KappaPi approach is the important amount of gathered information needed for the analysis. For this reason, we present a dynamic tuning system that takes measures of the execution on-line. This new design is focused to improve the performance of parallel programs during runtime.I Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
10381 Summary and Abstracts Collection -- Robust Query Processing
Dagstuhl seminar 10381 on robust query processing (held 19.09.10 -
24.09.10) brought together a diverse set of researchers and practitioners
with a broad range of expertise for the purpose of fostering discussion
and collaboration regarding causes, opportunities, and solutions for
achieving robust query processing.
The seminar strove to build a unified view across
the loosely-coupled system components responsible for
the various stages of database query processing.
Participants were chosen for their experience with database
query processing and, where possible, their prior work in academic
research or in product development towards robustness in database query
processing.
In order to pave the way to motivate, measure, and protect future advances
in robust query processing, seminar 10381 focused on developing tests
for measuring the robustness of query processing.
In these proceedings, we first review the seminar topics, goals,
and results, then present abstracts or notes of some of the seminar break-out
sessions.
We also include, as an appendix,
the robust query processing reading list that
was collected and distributed to participants before the seminar began,
as well as summaries of a few of those papers that were
contributed by some participants
ScalAna: Automating Scaling Loss Detection with Graph Analysis
Scaling a parallel program to modern supercomputers is challenging due to
inter-process communication, Amdahl's law, and resource contention. Performance
analysis tools for finding such scaling bottlenecks either base on profiling or
tracing. Profiling incurs low overheads but does not capture detailed
dependencies needed for root-cause analysis. Tracing collects all information
at prohibitive overheads. In this work, we design ScalAna that uses static
analysis techniques to achieve the best of both worlds - it enables the
analyzability of traces at a cost similar to profiling. ScalAna first leverages
static compiler techniques to build a Program Structure Graph, which records
the main computation and communication patterns as well as the program's
control structures. At runtime, we adopt lightweight techniques to collect
performance data according to the graph structure and generate a Program
Performance Graph. With this graph, we propose a novel approach, called
backtracking root cause detection, which can automatically and efficiently
detect the root cause of scaling loss. We evaluate ScalAna with real
applications. Results show that our approach can effectively locate the root
cause of scaling loss for real applications and incurs 1.73% overhead on
average for up to 2,048 processes. We achieve up to 11.11% performance
improvement by fixing the root causes detected by ScalAna on 2,048 processes.Comment: conferenc
Dynamic tuning of parallel programs
Performance of parallel programs is one of the reasons of their development. The process of designing and programming a parallel application is a very hard task that requires the necessary knowledge for the detection of performance bottlenecks, and the corresponding changes in the source code of the application to eliminate those bottlenecks. Current approaches to this analysis require a certain level of expertise from the developers part in locating and understanding the performance details of the application execution. For these reasons, we present an automatic performance analysis tool with the objective of alleviating the developers of this hard task: Kappa Pi. The most important limitation of KappaPi approach is the important amount of gathered information needed for the analysis. For this reason, we present a dynamic tuning system that takes measures of the execution on-line. This new design is focused to improve the performance of parallel programs during runtime.I Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Dynamic multi-resource monitoring for predictive job scheduling.
Standard job schedulers rely on either the user\u27s estimation, or a few approaches that use performance databases to keep information about job runtimes to predict future runs. Co-scheduling for improved resource utilization, however, requires more detailed information as regards behavior on multiple resources to make predictions about slowdowns. Thus, information about communication, I/O, and computation at application level is needed but hard to estimate by the user. Furthermore, dynamic adaptive resource allocation requires information about the different processes on different machine nodes. We present an intelligent monitoring tool, ScoPro, which provides such information. To make monitoring more feasible, ScoPro harnesses the dynamic instrument techniques, which postpone insertion of instrumentation code until the application is executing. To keep intrusion low, we limit monitoring to short test phases. (Abstract shortened by UMI.)Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L586. Source: Masters Abstracts International, Volume: 44-03, page: 1407. Thesis (M.Sc.)--University of Windsor (Canada), 2005
Recommended from our members
Automated Testing and Debugging for Big Data Analytics
The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing
Understanding Performance Inefficiencies In Native And Managed Languages
Production software packages have become increasingly complex with millions of lines of code, sophisticated control and data flow, and references to a hierarchy of external libraries. This complexity often introduces performance inefficiencies across software stacks, making it practically impossible for users to pinpoint them manually. Performance profiling tools (a.k.a. profilers) abound in the tools community to aid software developers in understanding program behavior. Classical profiling techniques focus on identifying hotspots. The hotspot analysis is indispensable; however, it can hardly diagnose whether a resource is being used in a productive manner that contributes to the overall efficiency of a program. Consequently, a significant burden is on developers to make a judgment call on whether there is scope to optimize a hotspot. Derived metrics, e.g., cache miss ratio, offer slightly better intuition into hotspots but are still not panaceas. Hence, there is a need for profilers that investigate resource wastage instead of usage. To overcome the critical missing pieces in prior work and complement existing profilers, we propose novel fine- and coarse-grained profilers to pinpoint varieties of performance inefficiencies and provide optimization guidance for a wide range of software covering benchmarks, enterprise applications, and large-scale parallel applications running on supercomputers and data centers. Fine-grained profilers are indispensable to understand performance inefficiencies comprehensively. We propose a whole-program profiler called LoadSpy, which works on binary executables to detect and quantify wasteful memory operations in their context and scope. Our observation, which is justified by myriad case studies, is that wasteful memory operations are often an indicator of various forms of performance inefficiencies, such as suboptimal choices of algorithms or data structures, missed compiler optimizations, and developers’ inattention to performance. Guided by LoadSpy, we are able to optimize a large number of well-known benchmarks and real-world applications, yielding significant speedups. Despite deep performance insights offered by fine-grained profilers, the high overhead keeps them away from widespread adoption, particularly in production. By contrast, coarse-grained profilers introduce low overhead at the cost of poor performance insights. Hence, another research topic is how we benefit from both, that is, the combination of deep insights of fine-grained profilers and low overhead of coarse-grained ones. The first effort to do so is proposing a lightweight profiler called JXPerf. It abandons heavyweight instrumentation by combining hardware performance monitoring units and debug registers available in commodity CPUs to detect wasteful memory operations. Compared with LoadSpy, JXPerf reduces the runtime overhead from 10x to 7% on average. The lightweight nature makes it useful in production. Another effort is proposing a lightweight profiler called FVSampler, the first nonintrusive profiler to study function execution variance
Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs
Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation.
In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise.
In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages.
In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits.
Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine.
Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela.
En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teorÃa, estas técnicas prometen múltiples beneficios. Primero, tendrÃan que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendrÃa que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveerÃa una tolerancia mayor a redes de comunicación lentas y ruido externo.
En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajerÃa especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún asÃ, en el caso de cargas de trabajo cientÃfico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada
proceso MPI opera localmente en el paso de mensajes.
En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo hÃbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones hÃbridas MPI/OmpSs para evaluar de qué manera se ejecutarÃan en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejarÃa en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones crÃticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendrÃa que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa.
En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC
Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications
International audienceThe complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms of performance, capacity, persistence, etc. Developers of scientific applications face a huge challenge: efficiently exploit the memory system to improve performance, but keep productivity high by using portable solutions. In this work, we present a new API and a method to manage the complexity of modern memory systems. Our portable and abstracted API is designed to identify memory kinds and describe hardware characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries, and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to remodify the code for each platform. Furthermore we present a survey of existing ways to determine sensitivity of application buffers using static code analysis, profiling and benchmarking. We show in a use case that combining these approaches with our API indeed enables a portable and productive method to match application requirements and hardware memory characteristics
- …