90 research outputs found

    Dynamic tuning of parallel programs

    Get PDF
    Performance of parallel programs is one of the reasons of their development. The process of designing and programming a parallel application is a very hard task that requires the necessary knowledge for the detection of performance bottlenecks, and the corresponding changes in the source code of the application to eliminate those bottlenecks. Current approaches to this analysis require a certain level of expertise from the developers part in locating and understanding the performance details of the application execution. For these reasons, we present an automatic performance analysis tool with the objective of alleviating the developers of this hard task: Kappa Pi. The most important limitation of KappaPi approach is the important amount of gathered information needed for the analysis. For this reason, we present a dynamic tuning system that takes measures of the execution on-line. This new design is focused to improve the performance of parallel programs during runtime.I Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    10381 Summary and Abstracts Collection -- Robust Query Processing

    Get PDF
    Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

    ScalAna: Automating Scaling Loss Detection with Graph Analysis

    Full text link
    Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.Comment: conferenc

    Dynamic tuning of parallel programs

    Get PDF
    Performance of parallel programs is one of the reasons of their development. The process of designing and programming a parallel application is a very hard task that requires the necessary knowledge for the detection of performance bottlenecks, and the corresponding changes in the source code of the application to eliminate those bottlenecks. Current approaches to this analysis require a certain level of expertise from the developers part in locating and understanding the performance details of the application execution. For these reasons, we present an automatic performance analysis tool with the objective of alleviating the developers of this hard task: Kappa Pi. The most important limitation of KappaPi approach is the important amount of gathered information needed for the analysis. For this reason, we present a dynamic tuning system that takes measures of the execution on-line. This new design is focused to improve the performance of parallel programs during runtime.I Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Dynamic multi-resource monitoring for predictive job scheduling.

    Get PDF
    Standard job schedulers rely on either the user\u27s estimation, or a few approaches that use performance databases to keep information about job runtimes to predict future runs. Co-scheduling for improved resource utilization, however, requires more detailed information as regards behavior on multiple resources to make predictions about slowdowns. Thus, information about communication, I/O, and computation at application level is needed but hard to estimate by the user. Furthermore, dynamic adaptive resource allocation requires information about the different processes on different machine nodes. We present an intelligent monitoring tool, ScoPro, which provides such information. To make monitoring more feasible, ScoPro harnesses the dynamic instrument techniques, which postpone insertion of instrumentation code until the application is executing. To keep intrusion low, we limit monitoring to short test phases. (Abstract shortened by UMI.)Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L586. Source: Masters Abstracts International, Volume: 44-03, page: 1407. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Understanding Performance Inefficiencies In Native And Managed Languages

    Get PDF
    Production software packages have become increasingly complex with millions of lines of code, sophisticated control and data flow, and references to a hierarchy of external libraries. This complexity often introduces performance inefficiencies across software stacks, making it practically impossible for users to pinpoint them manually. Performance profiling tools (a.k.a. profilers) abound in the tools community to aid software developers in understanding program behavior. Classical profiling techniques focus on identifying hotspots. The hotspot analysis is indispensable; however, it can hardly diagnose whether a resource is being used in a productive manner that contributes to the overall efficiency of a program. Consequently, a significant burden is on developers to make a judgment call on whether there is scope to optimize a hotspot. Derived metrics, e.g., cache miss ratio, offer slightly better intuition into hotspots but are still not panaceas. Hence, there is a need for profilers that investigate resource wastage instead of usage. To overcome the critical missing pieces in prior work and complement existing profilers, we propose novel fine- and coarse-grained profilers to pinpoint varieties of performance inefficiencies and provide optimization guidance for a wide range of software covering benchmarks, enterprise applications, and large-scale parallel applications running on supercomputers and data centers. Fine-grained profilers are indispensable to understand performance inefficiencies comprehensively. We propose a whole-program profiler called LoadSpy, which works on binary executables to detect and quantify wasteful memory operations in their context and scope. Our observation, which is justified by myriad case studies, is that wasteful memory operations are often an indicator of various forms of performance inefficiencies, such as suboptimal choices of algorithms or data structures, missed compiler optimizations, and developers’ inattention to performance. Guided by LoadSpy, we are able to optimize a large number of well-known benchmarks and real-world applications, yielding significant speedups. Despite deep performance insights offered by fine-grained profilers, the high overhead keeps them away from widespread adoption, particularly in production. By contrast, coarse-grained profilers introduce low overhead at the cost of poor performance insights. Hence, another research topic is how we benefit from both, that is, the combination of deep insights of fine-grained profilers and low overhead of coarse-grained ones. The first effort to do so is proposing a lightweight profiler called JXPerf. It abandons heavyweight instrumentation by combining hardware performance monitoring units and debug registers available in commodity CPUs to detect wasteful memory operations. Compared with LoadSpy, JXPerf reduces the runtime overhead from 10x to 7% on average. The lightweight nature makes it useful in production. Another effort is proposing a lightweight profiler called FVSampler, the first nonintrusive profiler to study function execution variance

    Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs

    Get PDF
    Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.La programación paralela consiste en dividir un problema de computación entre múltiples unidades de procesamiento y definir como interactúan (comunicación y sincronización) para garantizar un resultado correcto. El rendimiento de un programa paralelo normalmente está muy lejos de ser óptimo: el desequilibrio de la carga computacional y la excesiva interacción entre las unidades de procesamiento a menudo causa ciclos perdidos, reduciendo la eficiencia de la computación paralela. En esta tesis proponemos técnicas orientadas a explotar mejor el paralelismo en aplicaciones paralelas, poniendo énfasis en técnicas que incrementan el asincronismo. En teoría, estas técnicas prometen múltiples beneficios. Primero, tendrían que mitigar el retraso de la comunicación y la sincronización, y por lo tanto incrementar el rendimiento global. Además, la calibración de la paralelización tendría que exponer un paralelismo adicional, incrementando la escalabilidad de la ejecución. Finalmente, un incremente en el asincronismo proveería una tolerancia mayor a redes de comunicación lentas y ruido externo. En la primera parte de la tesis, estudiamos el potencial para la calibración del paralelismo a través de MPI. En concreto, exploramos técnicas automáticas para solapar la comunicación con la computación. Proponemos una técnica de mensajería especulativa que incrementa el solapamiento y no requiere cambios en la aplicación MPI original. Nuestra técnica identifica automáticamente la actividad MPI de la aplicación y la reinterpreta usando solicitudes MPI no bloqueantes situadas óptimamente. Demostramos que esta técnica maximiza el solapamiento y, en consecuencia, acelera la ejecución y permite una mayor tolerancia a las reducciones de ancho de banda. Aún así, en el caso de cargas de trabajo científico realistas, mostramos que el potencial de solapamiento está significativamente limitado por el patrón según el cual cada proceso MPI opera localmente en el paso de mensajes. En la segunda parte de esta tesis, exploramos el potencial para calibrar el paralelismo híbrido MPI/OmpSs. Intentamos obtener una comprensión mejor del paralelismo de aplicaciones híbridas MPI/OmpSs para evaluar de qué manera se ejecutarían en futuras máquinas. Exploramos como las aplicaciones MPI/OmpSs pueden escalar en una máquina paralela con centenares de núcleos por nodo. Además, investigamos cómo este paralelismo de cada nodo se reflejaría en las restricciones de la red de comunicación. En especia, nos concentramos en identificar secciones críticas de código en MPI/OmpSs. Hemos concebido una técnica que rápidamente evalúa, para una aplicación MPI/OmpSs dada y la máquina objetivo seleccionada, qué sección de código tendría que ser optimizada para obtener la mayor ganancia de rendimiento. También estudiamos técnicas para explorar rápidamente el paralelismo potencial de OmpSs inherente en las aplicaciones. Proporcionamos mecanismos para evaluar fácilmente el paralelismo potencial de cualquier descomposición en tareas. Además, describimos una aproximación iterativa para buscar una descomposición en tareas que mostrará el suficiente paralelismo en la máquina objetivo dada. Para finalizar, exploramos el potencial para automatizar la aproximación iterativa. En el trabajo expuesto en esta tesis hemos diseñado herramientas que pueden ser útiles para otros investigadores de este campo. La más avanzada es Tareador, una herramienta para ayudar a migrar aplicaciones al modelo de programación MPI/OmpSs. Tareador proporciona una interfaz simple para proponer una descomposición del código en tareas OmpSs. Tareador también calcula dinámicamente las dependencias de datos entre las tareas anotadas, y automáticamente estima el potencial de paralelización OmpSs. Por último, Tareador da indicaciones adicionales sobre como completar el proceso de migración a OmpSs. Tareador ya se ha mostrado útil al ser incluido en las clases de programación de la UPC

    Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications

    Get PDF
    International audienceThe complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms of performance, capacity, persistence, etc. Developers of scientific applications face a huge challenge: efficiently exploit the memory system to improve performance, but keep productivity high by using portable solutions. In this work, we present a new API and a method to manage the complexity of modern memory systems. Our portable and abstracted API is designed to identify memory kinds and describe hardware characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries, and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to remodify the code for each platform. Furthermore we present a survey of existing ways to determine sensitivity of application buffers using static code analysis, profiling and benchmarking. We show in a use case that combining these approaches with our API indeed enables a portable and productive method to match application requirements and hardware memory characteristics
    • …
    corecore