17 research outputs found

    Intelligent instrumentation techniques to improve the traces information-volume ratio

    Get PDF
    With ever more powerful machines being constantly deployed, it is crucial to manage the computational resources efficiently. This is important both from the point of view of the individual user, who expects fast results; and the supercomputing center hosting the whole infrastructure, that is interested in maximizing its overall productivity. Nevertheless, the real sustained performance achieved by the applications can be significantly lower than the theoretical peak performance of the machines. A key factor to bridge this performance gap is to understand how parallel computers behave. Performance analysis tools are essential not only to understand the behavior of parallel applications, but to identify why performance expectations might not have been met, serving as guidelines to improve the inefficiencies that caused poor performance, and driving both software and hardware optimizations. However, detailed analysis of the behavior of a parallel application requires to process a large amount of data that also grows extremely fast. Current large scale systems already comprise hundreds of thousands of cores, and upcoming exascale systems are expected to assemble more than a million processing elements. With such number of hardware components, the traditional analysis methodologies consisting in blindly collecting as much data as possible and then performing exhaustive lookups are no longer applicable, because the volume of performance data generated becomes absolutely unmanageable to store, process and analyze. The evolution of the tools suggests that more complex approaches are needed, incorporating intelligence to perform competently the challenging and important task of detailed analysis. In this thesis, we address the problem of scalability of performance analysis tools in large scale systems. In such scenarios, in-depth understanding of the interactions between all the system components is more compelling than ever for an effective use of the parallel resources. To this end, our work includes a thorough review of techniques that have been successfully applied to aid in the task of Big Data Analytics in fields like machine learning, data mining, signal processing and computer vision. We have leveraged these techniques to improve the analysis of large-scale parallel applications by automatically uncovering repetitive patterns, finding data correlations, detecting performance trends and further useful analysis information. Combinining their use, we have minimized the volume of performance data captured from an execution, while maximizing the benefit and insight gained from this data, and have proposed new and more effective methodologies for single and multi-experiment performance analysis.Con el incesante aumento de potencia y capacidad de los superordenadores, la habilidad de emplear de forma efectiva todos los recursos disponibles se ha convertido en un factor crucial. La necesidad de un uso eficiente radica tanto en la aspiración de los usuarios por obtener resultados en el menor tiempo posible, como en el interés del propio centro de cálculo que alberga la infraestructura computacional por maximizar la productividad de los recursos. Sin embargo, el rendimiento real que las aplicaciones son capaces de alcanzar suele ser significativamente menor que el rendimiento teórico de las máquinas. Y la clave para salvar esta distancia consiste en comprender el comportamiento de las máquinas paralelas. Las herramientas de análisis de rendimiento son instrumentos fundamentales no solo para entender como funcionan las aplicaciones paralelas, sino también para identificar los problemas por los que el rendimiento obtenido dista del esperado, sirviendo como guías para mejorar aquellas deficiencias software y/o hardware que son causas de degradación. No obstante, un análisis en detalle del comportamiento de una aplicación paralela requiere procesar una gran cantidad de datos que crece extremadamente rápido. Los sistemas actuales de gran escala ya comprenden cientos de miles de procesadores, y se espera que los inminentes sistemas exa-escala reunan millones de elementos de procesamiento. Con semejante número de componentes, las estrategias tradicionales de obtención indiscriminada de datos para mejorar la precisión de las herramientas de análisis caerán en desuso debido a las dificultades que entraña almacenarlos y procesarlos. En este aspecto, la evolución de las herramientas sugiere que son necesarios métodos más sofisticados, que incorporen inteligencia para desarrollar la tarea de análisis de manera más competente. Esta tesis aborda el problema de escalabilidad de las herramientas de análisis en sistemas de gran escala, donde es primordial el conocimiento detallado de las interacciones entre todos los componentes para emplear los recursos paralelos de la forma más óptima. Con este fin, esta investigación incluye una revisión exhaustiva de las técnicas que se han aplicado satisfactoriamente para extraer información de grandes volumenes de datos en otras áreas como aprendizaje automático, minería de datos y procesado de señal. Hemos adaptado estas técnicas para mejorar el análisis de aplicaciones paralelas de gran escala, detectando automáticamente patrones repetitivos, correlaciones de datos, tendencias de rendimiento, y demás información relevante. Combinando el uso de estas técnicas, se ha conseguido disminuir el volumen de datos generado durante una ejecución, a la vez que aumentar la cantidad de información útil que se puede extraer de los datos mediante la aplicación de nuevas y más efectivas metodologías de análisis para el estudio del rendimiento de experimentos individuales o en seri

    Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

    Get PDF
    Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallel programming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallel programming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallel programming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools; (2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain; and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740.This work has been partially funded from the HP4S (High Performance Parallel Payload Processing for Space) project under the ESA-ESTEC ITI contract â„– 4000124124/18/NL/CRS.Peer ReviewedPostprint (author's final draft

    Performance analysis of parallel Python applications

    Get PDF
    Python is progressively consolidating itself within the HPC community with its simple syntax, large standard library, and powerful third-party libraries for scientific computing that are especially attractive to domain scientists. Despite Python lowering the bar for accessing parallel computing, utilizing the capacities of HPC systems efficiently remains a challenging task, after all. Yet, at the moment only few supporting tools exist and provide merely basic information in the form of summarized profile data. In this paper, we present our efforts in developing event-based tracing support for Python within the performance monitor Extrae to provide detailed information and enable a profound performance analysis. We present concepts to record the complete communication behavior as well as to capture entry and exit of functions in Python to provide the according application context. We evaluate our implementation in Extrae by analyzing the well-established electronic structure simulation package GPAW and demonstrate that the recorded traces provide equivalent information as for traditional C or Fortran applications and, therefore, offering the same profound analysis capabilities now for Python, as well.Peer ReviewedPostprint (published version

    Detailed performance analysis using coarse grain sampling

    No full text
    Performance evaluation tools enable analysts to shed light on how applications behave both from a general point of view and at concrete execution points, but cannot provide detailed information beyond the monitored regions of code. Having the ability to determine when and which data has to be collected is crucial for a successful analysis. This is particularly true for trace-based tools, which can easily incur either unmanageable large traces or information shortage. In order to mitigate the well-known resolution vs. usability trade-off, we present a procedure that obtains fine grain performance information using coarse grain sampling, projecting performance metrics scattered all over the execution into thoroughly detailed representative areas. This mechanism has been incorporated into the MPItrace tracing suite, greatly extending the amount of performance information gathered from statically instrumented points with further periodic samples collected beyond them. We have applied this solution to the analysis of two applications to introduce a novel performance analysis methodology based on the combination of instrumentation and sampling techniques.Peer Reviewe

    On the usefulness of object tracking techniques in performance analysis

    No full text
    Understanding the behavior of a parallel application is crucial if we are to tune it to achieve its maximum performance. Yet the behavior the application exhibits may change over time and depend on the actual execution scenario: particular inputs and program settings, the number of processes used, or hardware-specific problems. So beyond the details of a single experiment a far more interesting question arises: how does the application behavior respond to changes in the execution conditions? In this paper, we demonstrate that object tracking concepts from computer vision have huge potential to be applied in the context of performance analysis. We leverage tracking techniques to analyze how the behavior of a parallel application evolves through multiple scenarios where the execution conditions change. This method provides comprehensible insights on the influence of different parameters on the application behavior, enabling us to identify the most relevant code regions and their performance trends. Copyright 2013 ACM.Peer ReviewedPostprint (published version

    Identifying code phases using piece-wise linear regressions

    No full text
    Node-level performance is one of the factors that may limit applications from reaching the supercomputers' peak performance. Studying node-level performance and attributing it to the source code results into valuable insight that can be used to improve the application efficiency, albeit performing such a study may be an intimidating task due to the complexity and size of the applications. We present in this paper a mechanism that takes advantage of combining piece-wise linear regressions, coarse-grain sampling, and minimal instrumentation to detect performance phases in the computation regions even if their granularity is very fine. This mechanism then maps the performance of each phase into the application syntactical structure displaying a correlation between performance and source code. We introduce a methodology on top of this mechanism to describe the node-level performance of parallel applications, even for first-time seen applications. Finally, we demonstrate the methodology describing optimized in-production applications and further improving their performance applying small transformations to the code based on the hints discovered. © 2014 IEEE.Peer ReviewedPostprint (published version

    On-line detection of large-scale parallel application's structure

    No full text
    With larger and larger systems being constantly deployed, trace-based performance analysis of parallel applications has become a daunting task. Even if the amount of performance data gathered per single process is small, traces rapidly become unmanageable when merging together the information collected from all processes. In general, an e cient analysis of such a large volume of data is subject to a previous ltering step that directs the analyst's attention towards what is meaningful to understand the observed application behavior. Furthermore, the iterative nature of most scienti c applications usually ends up producing repetitive information. Discarding irrelevant data aims at reducing both the size of traces, and the time required to perform the analysis and deliver results. In this paper, we present an on-line analysis framework that relies on clustering techniques to intelligently select the most relevant information to understand how does the application behave, while keeping the trace volume at a reasonable size.Peer ReviewedPostprint (published version

    Housing in England 1998/99 Report of the 1998/99 Survey of English Housing

    Get PDF
    Jointly published with the National Assembly for WalesSIGLEAvailable from British Library Document Supply Centre-DSC:4335.152005(no 6) / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Experimental and statistical studies of x-band transhorizon radio links over the sea

    No full text
    SIGLEAvailable from British Library Document Supply Centre- DSC:DX94770 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Unveiling internal evolution of parallel application computation phases

    No full text
    As access to supercomputing resources is becoming more and more commonplace, performance analysis tools are gaining importance in order to decrease the gap between the application performance and the supercomputers' peak performance. Performance analysis tools allow the analyst to understand the idiosyncrasies of an application in order to improve it. However, these tools require monitoring regions of the application to provide information to the analysts, leaving non-monitored regions of code unknown, which may result in lack of understanding of important regions of the application. In this paper we describe an automated methodology that reports very detailed application insights and improves the analysis experience of performance tools based on traces. We apply this methodology to three production applications and provide suggestions on how to improve their performance. Our methodology uses computation burst clustering and a mechanism called folding. While clustering automatically detects application structure, folding combines instrumentation and sampling to augment the performance analysis details. Folding provides fine grain performance information from coarse grain sampling on iterative applications. Folding results closely resemble the performance data gathered from fine grain sampling with an absolute mean difference less than 5% without overhead of fine grain.Peer Reviewe
    corecore