31 research outputs found

    Fake Run-Time Selection of Template Arguments in C++

    Full text link
    C++ does not support run-time resolution of template type arguments. To circumvent this restriction, we can instantiate a template for all possible combinations of type arguments at compile time and then select the proper instance at run time by evaluation of some provided conditions. However, for templates with multiple type parameters such a solution may easily result in a branching code bloat. We present a template metaprogramming algorithm called for_id that allows the user to select the proper template instance at run time with theoretical minimum sustained complexity of the branching code.Comment: Objects, Models, Components, Patterns (50th International Conference, TOOLS 2012

    Development of a performance analysis environment for parallel pattern-based applications

    Get PDF
    One of the challenges that the scientific community is facing nowadays refers to the data parallel treatment. Every day, we produce more and more overwhelming amounts of data, in such a way that there comes a point at which all those data volumes grow exponentially and can not be treated as we would desire. The importance of this treatment and data processing is mainly due to the need that arises in the scientific area for discovering new advances in science, or simply for discovering new algorithms capable of solving experiments each time more complex, whose resolution years ago was an unfeasible task due to the available resources. As a consequence, some changes appear in the internal architecture of these computers in order to increase their computing capacity and thus, be able to cope with the need for massive data processing. Thus, the scientific community has implemented different pattern-based parallel programming frameworks in order to compute experiments in a faster and efficient way. Unfortunately, the use of these programming paradigms is not a simple task, since it requires expertise and programming skills. This is further complicated when developers are not aware of the program internal behaviour, which leads to unexpected results in certain parts of the code. Inevitably, some need arises to develop a series of tools with the aim of helping this community to analyze the performance and results of their experiments. Hence, this bachelor thesis presents the development of a performance analysis environment based on parallel applications as a solution to that problem. Specifically, this environment is composed of two techniques commonly used, profiling and tracing, which have been added to the GrPPI framework. In this way, users can obtain a general assessment of their applications performance and thus, act according with the results obtained.Uno de los retos actuales al que la comunidad científica está haciendo frente hace referencia al tratamiento en paralelo de datos. Diariamente producimos cada vez más cantidades abrumadoras de datos, de tal manera que llega un punto en el que todos esos volúmenes de datos crecen desenfrenadamente y no pueden ser tratados como se desearía. La importancia de este tratamiento y procesamiento de datos se debe, principalmente, a la necesidad que surge en el ámbito científico por descubrir nuevos avances en la ciencia, o, simplemente, por descubrir nuevos algoritmos capaces de resolver experimentos cada vez más complejos cuya resolución años atrás era una tarea inviable debido a los recursos disponibles. Como consecuencia, surgen cambios en la arquitectura interna de estos ordenadores con el fin de aumentar su capacidad de cómputo y así poder hacer frente a dicha necesidad de tratamiento masivo de datos. Así, la comunidad científica ha implementado distintos frameworks de programación paralela basados en patrones paralelos con el fin de computar esos experimentos de una manera más rápida y eficiente. Desafortunadamente, la utilización de estos paradigmas de programación no es una tarea sencilla, ya que requiere experiencia y destrezas en programación. Esto se complica aún más cuando los desarrolladores no son conscientes del comportamiento interno del programa, lo cual conlleva a obtener resultados inesperados en ciertas partes del código. Inevitablemente, surge la necesidad de desarrollar una serie de herramientas con el objetivo de ayudar a esta comunidad a analizar el rendimiento y resultados de sus experimentos. Así pues, este trabajo fin de carrera presenta el desarrollo de un entorno de análisis de rendimiento de aplicaciones paralelas como solución al ese problema. Concretamente, este entorno está compuesto de dos técnicas comunmente utilizadas, profiling y tracing, las cuales han sido añadidas al framework de programación paralela GrPPI. Así, los usuarios podrán recibir una valoración general sobre el rendimiento de sus aplicaciones y actuar conforme a los resultados obtenidos.Ingeniería Informática (Plan 2011

    Performance engineering of data-intensive applications

    Get PDF
    Data-intensive programs deal with big chunks of data and often contain compute-intensive characteristics. Among various HPC application domains, big data analytics, machine learning and the more recent deep-learning models are well-known data-intensive applications. An efficient design of such applications demands extensive knowledge of the target hardware and software, particularly the memory/cache hierarchy and the data communication among threads/processes. Such a requirement makes code development an arduous task, as inappropriate data structures and algorithm design may result in superfluous runtime, let alone hardware incompatibilities while porting the code to other platforms. In this dissertation, we introduce a set of tools and methods for the performance engineering of parallel data-intensive programs. We start with performance profiling to gain insights on thread communications and relevant code optimizations. Then, by narrowing down our scope to deep-learning applications, we introduce our tools for enhancing the performance portability and scalability of convolutional neural networks (ConvNet) at inference and training phases. Our first contribution is a novel performance-profiling method to unveil potential communication bottlenecks caused by data-access patterns and thread interactions. Our findings show that the data shared between a pair of threads should be reused with a reasonably short intervals to preserve data locality, yet existing profilers neglect them and mainly report the communication volume. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. Our experiments show that applying relevant optimizations improves the performance in Rodinia benchmarks by up to 56%. For the next contribution, we developed a framework for automatic generation of efficient and performance-portable convolution kernels, including Winograd convolutions, for various GPU platforms. We employed a synergy of meta-programming, symbolic execution, and auto-tuning. The results demonstrate efficient kernels generated through an automated optimization pipeline with runtimes close to vendor deep-learning libraries, and the minimum required programming effort confirms the performance portability of our approach. Furthermore, our symbolic execution method exploits repetitive patterns in Winograd convolutions, enabling us to reduce the number of arithmetic operations by up to 62% without compromising the numerical stability. Lastly, we investigate possible methods to scale the performance of ConvNets in training and inference phases. Our specialized training platform equipped with a novel topology-aware network pruning algorithm enables rapid training, neural architecture search, and network compression. Thus, an AI model training can be easily scaled to a multitude of compute nodes, leading to faster model design with less operating costs. Furthermore, the network compression component scales a ConvNet model down by removing redundant layers, preparing the model for a more pertinent deployment. Altogether, this work demonstrates the necessity and shows the benefit of performance engineering and parallel programming methods in accelerating emerging data-intensive workloads. With the help of the proposed tools and techniques, we pinpoint data communication bottlenecks and achieve performance portability and scalability in data-intensive applications

    Generic Reloading for Languages Based on the Truffle Framework

    Get PDF
    Programmide käitusaegset uuendamist on põhjalikult uuritud ning selle kasutamine programmeerimiskeelte implementatsioonides kogub hoogu. Senised pakutud lahendused programmide käitusaegse uuendamise osas on rakendatavad ainult konkreetsetele keeltele ja ei ole taaskasutatavad. Käesolevas lõputöös on uuritud seda, kuidas Truffle-nimeline programmeerimiskeelte loomise raamistik suudaks aidata keelte loojatel lisada käitusaegse uuendamise tuge. Autor on loonud taaskasutatava dünaamilise uuendamise lahenduse, mida erinevad Truffle raamistikus loodud keeled saavad kasutada selleks, et vähese vaevaga toetada käitusaegseid uuendusi. Antud lahendusega on võimalik uuendatavaks teha Pythoni, Ruby ja JavasScripti Truffle implementatsioone. Väljatöötatud lahendusel on peaaegu olematu mõju keele tippvõimsusele, kui on sisse lülitatud Truffle täppisajastusega (JIT) kompilaator. See lahendus teeb käitusaegse uuendamise toe lisamise uutele ja tulevastele keeltele märkimisväärselt lihtsamaks.Reloading running programs is a well-researched and increasingly popular feature of programming language implementations. There are plenty of proposed solutions for various existing programming languages, but typically the solutions target a specific language and are not reusable. In this thesis, we explored how the Truffle language implementation framework could aid language creators in adding reloading capabilities to their languages. We created a reusable reloading core that different Truffle-based languages can hook into to support dynamic updates with minimum amount of effort on their part. We demonstrate how the Truffle implementations of Python, Ruby and JavaScript can be made reloadable with the developed solution. With Truffle’s just-in-time compiler enabled, our solution incurs close to zero overhead on steady-state performance. This approach significantly reduces the effort required to add dynamic update support for existing and future languages

    From Valid Measurements to Valid Mini-Apps

    Get PDF
    In high-performance computing, performance analysis, tuning, and exploration are relevant throughout the life cycle of an application. State-of-the-art tools provide capable measurement infrastructure, but they lack automation of repetitive tasks, such as iterative measurement-overhead reduction, or tool support for challenging and time-consuming tasks, e.g., mini-app creation. In this thesis, we address this situation with (a) a comparative study on overheads introduced by different tools, (b) the tool PIRA for automatic instrumentation refinement, and (c) a tool-supported approach for mini-app extraction. In particular, we present PIRA for automatic iterative performance measurement refinement. It performs whole-program analysis using both source-code and runtime information to heuristically determine where in the target application measurement hooks should be placed for a low-overhead assessment. At the moment, PIRA offers a runtime heuristic to identify compute-intensive parts, a performance-model heuristic to identify scalability limitations, and a load imbalance detection heuristic. In our experiments, PIRA compared to Score-P’s built-in filtering significantly reduces the runtime overhead in 13 out of 15 benchmark cases and typically introduces a slowdown of < 10 %. To provide PIRA with the required infrastructure, we develop MetaCG — an extendable lightweight whole-program call-graph library for C/C++. The library offers a compiler-agnostic call-graph (CG) representation, a Clang-based tool to construct a target’s CG, and a tool to validate the structure of the MetaCG. In addition to its use in PIRA, we show that whole-program CG analysis reduces the number of allocation to track by the memory tracking sanitizer TypeART by up to a factor of 2,350×. Finally, we combine the presented tools and develop a tool-supported approach to (a) identify, and (b) extract relevant application regions into representative mini-apps. Therefore, we present a novel Clang-based source-to-source translator and a type-safe checkpoint-restart (CPR) interface as a common interface to existing MPI-parallel CPR libraries. We evaluate the approach by extracting a mini-app of only 1,100 lines of code from an 8.5 million lines of code application. The mini-app is subsequently analyzed, and maintains the significant characteristics of the original application’s behavior. It is then used for tool-supported parallelization, which led to a speed-up of 35 %. The software presented in this thesis is available at https://github.com/tudasc

    Programming Persistent Memory

    Get PDF
    Beginning and experienced programmers will use this comprehensive guide to persistent memory programming. You will understand how persistent memory brings together several new software/hardware requirements, and offers great promise for better performance and faster application startup times—a huge leap forward in byte-addressable capacity compared with current DRAM offerings. This revolutionary new technology gives applications significant performance and capacity improvements over existing technologies. It requires a new way of thinking and developing, which makes this highly disruptive to the IT/computing industry. The full spectrum of industry sectors that will benefit from this technology include, but are not limited to, in-memory and traditional databases, AI, analytics, HPC, virtualization, and big data. Programming Persistent Memory describes the technology and why it is exciting the industry. It covers the operating system and hardware requirements as well as how to create development environments using emulated or real persistent memory hardware. The book explains fundamental concepts; provides an introduction to persistent memory programming APIs for C, C++, JavaScript, and other languages; discusses RMDA with persistent memory; reviews security features; and presents many examples. Source code and examples that you can run on your own systems are included. What You’ll Learn Understand what persistent memory is, what it does, and the value it brings to the industry Become familiar with the operating system and hardware requirements to use persistent memory Know the fundamentals of persistent memory programming: why it is different from current programming methods, and what developers need to keep in mind when programming for persistence Look at persistent memory application development by example using the Persistent Memory Development Kit (PMDK) Design and optimize data structures for persistent memory Study how real-world applications are modified to leverage persistent memory Utilize the tools available for persistent memory programming, application performance profiling, and debugging Who This Book Is For C, C++, Java, and Python developers, but will also be useful to software, cloud, and hardware architects across a broad spectrum of sectors, including cloud service providers, independent software vendors, high performance compute, artificial intelligence, data analytics, big data, etc
    corecore