6 research outputs found

    Increasing SpMV Energy Efficiency Through Compression: A study of how format, input and platform properties affect the energy efficiency of Compressed Sparse eXtended

    Get PDF
    This work is a continuation and augmentation of previous energy studies ofCompressed Sparse eXtended (CSX), a framework for efficiently executing SparseMatrix-Vector Multiplication (SpMV).CSX was developed by the CSLab at the National Technical University of Athens(NTUA), and utilizes compression to overcome a significant memory bottleneckinherent in SpMV, thus increasing performance and energy efficiency of itsexecution.SpMV is notorious within scientific computing for its low performance. However,the problem is unavoidable, as SpMV can be found within several scientificapplications. In this work, CSX is tested as the SpMV kernel in a frameworkimplementing the Conjugate Gradient Method (CG), an iterative algorithm forsolving specific linear algebra problems. CSX is also evaluated againstCompressed Sparse Row (CSR), a storage scheme widely used when executing SpMV.This work augments existing studies by evaluating properties in the formatsthemselves, in the matrices used as input and in the target platform to gainknowledge on how to maximize the benefits of CSX, as well as for what casesCSX does not prove beneficial. The work also compares the performance ofSpMV-execution on a stand-alone server known as the CARD-server to similarexecution on the Vilje supercomputer. This is done to evaluate how thedifferences between these two machines affect the results.Based on the results, it is shown that CSX should be used for matrices largerthan the Last Level Cache (LLC) of the target machine and for matrices with highdegrees of clustering in their values. The best energy efficiency trade-offs arefound at eight threads on dual socket configurations, and this is shown to berelated to the amount of physical cores per CPU. Similarly, frequencythrottling is shown to increase the energy efficiency of the execution only athigh numbers of threads and at the cost of performance.Overall, CSX is shown to obtain higher energy efficiency than CSR forSpMV-execution, given a suitable problem and run configuration. Thus, it isconfirmed that CSX can be used to decrease the energy consumption of SpMVapplications

    PCL - The Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors

    Get PDF
    A performance counter is that part of a microprocessor that measures and gathers performance-relevant events on the microprocessor. The number and type of available events differ significantly between existing microprocessors, because there is no commonly accepted specification, and because each manufacturer has different priorities on analyzing the performance of architectures and programs. Looking at the supported events on the different microprocessors, it can be observed that the functionality of these events differs from the requirements of an expert application programmer or a performance tool writer.PCL, the Performance Counter Library, establishes a common platform for performance measurements on a wide range of computer systems. With a common interface on all systems and a set of application-oriented events defined, the application programmer is able to do program optimization in a portable way and the performance tool writer is able to rely on a common interface on different systems. PCL has functions to query the functionality, to start and to stop counters, and to read the values of counters. PCL supports nested calls to PCL functions thus allowing hierarchical performance measurements. Counting may be done either in system or in user mode. All interface functions are callable in C, C++, and Fortran

    Traçage de logiciels bénéficiant d'accélération graphique

    Get PDF
    RÉSUMÉ En programmation, les récents changements d'architecture comme les processeurs à plusieurs cœurs de calcul rendirent la synchronisation des tâches qui y sont exécuté plus complexe à analyser. Pour y remédier, des outils de traçage comme LTTng furent implémentés dans l'optique de fournir des outils d'analyse de processus tout en gardant en tête les défis qu'implique les systèmes multi-cœur. Une seconde révolution dans le monde de l'informatique, les accélérateurs graphiques, créa alors un autre besoin de traçage. Les manufacturiers d'accélérateurs graphiques fournirent alors des outils d'analyse pour accélérateurs graphiques. Ces derniers permettent d'analyser l'exécution de commandes sur les accélérateurs graphiques. Ce mémoire apporte une solution au manque d'outil de traçage unifié entre le système hôte (le processeur central (CPU)) et l'exécution de noyaux de calcul OpenCL sur le périphérique (l'accélérateur graphique (GPU)). Par unifié, nous référons à la capacité d'un outil de prise de traces à collecter une trace du noyau de l'hôte sur lequel un périphérique d'accélération graphique est présent en plus de la trace d'exécution du périphérique d'accélération graphique. L'objectif initial principal de ce mémoire avait été défini comme suit: fournir un outil de traçage et les méthodes d’analyse qui permettent d'acquérir simultanément les traces de l’accélérateur graphique et du processeur central. En plus de l'objectif principal, les objectifs secondaires ajoutaient des critères de performance et de visualisation des traces enregistrés par la solution que ce mémoire présente. Les différentes notions de recherche explorés ont permis d'établir de hypothèses de départ. Ces dernières mentionnaient que le format de trace Common Trace Format (CTF) semblait permettre l'enregistrent de traces à faible surcoût et que des travaux précédents permettront d'effectuer la synchronisation entre les différents espaces temporels du CPU et du GPU. La solution présentée, OpenCL User Space Tracepoint (CLUST) consiste en une librairie qui remplace les symboles de la librairie de calcul GPGPU OpenCL. Pour l'utiliser, elle doit être chargée dynamiquement avant de lancer le programme à tracer. Elle instrumente ensuite toutes les fonctions d'OpenCL grâce aux points de trace LTTng-UST, permettant alors d'enregistrer les appels et de gérer les événements asynchrones communs aux GPUs. La performance de la librairie faisant partie des objectifs de départ, une analyse de la performance des différents cas d'utilisation de cette dernière démontre son faible surcoût : pour les charges de travail d'une taille raisonnable, un surcoût variant entre 0.5 % et 2 % fut mesuré. Cet accomplissement ouvre la porte à plusieurs cas d'utilisation. Effectivement, considérant le faible surcoût d'utilisation, CLUST ne représente pas seulement un outil qui permet l'acquisition de traces pour aider au développement de programmes mais peut aussi servir en tant qu'enregistreur permanent dans les systèmes critiques. La fonction "d'enregistreur de vol" de LTTng permet d'enregistrer une trace au disque seulement lorsque requis : l'ajout de données concernant l'état du GPU peut se révéler être un précieux avantage pour diagnostiquer un problème sur un serveur de production. Le tout sans ralentir le système de façon significative.----------ABSTRACT In the world of computing, programmers now have to face the complex challenges that multi-core processors have brought. To address this problem, tracing frameworks such as LTTng were implemented to provide tools to analyze multi-core systems without adding a major overhead on the system. Recently, Graphical Processing Units (GPUs) started a new revolution: General Purpose Graphical Processing Unit (GPGPU) computing. This allows programs to offload their parallel computation sections to the ultra parallel architecture that GPUs offer. Unfortunately, the tracing tools that were provided by the GPU manufacturers did not interoperate with CPU tracing. We propose a solution, OpenCL User Space Tracepoint (CLUST), that enables OpenCL GPGPU computing tracing as an extension to the LTTng kernel tracer. This allows unifying the CPU trace and the GPU trace in one efficient format that enables advanced trace viewing and analysis, to include both models in the analysis and therefore provide more information to the programmer. The objectives of this thesis are to provide a low overhead unified CPU-GPU tracing extension of LTTng, the required algorithms to perform trace domain synchronization between the CPU and the GPU time source domain, and provide a visualization model for the unified traces. As foundation work, we determined that already existing GPU tracing techniques could incorporate well with LTTng, and that trace synchronization algorithms already presented could be used to synchronize the CPU trace with the GPU trace. Therefore, we demonstrate the low overhead characteristics of the CLUST tracing library for typical applications under different use cases. The unified CPU-GPU tracing overhead is also measured to be insignificant (less than 2%) for a typical GPGPU application. Moreover, we use synchronization methods to determine the trace domain synchronization value between both traces. This solution is a more complete and robust implementation that provides the programmer with the required tools, never before implemented, in the hope of helping programmers develop more efficient OpenCL applications

    Language Agnostic Software Energy Kernel Framework

    Get PDF
    Software efficiency has taken a toll in recent times and code quality and optimization is often an afterthought nowadays. Also there exists no standard operating system support or unified tooling to gather fine grained energy consumption data about source code. Current tooling that exist tackles this problem by running the entire process/application as a whole, therefore localizing the exact part in source code is a blind endeavour. It is also time consuming and expensive to improve such efficiency concerns during the development phase. Coupled with the fact that recent hardware leaps has made it possible to write non-performant software to run relatively fast without much regards to code efficiency. The downside to this phenomenon is that, the hardware compensates for bad code quality by using far more resources increasing energy usage. In this thesis, we focus on an energy centric view of running applications and devise tooling to assist the software developer when choosing libraries, frameworks, programming languages and critical architecture designs. We propose a standard unified way of gathering energy consumption data from the operating system kernel and propose two solutions: a kernel energy module and associated energy reading libraries. The objective is to introspect process/applications without massively altering source code. The idea is to probe into source code and gather energy data for comparison against different implementations to create awareness amongst software developers. The tooling is designed to be application and programming language agnostic so that it can infer runtime metrics without much assumption of the underlying software. This allows to gather virtually any scenario and compare software models with different versions, environments and systems. The thesis also does extensive machine-learning tests using different libraries and synthetic datasets to shed light on ML experiments and their energy consumption. Together with these approaches, the developers can make informed decisions about which part to prioritize improvement and achieve greener software

    Development of extensions for an open source performance monitoring tool

    Get PDF
    Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2019/2020PMCtrack es una herramienta de código abierto para Linux que permite monitorizar el rendimiento de las aplicaciones haciendo uso de los contadores hardware del procesador (PMCs - Performance Monitoring Counters). Esta herramienta permite recabar valores de métricas relevantes sobre la ejecución de una aplicación, como el número de instrucciones por ciclo, o la tasa de fallos de predicción de saltos. Asimismo, PMCTrack proporciona información de monitorización hardware adicional no accesible mediante PMCs, como por ejemplo, el consumo de potencia o valores precisos del ancho de banda con memoria consumido por una aplicación. La flexibilidad de su API para recabar métricas de rendimiento desde distintos componentes del sistema operativo (SO), o el hecho de que está implementada en un módulo del kernel —lo cual permite extender la funcionalidad de la herramienta sin reiniciar el SO —, son algunas de las ventajas más relevantes de PMCTrack. A pesar de sus ventajas, PMCTrack requiere actualmente un parche del kernel para funcionar, y gestiona los contadores hardware directamente, no haciendo uso del subsistema estándar de Linux para esta tarea (perf events). El objetivo principal del proyecto es dar los primeros pasos para permitir que PMCTrack pueda en un futuro llegar a funcionar en versiones de kernel Linux sin modificar. Para ello, en el proyecto se ha procedido a la creación de un backend de PMCTrack usando perf events —cuya API del kernel tiene escasa documentación—, y a la adaptación de los distintos componentes de espacio de kernel a nuevas versiones de Linux. Para llevar a cabo la evaluación del nuevo soporte añadido en PMCTrack, se ha realizado una validación exhaustiva del backend creado y se ha procedido a analizar la sobrecarga introducida en la lectura de contadores hardware, llevando a cabo una comparativa entre distintas herramientas y mecanismos disponibles.PMCtrack is an open-source performance monitoring tool for GNU/Linux that allows to monitor application performance using the processor hardware counters (PMCs – Performance Monitoring Tools). This tool allows to collect relevant metric values about the execution of an application, such as the number of instructions per cycle or the jump prediction failure rate. Also, PMCtrack provides additional hardware monitoring information not accesible through PMCs, such as power consumption or precise values of the bandwith with memory consumed by an appliation. The flexibility of its API to collect performance metrics from different components of the operatyng sistem (OS), or the fact that it is actually implemented in a kernel module —which allows to extend the functionality of the tool without restarting the OS —, are some of the most relevant advantages of PMCtrack. Despite its advantages, PMCtrack currently requires a kernel patch to work, and manages the hardware counters directly, not making use of the standard Linux subsystem from this task (perf events). The main objective of the project is to take the first steps to allow PMCtrack to work on unmodified Linux kernel versions in the future. To achieve this goal, the project has proceeded to create a PMCtrack backend using perf events —whose kernel API has little documentation — and to adapt the different kernel space components to new versions of Linux. To carry out the evaluation of the new support added in PMCtrack, an exhaustive validation of the created backend has been carried out and the overload introduced in the reading of hardware meters has been analyzed, carrying out a comparison between different tools and mechanisms available.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Intelligent systems for efficiency and security

    Get PDF
    As computing becomes ubiquitous and personalized, resources like energy, storage and time are becoming increasingly scarce and, at the same time, computing systems must deliver in multiple dimensions, such as high performance, quality of service, reliability, security and low power. Building such computers is hard, particularly when the operating environment is becoming more dynamic, and systems are becoming heterogeneous and distributed. Unfortunately, computers today manage resources with many ad hoc heuristics that are suboptimal, unsafe, and cannot be composed across the computer’s subsystems. Continuing this approach has severe consequences: underperforming systems, resource waste, information loss, and even life endangerment. This dissertation research develops computing systems which, through intelligent adaptation, deliver efficiency along multiple dimensions. The key idea is to manage computers with principled methods from formal control. It is with these methods that the multiple subsystems of a computer sense their environment and configure themselves to meet system-wide goals. To achieve the goal of intelligent systems, this dissertation makes a series of contributions, each building on the previous. First, it introduces the use of formal MIMO (Multiple Input Multiple Output) control for processors, to simultaneously optimize many goals like performance, power, and temperature. Second, it develops the Yukta control system, which uses coordinated formal controllers in different layers of the stack (hardware and operating system). Third, it uses robust control to develop a fast, globally coordinated and decentralized control framework called Tangram, for heterogeneous computers. Finally, it presents Maya, a defense against power side-channel attacks that uses formal control to reshape the power dissipated by a computer, confusing the attacker. The ideas in the dissertation have been demonstrated successfully with several prototypes, including one built along with AMD (Advanced Micro Devices, Inc.) engineers. These designs significantly outperformed the state of the art. The research in this dissertation brought formal control closer to computer architecture and has been well-received in both domains. It has the first application of full-fledged MIMO control for processors, the first use of robust control in computer systems, and the first application of formal control for side-channel defense. It makes a significant stride towards intelligent systems that are efficient, secure and reliable
    corecore