228 research outputs found

    A system’s approach to cache hierarchy-aware decomposition of data-parallel computations

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaThe architecture of nowadays’ processors is very complex, comprising several computational cores and an intricate hierarchy of cache memories. The latter, in particular, differ considerably between the many processors currently available in the market, resulting in a wide variety of configurations. Application development is typically oblivious of this complexity and diversity, taking only into consideration the number of available execution cores. This oblivion prevents such applications from fully harnessing the computing power available in these architectures. This problem has been recognized by the community, which has proposed languages and models to express and tune applications according to the underlying machine’s hierarchy. These, however, lack the desired abstraction level, forcing the programmer to have deep knowledge of computer architecture and parallel programming, in order to ensure performance portability across a wide range of architectures. Realizing these limitations, the goal of this thesis is to delegate these hierarchy-aware optimizations to the runtime system. Accordingly, the programmer’s responsibilities are confined to the definition of procedures for decomposing an application’s domain, into an arbitrary number of partitions. With this, the programmer has only to reason about the application’s data representation and manipulation. We prototyped our proposal on top of a Java parallel programming framework, and evaluated it from a performance perspective, against cache neglectful domain decompositions. The results demonstrate that our optimizations deliver significant speedups against decomposition strategies based solely on the number of execution cores, without requiring the programmer to reason about the machine’s hardware. These facts allow us to conclude that it is possible to obtain performance gains by transferring hierarchyaware optimizations concerns to the runtime system

    A Distributed Block-Split Gibbs Sampler with Hypergraph Structure for High-Dimensional Inverse Problems

    Full text link
    Sampling-based algorithms are classical approaches to perform Bayesian inference in inverse problems. They provide estimators with the associated credibility intervals to quantify the uncertainty on the estimators. Although these methods hardly scale to high dimensional problems, they have recently been paired with optimization techniques, such as proximal and splitting approaches, to address this issue. Such approaches pave the way to distributed samplers, splitting computations to make inference more scalable and faster. We introduce a distributed Split Gibbs sampler (SGS) to efficiently solve such problems involving distributions with multiple smooth and non-smooth functions composed with linear operators. The proposed approach leverages a recent approximate augmentation technique reminiscent of primal-dual optimization methods. It is further combined with a block-coordinate approach to split the primal and dual variables into blocks, leading to a distributed block-coordinate SGS. The resulting algorithm exploits the hypergraph structure of the involved linear operators to efficiently distribute the variables over multiple workers under controlled communication costs. It accommodates several distributed architectures, such as the Single Program Multiple Data and client-server architectures. Experiments on a large image deblurring problem show the performance of the proposed approach to produce high quality estimates with credibility intervals in a small amount of time. Codes to reproduce the experiments are available online

    SMARTS: Exploiting Temporal through Vertical Execution

    Get PDF
    Abstract In the solution of large-scale numerical problems, parallel computing is becoming simultaneously more important and more dificult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely complex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of managing parallelism and data locality from the user. We present innovative algorithms, based on the macro-dataflow model, for detecting data parallelism and efficiently executing dataparallel statements on shared-memory multiprocessors. We also describe how these algorithms can be implemented on clusters of SMPs

    Exploiting Locality and Parallelism with Hierarchically Tiled Arrays

    Get PDF
    The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. From a high level point of view, they are the natural way to express many algorithms, both in iterative and recursive forms. Tiles or sub-tiles are used as basic units in the algorithm description. From a low level point of view, tiling, either as the unit maintained by the algorithm, or as a class of data layouts, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers given the growing gap between memory and processor speed. Finally, tiles and operations on them are also basic to express data distribution and parallelism. Despite the importance of this concept, which makes inevitable its widespread usage, most languages do not support it directly. Programmers have to understand and manage the low-level details along with the introduction of tiling. This gives place to bloated potentially error-prone programs in which opportunities for performance are lost. On the other hand, the disparity between the algorithm and the actual implementation enlarges. This thesis illustrates the power of Hierarchically Tiled Arrays (HTAs), a data type which enables the easy manipulation of tiles in object-oriented languages. The objective is to evolve this data type in order to make the representation of all classes for algorithms with a high degree of parallelism and/or locality as natural as possible. We show in the thesis a set of tile operations which leads to a natural and easy implementation of different algorithms in parallel and in sequential with higher clarity and smaller size. In particular, two new language constructs dynamic partitioning and overlapped tiling are discussed in detail. They are extensions of the HTA data type to improve its capabilities to express algorithms with a high abstraction and free programmers from programming tedious low-level tasks. To prove the claims, two popular languages, C++ and MATLAB are extended with our HTA data type. In addition, several important dense linear algebra kernels, stencil computation kernels, as well as some benchmarks in NAS benchmark suite were implemented. We show that the HTA codes needs less programming effort with a negligible effect on performance

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Optimization of high-throughput real-time processes in physics reconstruction

    Get PDF
    La presente tesis se ha desarrollado en colaboración entre la Universidad de Sevilla y la Organización Europea para la Investigación Nuclear, CERN. El detector LHCb es uno de los cuatro grandes detectores situados en el Gran Colisionador de Hadrones, LHC. En LHCb, se colisionan partículas a altas energías para comprender la diferencia existente entre la materia y la antimateria. Debido a la cantidad ingente de datos generada por el detector, es necesario realizar un filtrado de datos en tiempo real, fundamentado en los conocimientos actuales recogidos en el Modelo Estándar de física de partículas. El filtrado, también conocido como High Level Trigger, deberá procesar un throughput de 40 Tb/s de datos, y realizar un filtrado de aproximadamente 1 000:1, reduciendo el throughput a unos 40 Gb/s de salida, que se almacenan para posterior análisis. El proceso del High Level Trigger se subdivide a su vez en dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger 2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste en una serie de procesos software que reconstruyen lo que ha sucedido en la colisión de partículas. En la reconstrucción del HLT1 únicamente se analizan las trayectorias de las partículas producidas fruto de la colisión, en un problema conocido como reconstrucción de trazas, para dictaminar el interés de las colisiones. Por contra, el proceso HLT2 es más fino, requiriendo más tiempo en realizarse y reconstruyendo todos los subdetectores que componen LHCb. Hacia 2020, el detector LHCb, así como todos los componentes del sistema de adquisici´on de datos, serán actualizados acorde a los últimos desarrollos técnicos. Como parte del sistema de adquisición de datos, los servidores que procesan HLT1 y HLT2 también sufrirán una actualización. Al mismo tiempo, el acelerador LHC será también actualizado, de manera que la cantidad de datos generada en cada cruce de grupo de partículas aumentare en aproxidamente 5 veces la actual. Debido a las actualizaciones tanto del acelerador como del detector, se prevé que la cantidad de datos que deberá procesar el HLT en su totalidad sea unas 40 veces mayor a la actual. La previsión de la escalabilidad del software actual a 2020 subestim´ó los recursos necesarios para hacer frente al incremento en throughput. Esto produjo que se pusiera en marcha un estudio de todos los algoritmos tanto del HLT1 como del HLT2, así como una actualización del código a nuevos estándares, para mejorar su rendimiento y ser capaz de procesar la cantidad de datos esperada. En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza en profundidad y se proponen nuevos algoritmos para su resolución. Ya que los problemas analizados exhiben un paralelismo masivo, estos algoritmos se implementan en lenguajes especializados para tarjetas gráficas modernas (GPUs), dada su arquitectura inherentemente paralela. En este trabajo se dise ˜nan dos algoritmos de reconstrucción de trazas. Además, se diseñan adicionalmente cuatro algoritmos de decodificación y un algoritmo de clustering, problemas también encontrados en el HLT1. Por otra parte, se diseña un algoritmo para el filtrado de Kalman, que puede ser utilizado en ambas etapas. Los algoritmos desarrollados cumplen con los requisitos esperados por la colaboración LHCb para el año 2020. Para poder ejecutar los algoritmos eficientemente en tarjetas gráficas, se desarrolla un framework especializado para GPUs, que permite la ejecución paralela de secuencias de reconstrucción en GPUs. Combinando los algoritmos desarrollados con el framework, se completa una secuencia de ejecución que asienta las bases para un HLT1 ejecutable en GPU. Durante la investigación llevada a cabo en esta tesis, y gracias a los desarrollos arriba mencionados y a la colaboración de un pequeño equipo de personas coordinado por el autor, se completa un HLT1 ejecutable en GPUs. El rendimiento obtenido en GPUs, producto de esta tesis, permite hacer frente al reto de ejecutar una secuencia de reconstrucción en tiempo real, bajo las condiciones actualizadas de LHCb previstas para 2020. As´ı mismo, se completa por primera vez para cualquier experimento del LHC un High Level Trigger que se ejecuta únicamente en GPUs. Finalmente, se detallan varias posibles configuraciones para incluir tarjetas gr´aficas en el sistema de adquisición de datos de LHCb.The current thesis has been developed in collaboration between Universidad de Sevilla and the European Organization for Nuclear Research, CERN. The LHCb detector is one of four big detectors placed alongside the Large Hadron Collider, LHC. In LHCb, particles are collided at high energies in order to understand the difference between matter and antimatter. Due to the massive quantity of data generated by the detector, it is necessary to filter data in real-time. The filtering, also known as High Level Trigger, processes a throughput of 40 Tb/s of data and performs a selection of approximately 1 000:1. The throughput is thus reduced to roughly 40 Gb/s of data output, which is then stored for posterior analysis. The High Level Trigger process is subdivided into two stages: High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2). HLT1 occurs in real-time, and yields a reduction of data of approximately 30:1. HLT1 consists in a series of software processes that reconstruct particle collisions. The HLT1 reconstruction only analyzes the trajectories of particles produced at the collision, solving a problem known as track reconstruction, that determines whether the collision data is kept or discarded. In contrast, HLT2 is a finer process, which requires more time to execute and reconstructs all subdetectors composing LHCb. Towards 2020, the LHCb detector and all the components composing the data acquisition system will be upgraded. As part of the data acquisition system, the servers that process HLT1 and HLT2 will also be upgraded. In addition, the LHC accelerator will also be updated, increasing the data generated in every bunch crossing by roughly 5 times. Due to the accelerator and detector upgrades, the amount of data that the HLT will require to process is expected to increase by 40 times. The foreseen scalability of the software through 2020 underestimated the required resources to face the increase in data throughput. As a consequence, studies of all algorithms composing HLT1 and HLT2 and code modernizations were carried out, in order to obtain a better performance and increase the processing capability of the foreseen hardware resources in the upgrade. In this thesis, several algorithms of the LHCb recontruction are explored. The track reconstruction problem is analyzed in depth, and new algorithms are proposed. Since the analyzed problems are massively parallel, these algorithms are implemented in specialized languages for modern graphics cards (GPUs), due to their inherently parallel architecture. From this work stem two algorithm designs. Furthermore, four additional decoding algorithms and a clustering algorithms have been designed and implemented, which are also part of HLT1. Apart from that, an parallel Kalman filter algorithm has been designed and implemented, which can be used in both HLT stages. The developed algorithms satisfy the requirements of the LHCb collaboration for the LHCb upgrade. In order to execute the algorithms efficiently on GPUs, a software framework specialized for GPUs is developed, which allows executing GPU reconstruction sequences in parallel. Combining the developed algorithms with the framework, an execution sequence is completed as the foundations of a GPU HLT1. During the research carried out in this thesis, the aforementioned developments and a small group of collaborators coordinated by the author lead to the completion of a full GPU HLT1 sequence. The performance obtained on GPUs allows executing a reconstruction sequence in real-time, under LHCb upgrade conditions. The developed GPU HLT1 constitutes the first GPU high level trigger ever developed for an LHC experiment. Finally, various possible realizations of the GPU HLT1 to integrate in a production GPU-equipped data acquisition system are detailed
    • …
    corecore