27 research outputs found

    SHAPES : Easy and high-level memory layouts

    Get PDF
    CPU speeds have vastly exceeded those of RAM. As such, developers who aim to achieve high performance on modern architectures will most likely need to consider how to use CPU caches effectively, hence they will need to consider how to place data in memory so as to exploit spatial locality and achieve high memory bandwidth. Performing such manual memory optimisations usually sacrifices readability, maintainability, memory safety, and object abstraction. This is further exacerbated in managed languages, such as Java and C#, where the runtime abstracts away the memory from the developer and such optimisations are, therefore, almost impossible. To that extent, we present in this thesis a language extension called SHAPES . SHAPES aims to offer developers more fine-grained control over the placement of data, without sacrificing memory safety or object abstraction, hence retaining the expressiveness and familiarity of OOP. SHAPES introduces the concepts of pools and layouts; programmers group related objects into pools, and specify how objects are laid out in these pools. Classes and types are annotated by pool parameters, which allow placement aspects to be changed orthogonally to how the business logic operates on the objects in the pool. These design decisions disentangle business logic and memory concerns. We provide a formal model of SHAPES , present its type and memory safety model, and its translation into a low-level language. We present our reasoning as to why we can expect SHAPES to be compiled in an efficient manner in terms of the runtime representation of objects and the access to their fields. Moreover, we present SHAPES -z, an implementation of SHAPES as an embeddable language, and shapeszc , the compiler for SHAPES -z. We provide our our design and implementation considerations for SHAPES -z and shapeszc . Finally, we evaluate the performance of SHAPES and SHAPES -z through case studies.Open Acces

    Decoupling algorithms from schedules for easy optimization of image processing pipelines

    Get PDF
    Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System

    Generating renderers

    Get PDF
    Most production renderers developed for the film industry are huge pieces of software that are able to render extremely complex scenes. Unfortunately, they are implemented using the currently available programming models that are not well suited to modern computing hardware like CPUs with vector units or GPUs. Thus, they have to deal with the added complexity of expressing parallelism and using hardware features in those models. Since compilers cannot alone optimize and generate efficient programs for any type of hardware, because of the large optimization spaces and the complexity of the underlying compiler problems, programmers have to rely on compiler-specific hardware intrinsics or write non-portable code. The consequence of these limitations is that programmers resort to writing the same code twice when they need to port their algorithm on a different architecture, and that the code itself becomes difficult to maintain, as algorithmic details are buried under hardware details. Thankfully, there are solutions to this problem, taking the form of Domain-Specific Lan- guages. As their name suggests, these languages are tailored for one domain, and compilers can therefore use domain-specific knowledge to optimize algorithms and choose the best execution policy for a given target hardware. In this thesis, we opt for another way of encoding domain- specific knowledge: We implement a generic, high-level, and declarative rendering and traversal library in a functional language, and later refine it for a target machine by providing partial evaluation annotations. The partial evaluator then specializes the entire renderer according to the available knowledge of the scene: Shaders are specialized when their inputs are known, and in general, all redundant computations are eliminated. Our results show that the generated renderers are faster and more portable than renderers written with state-of-the-art competing libraries, and that in comparison, our rendering library requires less implementation effort.Die meisten in der Filmindustrie zum Einsatz kommenden Renderer sind riesige Softwaresysteme, die in der Lage sind, extrem aufwendige Szenen zu rendern. Leider sind diese mit den aktuell verfügbaren Programmiermodellen implementiert, welche nicht gut geeignet sind für moderne Rechenhardware wie CPUs mit Vektoreinheiten oder GPUs. Deshalb müssen Entwickler sich mit der zusätzlichen Komplexität auseinandersetzen, Parallelismus und Hardwarefunktionen in diesen Programmiermodellen auszudrücken. Da Compiler nicht selbständig optimieren und effiziente Programme für jeglichen Typ Hardware generieren können, wegen des großen Optimierungsraumes und der Komplexität des unterliegenden Kompilierungsproblems, müssen Programmierer auf Compiler-spezifische Hardware-“Intrinsics” zurückgreifen, oder nicht portierbaren Code schreiben. Die Konsequenzen dieser Limitierungen sind, dass Programmierer darauf zurückgreifen den gleichen Code zweimal zu schreiben, wenn sie ihre Algorithmen für eine andere Architektur portieren müssen, und dass der Code selbst schwer zu warten wird, da algorithmische Details unter Hardwaredetails verloren gehen. Glücklicherweise gibt es Lösungen für dieses Problem, in der Form von DSLs. Diese Sprachen sind maßgeschneidert für eine Domäne und Compiler können deshalb Domänenspezifisches Wissen nutzen, um Algorithmen zu optimieren und die beste Ausführungsstrategie für eine gegebene Zielhardware zu wählen. In dieser Dissertation wählen wir einen anderen Weg, Domänenspezifisches Wissen zu enkodieren: Wir implementieren eine generische, high-level und deklarative Rendering- und Traversierungsbibliothek in einer funktionalen Programmiersprache, und verfeinern sie später für eine Zielmaschine durch Bereitstellung von Annotationen für die partielle Auswertung. Der “Partial Evaluator” spezialisiert dann den kompletten Renderer, basierend auf dem verfügbaren Wissen über die Szene: Shader werden spezialisiert, wenn ihre Eingaben bekannt sind, und generell werden alle redundanten Berechnungen eliminiert. Unsere Ergebnisse zeigen, dass die generierten Renderer schneller und portierbarer sind, als Renderer geschrieben mit den aktuellen Techniken konkurrierender Bibliotheken und dass, im Vergleich, unsere Rendering Bibliothek weniger Implementierungsaufwand erfordert.This work was supported by the Federal Ministry of Education and Research (BMBF) as part of the Metacca and ProThOS projects as well as by the Intel Visual Computing Institute (IVCI) and Cluster of Excellence on Multimodal Computing and Interaction (MMCI) at Saarland University. Parts of it were also co-funded by the European Union(EU), as part of the Dreamspace project

    Optimization of high-throughput real-time processes in physics reconstruction

    Get PDF
    La presente tesis se ha desarrollado en colaboración entre la Universidad de Sevilla y la Organización Europea para la Investigación Nuclear, CERN. El detector LHCb es uno de los cuatro grandes detectores situados en el Gran Colisionador de Hadrones, LHC. En LHCb, se colisionan partículas a altas energías para comprender la diferencia existente entre la materia y la antimateria. Debido a la cantidad ingente de datos generada por el detector, es necesario realizar un filtrado de datos en tiempo real, fundamentado en los conocimientos actuales recogidos en el Modelo Estándar de física de partículas. El filtrado, también conocido como High Level Trigger, deberá procesar un throughput de 40 Tb/s de datos, y realizar un filtrado de aproximadamente 1 000:1, reduciendo el throughput a unos 40 Gb/s de salida, que se almacenan para posterior análisis. El proceso del High Level Trigger se subdivide a su vez en dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger 2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste en una serie de procesos software que reconstruyen lo que ha sucedido en la colisión de partículas. En la reconstrucción del HLT1 únicamente se analizan las trayectorias de las partículas producidas fruto de la colisión, en un problema conocido como reconstrucción de trazas, para dictaminar el interés de las colisiones. Por contra, el proceso HLT2 es más fino, requiriendo más tiempo en realizarse y reconstruyendo todos los subdetectores que componen LHCb. Hacia 2020, el detector LHCb, así como todos los componentes del sistema de adquisici´on de datos, serán actualizados acorde a los últimos desarrollos técnicos. Como parte del sistema de adquisición de datos, los servidores que procesan HLT1 y HLT2 también sufrirán una actualización. Al mismo tiempo, el acelerador LHC será también actualizado, de manera que la cantidad de datos generada en cada cruce de grupo de partículas aumentare en aproxidamente 5 veces la actual. Debido a las actualizaciones tanto del acelerador como del detector, se prevé que la cantidad de datos que deberá procesar el HLT en su totalidad sea unas 40 veces mayor a la actual. La previsión de la escalabilidad del software actual a 2020 subestim´ó los recursos necesarios para hacer frente al incremento en throughput. Esto produjo que se pusiera en marcha un estudio de todos los algoritmos tanto del HLT1 como del HLT2, así como una actualización del código a nuevos estándares, para mejorar su rendimiento y ser capaz de procesar la cantidad de datos esperada. En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza en profundidad y se proponen nuevos algoritmos para su resolución. Ya que los problemas analizados exhiben un paralelismo masivo, estos algoritmos se implementan en lenguajes especializados para tarjetas gráficas modernas (GPUs), dada su arquitectura inherentemente paralela. En este trabajo se dise ˜nan dos algoritmos de reconstrucción de trazas. Además, se diseñan adicionalmente cuatro algoritmos de decodificación y un algoritmo de clustering, problemas también encontrados en el HLT1. Por otra parte, se diseña un algoritmo para el filtrado de Kalman, que puede ser utilizado en ambas etapas. Los algoritmos desarrollados cumplen con los requisitos esperados por la colaboración LHCb para el año 2020. Para poder ejecutar los algoritmos eficientemente en tarjetas gráficas, se desarrolla un framework especializado para GPUs, que permite la ejecución paralela de secuencias de reconstrucción en GPUs. Combinando los algoritmos desarrollados con el framework, se completa una secuencia de ejecución que asienta las bases para un HLT1 ejecutable en GPU. Durante la investigación llevada a cabo en esta tesis, y gracias a los desarrollos arriba mencionados y a la colaboración de un pequeño equipo de personas coordinado por el autor, se completa un HLT1 ejecutable en GPUs. El rendimiento obtenido en GPUs, producto de esta tesis, permite hacer frente al reto de ejecutar una secuencia de reconstrucción en tiempo real, bajo las condiciones actualizadas de LHCb previstas para 2020. As´ı mismo, se completa por primera vez para cualquier experimento del LHC un High Level Trigger que se ejecuta únicamente en GPUs. Finalmente, se detallan varias posibles configuraciones para incluir tarjetas gr´aficas en el sistema de adquisición de datos de LHCb.The current thesis has been developed in collaboration between Universidad de Sevilla and the European Organization for Nuclear Research, CERN. The LHCb detector is one of four big detectors placed alongside the Large Hadron Collider, LHC. In LHCb, particles are collided at high energies in order to understand the difference between matter and antimatter. Due to the massive quantity of data generated by the detector, it is necessary to filter data in real-time. The filtering, also known as High Level Trigger, processes a throughput of 40 Tb/s of data and performs a selection of approximately 1 000:1. The throughput is thus reduced to roughly 40 Gb/s of data output, which is then stored for posterior analysis. The High Level Trigger process is subdivided into two stages: High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2). HLT1 occurs in real-time, and yields a reduction of data of approximately 30:1. HLT1 consists in a series of software processes that reconstruct particle collisions. The HLT1 reconstruction only analyzes the trajectories of particles produced at the collision, solving a problem known as track reconstruction, that determines whether the collision data is kept or discarded. In contrast, HLT2 is a finer process, which requires more time to execute and reconstructs all subdetectors composing LHCb. Towards 2020, the LHCb detector and all the components composing the data acquisition system will be upgraded. As part of the data acquisition system, the servers that process HLT1 and HLT2 will also be upgraded. In addition, the LHC accelerator will also be updated, increasing the data generated in every bunch crossing by roughly 5 times. Due to the accelerator and detector upgrades, the amount of data that the HLT will require to process is expected to increase by 40 times. The foreseen scalability of the software through 2020 underestimated the required resources to face the increase in data throughput. As a consequence, studies of all algorithms composing HLT1 and HLT2 and code modernizations were carried out, in order to obtain a better performance and increase the processing capability of the foreseen hardware resources in the upgrade. In this thesis, several algorithms of the LHCb recontruction are explored. The track reconstruction problem is analyzed in depth, and new algorithms are proposed. Since the analyzed problems are massively parallel, these algorithms are implemented in specialized languages for modern graphics cards (GPUs), due to their inherently parallel architecture. From this work stem two algorithm designs. Furthermore, four additional decoding algorithms and a clustering algorithms have been designed and implemented, which are also part of HLT1. Apart from that, an parallel Kalman filter algorithm has been designed and implemented, which can be used in both HLT stages. The developed algorithms satisfy the requirements of the LHCb collaboration for the LHCb upgrade. In order to execute the algorithms efficiently on GPUs, a software framework specialized for GPUs is developed, which allows executing GPU reconstruction sequences in parallel. Combining the developed algorithms with the framework, an execution sequence is completed as the foundations of a GPU HLT1. During the research carried out in this thesis, the aforementioned developments and a small group of collaborators coordinated by the author lead to the completion of a full GPU HLT1 sequence. The performance obtained on GPUs allows executing a reconstruction sequence in real-time, under LHCb upgrade conditions. The developed GPU HLT1 constitutes the first GPU high level trigger ever developed for an LHC experiment. Finally, various possible realizations of the GPU HLT1 to integrate in a production GPU-equipped data acquisition system are detailed

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures

    Doctor of Philosophy

    Get PDF
    dissertationCurrent scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures
    corecore