333 research outputs found

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706

    Many-core and heterogeneous architectures: programming models and compilation toolchains

    Get PDF
    1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc

    Scheduling strategies for parallel patterns on heterogeneous architectures

    Get PDF
    To help shrink the programmability-performance efficiency gap, we discuss that adaptive runtime systems can be used to facilitate the management of heterogeneous architectures. A runtime system can provide a significant performance boost while reducing the energy consumption, because it is aware of processors’ architectures and application’s requirements. We analyse how applications map onto hardware by inspecting built-in processor counters, and therefore build models to describe the observed behaviour. In this thesis, we discuss how parallel patterns, such as parallel for loops and pipelines, can be decomposed and efficiently executed on heterogeneous plat- forms. We propose several scheduling strategies aiming at reducing execution time and energy consumption. We demonstrate how applications can be run faster by mapping the application level parallelism onto the hardware process- ing units that best fit the application requirements, and by selecting the right task size. First, we devise a load balancing technique, that targets heterogeneous CPU and multi-GPU architectures. It monitors the relative speed of each processing unit, and distributes the remaining workload based on these relative speeds. By making all processing units to finish at same time, we avoid unnecessary waits between processors. Along with this load balancing technique, we propose a performance-sensitive partitioner that adapts the amount of computation offloaded to the accelerator for better performance and utilisation. We also present an accurate performance model for streaming applications, such as face recognition or object tracking. This model targets pipelined applications, as a series of stages, and performs a scalability analysis of each stage by using coarse and medium grain parallelism. Additionally, it also considers executing the stage on the GPU or not. By applying the model, we always find the best pipeline configuration among all possible, and get substantial performance and energy savings. All experiments in this thesis have been performed by using state-of-the-art hardware accelerators and benchmarks of the field of HPC. Specifically, we use the Rodinia and SHOC benchmark suites, for the evaluation of the parallel for partitioner. Moreover, we use the the ViVid application, along with tracking and SRAD applications from Rodinia Benchmark Suite, all of them are good candidates of vision applications. Finally, we rely on Intel Threading Building Blocks, the core engine of our schedulers; the Intel OpenCL SDK and CUDA SDK to offload computations to the GPU accelerators and Intel PCM library to monitor energy consumption and cache memory metrics.During the last decade, power consumption and energy efficiency have become key aspects in processor design. Nowadays, the power consumption is the principal limitation for further scaling of chip multiprocessors design (CMPs). In general, the research community agrees that current chip multiprocessor technology trends will not scale performance without an increase of power budget. Hardware design innovations as the recent Heterogeneous Architectures and Near Threshold Computing are needed to cope with the performance-power barrier. As a result of this, there has been a shift away from chip multiprocessors to heterogeneous processor architectures. Recently, we have witnessed an explosion in the availability of this kind of architectures. Many hardware vendors have released a number of heterogeneous processors to overcome the aforementioned limitations. However, software also requires changes to allow further performance scaling on these architectures. With the advent of heterogeneous architectures, hardware manufactures have impose the burden of explicit accelerator management on software developers. In general, programmers are used to sequential programming, but writing high-performance programs for heterogeneous architectures is a complex task. Programming for this kind of platforms requires the understanding of new hardware concepts, orchestration of different parallelism levels, the explicit management of different memory spaces and synchronisations between processing units, and finally the usage of low-level programming models such as OpenCL or CUDA. Moreover, heterogeneous architectures suffer from performance portability, as one program can exhibit unequal performance on different devices

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces
    • …
    corecore