44 research outputs found

    Dataflow Computing with Polymorphic Registers

    Get PDF
    Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them

    The Case for Polymorphic Registers in Dataflow Computing

    Get PDF
    Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9×9 or larger mask sizes, even in bandwidth-constrained systems

    Abstractions and performance optimisations for finite element methods

    Get PDF
    Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing. In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance. Domain-specific systems based on code generation techniques, such as Firedrake, attempt to address this problem with a design consisting of a hierarchy of abstractions, where the users can specify the mathematical problems via a high-level, descriptive interface, which is progressively lowered through the intermediate abstractions. Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently, generating high-performance code without user intervention. This thesis discusses several topics on the design of the abstraction layers of Firedrake, and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers. In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra. We successfully lift specific loop optimisations, previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language, improving the compilation speed and optimisation effectiveness. The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh. We redesign the abstraction to express the global assembly loop nests using tools and concepts based on the polyhedral model. This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically. This abstraction also improves the portability of Firedrake, as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces

    Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

    Get PDF
    The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The C++ template interface provided allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization

    Synthesis Techniques for Semi-Custom Dynamically Reconfigurable Superscalar Processors

    Get PDF
    The accelerated adoption of reconfigurable computing foreshadows a computational paradigm shift, aimed at fulfilling the need of customizable yet high-performance flexible hardware. Reconfigurable computing fulfills this need by allowing the physical resources of a chip to be adapted to the computational requirements of a specific program, thus achieving higher levels of computing performance. This dissertation evaluates the area requirements for reconfigurable processing, an important yet often disregarded assessment for partial reconfiguration. Common reconfigurable computing approaches today attempt to create custom circuitry in static co-processor accelerators. We instead focused on a new approach that synthesized semi-custom general-purpose processor cores. Each superscalar processor core's execution units can be customized for a particular application, yet the processor retains its standard microprocessor interface. We analyzed the area consumption for these computational components by studying the synthesis requirements of different processor configurations. This area/performance assessment aids designers when constraining processing elements in a fixed-size area slot, a requirement for modern partial reconfiguration approaches. Our results provide a more deterministic evaluation of performance density, hence making the area cost analysis less ambiguous when optimizing dynamic systems for coarse-grained parallelism. The results obtained showed that even though performance density decreases with processor complexity, the additional area still provides a positive contribution to the aggregate parallel processing performance. This evaluation of parallel execution density contributes to ongoing efforts in the field of reconfigurable computing by providing a baseline for area/performance trade-offs for partial reconfiguration and multi-processor systems

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 31st European Symposium on Programming, ESOP 2022, which was held during April 5-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 21 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 31st European Symposium on Programming, ESOP 2022, which was held during April 5-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 21 regular papers presented in this volume were carefully reviewed and selected from 64 submissions. They deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems

    Programming Languages and Systems

    Get PDF
    This open access book constitutes the proceedings of the 29th European Symposium on Programming, ESOP 2020, which was planned to take place in Dublin, Ireland, in April 2020, as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2020. The actual ETAPS 2020 meeting was postponed due to the Corona pandemic. The papers deal with fundamental issues in the specification, design, analysis, and implementation of programming languages and systems
    corecore