1,187 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture

    Full text link
    We introduce Stardust, a compiler that compiles sparse tensor algebra to reconfigurable dataflow architectures (RDAs). Stardust introduces new user-provided data representation and scheduling language constructs for mapping to resource-constrained accelerated architectures. Stardust uses the information provided by these constructs to determine on-chip memory placement and to lower to the Capstan RDA through a parallel-patterns rewrite system that targets the Spatial programming model. The Stardust compiler is implemented as a new compilation path inside the TACO open-source system. Using cycle-accurate simulation, we demonstrate that Stardust can generate more Capstan tensor operations than its authors had implemented and that it results in 138×\times better performance than generated CPU kernels and 41×\times better performance than generated GPU kernels.Comment: 15 pages, 13 figures, 6 tables

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

    Get PDF
    Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations

    An Optimized Architecture for CGA Operations and Its Application to a Simulated Robotic Arm

    Get PDF
    Conformal geometric algebra (CGA) is a new geometric computation tool that is attracting growing attention in many research fields, such as computer graphics, robotics, and computer vision. Regarding the robotic applications, new approaches based on CGA have been proposed to efficiently solve problems as the inverse kinematics and grasping of a robotic arm. The hardware acceleration of CGA operations is required to meet real-time performance requirements in embedded robotic platforms. In this paper, we present a novel embedded coprocessor for accelerating CGA operations in robotic tasks. Two robotic algorithms, namely, inverse kinematics and grasping of a human-arm-like kinematics chain, are used to prove the effectiveness of the proposed approach. The coprocessor natively supports the entire set of CGA operations including both basic operations (products, sums/differences, and unary operations) and complex operations as rigid body motion operations (reflections, rotations, translations, and dilations). The coprocessor prototype is implemented on the Xilinx ML510 development platform as a complete system-on-chip (SoC), integrating both a PowerPC processing core and a CGA coprocessing core on the same Xilinx Virtex-5 FPGA chip. Experimental results show speedups of 78x and 246x for inverse kinematics and grasping algorithms, respectively, with respect to the execution on the PowerPC processor

    Heterogeneous Self-Reconfiguring Robotics: Ph.D. Thesis Proposal

    Get PDF
    Self-reconfiguring robots are modular systems that can change shape, or reconfigure, to match structure to task. They comprise many small, discrete, often identical modules that connect together and that are minimally actuated. Global shape transformation is achieved by composing local motions. Systems with a single module type, known as homogeneous systems, gain fault tolerance, robustness and low production cost from module interchangeability. However, we are interested in heterogeneous systems, which include multiple types of modules such as those with sensors, batteries or wheels. We believe that heterogeneous systems offer the same benefits as homogeneous systems with the added ability to match not only structure to task, but also capability to task. Although significant results have been achieved in understanding homogeneous systems, research in heterogeneous systems is challenging as key algorithmic issues remain unexplored. We propose in this thesis to investigate questions in four main areas: 1) how to classify heterogeneous systems, 2) how to develop efficient heterogeneous reconfiguration algorithms with desired characteristics, 3) how to characterize the complexity of key algorithmic problems, and 4) how to apply these heterogeneous algorithms to perform useful new tasks in simulation and in the physical world. Our goal is to develop an algorithmic basis for heterogeneous systems. This has theoretical significance in that it addresses a major open problem in the field, and practical significance in providing self-reconfiguring robots with increased capabilities

    Scaling Simulations of Reconfigurable Meshes.

    Get PDF
    This dissertation deals with reconfigurable bus-based models, a new type of parallel machine that uses dynamically alterable connections between processors to allow efficient communication and to perform fast computations. We focus this work on the Reconfigurable Mesh (R-Mesh), one of the most widely studied reconfigurable models. We study the ability of the R-Mesh to adapt an algorithm instance of an arbitrary size to run on a given smaller model size without significant loss of efficiency. A scaling simulation achieves this adaptation, and the simulation overhead expresses the efficiency of the simulation. We construct a scaling simulation for the Fusing-Restricted Reconfigurable Mesh (FR-Mesh), an important restriction of the R-Mesh. The overhead of this simulation depends only on the simulating machine size and not on the simulated machine size. The results of this scaling simulation extend to a variety of concurrent write rules and also translate to an improved scaling simulation of the R-Mesh itself. We present a bus linearization procedure that transforms an arbitrary non-linear bus configuration of an R-Mesh into an equivalent acyclic linear bus configuration implementable on an Linear Reconfigurable Mesh (LR-Mesh), a weaker version of the R-Mesh. This procedure gives the algorithm designer the liberty of using buses of arbitrary shape, while automatically translating the algorithm to run on a simpler platform. We illustrate our bus linearization method through two important applications. The first leads to a faster scaling simulation of the R-Mesh. The second application adapts algorithms designed for R-Meshes to run on models with pipelined optical buses. We also present a simulation of a Directional Reconfigurable Mesh (DR-Mesh) on an LR-Mesh. This simulation has a much better efficiency compared to previous work. In addition to the LR-Mesh, this simulation also runs on models that use pipelined optical buses

    Heterogeneous Self-Reconfiguring Robotics

    Get PDF
    Self-reconfiguring (SR) robots are modular systems that can autonomously change shape, or reconfigure, for increased versatility and adaptability in unknown environments. In this thesis, we investigate planning and control for systems of non-identical modules, known as heterogeneous SR robots. Although previous approaches rely on module homogeneity as a critical property, we show that the planning complexity of fundamental algorithmic problems in the heterogeneous case is equivalent to that of systems with identical modules. Primarily, we study the problem of how to plan shape changes while considering the placement of specific modules within the structure. We characterize this key challenge in terms of the amount of free space available to the robot and develop a series of decentralized reconfiguration planning algorithms that assume progressively more severe free space constraints and support reconfiguration among obstacles. In addition, we compose our basic planning techniques in different ways to address problems in the related task domains of positioning modules according to function, locomotion among obstacles, self-repair, and recognizing the achievement of distributed goal-states. We also describe the design of a novel simulation environment, implementation results using this simulator, and experimental results in hardware using a planar SR system called the Crystal Robot. These results encourage development of heterogeneous systems. Our algorithms enhance the versatility and adaptability of SR robots by enabling them to use functionally specialized components to match capability, in addition to shape, to the task at hand
    • …
    corecore