2,476 research outputs found

    Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

    Get PDF
    abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201

    Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

    Full text link
    In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we model a modern COTS multicore system which has a nonblocking last-level cache (LLC) and a DRAM controller that prioritizes reads over writes. To minimize interference, we focus on LLC and DRAM bank partitioned systems. Based on the model, we propose an analysis that computes a safe upper bound for the worst-case memory interference delay. We validated our analysis on a real COTS multicore platform with a set of carefully designed synthetic benchmarks as well as SPEC2006 benchmarks. Evaluation results show that our analysis is more accurately capture the worst-case memory interference delay and provides safer upper bounds compared to a recently proposed analysis which significantly under-estimate the delay.Comment: Technical Repor

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels

    Get PDF
    Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach

    P2IP: A novel low-latency Programmable Pipeline Image Processor

    Get PDF
    International audienceThis paper presents a novel systolic Coarse-Grained Reconfigurable Architecture for real-time image and video processing called P 2 IP. The P 2 IP is a scalable architecture that combines the low-latency characteristic of systolic array architectures with a runtime reconfigurable datapath. Reconfigurabil-ity of the P 2 IP enables it to perform a wide range of image pre-processing tasks directly on a pixel stream. The versatility of the P 2 IP is demonstrated through three image processing algorithms mapped onto the architecture, implemented in an FPGA-based platform. The obtained results show that the P 2 IP can achieve up to 129 fps in Full HD 1080p and 32 fps in 4K 2160p what makes it suitable for modern high-definition applications
    corecore