3 research outputs found

    A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

    Get PDF
    In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423

    Accelerating Linear Algebra and Machine Learning Kernels on a Massively Parallel Reconfigurable Architecture

    Get PDF
    abstract: This thesis presents efficient implementations of several linear algebra kernels, machine learning kernels and a neural network based recommender systems engine onto a massively parallel reconfigurable architecture, Transformer. The linear algebra kernels include Triangular Matrix Solver (TRSM), LU Decomposition (LUD), QR Decomposition (QRD), and Matrix Inversion. The machine learning kernels include an LSTM (Long Short Term Memory) cell, and a GRU (gated Recurrent Unit) cell used in recurrent neural networks. The neural network based recommender systems engine consists of multiple kernels including fully connected layers, embedding layer, 1-D batchnorm, Adam optimizer, etc. Transformer is a massively parallel reconfigurable multicore architecture designed at the University of Michigan. The Transformer configuration considered here is 4 tiles and 16 General Processing Elements (GPEs) per tile. It supports a two level cache hierarchy where the L1 and L2 caches can operate in shared (S) or private (P) modes. The architecture was modeled using Gem5 and cycle accurate simulations were done to evaluate the performance in terms of execution times, giga-operations per second per Watt (GOPS/W), and giga-floating-point-operations per second per Watt (GFLOPS/W). This thesis shows that for linear algebra kernels, each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes. For instance, for smaller matrix sizes, L1P, L2P cache mode is best for TRSM, while L1S, L2S is the best cache mode for LUD, and L1P, L2S is the best for QRD. For each kernel, the optimal cache mode changes when the matrix size is increased. For instance, for TRSM, the L1P, L2P cache mode is best for smaller matrix sizes (N=64,128,256,512N=64, 128, 256, 512) and it changes to L1S, L2P for larger matrix sizes (N=1024N=1024). For machine learning kernels, L1P, L2P is the best cache mode for all network parameter sizes. Gem5 simulations show that the peak performance for TRSM, LUD, QRD and Matrix Inverse in the 14nm node is 97.5, 59.4, 133.0 and 83.05 GFLOPS/W, respectively. For LSTM and GRU, the peak performance is 44.06 and 69.3 GFLOPS/W. The neural network based recommender system was implemented in L1S, L2S cache mode. It includes a forward pass and a backward pass and is significantly more complex in terms of both computational complexity and data movement. The most computationally intensive block is the fully connected layer followed by Adam optimizer. The overall performance of the recommender systems engine is 54.55 GFLOPS/W and 169.12 GOPS/W.Dissertation/ThesisMasters Thesis Electrical Engineering 201
    corecore