2,657 research outputs found

    FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

    Full text link
    Neural Network (NN) accelerators with emerging ReRAM (resistive random access memory) technologies have been investigated as one of the promising solutions to address the \textit{memory wall} challenge, due to the unique capability of \textit{processing-in-memory} within ReRAM-crossbar-based processing elements (PEs). However, the high efficiency and high density advantages of ReRAM have not been fully utilized due to the huge communication demands among PEs and the overhead of peripheral circuits. In this paper, we propose a full system stack solution, composed of a reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and its software system including neural synthesizer, temporal-to-spatial mapper, and placement & routing. We highly leverage the software system to make the hardware design compact and efficient. To satisfy the high-performance communication demand, we optimize it with a reconfigurable routing architecture and the placement & routing tool. To improve the computational density, we greatly simplify the PE circuit with the spiking schema and then adopt neural synthesizer to enable the high density computation-resources to support different kinds of NN operations. In addition, we provide spiking memory blocks (SMBs) and configurable logic blocks (CLBs) in hardware and leverage the temporal-to-spatial mapper to utilize them to balance the storage and computation requirements of NN. Owing to the end-to-end software system, we can efficiently deploy existing deep neural networks to FPSA. Evaluations show that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME, the computational density of FPSA improves by 31x; for representative NNs, its inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201

    Modeling of a hardware VLSI placement system: Accelerating the Simulated Annealing algorithm

    Get PDF
    An essential step in the automation of electronic design is the placement of the physical components on the target semiconductor die. The placement step presents the opportunity to reduce costs in terms of wire length and performance degradation; however it is compute intensive and is NP-complete in terms of obtaining an optimal solution. As designs have grown in complexity and gate count, obtaining an optimal solution is not feasible due to time to market constraints or sheer compute effort required. Heuristic algorithms allow for efficient but sub-optimal designs to be produced with a reduction in processing time. A widely used algorithm is Simulated Annealing (SA). The goal of this work was to develop a model that would enable an analysis into the feasibility of developing a hardware accelerated placement system which uses SA at its core. The SA heuristic was analyzed for possible improvements in efficiency with focus given to targeting the system for hardware. A solution implementing parallel computing with specialized hardware configurations inside a field programmable gate array (FPGA) was investigated as having the possibility to improve the efficiency of the SA-based algorithm. All supporting subsystems were also described for a hardware accelerated model. A large speedup was analytically shown from both accelerating the critical path of the SA algorithm as well as novel methods of improving SA\u27s efficiency. As data throughput requirements were not included in this work, the results presented may be optimistic for an overall system speedup. However, the results clearly show that future work is warranted in studying the concept of a hardware accelerated placement system

    Timing Measurement Platform for Arbitrary Black-Box Circuits Based on Transition Probability

    No full text

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    A Parallel Hardware Architecture For Quantum Annealing Algorithm Acceleration

    Get PDF
    Quantum Annealing (QA) is an emerging technique, derived from Simulated Annealing, providing metaheuristics for multivariable optimisation problems. Studies have shown that it can be applied to solve NP-hard problems with faster convergence and better quality of result than other traditional heuristics, with potential applications in a variety of fields, from transport logistics to circuit synthesis and optimisation. In this paper, we present a hardware architecture implementing a QA-based solver for the Multidimensional Knapsack Problem, designed to improve the performance of the algorithm by exploiting parallelised computation. We synthesised the architecture using as a target an Altera FPGA board and simulated the execution for solving a set of benchmarks available in the literature. Simulation results show that the proposed implementation is about 100 times faster than a single-thread general-purpose CPU without impact on the accuracy of the solution
    corecore