1,267 research outputs found

    Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier Intellectual Property Core on FPGA

    Full text link
    In todays world, high-power computing applications such as image processing, digital signal processing, graphics, and robotics require enormous computing power. These applications use matrix operations, especially matrix multiplication. Multiplication operations require a lot of computational time and are also complex in design. We can use field-programmable gate arrays as low-cost hardware accelerators along with a low-cost general-purpose processor instead of a high-cost application-specific processor for such applications. In this work, we employ an efficient Strassens algorithm for matrix multiplication and a highly efficient run-time-reconfigurable floating-point multiplier for matrix element multiplication. The run-time-reconfigurable floating-point multiplier is implemented with custom floating-point format for variable-precision applications. A very efficient combination of Karatsuba algorithm and Urdhva Tiryagbhyam algorithm is used to implement the binary multiplier. This design can effectively adjust the power and delay requirements according to different accuracy requirements by reconfiguring itself during run time

    Criteria and Approaches for Virtualization on Modern FPGAs

    Full text link
    Modern field programmable gate arrays (FPGAs) can produce high performance in a wide range of applications, and their computational capacity is becoming abundant in personal computers. Regardless of this fact, FPGA virtualization is an emerging research field. Nowadays, challenges of the research area come from not only technical difficulties but also from the ambiguous standards of virtualization. In this paper, we introduce novel criteria of FPGA virtualization and discuss several approaches to accomplish those criteria. In addition, we present and describe in detail the specific FPGA virtualization architecture that we developed on Intel Arria 10 FPGA. We evaluate our solution with a combination of applications and microbenchmarks. The result shows that our virtualization solution can provide a full abstraction of FPGA device in both user and developer perspective while maintaining a reasonable performance compared to native FPGA

    Polystore++: Accelerated Polystore System for Heterogeneous Workloads

    Full text link
    Modern real-time business analytic consist of heterogeneous workloads (e.g, database queries, graph processing, and machine learning). These analytic applications need programming environments that can capture all aspects of the constituent workloads (including data models they work on and movement of data across processing engines). Polystore systems suit such applications; however, these systems currently execute on CPUs and the slowdown of Moore's Law means they cannot meet the performance and efficiency requirements of modern workloads. We envision Polystore++, an architecture to accelerate existing polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs). Polystore++ systems can achieve high performance at low power by identifying and offloading components of a polystore system that are amenable to acceleration using specialized hardware. Building a Polystore++ system is challenging and introduces new research problems motivated by the use of hardware accelerators (e.g, optimizing and mapping query plans across heterogeneous computing units and exploiting hardware pipelining and parallelism to improve performance). In this paper, we discuss these challenges in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201

    A configurable accelerator for manycores: the Explicitly Many-Processor Approach

    Full text link
    A new approach to designing processor accelerators is presented. A new computing model and a special kind of accelerator with dynamic (end-user programmable) architecture is suggested. The new model considers a processor, in which a newly introduced supervisor layer coordinates the job of the cores. The cores have the ability (based on the parallelization information provided by the compiler, and using the help of the supervisor) to outsource part of the job they received to some neighbouring core. The introduced changes essentially and advantageously modify the architecture and operation of the computing systems. The computing throughput drastically increases, the efficiency of the technological implementation (computing performance per logic gates) increases, the non-payload activity for using operating system services decreases, the real-time behavior changes advantageously, and connecting accelerators to the processor greatly simplifies. Here only some details of the architecture and operation of the processor are discussed, the rest is described elsewhere.Comment: 12 pages, 6 figure

    Renewing computing paradigms for more efficient parallelization of single-threads

    Full text link
    Computing is still based on the 70-years old paradigms introduced by von Neumann. The need for more performant, comfortable and safe computing forced to develop and utilize several tricks both in hardware and software. Till now technology enabled to increase performance without changing the basic computing paradigms. The recent stalling of single-threaded computing performance, however, requires to redesign computing to be able to provide the expected performance. To do so, the computing paradigms themselves must be scrutinized. The limitations caused by the too restrictive interpretation of the computing paradigms are demonstrated, an extended computing paradigm introduced, ideas about changing elements of the computing stack suggested, some implementation details of both hardware and software discussed. The resulting new computing stack offers considerably higher computing throughput, simplified hardware architecture, drastically improved real-time behavior and in general, simplified and more efficient computing stack.Comment: 28 pages; 7 figure

    FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

    Full text link
    Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.Comment: This article has been accepted for publication in IEEE Access (December, 2018

    Optical Hardware Accelerators using Nonlinear Dispersion Modes for Energy Efficient Computing

    Full text link
    This paper proposes a new class of hardware accelerators to alleviate bottlenecks in the acquisition, analytics, storage and computation of information carried by wideband streaming signals.Comment: 12 Figure

    Smart technologies for effective reconfiguration: the FASTER approach

    Get PDF
    Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows

    GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

    Full text link
    Generative Adversarial Networks (GANs) are one of the most recent deep learning models that generate synthetic data from limited genuine datasets. GANs are on the frontier as further extension of deep learning into many domains (e.g., medicine, robotics, content synthesis) requires massive sets of labeled data that is generally either unavailable or prohibitively costly to collect. Although GANs are gaining prominence in various fields, there are no accelerators for these new models. In fact, GANs leverage a new operator, called transposed convolution, that exposes unique challenges for hardware acceleration. This operator first inserts zeros within the multidimensional input, then convolves a kernel over this expanded array to add information to the embedded zeros. Even though there is a convolution stage in this operator, the inserted zeros lead to underutilization of the compute resources when a conventional convolution accelerator is employed. We propose the GANAX architecture to alleviate the sources of inefficiency associated with the acceleration of GANs using conventional convolution accelerators, making the first GAN accelerator design possible. We propose a reorganization of the output computations to allocate compute rows with similar patterns of zeros to adjacent processing engines, which also avoids inconsequential multiply-adds on the zeros. This compulsory adjacency reclaims data reuse across these neighboring processing engines, which had otherwise diminished due to the inserted zeros. The reordering breaks the full SIMD execution model, which is prominent in convolution accelerators. Therefore, we propose a unified MIMD-SIMD design for GANAX that leverages repeated patterns in the computation to create distinct microprograms that execute concurrently in SIMD mode.Comment: Proceedings of the 45th International Symposium on Computer Architecture (ISCA), 201

    NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

    Get PDF
    © 2016 Cheung, Schultz and Luk.NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation
    • …
    corecore