973 research outputs found

    Reconfigurable acceleration of Recurrent Neural Networks

    Get PDF
    Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency. This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs. The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs. The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs. The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection. To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces

    Timing analysis of synchronous data flow graphs

    Get PDF
    Consumer electronic systems are getting more and more complex. Consequently, their design is getting more complicated. Typical systems built today are made of different subsystems that work in parallel in order to meet the functional re- quirements of the demanded applications. The types of applications running on such systems usually have inherent timing constraints which should be realized by the system. The analysis of timing guarantees for parallel systems is not a straightforward task. One important category of applications in consumer electronic devices are multimedia applications such as an MP3 player and an MPEG decoder/encoder. Predictable design is the prominent way of simultaneously managing the design complexity of these systems and providing timing guarantees. Timing guarantees cannot be obtained without using analyzable models of computation. Data flow models proved to be a suitable means for modeling and analysis of multimedia applications. Synchronous Data Flow Graphs (SDFGs) is a data flow model of computation that is traditionally used in the domain of Digital Signal Processing (DSP) platforms. Owing to the structural similarity between DSP and multimedia applications, SDFGs are suitable for modeling multimedia applications as well. Besides, various performance metrics can be analyzed using SDFGs. In fact, the combination of expressivity and analysis potential makes SDFGs very interesting in the domain of multimedia applications. This thesis contributes to SDFG analysis. We propose necessary and sufficient conditions to analyze the integrity of SDFGs and we provide techniques to capture prominent performance metrics, namely, throughput and latency. These perfor- mance metrics together with the mentioned sanity checks (conditions) build an appropriate basis for the analysis of the timing behavior of modeled applications. An SDFG is a graph with actors as vertices and channels as edges. Actors represent basic parts of an application which need to be executed. Channels represent data dependencies between actors. Streaming applications essentially continue their execution indefinitely. Therefore, one of the key properties of an SDFG which models such an application is liveness, i.e., whether all actors can run infinitely often. For example, one is usually not interested in a system which completely or partially deadlocks. Another elementary requirement known as boundedness, is whether an implementation of an SDFG is feasible using a lim- ited amount of memory. Necessary and sufficient conditions for liveness and the different types of boundedness are given, as well as algorithms for checking those conditions. Throughput analysis of SDFGs is an important step for verifying throughput requirements of concurrent real-time applications, for instance within design-space exploration activities. In fact, the main reason that SDFGs are used for mod- eling multimedia applications is analysis of the worst-case throughput, as it is essential for providing timing guarantees. Analysis of SDFGs can be hard, since the worst-case complexity of analysis algorithms is often high. This is also true for throughput analysis. In particular, many algorithms involve a conversion to another kind of data flow graph, namely, a homogenous data flow graph, whose size can be exponentially larger than the size of the original graph and in practice often is much larger. The thesis presents a method for throughput analysis of SD- FGs which is based on explicit state-space exploration, avoiding the mentioned conversion. The method, despite its worst-case complexity, works well in practice, while existing methods often fail. Since the state-space exploration method is akin to the simulation of the graph, the result can be easily obtained as a byproduct in existing simulation tools. In various contexts, such as design-space exploration or run-time reconfigu- ration, many throughput computations are required for varying actor execution times. The computations need to be fast because typically very limited resources or time can be dedicated to the analysis. In this thesis, we present methods to compute throughput of an SDFG where execution times of actors can be param- eters. As a result, the throughput of these graphs is obtained in the form of a function of these parameters. Calculation of throughput for different actor exe- cution times is then merely an evaluation of this function for specific parameter values, which is much faster than the standard throughput analysis. Although throughput is a very useful performance indicator for concurrent real-time applications, another important metric is latency. Especially for appli- cations such as video conferencing, telephony and games, latency beyond a certain limit cannot be tolerated. The final contribution of this thesis is an algorithm to determine the minimal achievable latency, providing an execution scheme for executing an SDFG with this latency. In addition, a heuristic is proposed for optimizing latency under a throughput constraint. This heuristic gives optimal latency and throughput results in most cases

    ASAM : Automatic Architecture Synthesis and Application Mapping; dl. 3.2: Instruction set synthesis

    Get PDF
    No abstract

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression
    corecore