58 research outputs found
Investigating Opportunities and Challenges in Modeling and Designing Scale-Out DNN Accelerators
The rapid growth of deep learning used in practical applications such as speech recognition,
computer vision, natural language processing, robotics, any many other fields has opened
the gate to new technology possibilities. Unfortunately, traditional hardware systems
are being stretched to the maximum to accommodate the intense workloads presented by
state-of-the-art deep learning processes in a time when transistor technology is not scaling.
To serve the demand for better computational power and more specialized computations,
specialized hardware needs to be developed that provides better latency and bandwidth
specifications for various demanding applications.
The trend in the semi-conductor industry is to move towards heterogenous System-On
Chip (SoC) thereby choosing application specific performance vs. generality seen in most
CPU architectures today. In most situations, hardware engineers are left to construct
systems that serve the needs of various applications, often needing to predict the use-cases
of the system. As with any field, the ability to predict and act on the future innovation
trends of the industry is the difference between success and failure.
A novel simulator for the design of convolutional neural network accelerators is
presented and described in detail named SCALE-Sim (Systolic CNN Accelerator
Simulator). The simulator is available as an open-sourced repository and has 2 primary
use-cases in which computer architects can extract significant results. The first use-case is
for system designers who would like to integrate an existing DNN accelerator architecture
into a larger SoC and would be interested in system-level characterization results. The
second use-case is for an accelerator architect who would like to use the tool to explore the
accelerator design space by sweeping through design parameters.M.S
FPGA Hardware Accelerators - Case Study on Design Methodologies and Trade-Offs
Previous research has shown that the performance of any computation is directly related to the architecture on which it is performed. As a result, the performance of compute intensive applications can be improved using heterogeneous systems. These systems consist of various processor architectures such as CPU, FPGA, DSP, and GPU. Individual computations can be performed in parallel on different processor architecrues within the heterogeneous system. Computations are performed by utilizing existing designs from implementation libraries. There is a lack of FPGA accelerators for use in these libraries and as such additional implementations need to be designed.
Different design methodologies for developing FPGA accelerators result in implementations that vary in performance, design time, and resource utilization. A particular method and supporting toolset may produce better results for one type of design than another.
The customary method for designing FPGA accelerators is to develop the system architecture from an algorithm and model it using a hardware decription language (HDL). Another method is to convert
directly from a software implementation to HDL. This process is known as high level synthesis (HLS).
The advantages and disadvantages of these two techniques can be examined through comparison of different linear algebra operations. Many linear algebra operations are parallel in nature which makes them potentially good choices to speedup through implementation on an FPGA. In particular, matrix multiplication is an excellent candidate for examination due to not only its parallelism but also its multitude of different algorithms. The goal of this research is to design different matrix multiplication accelerators and provide insight into the advantages and disadvantages of each design procedure
Recommended from our members
Implementation of the OPU Instruction Set Architecture on the Microsemi Polarfire 300 Field-Programmable Gate Array
Deep learning is a fast-growing field with numerous promising applications that, unfortunately, demands large computing power for both training and inference tasks. To meet this demand, numerous hardware accelerators have thus been designed. Currently, however, these platforms are being developed independently from each other, and, as a result, there is a lack of compatibility between them. Notably, there is a need for standardization of the interface between hardware accelerators and software. UCLA's OPU is an ISA that aims at solving this issue. Contrary to general-purpose ISAs, OPU is designed to adequately express the computations involved in deep learning models, which allows for simple compilation and efficient cores. Prior to this work, only two fully-featured cores implementing the OPU ISA had been designed, both targeted at Xilinx SRAM-based FPGAs. However, flash-based FPGAs can offer several advantages thanks to their different technology. They are more secure, more reliable, and can yield a lower power consumption. All three of these characteristics being potentially highly valuable for deep learning accelerators, especially those embedded in edge devices, a new OPU core is here developed and mapped to a flash-based FPGA. More specifically, the potential of the MPF300 FPGA as a platform for the OPU ISA is evaluated. This represents the first OPU core implemented on an FPGA that is not manufactured by Xilinx. In addition, this design is also the first OPU core capable of operating on floating-point numbers, which simplifies the compilation of models. As such, this work contributes to the diversification of the catalog of available OPU cores, which increases the relevance of this ISA.While prior work affirms that, on Xilinx FPGAs, 8-bit floating-point arithmetic is more area-efficient than 8-bit integer arithmetic, the opposite is found in this work for Microsemi FPGAs. As a consequence, it is established that the optimum manner to perform large floating-point dot products on the MPF300 is to convert the operands to wider integers, on the device, then complete the computations using integer arithmetic. In contrast to Xilinx FPGAs, 5-bit mantissas are here preferred over 4-bit mantissas. Additionally, due to the lower ratio of the number of LUTs to DSPs of the MPF300, the relative resource utilization is found to be significantly higher here compared to the existing implementations. This new OPU core is found to be in average 1.7 times more energy-efficient than the existing similarly-sized implementation of the OPU ISA. Furthermore, the new core is in average 2 times faster than the Nvidia Jetson Nano platform, while consuming the same amount of power. These results further prove the relevance of the OPU ISA. In addition, this demonstrates that flash-based FPGAs, too, are a viable option for deep learning acceleration. The scarcity of these FPGAs in the relevant literature is thus not justified. Nevertheless, analysis of the core shows that the layout of modern FPGAs is in general suboptimal for the task of machine learning acceleration. In particular, the placement of the hard resources of the device tends to cause congestion on the device that reduces performance. This suggests the need for the development of specialized FPGAs for this task
Approximation Opportunities in Edge Computing Hardware : A Systematic Literature Review
With the increasing popularity of the Internet of Things and massive Machine Type Communication technologies, the number of connected devices is rising. However, while enabling valuable effects to our lives, bandwidth and latency constraints challenge Cloud processing of their associated data amounts. A promising solution to these challenges is the combination of Edge and approximate computing techniques that allows for data processing nearer to the user. This paper aims to survey the potential benefits of these paradigms’ intersection. We provide a state-of-the-art review of circuit-level and architecture-level hardware techniques and popular applications. We also outline essential future research directions.publishedVersionPeer reviewe
Reconfigurable acceleration of Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency.
This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs.
The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs.
The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs.
The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection.
To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces
Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler
Different from developing neural networks (NNs) for general-purpose
processors, the development for NN chips usually faces with some
hardware-specific restrictions, such as limited precision of network signals
and parameters, constrained computation scale, and limited types of non-linear
functions.
This paper proposes a general methodology to address the challenges. We
decouple the NN applications from the target hardware by introducing a compiler
that can transform an existing trained, unrestricted NN into an equivalent
network that meets the given hardware's constraints. We propose multiple
techniques to make the transformation adaptable to different kinds of NN chips,
and reliable for restrict hardware constraints.
We have built such a software tool that supports both spiking neural networks
(SNNs) and traditional artificial neural networks (ANNs). We have demonstrated
its effectiveness with a fabricated neuromorphic chip and a
processing-in-memory (PIM) design. Tests show that the inference error caused
by this solution is insignificant and the transformation time is much shorter
than the retraining time. Also, we have studied the parameter-sensitivity
evaluations to explore the tradeoffs between network error and resource
utilization for different transformation strategies, which could provide
insights for co-design optimization of neuromorphic hardware and software.Comment: Accepted by ASPLOS 201
Rethinking FPGA Architectures for Deep Neural Network applications
The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance.
Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking.
First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
A Survey on Approximate Multiplier Designs for Energy Efficiency: From Algorithms to Circuits
Given the stringent requirements of energy efficiency for Internet-of-Things
edge devices, approximate multipliers, as a basic component of many processors
and accelerators, have been constantly proposed and studied for decades,
especially in error-resilient applications. The computation error and energy
efficiency largely depend on how and where the approximation is introduced into
a design. Thus, this article aims to provide a comprehensive review of the
approximation techniques in multiplier designs ranging from algorithms and
architectures to circuits. We have implemented representative approximate
multiplier designs in each category to understand the impact of the design
techniques on accuracy and efficiency. The designs can then be effectively
deployed in high-level applications, such as machine learning, to gain energy
efficiency at the cost of slight accuracy loss.Comment: 38 pages, 37 figure
- …