5,817 research outputs found
High accuracy computation with linear analog optical systems: a critical study
High accuracy optical processors based on the algorithm of digital multiplication by analog convolution (DMAC) are studied for ultimate performance limitations. Variations of optical processors that perform high accuracy vector-vector inner products are studied in abstract and with specific examples. It is concluded that the use of linear analog optical processors in performing digital computations with DMAC leads to impractical requirements for the accuracy of analog optical systems and the complexity of postprocessing electronics
Chipmunk: A Systolically Scalable 0.9 mm, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference
Recurrent neural networks (RNNs) are state-of-the-art in voice
awareness/understanding and speech recognition. On-device computation of RNNs
on low-power mobile and wearable devices would be key to applications such as
zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a
small (<1 mm) hardware accelerator for Long-Short Term Memory RNNs in UMC
65 nm technology capable to operate at a measured peak efficiency up to 3.08
Gop/s/mW at 1.24 mW peak power. To implement big RNN models without incurring
in huge memory transfer overhead, multiple Chipmunk engines can cooperate to
form a single systolic array. In this way, the Chipmunk architecture in a 75
tiles configuration can achieve real-time phoneme extraction on a demanding RNN
topology proposed by Graves et al., consuming less than 13 mW of average power
Recommended from our members
Two-dimensional DCT/IDCT architecture
A fully parallel architecture for the computation of a two-dimensional (2-D) discrete cosine transform (DCT), based on row-column decomposition is presented. It uses the same one dimensional (1-D) DCT unit for the row and column computations and (N2+N) registers to perform the transposition. It possesses features of regularity and modularity, and is thus well suited for VLSI implementation. It can be used for the computation of either the forward or the inverse 2-D DCT. Each 1-D DCT unit uses N fully parallel vector inner product (VIP) units. The design of the VIP units is based on a systematic design methodology using radix-2â arithmetic, which allows partitioning of the elements of each vector into small groups. Array multipliers without the final adder are used to produce the different partial product terms. This allows a more efficient use of 4:2 compressors for the accumulation of the products in the intermediate stages and reduces the number of accumulators from N to one. Using this procedure, the 2-D DCT architecture requires less than N2 multipliers (in terms of area occupied) and only 2N adders. It can compute a N x N-point DCT at a rate of one complete transform per N cycles after an appropriate initial delay
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks
Fully realizing the potential of acceleration for Deep Neural Networks (DNNs)
requires understanding and leveraging algorithmic properties. This paper builds
upon the algorithmic insight that bitwidth of operations in DNNs can be reduced
without compromising their classification accuracy. However, to prevent
accuracy loss, the bitwidth varies significantly across DNNs and it may even be
adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer
limited benefits to accommodate the worst-case bitwidth requirements, or lead
to a degradation in final accuracy. To alleviate these deficiencies, this work
introduces dynamic bit-level fusion/decomposition as a new dimension in the
design of DNN accelerators. We explore this dimension by designing Bit Fusion,
a bit-flexible accelerator, that constitutes an array of bit-level processing
elements that dynamically fuse to match the bitwidth of individual DNN layers.
This flexibility in the architecture enables minimizing the computation and the
communication at the finest granularity possible with no loss in accuracy. We
evaluate the benefits of BitFusion using eight real-world feed-forward and
recurrent DNNs. The proposed microarchitecture is implemented in Verilog and
synthesized in 45 nm technology. Using the synthesis results and cycle accurate
simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN
accelerators, Eyeriss and Stripes. In the same area, frequency, and process
technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss.
Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction
at 45 nm node when BitFusion area and frequency are set to those of Stripes.
Scaling to GPU technology node of 16 nm, BitFusion almost matches the
performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while
BitFusion merely consumes 895 milliwatts of power
- âŚ