165 research outputs found
CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration with Better-than-Binary Energy Efficiency
We present a 3.1 POp/s/W fully digital hardware accelerator for ternary
neural networks. CUTIE, the Completely Unrolled Ternary Inference Engine,
focuses on minimizing non-computational energy and switching activity so that
dynamic power spent on storing (locally or globally) intermediate results is
minimized. This is achieved by 1) a data path architecture completely unrolled
in the feature map and filter dimensions to reduce switching activity by
favoring silencing over iterative computation and maximizing data re-use, 2)
targeting ternary neural networks which, in contrast to binary NNs, allow for
sparse weights which reduce switching activity, and 3) introducing an optimized
training method for higher sparsity of the filter weights, resulting in a
further reduction of the switching activity. Compared with state-of-the-art
accelerators, CUTIE achieves greater or equal accuracy while decreasing the
overall core inference energy cost by a factor of 4.8x-21x
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation
The edge processing of deep neural networks (DNNs) is becoming increasingly
important due to its ability to extract valuable information directly at the
data source to minimize latency and energy consumption. Frequency-domain model
compression, such as with the Walsh-Hadamard transform (WHT), has been
identified as an efficient alternative. However, the benefits of
frequency-domain processing are often offset by the increased
multiply-accumulate (MAC) operations required. This paper proposes a novel
approach to an energy-efficient acceleration of frequency-domain neural
networks by utilizing analog-domain frequency-based tensor transformations. Our
approach offers unique opportunities to enhance computational efficiency,
resulting in several high-level advantages, including array micro-architecture
with parallelism, ADC/DAC-free analog computations, and increased output
sparsity. Our approach achieves more compact cells by eliminating the need for
trainable parameters in the transformation matrix. Moreover, our novel array
micro-architecture enables adaptive stitching of cells column-wise and
row-wise, thereby facilitating perfect parallelism in computations.
Additionally, our scheme enables ADC/DAC-free computations by training against
highly quantized matrix-vector products, leveraging the parameter-free nature
of matrix multiplications. Another crucial aspect of our design is its ability
to handle signed-bit processing for frequency-based transformations. This leads
to increased output sparsity and reduced digitization workload. On a
1616 crossbars, for 8-bit input processing, the proposed approach
achieves the energy efficiency of 1602 tera operations per second per Watt
(TOPS/W) without early termination strategy and 5311 TOPS/W with early
termination strategy at VDD = 0.8 V
Advanced Information Processing Methods and Their Applications
This Special Issue has collected and presented breakthrough research on information processing methods and their applications. Particular attention is paid to the study of the mathematical foundations of information processing methods, quantum computing, artificial intelligence, digital image processing, and the use of information technologies in medicine
- …