28 research outputs found
DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling
DNN workloads can be scheduled onto DNN accelerators in many different ways:
from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a.
layer fusion, or cascaded execution). This results in a very broad scheduling
space, with each schedule leading to varying hardware (HW) costs in terms of
energy and latency. To rapidly explore this vast space for a wide variety of
hardware architectures, analytical cost models are crucial to estimate
scheduling effects on the HW level. However, state-of-the-art cost models are
lacking support for exploring the complete depth-first scheduling space, for
instance focusing only on activations while ignoring weights, or modeling only
DRAM accesses while overlooking on-chip data movements. These limitations
prevent researchers from systematically and accurately understanding the
depth-first scheduling space.
After formalizing this design space, this work proposes a unified modeling
framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in
the gaps. DeFiNES enables analytically estimating the hardware cost for
possible schedules in terms of both energy and latency, while considering data
access at every memory level. This is done for each schedule and HW
architecture under study by optimally choosing the active part of the memory
hierarchy per unique combination of operand, layer, and feature map tile. The
hardware costs are estimated, taking into account both data computation and
data copy phases. The analytical cost model is validated against measured data
from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling
accuracy at the end-to-end neural network level. A comparison with generalized
state-of-the-art demonstrates up to 10X better solutions found with DeFiNES.Comment: Accepted by HPCA 202
TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge
Extreme edge devices or Internet-of-thing nodes require both ultra-low power
always-on processing as well as the ability to do on-demand sampling and
processing. Moreover, support for IoT applications like voice recognition,
machine monitoring, etc., requires the ability to execute a wide range of ML
workloads. This brings challenges in hardware design to build flexible
processors operating in ultra-low power regime. This paper presents TinyVers, a
tiny versatile ultra-low power ML system-on-chip to enable enhanced
intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to
enable multi-modal support and aggressive on-chip power management for
duty-cycling to enable smart sensing applications. The SoC combines a RISC-V
host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7
W deep sleep wake-up controller, and an eMRAM for boot code and ML
parameter retention. The SoC can perform up to 17.6 GOPS while achieving a
power consumption range from 1.7 W-20 mW. Multiple ML workloads aimed for
diverse applications are mapped on the SoC to showcase its flexibility and
efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power
consumption below 230 W in continuous operation. In a duty-cycling use
case for machine monitoring, this power is reduced to below 10 W.Comment: Accepted in IEEE Journal of Solid-State Circuit
Non-benzoquinone geldanamycin analogs trigger various forms of death in human breast cancer cells
White matter abnormalities in adolescents with generalized anxiety disorder: a diffusion tensor imaging study
Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing
status: Published onlin
Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing
The current trend for deep learning has come with an enormous computational need for billions of Multiply-Accumulate (MAC) operations per inference. Fortunately, reduced precision has demonstrated large benefits with low impact on accuracy, paving the way towards processing in mobile devices and IoT nodes. To this end, various precision-scalable MAC architectures optimized for neural networks have recently been proposed. Yet, it has been hard to comprehend their differences and make a fair judgment of their relative benefits as they have been implemented with different technologies and performance targets. To overcome this, this work exhaustively reviews the state-of-the-art precision-scalable MAC architectures and unifies them in a new taxonomy. Subsequently, these different topologies are thoroughly benchmarked in a 28nm commercial CMOS process, across a wide range of performance targets, and with precision ranging from 2 to 8 bits. Circuits are analyzed for each precision as well as jointly in practical use cases, highlighting the impact of architectures and scalability in terms of energy, throughput, area and bandwidth, aiming to understand the key trends to reduce computation costs in neural-network processing
SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators
To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA [1] and Timeloop [2] on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding-up the search by 1.7× and 24× compared to LOMA and Timeloop, respectively
Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference
status: Published onlin