9 research outputs found
DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling
DNN workloads can be scheduled onto DNN accelerators in many different ways:
from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a.
layer fusion, or cascaded execution). This results in a very broad scheduling
space, with each schedule leading to varying hardware (HW) costs in terms of
energy and latency. To rapidly explore this vast space for a wide variety of
hardware architectures, analytical cost models are crucial to estimate
scheduling effects on the HW level. However, state-of-the-art cost models are
lacking support for exploring the complete depth-first scheduling space, for
instance focusing only on activations while ignoring weights, or modeling only
DRAM accesses while overlooking on-chip data movements. These limitations
prevent researchers from systematically and accurately understanding the
depth-first scheduling space.
After formalizing this design space, this work proposes a unified modeling
framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in
the gaps. DeFiNES enables analytically estimating the hardware cost for
possible schedules in terms of both energy and latency, while considering data
access at every memory level. This is done for each schedule and HW
architecture under study by optimally choosing the active part of the memory
hierarchy per unique combination of operand, layer, and feature map tile. The
hardware costs are estimated, taking into account both data computation and
data copy phases. The analytical cost model is validated against measured data
from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling
accuracy at the end-to-end neural network level. A comparison with generalized
state-of-the-art demonstrates up to 10X better solutions found with DeFiNES.Comment: Accepted by HPCA 202
SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators
To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations. This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA [1] and Timeloop [2] on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding-up the search by 1.7× and 24× compared to LOMA and Timeloop, respectively
Clinical patterns of presentation and attenuated inflammatory response in octo- and nonagenarians with perforated gastroduodenal ulcers
Zebra dolomitization as a result of focused fluid flow in the Rocky Mountains Fold and Thrust Belt, Canada
Soluble IL-1 receptor 2 is associated with left ventricular remodelling in patients with ST-elevation myocardial infarction
Midrapidity antiproton-to-proton ratio in pp collisons root s=0.9 and 7 TeV measured by the ALICE experiment
The ratio of the yields of antiprotons to protons in pp collisions has been measured by the ALICE experiment at root s = 0.9 and 7 TeV during the initial running periods of the Large Hadron Collider. The measurement covers the transverse momentum interval 0.45 < p(t) < 1.05 GeV/c and rapidity vertical bar y vertical bar < 0.5. The ratio is measured to be R-vertical bar y vertical bar<0.5 = 0.957 +/- 0.006(stat) +/- 0.0014(syst) at 0.9 Tev and R-vertical bar y vertical bar<0.5 = 0.991 +/- 0.005 +/- 0.014(syst) at 7 TeV and it is independent of both rapidity and transverse momentum. The results are consistent with the conventional model of baryon-number transport and set stringent limits on any additional contributions to baryon-number transfer over very large rapidity intervals in pp collisions
Centrality dependence of the charged-particle multiplicity density at mid-rapidity in Pb-Pb collisions at = 2.76 TeV
The centrality dependence of the charged-particle multiplicity density at mid-rapidity in Pb-Pb collisions at = 2.76 TeV is presented. The charged-particle density normalized per participating nucleon pair increases by about a factor 2 from peripheral (70-80%) to central (0-5%) collisions. The centrality dependence is found to be similar to that observed at lower collision energies. The data are compared with models based on different mechanisms for particle production in nuclear collisions.The centrality dependence of the charged-particle multiplicity density at mid-rapidity in Pb-Pb collisions at = 2.76 TeV is presented. The charged-particle density normalized per participating nucleon pair increases by about a factor 2 from peripheral (70-80%) to central (0-5%) collisions. The centrality dependence is found to be similar to that observed at lower collision energies. The data are compared with models based on different mechanisms for particle production in nuclear collisions