Search CORE

123,354 research outputs found

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

Author: Bell Steven Emberton
Cao Kaidi
Gao Mingyu
Ha Heonjae
Horowitz Mark
Kozyrakis Christos
Liu Qiaoyi
Nayak Ankita
Pu Jing
Raina Priyanka
Setter Jeff Ou
Yang Xuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/04/2020
Field of study

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

arXiv.org e-Print Archive

Crossref

Synchronous Counting and Computational Algorithm Design

Author: Dolev Danny
Heljanko Keijo
Järvisalo Matti
Korhonen Janne H.
Lenzen Christoph
Rybicki Joel
Suomela Jukka
Wieringa Siert
Publication venue: 'Elsevier BV'
Publication date: 05/01/2015
Field of study

Consider a complete communication network on

n

nodes, each of which is a state machine. In synchronous 2-counting, the nodes receive a common clock pulse and they have to agree on which pulses are "odd" and which are "even". We require that the solution is self-stabilising (reaching the correct operation from any initial state) and it tolerates

f

Byzantine failures (nodes that send arbitrary misinformation). Prior algorithms are expensive to implement in hardware: they require a source of random bits or a large number of states. This work consists of two parts. In the first part, we use computational techniques (often known as synthesis) to construct very compact deterministic algorithms for the first non-trivial case of

f = 1

. While no algorithm exists for

n < 4

, we show that as few as 3 states per node are sufficient for all values

n \ge 4

. Moreover, the problem cannot be solved with only 2 states per node for

n = 4

, but there is a 2-state solution for all values

n \ge 6

. In the second part, we develop and compare two different approaches for synthesising synchronous counting algorithms. Both approaches are based on casting the synthesis problem as a propositional satisfiability (SAT) problem and employing modern SAT-solvers. The difference lies in how to solve the SAT problem: either in a direct fashion, or incrementally within a counter-example guided abstraction refinement loop. Empirical results suggest that the former technique is more efficient if we want to synthesise time-optimal algorithms, while the latter technique discovers non-optimal algorithms more quickly.Comment: 35 pages, extended and revised versio

arXiv.org e-Print Archive

MPG.PuRe