15 research outputs found
An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks
Edge TPUs are a domain of accelerators for low-power, edge devices and are
widely used in various Google products such as Coral and Pixel devices. In this
paper, we first discuss the major microarchitectural details of Edge TPUs.
Then, we extensively evaluate three classes of Edge TPUs, covering different
computing ecosystems, that are either currently deployed in Google products or
are the product pipeline, across 423K unique convolutional neural networks.
Building upon this extensive study, we discuss critical and interpretable
microarchitectural insights about the studied classes of Edge TPUs. Mainly, we
discuss how Edge TPU accelerators perform across convolutional neural networks
with different structures. Finally, we present our ongoing efforts in
developing high-accuracy learned machine learning models to estimate the major
performance metrics of accelerators such as latency and energy consumption.
These learned models enable significantly faster (in the order of milliseconds)
evaluations of accelerators as an alternative to time-consuming cycle-accurate
simulators and establish an exciting opportunity for rapid hard-ware/software
co-design.Comment: 11 pages, 15 figures, submitted to ISCA 202
A Formal Approach to Memory Access Optimization: Data Layout, Reorganization, and Near-Data Processing
The memory system is a major bottleneck in achieving high performance and energy efficiency for various processing platforms. This thesis aims to improve memory performance and energy efficiency of data intensive applications through a two-pronged approach which combines a formalrepresentation framework and a hardware substrate that can efficiently reorganize data in memory. The proposed formal framework enables representing and systematically manipulating data layout formats, address mapping schemes, and memory access patterns through permutations to exploit the locality and parallelism in memory. Driven by the implications from the formal framework, this thesis presents the HAMLeT architecture for highly-concurrent, energy-efficient and low-overheaddata reorganization performed completely in memory. Although data reorganization simply relocatesdata in memory, it is costly on conventional systems mainly due to inefficient access patterns, limited data reuse, and roundtrip data traversal throughout the memory hierarchy. HAMLeT pursues a near-data processing approach exploiting the 3D-stacked DRAM technology. Integrated inthe logic layer, interfaced directly to the local controllers, it takes advantage of the internal fine-grainparallelism, high bandwidth and locality which are inaccessible otherwise. Its parallel streaming architecturecan extract high throughput from stringent power, area, and thermal budgets. The thesis evaluates the efficient data reorganization capability provided by HAMLeT through several fundamental use cases. First, it demonstrates software-transparent data reorganization performedin memory to improve the memory access. A proposed hardware monitoring determines inefficient memory usage and issues a data reorganization to adapt an optimized data layout and address mapping for the observed memory access patterns. This mechanism is performed transparently and does not require any changes to the user software—HAMLeT handles the remappingand its side effects completely in hardware. Second, HAMLeT provides an efficient substrate to explicitly reorganize data in memory. This gives an ability to offload and accelerate common data reorganization routines observed in high-performance computing libraries (e.g., matrix transpose, scatter/gather, permutation, pack/unpack, etc.). Third, explicitly performed data reorganization enablesconsidering the data layout and address mapping as a part of the algorithm design space. Exposing these memory characteristics to the algorithm design space creates opportunities for algorithm/ architecture co-design. Co-optimized computation flow, memory accesses, and data layout lead to new algorithms that are conventionally avoided. <br
HAMLeT: Hardware Accelerated Memory Layout Transform within 3D-stacked DRAM
<p>Memory layout transformations via data reorganization are very common operations, which occur as a part of the computation or as a performance optimization in data-intensive applications. These operations require inefficient memory access patterns and roundtrip data movement through the memory hierarchy, failing to utilize the performance and energy-efficiency potentials of the memory subsystem. This paper proposes a high-bandwidth and energy-efficient hardware accelerated memory layout transform (HAMLeT) system integrated within a 3D-stacked DRAM. HAMLeT uses a low-overhead hardware that exploits the existing infrastructure in the logic layer of 3D-stacked DRAMs, and does not require any changes to the DRAM layers, yet it can fully exploit the locality and parallelism within the stack by implementing efficient layout transform algorithms. We analyze matrix layout transform operations (such as matrix transpose, matrix blocking and 3D matrix rotation) and demonstrate that HAMLeT can achieve close to peak system utilization, offering up to an order of magnitude performance improvement compared to the CPU and GPU memory subsystems which does not employ HAMLeT.</p
FFTS with near-optimal memory access through block data layouts
<p>Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework. In our evaluations we demonstrate that Spiral generated hardware designs achieve close to theoretical peak performance of the targeted platform and offer significant speed-up (up to 6.5x) compared to naive baseline algorithms.</p