10 research outputs found

    An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

    Full text link
    Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. In this paper, we first discuss the major microarchitectural details of Edge TPUs. Then, we extensively evaluate three classes of Edge TPUs, covering different computing ecosystems, that are either currently deployed in Google products or are the product pipeline, across 423K unique convolutional neural networks. Building upon this extensive study, we discuss critical and interpretable microarchitectural insights about the studied classes of Edge TPUs. Mainly, we discuss how Edge TPU accelerators perform across convolutional neural networks with different structures. Finally, we present our ongoing efforts in developing high-accuracy learned machine learning models to estimate the major performance metrics of accelerators such as latency and energy consumption. These learned models enable significantly faster (in the order of milliseconds) evaluations of accelerators as an alternative to time-consuming cycle-accurate simulators and establish an exciting opportunity for rapid hard-ware/software co-design.Comment: 11 pages, 15 figures, submitted to ISCA 202

    In-Datacenter Performance Analysis of a Tensor Processing Unit

    Full text link
    Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201

    Coded modulation with Low Density Parity Check codes

    No full text
    Due to the character of the original source materials and the nature of batch digitization, quality control issues may be present in this document. Please report any quality issues you encounter to [email protected], referencing the URI of the item.Includes bibliographical references (leaves 78-80).Issued also on microfiche from Lange Micrographics.This thesis proposes the design of Low Density Parity Check (LDPC) codes for cases where coded modulation is used. We design these codes by extending the idea of Density Evolution (DE) that has been introduced as a powerful tool to analyze LDPC codes. We first discuss methods by which we can design these codes for higher order constellations like 8 Phase Shift Keying (PSK) and 16 Quadrature Amplitude Modulation (QAM). We present simulation results that are within 0.22 dB and 0.4 dB within the constrained capacity of 8 PSK and 16 QAM constellations respectively in an Additive White Gaussian Noise (AWGN) channel. In the second part, we investigate serial concatenation of LDPC codes and minimum shift keying (MSK) with iterative decoding. We show that the design of LDPC codes is crucially dependent on the realization of the MSK modulator. For MSK modulators with non-recursive continuous phase encoders (CPEs), optimal codes for BPSK are optimal whereas for MSK modulators with recursive CPEs the BPSK codes are not optimal. We show that for non-recursive CPEs, iterative demodulation and decoding is not required even though the CPE has memory. However, iterative demodulation is essential for recursive CPEs. For recursive CPEs, we design LDPC codes using density evolution and differential evolution by looking at the graph structure of the CPE and considering message passing between both these codes. The resulting codes provide significantly improved performance over the existing codes
    corecore