9 research outputs found

    Deploying Deep Neural Networks in Edge with Distribution

    Get PDF
    The widespread applicability of deep neural networks (DNNs) has led edge computing to emerge as a trend to extend our capabilities to several domains such as robotics, autonomous technologies, and Internet-of-things devices. Because of the tight resource constraints of such individual edge devices, computing accurate predictions while providing a fast execution is a key challenge. Moreover, modern DNNs increasingly demand more computation power than their predecessors. As a result, the current approach is to rely on compute resources in the cloud by offloading the inference computations of DNNs. This approach not only does raise privacy concerns but also relies on network infrastructure and data centers that are not scalable and do not guarantee fast execution. My key insight is that edge devices can break their individual resource constraints by distributing the computation of DNNs on collaborating peer edge devices. In my approach, edge devices cooperate to conduct single-batch inferences in real-time while exploiting several model-parallelism methods. Nonetheless, since communication is costly and current DNN models capture a single-chain of dependency pattern, distributing and parallelizing the computations of current DNNs may not be an effective solution for edge domains. Therefore, to efficiently benefit from computing resources with low communication overhead, I propose new handcrafted edge-tailored models that consist of several independent and narrow DNNs. Additionally, I explore an automated neural architecture search methodology and propose custom DNN architectures with low communication overheads and high parallelization opportunities. Finally, to increase reliability, decrease susceptibility to short disconnectivity or losing a device, I propose a coded distributed computing recovery method that enables distributed DNN models on edge devices to tolerate failures and not lose time-sensitive and real-time information.Ph.D

    Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube

    Full text link
    Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small area. Although several studies have taken advantage of the novel architecture of HMC, its characteristics in terms of latency and bandwidth or their correlation with temperature and power consumption have not been fully explored. This paper is the first, to the best of our knowledge, to characterize the thermal behavior of HMC in a real environment using the AC-510 accelerator and to identify temperature as a new limitation for this state-of-the-art design space. Moreover, besides bandwidth studies, we deconstruct factors that contribute to latency and reveal their sources for high- and low-load accesses. The results of this paper demonstrates essential behaviors and performance bottlenecks for future explorations of packet-switched and 3D-stacked memories.Comment: EEE Catalog Number: CFP17236-USB ISBN 13: 978-1-5386-1232-

    Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

    Full text link
    Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs

    Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

    Full text link
    Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for developing DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.Comment: 11 pages, 14 figures, 2 table

    ERIDANUS: Efficiently Running Inference of DNNs Using Systolic Arrays

    No full text

    Charon: Specialized near-memory processing architecture for clearing dead objects in memory

    No full text
    © 2019 Association for Computing Machinery.programming, saving a programmer from many nasty memoryrelated bugs. However, these productivity benefits come with a cost in terms of application throughput, worst-case latency, and energy consumption. Since the first introduction of GC by the Lisp programming language in the 1950s, a myriad of hardware and software techniques have been proposed to reduce this cost. While the idea of accelerating GC in hardware is appealing, its impact has been very limited due to narrow coverage, lack of flexibility, intrusive system changes, and significant hardware cost. Even with specialized hardware GC performance is eventually limited by memory bandwidth bottleneck. Fortunately, emerging 3D stacked DRAM technologies shed new light on this decades-old problem by enabling efficient near-memory processing with ample memory bandwidth. Thus, we propose Charon1, the first 3D stacked memory-based GC accelerator. Through a detailed performance analysis of HotSpot JVM, we derive a set of key algorithmic primitives based on their GC time coverage and implementation complexity in hardware. Then we devise a specialized processing unit to substantially improve their memory-level parallelism and throughput with a low hardware cost. Our evaluation of Charon with the full-production HotSpot JVM running two big data analytics frameworks, Spark and GraphChi, demonstrates a 3.29 geomean speedup and 60.7% energy savings for GC over the baseline 8-core out-of-order processor.N
    corecore