244 research outputs found
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
The most widely used machine learning frameworks require users to carefully
tune their memory usage so that the deep neural network (DNN) fits into the
DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to
study different machine learning algorithms, forcing them to either use a less
desirable network architecture or parallelize the processing across multiple
GPUs. We propose a runtime memory manager that virtualizes the memory usage of
DNNs such that both GPU and CPU memory can simultaneously be utilized for
training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory
usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a
significant reduction in memory requirements of DNNs. Similar experiments on
VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the
memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256
(requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card
containing 12 GB of memory, with 18% performance loss compared to a
hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International
Symposium on Microarchitecture (MICRO-49), 201
Instruction manual: Photogrammetry as a non-contact measurement system in large scale structural testing
Photogrammetry is a non-contact measurement method that is being used in large scale structural experimentation to extract information about the overall geometry of the specimen as well as the XYZ motion of select points on the structure during testing. This is possible through the use of high-resolution still cameras that capture several photographs of the specimen and are processed using photogrammetric software. The following document will focus specifically on the application of PhotoModeler® as the image post-processing tool. This instruction manual aims to provide guidance to researchers who would like to adopt photogrammetric techniques to acquire experimental test data, especially in cases where a high density grid of displacement measurements is desired at a relatively low cost
Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology
Processing-in-memory (PIM) has been explored for decades by computer
architects, yet it has never seen the light of day in real-world products due
to their high design overheads and lack of a killer application. With the
advent of critical memory-intensive workloads, several commercial PIM
technologies have been introduced to the market ranging from domain-specific
PIM architectures to more general-purpose PIM architectures. In this work, we
deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled
parallel architecture that is highly programmable. Our first key contribution
is the development of a flexible simulation framework for PIM. The simulator we
developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes
into its compiled machine-level instructions, which are subsequently consumed
by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's
PIM design through a detailed characterization study. Building on top of our
characterization, we conduct a series of case studies to pathfind important
architectural features that we deem will be critical for future PIM
architectures to suppor
Recommended from our members
Performance-efficient mechanisms for managing irregularity in throughput processors
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin
- …