49 research outputs found

    DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability

    Get PDF
    To enable the design of large capacity memory structures, novel memory technologies such as non-volatile memory (NVM) and novel fabrication approaches, e.g., 3D stacking and multi-level cell (MLC) design have been explored. The existing modeling tools, however, cover only a few memory technologies, technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 2D/3D memories designed using SRAM, resistive RAM (ReRAM), spin transfer torque RAM (STT-RAM), phase change RAM (PCM) and embedded DRAM (eDRAM) and 2D memories designed using spin orbit torque RAM (SOT-RAM), domain wall memory (DWM) and Flash memory. In addition to single-level cell (SLC) designs for all of these memories, DESTINY also supports modeling MLC designs for NVMs. We have extensively validated DESTINY against commercial and research prototypes of these memories. DESTINY is very useful for performing design-space exploration across several dimensions, such as optimizing for a target (e.g., latency, area or energy-delay product) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e., 2D v/s 3D) for a given optimization target, etc. We believe that DESTINY will boost studies of next-generation memory architectures used in systems ranging from mobile devices to extreme-scale supercomputers. The latest source-code of DESTINY is available from the following git repository: https://bitbucket.org/sparsh_mittal/destiny_v2

    An Automated and Controlled Numerical Precision Reduction Framework for GPUs

    Get PDF
    Reducing the precision of floating-point values is an effective approach to achieve higher performance as well as higher energy-efficiency. This is especially true for GPUs, since many of its common tasks are inherently insensitive to precision-reduction. A substantially lower bitwidth can open up for many novel microarchitectural optimizations such as resource-efficient register files, functional units, and cache memory subsystems. However, to reduce the precision of floating-point values in a controlled manner, a connection has to be established between the application and the microarchitecture, since it is decided at the application level if deviations from the exact answer is tolerable. This thesis proposes a GPU framework which establishes such a connection. The first part of the framework consists of a method for automatically selecting an appropriate precision for each floating-point value given the tolerable output deviation. The results show that by allowing a small, but acceptable, degradation of output quality, the number of bits needed to represent the floating-point values can be significantly reduced.The second part of the framework is a novel GPU register file organization together with a register allocation algorithm capable of leveraging the precision-reduced floats given by the first part of the framework. The register allocation algorithm uses the precision-reduced floats to lower the register footprint of each thread. This is of great importance for GPUs since, unlike traditional CPU architectures, GPUs hide latency by keeping a large number of threads in flight simultaneously. Also, to enable fast context switching, the state of all active threads are readily available in the register file. As the thread register footprint limits the number of active threads, it might impede latency hiding. Our evaluation shows that the increase in active threads is translated into a significant performance improvement when using our proposed GPU register file organization, for a smaller cost than increasing the number of threads by using a larger register file

    Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units

    Get PDF
    A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be efficiently utilized.This thesis provides a set of approximation and compression techniques for GPUs, with the goal of efficiently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization.This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more efficiently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floating-point values and narrow integers for increased capacity and performance.Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-of-the-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines

    Energy-Aware Data Movement In Non-Volatile Memory Hierarchies

    Get PDF
    While technology scaling enables increased density for memory cells, the intrinsic high leakage power of conventional CMOS technology and the demand for reduced energy consumption inspires the use of emerging technology alternatives such as eDRAM and Non-Volatile Memory (NVM) including STT-MRAM, PCM, and RRAM. The utilization of emerging technology in Last Level Cache (LLC) designs which occupies a signifcant fraction of total die area in Chip Multi Processors (CMPs) introduces new dimensions of vulnerability, energy consumption, and performance delivery. To be specific, a part of this research focuses on eDRAM Bit Upset Vulnerability Factor (BUVF) to assess vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on the write voltage, storage and bit-line capacitance. This dissertation broaden the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory. Large-latency and power-hungry Sense Amplifers (SAs) have been adapted to combat PV in the past. Herein, a novel approach is proposed to leverage the PV in NVM arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time. On the other hand, this dissertation investigates a novel technique to prioritize the service to 1) Extensive Read Reused Accessed blocks of the LLC that are silently dropped from higher levels of cache, and 2) the portion of the working set that may exhibit distant re-reference interval in L2. In particular, we develop a lightweight Multi-level Access History Profiler to effciently identify ERRA blocks through aggregating the LLC block addresses tagged with identical Most Signifcant Bits into a single entry. Experimental results indicate that the proposed technique can reduce the L2 read miss ratio by 51.7% on average across PARSEC and SPEC2006 workloads. In addition, this dissertation will broaden and apply advancements in theories of subspace recovery to pioneer computationally-aware in-situ operand reconstruction via the novel Logic In Interconnect (LI2) scheme. LI2 will be developed, validated, and re?ned both theoretically and experimentally to realize a radically different approach to post-Moore\u27s Law computing by leveraging low-rank matrices features offering data reconstruction instead of fetching data from main memory to reduce energy/latency cost per data movement. We propose LI2 enhancement to attain high performance delivery in the post-Moore\u27s Law era through equipping the contemporary micro-architecture design with a customized memory controller which orchestrates the memory request for fetching low-rank matrices to customized Fine Grain Reconfigurable Accelerator (FGRA) for reconstruction while the other memory requests are serviced as before. The goal of LI2 is to conquer the high latency/energy required to traverse main memory arrays in the case of LLC miss, by using in-situ construction of the requested data dealing with low-rank matrices. Thus, LI2 exchanges a high volume of data transfers with a novel lightweight reconstruction method under specific conditions using a cross-layer hardware/algorithm approach

    Streaming Architectures for Medical Image Reconstruction

    Full text link
    Non-invasive imaging modalities have recently seen increased use in clinical diagnostic procedures. Unfortunately, emerging computational imaging techniques, such as those found in 3D ultrasound and iterative magnetic resonance imaging (MRI), are severely limited by the high computational requirements and poor algorithmic efficiency in current arallel hardware---often leading to significant delays before a doctor or technician can review the image, which can negatively impact patients in need of fast, highly accurate diagnosis. To make matters worse, the high raw data bandwidth found in 3D ultrasound requires on-chip volume reconstruction with a tight power dissipation budget---dissipation of more than 5~W may burn the skin of the patient. The tight power constraints and high volume rates required by emerging applications require orders of magnitude improvement over state-of-the-art systems in terms of both reconstruction time and energy efficiency. The goal of the research outlined in this dissertation is to reduce the time and energy required to perform medical image reconstruction through software/hardware co-design. By analyzing algorithms with a hardware-centric focus, we develop novel algorithmic improvements which simultaneously reduce computational requirements and map more efficiently to traditional hardware architectures. We then design and implement hardware accelerators which push the new algorithms to their full potential. In the first part of this dissertation, we characterize the performance bottlenecks of high-volume-rate 3D ultrasound imaging. By analyzing the 3D plane-wave ultrasound algorithm, we reduce computational and storage requirements in Delay Compression. Delay Compression recognizes additional symmetry in the planar transmission scheme found in 2D, 3D, and 3D-Separable plane-wave ultrasound implementations, enabling on-chip storage of the reconstruction constants for the first time and eliminating the ost power-intensive component of the reconstruction process. We then design and implement Tetris, a streaming hardware accelerator for 3D-Separable plane-wave ultrasound. Tetris is enabled by the Tetris Reserveration Station, a novel 2D register file that buffers incomplete voxels and eliminates the need for a traditional load-and-store memory interface. Utilizing a fully pipelined architecture, Tetris reconstructs volumes at physics-limited rates (i.e., limited by the physical propagation speed of sound through tissue). Next, we review a core component of several computational imaging modalities, the Non-uniform Fast Fourier Transform (NuFFT), focusing on its use in MRI reconstruction. We find that the non-uniform interpolation step therein requires over 99% of the reconstruction time due to poor spatial and temporal memory locality. While prior work has made great strides in improving the performance of the NuFFT, the most common algorithmic optimization severely limits the available parallelism, causing it to map poorly to the massively parallel processing available in modern GPUs and FPGAs. To this end, we create Slice-and-Dice, a processing model which enables efficient mapping of the NuFFT's most computationally-intensive component onto traditional parallel architectures. We then demonstrate the full acceleration potential of Slice-and-Dice with Jigsaw, a custom hardware accelerator which performs the non-uniform interpolations found in the NuFFT in time approximately linear in the number of non-uniform samples, rrespective of sampling pattern, uniform grid size, or interpolation kernel width. The algorithms and architectures herein enable faster, more efficient medical image reconstruction, without sacrificing image quality. By decreasing the time and energy required for image reconstruction, our work opens the door for future exploration into higher-resolution imaging and emerging, computationally complex reconstruction algorithms which improve the speed and quality of patient diagnosis.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167986/1/westbl_1.pd

    Doctor of Philosophy

    Get PDF
    dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

    Study and development of innovative strategies for energy-efficient cross-layer design of digital VLSI systems based on Approximate Computing

    Get PDF
    The increasing demand on requirements for high performance and energy efficiency in modern digital systems has led to the research of new design approaches that are able to go beyond the established energy-performance tradeoff. Looking at scientific literature, the Approximate Computing paradigm has been particularly prolific. Many applications in the domain of signal processing, multimedia, computer vision, machine learning are known to be particularly resilient to errors occurring on their input data and during computation, producing outputs that, although degraded, are still largely acceptable from the point of view of quality. The Approximate Computing design paradigm leverages the characteristics of this group of applications to develop circuits, architectures, algorithms that, by relaxing design constraints, perform their computations in an approximate or inexact manner reducing energy consumption. This PhD research aims to explore the design of hardware/software architectures based on Approximate Computing techniques, filling the gap in literature regarding effective applicability and deriving a systematic methodology to characterize its benefits and tradeoffs. The main contributions of this work are: -the introduction of approximate memory management inside the Linux OS, allowing dynamic allocation and de-allocation of approximate memory at user level, as for normal exact memory; - the development of an emulation environment for platforms with approximate memory units, where faults are injected during the simulation based on models that reproduce the effects on memory cells of circuital and architectural techniques for approximate memories; -the implementation and analysis of the impact of approximate memory hardware on real applications: the H.264 video encoder, internally modified to allocate selected data buffers in approximate memory, and signal processing applications (digital filter) using approximate memory for input/output buffers and tap registers; -the development of a fully reconfigurable and combinatorial floating point unit, which can work with reduced precision formats
    corecore