28 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

    Staged reads: mitigating the impact of DRAM writes on DRAM reads

    Get PDF
    Journal ArticleMain memory latencies have always been a concern for system performance. Given that reads are on the criti- cal path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a novel mechanism to boost read-write parallelism and perform useful components of read operations even when the memory system is busy performing writes. If some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as a Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip. The proposed tech- nique works best when there is bank imbalance in the write stream. We also introduce a write scheduling algorithm that artificially creates bank imbalance and allows useful read operations to be performed during the write drain. Across a suite of memory-intensive workloads, we show that Staged Reads can boost throughput by up to 33% (average 7%) with an average DRAM access latency improvement of 17%, while incurring a very small cost (0.25%) in terms of memory chip area. The throughput improvements are even greater when considering write-intensive work-loads (average 11%) or future systems (average 12%)

    Leveraging heterogeneity in DRAM main memories to accelerate critical word access

    Get PDF
    pre-printThe DRAM main memory system in modern servers is largely homogeneous. In recent years, DRAM manufacturers have produced chips with vastly differing latency and energy characteristics. This provides the opportunity to build a heterogeneous main memory system where different parts of the address space can yield different latencies and energy per access. The limited prior work in this area has explored smart placement of pages with high activities. In this paper, we propose a novel alternative to exploit DRAM heterogeneity. We observe that the critical word in a cache line can be easily recognized beforehand and placed in a ow-latency region of the main memory. Other non-critical words of the cache line can be placed in a low-energy region. We design an architecture that has low complexity and that can accelerate the transfer of the critical word by tens of cycles. For our benchmark suite, we show an average performance improvement of 12.9% and an accompanying memory energy reduction of 15%

    Rethinking DRAM design and organization for energy-constrained multicores

    Get PDF
    PosterDRAM vendors have traditionally optimized for low cost and high performance, often making design decisions that incur energy penalties. For example, a single conventional DRAM access activates thousands of bitlines in many chips, to return a single cache line to the CPU. The other bits may be accessed if there is high locality in the access stream. Otherwise significant energy is wasted, especially when memory access locality is reduced as core, thread, and socket counts increase. Instead we propose and analyze two optimizations which activate as little of the DRAM circuitry as possible, while incurring only modest performance penalties. The first approach, Selective Bitline Activation (SBA), is compatible with existing DRAM standards. SBA waits for both RAS and CAS signals to arrive before activating exactly those bitlines that provide the requested cache line. Our second proposal, Single Subarray Access (SSA), re-organizes the layout of DRAM subarrays and the mapping of data to subarrays so that the entire cache line is fetched from a single subarray. Since SSA reads an entire cache line from a single DRAM, we also examine the addition of DRAM checksums, to increase error detection and correction capabilities. The approach is similar to existing methods used on disk drives. We then provide chipkill-level reliability through RAID techniques. The SSA design yields memory energy savings of 6X, while incurring an area cost of 4.5%, and even improving performance by up to 15%. Our chipkill solution has significantly lower capacity and energy overheads than other known chipkill solutions, while incurring only an 11% performance penalty compared to an SSA memory system without chipkill

    DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis

    Full text link
    Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement
    corecore