Search CORE

263 research outputs found

Software and hardware methods for memory access latency reduction on ILP processors

Author: Zhang Zhao
Publication venue: W&M ScholarWorks
Publication date: 01/01/2002
Field of study

While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

College of William & Mary: W&M Publish

GPGPU microbenchmarking for irregular application optimization

Author: Winans-Pruitt Dalton R.
Publication venue: Scholars Junction
Publication date: 09/08/2022
Field of study

Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory bandwidth in certain circumstances. We also make other further observations on specific relevant behaviors of GPUs. We hope that this study opens the door for further investigation into techniques that can exploit latency amortization when latency hiding does not achieve sufficient performance

Scholars Junction - Mississippi State University Institutional Repository

Energy Saving Techniques for Phase Change Memory (PCM)

Author: Mittal Sparsh
Publication venue
Publication date: 15/09/2013
Field of study

In recent years, the energy consumption of computing systems has increased and a large fraction of this energy is consumed in main memory. Towards this, researchers have proposed use of non-volatile memory, such as phase change memory (PCM), which has low read latency and power; and nearly zero leakage power. However, the write latency and power of PCM are very high and this, along with limited write endurance of PCM present significant challenges in enabling wide-spread adoption of PCM. To address this, several architecture-level techniques have been proposed. In this report, we review several techniques to manage power consumption of PCM. We also classify these techniques based on their characteristics to provide insights into them. The aim of this work is encourage researchers to propose even better techniques for improving energy efficiency of PCM based main memory.Comment: Survey, phase change RAM (PCRAM

arXiv.org e-Print Archive

CiteSeerX

Introducing memory versatility to enhance memory system performance, energy efficiency and reliability

Author: Cao Yanan
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2016
Field of study

Main memory system is facing increasingly high pressure from the advances of computation power scaling. Nowadays memory systems are expected to have much higher capacity than before. However, DRAM devices have limited scalability. Higher capacity usually translates to proportional hardware and power cost. Memory compression is a promising technology to prevent it from happening. Previous memory compression works are generally based on rigid data layout which limits their performance. We thus propose Flexible Memory which supports out-of-order memory block layout to lower compression-related overhead and improve performance. Besides, the cost of memory reliability also increases with capacity growth. Conventional error protection schemes utilizes Hamming-based SECDED code that comes with 12.5\% capacity and power overhead of entire memory system. However it is not necessary to protect whole memory system because some data is not critical or not sensitive to memory errors. Memory capacity and power used in protecting them is almost wasted. Therefore, Selective Error Protection is necessary to lower the cost of large scale memory protection. The method to select critical data and non-critical data has been proposed before, however a memory system design to support its partitioned memory is challenging and does not exist at that time. Therefore, we propose a memory system design that has the capability to maintain two or more partitions with different layout in main memory at the same time. This design makes SEP schemes a complete practical design. Even with selective error protection, supporting memory reliability is still hurting scaling of memory capacity. Fortunately, memory data has been proved to be very compressible. Most common applications are expected to free up enough space that can be used to store their own ECC code. For these applications, memory reliability incurs very low space and power overhead. However, combining ECC and memory compression is not trivial. It is difficult to achieve high percentage of coverage over entire memory when compressibility of different memory blocks varies a lot. We thus introduce Flexible ECC that is based on Flexible Memory to allow easier ECC code placement. When a block has more choices to store its ECC code, it is more likely to be covered by ECC. With Flexible ECC, a larger portion of memory can be covered by ECC codes whose storage overhead is lowered by memory compression

Digital Repository @ Iowa State University (ISU)

Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems

Author: udayakumaran sumesh
Publication venue
Publication date: 01/01/2006
Field of study

In this research we propose a highly predictable, low overhead and yet dynamic, memory allocation strategy for embedded systems with scratch-pad memory. A scratch-pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in energy consumption, area and overall runtime, even with a simple allocation scheme. Scratch-pad allocation methods primarily are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption and SRAM space for tags and deliver poor real-time guarantees, just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. We propose a dynamic allocation methodology for global and stack data and program code that (i) accounts for changing program requirements at runtime (ii) has no software-caching tags (iii) requires no run-time checks (iv) has extremely low overheads, and (v) yields 100% predictable memory access times. In this method data that is about to be accessed frequently is copied into the scratch-pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to an existing static allocation scheme, results show that our scheme reduces runtime by up to 39.8% and energy by up to 31.3% on average for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in run-time and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture in runtime and energy while delivering better real-time benefits

CiteSeerX

Digital Repository at the University of Maryland

Limits of a decoupled out-of-order superscalar architecture

Author: Jones Graham P.
Publication venue: The University of Edinburgh
Publication date: 01/01/1999
Field of study

Edinburgh Research Archive