3 research outputs found

    Optimum Circuits for Bit Reversal

    Full text link

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    Cache-optimal methods for bit-reversals

    No full text
    Bit-reversals are representative and important data reordering operations in many scientific computations. Performance degradation is mainly caused by cache conflict misses. Bit-reversals are often repeatedly used as fundamental subroutines for many scientific programs. Thus, in order to gain the best performance, cache-optimal methods and their implementations should be carefully and precisely done at the programming level. This type of performance programming for some special programs, such as the data reorderings, may significantly outperform an optimization from an automatic tool, such as a compiler. In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations. We evaluate the merits and limits of each technique and their application and architecture-dependent conditions for developing cache-optimal methods. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and TLB cache size and which fully use the available registers are cache-optimal and fast. (2) We show that our padding methods outperform other software oriented methods, and believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and SMP multiprocessors.
    corecore