5 research outputs found

    Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms

    Get PDF
    As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks

    Buffer-On-Board Memory System

    Get PDF
    The design and implementation of the commodity memory architecture has resulted in significant limitations in a system's speed and capacity. To circumvent these limitations, designers and vendors have begun to place intermediate logic between the CPU and DRAM. This additional logic has two functions: to control the DRAM and to communicate with the CPU over a fast and narrow bus. The benefit provided by this logic is a reduction in pin-out to the memory system from the CPU and increased signal integrity seen by the DRAM, granting faster clock rates while increasing capacity. This new design is reminiscent of the FB-DIMM memory system yet makes key changes to its architecture including the use of existing DIMMs to reduce cost, a reduction in power (relative to FB-DIMM), and a more stable request latency. The problem is that the few vendors utilizing this design have the same general approach, yet the implementations vary greatly in their non-trivial details. A hardware verified simulation suite is developed to accurately model and evaluate the behavior of this buffer-on-board memory system. A study of this design space is performed to determine optimal use of the resources involved. This includes DRAM and bus organization, queue storage, and mapping schemes. Various constraints based on implementation costs are placed on simulated configurations to confirm that these optimizations apply to viable systems. Finally, full system simulations are performed with MARSSx86 to better understand how this memory system interacts with a CPU, cache, and operating system executing an application. Full system simulations uncover behaviors not present in simple limit-case simulations such as the impact of address and channel mapping schemes or the organization of ports and the associated buffers. When applying insights gleaned from these simulations, optimal performance can be achieved while still considering outside constraints (i.e., pin-out, power, and fabrication costs)

    ANALYTICAL MODEL FOR CHIP MULTIPROCESSOR MEMORY HIERARCHY DESIGN AND MAMAGEMENT

    Get PDF
    Continued advances in circuit integration technology has ushered in the era of chip multiprocessor (CMP) architectures as further scaling of the performance of conventional wide-issue superscalar processor architectures remains hard and costly. CMP architectures take advantageof Moore¡¯s Law by integrating more cores in a given chip area rather than a single fastyet larger core. They achieve higher performance with multithreaded workloads. However,CMP architectures pose many new memory hierarchy design and management problems thatmust be addressed. For example, how many cores and how much cache capacity must weintegrate in a single chip to obtain the best throughput possible? Which is more effective,allocating more cache capacity or memory bandwidth to a program?This thesis research develops simple yet powerful analytical models to study two newmemory hierarchy design and resource management problems for CMPs. First, we considerthe chip area allocation problem to maximize the chip throughput. Our model focuses onthe trade-off between the number of cores, cache capacity, and cache management strategies.We find that different cache management schemes demand different area allocation to coresand cache to achieve their maximum performance. Second, we analyze the effect of cachecapacity partitioning on the bandwidth requirement of a given program. Furthermore, ourmodel considers how bandwidth allocation to different co-scheduled programs will affect theindividual programs¡¯ performance. Since the CMP design space is large and simulating only one design point of the designspace under various workloads would be extremely time-consuming, the conventionalsimulation-based research approach quickly becomes ineffective. We anticipate that ouranalytical models will provide practical tools to CMP designers and correctly guide theirdesign efforts at an early design stage. Furthermore, our models will allow them to betterunderstand potentially complex interactions among key design parameters

    The bit-reversal SDRAM address mapping

    No full text
    The performance contributions of SDRAM address mapping techniques in the main memory of an embedded system are studied and examined. While spatial locality existing in the access stream increases SDRAM row hit rate, it also increases row conflicts. Mapping of the physical address bits into SDRAM column, row, bank and rank index impacts system performance significantly. A novel address mapping scheme, called bit-reversal, is described and experimentally compared against known methods. The bit-reversal address mapping increases SDRAM row hit rate from 43% to 66% by distributing conflicting memory accesses over independent SDRAM banks. Bit-reversal address mapping reduces the average memory access latency by 26%-29% over other methods, resulting in a 11.7%-13.5% reduction of total execution time. The configuration space of bit-reversal address mapping is explored. Finally, limited studies examining the impact of address mapping techniques in conjunction with SDRAM controller policy and virtual paging illustrate that mapping is better suited to virtual memory free embedded systems than desktop workstations incorporating paging mechanisms. Copyright 2005 ACM
    corecore