341 research outputs found

    Interval simulation: raising the level of abstraction in architectural simulation

    Get PDF
    Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs

    Performance analysis of Intel Core 2 Duo processor

    Get PDF
    With the emergence of thread level parallelism as a more efficient method of improving processor performance, Chip Multiprocessor (CMP) technology is being more widely used in developing processor architectures. Also, the widening gap between CPU and memory speed has evoked the interest of researchers to understand performance of memory hierarchical architectures. As part of this research, performance characteristic studies were carried out on the Intel Core 2 Duo, a dual core power efficient processor, using a variety of new generation benchmarks. This study provides a detailed analysis of the memory hierarchy performance and the performance scalability between single and dual core processors. The behavior of SPEC CPU2006 benchmarks running on Intel Core 2 Duo processor is also explained. Lastly, the overall execution time and throughput measurement using both multi-programmed and multi-threaded workloads for the Intel Core 2 Duo processor is reported and compared to that of the Intel Pentium D and AMD Athlon 64X2 processors. Results showed that the Intel Core 2 Duo had the best performance for a variety of workloads due to its advanced micro-architectural features such as the shared L2 cache, fast cache to cache communication and smart memory access

    Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms

    Get PDF
    As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks

    Code Positioning in LLVM

    Get PDF
    Given the increasing performance disparity between processor speeds and memory latency, making efficient use of cache memory is more important than ever to achieve good performance in memory-bound workloads. Many modern first-level caches store instructions separately from data, making code layout and code size an important factor in the cache behavior of a program. This work investigates two methods that attempt to improve code locality, namely procedure splitting and procedure positioning, previously investigated by Pettis and Hansen. They are implemented in the open-source compiler framework LLVM to evaluate their effect on the SPEC CPU2000 benchmark suite and a benchmark run of the PostgreSQL database system. We found that our implementation is highly situational, but can be beneficial, reducing execution time by up to 3% on suitable SPEC benchmarks and an increase of 3% in average transactions per second on PostgreSQL

    Benchmarking

    Full text link
    • …
    corecore