302 research outputs found

    ANALYTICAL MODEL FOR CHIP MULTIPROCESSOR MEMORY HIERARCHY DESIGN AND MAMAGEMENT

    Get PDF
    Continued advances in circuit integration technology has ushered in the era of chip multiprocessor (CMP) architectures as further scaling of the performance of conventional wide-issue superscalar processor architectures remains hard and costly. CMP architectures take advantageof Moore¡¯s Law by integrating more cores in a given chip area rather than a single fastyet larger core. They achieve higher performance with multithreaded workloads. However,CMP architectures pose many new memory hierarchy design and management problems thatmust be addressed. For example, how many cores and how much cache capacity must weintegrate in a single chip to obtain the best throughput possible? Which is more effective,allocating more cache capacity or memory bandwidth to a program?This thesis research develops simple yet powerful analytical models to study two newmemory hierarchy design and resource management problems for CMPs. First, we considerthe chip area allocation problem to maximize the chip throughput. Our model focuses onthe trade-off between the number of cores, cache capacity, and cache management strategies.We find that different cache management schemes demand different area allocation to coresand cache to achieve their maximum performance. Second, we analyze the effect of cachecapacity partitioning on the bandwidth requirement of a given program. Furthermore, ourmodel considers how bandwidth allocation to different co-scheduled programs will affect theindividual programs¡¯ performance. Since the CMP design space is large and simulating only one design point of the designspace under various workloads would be extremely time-consuming, the conventionalsimulation-based research approach quickly becomes ineffective. We anticipate that ouranalytical models will provide practical tools to CMP designers and correctly guide theirdesign efforts at an early design stage. Furthermore, our models will allow them to betterunderstand potentially complex interactions among key design parameters

    Reuse Distance Analysis for Large-Scale Chip Multiprocessors

    Get PDF
    Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program's detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across thread-interleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers. This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%-96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs

    Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling

    Get PDF
    Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.National Science Foundation (U.S.) (Grant CCF-1318384)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Jacobs Presidential Fellowship)United States. Defense Advanced Research Projects Agency (PERFECT Contract HR0011-13-2-0005

    A fuzzy logic based dynamic reconfiguration scheme for optimal energy and throughput in symmetric chip multiprocessors

    Get PDF
    Embedded systems architectures have traditionally often been investigated and designed in order to achieve a greater throughput combined with minimum energy consumption. With the advent of reconfigurable architectures it is now possible to support algorithms to find optimal solutions for an improved energy and throughput balance. As a result of ongoing research several online and offline techniques and algorithm have been proposed for hardware adaptation. This paper presents a novel coarse-grained reconfigurable symmetric chip multiprocessor (SCMP) architecture managed by a fuzzy logic engine that balances performance and energy consumption. The architecture incorporates reconfigurable level 1 (L1) caches, power gated cores and adaptive on-chip network routers to allow minimizing leakage energy effects for inactive components. A coarse grained architecture was selected as to be a focus for this study as it typically allows for fast reconfiguration as compared to the fine-grained architectures, thus making it more feasible to be used for runtime adaption schemes. The presented architecture is analyzed using a set of OpenMP based parallel benchmarks and the results show significant improvements in performance while maintaining minimum energy consumption

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Architecting Energy Efficient Servers.

    Full text link
    This dissertation investigates how energy efficient servers can be architected using current and future technology. We leverage recent trends in packaging and device technology to deliver low power and high throughput. Specifically at the package level, this dissertation looks at 3D stacking technology that has emerged as a promising solution in achieving energy efficiency by delivering high throughput at a low cost. It shows how one would leverage this new technology into a datacenter. 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture leveraging this technology, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple memory dies sufficient for a primary memory. The multiple memory dies are composed of DRAM. 3D stacking technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency along with the integration of non-volatile memory in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied. The PicoServer architecture targets server applications,which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. At the memory device level, this dissertation investigates how the system memory could be re-architected to reduce the rising power consumption of system memory and disk drives. Flash memory has emerged as a strong candidate to reduce system memory power while remaining cost effective than conventional system memory. This dissertation discusses how Flash could be integrated at the system level and provides insights on the architectural support for Flash in servers. Our architecture uses a two level disk cache composed of a relatively small DRAM, which includes a primary disk cache, and a Flash based secondary disk cache. Further, based on our observations, we found that the Flash based disk caches should be split into a read optimized disk cache and write optimized disk cache.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57602/2/tkgil_1.pd
    • …
    corecore