1,590 research outputs found

    Limits on Fundamental Limits to Computation

    Full text link
    An indispensable part of our lives, computing has also become essential to industries and governments. Steady improvements in computer hardware have been supported by periodic doubling of transistor densities in integrated circuits over the last fifty years. Such Moore scaling now requires increasingly heroic efforts, stimulating research in alternative hardware and stirring controversy. To help evaluate emerging technologies and enrich our understanding of integrated-circuit scaling, we review fundamental limits to computation: in manufacturing, energy, physical space, design and verification effort, and algorithms. To outline what is achievable in principle and in practice, we recall how some limits were circumvented, compare loose and tight limits. We also point out that engineering difficulties encountered by emerging technologies may indicate yet-unknown limits.Comment: 15 pages, 4 figures, 1 tabl

    Accelerator Memory Reuse in the Dark Silicon Era

    Get PDF
    Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we present a novel technique aimed at mitigating this opportunity cost by allowing GP-CPU cores to reuse accelerator memory as a non-uniform cache architecture (NUCA) substrate. On a system with a last level-2 cache of 128kB, our technique achieves on average a 25% performance improvement when reusing four 512 kB accelerator memory blocks to form a level-3 cache. Making these blocks reusable as NUCA slices incurs on average in a 1.89% area overhead with respect to equally-sized ad hoc cache slice

    Architectural techniques to extend multi-core performance scaling

    Get PDF
    Multi-cores have successfully delivered performance improvements over the past decade; however, they now face problems on two fronts: power and off-chip memory bandwidth. Dennard\u27s scaling is effectively coming to an end which has lead to a gradual increase in chip power dissipation. In addition, sustaining off-chip memory bandwidth has become harder due to the limited space for pins on the die and greater current needed to drive the increasing load . My thesis focuses on techniques to address the power and off-chip memory bandwidth challenges in order to avoid the premature end of the multi-core era. ^ In the first part of my thesis, I focus on techniques to address the power problem. One option to cope with the power limit, as suggested by some recent papers, is to ensure that an increasing number of cores are kept powered down (i.e., dark silicon) due to lack of power; but this option imposes a low upper bound on performance. The alternative option of customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. I propose a gentler evolutionary path for multi-cores, called successive frequency unscaling ( SFU), to cope with the slowing of Dennard\u27s scaling. SFU keeps powered significantly more cores (compared to the option of keeping them \u27dark\u27) running at clock frequencies on the extended Pareto frontier that are successively lowered every generation to stay within the power budget. ^ In the second part of my thesis, I focus on techniques to avert the limited off-chip memory bandwidth problem. Die-stacking of DRAM on a processor die promises to continue scaling the pin bandwidth to off-chip memory. While the die-stacked DRAM is expected to be used as a cache, storing any part of the tag in the DRAM itself erodes the bandwidth advantage of die-stacking. As such, the on-die space overhead of the large DRAM cache\u27s tag is a concern. A well-known compromise is to employ a small on-die tag cache (T)forthetagmetadatawhilethefulltagstaysintheDRAM.However,tagcachingfundamentallyrequiresexploitingpagelevelmetadatalocalitytoensureefficientuseofthe3DDRAMbandwidth.Plainsubblockingexploitsthislocalitybutincursholesinthecache(i.e.,diminishedDRAMcachecapacity),whereasdecoupledorganizationsavoidholesbutdestroythislocality.IproposeBandwidthEfficientTagAccess(BETA)DRAMcache(β ) for the tag metadata while the full tag stays in the DRAM. However, tag caching fundamentally requires exploiting page-level metadata locality to ensure efficient use of the 3-D DRAM bandwidth. Plain sub-blocking exploits this locality but incurs holes in the cache (i.e., diminished DRAM cache capacity), whereas decoupled organizations avoid holes but destroy this locality. I propose Bandwidth-Efficient Tag Access (BETA) DRAM cache (β ) which avoids holes while exploiting the locality through various metadata organizational techniques. Using simulations, I conclusively show that the primary concern in DRAM caches is bandwidth and not latency, and that due to β2˘7stagbandwidthefficiency,β\u27s tag bandwidth efficiency, β with a Tperforms15 performs 15% better than the best previous scheme with a similarly-sized T

    Resource Management for Multicores to Optimize Performance under Temperature and Aging Constraints

    Get PDF
    corecore