54 research outputs found

    Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0

    Get PDF
    Journal ArticleA significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor performance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire models (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies

    Microarchitectural wire management for performance and power in partitioned architectures

    Get PDF
    Journal ArticleFuture high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall processor performance and power consumption. VLSI techniques allow a variety of wire implementations, but these wire properties have previously never been exposed to the microarchitecture. This paper advocates global wire management at the microarchitecture level and proposes a heterogeneous interconnect that is comprised of wires with varying latency, bandwidth, and energy characteristics. We propose and evaluate microarchitectural techniques that can exploit such a heterogeneous interconnect to improve performance and reduce energy consumption. These techniques include a novel cache pipeline design, the identification of narrow bit-width operands, the classification of non-critical data, and the detection of interconnect load imbalance. For a dynamically scheduled partitioned architecture, our results demonstrate that the proposed innovations result in up to 11% reductions in overall processor ED2, compared to a baseline processor that employs a homogeneous interconnect

    Power efficient resource scaling in partitioned architectures through dynamic heterogeneity

    Get PDF
    Journal ArticleThe ever increasing demand for high clock speeds and the desire to exploit abundant transistor budgets have resulted in alarming increases in processor power dissipation. Partitioned (or clustered) architectures have been proposed in recent years to address scalability concerns in future billion-transistor microprocessors. Our analysis shows that increasing processor resources in a clustered architecture results in a linear increase in power consumption, while providing diminishing improvements in single-thread performance. To preserve high performance to power ratios, we claim that the power consumption of additional resources should be in proportion to the performance improvements they yield. Hence, in this paper, we propose the implementation of heterogeneous clusters that have varying delay and power characteristics. A cluster's performance and power characteristic is tuned by scaling its frequency and novel policies dynamically assign frequencies to clusters, while attempting to either meet a fixed power budget or minimize a metric such as Energy×Delay2 (ED2). By increasing resources in a power-efficient manner, we observe a 11% improvement in ED2 and a 22.4% average reduction in peak temperature, when compared to a processor with homogeneous units. Our proposed processor model also provides strategies to handle thermal emergencies that have a relatively low impact on performance

    The effect of interconnect design on the performance of large L2 caches

    Get PDF
    Journal ArticleThe ever increasing sizes of on-chip caches and the growing domination of wire delay have changed the traditional design approach of the memory hierarchy. Many recent proposals advocate splitting the cache into a large number of banks and employ an on-chip network to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures (NUCA)). While these proposals focus on optimizing logical policies (placement, searching, and movement) associated with a cache design, initial design choices do not include the complexity of the network. With wire delay being the major performance limiting factor in modern processors, components designed without including wire parameters and network overhead will be sub-optimal with respect to both delay and power. The primary contributions of this work are: 1. An extension of the current version of CACTI to include network overhead and find the optimal design point for large on-chip caches. 2. An evaluation of novel techniques at the microarchitecture level that exploit special wires in the L2 cache network to improve performance

    Leveraging wire properties at the microarchitecture level

    No full text
    Journal ArticleIn future microprocessors, communication will emerge as a major bottleneck. The authors advocate composing future interconnects of some wires that minimize latency, some that maximize bandwidth, and some that minimize power. A microarchitecture aware of these wire characteristics can steer on-chip data transfers tot he most appropriate wires, thus improving performance and saving energy
    corecore