23 research outputs found

    Hi-Rise: A high-radix switch for 3D integration with single-cycle arbitration

    Get PDF
    Abstract-This paper proposes a novel 3D switch, called 'HiRise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction

    Hardware Acceleration for Similarity Measurement in Natural Language Processing

    Get PDF
    Abstract-The continuation of Moore's law scaling, but in the absence of Dennard scaling, motivates an emphasis on energyefficient accelerator-based designs for future applications. In natural language processing, the conventional approach to automatically analyze vast text collections-using scale-out processingincurs high energy and hardware costs since the central computeintensive step of similarity measurement often entails pair-wise, allto-all comparisons. We propose a custom hardware accelerator for similarity measures that leverages data streaming, memory latency hiding, and parallel computation across variable-length threads. We evaluate our design through a combination of architectural simulation and RTL synthesis. When executing the dominant kernel in a semantic indexing application for documents, we demonstrate throughput gains of up to 42× and 58× lower energy per similaritycomputation compared to an optimized software implementation, while requiring less than 1.3% of the area of a conventional core

    Bloom Filter Guided Transaction Scheduling

    No full text
    Contention management is an important design com-ponent to a transactional memory system. Without effec-tive contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcom-ings of reactive contention managers have led to propos-als that approach contention management as a scheduling problem—proactive managers. Proactive techniques range from throttling execution in predicted periods of high con-tention to preventing groups of transactions running con-currently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situa-tions. We compare to two state-of-the-art transaction sched-ulers, “Adaptive Transaction Scheduling ” and “Proactive Transaction Scheduling ” and show that BFGTS attains up to a 4.6x and 1.7x improvement on high contention bench-marks respectively. Across all benchmarks it shows a 35% and 25 % average performance improvement respectively. 1

    Energy efficient near-threshold chip multi-processing

    No full text
    Subthreshold circuit design has become a popular approach for building energy efficient digital circuits. One drawback is perfor-mance degradation due to the exponentially reduced driving current. This had limited subthreshold circuits to relatively low performance applications such as sensor networks. To retain the excellent energy efficiency while reducing performance loss, we propose to apply subthreshold and near-threshold techniques to chip multi-processors. We show that an architecture where several slower cores are clus-tered together with a shared faster L1 cache is optimal for energy efficiency, because processor cores and memory operate best at dif-ferent supply and threshold voltages. In particular, SPLASH2 bench-marks show about a 53 % energy improvement over the traditional CMP approach (about 70 % over a single core machine)

    When Homogeneous becomes Heterogeneous -- Wearout Aware Task Scheduling for Streaming Applications

    No full text
    Recent trends in process technology suggest the need to monitor transistor wear-out in future processes. Because of withindie variation and the different computations being run on each core in a multi-core chip, this wear-out causes further imbalance to initial core frequencies as time progresses. Furthermore, manufacturing defects mean that cache sizes can vary between cores, adding further imbalance to a system. If we allow different cores to independently control their operating frequency we can achieve the best possible performance for their part of the die. Other parts of the system with slowly degrading performance can include interconnects and Flash-based file caches. In this paper we first explain how conventionally homogeneous multi-core processors can become heterogeneous over time. We discuss possible operating system based solutions to maximize the performance of a system as it wears out and present illustrative theoretical results based on linear programming. We demonstrate that for a class of streaming applications, an intelligent scheduling scheme recovers a significant amount of performance lost through wear-out. We advocate the need for multiple accurate performance measurements for effective scheduling in a wearout-aware multicore chip

    Full-system analysis and characterization of interactive smartphone applications

    No full text
    Abstract-Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of highperformance, low-power, general purpose CPUs in the latest smartphone models, users now expect PC-like performance and a rich user experience, including high-definition audio and video, high-quality multimedia, dynamic web content, responsive user interfaces, and 3D graphics. In this paper, we characterize the microarchitectural behavior of representative smartphone applications on a currentgeneration mobile platform to identify trends that might impact future designs. To this end, we measure a suite of widely available mobile applications for audio, video, and interactive gaming. To complete this suite we developed BBench, a new fully-automated benchmark to assess a web-browser's performance when rendering some of the most popular and complex sites on the web. We contrast these applications' characteristics with those of the SPEC CPU2006 benchmark suite. We demonstrate that realworld interactive smartphone applications differ markedly from the SPEC suite. Specifically the instruction cache, instruction TLB, and branch predictor suffer from poor performance. We conjecture that this is due to the applications' reliance on numerous high level software abstractions (shared libraries and OS services). Similar trends have been observed for UI-intensive interactive applications on the desktop

    Full-system analysis and characterization of interactive smartphone applications

    No full text
    Abstract—Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of high-performance, low-power, general purpose CPUs in the latest smartphone models, users now expect PC-like performance and a rich user experience, including high-definition audio and video, high-quality multimedia, dynamic web content, responsive user interfaces, and 3D graphics. In this paper, we characterize the microarchitectural behav-ior of representative smartphone applications on a current-generation mobile platform to identify trends that might impact future designs. To this end, we measure a suite of widely available mobile applications for audio, video, and interactive gaming. To complete this suite we developed BBench, a new fully-automated benchmark to assess a web-browser’s performance when ren-dering some of the most popular and complex sites on the web. We contrast these applications ’ characteristics with those of the SPEC CPU2006 benchmark suite. We demonstrate that real-world interactive smartphone applications differ markedly from the SPEC suite. Specifically the instruction cache, instruction TLB, and branch predictor suffer from poor performance. We conjecture that this is due to the applications ’ reliance on numerous high level software abstractions (shared libraries and OS services). Similar trends have been observed for UI-intensive interactive applications on the desktop. I

    The M5 simulator: Modeling networked systems

    No full text
    TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulator attributes, M5 provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsys-tem, and the ability to simulate multiple networked systems deterministically. Our experience in simulating network workloads revealed some unexpected interactions between TCP and the common simulation accel-eration techniques of sampling and warm-up. We have successfully validated M5’s simulated performance results against real machines, indicating that our models and methodology adequately capture the salient characteristics of these systems. M5’s usefulness as a general-purpose architecture simulator and its liberal open-source license have led to its adoption by several other academic and commercial groups. 2 Keywords computer architecture, simulation, simulation software, interconnected systems
    corecore