Search CORE

23 research outputs found

Hi-Rise: A high-radix switch for 3D integration with single-cycle arbitration

Author: David Blaauw
Reetuparna Das
Ronald G Dreslinski
Supreet Jeloka
Trevor Mudge
Publication venue
Publication date: 01/01/2014
Field of study

Abstract-This paper proposes a novel 3D switch, called 'HiRise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction

CiteSeerX

HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework

Author: Cole Murray
Dreslinski Ronald G.
Feng Siying
Franke Bjoern
Kaszyk Kuba
Mudge Trevor
O'Boyle Michael F P
Pal Subhankar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/11/2020
Field of study

Crossref

Edinburgh Research Explorer

Hardware Acceleration for Similarity Measurement in Natural Language Processing

Author: Jichuan Chang
Parthasarathy Ranganathan
Prateek Tandon
Ronald G Dreslinski
Thomas F Wenisch
Vahed Qazvinian
Publication venue
Publication date: 10/04/2020
Field of study

Abstract-The continuation of Moore's law scaling, but in the absence of Dennard scaling, motivates an emphasis on energyefficient accelerator-based designs for future applications. In natural language processing, the conventional approach to automatically analyze vast text collections-using scale-out processingincurs high energy and hardware costs since the central computeintensive step of similarity measurement often entails pair-wise, allto-all comparisons. We propose a custom hardware accelerator for similarity measures that leverages data streaming, memory latency hiding, and parallel computation across variable-length threads. We evaluate our design through a combination of architectural simulation and RTL synthesis. When executing the dominant kernel in a semantic indexing application for documents, we demonstrate throughput gains of up to 42× and 58× lower energy per similaritycomputation compared to an optimized software implementation, while requiring less than 1.3% of the area of a conventional core

CiteSeerX

Bloom Filter Guided Transaction Scheduling

Author: Geoffrey Blake
Ronald G. Dreslinski
Trevor Mudge
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Contention management is an important design com-ponent to a transactional memory system. Without effec-tive contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcom-ings of reactive contention managers have led to propos-als that approach contention management as a scheduling problem—proactive managers. Proactive techniques range from throttling execution in predicted periods of high con-tention to preventing groups of transactions running con-currently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situa-tions. We compare to two state-of-the-art transaction sched-ulers, “Adaptive Transaction Scheduling ” and “Proactive Transaction Scheduling ” and show that BFGTS attains up to a 4.6x and 1.7x improvement on high contention bench-marks respectively. Across all benchmarks it shows a 35% and 25 % average performance improvement respectively. 1

CiteSeerX

Crossref

Energy efficient near-threshold chip multi-processing

Author: Bo Zhai
David Blaauw
Dennis Sylvester
Ronald G. Dreslinski
Trevor Mudge
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

Subthreshold circuit design has become a popular approach for building energy efficient digital circuits. One drawback is perfor-mance degradation due to the exponentially reduced driving current. This had limited subthreshold circuits to relatively low performance applications such as sensor networks. To retain the excellent energy efficiency while reducing performance loss, we propose to apply subthreshold and near-threshold techniques to chip multi-processors. We show that an architecture where several slower cores are clus-tered together with a shared faster L1 cache is optimal for energy efficiency, because processor cores and memory operate best at dif-ferent supply and threshold voltages. In particular, SPLASH2 bench-marks show about a 53 % energy improvement over the traditional CMP approach (about 70 % over a single core machine)

CiteSeerX

Crossref

When Homogeneous becomes Heterogeneous -- Wearout Aware Task Scheduling for Streaming Applications

Author: David Blaauw
David Roberts
Dennis Sylvester
Eric Karl
Ronald G. Dreslinski
Trevor Mudge
Publication venue
Publication date: 01/01/2007
Field of study

Recent trends in process technology suggest the need to monitor transistor wear-out in future processes. Because of withindie variation and the different computations being run on each core in a multi-core chip, this wear-out causes further imbalance to initial core frequencies as time progresses. Furthermore, manufacturing defects mean that cache sizes can vary between cores, adding further imbalance to a system. If we allow different cores to independently control their operating frequency we can achieve the best possible performance for their part of the die. Other parts of the system with slowly degrading performance can include interconnects and Flash-based file caches. In this paper we first explain how conventionally homogeneous multi-core processors can become heterogeneous over time. We discuss possible operating system based solutions to maximize the performance of a system as it wears out and present illustrative theoretical results based on linear programming. We demonstrate that for a class of streaming applications, an intelligent scheduling scheme recovers a significant amount of performance lost through wear-out. We advocate the need for multiple accurate performance measurements for effective scheduling in a wearout-aware multicore chip

CiteSeerX

Full-system analysis and characterization of interactive smartphone applications

Author: Ali Saidi
Anthony Gutierrez
Chris Emmons
Nigel Paver
Ronald G Dreslinski
Thomas F Wenisch
Trevor Mudge
Publication venue
Publication date: 01/01/2011
Field of study

Abstract-Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of highperformance, low-power, general purpose CPUs in the latest smartphone models, users now expect PC-like performance and a rich user experience, including high-definition audio and video, high-quality multimedia, dynamic web content, responsive user interfaces, and 3D graphics. In this paper, we characterize the microarchitectural behavior of representative smartphone applications on a currentgeneration mobile platform to identify trends that might impact future designs. To this end, we measure a suite of widely available mobile applications for audio, video, and interactive gaming. To complete this suite we developed BBench, a new fully-automated benchmark to assess a web-browser's performance when rendering some of the most popular and complex sites on the web. We contrast these applications' characteristics with those of the SPEC CPU2006 benchmark suite. We demonstrate that realworld interactive smartphone applications differ markedly from the SPEC suite. Specifically the instruction cache, instruction TLB, and branch predictor suffer from poor performance. We conjecture that this is due to the applications' reliance on numerous high level software abstractions (shared libraries and OS services). Similar trends have been observed for UI-intensive interactive applications on the desktop

CiteSeerX

Full-system analysis and characterization of interactive smartphone applications

Author: Ali Saidi
Anthony Gutierrez
Chris Emmons
Nigel Paver
Ronald G. Dreslinski
Thomas F. Wenisch
Trevor Mudge
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Abstract—Smartphones have recently overtaken PCs as the primary consumer computing device in terms of annual unit shipments. Given this rapid market growth, it is important that mobile system designers and computer architects analyze the characteristics of the interactive applications users have come to expect on these platforms. With the introduction of high-performance, low-power, general purpose CPUs in the latest smartphone models, users now expect PC-like performance and a rich user experience, including high-definition audio and video, high-quality multimedia, dynamic web content, responsive user interfaces, and 3D graphics. In this paper, we characterize the microarchitectural behav-ior of representative smartphone applications on a current-generation mobile platform to identify trends that might impact future designs. To this end, we measure a suite of widely available mobile applications for audio, video, and interactive gaming. To complete this suite we developed BBench, a new fully-automated benchmark to assess a web-browser’s performance when ren-dering some of the most popular and complex sites on the web. We contrast these applications ’ characteristics with those of the SPEC CPU2006 benchmark suite. We demonstrate that real-world interactive smartphone applications differ markedly from the SPEC suite. Specifically the instruction cache, instruction TLB, and branch predictor suffer from poor performance. We conjecture that this is due to the applications ’ reliance on numerous high level software abstractions (shared libraries and OS services). Similar trends have been observed for UI-intensive interactive applications on the desktop. I

CiteSeerX

Crossref

The M5 simulator: Modeling networked systems

Author: Ali G. Saidi
Kevin T. Lim
Lisa R. Hsu
Nathan L. Binkert
Ronald G. Dreslinski
Steven K. Reinhardt
Publication venue
Publication date: 01/01/2006
Field of study

TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulator attributes, M5 provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsys-tem, and the ability to simulate multiple networked systems deterministically. Our experience in simulating network workloads revealed some unexpected interactions between TCP and the common simulation accel-eration techniques of sampling and warm-up. We have successfully validated M5’s simulated performance results against real machines, indicating that our models and methodology adequately capture the salient characteristics of these systems. M5’s usefulness as a general-purpose architecture simulator and its liberal open-source license have led to its adoption by several other academic and commercial groups. 2 Keywords computer architecture, simulation, simulation software, interconnected systems

CiteSeerX