Multithreaded architectures for manycore throughput processors by Johnson, Daniel
c© 2013 Daniel R. Johnson
MULTITHREADED ARCHITECTURES FOR MANYCORE THROUGHPUT
PROCESSORS
BY
DANIEL R. JOHNSON
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2013
Urbana, Illinois
Doctoral Committee:
Professor Sanjay J. Patel, Chair
Associate Professor Steven S. Lumetta
Professor Wen-Mei W. Hwu
Professor Naresh R. Shanbhag
ABSTRACT
This dissertation describes work on the architecture of throughput-oriented
accelerator processors.
First, we examine the limitations of current accelerator processors and
identify an opportunity to enable high throughput while also providing a
more general-purpose programming model. To address this opportunity, we
present Rigel, a single-chip accelerator architecture with 1024 independent
processing cores targeted at a broad class of data- and task-parallel com-
putation. Enabled by the feasibility of large die sizes combined with in-
creasing transistor densities, we show that such an aggressive design can be
implemented in today’s process technology within acceptable area and power
limits. We discuss our motivation for such a design and evaluate the perfor-
mance scalability as well as power and area requirements. We also describe
the Rigel memory system, including the Task Centric Memory Model soft-
ware coherence protocol, the Cohesion hybrid memory model, and lazy
atomic operations.
We describe the Rigel toolflow, a set of tools we have developed for eval-
uating manycore accelerator architectures. The Rigel toolset includes an
architectural simulator, LLVM-based compiler, parallel benchmarks, RTL
models, and associated infrastructure scripts and toolflows. We have pre-
pared an open-source release of portions of the resulting toolset for the use
of the broader research community. Such a release will enable others to
perform further work in the area of accelerator design.
We present multi-level scheduling, a technique developed for throughput-
oriented graphics processing units (GPUs) designed to reduce complexity
and energy consumption. Modern GPUs employ a large number of hardware
threads to hide both long and short latencies. Supporting tens of thousands
of hardware threads requires a complex scheduler and a large register file
which is expensive to access in terms of energy and latency. With multi-level
ii
scheduling, we divide threads into a smaller set of active threads to hide
short latencies and larger set of pending threads for hiding long latencies to
main memory. By reducing the concurrently active number of threads, we
enable more efficient scheduler and register file structures.
Finally, we describe opportunities for employing similar hierarchical multi-
threading techniques to MIMD accelerator designs such as Rigel. We extend
the original Rigel architecture with a new multithreaded microarchitecture.
We propose a novel, flexible multithreading paradigm that allows the archi-
tect a flexible way to scale the number of threads to match the requirements
of targeted workloads. We show that this new multithreaded architecture can
be implemented efficiently while providing more flexibility to the architect.
iii
for my wife, Melanie...
iv
ACKNOWLEDGMENTS
Like all successful projects, the research presented in this dissertation bene-
fited from many contributions, both small and large, from a variety of people.
John Kelm was instrumental in the development of the Rigel project.
John’s willingness to plow ahead in the face of uncertainty was key to spin-
ning up our own set of tools from scratch. I’m thankful for the frequent
intellectually stimulating conversations about architecture, computers, eco-
nomics, all things technical, and more. I must also thank John for the beer.
Matt Johnson’s hacker expertise was an invaluable asset across the project,
from simulator to compiler to benchmarks.
Special thanks to all of the students and others who contributed to the
RTL portion of the Rigel project: Bill Tuohy lent his expert hand in many
ways to get a realistic RTL CAD flow set up, along with Jonathan Ashbrook.
Voytek Truty implemented the floating-point units in the Rigel RTL and
was a valuable team contributor to the Rigel RTL project. Voytek’s unique
sense of humor was always a bright spot for the two years we were officemates.
We will always remember Trogdor in the testflow. Simon Venshtain worked
on the RTL cache models, wrangled memory macros, and helped with the
test and verification flow. Jeremey (Chia-jen Chang) worked on turning our
RTL project into a taped out test chip! Jeremy took the RTL and toolflow I
developed with Simon and Voytek and worked it into a tapeout-ready netlist
for fabrication.
I would like to thank the various faculty I’ve had a chance to collaborate
with and get to know through this project.
Sanjay, for instigating a large systems project and imposing no boundaries.
I am grateful for having had the opportunity to work on a large systems
project where nothing was taken as fixed, enabling me to dabble in architec-
ture, software engineering, parallel application development, compilers, and
RTL design and implementation.
v
Steve Lumetta, for always taking an interest in the work of students and
providing positive, insightful advice, no matter the topic.
Matt Frank for helping to get the Rigel project started on the right foot.
Finally, Wen-Mei and Naresh for serving on my committee.
I would like to thank all of the remaining members of the Rigel team
for their various contributions over time, including Stephen Kofsky, Neal
Crago, Wooil Kim, and our collective class project partners who contributed
to our work, especially those who acted as guinea pigs for our tools under
development. Over time, many students helped in some fashion to develop,
debug, and improve the various tools used in this research.
I thank my collaborators at NVIDIA Research, Mark Gebhart, David Tar-
jan, and Steve Keckler, for their contributions to the GPU-related work pre-
sented here. I would also like to thank NVIDIA for a generous Graduate
Fellowship award during my last year of graduate school.
I thank my family for their continued support and encouragement.
Finally, I would like to thank my wife Melanie for her love and support
throughput my graduate school career, and for managing to tolerate me
throughout the long process that is graduate school.
vi
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation: Current Accelerator Limitations . . . . . . . . . . 1
1.2 Rigel: A Programmable Manycore Accelerator . . . . . . . . 2
1.3 Efficient, Flexible Multithreading . . . . . . . . . . . . . . . . 3
CHAPTER 2 RIGEL: A 1024-CORE SINGLE CHIP ACCELER-
ATOR ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Motivation: Current Accelerator Limitations . . . . . . . . . . 6
2.2 Rigel Accelerator Architecture Overview . . . . . . . . . . . 7
2.3 Caching and Memory Model . . . . . . . . . . . . . . . . . . . 13
2.4 Programming Rigel: The Rigel Task Model . . . . . . . . . 16
2.5 Evaluation: Area and Power . . . . . . . . . . . . . . . . . . . 18
2.6 Evaluation: Scalability . . . . . . . . . . . . . . . . . . . . . . 19
2.7 VLSI Implementation and Test Chip . . . . . . . . . . . . . . 19
CHAPTER 3 RIGEL MEMORY SYSTEM . . . . . . . . . . . . . . . 25
3.1 Parallel Application Characterization . . . . . . . . . . . . . . 26
3.2 Software-managed Coherence with the Task-Centric Mem-
ory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Hybrid Coherence with Cohesion . . . . . . . . . . . . . . . 34
3.4 Lazy Atomic Operations . . . . . . . . . . . . . . . . . . . . . 36
CHAPTER 4 AN EVALUATION INFRASTRUCTURE FOR MAS-
SIVELY PARALLEL ACCELERATOR PROCESSORS . . . . . . 47
4.1 Toolflow Objectives . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 The Case for a Flexible, Integrated Evaluation Framework . . 48
4.3 Toolflow Components . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Tool Chain Integration: IDEA . . . . . . . . . . . . . . . . . . 54
4.5 Timing Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Simulation Automation . . . . . . . . . . . . . . . . . . . . . . 56
4.8 RTL Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 The Rigel Core . . . . . . . . . . . . . . . . . . . . . . . . . 58
vii
4.10 Testing and Verification . . . . . . . . . . . . . . . . . . . . . 59
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
CHAPTER 5 ENERGY-EFFICIENT THROUGHPUT ARCHI-
TECTURES WITH MULTI-LEVEL SCHEDULING . . . . . . . . 61
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
CHAPTER 6 HIERARCHICAL MULTITHREADING . . . . . . . . 94
6.1 Hierarchical Multithreaded Clusters . . . . . . . . . . . . . . . 96
6.2 A Multithreaded Microarchitecture for Rigel . . . . . . . . . . 98
6.3 Hierarchical Multithreading . . . . . . . . . . . . . . . . . . . 104
6.4 Thread Pools: Migratory Hierarchical Multithreading . . . . . 110
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
CHAPTER 7 RELATED WORK . . . . . . . . . . . . . . . . . . . . 119
7.1 Throughput Processors . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4 Hierarchical Multithreading . . . . . . . . . . . . . . . . . . . 122
7.5 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
CHAPTER 8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . 126
8.1 Rigel: Looking to the Future . . . . . . . . . . . . . . . . . . . 126
8.2 Energy Efficient Multithreading for GPUs . . . . . . . . . . . 129
8.3 Hierarchical Multithreading . . . . . . . . . . . . . . . . . . . 130
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
viii
LIST OF ABBREVIATIONS
BSP Bulk Synchronous Parallel
CMP Chip Multiprocessor
CPU Central Processing Unit
DRAM Dynamic Random Access Memory
FIFO First In, First Out
FPU Floating Point Unit
GDDR Graphics Double Data Rate
GPU Graphics Processing Unit
GPGPU General-Purpose Graphics Processing Unit
ILP Instruction Level Parallelism
ISA Instruction Set Architecture
MIMD Multiple Instruction Multiple Data
RTL Register-Transfer Level
SIMD Single Instruction Multiple Data
SIMT Single Instruction Multiple Thread
SMP Symmetric Multiprocessor
SoC System on Chip
SPMD Single Program Multiple Data
SRAM Static Random Access Memory
VLIW Very Long Instruction Word
VLSI Very Large Scale Integration
ix
CHAPTER 1
INTRODUCTION
An insatiable appetite for performance on compute intensive data-parallel
workloads in visual and scientific computing has driven the design of through-
put-oriented parallel compute accelerators. For this work, we consider pro-
grammable accelerators in contrast to fixed-function or hardwired application-
specific accelerator units. Such accelerators are designed to improve perfor-
mance for specific classes of workloads by exploiting characteristics of the
target domain and by limiting the generality of the programming model.
While general-purpose processors tend to employ additional transistor re-
sources to minimize latency ( sec
operation
), accelerators are designed to maximize
throughput (operations
sec
). Contemporary examples of compute accelerators in-
clude graphics processing units (GPUs) [1], Cell [2], and Larrabee [3]. For
a more complete discussion of throughput-oriented CMP and accelerator ar-
chitectures, please see Chapter 7.
1.1 Motivation: Current Accelerator Limitations
Current accelerators generally expose restricted programming models which
yield high performance for data-parallel applications with rigidly structured
computation and memory access patterns, but present a more difficult tar-
get for less regular parallel applications. The throughput-oriented archi-
tectural design choices of accelerators often compromise the generality of
the programming model. For instance, accelerators commonly achieve high
throughput through the use of wide SIMD (single instruction, multiple data)
processing elements, as opposed to the MIMD (multiple instruction, multi-
ple data) model favored by general-purpose processors. For dense or regular
data-parallel computations, SIMD hardware reduces the cost of perform-
ing many computations by amortizing costs such as control and instruction
1
fetch across many processing elements, enabling an efficient hardware im-
plementation. However, for applications that do not naturally map to the
SIMD execution model, programmers must adapt their algorithms or suffer
reduced efficiency, limiting the scope of applications which can achieve the
hardware’s peak performance. The memory system is another area where
accelerators commonly make compromises in support of hardware efficiency
that limit programmability. Software-managed scratchpad memories yield
denser hardware, tighter access latency guarantees, and consume less power
than caches; however, they impose additional burden on either the program-
mer or software tools. Multiple address spaces also increase the burden of
development.
1.2 Rigel: A Programmable Manycore Accelerator
The Rigel architecture was conceived in 2007 as an attempt to address some
of the shortcomings of parallel computation accelerators while pushing the
envelope on throughput-oriented designs. Broadly, the goals of the project
are to
• Demonstrate the feasibility of a single-chip, massively parallel MIMD
accelerator architecture with 1024 cores
• Achieve high computational density, or throughput (operations/sec
mm2
)
• Determine how to organize such a device to be programmer-friendly
• Present a more general target to developers, increasing the scope of
parallel applications which can target the design
These goals drove our development of Rigel [4], a 1024-core single-chip
accelerator architecture that targets a broad class of data- and task-parallel
computation, especially visual computing workloads. With the Rigel de-
sign, we aim to strike a balance between raw performance and ease of pro-
grammability by adopting programming interface elements from general-
purpose processors. Rigel is composed of 1024 independent, hierarchi-
cally organized cores that use a fine-grained, dynamically scheduled single-
program, multiple-data (SPMD) execution model. Rigel adopts a single
global address space and a fully cached memory hierarchy. Parallel work is
expressed in a task-centric, bulk-synchronized manner using minimal hard-
2
ware support. Compared to existing accelerators, which contain domain-
specific hardware, specialized memories, and restrictive programming mod-
els, Rigel is more flexible and provides a more straightforward target for
a broader set of applications. The Rigel architecture is described in more
detail in Chapter 2.
The design of the Rigel memory system, particularly cache coherence,
shaped many other aspects of the architecture. We observed that data shar-
ing and communication patterns in parallel workloads can be leveraged in the
design of memory systems for future manycore accelerators. Based on these
insights, we developed both software and hardware mechanisms to manage
coherence on parallel accelerator processors. We developed the Task-Centric
Memory Model [5], a software protocol which works with hardware caches to
maintain a coherent, single-address-space view of memory without hardware
coherence support. We developed Lazy Atomic Operations as an extension
to this model to take advantage of incoherent hardware caches for improving
throughput on atomic operations. We developed Cohesion [6] as a bridge
enabling effective use of both hardware and software coherence mechanisms,
simplifying the integration of multiple memory models in heterogeneous or
accelerator-based systems.
1.3 Efficient, Flexible Multithreading
Application characteristics and memory access behavior direct the level of
threading and required latency hiding capabilities for an architecture. In
throughput-oriented systems, multiple types of latencies must be hidden. A
variety of short-term latencies limit performance, including pipeline hazards,
functional unit latencies, cache access latency, resource contention, and lack
of ILP. Traditional techniques for hiding shorter latencies, such as out-of-
order processing to uncover ILP, incur a high cost in terms of area and power,
undesirable for throughput-oriented designs. Additionally, long-latency op-
erations including cache misses, global memory operations such as atomics,
and uncacheable memory accesess can introduce longer stalls.
A common solution for hiding latency and increasing throughput is hard-
ware multithreading. Multithreaded systems provide hardware threads to
hide both short and long latencies. These designs provide a relatively easy
3
path for programmers to achieve high throughput on their applications.
While adding additional threads to an architecture can hide more latency and
ease the burden on the programmer, hardware threads are not free. There is a
large marginal cost to each hardware thread. Each hardware thread context
requires additional resources, including register file storage and scheduling
entries. On modern GPUs, a substantial portion of the die area is dedicated
to register storage for thousands of hardware-resident threads [7]. Addition-
ally, the design complexity of the pipeline, the thread scheduler, and other
shared structures increases to handle the increased level of threading. Finally,
over-threading causes additional contention for shared resources, which can
hurt performance and throughput.
1.3.1 Power Efficient Throughput-Oriented Architectures
While aggressive throughput-oriented processors are feasible today, power is
becoming an increasingly important limiter for high-performance designs [8]
such as Rigel and graphics processors. Architects of future throughput
processors must consider both power and performance. In Chapter 5, we
describe work on energy-efficient mechanisms for managing large numbers of
hardware-resident thread contexts in throughput processors.
We present multi-level scheduling, a technique developed for throughput-
oriented graphics processors (GPUs) designed to reduce complexity and en-
ergy consumption. Modern GPUs employ a large number of hardware threads
to hide both long and short latencies. Supporting tens of thousands of hard-
ware threads requires a complex scheduler and a large register file, which is
expensive to access in terms of energy and latency. With multi-level schedul-
ing, we divide threads into a smaller set of active threads to hide short
latencies and larger set of pending threads for hiding long latencies to main
memory. By reducing the concurrently active number of threads, we en-
able both a lower complexity scheduler and more power-efficient register file
structures.
4
1.3.2 Hierarchical Multithreading on Rigel
There is a wide design space between the simple, single-threaded processors
originally employed by Rigel and the massively threaded processors em-
ployed by modern GPUs. The level of threading desired for a given system
is influenced by a variety of factors, especially for throughput-oriented accel-
erator processors. Key applications and application domains have a heavy
influence on the underlying architecture. For instance, graphics workloads
stream through large datasets while rendering each frame and are effectively
bandwidth limited, leading to large numbers of latency hiding threads in
modern GPUs.
While a particular application or application domain may be matched to
a specific degree of threading for a given design implementation, there is no
single correct degree of multithreading. For a throughput-oriented system,
the end goal is to select a design point that maximizes overall throughput.
We desire a flexible threading architecture for accelerators. We seek to take
advantage of the disjoint latency classes that influence multithreaded designs
to enable a more configurable multithreading paradigm. Our goal is a scal-
able multithreading solution versus the typical point solutions employed by
multithreaded designs today. It is desirable to consider an architecture that
allows the architect to dial a knob for selecting the degree of multithreading.
One knob adjusts the number of threads supported by execution pipelines,
and another knob selects the number of threads available to hide memory
latency.
In pursuit of this goal, we employ similar techniques as we evaluated for
GPUs. We consider splitting thread contexts into two groups, L1 threads that
are actively executing on a given core, and hardware-resident L2 threads that
wait on memory requests and provide fast context switches.
We describe opportunities for employing similar hierarchical multithreading
techniques to MIMD accelerator designs such as Rigel. We extend the
original Rigel architecture with a new multithreaded microarchitecture. We
propose a novel flexible multithreading paradigm allowing the architect a
flexible way to scale the number of threads to match the requirements of
targeted workloads. We show that this new multithreading paradigm can be
implemented efficiently while providing more flexibility to the architect.
5
CHAPTER 2
RIGEL: A 1024-CORE SINGLE CHIP
ACCELERATOR ARCHITECTURE
In this chapter, we provide a description of the Rigel 1024 core accelerator
architecture. We describe previous work for throughput-oriented processing
designs. We detail the motivation for and objectives of the Rigel design. We
describe the cores and clustered core organization, the caches and memory
system, and Rigel’s task-based programming model.
This work has been most recently published in [9], and parts have been
previously published in [4, 10, 5, 11, 6, 12].
2.1 Motivation: Current Accelerator Limitations
Current accelerators generally expose restricted programming models which
yield high performance for data-parallel applications with rigidly structured
computation and memory access patterns, but present a more difficult tar-
get for less regular parallel applications. The throughput-oriented architec-
tural choices of accelerators often compromise the generality of the program-
ming model. For instance, accelerators commonly achieve high throughput
through the use of SIMD (single instruction, multiple data) processing ele-
ments as opposed to the MIMD (multiple instruction, multiple data) model.
For dense or regular data-parallel computations, SIMD hardware reduces
the cost of performing many computations by amortizing costs such as con-
trol and instruction fetch across many processing elements. However, when
applications do not naturally map to the SIMD execution model, program-
mers must adapt their algorithms or suffer reduced efficiency. SIMD then
limits the scope of applications which can achieve the hardware’s peak per-
formance. The memory system is another area where accelerators commonly
make compromises in support of hardware efficiency that limit programmabil-
ity. Software-managed scratchpad memories yield denser hardware, provide
6
tighter access latency guarantees, and consume less power than caches; how-
ever, they impose an additional burden on either the programmer or software
tools. Additionally, managing the multiple address spaces often associated
with accelerator memories requires copy operations and more explicit manual
memory management.
2.2 Rigel Accelerator Architecture Overview
In this section, we provide an overview of the Rigel accelerator architecture.
We describe our design objectives, the major components of our design, and
some possible alternative organizations.
2.2.1 Objectives
The Rigel architecture was conceived as an attempt to address some of the
shortcomings of parallel computation accelerators while pushing the envelope
on throughput-oriented designs.
Broadly, the goals of the Rigel project were to
• Determine the feasibility of a single-chip, massively parallel MIMD gen-
eralized computation accelerator
• Achieve high computation density, or throughput (operations/sec
mm2
)
• Determine how to organize such a device to be programmer-friendly
• Present a more general target to developers, increasing the scope of
parallel applications which can target the design
• Address the limitations of existing accelerator architectures
2.2.2 Rigel: A Programmable Manycore Accelerator
Architecture
These objectives drove our development of Rigel [4], a 1024-core single-
chip accelerator architecture designed to efficiently target a wide class of
regular and irregular parallel applications, including data- and task-parallel
7
Rigel Architecture: Full Chip View 
9
Figure 2.1: Block diagram of the Rigel processor.
computation. With the Rigel design, we aim to strike a balance between
raw performance and ease of programmability by adopting programming in-
terface elements from general-purpose processors. A block diagram of the
Rigel accelerator architecture is shown in Figure 2.1. Rigel is composed of
1024 independent, hierarchically organized cores. Simple in-order cores em-
phasize throughput over latency, but a MIMD execution model is chosen for
flexibility over a potentially denser SIMD model. Rigel has a fully cached,
single address space memory model with no chip-wide hardware-enforced
coherence in the baseline configuration. Work distribution is managed in
software in a bulk-synchronous fashion. Compared to existing accelerators,
which rely on domain-specific hardware, multiple special-purpose memories,
and limited programming models, Rigel is more flexible and provides a more
straightforward development target for a broader range of parallel applica-
tions.
Tradeoffs are made in Rigel’s low-level programming interface between
generality and accelerator performance. The primary elements that we iden-
tify as important for supporting our objectives include the execution model,
the memory model, work distribution, synchronization, and locality man-
agement. The Rigel execution model omits complex ILP-oriented cores in
favor of simple, area-optimized in-order cores to improve full-chip through-
put. However, a more flexible MIMD model is chosen over denser SIMD
hardware. A degree of multithreading is beneficial in improving throughput
8
Table 2.1: Comparison of Rigel to other contemporary accelerator
architectures
Rigel GPU Cell Larrabee
Vectors 1× (MIMD) 32× (SIMT) 4× (SIMD) 16× (SIMD)
Memory Fully Special DMA + Fully
Cached Purpose Scratchpad Cached
Address Space Single Multiple Multiple Single
Thread Count Some (1-4) Heavy None Some (4)
(per core) (10s-100s)
Core Count 1024 10s-100s 8-10 10s
Coherence HW/SW None None HW
Hybrid
Work Software Hardware Software Software
Distribution
Specialized None Significant None Some
Hardware (graphics) (texture)
and is under investigation. Unlike many accelerators, Rigel presents a single
global address space similar to general purpose CMPs. Rigel supports soft-
ware work distribution in the form of task queues based on the common BSP
execution model. Rigel supports global synchronization through software
barriers and through atomic hardware primitives. Rigel supports a variety
of memory operations to aid in locality management, including prefetches,
local and global memory operations, and explicit cache management instruc-
tions.
Table 2.1 compares the Rigel design point to several other notable ac-
celerator processors, and Table 2.2 summarizes our design parameters. The
various components of the Rigel accelerator architecture are discussed in
more detail below.
2.2.3 Core
The fundamental processing element of Rigel is a dual-issue in-order core
optimized for area rather than ILP. Cores support a custom 32-bit RISC-like
instruction set. Each core has a standard integer pipeline, a fully pipelined
single-precision floating-point unit, load-store pipeline, and 32-entry 32-bit
9
Table 2.2: Simulated parameters for the Rigel architecture
Cores 1024, 2-wide issue, in-order
1.5 GHz
1 single-precision floating-point unit
1 integer unit
1 memory unit
L1 I-Cache 2 kB, 2-way associative
L1 D-Cache 1 kB, 2-way associative kB
L2 (Cluster) Cache Unified, 1 per 8-core cluster 64 kB, 16-way
L3 (Global) Cache Globally shared, 4 MB total, 32 8-way banks
DRAM 8 32-bit channels, GDDR5
DRAM Bandwidth 192 GB/s
general-purpose register file. A set of special purpose registers serve to pro-
vide configuration information such as unique core IDs. In contrast to SIMD
or SIMT machines, each pipeline has an independent front end and instruc-
tion fetch unit, allowing all cores to simultaneously execute fully independent
instruction streams in a MIMD fashion. Each core has a small L1 instruction
cache and L1 data cache. The baseline core microarchitecture is illustrated
in Figure 2.2.
2.2.4 Cluster
Cores are organized intro groups called clusters. A cluster contains a collec-
tion of cores attached to a shared unified cluster cache. Figure 2.1 illustrates
the cluster on the left side. The baseline Rigel configuration contains eight
cores per cluster, with cores connected to the cluster cache via a shared
bus. The interconnect between cores and the cluster cache is a split-phase
bus, enabling simultaneous requests and responses. Clusters allow efficient
communication among their cores via the shared cluster cache. Core-private
caches are kept coherent across their containing cluster. Clusters also im-
plement local atomics, load-linked and store-conditional. The cores, cluster
cache, core-to-cluster-cache interconnect and the cluster-to-global intercon-
nect logic make up a single Rigel cluster.
10
  
 
 
 
 
 
 
 
 
Exec 
Fetch Decode 
Exec 1 
(Int) 
Mem 
FPU 1 
CCRead 
RegFile 
L1  
I-Cache 
L1  
D-Cache 
WB 
Mem2 
Score- 
Board 
Bypass 
Network 
Exec 2 
FPU 2 FPU 3 FPU 4 
ClusterNet 
(Bus) 
ClusterNet  
(Arb) 
(empty) (empty) 
SPRF 
FP  
Accumulator 
RegFile 
 
 
 
 
 
 
 
 
 
 
Mem 
 
 
 
 
 
 
 
 
 
 
FPU 
 
 
 
 
 
 
 
 
 
 
CCRead 
Agen 
Figure 2.2: Block diagram of the Rigel pipeline.
11
2.2.5 Tile
Clusters are connected and grouped logically into a tile. The tile unit serves
primarily as a unit of VLSI replication. In the 1024-core baseline configura-
tion of Rigel, eight tiles of 16 clusters each are distributed across the chip.
Clusters within a tile share resources on a bi-directional tree-structured inter-
connect. A tree-structured interconnect is chosen as opposed to a mesh due
to the intended use pattern. Communication between cores takes place via
shared caches, not through explicit message passing. The interconnect serves
to connect cores to memory, not to enable arbitrary core-to-core communi-
cation or coherence traffic. Tiles are distributed across the chip, attached
to global cache banks via a multi-stage switch interconnect. Figure 2.1 illus-
trates the tile and top-level organization on the right side.
2.2.6 Global Cache
The global cache is Rigel’s last-level shared cache and provides buffering
for several high-bandwidth memory controllers. Our initial 1024-core design
includes 8 GDDR memory controllers and 32 global cache banks. The global
cache provides a point of coherence for memory accesses; each address may
be cached in only a single location in the last-level global cache. Shared
data made visible in the global cache is visible to all cores on the chip. By
default, global cache misses which access DRAM result in the returned data
being cached in the global cache. However, memory operations which bypass
caching in the global cache are optionally available. Additionally, global
atomic operations are performed at the global cache.
2.2.7 Alternate Organizations
While Rigel selects a particular set of design points, a number of promising
alternatives to the baseline Rigel architecture exist.
At the cluster level, a rich design space exists in the interconnection of
cores and caches as well as in core microarchitecture and degree of thread-
ing. We explore only a subset of this space. The cluster-level shared cache
interconnection was implemented as a split-phase shared bus in the baseline
Rigel design as cores are single threaded and core-level caches service a large
12
fraction of memory traffic, reducing required bandwidth to the cluster cache.
However, as more cores share the cluster cache, bandwidth demands grow.
Additionally, some application classes naturally demand more bandwidth or
are simply less amenable to caching. Notable variations on the cluster design
include a multi-banked cache interconnected to cores by a crossbar structure
as well as alternate core microarchitectures, including multithreading.
At the tile and chip level, the network-on-chip and last-level global cache
organization present a number of design alternatives. While we selected our
design point based on intended use cases, there are compelling arguments for
alternative designs. One notable alternate design point for the tile and global
cache is to include global cache banks paired with tiles, and interconnect tiles
as a mesh. Each tile would be associated with fixed memory controller and
global cache resources in a NUCA organization. Such a design may have
desirable VLSI replication properties due to regularity.
Exhaustive exploration of all design alternatives is beyond the scope of
this dissertation as well as the Rigel project.
2.3 Caching and Memory Model
All cores on the Rigel processor share a single global address space. Cores
within a cluster have the same view of memory via the shared cluster cache,
while cluster caches in our baseline architecture are not explicitly kept co-
herent with one another. The low-level hardware operations and software
model for maintaining coherence are discussed further in Chapter 3. The
global cache provides a point of coherence for when software needs to syn-
chronize or otherwise safely share data between separate clusters. Due to
the incoherent nature of the cache hierarchy, Rigel implements two classes
of memory operations: local and global.
2.3.1 Local Memory Operations
Local memory operations are the standard path to memory on the Rigel ar-
chitecture. Local operations are fully cached at the cluster cache, but are not
kept coherent by hardware between clusters. Local loads and stores generally
constitute the majority of memory operations, providing high bandwidth and
13
low latency access via the cache hierarchy. Local memory operations are used
for accessing read-only data, private data such as the stack, and data shared
with cores in the same cluster.
Local stores are not visible outside of the cluster until either an eviction or
explicit writeback occurs. Values evicted from the cluster cache are written
back to the last-level global cache, and cluster cache misses are serviced by
the global cache if the required data is present. By default, local loads that
initially miss in the on-chip caches are also cached in the global cache to
improve performance for read-shared data.
The cluster caches and the global cache are neither inclusive nor exclusive.
Local store operations are not guaranteed to be globally visible without ex-
plicit synchronization, and local loads may return inconsistent data values if
improperly used to access write-shared data without synchronization. Per-
word dirty bits are maintained for the cluster cache to mitigate the effects of
false sharing within cachelines.
2.3.2 Global Memory Operations
Global loads, stores, and atomics on Rigel always bypass core-level and
cluster caches and complete at the last-level global cache. Memory locations
operated on solely by global memory operations are trivially kept coherent
across the chip because they may be cached in only a single location. Global
operations are key to supporting synchronization, fine-grained inter-cluster
communication and data sharing, system software features, and resource
management. The cost of global memory operations is high compared to
local operations due to increased latency, reduced read and write bandwidth,
and contention on the shared global interconnect.
Rigel also implements a set of atomic operations (arithmetic, bitwise,
min/max, exchange) that complete at the global cache.
2.3.3 Coherence
A key design consideration for a 1024-core accelerator processor is what the
memory system should look like, including coherence. In analyzing the data
sharing and communication patterns of visual computing workloads, we ob-
14
serve that such patterns can be leveraged in the design of memory systems
for future manycore accelerators. Based on these insights, we developed
both software and hardware mechanisms to manage coherence on parallel
accelerator processors. We developed the Task-Centric Memory Model [5],
a software protocol which works in concert with hardware caches to main-
tain a coherent, single-address-space view of memory without the need for
hardware coherence. We then developed Cohesion [6] as a mechanism to
support hybrid coherence with both hardware and software-managed cache
coherence features, enabling multiple memory models in heterogeneous or
accelerator-based systems.
In the baseline Rigel architecture, software must enforce coherence when
inter-cluster read-write sharing exists. Coherence enforcement may be ac-
complished by colocating sharers within a single coherent cluster, by using
global memory accesses for shared data, or by forcing the writer to explicitly
flush shared data before allowing the reader to access it. Explicit instructions
for actions such as flushing and eviction are provided for cache management.
Colocating sharers is not always possible, and using global accesses for all
shared data is undesirable for performance reasons, because global cache
and interconnect bandwidth is more limited. Instead, we develop a software
algorithm to maintain coherence in a coarse-grained manner.
The cache coherence mechanism on Rigel is not implemented in hard-
ware, but instead exploits the sharing patterns present in accelerator work-
loads to enforce coherence in software. The sharing patterns present in our
target workloads allow Rigel to leverage local caches for storing output
write data between barriers before lazily making modifications globally visi-
ble. Most data sharing on accelerator workloads occurs not between barriers
but across barriers. Lazy updates can be performed as long as coherence ac-
tions performed to write-output data are completed when a barrier is reached.
Rigel enables software management of cache coherence in two ways. First,
by providing instructions for explicit cluster cache management that include
cache flushes and invalidate operations. Explicit cluster cache flushes up-
date the value at the global cache, but do not update or invalidate cached
copies that may be held by other clusters. Second, broadcast invalidation
and broadcast update operations allow software to implement data synchro-
nization and wakeup operations that rely on invalidation or update-based
coherence in conventional cache coherent designs.
15
The topic of coherence is explored in more detail in Chapter 3.
2.4 Programming Rigel: The Rigel Task Model
Rigel is not restricted to running software written in a particular hardware-
specific paradigm, but instead has the ability to run standard C code. We
target Rigel using the LLVM compiler framework and a custom backend.
Rigel applications are developed using the Rigel Task Model (RTM),
a simple bulk-synchronous parallel (BSP)[13], task-based work distribution
library that we have developed. We implement task management primarily
in software using hierarchical queues, enabling flexibility in work distribution
and scheduling policies, and using minimal specialized hardware in the form
of atomics, global memory accesses, and broadcasts.
Applications are written in RTM using a SPMD execution paradigm where
all cores execute a single shared binary, but with arbitrary control flow per
core. The programmer defines parallel work units, referred to as tasks, that
are managed via queues by the RTM runtime. All threads may both enqueue
and deque work at any time.
We define an interval as the time between two global synchronization barri-
ers. During an interval, worker threads can both produce and consume work
units. There is no specified execution ordering for tasks within an interval.
RTM task queues act as barriers when empty to provide global synchroniza-
tion points. When a worker thread attempts to dequeue new work and finds
an empty queue, the thread continues to poll for additional work. When all
threads have reached this state and no additional work remains, a barrier
has been reached. The last thread to enter the barrier notifies the remaining
threads. For Rigel, barriers represent a point at which any locally cached
non-private data should be flushed and made globally coherent before the
start of a new interval. Figures 2.3 and 2.4 illustrate the BSP model we
implement along with our hierarchical task queues.
16
Communication
Execution
Barrier
Task Execution
Idle Time
Ti
m
e
In
te
rv
al
…
…
Local
Task Queues
Global 
Task Queue
cores
…
Task Queue Hierarchy
Figure 2.3: The Rigel Task Model consists of hierarchical task queues.
Depending on the configuration, cores may produce tasks into either local
or global queues. Groups of tasks are removed from the global queue and
placed into local queues for faster access and less contention.
Communication
Execution
Barrier
Task Execution
Idle Time
Ti
m
e
In
te
rv
al
…
…
Local
Task Queues
Global 
Task Queue
cores
…
Task Queue Hierarchy
Figure 2.4: The BSP execution model of the Rigel Task Model. An
interval is defined as the time between two barriers.
17
GCache
30mm2
(10%)
Other Logic
30mm2
(9%)
Overhead
53mm2
(17%) Cluster Cache SRAM
75mm2 (23%)
Logic (Core+Cache)
112mm2 (35%)
Register Files
20mm2  (6%)
Clusters
207mm2
(67%)
Figure 2.5: Area estimates for the Rigel design.
2.5 Evaluation: Area and Power
To demonstrate feasibility of Rigel on current process technology, we pro-
vide area and power estimates on a commercial 45 nm process. Our estimates
are derived from synthesized Verilog, compiled SRAM arrays, IP compo-
nents, and die plot analysis of other 45 nm designs. Core pipeline area is
estimated from a functionally correct synthesized RTL model, including a
single-precision floating-point unit. SRAM estimates are based on macros
generated by a commercial memory compiler. Memory controller estimates
are based on die plot analysis.
Figure 2.5 shows a breakdown of area estimates for the Rigel design.
Cluster caches are 64 kB each and global cache banks total 4 MB. “Other
Logic” encompasses interconnect as well as memory controller and global
cache controller logic. For a conservative estimate, we include a 20% charge
for additional overheads to account for some underestimation. The resulting
320 mm2 is reasonable for implementation in current process technologies,
and leaves space for additional SRAM cache, double-precision floating-point
units, or more aggressive memory controllers.
Typical power consumption of the design with realistic activity factors for
all components at 1.2 GHz is expected to be in the range of 99 W, though
peak power consumption beyond 100 W is possible at high utilization. Our
estimate is based on power consumption data for compiled SRAMs, postsyn-
thesis power reports for logic, leakage, and clock tree of cluster components,
estimates for interconnect and I/O pin power, and a 20% charge for addi-
tional power overhead. The figure is similar to modern GPUs from NVIDIA
18
Table 2.3: Description of our data- and task-parallel workloads
Benchmark Description
cg Conjugate gradient linear solver
dmm Blocked dense matrix multiplication
fft 2D complex-to-complex radix-2 fast Fourier transform
gjk Gilbert-Johnson-Keerthi 3D collision detection
heat 2D 5-point iterative, out-of-place stencil computation
kmeans K-means clustering
march Marching cubes polygonization of 3D volumetric data
mri Magnetic resonance image reconstruction (FHD matrix)
sobel Sobel edge detection
stencil 3D 7-point iterative, out-of-place stencil computation
which consume around 150 W [14], while modern high-end CPUs can con-
sume nearly as much.
2.6 Evaluation: Scalability
We evaluate Rigel based on a variety of parallel applications and kernels
drawn from visual and scientific computing. Most benchmarks are written
in a bulk-synchronous style using RTM for dynamic work distribution, but
stencil statically allocates work to threads. While all of our applications ex-
hibit abundant data parallelism, the structure varies from dense (dmm,sobel)
to sparse (cg) to irregular task-parallel (gjk) and includes diverse communi-
cation patterns (kmeans,fft,heat,stencil). Table 2.3 describes our set of
benchmark codes.
Figure 2.6 shows the kernel scalability for a variety of parallel applications
up to 1024 cores. Across our selection of benchmarks, we observe an average
speedup of 84× (harmonic mean) at 1024 cores compared to one eight-core
cluster (128× speedup is ideal).
2.7 VLSI Implementation and Test Chip
The original goal of the Rigel project was to design and develop an archi-
tecture that was implementable in contemporary fabrication process tech-
nology and implement a VLSI test chip. Such a chip would have enabled
19
2X
4X
8X
16X
32X
64X
128X
1
 T
il
e
2 
Ti
le
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4
 T
ile
s
8 
Ti
le
s
1
 T
il
e
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2 
Ti
le
s
4
 T
ile
s
8 
Ti
le
s
1 
Ti
le
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
1
 T
il
e
2 
Ti
le
s
4
 T
ile
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4 
Ti
le
s
8 
Ti
le
s
1
 T
il
e
2 
Ti
le
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4
 T
ile
s
8 
Ti
le
s
1
 T
il
e
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
cg dmm fft gjk heat kmeans march mri sobel stencil
Sp
e
e
d
u
p
 O
ve
r 
8-
co
re
 C
lu
st
e
r
Figure 2.6: Benchmark scalability on Rigel with 1, 2, 4, and 8 128-core
tiles (128, 256, 512, and 1024 cores). Speedups are relative to a single
eight-core cluster. 128× represents linear scaling at 1024 cores. Benchmark
binaries and datasets are identical across all system sizes Global cache
resources and memory controllers (and thus memory bandwidth) are scaled
with the number of tiles.
architectural research ideas to be explored along with the full software and
application stack to be explored on real hardware. Such a test platform was
envisioned to enable interaction with domain experts in a more productive
relationship than severely performance-limited simulation allows.
While the original goal of a full-scale Rigel test vehicle was not achieved,
our fledgling RTL implementation was taped out in much smaller form as
a 90 nm multicore processor. Figure 2.7 illustrates the fabricated die, both
bare and packaged, and provides a scale comparison with a dime.
Though this test chip does not enable the original grand vision for a mas-
sively parallel hardware platform for architecture and software research, it
represents a significant achievement for the small number of students who
enabled the project. It also provides validation of the underlying design and
implementation that were used as the foundations of area and power esti-
mates for the full Rigel design. Unfortunately, the older 90 nm process
coupled with a design tuned for a newer process mean that actual area and
power data from the test chip are not very applicable to the 45 nm intended
design point. To fit within the constrained footprint, much smaller SRAMs
were selected for caches, and an on-chip main memory array was added due
to lack of proper memory controllers.
Table 2.4 enumerates some key parameters of the test chip. The test chip
implemented a pair of reduced Rigel clusters with two cores each for a total
of four cores. Each core has independent L1 instruction and data caches.
20
Figure 2.7: Bare and packaged test chip die. The bare die is shown next to
a dime for size comparison.
21
Table 2.4: Rigel test chip characteristics
Process Technology TSMC 90 nm
Die Area 3×32 mm
Frequency 1˜25 MHz
Power Supply 1.0 V
Power Consumption (estimated) 100 to 230 mW
Cores 4
SRAM 16 kB
cluster cache 0
6%
cluster cache 1
6%
main memory
19%
I/O port
25%
pipeline 0
11%
pipeline 1
11%
pipeline 2
11%
pipeline 3
11%
Pipelines
44%
Figure 2.8: Area breakdown data for the test chip implementation.
Each cluster has a shared L2 cluster cache. For the test chip, an on-chip
SRAM array serves as main memory. Figures 2.8 and 2.9 break down the
estimated area and power for the test chip based on results from synthesis,
place and route.
Figure 2.7 illustrates the fabricated die, both bare and packaged, and pro-
vides a scale comparison with a dime. Figure 2.10 shows a wireplot. Fig-
ure 2.11 shows a die plot of the test chip and labels key large blocks, including
cores and caches. The chip was taped out in TSMC 90 nm technology in 2012,
and as of this writing awaits bringup testing.
22
clock tree
7%
cluster cache 0
4%
cluster cache 1
4%
main memory
10%
I/O port
19%
pipeline 0
14%
pipeline 1
14%
pipeline 2
14%
pipeline 3
14%
Pipelines
56%
Figure 2.9: Test chip power estimate breakdown.
Figure 2.10: Test chip die and wire plot.
23
Figure 2.11: Test chip floorplan.
24
CHAPTER 3
RIGEL MEMORY SYSTEM
Memory system considerations played a key role in the development of the
Rigel architecture and programming model. The nature of the Rigel ac-
celerator architecture led to the development of several novel elements of the
memory system.
A high-performance accelerator requires efficient mechanisms for safely
sharing data. In cached architectures, programmers must be able to safely
share data between multiple caches in a reliable and predictable manner.
Traditionally, hardware cache coherence mechanisms enable such sharing.
While scalable multiprocessor hardware coherence schemes exist, they were
envisioned for machines with drastically different mixes of computation, com-
munication, and storage resources than large scale chip multiprocessors such
as Rigel.
First, we consider the coherence requirements for accelerator systems and
workloads. We examine a variety of parallel workloads and observe common
data sharing and communication patterns. We show that for a variety of
accelerator workloads, regular and clearly defined data sharing patterns exist.
Our target applications exhibit more structured data sharing patterns than
those applications targeted by traditional systems.
These observations motivate our initial design goal for Rigel, to achieve
both good performance and ease of programmability without hardware co-
herent caches. We find that we can leverage these characteristics to move
the burden of coherence management from hardware to software. To this
end, we introduce the Task-Centric Memory Model [5, 11], a software pro-
tocol that enables the programmer to maintain coherence with incoherent
hardware caches.
While we show that coherence is not strictly required for good perfor-
mance or ease of programmability for accelerators such as Rigel, hardware
cache coherence nonetheless retains attractive benefits. We developed Co-
25
hesion [6, 12], a hybrid memory model to bridge hardware and software
coherent models.
Finally, we describe lazy atomics operations, an extension to TCMM that
takes advantage of incoherent hardware caches for improved atomic operation
bandwidth.
This chapter is organized as follows. Section 3.1 describes observations
on parallel application characteristics, including parallelism structure and
sharing patterns. Section 3.2 describes the Task-Centric Memory Model.
Section 3.3 describes hybrid hardware-software coherence with Cohesion.
Section 3.4 describes lazy atomic operations.
3.1 Parallel Application Characterization
Accelerators impose different constraints on caches and coherence manage-
ment than traditional general purpose chip multiprocessors. An opportunity
exists, driven by the characteristics of parallel accelerator workloads, to ex-
ploit these differing constraints. The design of the Rigel memory system is
influenced by the sharing and communication patterns of parallel workloads
targeted by accelerators. We consider two sets of applications developed for
two different platforms, x86/pthreads and Rigel/RTM. We find that struc-
tured, coarse-grained sharing is common and that most sharing takes place
across global synchronization points (barriers). We also find that fine-grained
data sharing between barriers is relatively uncommon.
Figure 3.1 illustrates the types of sharing patterns studied and defines the
terms we use here: input, output, private, and conflict. Figure 3.2 illustrates
the sharing patterns in a set of parallel x86 workloads, while Figure 3.3 il-
lustrates similar patterns in several workloads running on Rigel.
3.1.1 Parallelism Structure
We observe that the programming styles adopted by developers for parallel
accelerator applications often share a bulk synchronous structure [13]. Paral-
lel applications are composed of a collection of concurrently executing, mostly
unordered tasks operating on mostly data-parallel work units. We define an
interval as the time period between two barriers. We find that tasks share
26
Barrier
Barrier
conflicts
output
input
private
Figure 3.1: We define an interval as the time between two barriers. Each
arrow represents an independent task. Input reads and output writes
communicate data across barriers. Conflict accesses share data between
two tasks in the same barrier interval. Reads and writes private to a task
include stack acceses and temporary data structures.
minimal data within intervals. Instead, at barriers, modified state is made
globally visible for the next interval of computation.
Updates made by a task within an interval can generally only be assumed
to be visible after the current interval has completed. Sharing modified data
within an interval requires explicit synchronization by the programmer. In
a barrier-synchronized, mostly-data-parallel, task-based shared-memory pro-
gramming model, coherence management is required to enable sharing; how-
ever, the mechanisms found in conventional CMP architectures to support
arbitrary sharing through cache coherence are of marginal utility.
We observe that popular programming models used for developing large-
scale data-parallel applications do not depend on the hardware support for
arbitrary sharing provided by conventional systems. However, a mechanism
for enabling some data, such as task queues or shared data structures, to be
shared is required. Second, the common structure present in these parallel
applications is rooted in the programmer’s attempt to create scalable code
in a manner that is conceptually simple; thus there is minimal sharing.
3.1.2 Sharing Patterns
Our next observation is that emerging applications targeting accelerator sys-
tems have common data sharing and synchronization characteristics that can
27
Figure 3.2: Read and write sharing between independent tasks in the
VISBench suite [15].
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
CG DMM GJK HEAT KMEANS MRI SOBEL
Output Conflict Private Input
Figure 3.3: Characterization of memory accesses in task-based BSP
applications. Input reads and output writes communicate data across
barriers. The majority of memory accesses are to data that is private to a
task. Conflict accesses share data between two tasks in the same barrier
interval, requiring hardware coherence or synchronization mechanisms such
as atomic operations to maintain correctness, but are rare in our
applications.
28
guide the design of future accelerator architectures. We provide analysis of a
set of parallel visual computing workloads from VISBench [15] and from the
Rigel kernel benchmark suite. VISBench consists of a set of full applications
that we run on the x86 platform. The Rigel benchmarks conjugate gradient
solver (CG), Sobel edge detection, k-means clustering, and dense matrix mul-
tiply (DMM) were written by hand and optimized for the Rigel architecture.
GJK collision detection was ported from a freely available sequential version.
Heat is ported from the Cilk [16] benchmark with optimizations applied for
Rigel. MRI benchmark is a port of the VISBench medical imaging code.
Analysis of these workloads shows similar data sharing and synchronization
patterns. Specifically, we investigate the sharing patterns of our workloads
with respect to synchronization boundaries. Figure 3.1 illustrates the types
of sharing patterns studied and illustrates the terms we use here: input,
output, private, and conflict. Figure 3.2 shows the number of unique memory
references that are shared across intervals, marked as input and output, and
within an interval, marked as conflict, for VISBench applications. Figure 3.3
shows the same for the Rigel benchmark suite. Note that the results for MRI
differ due to register spilling on x86, resulting in more private reads and writes
(to the stack) on x86 compared to Rigel. We exclude work distribution
related sharing from results to highlight application-level characteristics.
Figures 3.2 and 3.3 show the frequency of non-private loads and stores,
which are data produced by one task and consumed by one or more other
tasks. Non-private accesses are further broken down by whether values are
shared between tasks within an interval, which we call conflict reads and
writes, or across intervals, which we call input reads and output writes. The
figures show that the majority of non-private loads are to data produced
before the current interval began (input reads). At the same time, both
conflict reads and writes to data shared within an interval are rare. Output
writes, which are writes by a task in the current interval consumed by one or
more tasks in the next interval, are more common in real applications than
true shared writes, which require intra-interval synchronization; moreover,
they constitute a small fraction of overall execution. Also note that the
number of unique output writes is much smaller than the number of input
reads in the figure due to one-to-many sharing across intervals.
29
3.1.3 Accelerator Workload Characteristics
We observe five common characteristics in accelerator workloads:
1. Large amounts of immutable, read-shared data are present within an
interval. Examples of read-shared data from our workloads include
input images and video or scene and model description data.
2. Synchronization is coarse grained. This in turn motivates our investi-
gation of bulk coherence management at task boundaries. Indicative
of this pattern are output writes and corresponding input reads in Fig-
ure 3.2 and Figure 3.3, which demonstrate that modified data is often
read by a task after the interval in which the data was written has
ended.
3. There exists only small amounts of write-shared data within an inter-
val, which indicates that tasks are highly data-parallel with few data
dependences between tasks within an interval. The lack of data depen-
dences is illustrated in Figure 3.2 and Figure 3.3 as a lack of conflict
reads and writes. The conflicts that do exist are structured, such as
the histogramming operations in k-means and the reduction operations
in CG.
4. Fine-grained synchronization is present but rare. An example of such
synchronization is atomic updates to shared data structures. We ob-
serve that much of the fine-grained synchronization that we do find is
used for task management and not for application code.
5. When write sharing within an interval does exist, it is usually between
few sharers.
Collectively, these characteristics demonstrate that little coherence man-
agement is required within an interval, indicating the potential for pushing
coherence management into software to be logically performed at the end of
an interval. At the same time, mechanisms must be present to enable small
amounts of fine-grained synchronization and data sharing within an interval
and to support task management efficiently. Our findings further motivate
the use of shared caches that can amortize the costs associated with data ac-
cess to read-shared data, a prevalent access pattern in our target workloads.
30
3.2 Software-managed Coherence with the
Task-Centric Memory Model
3.2.1 The Task-Centric Memory Model
Adopting a structured programming model enables us to implement software-
managed cache coherence efficiently. We developed the Task-Centric Memory
Model (TCMM) [5] as a contract describing the software actions necessary to
ensure correctness in task-based BSP programs in the absence of hardware-
enforced coherence.
Figure 3.4 illustrates the state transitions for a cacheline in our protocol.
All blocks start in the clean state with no sharers or cached copies and
may transition to immutable (read-only), shared as globally coherent,
or private. State is implicit and must be tracked by the programmer.
Cached local memory operations may operate on private or immutable data,
whereas uncached global operations are required for globally coherent
data. Transitioning data between states requires first moving through the
clean state.
Software cache coherence requires minimal hardware support in the form
of instructions for explicitly writing back and invalidating data in private
caches. We found that a small number of additional hardware mechanisms,
such as broadcast support to accelerate global barriers and global atomic op-
erations to facilitate infrequent intra-barrier sharing, greatly improved scal-
ability over a na¨ıve design. With these relatively inexpensive mechanisms,
TCMM was able to achieve performance within a few percent of idealized
hardware coherence at 1024 cores. Future accelerators may improve upon
TCMM by automating coherence actions in the compiler and scheduling co-
herence actions to maximize cache locality.
3.2.2 Performance Evaluation
A simple implementation of TCMM which strictly adheres to the actions as
described potentially requires large numbers of writebacks and invalidates at
task boundaries. This flurry of traffic at task boundaries can cause conges-
tion in the memory system and interfere with tasks performing useful work.
31
Immutable
Globally
Coherent
Private
(Dirty)
Clean
ε
Private
(Clean)
Inv
L.LD
L.LD
L.ST WB
L.LD,
L.ST
G.LD,G.ST,INV
G.LD,
G.ST
L.LD,
Inv
L.LD
ε
Figure 3.4: State transitions for a cacheline under control of the
Task-Centric Memory Model protocol. G indicates global operations, L local
operations, WB writebacks, INV invalidates,  no sharers, LD loads, and ST
stores.
32
0.0x
0.2x
0.4x
0.6x
0.8x
1.0x
cg dmm heat kmeans mri sobel
Sp
ee
du
p 
vs
. I
de
al
 H
W
 C
oh
er
en
ce
LILW LIEW EILW EIEW
Figure 3.5: Performance of TCMM for four policy combinations of
eager/lazy and invalidation/writeback relative to ideal (zero-cost) hardware
coherence.
At the same time read-only data may be needlessly flushed, reducing the
effectiveness of caching for read shared immutable data.
Coherence actions are not strictly required to be issued at task boundaries,
Coherence actions can be deferred until the end of an interval to improve
performance depending on task behavior. We define lazy actions to take place
when a barrier is reached, while eager actions take place immediately upon
task completion. We consider combinations of both eager and lazy writebacks
and invalidates. For different applications or classes of data, different policies
may be optimal.
Figure 3.5 compares performance of each policy selection for TCMM to an
idealized baseline of zero-cost hardware coherence. The results show eager in-
validate/eager writeback (EIEW), lazy invalidate/ eager writeback (LIEW),
eager invalidate/lazy writeback (EILW), and lazy invalidate/lazy writeback
(LILW) relative to the optimistic baseline.
We see that different policy combinations may be optimal for each appli-
cation, highlighting the power of flexibility in software. Applications may
be tuned and select a policy optimized for their memory access patterns,
rather than relying upon a hardware default. Only MRI suffers a penalty of
greater than 10% vs. the idealized baseline. Eager writebacks overlap write
traffic with task execution, improving memory system utilization, as can be
seen with dmm. At the same time, lazy invalidation for read-shared data can
33
improve cache effectiveness for data that is read again by other tasks.
The TCMM protocol is described in more detail along with additional
performance analysis in [5].
3.3 Hybrid Coherence with Cohesion
Memory models in use today are either fully hardware-coherent or fully
software-coherent. In systems that include both models, the two models
are strictly separated by using disjoint address spaces or physical memories.
As systems-on-chip (SoCs) and other heterogeneous platforms become more
prevalent, the ability to seamlessly manage data across different memory
models will become increasingly important.
Software-managed cache coherence (SWcc) removes the area, power, and
interconnect traffic overhead of cache coherence for structured data sharing
patterns and allows experienced application developers to achieve high per-
formance. Hardware coherence (HWcc) avoids the instruction overhead of
software coherence, performs well with unstructured sharing patterns, and
provides correct data sharing with low programmer effort. To achieve the
combined benefits of these two models, we have developed Cohesion, a
hybrid memory model.
Figure 3.6 illustrates the high-level operation of Cohesion. Cohesion in-
cludes a hardware coherence implementation which tracks the entire address
space by default. The developer can selectively remove cache lines from the
hardware coherent domain at runtime and manage them using software to
improve performance. Because data can move back and forth between the
software coherent and hardware coherent domains at will, Cohesion can be
used to dynamically adapt to the sharing needs of applications and runtimes
and does not require multiple address spaces nor explicit copy operations.
Cohesion can also enable the integration of multiple memory models in
heterogeneous or accelerator-based systems such as SoCs. We implement
software-managed coherence using TCMM and use an MSI-based hardware
coherence protocol, but any hardware and software protocols may be used
so long as the necessary state transitions are enforced.
Figure 3.7 illustrates the hardware structures that implement Cohesion.
A directory tracks shared words in theHWcc domain. The coarse-grain region
34
4/18/2011
1
T=0 T=1 T=2 T=3 T=4
0x100
0x140
0x160
0x180
0x1C0
…
0x120
0x1A0
SW‐managed Coherence Protocol HW‐managed Coherence Protocol
SW-to-HW Transitions
COHESION
SWcc Cache Line HWcc Cache Line  Transition
Ad
dr
es
s S
pa
ce
TimeImmutable
Globally
Coherent
Private
(Dirty)
Clean
ε
Private
(Clean)
Inv
L.LD
L.LD
L.ST WB
L.LD,
L.ST
G.LD,G.ST,INV
G.LD,
G.ST
L.LD,
Inv
L.LD
ε
Figure 3.6: Cohesion is a hybrid memory model for accelerators that
enables hardware and software-managed coherence to coexist, allowing data
to migrate between the two domains dynamically. Left: State transitions
for a cacheline under control of the Task-Centric Memory Model protocol.
G indicates global operations, L local operations, WB writebacks, INV
invalidates,  no sharers, LD loads, and ST stores.
code 
segment
stack 
segment
…
wm-1wm-2w0 w1
set0
set1
setn-2
setn-1
sharers tag I/M/S
global 
data
Coherence
bit vectors
(1 bit/line)
16 MB table/  
4 GB memory
base_addr
start_addr size valid
Sparse Directory
Coarse-grain
Region Table
Fine-grain
Region Table
(One per L3 bank) (Strided across L3 banks)(Global Table)
Figure 3.7: Cohesion hardware architecture.
table tracks large regions of memory within the SWcc domain, such as code,
an incoherent heap, or private stack space. Finally, the fine-grain region table
tracks the rest of memory that may transition dynamically between the two
domains. Each L2 cache keeps a per-line incoherent bit to track lines which
are not under the control of hardware coherence.
When a request arrives, the directory is first queried to determine if the
line is under hardware management. If so, the access is handled as is typical
under the hardware coherence protocol. If the line is not present in the
directory, the region tables are queried next. If the address maps to the
coarse-grain region table, the access proceeds without intervention from the
hardware coherence protocol, and the incoherent bit is set on the response.
If the address does not map to either the directory or the coarse-grain region
table, the fine-grain region table is checked. The fine-grain region table is a
35
0K
50K
100K
150K
200K
250K
300K
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
CG DMM GJK HEAT KMEANS MRI SOBEL STENCIL Mean
A
ve
ra
ge
 #
 D
ir
ec
to
ry
 E
nt
ri
es
 A
llo
ca
te
d
Code Heap/Global Stack Maximum Allocated
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
256 512 1024 2048 4096 8192 16384
Directory Entries per L3 Cache Bank
cg
dmm
gjk
heat
kmeans
mri
sobel
stencil
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
256 512 1024 2048 4096 8192 16384
Sl
ow
do
w
n 
N
or
m
al
iz
ed
 to
 In
fin
it
e 
En
tr
ie
s
Directory Entries per L3 Cache Bank
(B) (C)(A)
Figure 3.8: Performance versus directory cache size for (A) HWcc alone and
(B) Cohesion, which amplifies effective directory size by moving some
data from HWcc to SWcc.
cached data structure that stores one state bit per cacheline which indicates
membership in either HWcc or SWcc. If the line is under SWcc, an incoherent
access is initiated. If the line is under HWcc, a directory entry is allocated
for hardware tracking and a coherent access is initiated.
A developer may instruct the hardware coherence machinery to defer to
software management for a particular cache line by updating a software-
accessible table in memory. For instance, hardware coherence management
is inefficient when data is private or when a large amount of data can be man-
aged as a unit by software. Handling read-mostly and private data outside
the scope of the hardware coherence protocol can increase performance and
reduce the load on the coherence hardware, increasing the effective directory
size for data managed under hardware coherence, as seen in Figure 3.8.
Ultimately, Cohesion allows explicit coherence management to be an op-
tional optimization opportunity, rather than necessary for correctness. Co-
hesion is explored in greater detail in [6, 12].
3.4 Lazy Atomic Operations
We have considered the need for cache coherence in accelerator architectures
and described efficient mechanisms for managing coherence in software with
36
incoherent hardware caches via the Task-Centric Memory Model. While the
TCMM work was focused on enforcing coherence, we now consider taking
advantage of incoherent hardware caches in a novel way to improve perfor-
mance. We introduce lazy atomic operations, a hardware mechanism targeted
at throughput processors that leverages incoherent hardware caches to trans-
parently improve throughput for order-independent atomic operations while
simultaneously improving the ease of programming. This work was originally
published in [17].
Both GPUs and CPUs implement atomic operations, popular in throughput-
oriented GPGPU computing with languages such as CUDA [18] and Open-
CL [19]. Atomic operations are important in parallel codes for massively
parallel accelerators such as Rigelor GPUs and have a wide variety of uses
in parallel applications. Algorithms in scientific computing, data mining, and
image processing make use of atomic updates, including k-means clustering,
principle component analysis (PCA), sorting, convex hull, histogramming,
two-point angular correlation function (TPACF), and even SQL databases.
In a parallel system, throughput of atomic operations is limited due to the
serializing nature of atomic operations and limited atomic hardware units.
Global atomics are especially easy to use for programmers, but suffer from
orders of magnitude lower throughput and higher latency than local opera-
tions. While local atomic operations may be available, using them can require
more complex multi-phase algorithm implementations or replication of data
structures for partial or temporary results.
There are a variety of ways to implement atomic updates. Figures 3.9, 3.10,
and 3.11 illustrate a progression of methods for performing atomic updates.
Figure 3.9 illustrates a software-based lock approach. Figure 3.10 illustrates
a hardware-accelerated approach using load-link and store-conditional pairs.
Figure 3.11 illustrates the advantages of global atomic operations over the
first two approaches.
3.4.1 Reductions
Reductions are an important part of many parallel applications [20]. Com-
monly, parallel reductions are implemented with a collection of partial results
that must be further reduced in a tree-structured fashion. Replication with
37
Core Global $
❶ Acquire lock
❷Read data
❸Write data
❹Release lock
Lock
Data
Lock‐based Updates
20 Cycles
• Separate lock protects data
• Latency=100‐140 cycles , Throughput=1 per 100‐140 cycles
Matthew R. Johnson 4
Figure 3.9: Lock-based updates. First a lock must be acquired, followed by
a read-modify-write to the data of interest, and finally a release of the lock.
Depending on contention for locks and memory access latencies, this takes
on the order of 100s of cycles, limiting throughput to one update every few
hundred cycles.
Core Global $
❶ LL
❷SC
LData
Load‐Link Store‐Conditional (LL/SC)
20 Cycles
• Hardware‐managed lock bit detects conflicts, forces 
replay
• Latency=60‐80 cycles, Throughput=1 per 60‐80 cycles
Matthew R. Johnson 5
Figure 3.10: Load-linked store-c nditional (LL/SC) ba ed updates. A
hardware-man ged lock bit detects conflicting accesses to a locked address,
forcing a replay in software when a conflict occurs. In the uncontended case
(hopefully the common case), no replay is required. No lock is required,
reducing latency and improving throughput relative to the lock-based
approach.
38
Core Global Cache
❶ Request
Data
Global Atomics
20 Cycles
• Dedicated functional units at global cache perform op.
• Latency = 42 cycles, Throughput = 1 per 1‐2 cycles
• Ops to a given location are still serialized
• Contention can limit throughput
Atomic
Func. 
Units
❸Reply ❷Op
1 Cycle
Data
Matthew R. Johnson 6
Figure 3.11: Global atomics based updates. Requests are sent to be
handled remotely by a dedicated atomic unit, often at the last-level cache.
While the latency of an individual global atomic request may be similar to
that of a last-level cache access, the throughput of such atomic operations
can be as high as one operation per cycle. However, chip-level throughput
is ultimately limited by the number of atomic units available. Frequently,
as in the case of both Rigeland GPUs, this number will be much smaller
than the aggregate throughput of the execution units.
partial results increases storage overhead while reducing contention and se-
rialization at atomic units.
Many reduction operations are order-independent. That is, they are asso-
ciative ((x ∗ y) ∗ z = x ∗ (y ∗ z)) and commutative (x ∗ y = y ∗ x) where ∗
represents a binary operation. For instance, the result of a dot-product, a
common reduction used in linear algebra and 3D graphics, is independent of
the summation order. Similarly, histogram computation is concerned only
with the final output totals for each bin and not intermediate sums or the
ordering of individual updates. More generally, a class of associative algo-
rithms exhibit similar order-independent properties.
3.4.2 Example: Histogram
Consider the example of computing a histogram in parallel. A histogram
computes a reduction operation across a set of data by incrementing a bin
counter for each input element based on the element’s value:
1: Foreach element e in E {
2: Histogram[ e ]++;
3: }
39
Summing the number of elements in each bin is associative and commu-
tative. Simple histogramming performs very little computation per data
element (essentially none), resulting in high bandwidth requirements for bin
updates as well as potential contention for bins, depending on the sparsity
of the data set. When a bin is updated, it must be done so atomically so
updates are not lost.
A variety of different approaches can be considered for optimizing the
deceivingly simple operation of histogramming, each applying to specific sets
of circumstances influenced by the underlying data. Contention for bins can
be reduced by computing a private partial reduction per thread and merging
the partial reductions in one or more subsequent software passes. If the
selection of bins is small, it may be possible for threads (or groups of threads)
to keep complete sets of partial sums, but this requires additional levels of
reduction to compute the final sums. If the number of bins is large, keeping
a dense copy of partial sums for each thread (or local group of threads)
is impractical. Keeping a hashed sparse subset of bins is possible, but is
not useful if individual bins are infrequently accessed. Multiple passes may
be made over the dataset to allow a subset of bins to be cached over each
pass. It is possible that ranges of the histogram bin domain may be divided
among tasks, where a task only updates the subset of bins (or bin copies) it
is responsible, but this can be an inefficient use of memory bandwidth.
3.4.3 Lazy Atomic Operation Implementation
We propose the use of lazy atomic operations, which allow software to specify
an atomic update action to be applied at the last-level cache when a cache
line is removed (evicted or flushed) from a local cache. Lazy atomics allow
updates to shared locations to be distributed across local caches, reducing
contention and increasing throughput. All copies of a given cache line op-
erated on by lazy atomics are incoherent and reside at the same address,
reducing cache footprint and programmer effort relative to replication and
multi-pass approaches.
As described, lazy atomics make use of multiple, incoherent cached copies
of a single memory location. While throughput-oriented architectures such as
Rigel or contemporary GPUs are not cache coherent today, this mechanism
40
Shared $
PE PE PE
Banked Last Level $ Cachelines,Tags
Atomics Metadata                    
|
Shared $
PE PE PE
Data[255:0]V Tag Dirty[7:0] Atom[2:0]
Shared $
PE PE PE
Cachelines,Tags Atomics Metadata
Shared $
PE PE PE
Interconnect
Banked Last Level $
ALU ALU ALU ALU
Figure 3.12: Block diagram of high-level manycore processor architecture
with global atomic units at last-level cache and augmented cache line state.
can be supported in coherent systems by simply removing such locations
from the control of the hardware coherence protocol. Figure 3.12 provides
an overview of the high-level organization of a manycore accelerator that
supports global atomic operations. Conceptually, this is similar to both
Rigel and GPUs.
To take advantage of lazy atomic operations, values with the appropriate
properties are identified by software (at the direction of either the program-
mer or a smart compiler) and marked with additional bits in the cache. For
simplicity of implementation, all words in a given cache line are tagged with
the same atomic update action. Our baseline design includes per-word dirty
bits, which track which words within a cache line have been modified and
allow global atomic units to ignore wasteful unnecessary updates for clean
words.
We use a short bit-vector per cache line to indicate which, if any, up-
date action should be taken. We can support add, inc, min, max, and, or,
and xor with three additional bits per cache line, where 000 indicates no
atomic action. Additional compatible atomic operations can be supported
with additional bits. Alternatively, coarser-grained software configuration
could establish a range of addresses for which special properties exist, as
proposed in [21], similar to a page table entry.
Figure 3.13 illustrates the control and data flow patterns for our imple-
mentation of lazy atomic operations. If a marked location is present in a
local cache when an atomic update is issued, the action is performed locally,
and the value updated locally in the cache. In Figure 3.12, this would be
the shared L2 cache. If the specified location is not cached locally, the line
is allocated in the cache but not filled with the latest global copy, ensuring
no duplicate updates. Instead, the newly allocated line is initialized with
41
Core
Local $
Data
Lazy Atomics
2 Cycles
Atomic
Func. 
Units
1 Cycle
Global $
❹Atomic Op
❷ LLL.op/SC
❸Writeback+Op…
~18 Cycles
Local 
Data Op
❶ LLL.op/SC
Matthew R. Johnson 10
Figure 3.13: Lazy atomic updates. For associative and commutative
operations, updates can be performed in local caches with higher aggregrate
throughput. Results from local caches are combined transparently at
writeback (lazily) using global atomic units. Atomic throughput is thus
improved chipwide, while reducing demand for global atomic units and
maintaining a conceptually simple programming model.
a software-defined default, usually an identity element such as 0 for add or
increment. As an entire cache line is restricted to a single atomic action type,
the same identity element is used to initialize each element in the cache line.
By allocating an identity element, the local subset of updates can be accu-
mulated completely independently of any other conflicting updates taking
place concurrently within another cache. Figure 3.14 illustrates the logical
coherence state transitions of a cache line under lazy atomic control. This
protocol is an extended variant of the previously described TCMM software
coherence protocol.
Local atomics can be implemented with dedicated arithmetic units colo-
cated with shared caches, or performed on local core ALUs. If dedicated
atomic units are used, then a single instruction will induce an update to
cache state and property bits and perform the atomic action. Alternatively,
local processing element ALUs can be used for local atomics to reduce hard-
ware overhead. A short load-link, store-conditional pair can be used
to wrap the action for atomicity, cleanly handling the case when a line is
evicted from the cache between loads and stores.
For lazy atomics, we extend the traditional load-link operation to a
load-link-lazy operation which additionally both allocates the line and
sets the cache bits as necessary. Lazy atomic instructions specify the opera-
42
Cached
Clean
(No Atom)
Cached
Clean
(Atomic)
Cached
Dirty
(Atomic)
Uncached
LD
LLL ST
ST LLL
Eviction + Global 
Atomic Update
Eviction Eviction 
LLL
Valid
Clean
Atom [NULL]
Valid
Clean
Atom [OP]
Valid
Dirty
Atom [OP]
Invalid
(Uncached)
LD
LLL.OP STC
STC
LLL.OP
Eviction + Global 
Atomic Update
Eviction 
Eviction 
LLL.OP Writeback + 
Global 
Atomic Update
Figure 3.14: State diagram for cache line transitions. This is an extended
version of the TCMM protocol originally described in [5].
tion, what address t operate on, and an initialization value to use if a cached
copy is not present. Instructions are formatted as LLL.OP Rd, Raddr, Init.
To ensure all updates are eventually made globally visible, some actions
are necessary when synchronization takes place at the end of a computation
interval. Either software or hardware cache flush mechanisms must write
back any locally cached dirty copies of data with pending lazy atomic actions.
3.4.4 Programming with Lazy Atomics
Using lazy atomics for reduction style operations can be mostly transparent
to the programmer. Code can be written as if global atomic operations
were used, requiring no data replication. To ensure that all updates are
made, dirty cache lines marked with atomic actions must be flushed before
final global synchronization. While this represents additional programmer
effort over using only global atomics, it is less effort than multi-level tree-
structured software combining, and the timing of such flush operations may
be tuned in software. Similar mechanisms are proposed in [5] for software
cache coherence.
// process elements in parallel
1: while ( e = GetElement() ) {
2: reduce( ) = reduce( ) op function(e)
3: }
// flush (make coherent)
43
3.4.5 Limitations
A limited selection of atomic operations can be offered because an ALU must
be present at the last-level cache bank. This may rule out floating-point
atomic operations. The mechanism requires additional bits to be added to
each cache line. However, assuming per-word dirty bits are already present,
the marginal overhead per line is log2 of the number of supported atomic
operations, about three bits per line or roughly 1% storage overhead for 256-
bit cache lines. To reduce overhead, we limit the granularity of updates to
aligned words, and only a single type of atomic update can be specified per
cache line.
3.4.6 Evaluation
We evaluate an initial implementation of lazy atomic functionality on Rigel,
a fully-cached, MIMD manycore accelerator architecture [4] (Figure 3.15) and
also provide motivating GPU-based performance measurements (Figure 3.16)
for a simple histogram of random data, varying the number of bins. We note
that a variety of factors can influence performance. The histogram imple-
mentation we used can be considered na¨ıve; however, improved performance
on simple code is an advantage in terms of programmability. The “No atom-
ics” and “local updates” categories disregard safe atomic memory accesses
and provide a performance bound estimate for lazy atomics on Rigel and
GPUs, respectively. On Rigel at 1024 cores, the strain on global atomic
bandwidth can be seen. However, local (lazy) atomics perform similar to the
approximate upper bound of non-atomic operations.
In the case of GPUs, we can see that behavior varies with histogram bin
count. Different histogram bin counts are bottlenecked in different parts
of the system. GPUs are subject to degraded performance under diverged
memory accesses, and the random input data is unlikely to exhibit good
locality. For a range of bin sizes, lazy atomics via local updates provide
a 3× to 6× performance advantage over global atomics. For this middle
range of bin sizes, global atomic performance hits a cliff. For large numbers
of bins, local atomic updates are not effective, because data is no longer
reliably cached on chip. In this case, we are likely limited by off-chip memory
bandwidth, not global atomic bandwidth.
44
16
0
1
2
3
4
5
6
7
8
64 bins 256 bins 1024 bins 64 bins 256 bins 1024 bins
128 cores 1024 coresS
p e
e d
u p
  O
v e
r   G
l o
b a
l   A
t o
m
i c
s
Global Atomics Local Atomics No Atomics
0
2
4
6
8
64 bins 256 bins 1K bins 4K bins 16K bins 64K bins 256K bins 1M binsS
p e
e d
u p
  O
v e
r   G
l o
b a
l   A
t o
m
i c
s
Global Atomics Local Updates
Figure 3.15: Speedup on Rigel for histogram on (128, 1024) cores for (64,
256, 1024) bins and one million data elements (local are lazy atomics).
16
0
1
3
4
5
6
7
8
64 bins 256 bins 1024 bins 64 bins 256 bins 1024 bins
128 cores 1024 coresS
p e
e d
u p
  O
v e
r   G
l o
b a
l   A
t o
m
i c
s
Global Atomics Local Atomics No Atomics
0
2
4
6
8
64 bins 256 bins 1K bins 4K bins 16K bins 64K bins 256K bins 1M binsS
p e
e d
u p
  O
v e
r   G
l o
b a
l   A
t o
m
i c
s
Global Atomics Local Updates
Figure 3.16: Speedup on Fermi GTX 460 GPU of local over global
operations for various numbers of histogram bins. Atomics are to global
device memory, and local updates use nonatomic operations to cacheable
global memory to estimate the performance of lazy atomics. 64 million data
elements.
45
3.4.7 Limitations, Future Work, and Conclusion
The set of lazy atomic operations that can be implemented is limited by
hardware, as is the case with global atomic units. However, local atomic up-
dates can be arbitrary and possibly different from the lazy atomic operation.
We leave to future work whether there are any practical applications for this
functionality.
Finally, all local data copies must be flushed to ensure visibility. There are
several ways this can be implemented. A complete flush of the cache can be
implemented, but this is needlessly wasteful. A more selective cache flush can
be implemented in hardware, taking care to only flush lines requiring a lazy
atomic update. However, this adds additional hardware complexity. Finally,
software in the form of the compiler or application programmer can track
lazy atomic data and manage flushing manually. This approach is similar to
software coherence.
Lazy atomics have the potential to reduce the programming burden for ap-
plications and algorithms that make use of order-independent atomic updates
such as reductions. They offer the ease of use of global atomic operations,
while providing the same bandwidth of local atomic operations and data
replication.
46
CHAPTER 4
AN EVALUATION INFRASTRUCTURE
FOR MASSIVELY PARALLEL
ACCELERATOR PROCESSORS
As a contribution of this dissertation, we plan a release of several of the
software tools developed to make this research possible. While tools devel-
opment is not the primary goal of this dissertation, the release of these tools
represents a considerable undertaking and has the potential to be a uniquely
beneficial resource for the academic architecture research community.
A number of open source tools exist for simulating uniprocessors and
CMPs [22]. Tools such as GPGPUSim [23] allow simulation of restricted
SIMD GPUs, similar to CUDA and OpenCL capable GPUs from NVIDIA.
Ocelot [24] allows emulation for NVIDIA PTX-based GPUs.
While a number of tools exist, none are particularly well suited for the
detailed exploration of large-scale parallel accelerator systems. Full-system
simulators and those based on commercial designs tend to be limited by the
commercial architectures they model, with many levels of the software stack
effectively set in stone along with the instruction set architecture. Similarly,
off-the-shelf software is not sufficient for analyzing large-scale parallel accel-
erators such as Rigel. Finally, our tools include an RTL model allowing the
co-development of RTL and simulation for better power and area analysis,
critical for evaluation of throughput-oriented processors. To our knowlege,
there is no widely available toolset providing similar capabilities.
In this chapter, I describe our integrated evaluation framework for Rigel.
This infrastructure consists of an integrated and tightly coupled set of tools
spanning simulation, code generation, and hardware development. I describe
IDEA (Integrated Design Space Exploration for Accelerators), a unifying
component of these tools.
A current release of the Rigel toolflow, as of this writing, is available at
rigelproject.github.com.
47
4.1 Toolflow Objectives
The broader research goal of our project is to evaluate the potential of a 1024-
core MIMD architecture with a cached, shared address space that maximizes
throughput (FLOPS
mm2
or FLOPS
W
) while supporting a conventional programming
model. However, we wish to remain otherwise unconstrained by traditionally
limiting design considerations.
The goal of this section is not to evaluate specific design tradeoffs, but
to describe the tools we have developed for making these decisions. Our
research objectives lead us to develop this framework.
We describe our evaluation framework for Rigel, a 1024-core single-chip
accelerator architecture designed for throughput on visual computing and
scientific workloads. The architecture is described in detail in Chapter 2, but
the salient points are summarized here. Each dual-issue, in-order core has
private L1 data and instruction caches. The cores are arranged into clusters
that share a unified L2 cache. The cluster acts as an eight-way SMP. The
full design has 1024 cores in 128 clusters. All cores share a unified last-level
global cache. We support a configurable number of memory controllers, and
the DRAM model is parameterized, allowing us to model a variety of DRAM
standards. We consider variations of the design that support various types
of hardware and software cache coherence.
4.2 The Case for a Flexible, Integrated Evaluation
Framework
Most designs separately approach the topics of architecture development,
hardware development, and software tool development. Instruction set ar-
chitectures (ISAs) are generally fixed and maintained across generations,
and software stacks are inflexible. The limited flow of information across
design boundaries hampers globally optimized design decisions. Prominent
researchers [25] have advocated re-evaluation of the traditional fixed infras-
tructure stack, including simulators, compilers, and ISAs.
In addition to performance, most processor designers today are concerned
with area and power. Architects are traditionally well equipped to evaluate
performance, but less so for area and power, often relying on previous designs
48
or analytical models. CPUs and GPUs are often designed and implemented
concurrently, with RTL used primarily for validation and verification, not
performance studies or for physical estimates (power, area), and can rely
upon past designs to inform future decisions. SoC designers can rely on
known or characterized IP and use standard processor ISAs, mitigating de-
sign risk.
When exploring a new design space, in our case 1000+ core CMPs, it is
difficult to foresee all potential challenges, and we encountered many non-
obvious design decision questions throughout the process. Ultimately, per-
formance in silicon implementation can be difficult to predict without actual
implementation details. We find it valuable to consider physical design con-
straints proactively, during the definition of the architecture, rather than
reactively during implementation. For these reasons, we choose to pursue a
custom integrated design flow employing measurement rather than modeling
whenever possible.
With an accurate RTL model, we can quantify design tradeoffs and their
impact on area, power, and frequency. Due to the targeted level of par-
allelism, design decisions impacting core efficiency are amplifed 1000-fold,
making accurate estimates paramount. However, it is difficult to measure
the performance impact in RTL due to the required detail and scale of the
model. Therefore, we take the approach of measuring key components in
RTL but evaluate the impact on system-level performance in a cycle-accurate
full-system simulator, providing improved design visibility.
In most designs, an ISA is selected or developed early in the design process
and remains mostly fixed throughout the development cycle (and, in prac-
tice, for many product generations). Once an ISA is chosen, many parallel
portions of the design flow are developed that depend on this choice, includ-
ing performance simulation, RTL implementation, and software tools. ISA
modification can represent a daunting task late in the design cycle, having
costly, widespread consequences throughout the development stack. Fixing
the ISA is a practical consideration, maintaining binary compatibility with
software infrastructure that depends upon the ISA, such as compilers and
assemblers, and assembly language system software. Many components of
the processor implementation may be affected by ISA-level changes incuding
the decoder, pipeline configuration, and execution units.
However, for designs targeting new design spaces, where entirely new ap-
49
plication codes and software stacks will be developed, the initial requirements
of a design may not initially be clear. Such attachment to an ISA may not
be necessary. The prevalance of just-in-time (JIT) compilation and the rise
of interpreted languages or virtual-machines like the JVM or .NET frame-
work have reduced somewhat the reliance upon underlying ISA. Devices such
as GPUs make use of JIT compilation for shaders, allowing them to make
substantial ISA-level changes between product generations.
For designers targeting new design spaces, an agile methodology for al-
lowing ISA design space exploration can be a powerful tool to aid in the
evaluation of power, area, and performance tradeoffs or simply extending
the design with additional features.
4.3 Toolflow Components
This section summarizes the components of our evaluation framework. Our
framework consists of three major components: an architectural timing sim-
ulator, an RTL implementation and tool flow, and a software stack for code
generation. Each of these components contains portions automatically gen-
erated by a component we call IDEA. Figure 4.1 illustrates the high-level
organization of the tools with IDEA, and Figure 4.2 illustrates the major
components of the toolflow.
4.3.1 Timing Simulator
Our simulator is an execution-driven cycle-accurate model of the Rigel ar-
chitecture. The simulator is structural, with an RTL analog for every simu-
lator component used to model timing, allowing for simplified validation at
interfaces with RTL. Though tempted to implement our simulator in Sys-
temC, we chose standard C++ for performance and flexibility, because our
simulator is essentially full-system. The simulator runs a custom parallel
runtime, but like other accelerators does not run a complete traditional op-
erating system. Because we are exploring a relatively new architectural niche,
1000-core processors, we opted not to use a more abstract high-level model
that might obscure important performance effects. For instance, many re-
search projects approximate DRAM behavior with a fixed latency and fixed
50
Software Stack,
Compiler
IDEA
Figure 4.1: IDEA: An integrated toolflow for accelerator design space
exploration.
RTL Simulator
LLVM Compiler
GNU Binutils
RTL-Decode Sim-Decode
IDEA Tool
Sim-SB
ISA,machine Specification
Test and 
Benchmark 
Code
RF Trace
Sim-Exec
Figure 4.2: Integrated toolflow components. A header file specifies various
ISA parameters. This file is consumed by IDEA, which produces
components of the simulator, RTL, and code generation tools.
51
0
50
100
150
200
250
300
350
400
450
A
ve
ra
ge
 L
at
e
n
cy
 (C
yc
le
s)
Mean
Figure 4.3: Average DRAM latency for our applications varies by a factor
of 10, necessitating a detailed timing model for accurate performance
prediction.
or infinite bandwidth. We find that our applications exhibit a 10× variation
in average latency for a 1024-core system using a detailed timing model, as
shown in Figure 4.3. This led us to develop a structural memory system
model based on real DRAM timing constraints. We also find that it is diffi-
cult to abstract without a more complete understanding of the system being
developed. As such, an abstract model, though easier to develop and provid-
ing faster simulation times, was not feasible for a design point far-removed
from existing architectures.
4.3.2 RTL Model
In addition to performance, area and power are key criteria in our design
space of parallel architectures, where components are replicated many times
(for our cores, 1024 times). We developed a flexible RTL design flow in or-
der to evaluate the feasibility of implementing a 1024-core CMP in current
process technologies and allow comparison with existing GPU and CPU ar-
chitectures. Such an aggressive design target requires an emphasis be placed
on accurate area and power measurement and modeling for key components.
Our SystemVerilog core and cluster models are extensively parameterized, al-
lowing components to be interchanged or removed. We target a production-
quality 40 nm high-performance standard cell library.
52
The RTL flow is integrated with the simulator to perform performance val-
idation of the simulator and pre-silicon verification of the RTL. We use the
dynamic stream of register file writes from within the simulator for compari-
son with the RTL model, a method which nicely handles microarchitectureal
mismatches between the RTL and simulator.
RTL Design Space Exploration Flow
To enable comprehensive analysis of design tradeoffs impacting area, power,
and timing, we developed a flow for automated RTL-level design space explo-
ration. Provided with a set of values for RTL-level configuration parameters,
we generate a cross product of possible design points and pass these to simu-
lation and synthesis flows that leverage the Condor distributed computation
system. For each configuration, a variety of performance test kernels may be
run under simulation for functional and performance verification as well as
for extracting switching data for power analysis. Each design point may be
synthesized under a variety of conditions for clock speed, voltage, and more.
Post-synthesis netlists can be combined with switching data to produce power
estimates. This flow allows us to rapidly experiment with a variety of design
tradeoffs impacting frequency, area, and power that are not clearly visible
within the confines of the traditional timing simulator approach typically
employed by architects.
4.3.3 Code Generation and Software Stack
We use the LLVM tool suite [26] with a custom backend for compilation
and GNU Binutils to assemble, disassemble, and link Rigel binaries. The
decoupling of binary creation from the compiler and the autogeneration of
ISA-specific components of the compiler backend has allowed us to keep
our compiler updated with new releases of LLVM. Building a robust code
generation framework has allowed us to take advantage of new features and
tools provided by the LLVM development effort. Moreover, using a compiler
with a retargetable intermediate representation allows us to take advantage
of transformations and extensions that target LLVM, such as [27].
53
4.4 Tool Chain Integration: IDEA
A key unifying component of our tool flow is IDEA (Integrated Design Space
Exploration for Accelerators). IDEA is a utility application that takes a
set of configuration files defining and documenting the ISA, architectural
parameters including latency and complement of functional units, and RTL
parameters. IDEA removes the fixed ISA restriction placed on a traditional
design flow and allows ISA design space exploration as part of the design
process. Building this tool early on helped us avoid locked-in commitments.
IDEA produces Verilog and C++ files used by our simulator, compiler, and
RTL for generating various stages of the pipeline and code generation code
in the compiler. IDEA is a critical piece of the design that allows us to
rapidly change the RTL, compiler, and simulator while minimizing cross-
tool inconsistencies and incorrect output from the tools. The consistency
and single point of modification enables rapid modification of the ISA and
core-level microarchitectural parameters with little effort.
We supply an instruction mnenomic, encoding type, description for ISA
document generation, latency, the functional unit that handles the op, and
RTL-specific information. For instructions, we provide a list of available
functional units available for scheduling, the number and type of registers
used, and the encoding type used by the decoder. For functional units, we
describe latencies and microarchitectural configuration. We also include in-
formation necessary for generating an efficient RTL decoder including signals
generated by each instruction and functional unit such as carry out signals
or branch resolution information. IDEA consumes the configuration and
automatically generates an ISA that is compatible, assigning opcodes and en-
codings as required. An error is produced if the ISA specification cannot be
encoded in the provided bit space. In this case, the user is required to make
higher-level design decisions to open encoding space. For instance, in an ISA
that supports 16-bit immediates with 32-bit fixed length instructions and 32
registers (5-bit identifiers), at most 6 bits remain to specify opcode and in-
struction encoding type. The outputs of IDEA are shared by our compiler,
our assembler, the RTL flow, and our simulator. The assembler and GNU
binutils toolchain uses the instruction encodings automatically generated by
IDEA.
54
4.4.1 Limitations
Though powerful, IDEA has limitations. It does not automatically generate
entire simulators, compilers, or RTL implementations. We are required to
specify the implementation and functionality of new features in the simu-
lator, compiler, and RTL. However, once a new feature class (for instance,
a new instruction format or new field type) is implemented within IDEA,
modifications using those features are easy to propagate. We are extending
IDEA to provide a consistent execution unit profile, with quantity, latency,
and pipeline restrictions across simulator, compiler, and RTL.
4.5 Timing Simulator
In this section, we describe Rigelsim, a cycle-accurate structural timing sim-
ulator for the Rigel architecture that is under continuing development.
Our simulator is an execution-driven cycle-accurate model of the Rigel ar-
chitecture. The simulator is a structural model, with an RTL analog for each
major simulator component used to model timing. This allows for simpli-
fied block and module-level verification with RTL. Though SystemC was
considered for our simulator, we chose standard C++ for performance and
ultimately flexibility. Rigelsim is essentially a full-system simulator, running
a custom parallel runtime, but like most other accelerators does not currently
run a complete operating system in the traditional sense.
Because we are exploring a relatively new architectural niche, 1000-core
processors, we opted against using an abstract high-level model that could
obscure important performance effects. For instance, many research projects
approximate DRAM behavior with a fixed-latency and either fixed or infinite
bandwidth. In contrast, we choose to fully model all timings and interactions
with fine granularity. We find that our applications exhibit a 10× variation in
average latency for a 1024-core system using a detailed timing model limiting
the value of fixed parameter models. We also find that it is more difficult to
develop abstractions without a more complete understanding of the system
being developed. Therefore, abstract models, though easier to develop and
enabling faster simulation times, were not desirable for a design point so
dissimilar from existing architectures.
55
The simulator is constructed as a hierarchy of interconnected class mod-
ules representing various parts of the design, including pipeline stages, cores,
arbiters, various caches, interconnects, and memory controllers. We describe
some of the major components of the simulator.
4.6 Cluster
The cluster is the primary design element of the Rigel architecture. Clus-
ters are replicated across the chip many dozens of times, making accurate
modeling of this portion of the design very important.
We have implemented a detailed timing model of the proposed Rigel clus-
ter architecture for performing an accurate design space exploration for clus-
ters of parallel processors. The RTL model provides the ability to do rapid
design space exploration at a low level.
We have worked to develop an accurate model of the cluster, and in par-
ticular the cache hierarchy for our proposed architecture, including timing
and resource contention at each level of the cluster cache hierarchy. This
facilitates an accurate evaluation of the design tradeoffs in the cache system.
We have also developed a generalized system of arbitration that can be used
for accurate contention modeling in the simulator.
4.7 Simulation Automation
We have developed various tools enabling efficient distribution of simulation
jobs running Rigelsim experiments. Users can specify a set of jobs to be run
on one of two parallel environments, the Trusted Illiac cluster or our group’s
set of desktop machines. The Illiac cluster manages job submissions via the
freely available SLURM resource manager. The desktop machines manage
work via the Condor job queuing system. Submitting work to either set of
machines entails constructing a configuration file which specifies a set of sim-
ulation parameters, benchmark codes, data inputs, and other configuration
details. The Rigel job submission tool automatically builds a cross prod-
uct of all relevant parameters and submits each job to the specified set of
machines.
56
4.8 RTL Infrastructure
Rigelsim provides us with performance results for various core, cache, and
interconnect configurations. To evaluate the area-performance tradeoffs that
influence parallel designs, we have undertaken the development of an RTL
model for use with the Rigel project.
In this section, we describe work on the development of RTL and related
infrastructure for the Rigel project. We describe the CAD toolflow, core
RTL and microarchitecture, core verification strategy, and cluster RTL de-
velopment.
For this work, we use a commercial suite of Synopsys CAD tools, includ-
ing VCS for RTL simulation, DesignCompiler for synthesis, and ICCompiler
for place and route. We use commercial standard cell libraries for a high-
performance 40 nm process.
Memories are generated using commercial SRAM and Register File com-
pilers when possible. However, our toolset is limited to generating single-
and dual-ported structures. Multiported structures are synthesized from
standard cells when the required configurations are not available from the
memory compiler.
4.8.1 CSL ASIC Flow
The physical design flow is driven by the CSL ASIC flow, a powerful frame-
work for chip development developed by Jonathan Ashbrook in collaboration
with the Rigel team.
In addition to performance, area and power are key criteria in our design
space of parallel architectures, where components are replicated many times
(for our cores, 1024 times). We have worked to develop a flexible RTL design
flow in order to evaluate the feasibility of implementing a 1024-core CMP in
current process technologies and allow comparison with existing GPU and
CPU architectures. Such an aggressive design target requires an emphasis
be placed on accurate area and power measurement and modeling for key
components. Our SystemVerilog core and cluster models are extensively
parameterized, allowing components to be interchanged or removed. We
target a production-quality 40 nm high-performance standard cell library.
The RTL flow is integrated with the simulator to perform performance val-
57
idation of the simulator and pre-silicon verification of the RTL. We use the
dynamic stream of register file writes from within the simulator for compari-
son with the RTL model, a method which nicely handles microarchitectureal
mismatches between the RTL and simulator.
4.8.2 RTL Design Space Exploration Flow
To enable comprehensive analysis of design tradeoffs impacting area, power,
and timing, we developed a flow for automated RTL-level design space explo-
ration. Provided with a set of values for RTL-level configuration parameters,
we generate a cross product of possible design points and pass these to simu-
lation and synthesis flows that leverage the Condor distributed computation
system. For each configuration, a variety of performance test kernels may be
run under simulation for functional and performance verification as well as
for extracting switching data for power analysis. Each design point may be
synthesized under a variety of conditions for clock speed, voltage, and more.
Post-synthesis netlists can be combined with switching data to produce power
estimates. This flow allows us to rapidly experiment with a variety of design
tradeoffs impacting frequency, area, and power that are not clearly visible
within the confines of the traditional timing simulator approach typically
employed by architects.
4.9 The Rigel Core
As described in Chapter 2, the principal processing element of the Rigel ac-
celerator architecture is a simple dual-issue, in-order core. The Rigel core
has three separate pipelines: integer, floating-point, and memory. The inte-
ger pipeline also handles branch instructions. By default, the pipeline is fully
bypassed, with the exception that the integer and floating-point pipelines do
not bypass to each other. Bypassing can be enabled or disabled at synthesis
time.
The initial Rigel pipeline RTL has seven stages. The front-end has two
stages, fetch and decode. The decode stage also serves to schedule instruc-
tions based on operand availability, dependences, and hazards. Four stages
are provided for execution. The last stage handles register file writeback.
58
  
 
 
 
 
 
 
 
 
Exec 
Fetch Decode 
Exec 1 
(Int) 
Mem 
FPU 1 
CCRead 
RegFile 
L1  
I-Cache 
L1  
D-Cache 
WB 
Mem2 
Score- 
Board 
Bypass 
Network 
Exec 2 
FPU 2 FPU 3 FPU 4 
ClusterNet 
(Bus) 
ClusterNet  
(Arb) 
(empty) (empty) 
SPRF 
FP  
Accumulator 
RegFile 
 
 
 
 
 
 
 
 
 
 
Mem 
 
 
 
 
 
 
 
 
 
 
FPU 
 
 
 
 
 
 
 
 
 
 
CCRead 
Agen 
Figure 4.4: Pipeline block diagram of the Rigel core.
Simple branch prediction is provided in the form of a single entry branch
target buffer and a static backward-taken, forward not (BTFN) predictor.
Figure 4.4 shows the typical configuration of the Rigel pipeline.
4.10 Testing and Verification
We have developed a comprehensive framework for testing and verifying the
processor pipeline’s implementation. A series of test codes can be run on the
RTL and their outputs verified in a number of ways. Testing is automated
in a structured and configurable manner.
4.10.1 Assembly Codes
Tests written in assembly are simple to run and verify. Each assembly test
gets a directory in the testcode folder. Each directory holds the test code
59
and a test configuration file and optionally files for initial and final register
file state, register write trace, and memory dump state. An example test
code is written for each instruction supported in the RTL. A variety of other
assembly test patterns are available as well.
Initially, assembly tests were run on the RTL and their final register file
state was compared with that of the simulator running the same piece of
code. While this is sufficient for quick superficial functional checking, it fails
to capture intermediate errors which might occur while not impacting the
final register state. To address this issue, the Rigel RTL flow captures a
complete trace of register file writes from both the simulator and the RTL
models. Because the specific core microarchitectures in the two models are
not identical and vary during the development process, these traces allow
an implementation-agnostic method of verifying functionality. On a correct
execution, the ordering of register writes as well as the values written directly
correspond. Writes which occur sequentially in one model are also correct
when occurring simultaneously (in the same cycle) in the other model; this
can happen when one model extracts a better schedule due to various archi-
tectural factors such as memory latency or pipeline configuration differences.
4.10.2 Compiled Code
More thorough testing and evaluation is possible by running code compiled
with the full Rigel toolchain. In order to support compiled codes, the RTL
model implements several features via behavioral emulation. For instance,
as the global cache and associated atomics are currently unimplemented in
RTL, these are emulated via testbench hooks.
4.11 Summary
The Rigel toolflow has been packaged as an open source release. This
infrastructure consists of an integrated and tightly coupled set of tools span-
ning simulation, code generation, and hardware development. We believe the
unique capabilities of the tools developed as part of this research have the
potential to benefit the broader architecture research community. A current
release, as of this writing, is available at rigelproject.github.com.
60
CHAPTER 5
ENERGY-EFFICIENT THROUGHPUT
ARCHITECTURES WITH MULTI-LEVEL
SCHEDULING
In contrast to early variants of the Rigel architecture, modern graphics pro-
cessing units (GPUs) use a large number of hardware threads to hide both
function unit and memory access latency. Extreme multithreading requires a
complicated thread scheduler as well as a large register file, which is expensive
to access both in terms of energy and latency. We develop two complemen-
tary techniques for reducing energy on massively threaded processors such
as GPUs. First, we examine register file caching to replace accesses to the
large main register file with accesses to a smaller structure containing the
immediate register working set of active threads. Second, we investigate a
two-level thread scheduler that maintains a small set of active threads to hide
ALU and local memory access latency and a larger set of pending threads to
hide main memory latency.
Combined with register file caching, a two-level thread scheduler provides
a further reduction in energy by limiting the allocation of temporary reg-
ister cache resources to only the currently active subset of threads. The
combined microarchitecture effectively enables a two-dimensional reduction
in the short-term working set of registers. By reducing the amount of state
that must be kept accessible at high performance, we enable more efficient
implementations. Figure 5.1 illustrates this graphically.
We show that on average, across a variety of real-world graphics and com-
pute workloads, a register file cache with six entries per thread reduces the
number of reads and writes to the main register file by 50% and 59%, re-
spectively. We further show that the active thread count can be reduced
by a factor of four with minimal impact on performance, resulting in a 36%
reduction of register file energy.
The work presented in this chapter was performed in conjunction with
NVIDIA Research and Mark Gebhart of the University of Texas at Austin.
This work was previously published in [28] and [29]. Elements of this design
61
Short‐Term ‘Working Set’ Reduction
24
~32 Scheduler Entries
~ 3
2  
R e
g i
s t
e r
s /
E n
t r
y  
(a) Full working set.
Short‐Term ‘Working Set’ Reduction
25
~8 Active Sched Entries
~ 3
2  
R e
g i
s t
e r
s /
E n
t r
y  
(b) Reduction from two-level scheduler.
Short‐Term ‘Working Set’ Reduction
26
~8 Active Sched Entries
~ 6
  R
F  
C a
c h
e  
E n
t r
i e
s
 20x reduction in accessible state(c) Reduction from two-level scheduler and register file cache.
Figure 5.1: Graphical illustration of two-dimensional reduction in
short-term working set. The short-term working set is data that is required
to keep the processor utilized over a short window of time.
62
are patent-pending by NVIDIA under the patent application “Two-Level
Scheduler for Multi-Threaded Processing” [30].
5.1 Introduction
Graphics processing units (GPUs) such as those produced by NVIDIA [7]
and AMD [31] are massively parallel programmable processors originally de-
signed to exploit the concurrency inherent in graphics workloads. Modern
GPUs contain hundreds of arithmetic units and tens of thousands of hard-
ware threads, and can achieve peak performance in excess of a teraFLOP.
As single-thread performance improvements have slowed, GPUs have become
more attractive targets for non-graphics workloads that demand high com-
putational performance.
Unlike CPUs that generally target single-thread performance, GPUs aim
for high throughput by employing extreme multithreading [32]. For example,
NVIDIA’s recent Fermi design has a capacity of over 20,000 threads inter-
leaved across 512 processing units [7]. Just holding the register context of
these threads requires substantial on-chip storage – 2 MB in total for the
maximally configured Fermi chip. Further, extreme multithreaded architec-
tures require a thread scheduler that can select a thread to execute each
cycle from a large hardware-resident pool. Accessing large register files and
scheduling among a large number of threads consumes precious energy that
could otherwise be spent performing useful computation. As existing and
future integrated systems become power limited, energy efficiency (even at a
cost to area) is critical to system performance.
In this work, we investigate two techniques to improve the energy efficiency
of the core datapath. Register file caching adds a small storage structure to
capture the working set of registers, reducing the number of accesses to the
larger main register file. While previous work has mainly examined register
file caching to reduce latency for large CPU register files [33], this chapter
examines architectures and policies for register file caching aimed at reducing
energy in throughput-oriented multithreaded architectures. Our analysis of
both graphics and compute workloads on GPUs indicates substantial register
locality. A small, 6-entry per-thread register file cache reduces the number
of reads and writes to the main register file by 50% and 43%, respectively.
63
Static liveness information can be used to elide writing dead values back from
the register file cache to the main register file, resulting in a total reduction
of write traffic to the main register file of 59%.
Multi-level scheduling partitions threads into two classes: (1) active threads
that are issuing instructions or waiting on relatively short latency operations,
and (2) pending threads that are waiting on long memory latencies. The
cycle-by-cycle instruction scheduler need only consider the smaller number
of active threads, enabling a simpler and more energy-efficient scheduler.
Our results show that a factor of four fewer threads can be active without
suffering a performance penalty. The combination of register file caching and
multi-level scheduling enables a register file cache that is 21× smaller than
the main register file, while capturing over half of all register accesses. This
approach reduces the energy required for the register file by 36% compared
to the baseline architecture without register file caching.
The remainder of this chapter is organized as follows. Section 5.2 provides
background on the design of contemporary GPUs and characterizes register
value reuse in compute and graphics workloads. Section 5.3 describes our pro-
posed microarchitectural enhancements, including both register file caching
and scheduling techniques. Section 5.4 describes our evaluation methodology.
Section 5.5 presents performance and power results. Section 5.6 discusses fu-
ture work and conclusions.
5.2 Background
While GPUs are becoming increasingly popular targets for computationally-
intensive non-graphics workloads, their design is primarily influenced by
triangle-based raster graphics. Graphics workloads have a large amount of
inherent parallelism that can be easily exploited by a parallel machine. Tex-
ture memory accesses are common operations in graphics workloads and tend
to be fine-grained and difficult to prefetch. Graphics workloads have large,
long-term (inter-frame) working sets that are not amenable to caching; there-
fore, texture cache units focus on conserving bandwidth rather than reducing
latency [32]. Because texture accesses are frequent and macroscopically un-
predictable, GPUs rely on massive multithreading to keep arithmetic units
utilized.
64
Figure 5.2: Baseline full-chip architecture representing a contemporary
GPU. This figure illustrates a typical contemporary full-chip GPU
organization. 16 SMs are connected via multi-stage interconnect to six
shared L2 cache banks. Each L2 cache bank has an associated
high-bandwidth GDDR memory controller.
Figure 5.2 illustrates the architecture of a contemporary GPU, similar in
nature to NVIDIA’s Fermi design. The figure represents a generic design
point similar to those discussed in the literature [34, 7, 35], but is not in-
tended to correspond directly to any existing industrial product. The GPU
consists of 16 streaming multiprocessors, 6 high-bandwidth DRAM chan-
nels, and an on-chip level-2 cache. A streaming multiprocessor (SM), shown
in Figure 5.3, contains 32 SIMT (single-instruction, multiple thread) lanes
that can collectively issue up to 32 instructions per cycle, one from each of
32 threads. Threads are organized into groups called warps, which execute
together using a common physical program counter. While each thread has
its own logical program counter and the hardware supports control-flow di-
vergence of threads within a warp, the streaming multiprocessor executes
most efficiently when all threads execute along a common control-flow path.
65
SIMT Lanes 
SFU MEM ALU 
Warp Scheduler 
Main Register File 
32 banks 
Shared Memory 
32 KB 
TEX 
Figure 5.3: A baseline streaming multiprocessor (SM). The SM consists of a
number of SIMT lanes, or datapaths, along with associated memories and
control logic. Each SM contains a fixed-size, explicitly addressed scratch
SRAM known as shared memory. Each SM includes a large,
high-bandwidth, multi-banked register file. Control logic for the SM,
including the warp scheduler, is shared across lanes.
66
Fermi supports 48 active warps for a total of 1,536 active threads per SM.
To accommodate this large set of threads, GPUs provide vast on-chip register
file resources. Fermi provides 128 kB of register file storage per streaming
multiprocessor, allowing an average of 21 registers per thread at full sched-
uler occupancy. The total register file capacity across the chip is 2 MB, sub-
stantially exceeding the size of the L2 cache. GPUs rely on heavily banked
register files in order to provide high bandwidth with simple register file de-
signs [36, 35]. Despite aggressive banking, these large register file resources
not only consume area and static power, but result in high per-access energy
due to their size and physical distance from execution units. Prior work ex-
amining a previous generation NVIDIA GTX280 GPU (which has 64 kB of
register file storage per SM), estimates that nearly 10% of total GPU power
is consumed by the register file [37]. Our own estimates show that the access
and wire energy required to read an instruction’s operands is twice that of
actually performing a fused multiply-add [38]. Because power-supply voltage
scaling has effectively come to an end [8], driving down per-instruction energy
overheads will be the primary way to improve future processor performance.
5.2.1 Baseline SM Architecture
In this work, we focus on the design of the SM (Figures 5.3 and 5.4). For
our baseline, we model a contemporary GPU streaming multiprocessor with
32 SIMT lanes. Our baseline architecture supports a 32-entry warp sched-
uler, for a maximum of 1024 threads per SM, with a warp issuing a single
instruction to each of the 32 lanes per cycle. We model single-issue, in-order
pipelines for each lane. Each SM provides 32 kB of local scratch storage
known as shared memory. Figure 5.4 provides a more detailed microarchi-
tectural illustration of a cluster of 4 SIMT lanes. A cluster is composed of
four ALUs, four register banks, a special function unit (SFU), a memory
unit (MEM), and a texture unit (TEX) shared between two clusters. Eight
clusters form a complete 32-wide SM.
A single-precision fused multiply-add requires three register inputs and one
register output per thread for a total register file bandwidth of 96 32-bit reads
and 32 32-bit writes per cycle per SM. The SM achieves this bandwidth by
subdividing the register file into multiple dual-ported banks (1 read and 1
67
Figure 5.4: 4-wide SIMT lane detail. This represents a possible
microarchitectural arrangement for a grouping of four SIMT lanes within
an SM. Four ALU datapaths are connected to register resources via an
operand distribution and buffering interconnection network. Shared
resources for special math functions (SFU), memory accesses (Mem), and
Texture (Tex) are connected similarly.
68
write per cycle). Each entry in the SM’s main register file (MRF) is 128 bits
wide, with 32 bits allocated to the same-named register for threads in each of
the four SIMT lanes in the cluster. Each bank contains 256 128-bit registers
for a total of 4 kB. The MRF consists of 32 banks for a total of 128 kB per
SM, allowing for an average of 32 registers per thread, more than Fermi’s 21
per thread. The trend over the last several generations of GPUs has been to
provision more registers per thread, and our traces make use of this larger
register set.
The 128-bit registers are interleaved across the register file banks to in-
crease the likelihood that all of the operands for an instruction can be fetched
simultaneously. Instructions that require more than one register operand
from the same bank perform their reads over multiple cycles, eliminating the
possibility of a stall due to a bank conflict for a single instruction. Bank con-
flicts from instructions in different warps may occur when registers map to
the same bank. Our MRF design is over-provisioned in bandwidth to reduce
the effect of these rare conflicts. Bank conflicts can also be reduced signifi-
cantly via the compiler [39]. The operand buffering between the MRF and
the execution units represents interconnect and pipeline storage for operands
that may be fetched from the MRF on different cycles.
5.2.2 GPU Value Usage Characterization
Prior work in the context of CPUs has shown that a large fraction of register
values are consumed a small number of times, often within a few instructions
of being produced [40]. Our analysis of GPU workloads indicates that the
same trend holds. Figure 5.5 shows the number of times a value written to a
register is read for a set of real-world graphics and compute workloads. Up
to 70% of values are read only once, and only 10% of values are read more
than twice. HPC workloads show the highest level of register value reuse
with 40% of values being read more than once. Graphics workloads, labeled
Shader, show reuse characteristics similar to the remaining compute traces.
Figure 5.6 shows the lifetime of all dynamic values that are read only once.
Lifetime is defined as the number of instructions between the producer and
consumer (inclusive) in a thread. A value that is consumed directly after be-
ing produced has a lifetime of one. Up to 40% of all dynamic values are read
69
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
P
e
rc
e
n
t 
o
f 
A
ll 
V
al
u
e
s 
P
ro
d
u
ce
d
Read >2 Times
Read 2 Times
Read 1 Time
Read 0 Times
Figure 5.5: Value usage: number of reads per register value.
only once and are read within three instructions of being produced. In gen-
eral, the HPC traces exhibit longer lifetimes than the other compute traces,
due in part to hand-scheduled optimizations in several HPC codes where
producers are hoisted significantly above consumers for improved memory
level parallelism. Graphics traces also exhibit a larger proportion of values
with longer lifetimes due to texture instructions, which the compiler hoists to
improve performance. These value usage characteristics motivate the deploy-
ment of a register file cache to capture short-lived values and dramatically
reduce accesses to the main register file.
5.3 Microarchitecture
This section details our microarchitectural extensions to the GPU stream-
ing multiprocessor (SM) to improve energy efficiency, including register file
caching and a simplified thread scheduler.
70
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
P
e
rc
e
n
t 
o
f 
A
ll 
V
al
u
e
s 
P
ro
d
u
ce
d
Value Lifetime >3
Value Lifetime 3
Value Lifetime 2
Value Lifetime 1
Figure 5.6: Value usage: lifetime of values that are read only once.
5.3.1 Register File Cache
Section 5.2.2 shows that up to 40% of all dynamic register values are read only
once and within three instructions of being produced. Because these values
have such short lifetimes, writing them into the main register file wastes
energy. We propose a register file cache (RFC) to capture these short-lived
values. The RFC filters requests to the main register file (MRF) and provides
several benefits: (1) reduced MRF energy by reducing MRF accesses; (2)
reduced operand delivery energy, since the RFC can be physically closer to
the ALUs than the MRF; and (3) reduced MRF bandwidth requirements,
allowing for a more energy-efficient MRF. The majority of this section focuses
on (1) while we discuss (2) and (3) in Section 5.5.4.
The pipeline with register file caching adds a stage to check the register
file cache tags to determine if the operands are in the RFC. Operands not
found in the RFC are fetched over multiple cycles from the MRF as before.
Operands in the cache are fetched during the last stage allocated to MRF ac-
cess. The RFC is multiported and all operands present in the RFC are read
in a single cycle. We do not exploit the potential for reducing pipeline depth
when all operands can be found in the RFC, because this optimization has
71
a small effect on existing throughput-oriented architectures and workloads.
The tag-check stage does not affect back-to-back instruction latencies, but
adds a cycle to the branch resolution path. Our results show that this ad-
ditional stage does not reduce performance noticeably, as branches do not
dominate in the traces we evaluate.
RFC Allocation: Our baseline RFC design allocates the result of every
operation into the RFC. We explored an extension that additionally allocates
RFC entries for an instruction’s source operands. We found that this policy
results in 5% fewer MRF reads with a large RFC, but also pollutes the RFC,
resulting in 10% to 20% more MRF writes. Such a policy requires additional
RFC write ports, an expense not justified by our results.
RFC Replacement: Prior work on register file caches in the context
of CPUs has used either LRU replacement [33] or a combination of FIFO
and LRU replacement [41] to determine which value to evict when writing
a new value into the RFC. While our baseline RFC design uses a FIFO
replacement policy, our results show that using LRU replacement results in
only an additional 1% to 2% reduction in MRF accesses. Compared to prior
work on CPU register file caching, our RFC can only accommodate very few
entries per thread due to the large thread count of a throughput processor,
reducing the effectiveness of LRU replacement.
RFC Eviction: While the default policy writes all values evicted from
the RFC to the MRF, many of these values will not actually be read again.
In order to elide writebacks of dead values, we consider a combined hard-
ware/software RFC design. We extend our hardware-only RFC design with
compile-time generated static liveness information, which indicates the last
instruction that will read a particular register instance. This information
is passed to the hardware by an additional bit in the instruction encoding.
Registers that have been read for the last time are marked dead in the RFC,
and their values need not be written back to the MRF. This optimization
is conservative and never destroys valid data that could be used in the fu-
ture. Due to uncertain control flow in the application, some values that are
actually dead will be unnecessarily written back to the MRF.
Our results show that six RFC entries per thread captures most of the
register locality and keeps the storage overhead per thread moderate. How-
ever, with six (4 byte) entries per thread, 32 threads per warp, and 32 warps
per SM, the RFC would require 24 kB of additional storage per SM. While
72
this proposed baseline RFC is five times smaller than the MRF, its large size
limits the potential energy savings.
5.3.2 Two-Level Warp Scheduler
To reduce the storage requirements of the RFC, we introduce a two-level
warp scheduler. The warp scheduler, shown in Figure 5.7(a), is responsible
for keeping the SIMT cores supplied with work in the face of both pipeline
and memory latencies. To hide long latencies, GPUs allocate a large number
of hardware thread contexts for each set of SIMT cores. This large set of
concurrently executing warps in turn increases scheduler complexity, thus in-
creasing area and power requirements. Significant state must be maintained
for each warp in the scheduler, including buffered instructions for each warp.
In addition, performing scheduling among such a large set of candidate warps
necessitates complex selection logic and policies. The scheduler attempts to
hide two distinct sources of latency in the system: (1) long, often unpre-
dictable latencies, such as loads from DRAM or texture operations; and (2)
shorter, often fixed or bounded latencies due to ALU operations, branch reso-
lution, or accesses to the SM’s local shared memory. A large pool of available
warps is required to tolerate latencies in the first group, but a much smaller
pool of warps is sufficient to tolerate common short latencies. The latency
of arithmetic operations and shared memory accesses along with the amount
of per-thread ILP influences the number of threads required to saturate the
hardware. Reducing the set of warps available for selection on a given cycle
can reduce both the complexity and energy overhead of the scheduler. One
important consequence of reducing the number of concurrently active threads
is that it reduces the immediate-term working set of registers.
We propose a two-level warp scheduler that partitions warps into an active
set eligible for execution and an inactive pending set. The smaller set of
active warps hides common short latencies, while the larger pool of pending
warps is maintained to hide long-latency operations and provide fast thread
switching. Figure 5.7 illustrates a traditional single-level warp scheduler and
our proposed two-level warp scheduler. All hardware-resident warps have
entries in the outer level of the scheduler and are allocated MRF entries. The
outer scheduler contains a large set of entries where pending warps may wait
73
(a) Single-level (b) Two-level
Figure 5.7: Warp schedulers. (a) illustrates a typical single-level scheduler,
where all threads are kept ready and selectable. (b) illustrates our proposed
two-level scheduler. The same number of hardware-resident pending threads
are supported as in (a), but a much smaller set is maintained as active.
on long-latency operations to complete, with the number of pending entries
required primarily influenced by the memory latency to be hidden. The
inner level contains a much smaller set of active warps available for selection
each cycle and is sized such that it can cover shorter latencies due to ALU
operations, branch resolution, shared memory accesses, or cache hits. When
a warp encounters a stall-inducing event, that warp can be removed from
the active set but left pending in the outer scheduler. Introducing a second
level to the scheduler presents a variety of new scheduling considerations for
selection and replacement of warps from the active set.
Scheduling: For a two-level scheduler, we consider two common schedul-
ing techniques: round-robin and greedy. For round-robin, we select a new
ready warp from the active warp pool each cycle using a rotating priority.
For greedy, we continue to issue instructions from a single active warp for as
long as possible, without stalling, before selecting another ready warp. Our
single-level scheduler has the same options, but all 32 warps remain selectable
at all times. We evaluate the effectiveness of these policies in Section 5.5.
Replacement: A two-level scheduler must consider when to remove warps
74
from the active set. Only warps which are ready or will be ready soon should
be kept in the active set; otherwise, they should be replaced with ready
warps to avoid stalls. Replacement can be done preemptively or reactively,
and depending on the size of the active set and the latencies of key oper-
ations, different policies will be appropriate. We choose to suspend active
warps when they consume a value produced by a long-latency operation. In-
structions marked by the compiler as sourcing an operand produced by a
long-latency operation induce the warp to be suspended to the outer sched-
uler. We consider texture operations and global (cached) memory accesses
as long-latency. This preemptive policy speculates that the value will not
be ready immediately, a reasonable assumption on contemporary GPUs for
both texture requests and loads that may access DRAM. Alternatively, a
warp can be suspended after the number of cycles it is stalled exceeds some
threshold; however, because long memory and texture latencies are common,
we find this strategy reduces the effective size of the active warp set and
sacrifices opportunities to execute instructions. For stalls on shorter latency
computational operations or accesses to shared memory (local scratchpad),
warps retain their active scheduler slot. For different design points, longer
computational operations or shared memory accesses could be triggers for
eviction from the active set.
5.3.3 Combined Architecture
While register file caching and two-level scheduling are each beneficial in
isolation, combining them substantially increases the opportunity for energy
savings. Figure 5.8 shows our proposed architecture that takes advantage
of register file caching to reduce accesses to the MRF while employing a
two-level warp scheduler to reduce the required size of an effective RFC.
Figure 5.9 shows the detailed SM microarchitecture which places private
RFC banks adjacent to each ALU. Instructions targeting the private ALUs
are most common, so colocating RFC banks with each ALU provides the
greatest opportunity for energy reduction. Operands needed by the SFU,
MEM, or TEX units are transmitted from the RFC banks using the operand
routing switch.
To reduce the size of the RFC, entries are only allocated to active warps.
75
Warp Scheduler 
SIMT Lanes 
SFU TEX MEM ALU 
RFC 
Select 
A A A A Active Warps 
MRF 
Shared Memory 
Pending Warps 
Figure 5.8: Modified GPU microarchitecture: High-level SM architecture.
MRF with 32 128-bit wide banks, multiported RFC (3R/1W ports per
lane).
76
Figure 5.9: Modified GPU microarchitecture: Detailed SM
microarchitecture. 4-lane cluster replicated eight times to form 32 wide
machine.
77
Completed instructions write their results to the RFC according to the poli-
cies discussed in Section 5.3.1. When a warp encounters a dependence on
a long-latency operation, the two-level scheduler suspends the warp and
evicts dirty RFC entries back to the MRF. To reduce writeback energy and
avoid polluting the RFC, we augment the allocation policy described in Sec-
tion 5.3.1 to bypass the results of long-latency operations around the RFC,
directly to the MRF. Allocating entries only for active warps and flushing
the RFC when a warp is swapped out increases the number of MRF accesses
but dramatically decreases the storage requirements of the RFC. Our results
show that combining register file caching with two-level scheduling produces
an RFC that (1) is 21 times smaller than the MRF, (2) eliminates more
than half of the reads and writes to the MRF, (3) has negligible impact on
performance, and (4) reduces register file energy by 36%.
5.4 Methodology
As described in Section 5.2.1, we model a contemporary GPU SIMT pro-
cessor, similar in structure to the NVIDIA Fermi streaming multiprocessor
(SM). Table 5.1 summarizes the simulation parameters used for our SM de-
sign. Standard integer ALU and single-precision floating-point operations
have a latency of eight cycles and operate with full throughput across all
lanes. While contemporary NVIDIA GPUs have longer pipeline latencies for
standard operations [35], eight cycles is a reasonable assumption based on
AMD’s GPUs [42]. As with modern GPUs, various shared units operate with
a throughput of less than the full SM SIMT width. Our texture unit has a
throughput of four texture (TEX) instructions per cycle. Special operations,
such as transcendental functions, operate with an aggregate throughput of
eight operations per cycle.
Due to the memory access characteristics and programming style of the
workloads we investigate, we find that system throughput is relatively in-
sensitive to cache hit rates and typical DRAM access latency. Codes make
heavy use of shared memory or texture for memory accesses, using most
DRAM accesses to populate the local scratchpad memory. Combined with
the large available hardware thread count, the meager caches provided by
modern GPUs only minimally alter performance results, especially for graph-
78
Table 5.1: Simulation parameters
Parameter Value
Execution Model In-order
Execution Width 32 wide SIMT
Register File Capacity 128 kB
Register Bank Capacity 4 kB
Shared Memory Capacity 32 kB
Shared Memory Bandwidth 32 bytes / cycle
SM External Memory Bandwidth 32 bytes / cycle
ALU Latency 8 cycles
Special Function Latency 20 cycles
Shared Memory Latency 20 cycles
Texture Instruction Latency 400 cycles
DRAM Latency 400 cycles
ics shader workloads. We find the performance difference between no caches
and perfect caches to be less than 10% for our workloads, so we model the
memory system as bandwidth constrained with a fixed latency of 400 cycles.
5.4.1 Workloads
We evaluate 210 real-world instruction traces, described in Table 5.2, taken
from a variety of sources. The traces are encoded in NVIDIA’s native ISA.
Due to the large number of traces we evaluate, we present the majority
of our results as category averages. Compute workloads, including high-
performance and scientific computing, image and video processing, and sim-
ulation comprise 55 of the traces. The remaining 155 traces represent im-
portant shaders from 12 popular vode games published in the last few years.
Shaders are short programs that perform programmable rendering opera-
tions, usually on a per-pixel or per-vertex basis, and operate across very
large datasets with millions of threads per frame.
5.4.2 Simulation Methodology
We employ a custom trace-based simulator that models the SM pipeline
and memory system described in Sections 5.3 and 5.4. When evaluating
register file caching, we simulate all threads in each trace. For two-level
79
Table 5.2: Trace characteristics
Avg. Dynamic Avg.
Category Examples Traces Warp Insts. Threads
Video H264 Encoder, 19 60 million 99 K
Processing Video Enhancement
Simulation Molecular Dynamics, 11 691 million 415 K
Computational Graphics,
Path Finding
Image Image Blur, JPEG 7 49 million 329 K
Processing
HPC DGEMM, SGEMM, FFT 18 44 million 129 K
Shader 12 Modern Video Games 155 5 million 13 K
Total 210
scheduling, we simulate execution time on a single SM for a subset of the
total threads available for each workload, selected in proportion to occurrence
in the overall workload. This strategy reduces simulation time while still
accurately representing the behavior of the trace.
5.4.3 Energy Model
We model the energy requirements of several 3-read port, 1-write port RFC
configurations using synthesized flip-flop arrays. We use Synopsys Design
Compiler with both clock-gating and power optimizations enabled and com-
mercial 40 nm high-performance standard cell libraries with a clock target of
1 GHz at 0.9 V. We estimate access energy by performing several thousand
reads and writes of uniform random data across all ports.
Table 5.3 shows the RFC read and write energy for four 32-bit values,
equivalent to one 128-bit MRF entry. We model the main register file (MRF)
as a collection of 32 banks, with each 4 kB bank a 128-bit wide dual-ported
(1 read, 1 write) SRAM. SRAMs are generated using a commercial memory
compiler and are characterized similarly to the RFC for read and write energy
at 1 GHz.
We model wire energy based on the methodology of [43] using the parame-
ters listed in Table 5.4, resulting in energy consumption of 1.9 pJ per mm for
a 32-bit word. From a Fermi die photo, we estimate the area of a single SM
to be 16 mm2 and assume that operands must travel 1 mm from a MRF bank
80
Table 5.3: RFC area and read/write energy for 128-bit accesses
RFC Entries Active Warps
per Thread 4 6 8
µ2 R/W (pJ) µ2 R/W (pJ) µ2 R/W (pJ)
4 5100 1.2/3.8 7400 1.2/4.4 9600 1.9/6.1
6 7400 1.2/4.4 10800 1.7/5.4 14300 2.2/6.7
8 9600 1.9/6.1 14300 2.2/6.7 18800 3.4/10.9
Table 5.4: Modeling parameters
Parameter Value
MRF Read/Write Energy 8/11 pJ
MRF Bank Area 38000 µ2
MRF Distance to ALUs 1 mm
Wire capacitance 300 fF/mm
Voltage 0.9 V
Wire Energy (32 bits) 1.9 pJ/mm
to the ALUs. Each RFC bank is private to a SIMT lane, greatly reducing dis-
tance from the RFC banks to the ALUs. The tags for the RFC are located
close to the scheduler to minimize the energy spent accessing them. Sec-
tion 5.5.4 evaluates the impact of wire energy. Overall, we found our energy
measurements to be consistent with previous studies [44] and CACTI [45]
after accounting for differences in design space and process technology.
5.5 Evaluation
This section demonstrates the effectiveness of register file caching and two-
level scheduling on GPU compute and graphics workloads. We first evalu-
ate the effectiveness of each technique individually and then show how the
combination reduces overall register file energy. As power consumption char-
acteristics are specific to particular technology and implementation choices,
we first present our results in a technology-independent metric (fraction of
MRF reads and writes avoided), and then present energy estimates for our
chosen design points.
81
5.5.1 Baseline Register File Cache
Figures 5.10(a) and 5.10(b) show the percentage of MRF read and write
traffic that can be avoided by the addition of the baseline RFC described
in Section 5.3.1. Even a single-entry RFC reduces MRF reads and writes,
with the knee of the curve at about six entries for each per-thread RFC. At
six RFC entries, this simple mechanism filters 45% to 75% of MRF reads
and 35% to 75% of MRF writes. RFC effectiveness is lowest on HPC traces,
where register values are reused more frequently and have longer average
lifetimes, a result of hand scheduling.
As discussed in Section 5.2.2, many register values are only read a single
time. Figure 5.10(c) shows the percentage of MRF writes avoided when
static liveness information is used to identify the last consumer of a register
value and avoid writing the value back to the MRF on eviction from the
RFC. Read traffic does not change, as liveness information is used only to
avoid writing back dead values. With 6 RFC entries per thread, the use of
liveness information increases the fraction of MRF accesses avoided by 10–
15%. We present the remaining write traffic results assuming static liveness
information is used to avoid dead value writebacks.
Figure 5.12 plots the reduction in MRF traffic with a 6-entry RFC for
each individual compute trace. For these graphs, each point on the x-axis
represents a different trace from one of the sets of compute applications. The
traces are sorted on the x-axis by the amount of read traffic avoided. The
lines connecting the points serve only to clarify the two categories and do
not imply a parameterized data series. The effectiveness of the RFC is a
function of both the inherent data reuse in the algorithms and the compiler
generated schedule in the trace. Some optimizations such as hoisting improve
performance at the expense of reducing the effectiveness of the RFC. All
of the traces, except for a few hand-scheduled HPC codes, were scheduled
by a production NVIDIA compiler that does not optimize for our proposed
register file cache. To provide insight into shader behavior, Figure 5.11 shows
results for a 6-entry per thread cache for individual shader traces grouped
by games and sorted on the x-axis by MRF accesses avoided. Due to the
large number of traces, individual datapoints are hard to observe, but the
graphs demonstrate variability both within and across each game. Across
all shaders, a minimum of 35% of reads and 40% of writes are avoided,
82
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
 
Number of Entries per Thread 
Image
Video
Simulation
Shader
HPC
(a) Read accesses avoided
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
 
Number of Entries per Thread 
Image
Video
Simulation
Shader
HPC
(b) Write accesses avoided
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
 
Number of Entries per Thread 
Image
Video
Simulation
Shader
HPC
(c) Write accesses avoided (SW liveness)
Figure 5.10: Reduction of MRF accesses by baseline register file cache.
83
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
Game 1
Game 2
Game 3
Game 4
Game 5
Game 6
Game 7
Game 8 
Game 9
Game 10
Game 11
Game 12
(a) Read accesses avoided
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
Game 1
Game 2
Game 3
Game 4
Game 5
Game 6
Game 7
Game 8 
Game 9
Game 10
Game 11
Game 12
(b) Write accesses avoided (with liveness info)
Figure 5.11: Graphics per-trace reduction in MRF accesses with a 6-entry
RFC per thread (one point per trace).
84
0%
20%
40%
60%
80%
100%
M
R
F 
A
cc
es
se
s 
A
vo
id
ed
 
Read Traffic Avoided
Write Traffic Avoided
           Video               Simulation               Image         HPC 
Figure 5.12: Per-trace reduction in MRF accesses with a 6-entry RFC per
thread (one point per trace).
illustrating the general effectiveness of this technique for these workloads.
While the remainder of our results are presented as averages, the same general
trends appear for other RFC configurations.
5.5.2 Two-Level Warp Scheduler
Next, we consider the performance of our two-level warp scheduler rather
than the typical, more complex, single-level scheduler. Figure 5.13 shows
SM instructions per clock (IPC) for a scheduler with 32 total warps and a
range of active warps, denoted below each bar. Along with the arithmetic
mean, the graph shows standard deviation across traces for each scheduler
size. The scheduler uses a greedy policy in the inner level, issuing from a
single warp until it can no longer issue without a stall, and uses a round-
robin policy when replacing active warps with ready pending warps from
the outer level. The single-level scheduler (all 32 warps active) issues in
the same greedy fashion as the inner level. A two-level scheduler with eight
active warps achieves nearly identical performance to a scheduler with all
32 warps active, while scheduling six active warps experiences a 1% average
performance regression on compute and a 5% average performance regression
on graphics shaders.
Figure 5.14 shows a breakdown of both compute and shader traces for a
few key active scheduler sizes. The figure shows an all-warps-active scheduler
along with three smaller active scheduler sizes. A system with eight active
warps achieves nearly the same performance as a single-level warp scheduler,
whereas performance begins to deteriorate significantly with fewer than six
active warps. The effectiveness of six to eight active warps can be attributed
85
0
4
8
12
16
20
24
28
32
HPC Image Simulation Video Shader
IP
C
 (
p
e
r 
3
2
-l
an
e
 S
M
)
2   3   4   6   8  12 16 24 32 2   3   4   6   8  12 16 24 32 2   3   4   6   8  12 16 24 32 2   3   4   6   8  12 16 24 32 2   3   4   6   8  12 16 24 32
Figure 5.13: Average IPC with ±1 standard deviation for a range of active
warps.
in part to our pipeline parameters; an 8-cycle pipeline latency is completely
hidden with eight warps, while a modest amount of ILP allows six to perform
nearly as well. Some traces actually see higher performance with fewer active
warps when compared with a fully active warp scheduler; selecting among
a smaller set of warps until a long-latency stall occurs helps to spread out
long-latency memory or texture operations in time.
For selection among active warps, we compared round-robin and greedy
policies. Round-robin performs worse as the active thread pool size is in-
creased beyond a certain size. This effect occurs because a fine-grained
round-robin interleaving tends to expose long-latency operations across mul-
tiple warps in a short window of time, leading to many simultaneously stalled
warps. For the SPMD code common to GPUs, round-robin scheduling of
active warps leads to consuming easily extracted parallel math operations
without overlapping memory accesses across warps. On the other hand, is-
suing greedily often allows a stall-inducing long-latency operation (memory
or texture) in one warp to be discovered before switching to a new warp,
overlapping the latency with other computation.
86
12
16
20
24
28
32
IP
C
 (
p
e
r 
3
2
-l
an
e
 S
M
)
Compute Traces (Sorted)
4 Active Warps
6 Active Warps
8 Active Warps
32 Active Warps
(a) two-level scheduling: compute.
12
16
20
24
28
32
IP
C
 (
p
e
r 
3
2
-l
an
e
 S
M
) 
Shader Traces (Sorted) 
4 Active Warps
6 Active Warps
8 Active Warps
32 Active Warps
(b) two-level scheduling: shader.
Figure 5.14: IPC for various active warp set sizes (one point per trace) for
(a) compute and (b) shader traces.
87
5.5.3 Combined Architecture
Combining register file caching with two-level scheduling produces an effec-
tive combination for reducing accesses to a large register file structure. A
two-level scheduler dramatically reduces the size of the RFC by allocating
entries only to active warps, while still maintaining performance compara-
ble to a single-level scheduler. A consequence of two-level scheduling is that
when a warp is deactivated, its entries in the RFC must be flushed to the
MRF so that RFC resources can be reallocated to a newly activated warp.
Figure 5.15 shows the effectiveness of the RFC in combination with a two-
level scheduler as a function of RFC entries per thread. Compared to the
results shown in Figure 5.10, flushing the RFC entries for suspended warps
increases the number of MRF accesses by roughly 10%. This reduction in
RFC effectiveness is more than justified by the substantial (4× to 6×) re-
duction in RFC capacity requirements when allocating only for active warps.
We explore extending our baseline design by using static liveness informa-
tion to bypass values that will not be read before the warp is deactivated
around the RFC, directly to the MRF. Additionally, we use static liveness
information to augment the FIFO RFC replacement policy to first evict RFC
values that will not be read before the next long latency operation. These
optimizations provide a modest 1% to 2% reduction in MRF accesses. How-
ever, bypassing these values around the RFC saves energy by avoiding RFC
accesses and reduces the number of RFC writebacks to the MRF by 30%.
5.5.4 Energy Savings
MRF Traffic Reduction: Figure 5.16 shows the energy consumed in regis-
ter file accesses for a range of RFC configurations, normalized to the baseline
design without an RFC. Each bar is annotated with the amount of storage
required for the RFC bank per SIMT lane. The RFC architectures include
two-level scheduling with four, six, or eight active warps and four, six, or
eight RFC entries per thread. The energy estimates for the RFC config-
urations are based on the results in Section 5.5.3 and include accesses to
the RFC, the MRF, and the RFC/MRF accesses required to flush RFC en-
tries on a thread swap. The addition of a register file cache reduces energy
consumption by 20% to 40% for a variety of design points. Generally, an
88
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
 
Number of Entries per Thread 
Image
Video
Simulation
Shader
HPC
(a) MRF read accesses avoided.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 4 5 6 7 8 9 10 11 12 13
M
R
F 
A
cc
e
ss
e
s 
A
vo
id
e
d
 
Number of Entries per Thread 
Image
Video
Simulation
Shader
HPC
(b) MRF write accesses avoided (SW liveness).
Figure 5.15: Effectiveness of RFC when combined with two-level scheduler.
89
00.2
0.4
0.6
0.8
1
4 Active
Warps
6 Active
Warps
8 Active
Warps
4 Active
Warps
6 Active
Warps
8 Active
Warps
Compute Shaders
En
e
rg
y 
an
d
 P
e
rf
o
rm
an
ce
 
N
o
rm
al
iz
e
d
 t
o
 B
as
e
lin
e
 4 Entries per
Thread
6 Entries per
Thread
8 Entries per
Thread
Performance
6
4
 b
yt
e
s 
9
6
 b
yt
e
s 
1
2
8
 b
yt
e
s 
9
6
 b
yt
e
s 
1
4
4
 b
yt
e
s 
1
9
2
 b
yt
e
s 
1
2
8
 b
yt
e
s 
1
9
2
 b
yt
e
s 
2
5
6
 b
yt
e
s 
6
4
 b
yt
e
s 
9
6
 b
yt
e
s 
1
2
8
 b
yt
e
s 
9
6
 b
yt
e
s 
1
4
4
 b
yt
e
s 
1
9
2
 b
yt
e
s 
1
2
8
 b
yt
e
s 
1
9
2
 b
yt
e
s 
2
5
6
 b
yt
e
s 
Figure 5.16: Energy savings due to MRF traffic reduction. Bars show
register file access energy consumed relative to baseline without RFC (lower
is better); lines show performance (higher is better).
Table 5.5: Combined register file access and wire energy savings for 8-active
scheduler, 6-entry RFC configuration. Normalized to baseline register file
energy consumption.
Normalized Register File Energy
MRF to ALU (mm) RFC to ALU (mm) Compute Shaders
0 0 0.76 0.74
1 1 0.87 0.86
1 0.2 0.65 0.63
1 0 0.63 0.60
RFC with six entries per thread provides the most energy savings for our
SM design point. A design with six RFC entries per thread and eight active
warps reduces the amount of energy spent on register file accesses by 25%
without a performance penalty. This design requires 192 bytes of storage for
the RFC bank per SIMT lane, for a total of 6,144 bytes per SM. Additional
energy can be saved at the expense of performance for some workloads. In
the 8-active, 6-entry configuration, if every access was to the RFC and the
MRF was never accessed, the maximum idealized energy savings would be
58%.
Wire Energy: Our proposed design presents several opportunities to re-
duce the energy expended moving data between the MRF and the ALUs.
MRF banks are shared across four SIMT lanes, forcing operands to be dis-
tributed a greater distance to reach the ALUs compared with the per-lane
RFC banks. The operand buffering used to enable multi-cycle operand col-
lection represents a large piece of logic and interconnect that takes signif-
90
icant energy to traverse. Operands from the RFC are fetched in a single
cycle, avoiding multi-cycle operand buffering. Further, the RFC banks can
be located much closer to the ALUs, reducing wiring energy for ALU opera-
tions, while not adversely affecting the wiring energy to the shared units. To
evaluate the potential energy savings from operand delivery, we focus on the
expected reduction in wire distance using the energy parameters in Table 5.4.
While Figure 5.16 assumes zero wiring overhead for either the MRF or RFC,
the bottom three rows of Table 5.5 show the normalized register file energy
savings when accounting for both the reduction in MRF accesses and various
wiring distances between the RFC and ALUs, with 8 active warps and 6 RFC
entries per thread. Assuming the MRF is 1 mm from the ALUs, locating the
RFCs 0.2 mm from the ALUs boosts energy savings to 35% for the com-
pute traces and 37% for the shader traces. We expect additional savings to
come from the reduction in multi-cycle operand buffering and multiplexing
of values coming from the MRF, an effect not quantified here.
5.5.5 Discussion
To put our energy savings in context, we present a high-level GPU power
model. A modern GPU chip consumes roughly 130 W [37]. If we assume
that 1/3 of this power is spent on leakage, the chip consumes 87 W of dynamic
power. We assume that 30% of dynamic power is consumed by the memory
system and 70% of dynamic power is consumed by the SMs [46]. This gives
a dynamic power of 3.8 W per SM for a chip with 16 SMs. The register file
conservatively consumes 15% of the SM power [37], about 0.6 W per SM. The
register file system power is split between bank access power, wire transfer
power, and operand routing and buffering power. Our detailed evaluation
shows that our technique saves 36% of register file access and wire energy.
A detailed evaluation of operand buffering and routing power is beyond the
scope of this work. For the purpose of this high-level model, we assume that
our technique saves 36% of the full register file system power, about 0.2 W
per SM, for a chip wide savings of 3.3 W. This represents 5.4% of SM power
and 3.8% of chip-wide dynamic power. While 3.3 W may appear to be a
small portion of overall chip power, today’s GPUs are power limited and
improvements in energy efficiency can be directly translated to performance
91
increases. Making future high-performance integrated systems more energy
efficient will come from a series of architectural improvements rather than
from a single magic bullet. While each improvement may have a small effect
in isolation, collectively they can be significant.
In addition to the energy savings from simplifying the frequently traversed
datapaths from operand storage to ALUs, the RFC and two-level scheduler
provide two additional opportunities for energy savings. First, an RFC with
6 entries per thread reduces the average MRF bandwidth required per in-
struction by half. We expect that this effect can be leveraged to build a more
energy-efficient MRF with less aggregate bandwidth, without sacrificing per-
formance. Second, the two-level scheduler greatly simplifies scheduling logic
relative to a complex all-warps-active scheduler, an energy-intensive compo-
nent of the SM that must make time-critical selections among a large number
of warps. By reducing the number of active warps, a two-level scheduler also
reduces storage requirements for buffered instructions and scheduling state.
Additionally, reduced ALU latencies can decrease the average short-term la-
tency that the inner scheduler must hide, allowing fewer active warps to
maintain throughput, further reducing RFC overhead. ALU latency of four
cycles increases IPC for a 4-active scheduler by 5%, with smaller increases for
6 and 8. Finally, prior work on a previous generation GPU has shown that
instruction fetch, decode, and scheduling consumes up to 20% of chip-level
power [37], making the scheduler an attractive target for energy reduction.
5.6 Conclusion
Modern GPU designs are constrained by various requirements, chief among
which is high performance on throughput-oriented workloads such as tradi-
tional graphics applications. This requirement leads to design points which
achieve performance through massive multithreading, necessitating very large
register files. We demonstrate that a combination of two techniques, regis-
ter file caching and two-level scheduling, can provide power savings while
maintaining performance for massively threaded GPU designs. Two-level
scheduling reduces the hardware required for warp scheduling and enables
smaller, more efficient register file caches. Coupled with two-level schedul-
ing, register file caching reduces traffic to the main register file by 40% to
92
80% with 6 RFC entries per thread. This reduction in traffic along with
more efficient operand delivery reduces energy consumed by the register file
by 36%, corresponding to a savings of 3.3 W of total chip power, without
affecting throughput.
Several opportunities exist to extend this work to further improve energy
efficiency. Rather than rely on a hardware register cache, allocation and
eviction could be handled entirely by software, eliminating the need for RFC
tag checks. The effectiveness of the RFC could be improved by applying code
transformations that were aware of the RFC. Software control in the form
of hints could also be applied to provide more flexible policies for activating
or deactivating warps in our two-level scheduler. As GPUs have emerged as
an important high-performance computational platform, improving energy
efficiency through all possible means including static analysis will be key to
future increases in performance.
93
CHAPTER 6
HIERARCHICAL MULTITHREADING
Multithreading is introduced to an architecture to improve throughput. There
is a wide design space between the simple, single-threaded processors orig-
inally employed by Rigel and the massively threaded processors employed
by modern GPUs. The level of threading desired for a given system is influ-
enced by a variety of factors, especially for throughput-oriented accelerator
processors. Key applications and application domains have a heavy influ-
ence on the underlying architecture. For instance, graphics workloads stream
through large datasets while rendering each frame and are effectively band-
width limited, leading to large numbers of latency hiding threads in modern
GPUs.
Most multithreaded systems fall into one of two distinct categories:
(1) simple, highly-threaded systems optimized for system throughput at
the expense of single-threaded performance. Examples include modern GPUs,
classic barrel threaded machines, and the Cray XMT. These systems require a
minimum number of threads to be able to issue instructions each cycle, and
performance suffers greatly if thread-level parallelism is not present. The
number of hardware threads selected for these machines may be influenced
by typical memory latencies.
(2) more complex systems that can sustain reasonable performance with
only one (or a small number) of threads. Modern desktop and server CPUs
generally fit within this class, employing extra hardware threads simply as a
way to improve utilization in the face of limited single-thread ILP.
Some machines fall in between these two extremes. One example are Sun’s
(now Oracle’s) T-series of UltraSPARC server processors. While early T-
series parts had only limited support for ILP and relied on thread-level par-
allelism for good performance, more recent parts have dedicated design area
to improving single-threaded performance.
The process for determining the optimal number of threads required for
94
a particular design varies. This process can be a simple matter of division
for systems with predictable latencies, or can require a complex analysis of
a multi-dimensional tradeoff space. While a particular application or appli-
cation domain may be matched to a specific degree of threading for a given
design implementation, there is no single correct degree of multithreading.
For a throughput-oriented system, the end goal is to select a design point
that maximizes overall throughput, often for a variety of target applications.
We desire a flexible threading architecture for accelerators. We seek to take
advantage of the disjoint latency classes that influence multithreaded designs
to enable a more configurable multithreading paradigm. Our goal is a scal-
able multithreading solution versus the typical point solutions employed by
multithreaded designs today. It is desirable to consider an architecture that
allows the architect to dial knobs for selecting the degree of multithreading.
One knob can adjust the number of threads supported by execution pipelines,
and another knob can select the number of threads available to hide mem-
ory latency. In this way, we can decouple the degree of threading required
for maximizing pipeline utilization from the degree of threading required for
tolerating long-latency operations.
Multi-level scheduling partitions threads into two classes: (1) active threads
that are issuing instructions or waiting on relatively short latency operations,
and (2) pending threads that are waiting on long memory latencies. The
cycle-by-cycle instruction scheduler need only consider the smaller number
of active threads, enabling a simpler and more energy-efficient scheduler.
In this chapter, we explore opportunities for extending the multi-level
scheduling paradigm to a more general hierarchical multithreading organi-
zation for MIMD accelerator designs such as Rigel. For this work, we
primarily consider the architecture of the Rigel cluster. As described in
Chapter 2, the cluster consists of a collection of processor cores along with a
shared cluster cache. The cluster makes up a primary reusable design element
for the Rigel accelerator architecture, similar to a Streaming Multiprocessor
(SM) in modern GPU architectures.
First, we extend the original Rigel architecture with a new multithreaded
microarchitecture. Next, we propose a novel flexible multithreading paradigm
allowing the architect a flexible way to scale the number of threads to match
the requirements of targeted workloads. Finally, we show that this new mul-
tithreading paradigm can be implemented efficiently while providing more
95
flexibility to the architect.
6.1 Hierarchical Multithreaded Clusters
We have developed a programmable accelerator processor with the Rigel ar-
chitecture. Separately, we have developed enhancements to another class
of throughput processors, GPUs. The application of similar techniques to
the Rigel architecture presents an opportunity to enhance the baseline
Rigel design.
While modern GPUs rely upon massive multithreading (32 or more threads
per execution unit), the baseline Rigel design relies upon a large number
of parallel cores to provide throughput. To keep design complexity low,
the baseline design implements a single thread per core. Multithreading is
a common technique employed to improve overall throughput and increase
resource utilization. However, simply increasing the thread count does not
come for free. Increased thread count can lead to competition for resources
and increase cache pressure. At the same time, much like GPUs, Rigel is
faced with hiding both short and long latencies.
We propose hierarchical multithreading, or HiMT, within the Rigel ar-
chitecture. Such a design will employ two-level scheduling similar to that
evaluated for GPUs. A small number of threads per core are appropriate for
keeping execution resources saturated and hiding short latencies for events
such as branch resolution and cache hits, while a larger pool of threads can
be maintained for long-stalled threads.
Figure 6.1 provides a high-level illustration of a hierarchically multithreaded
cluster organization.
6.1.1 Scheduling Policies
While a particular set of design decisions was appropriate in the case of
GPUs, a different set of tradeoffs is beneficial in the context of Rigel. In the
case of throughput-oriented designs, we take the position that both stalling
and speculative actions should be avoided in the case that alternate parallel
work is available. Branches, memory accesses, and pipeline stall conditions
present dynamic opportunities for tuning thread-level scheduling decisions
96
Cluster      
Contexts
(RF state)
Multi‐banked
Cache
Interconnection
Pipelines
Schedulable
Threads
Figure 6.1: High-level organization of a hierarchically multithreaded
cluster. Each core has hardware entries for a limited number of threads it
can issue from, while a potentially larger set of hardware register thread
contexts are available.
to improve throughput. Additionally, static opportunities exist to improve
scheduling, including preemptive descheduling of threads when they issue
known long-latency operations such as explicit global cache reads or global
atomic operations.
6.1.2 Context Management
Supporting additional threads on Rigel requires a corresponding increase
in register file capacity for storing thread context. For the area-optimized
in-order cores employed by Rigel, the register file makes up a substantial
fraction of both the area and power. Moving from a single thread to four
threads per core leads to approximately a doubling in per-core area and ap-
proximately a 30% increase in total chip area when using an ASIC design
flow (the cost can be reduced if more efficient custom arrays are designed).
The marginal cost of each thread motivates our investigation of a novel or-
ganization for thread context management, similar in spirit to [47, 48]. We
propose the secondary thread pool for holding long-latency threads be shared
among all cores within a cluster, allowing fast context switching and dynamic
97
Schedule
Thread
Scheduler
FetchSelect
Fetch
Thread
I n
t e
g e
r
F l
o a
t i n
g  
P o
i n
t
Writeback
Select
Issue
Thread
Register 
Files
L1 D$
LDST
M
e m
o r
y
RF Read
Scoreboard
State
Interface to
Shared L2 
Scheduler
State
Instruction
Buffering
Figure 6.2: A diagram of the baseline multithreaded architecture.
throughput-optimized load balancing within the cluster.
6.2 A Multithreaded Microarchitecture for Rigel
Earlier work on the Rigel architecture focused on single-threaded cores. In
this section, we describe a multithreaded MIMD architecture for Rigel. The
microarchitecture is influenced by the design goals for throughput processors:
• Speculation is limited or eliminated, because there is often known non-
speculative work to be done.
• We favor simplicity of implementation rather than maximizing ILP for
a given thread.
Figure 6.2 provides an illustration of the multithreaded microarchitecture
developed for Rigel.
98
6.2.1 Fetch
Each hardware thread fetches instructions into a FIFO instruction queue.
The FIFO instruction queues provide storage for prefetching of the instruc-
tion streams as well as decoupling the pipeline front-end from schedule and
issue. If a thread does not have an instruction ready to issue from its FIFO,
it will not be selected that cycle.
For this work, we do not consider branch prediction. Branch prediction
is primarily targeted at improving the performance of a given thread, while
throughput-oriented accelerators with many threads provide a source of non-
speculative instructions to issue during branch bubbles. We find branch
prediction to be generally incompatible with our design goals. Further, the
microarchitectures of interest for throughput-oriented designs tend to be sim-
pler, with shorter branch resolution times and thus a smaller penalty in terms
of idle cycles when no non-speculative work exists. Furthermore, branch pre-
dictors consume non-trivial amounts of area and power.
A thread stalls at fetch until a branch resolves, eliminating both speculative
instruction issue and fetch.
By default, the least-recently-used thread is selected for fetch, similar
to [49]. If this thread cannot fetch (for instance, if the FIFO queue is full,
or the thread is stalled on an unresolved branch), the second least-recently-
fetched thread is selected, and so on. In this manner, fetch attempts to keep
a supply of ready instructions available for each thread.
6.2.2 Scheduling
For this work, we select the LRU policy for scheduling as a more balanced
or fair approach. By default, each core issues from the least-recently-used
hardware thread that has a ready instruction. If the LRU thread cannot issue,
the second least-recently-used thread is selected, and so on. This prevents
long-term starvation of any given thread and ensures forward progress of all
threads.
One alternative instruction issue scheduling policy option is GREEDY. An-
other alternative is a variant of Round Robin, which is less fair than actual
LRU.
For GREEDY, instructions are issued from a single thread until a stall is
99
generated. To ensure forward progress, some quantum should be selected so
as to guarantee each thread forward progress.
There are several potential benefits of GREEDY over LRU. First, threads
executing similar code on data-parallel code can induce a prefetching effect
when data accesses are regular and colocated tasks touch the same cache-
lines. However, prefetching can still occur across threads with LRU. Second,
GREEDY allows several memory requests from the same thread to be issued
in sequence, potentially (though not necessarily) providing a better request
stream for the memory controller. Finally, and perhaps most importantly,
back-to-back or GREEDY issue from a given thread allows the hardware to take
advantage of ILP that exists within a single thread, reserving the next ready
thread in line for when an actual stall occurs.
6.2.3 Bypassing
Bypass networks are frequently used in single-threaded ILP-oriented proces-
sors, but are less prevalent in throughput processors. Bypass networks can
consume significant area and power resources. Figure 6.3 shows the cost in
area of a full bypass network for a synthesized Rigel core microarchitecture
for a single-threaded core and a range of target clock frequencies. This data is
illustrative, as there are numerous pitfalls when comparing synthesized data
at different clock speeds for communication-heavy structures like bypass net-
works. Nonetheless, we see the cost is a non-trivial design consideration.
While the addition (or removal) of bypassing may change the performance
of particular design points, this design choice does not represent a fundamen-
tal component and the proposed multithreaded techniques can apply within
both bypassed and non-bypassed designs. For this work, we omit bypass
networks from the multithreaded microarchitecture
6.2.4 Load-Store Unit
The original single-threaded in-order pipelines employed by Rigel imple-
mented a simple in-order load-store unit, stalling the pipeline when earlier
memory requests miss in the cache. However, with multiple threads execut-
ing on a single pipeline, simply stalling when a miss occurs is detrimental to
100
110
Cost of Bypass
90
100
c r
o n
2 ~20%
80
1 0
0 0
s   o
f   m
i c
Bypass
60
70
A r
e a
    i
n  
1
NoBypass
~10%
50
0.8 1 1.2 1.33 1.4
Synthesized Clock Frequency (GHz)
Figure 6.3: The area cost of a full bypass network on the Rigel core for
various synthesized clock targets.
performance and eliminates much of the benefit of multithreading.
While the memory orderings for each thread may be restricted for simplic-
ity, each pipeline is augmented with a load-store unit that allows memory
requests from different threads to be issued and serviced out of order with
respect to other threads.
Each thread may be assigned one or more entries in the load-store unit.
As shown in Figure 6.2, if a load misses in the L1 cache, the request is
forwarded to the load-store unit. The load-store unit sends a request to the
cluster cache and waits for a reply. In the case of a cluster cache hit, the
data is returned while the original instruction is still in the pipeline and the
instruction retires normally. If the response is delayed, the load-store entry
is maintained until a response is received. When the response is received,
the load-store unit writes back the result to the thread’s register file and
notifies the appropriate thread scheduler. The writeback may be performed
by either adding a dedicated write port for the load-store unit, or by stealing
a write cycle from the pipeline. Depending on the hardware parameters and
memory consistency model implemented, the thread may continue to issue
instructions while waiting on a memory response, or it may stall, yielding to
other threads.
101
6.2.5 ISA Extensions: Synchronization
In parallel systems, we require efficient synchronization mechanisms. Typ-
ical solutions include both software mechanisms (slower, but flexible) and
hardware mechanisms such as GPU sync threads().
The Rigel Task Model makes extensive use of software synchronization
in the form of barriers. These barriers are implemented efficiently by having
cores poll their local caches and receive broadcast updates. With this tech-
nique, polling traffic is localized to the private L1 cache after an initial cache
fill. While this works reasonably well for single-threaded cores (local polling
introduces no interference with other threads), multithreaded cores cannot
rely on polling for efficient barriers. Repeated polling on a multithreaded core
introduces noise into the instruction stream by allowing a thread to issue in-
structions that do not perform computational work. A variety of solutions for
software synchronization exist in the literature [50], but these are generally
tailored to systems with a different balance of latency and threading than the
manycore, highly threaded accelerator systems we consider here. In the case
of accelerators such as GPUs, the large number of thread contexts and the
potential for frequent, fine-grained synchronization limits the attractiveness
of techniques such as exponential backoff.
While no extension to the instruction set is strictly required to support
mulithreading, tightly cooperative threads that synchronize regularly are a
unique requirement for parallel accelerator processors. We find that some
simple ISA extensions provide a simple and efficient solution to the local
synchronization problem at the Rigel cluster level.
We propose extending the Rigel instruction set with a more explicit
method for synchronizing multiple threads at the cluster level. We introduce
two new flexible software primitives: CBAR.SLEEP and CBAR.WAKEALL. The
CBAR.SLEEP and CBAR.WAKEALL instructions enable efficient synchronization
(cluster local barriers) for a large number of cooperating threads at the clus-
ter level. Most importantly, they allow threads to explicitly sleep rather than
spin when they no longer have useful work to do. Sleeping saves both power
and execution bandwidth relative to the previous spinning implementation.
Table 6.1 summarizes the operation of these new instructions.
The CBAR.SLEEP instruction places the thread executing it into a sleep
state on the thread scheduler. Sleeping threads are inelegible for execution
102
Table 6.1: ISA extensions for hardware synchronization support
Instruction OPERATION
CBAR.SLEEP Places the thread into a sleep state.
Fails if this is the last thread awake (not sleeping)
within the cluster.
Returns zero into destination register if sleep was
successful, non-zero if sleep failed.
CBAR.WAKEALL Reset the sleep state for all threads within the cluster.
Typically, the last thread awake per cluster will be
responsible for waking all other sleeping threads
to implement an efficient local barrier.
There are no restrictions on which threads must be
in a sleep state to call CBAR.WAKEALL.
selection. In the case of a multi-level scheduler, sleeping threads are not
eligible to occupy L1 scheduler entries. Upon executing CBAR.SLEEP, a thread
is suspended to the L2 scheduler until it is awoken. Upon successful sleep
initiation, a return code of zero is placed into the destination register specified
by the instruction.
If a thread executes CBAR.SLEEP and is the last remaining thread awake,
the instruction fails and does not place the thread into a sleep state in the
scheduler. Instead, the thread remains active and a non-zero return code is
placed in the specified destination register.
At wakeup after a CBAR.SLEEP instruction, execution resumes at the next
instruction in the program (i.e., the incremented program counter). The
following instruction should examine the result generated, similar in spirit to
a load-link/store-conditional sequence. In this way, it can be determined if
the calling thread was the last to sleep within the cluster, allowing it to take
action in this case. In the case of a hierarchical barrier implementation, the
last thread to sleep can perform the next-level barrier notification.
The CBAR.WAKEALL instruction resets the sleep state for all threads within
a cluster. In implementing a cluster-level barrier, the last thread to call
CBAR.SLEEP will fail and is resonsible for issuing CBAR.WAKEALL at the ap-
propriate time.
However, there are no restrictions on which threads are in sleep state before
CBAR.WAKEALL may be called. In this manner, these instruction pairs may be
used to implement more complex yield or suspend operations for threads. A
103
variety of extensions or alternate designs can be envisioned. CBAR.WAKEALL
could be modified to take a parameter and wake a specific thread, wake a
subset of threads, or even select at random a requested number of suspended
threads to wake.
While beyond the scope of this dissertation, we note that a combination
of system software and hardware scheduler features can be used to find ad-
ditional work for sleeping threads if such a model is desired. In this case, all
threads may go to sleep and must receive external wakeup notifications. Such
a scheme could push tasks to sleeping threads once work becomes available,
but requires hardware/software support for identifying and waking sleeping
threads. We limit our design consideration to the case where sleeping threads
have no more useful work to do and are simply waiting for other threads to
reach a barrier.
6.3 Hierarchical Multithreading
In Chapter 5, we described extensions to modern graphics processing units
targeted at improving the efficiency of the thread scheduler. One of these
extensions was the multi-level scheduler. In this section, we extend the multi-
level scheduling paradigm to the Rigel MIMD accelerator architecture.
6.3.1 Architecture
We start with the baseline multithreaded architecture described in Section 6.2
and extend the microarchitecture with a hierarchical two-level thread sched-
uler. We divide the scheduler into two sets: the L1 scheduler holds active
threads, while the L2 scheduler holds pending threads. Figure 6.4 provides a
high-level block diagram of a hierarchically multithreaded cluster, while Fig-
ure 6.5 illustrates the microarchitectural modifications to the core pipeline
for hierarchical scheduling.
The L1 scheduler for each pipeline holds entries for all threads actively
executing on the pipeline. Threads in the L1 scheduler may issue instructions
at any time. The primary goal of the L1 thread scheduler is to keep the
pipeline saturated.
The L2 scheduler holds entries for all hardware-allocated thread contexts.
104
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
Per Core
Thread Context
(Register Files)
Register File 
Interconnect
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
L1 (active) 
threads
L2 (pending)
Threads per core
Figure 6.4: Cluster-level organization for a HiMT cluster.
105
Thread
Scheduler
FetchSelect
Fetch
Thread
Schedule
I n
t e
g e
r
F l
o a
t i n
g  
P o
i n
t
Writeback
Issue
Thread 
Select
Register 
Files
L1 D$
LDST
M
e m
o r
y
RF Read
SB 
State
Interface to
Shared L2 
L2 Scheduler Entries
L1 Scheduler Entries
Scheduler
State
Instruction
Buffering
Figure 6.5: Microarchitecture modifications for a two-level threads
scheduler. The L1 scheduler holds state for threads that can currently
issue. Pipeline structures, such as instruction FIFOs, are sized for the
number of L1 threads. The L2 scheduler holds state for all hardware
resident threads. Every hardware thread has an associated register file and
L2 scheduler entry.
106
Threads which are currently not executing are resident only in the L2 sched-
uler. Threads may wait in the L2 scheduler for long-latency operations, such
as memory requests, to complete. Threads waiting on a CBAR.WAKEALL also
wait in the L2 scheduler. When a thread in the L1 scheduler stalls, or is oth-
erwise selected for replacement, a ready thread in the L2 scheduler may be
selected for execution. The primary goal of the L2 scheduler is to maintain
enough hardware-resident state to hide typical latencies for target workloads.
The design point where the L1 scheduler size and the L2 scheduler size
are the same essentially represents a classic multithreaded microarchitecture,
similar to the baseline, where all threads are of equivalent status.
6.3.2 Scheduling
In addition to selecting which thread (or threads) to issue from each cycle,
the hierarchical architecture needs to select which threads to activate (or
deactivate).
The L1 scheduler for each pipeline attempts to keep the pipeline full by
issuing an instruction each cycle. If a thread becomes unable to issue new
instructions, it becomes a candidate for replacement.
Potential replacement events include both short and long-latency opera-
tions that introduce bubbles or otherwise stall exection of the thread for an
extended period. The key events that induce an L1 thread replacement are
long-latency memory operations that are either uncached (globals and global
atomics) or that are simply cache misses (local loads that miss in the clus-
ter cache). For similar architectures that implement long-latency or shared
functional units, such as the trigonometry or texture units on GPUs, access-
ing these structures would make good candidates for L1 replacement. At
present, Rigel does not implement similar units.
Branches are a possible candidate event for L1 thread replacement. As
decribed in Chapter 5, branches were used for replacement in the GPU first-
level scheduler. However, Rigel has a shorter branch resolution period than
do GPUs. Additionally, we do not make the presumption of massive mul-
tithreading as we do with GPUs, so changing threads for a short latency
operation is not as desirable due to different opportunity cost.
Finally, software-based hints or compiler-set bits (potentially based on pro-
107
filing) could be employed to explicitly indicate that a thread should be given
lower priority or suspended. For instance, Larrabee employed software-based
yielding when issuing long-latency texture requests [3]. While applicable to
the problem at hand, we leave evaluation of this class of technique to future
work.
6.3.3 Performance Evaluation
Our goal for hierarchical multithreading is to provide equivalent performance
as with standard multithreading, while enabling a larger opportunity for de-
sign optimization with a two-level scheduler. Figure 6.6 illustrates perfor-
mance for a two-level scheduler on a set of Rigel workloads for an 8-way
multithreaded core. Performance is normalized to the baseline of a standard
single-level scheduler. All configurations have eight hardware threads, while
the number of concurrently active (L1) threads is limited as shown. We
observe less than 3% average performance regression with four L1 threads
enabled, probably within the margin of simulation error. We are thus able to
provide a higher degree of hardware multithreading at a lower level of com-
plexity. The core pipeline structures need only support a limited number
of threads, while we retain the benefit of a larger set of hardware resident
threads.
We see that for some benchmarks, having fewer active L1 threads can
actually improve performance. There are several possible explanations for
this. Some threads may effectively prefetch data for other threads that are
currently disabled. Additionally, suspended L2 threads waiting and ready
to execute can quickly be activated when other threads stall, reducing the
possibility that all threads stall concurrently. Finally, for some applications,
more threads are simply not needed for high utilization; in this case, addi-
tional threads simply increase the potential for resource conflicts and cache
thrashing.
108
00.2
0.4
0.6
0.8
1
1.2
Stream DMM FFT Collatz CG Heat Stencil Average
N
o r
m
a l
i z e
d  
p e
r f
o r
m
a n
c e
1Th 2Th 3Th 4Th 6Th 8Th
Figure 6.6: Performance for a two-level hierarchical multithreaded
scheduler.
109
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
RF
Operand Interconnect
Unified
Thread Context
(Register File)
Pool
Register File 
Interconnect
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
P i
p e
l i n
e
L1 (active) 
threads
Unified L2 Scheduler
Unified L2 (pending)
Scheduler Entries
Figure 6.7: Architecture block diagram for a cluster-level thread pool.
6.4 Thread Pools: Migratory Hierarchical
Multithreading
We extend the hierarchical multithreading concept presented previously to
encompass a cluster-level thread pool.
Figure 6.7 illustrates a cluster augmented with a shared thread context
pool. In this organization, hardware thread contexts are grouped logically at
the cluster level into a shared pool of threads. Any thread context may be
assigned to any pipeline within the cluster. Persistent thread context state,
including registers and program counter, is stored at the cluster level while
intermediate execution state is kept in local structures on the pipeline which
the given thread context is assigned to. In this way, the thread pool enables
a fast hardware context switch across the pipelines within a given cluster.
110
6.4.1 Register File Organization
A cluster architecture where hardware thread contexts may be bound dy-
namically to different pipelines requires a more complex register file struc-
ture. Each pipeline that can execute a given thread context needs to be
able to access the register storage for that thread. For the most general
case, where any thread context can be bound to any pipeline in the cluster,
a crossbar interconnect between pipelines and register files can provide the
required connectivity and bandwidth.
Figure 6.7 illustrates one way to organize the register file for a cluster-
level pool of hardware thread contexts. Each thread context is assigned its
own multi-ported register file. A crossbar interconnect allows each pipeline
to access any register file in the cluster-level pool. Because a given thread
context can only be assigned to a single pipeline at a time, the interconnect
is conflict free.
A variety of design alternatives are possible for large, high-bandwidth reg-
ister files. GPUs commonly implement register files from a collection of
single-ported or psuedo-dual-ported SRAM macros. While these macros are
individually efficient, complex conflict resolution and buffering hardware is
required to manage bank conflicts. While multi-ported structures are more
costly than alternatives,they provide a reasonable starting point along with
certain benefits, namely lack of bank conflicts. This design choice provides
a reasonable upper bound on the expected cost overheads (in terms of area
and power) of this organization while allowing us to focus on other aspects of
the threading design space. Multi-ported structures also highlight the upper
bound for performance.
Previously proposed techniques for improving the efficiency of large register
file structures can be applied to further improve area and energy efficiency.
For instance, rather than separate multi-ported register structures, many
banks of single-ported SRAM can be used to implement register storage.
Conflicts across threads and banks can be either handled in hardware or in
software.
111
Table 6.2: Crossbar modeling parameters
Parameter Value
TSMC 40 nm Wire Pitch M4 140 nm
TSMC MUX4D4 dimensions 5.18 µ x 1.26 µ
Wire Energy (32 bits) 1.9 pJ/mm
6.4.2 Cost: Area and Power
As with most architectural techniques that improve flexibility, there is some
tradeoff in cost. As a consequence of allowing a larger set of threads to be
assigned to each core pipeline, a more complex interconnection is required
between the register file and the execution units.
Additionally, we need not constrain ourselves purely within the bounds of
today’s manufacturing technology. While the proposed organization may be
relatively costly when implemented in today’s technology as described, this
design can be considered as a starting point for future work. Future technolo-
gies may be able to provide more efficient, higher bandwidth interconnects,
and the relative costs of gates, wires, and bits of storage continue to evolve
with time. Finally, more efficient architectural techniques may be discovered
to improve the efficiency of this initial architecture proposal.
Interconnect Area and Power Model
We model the area and power for the crossbar register interconnect simi-
lar to [51]. Table 6.2 summarizes the modeling parameters, and Figure 6.8
illustrates the crossbar organization.
For a fully connected crossbar and a 32-thread, 8-core configuration, we
estimate an interconnect overhead of about 0.14 mm2, roughly a 5% area
overhead vs. a zero-cost interconnect in the baseline case without thread
pools.
Figure 6.9 shows the relative area cost for various cluster components while
scaling the number of thread contexts per core. While this data is based on a
relatively simple model rather than hardcore VLSI layout data, it nonetheless
serves to estimate first-order feasibility. The core area projected is optimistic:
core complexity is held constant as threading increases, with the additional
area contribution coming from additional register file resources. At the same
112
Port0
Control
Port0[0]Port1[0]Port2[0]Port3[0]Port4[0]
Port0[1]Port1[1]Port2[1]Port3[1]Port4[1]
Port0[2]Port1[2]Port2[2]Port3[2]Port4[2]
Port0[2]Port1[2]Port2[2]Port3[2]Port4[2]
Port1
Control
Port2
Control
Port3
Control
Port4
Control
Port0
Output
Port1
Output
Port2
Output
Port3
Output
Port4
Output
Figure 6.8: Crossbar model. We model a mux-based, all-to-all crossbar
structure as shown.
time, the interconnect area estimate is pessimistic: a fully connected crossbar
is assumed. In reality, a less connected, more restricted interconnect may
be sufficient. We observe that approximately 5% of cluster area is due to
interconnect at 32 thread clusters (4/core) and 7% for 64 thread clusters
(8/core).
6.4.3 Cost Mitigation: Register Reuse
As explored in Chapter 5, one method for reducing cost in terms of power
is to provide local storage for short-term or intermediate results. This can
be done transparently via register file caching, or explicitly with hierarchical
register files. Hierarchical register files are appealing for their efficiency [52],
but require additional software support. For this work, we consider the more
transparent case of register file caches. Register caching can be considered
a reasonable upper bound on energy consumption and area overhead while
a lower bound on effectiveness, providing a good indication of the design’s
potential.
From Figure 6.10, we can see that a large fraction of register values are
consumed only a single time. This observation is consistent both with our
113
01000000
2000000
3000000
4000000
5000000
1 2 3 4 6 8
Threads/Core
A r
e a
  ( s
q u
a r
e  
m
i c
r o
n )
CCache Area XBAR Total Core Area (8 cores)
Figure 6.9: Crossbar area impact for various degrees of multithreading.
previous work on GPUs [28] as well as with the literature [40]. Writing these
single-use values back to a large register file structure incurs a large power
overhead. This overhead increases with the size of the register file, increases
with the complexity of the operand interconnect network, and increases as
the proximity of the register file to the execution units decreases. Reduc-
ing register file accesses can be a substantial source of power savings, as
demonstrated in Chapter 5.
Figure 6.11 illustrates that values consumed only once often have a short
lifetime as well. Again, these observations are consistent both with our pre-
vious work with GPUs and with the literature. Most single-use register
values are consumed within a small interval of being produced, within a few
instructions.
These two observations indicate a substantial opportunity for cost reduc-
tion in the register operand network. First, we can substantially reduce both
read and write traffic, and thus power, to the register files by keeping values
with limited lifetime and limited reuse near the execution units. Second, as
the required bandwidth to these register file structures is reduced, there is
114
0%
20%
40%
60%
80%
100%
FFT DMM CG Collatz Heat Stencil
5+
4
3
2
1
0
Figure 6.10: Register value reuse for a selection of benchmarks. The
majority of values produced into registers are consumed only once.
0%
20%
40%
60%
80%
100%
FFT DMM CG Collatz Heat Stencil
5+
4
3
2
1
Figure 6.11: Lifetime of values consumed only once. Lifetime is measured
as number of instructions between production and consumption. A lifetime
of 1 indicates a value is consumed by the next instruction. Most values are
consumed within a small interval of being produced.
115
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFT DMM CG Collatz Heat Stencil
R F
C  
H i
t r
a t
e
RFC Size 1
RFC Size 2
RFC Size 3
RFC Size 4
RFC Size 5
RFC Size 6
RFC Size 7
RFC Size 8
Figure 6.12: Register file cache effectiveness for Rigel using a simple FIFO
policy. A small number of register file cache entries capture a large portion
of register file accesses, reducing read and write traffic across the RF
interconnect.
opportunity to reduce both the complexity and area of the register operand
interconnect.
As in Chapter 5, we explore the use of operand caching to reduce register
file traffic. Figure 6.12 illustrates the effectiveness of register file caching
for Rigel using a simple FIFO policy. Even with a simple implementation
and no compiler support or hints, we see that a small number of temporary
registers can capture a large share of register file traffic. Four entries per
thread can service on average over 50% of register file traffic, and six entries
per thread can service over 70% of register traffic. Better policies, compiler
directed hints, and explicit compiler management all have the potential to
improve efficiency further.
One potential application of the cluster thread pool is active hardware load
balancing, or relocation of threads between cores. An appropriately config-
ured cluster-level thread-pool organization can allow a suspended thread in
the L2 scheduler to be resumed on any core within the cluster. Figure 6.13
shows the potential benefits for synthetic task data. In this example, a fixed
set of tasks are assigned to each core. However, the tasks are not uniform;
the fixed allocation results in unbalanced execution time, with performance
limited by the core with the most work. While this particular case can be
116
0
0.5
1
1.5
2
2.5
Ru
nt
im
e (
no
rm
al
ize
d)
Fixed Pool
Figure 6.13: Synthetic data showing the impact of load imbalance for fixed
task allocation.
solved in software with better work distribution, it is illustrative of the ben-
efit that dynamic load balancing can afford. In the load balanced case with
hardware thread pools, dynamically remapping tasks to under-utilized cores
improves overall system throughput.
6.5 Summary
By extending the Rigel architecture with hierarchical multithreading, we
enable an efficient and scalable multithreading solution for accelerators. Hi-
erarchical multithreading allows accelerator architects to match the desired
degree of threading to their design and domain, while simultaneously enabling
a more efficient implementation with two-level scheduling. We effectively de-
couple the degree of threading from the base pipeline implementation, pro-
viding a flexible architectural paradigm.
We extend the hierarchical multithreading paradigm with the concept of
thread pools. Thread pools serve to further decouple the design space, en-
abling architects to select execution pipeline and total thread counts inde-
pendently. This approach also enables low-cost dynamic context switching
or load balancing across pipelines for shared threads.
While there is a price to be paid for this flexibility, we find that several
117
techniques can reduce cost. Cached or hierarchical register files can reduce
register file traffic, and thus power, and this in turn can be leveraged to
simplify the operand interconnect. This design serves as a starting point, an
initial architectural proposal that has potential for future enhancement.
118
CHAPTER 7
RELATED WORK
As a large systems-oriented project, Rigel touches on a broad set of topics,
and a variety of relevant prior work exists.
7.1 Throughput Processors
We consider two broad classes of throughput-oriented architectures: general-
purpose chip multiprocessors (CMPs) and specialized accelerators. Chip mul-
tiprocessors (CMPs) have been proposed as a way to use increasing transistor
densities to improve performance [53]. General-purpose CMP development
is driven by the need to increase performance while retaining support for
a vast ecosystem of existing multitasking operating systems, programming
models, applications, and development tools. Increasingly, CMPs are inte-
grating additional functionality such as memory controllers and peripheral
bus controllers on-die, converging to an SoC model. Accelerators are designed
to improve performance and reduce power for a specific class of workloads
by exploiting characteristics of the target domain and are optimized for a
narrower class of workloads and programming styles. While general-purpose
processors tend to employ additional transistors to decrease latency along
a single thread of execution, accelerators are architected to maximize ag-
gregate system throughput, often with increased latency for any particular
operation. For many systems, the distinction between CMP and accelerator
becomes somewhat blurred.
Although CMPs do not improve the performance of single threads, the
cores can collectively achieve higher throughput (per unit area) than a wide-
issue superscalar core on multithreaded workloads such as commercial server
workloads. Throughput-oriented CMPs such as Sun’s UltraSPARC T3 [54]
target server workloads with a moderate number of simple, highly multi-
119
threaded cores. Such CMPs achieve relatively high throughput at the expense
of single-threaded performance. However, such designs are ultimately limited
by low memory bandwidth relative to accelerators, heavyweight hardware
cache coherence, and the per-core features and latency reduction techniques
required to meet the needs of general-purpose workloads.
Tilera’s most recent processors [55], based upon earlier work on Raw [56],
feature up to 100 cores in a tiled design with a mesh interconnect, optimized
for message passing and streaming applications. Intel has developed several
experimental mesh-based throughput processors, including the 1TFLOP 80-
core chip [57] and the 48-core Single-chip Cloud Computer. For both Tilera
and Intel, these mesh interconnects are intended for explicit point-to-point
massage-passing-based programming. Intel’s prototype mesh designs do not
support cache coherence, while Tilera’s chips do.
Stream processors [58] are programmable accelerators targeted at media
and signal processing workloads with regular, predictable data access pat-
terns. Imagine [59] pioneered the stream processing concept, SPI [60] com-
mercialized it, and stream processing has influenced modern graphics pro-
cessor designs.
The most prominent class of programmable accelerators at present are
graphics processing units (GPUs). GPUs produced NVIDIA [18, 1, 7] and
AMD [61, 62] are targeted primarily at the raster graphics pipeline, but have
exposed an increasingly flexible substrate for more general purpose data-
parallel computations. For instance, compared with the first CUDA-capable
GPUs, NVIDIA’s latest Fermi GPUs include a cached memory hierarchy and
accelerated atomic operations, support for execution of multiple concurrent
kernels, and reduce the performance penalty for memory gather operations.
Both NVIDIA and AMD GPUs utilize single instruction, multiple thread
(SIMT) style architectures, whereby a single instruction is simultaneously
executed across multiple pipelines with different data. A key aspect of SIMT
designs is their ability to allow control divergence for branching code, whereby
only a subset of a SIMT unit’s pipelines are active. Such designs enable
dense hardware with high peak throughput but work best for regular com-
putations with infrequent divergence. Both NVIDIA and AMD rely upon
programmer-managed scratchpad memories and explicit data transfers for
high performance, though modern variants do implement small non-coherent
caches. GPUs also rely upon high memory bandwidth and thousands of
120
hardware threads to hide memory latency and improve throughput.
The Cell processor [2], introduced with the PlayStation 3 and also used
in high-performance computing, uses a heterogenous model with a multi-
threaded PowerPC processor and up to eight synergistic processing elements
(SPEs) as coprocessors. SPEs use a programmer-managed scratchpad for
both instruction and data access and implement a SIMD instruction set.
Intel’s Larrabee [3] project approached the accelerator design point with a
fully programmable manycore x86 design, cached memory hierarchy, and
hardware cache coherence. Like Rigel, Larrabee was intended for more
general-purpose parallel programming. Intel AVX extensions [63], derived
from Larrabee, support very wide 256-bit SIMD vectors for parallel pro-
cessing. These wide vectors represent a vastly different design point than
the independent scalar cores of Rigel or the SIMT units of GPUs, requiring
additional programmer or compiler effort for packing and alignment.
More recently, NVIDIA’s proposed Echelon architecture [64] has advocated
a combination of throughput-oriented and latency-oriented cores along with
MIMD-capable execution lanes.
7.2 Coherence
A rich and diverse set of related work exists on the topic of cache coher-
ence [65]. Distributed shared memory (DSM) was developed as a scalable
way to provide the illusion of a single coherent address space across multiple
disjoint memories [66]. A variety of mechanisms such as [67] have been pro-
posed for reducing directory overhead on CMPs, and novel optical networks
have been proposed for 1000-core cache-coherent chips [68]. The Smart Mem-
ories project has examined programmable controllers that can implement a
variety of on-chip memory models including cache coherence [69], while Co-
hesion enables the management of multiple coherence domains simultane-
ously. A more complete treatment of related works on memory models and
cache coherence may be found in [5, 6, 70].
121
7.3 Atomic Operations
GPUs implement various atomic operations, including add, sub, min, max,
inc, dec, and bitwise operators [71]. Older NVIDIA GPUs perform atomic
operations only on global memory, while newer devices support lower latency,
higher bandwidth atomic operations on SM-local shared memory scratch-
pads. NVIDIA’s Fermi improved global atomic performance with additional
atomic functional units colocated with a globally shared L2 cache [7].
The NYU UltraComputer aimed to reduce congestion and improve perfor-
mance of fetch-and-op operations by implementing a combining network [72].
However, atomic operations from multiple threads must occur close in time
to be combined. Transactional memory provides the ability to make more
generalized atomic updates. An earlier proposal looked at similar hardware
support for parallel reductions in cache-coherent SMPs [21], whereas we tar-
get single-chip accelerators like GPUs with 100s to 1000s of threads and
cores.
7.4 Hierarchical Multithreading
Previous work on ILP-oriented superscalar schedulers has proposed holding
instructions dependent on long-latency operations in separate waiting in-
struction buffers [73] for tolerating cache misses and using segmented [74] or
hierarchical [75] windows. Cyclone proposed a two-level scheduling queue for
superscalar architectures to tolerate events of different latencies [76].
A variety of multithreaded architectures have been proposed in the liter-
ature [77]. The Tera MTA [78] provided large numbers of hardware thread
contexts, similar to GPUs, for latency hiding. Multithreaded machines such
as Sun’s Niagara [49] select among smaller numbers of threads to hide laten-
cies and avoid stalls. AMD GPUs [79, 31] multiplex a small number of very
wide (64 thread) wavefronts issued over several cycles. Tune proposed bal-
anced multithreading, an extension to SMT for CPUs where additional vir-
tual thread contexts are presented to software to leverage memory-level par-
allelism, while fewer hardware thread contexts simplify the SMT pipeline im-
plementation [48]. The shared-thread multiprocessor [47] proposed a shared
pool of thread contexts that could be scheduled to any available core al-
122
lowing dynamic load balancing and fast context switching. Later work [80]
examined another form of two-level scheduling within GPUs. Threads are
organized into subset fetch groups which are rotated among in round-robin
fashion to prevent all threads from reaching the same stall-generating in-
structions at the same time. Similar behavior was observed in our work.
Intel’s Larabee [3] proposed software mechanisms, similar to traditional soft-
ware context switching, for suspending threads expected to become idle due
to texture requests. MIT’s Alewife [81] used coarse-grained multithreading,
performing thread context switches only when a thread relied upon a remote
memory access. Mechanisms for efficient context switching have been pro-
posed which recognize that only a subset of values are live across context
switches [82].
Prior work has found that a large number of register values are only used
once and within a short period of time from when they are produced [40].
Swensen and Patt show that a two-level register file hierarchy can provide
most of the performance benefit of a large register file on scientific codes [83].
Prior work has examined using register file caches in the context of CPUs [33,
41, 84, 85, 86, 87], much of which was focused on reducing the latency of
register file accesses. Rather than reduce latency, we aim to reduce the energy
spent in the register file system. Shioya et al. designed a register cache for
a limited-thread CPU that aims to simplify the design of the MRF to save
area and energy rather than reducing latency [88]. Other work in the context
of CPUs considers register file caches with tens of entries per thread [33].
Since GPUs have a large number of threads, the register file cache must have
a limited number of entries per thread to remain small. Zeng and Ghose
propose a register file cache that saves 38% of the register file access energy
in a CPU by reducing the number of ports required in the main register
file [41]. Each thread on a GPU is executed in-order, removing several of
the challenges faced by register file caching on a CPU, including preserving
register values for precise exceptions [89] and the interaction between register
renaming and register file caching. The ELM project considered a software
controlled operand register file, private to an ALU, with a small number of
entries to reduce energy on an embedded processor [44]. In ELM, entries
were not shared among a large number of threads and could be persistent for
extended periods of time. Past work has also relied on software providing
information to the hardware to increase the effectiveness of the register file
123
cache [84].
AMD GPUs use clause temporary registers to hold short-lived values dur-
ing a short sequence of instructions that does not contain a change in control-
flow [31]. The software explicitly controls allocation and eviction from these
registers and values are not preserved across clause boundaries. Addition-
ally, instructions can explicitly use the result of the previous instruction [79].
This technique can eliminate register accesses for values that are consumed
only once when the consuming instruction immediately follows the producing
instruction.
7.5 Tools
A variety of tools and techniques exist for evaluating system performance,
area, and power. We motivate our approach by examining the tradeoffs in
existing tools.
Numerous timing simulators are available, ranging from simple models to
complex full-system simulators. Many contemporary simulators are based on
superscalar out-of-order execution [90, 91] and enable high-fidelity modeling
of complex processor cores. However, simulators often employ high levels
of abstraction, obfuscating the relationships between simulated components
and their RTL counterparts; this obfuscation frustrates the use of the simula-
tor as a more flexible golden correctness and performance model for the VLSI
implementation and motivates a less-abstract simulator. Parallel simulators
for parallel machines such as Graphite [92] can achieve higher simulation
throughput than single-threaded simulators, but are generally more difficult
to extend and often employ a relaxed timing model, which may not be ac-
ceptable for performance validation of a new design. Simulators for modeling
large parallel machines have also employed abstract modeling techniques that
leverage the synchronization characteristics of their systems [93]. Such sim-
ulators are useful for evaluating programming models and runtimes, but are
less useful for pre-silicon validation and performance validation. GPGPU-
Sim [23] simulates a restricted SIMD processor. We find no existing simu-
lation infrastructure that both accurately models the scale of parallelism we
target and can evaluate arbitrary application and system code.
Emulation environments, such as FAST [94] or RAMP [95] execute parts of
124
the timing model on an FPGA to improve performance. While this approach
improves simulation speed, it is not meant to provide synthesizeable RTL
models to aid in precise area or power estimation. While FPGA emulation
and timing simulation can provide guidance, their visibility is limited by their
constraints. For instance, the Godson-2 CPU experiences notable deviations
from both FPGA and simulator models [96].
Tools such as McPAT [97] present an integrated modeling framework for
power, area, and timing. However, McPAT relies upon an analytical model
rather than an RTL flow for its physical design feedback and is aimed at
higher-level or more abstract design space exploration than we target. Power
modeling has been shown to be of limited effectiveness without any detailed
implementation information [98].
Commercially, Tensilica provides a synthesizable embedded processor frame-
work with Xtensa [99, 100]. The Tensilica flow allows existing designs to be
extended and provides a compiler and GNU-based toolchain. However, the
Tensilica Xtensa toolset relies upon a fixed baseline ISA with extensions,
rather than a completely flexible ISA specification. OpenRISC [101] pro-
vides a simulator, software stack, and RTL, but these are not targeted at
flexible design space exploration.
Kumar argued that the cores, caches, and interconnect in future multicore
systems cannot be derived independently [102]. To achieve an efficient de-
sign in terms of area and/or power vs. performance, analysis that includes all
aspects of the design is needed rather than piece-by-piece analysis. Promi-
nent researchers [25] have advocated re-evaluation of the traditional fixed
infrastructure stack, including simulators, compilers, and ISAs.
125
CHAPTER 8
CONCLUSIONS
In this dissertation, we describe current limitations for accelerator processors
and motivate our design of the Rigel MIMD accelerator architecture. We
present Rigel, a 1024-core single-chip accelerator architecture targeted at
data- and task-parallel computation. We show that a baseline design is
implementable in contemporary process technology, within acceptable power
constraints, and provides performance scalability for a variety of benchmarks.
We develop microarchitectural enhancements for a modern thoughput-
oriented graphics processing unit. These enhancements enable a reduction in
design complexity and energy consumption without impacting performance.
We extend the baseline Rigel architecture with a multithreaded microar-
chitecture. We developed novel multithreading techniques for improving the
efficiency of multithreaded architectures while increasing design flexibility.
We have prepared an open-source release of the Rigel toolset to benefit
the broader research community. It is our hope that we can contribute to a
more open set of tools for collaboration within the field.
8.1 Rigel: Looking to the Future
When the Rigel accelerator architecture was conceived in 2007, we con-
cerned ourselves with the technology parameters and constraints of the time.
Though large dies with billions of transistors are possible in 45 nm technol-
ogy, our target of 1024 cores was extremely aggressive and led to sacrifices in
our design. Rigel was developed as a coprocessor for parallel computation,
an alternative to GPUs, rather than as a complete system. We limited our-
selves to 32-bit datapaths and single-precision floating-point units to limit die
area. As process technology marches on, these limitations can be addressed,
though ultimately at the expense of throughput.
126
The Rigel architecture achieves its scale in part due to what it omits
compared to GPUs. While GPUs expend substantial die area on graphics-
specific hardware, such as texture units, Rigel omits these, repurposing
the die area for features such as caches and independent processing cores
(NVIDIA’s Tesla die is approximately 25% stream processing units (SMs),
while roughly half of die area is dedicated to graphics-related hardware such
as texture and raster operations [14]). However, the ratio of computation
units on GPUs continues to grow. Ultimately, Rigel makes choices that
aim to increase generality. Further work is merited to drive down the cost of
Rigel’s MIMD hardware in comparison with efficient SIMD hardware.
8.1.1 Power
We chose an aggressive design point for Rigel and show that it is feasible
in current process technology within a reasonable power budget, on par with
GPUs of similar peak throughput [14]. We made architectural choices gener-
ally favorable to power efficiency, including the use of small, non-speculative
in-order cores and moderate clock targets. While industry researchers have
advocated future thousand core chips [103], power consumption for future
large-scale processors nevertheless remains a concern [8]. While some studies
portend challenges for scaling multicore processors in a power-constrained
world of “dark silicon,” others find that thousands of cores are reasonable
for highly parallel workloads such as raytracing. Future massively parallel
processors like Rigel will likely need to conserve power through a variety
of techniques at multiple levels of the technology stack, including process
technology, circuits, architecture, and software.
8.1.2 Off-chip Bandwidth
Though on-chip transistor and bandwidth budgets will likely increase along
with Moore’s law, off-chip memory bandwidth will increase more slowly [8],
becoming a scalability bottleneck for many applications. High-bandwidth
off-chip memory systems are also a major source of power consumption in
high-performance systems, requiring upwards of 30 W to meet the bandwidth
demands of modern GPU systems. Future accelerators will require a careful
127
consideration of data locality at all levels of the memory hierarchy to make
optimal use of limited off-chip bandwidth. Emerging technologies such as
optical off-chip interconnect or 3D die stacking with through-silicon vias may
provide additional bandwidth to future designs.
8.1.3 System Software Support
Like GPUs, Rigel was originally conceived as a coprocessor, complemen-
tary to a general purpose CPU. Both omit system-level support required
for features such as resource virtualization, multiprogramming, process iso-
lation, and resource management. Recent GPUs implement a form of virtual
memory and allow execution of multiple concurrent kernels in space, though
they generally cannot be time- or space-multiplexed efficiently among mul-
tiple applications. When required, unimplemented functionality must be
emulated at a significant cost by a host processor. Future accelerators will
need to integrate into systems as first-class entities, manage their own re-
sources when possible, and provide the safety and portability guarantees
that enhance programmer productivity and robustness on general-purpose
processors. Providing this additional functionality while maintaining the
performance characteristics of an accelerator is an important problem.
8.1.4 A Tale of Two Laws
Amdahl’s law states that overall system performance is ultimately limited
by the serial portion of a problem. As additional processing elements are
added, performance levels off. Gustafson’s law [104] is a counterpoint to
Amdahl’s, arguing that as more parallel resources become available, larger
problem sizes become feasible. Rather than becoming dominated by the
serial portion of a problem, the parallel portion expands to take advantage
of additional processing capability.
One possible future for designs such as Rigel is as a parallel computa-
tion fabric for heterogeneous systems. Sufficient die area is now available
such that multiple high-performance CPUs can be integrated onto the same
die as Rigel. This strategy allows a few latency-optimized cores to han-
dle operating systems, serial code, and latency-sensitive code while oﬄoad-
128
ing parallel work to a more efficient computational substrate. Evidence of
such an approach can already be seen in the embedded SoC space, where
consumer-oriented visual computing applications are driving demand for in-
creased performance. AMD’s Fusion and Intel’s on-die integration of GPUs
is also a step in this direction.
8.2 Energy Efficient Multithreading for GPUs
Modern GPU designs are constrained by various requirements, chief among
which is high performance on throughput-oriented workloads such as tradi-
tional graphics applications. This requirement leads to design points which
achieve performance through massive multithreading, necessitating very large
register files. We demonstrate that a combination of two techniques, regis-
ter file caching and two-level scheduling, can provide power savings while
maintaining performance for massively threaded GPU designs. Two-level
scheduling reduces the hardware required for warp scheduling and enables
smaller, more efficient register file caches. Coupled with two-level schedul-
ing, register file caching reduces traffic to the main register file by 40% to
80% with 6 RFC entries per thread. This reduction in traffic along with
more efficient operand delivery reduces energy consumed by the register file
by 36%, corresponding to a savings of 3.3 W of total chip power, without
affecting throughput.
Several opportunities exist to extend this work to further improve energy
efficiency. Rather than rely on a hardware register cache, allocation and
eviction could be handled entirely by software, eliminating the need for RFC
tag checks. The effectiveness of the RFC could be improved by applying code
transformations that were aware of the RFC. Software control in the form
of hints could also be applied to provide more flexible policies for activating
or deactivating warps in our two-level scheduler. As GPUs have emerged as
an important high-performance computational platform, improving energy
efficiency through all possible means including static analysis will be key to
future increases in performance.
129
8.3 Hierarchical Multithreading
By extending the Rigel architecture with hierarchical multithreading, we
enable an efficient and scalable multithreading solution for accelerators. Hi-
erarchical multithreading allows accelerator architects to match the desired
degree of threading to their design and domain, while simultaneously enabling
a more efficient implementation with two-level scheduling. We effectively de-
couple the degree of threading from the base pipeline implementation, pro-
viding a flexible architectural paradigm.
We extend the hierarchical multithreading paradigm with the concept of
thread pools. Thread pools serve to further decouple the design space, en-
abling architects to select execution pipeline and total thread counts inde-
pendently. This approach also enables low-cost dynamic context switching
or load balancing across pipelines for shared threads.
While there is a price to be paid for this flexibility, we find that several
techniques can reduce cost. Cached or hierarchical register files can reduce
register file traffic, and thus power, and this in turn can be leveraged to
simplify the operand interconnect. This initial architectural proposal serves
as a starting point that has potential for future improvements.
130
REFERENCES
[1] NVIDIA, NVIDIA GeForce 8800 GPU Architecture Overview,
Santa Clara, CA, November 2006. [Online]. Available:
http://www.nvidia.com/object/IO 37100.html
[2] M. Gschwind, “Chip multiprocessing and the Cell broadband engine,”
in Proceedings of the 3rd Conference on Computing Frontiers, 2006,
pp. 1–8.
[3] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,
S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,
T. Juan, and P. Hanrahan, “Larrabee: A Many-core x86 Architecture
for Visual Computing,” in International Conference and Exhibition on
Computer Graphics and Interactive Techniques, August 2008, pp. 1–15.
[4] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,
A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: An
architecture and scalable programming interface for a 1000-core accel-
erator,” in Proceedings of the International Symposium on Computer
Architecture, June 2009, pp. 140–151.
[5] J. H. Kelm, D. R. Johnson, S. S. Lumetta, M. I. Frank, and S. J. Patel,
“A task-centric memory model for scalable accelerator architectures,”
in Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, September 2009, pp. 77–87.
[6] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel,
“Cohesion: A hybrid memory model for accelerators,” in Proceedings
of the International Symposium on Computer Architecture, June 2010,
pp. 429–440.
[7] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,
NVIDIA, Santa Clara, CA, December 2009. [Online]. Avail-
able: http://www.nvidia.com/content/PDF/fermi white papers/
NVIDIA Fermi Compute Architecture Whitepaper.pdf
[8] International Technology Roadmap for Semiconductors, 2009 Edition,
2009. [Online]. Available: http://itrs.net
131
[9] D. R. Johnson, M. Johnson, J. Kelm, W. Tuohy, S. Lumetta, and S. Pa-
tel, “Rigel: A 1,024-core single-chip accelerator architecture,” IEEE
Micro, vol. 31, no. 4, pp. 30–41, July-Aug. 2011.
[10] D. R. Johnson, J. H. Kelm, N. C. Crago, M. R. Johnson, W. Tuohy,
W. Truty, S. Kofsky, S. S. Lumetta, W. W. Hwu, M. I. Frank, and
S. J. Patel, “Rigel: A scalable architecture for 1000+ core accelera-
tors,” in Symposium on Application Accelerators in High Performance
Computing, July 2009.
[11] J. H. Kelm, D. R. Johnson, S. S. Lumetta, M. I. Frank, and S. J.
Patel, “A task-centric memory model for accelerator architectures,”
IEEE Micro, vol. 30, no. 1, pp. 2–12, January/February 2010.
[12] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel,
“Cohesion: An adaptive hybrid memory model for accelerators,” IEEE
Micro, vol. 31, no. 1, pp. 42–55, January/February 2011.
[13] L. G. Valiant, “A bridging model for parallel computation,” Commu-
nications of the ACM, vol. 33, no. 8, pp. 103–111, 1990.
[14] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA
Tesla: A unified graphics and computing architecture,” IEEE Micro,
vol. 28, no. 2, pp. 39–55, 2008.
[15] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel, “Tradeoffs in design-
ing accelerator architectures for visual computing,” in Proceedings of
the International Symposium on Microarchitecture, 2008, pp. 164–175.
[16] M. Frigo, C. E. Leiserson, and K. H. Randall, “The implementation of
the Cilk-5 multithreaded language,” in Proceedings of the ACM SIG-
PLAN ’98 Conference on Programming Language Design and Imple-
mentation, Montreal, Quebec, Canada, June 1998, pp. 212–223.
[17] D. R. Johnson, M. R. Johnson, and S. J. Patel, “Lazy atomic opera-
tions,” in Workshop on Applications for Multi and Many Core Proces-
sors, June 2011.
[18] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[19] The OpenCL Specification, Khronos OpenCL Working Group,
September 2010. [Online]. Available: http://www.khronos.org/
registry/cl/specs/opencl-1.1.pdf
[20] X. Huo, V. Ravi, W. Ma, and G. Agrawal, “Approaches for parallelizing
reductions on modern GPUs,” in International Conference on High
Performance Computing, 2010, pp. 1 –10.
132
[21] M. J. Garzara´n, M. Prvulovic, Y. Zhangy, J. Torrellas, A. Jula, H. Yu,
and L. Rauchwerger, “Architectural support for parallel reductions in
scalable shared-memory multiprocessors,” in International Conference
on Parallel Architectures and Compilation Techniques, Washington,
DC, 2001, pp. 243–.
[22] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,
J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,
M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simula-
tor,” Computer Architecture News, vol. 39, pp. 1–7, Aug. 2011.
[23] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyz-
ing cuda workloads using a detailed GPU simulator,” in Performance
Analysis of Systems and Software, 2009. ISPASS 2009. IEEE Interna-
tional Symposium on, 26-28 2009, pp. 163 –174.
[24] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: a
dynamic optimization framework for bulk-synchronous applications in
heterogeneous systems,” in International Conference on Parallel Archi-
tectures and Compilation Techniques, New York, NY, 2010, pp. 353–
364.
[25] W. J. Dally, “Moving the needle, computer architecture research in
academe and industry,” in Proceedings of the 37th International Sym-
posium on Computer Architecture, 2010.
[26] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong
program analysis & transformation,” in CGO ’04: Proceedings of the
International Symposium on Code Generation and Optimization, 2004,
p. 75.
[27] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPU-CPU work-
loads and systems,” in GPGPU ’10: Proceedings of the 3rd Workshop
on General-Purpose Computation on Graphics Processing Units, New
York, NY, 2010, pp. 31–42.
[28] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,
E. Lindholm, and K. Skadron, “Energy-efficient mechanisms for man-
aging thread context in throughput processors,” in Proceeding of the
38th Annual International Symposium on Computer Architecture, New
York, NY, 2011, pp. 235–246.
[29] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,
E. Lindholm, and K. Skadron, “A hierarchical thread scheduler and
register file for energy-efficient throughput processors,” ACM Transac-
tions on Computer Systems, vol. 30, no. 2, pp. 8:1–8:38, Apr. 2012.
133
[30] W. J. Dally, M. A. Gebhart, D. R. Johnson, S. W. Keckler, J. E. Lind-
holm, and D. Tarjan, “Two-level scheduler for multi-threaded process-
ing,” U.S. Patent Application 13 151 094, 29, 2012.
[31] AMD, “HD 6900 series instruction set architecture,” February 2011.
[Online]. Available: http://developer.amd.com/gpu/amdappsdk/
assets/AMD HD 6900 Series Instruction Set Architecture.pdf
[32] K. Fatahalian and M. Houston, “A closer look at GPUs,” Communi-
cations of the ACM, vol. 51, no. 10, pp. 50–57, October 2008.
[33] J. Cruz, A. Gonzlez, M. Valero, and N. P. Topham, “Multiple-banked
register file architectures,” in International Symposium on Computer
Architecture, June 2000, pp. 316–325.
[34] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt,
“Analyzing CUDA workloads using a detailed GPU simulator,” in In-
ternational Symposium on Performance Analysis of Systems and Soft-
ware, April 2009, pp. 163–174.
[35] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU microarchitecture through mi-
crobenchmarking,” in International Symposium on Performance Anal-
ysis of Systems and Software, March 2010, pp. 235–246.
[36] NVIDIA, “Compute unified device architecture programming
guide version 2.0,” http://devloper.download.nvidia.com/ com-
pute/cuda/2 0/docs/NVIDIA CUDA Programming Guide 2.0.pdf,
June 2008.
[37] S. Hong and H. Kim, “An integrated GPU power and performance
model,” in International Symposium on Computer Architecture, June
2010, pp. 280–289.
[38] S. Galal and M. Horowitz, “Energy-efficient floating point unit design,”
IEEE Transactions on Computers, vol. 60, no. 7, pp. 913–922, July
2011.
[39] X. Zhuang and S. Pande, “Resolving register bank conflicts for a net-
work processor,” in International Conference on Parallel Architectures
and Compilation Techniques, September 2003, pp. 269–278.
[40] M. Franklin and G. S. Sohi, “Register traffic analysis for streamlin-
ing inter-operation communication in fine-grain parallel processors,”
in International Symposium on Microarchitecture, November 1992, pp.
236–245.
134
[41] H. Zeng and K. Ghose, “Register file caching for energy efficiency,”
in International Symposium on Low Power Electronics and Design,
October 2006, pp. 244–249.
[42] AMD, “ATI stream computing OpenCL programming guide,”
http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI Stream
SDK OpenCL Programming Guide.pdf, August 2010.
[43] P. Kogge et al., “Exascale computing study: Technology challenges
in achieving exascale systems,” University of Notre Dame, Tech. Rep.
TR-2008-13, September 2008.
[44] J. Balfour, R. Harting, and W. Dally, “Operand registers and explicit
operand forwarding,” IEEE Computer Archiecture Letters, vol. 8, no. 2,
pp. 60–63, July 2009.
[45] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “CACTI
6.0: A tool to model large caches,” HP Laboratories, Tech. Rep., April
2009.
[46] A. S. Leon, B. Langley, and J. L. Shin, “The UltraSPARC T1 proces-
sor: CMT reliability,” in IEEE Custom Integrated Circuits Conference,
September 2007, pp. 555 –562.
[47] J. A. Brown and D. M. Tullsen, “The shared-thread multiprocessor,”
in Proceedings of the 22nd Annual International Conference on Super-
computing. New York, NY: ACM, 2008, pp. 73–82.
[48] E. Tune, R. Kumar, D. M. Tullsen, and B. Calder, “Balanced mul-
tithreading: increasing throughput via a low cost multithreading hi-
erarchy,” in International Symposium on Microarchitecture, December
2004, pp. 183–194.
[49] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way
multithreaded SPARC processor,” IEEE Micro, vol. 25, no. 2, pp. 21–
29, March 2005.
[50] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for scalable syn-
chronization on shared-memory multiprocessors,” ACM Transactions
on Computer Systems, vol. 9, no. 1, pp. 21–65, 1991.
[51] M. Hayenga, D. Johnson, and M. Lipasti, “Pitfalls of ORION-based
simulation,” in Workshop on Duplicating, Deconstructing, and Debunk-
ing, June 2012.
[52] M. Gebhart, S. W. Keckler, and W. J. Dally, “A compile-time managed
multi-level register file hierarchy,” in Proceedings of the 44th Annual
IEEE/ACM International Symposium on Microarchitecture, 2011, pp.
465–476.
135
[53] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang,
“The case for a single-chip multiprocessor,” in ASPLOS, 1996, pp. 2–
11.
[54] J. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, C. Hwang, H. Li,
A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A. Leon, and
A. Strong, “A 40nm 16-core 128-thread CMT SPARC SoC processor,”
in International Solid-State Circuits Conference, February 2010, pp.
98–99.
[55] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,
J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao,
C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks,
D. Khan, F. Montenegro, J. Stickney, and J. Zook, “Tile64 - processor:
A 64-core soc with mesh interconnect,” in International Solid-State
Circuits Conference, 2008, pp. 88 –598.
[56] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-
wald, H. Hoffmann, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf,
M. Seneski, N. Shnidman, V. S. M. Frank, S. Amarasinghe, and
A. Agarwal, “The Raw microprocessor: A computational fabric for
software circuits and general purpose programs,” IEEE Micro, vol. 22,
pp. 25–35, 2002.
[57] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Fi-
nan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote,
and N. Borkar, “An 80-tile 1.28tflops network-on-chip in 65nm CMOS,”
in International Solid-State Circuits Conference, 2007, pp. 98 –589.
[58] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Matt-
son, and J. D. Owens, “Programmable stream processors,” IEEE Com-
puter, vol. 36, no. 8, pp. 54–62, 2003.
[59] J. D. Owens, U. J. Kapasi, P. Mattson, B. Towles, B. Serebrin,
S. Rixner, and W. J. Dally, “Media processing applications on the
Imagine stream processor,” in Proceedings of the IEEE International
Conference on Computer Design, Sep. 2002, pp. 295–302.
[60] SPI, “Stream processing: Enabling the new generation of
easy to use, high-performance DSPs,” 2008. [Online]. Avail-
able: http://www.streamprocessors.com/streamprocessors/resources/
resource/White Paper Enabling The Next Gen.pdf
[61] AMD, “The future is fusion: The industry-changing im-
pact of accelerated computing,” 2008. [Online]. Available:
http://sites.amd.com/jp/Documents/AMD fusion Whitepaper.pdf
136
[62] ATI, ATI Radeon HD 5870 GPU Feature Summary, 2010. [Online].
Available: http://www.amd.com/us/products/desktop/graphics/ati-
radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-
specifications.aspx
[63] Intel, Advanced Vector Extensions Programming Reference, 2010.
[Online]. Available: http://software.intel.com/en-us/avx
[64] S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs
and the future of parallel computing,” Micro, IEEE, vol. 31, no. 5, pp.
7 –17, sept.-oct. 2011.
[65] D. J. Lilja, “Cache coherence in large-scale shared-memory multipro-
cessors: issues and comparisons,” ACM Comput. Surv., vol. 25, pp.
303–338, September 1993.
[66] J. Hennessy, M. Heinrich, and A. Gupta, “Cache-coherent distributed
shared memory: perspectives on its development and future chal-
lenges,” Proceedings of the IEEE, vol. 87, no. 3, pp. 418–429, Mar
1999.
[67] J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, “A tagless
coherence directory,” in Proceedings of the 42nd Annual IEEE/ACM
International Symposium on Microarchitecture. New York, NY: ACM,
2009, pp. 423–434.
[68] G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C.
Kimerling, and A. Agarwal, “ATAC: A 1000-core cache-coherent pro-
cessor with on-chip optical network,” in Proceedings of the 19th Inter-
national Conference on Parallel Architectures and Compilation Tech-
niques, 2010, pp. 477–488.
[69] A. Firoozshahian, A. Solomatnikov, O. Shacham, Z. Asgar, S. Richard-
son, C. Kozyrakis, and M. Horowitz, “A memory system design frame-
work: creating smart memories,” in Proceedings of the 36th Annual
International Symposium on Computer Architecture, 2009, pp. 406–
417.
[70] J. H. Kelm, M. R. Johnson, S. S. Lumetta, and S. J. Patel, “Way-
point: Scaling coherence to 1000-core architectures,” in Proceedings of
the International Conference on Parallel Architectures and Compila-
tion Techniques, September 2010, pp. 99–110.
[71] NVIDIA CUDA Programming Guide, 3rd ed., NVIDIA, Feb 2010.
[72] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph,
and M. Snir, “The NYU Ultracomputer—designing a MIMD, shared-
memory parallel machine,” in Proceedings of the 9th Annual Interna-
tional Symposium on Computer Architecture, 1982, pp. 239–254.
137
[73] A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg,
“A large, fast instruction window for tolerating cache misses,” in Inter-
national Symposium on Computer Architecture, May 2002, pp. 59–70.
[74] S. E. Raasch, N. L. Binkert, and S. K. Reinhardt, “A scalable instruc-
tion queue design using dependence chains,” in International Sympo-
sium on Computer Architecture, May 2002, pp. 318–329.
[75] E. Brekelbaum, J. Rupley, C. Wilkerson, and B. Black, “Hierarchical
scheduling windows,” in International Symposium on Microarchitec-
ture, November 2002, pp. 27–36.
[76] D. Ernst, A. Hamel, and T. Austin, “Cyclone: A broadcast-free dy-
namic instrution scheduler with selective replay,” in International Sym-
posium on Computer Architecture, June 2003, pp. 253–263.
[77] T. Ungerer, B. Robicˇ, and J. Sˇilc, “A survey of processors with explicit
multithreading,” ACM Computing Surveys, vol. 35, no. 1, pp. 29–63,
March 2003.
[78] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield,
and B. Smith, “The Tera computer system,” in International Confer-
ence on Supercomputing, June 1990, pp. 1–6.
[79] AMD, “R600 family instruction set architecture,” Jan-
uary 2009. [Online]. Available: http://developer.amd.com/
gpu assets/R600 Instruction Set Architecture.pdf
[80] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu,
and Y. N. Patt, “Improving GPU performance via large warps and
two-level warp scheduling,” in International Symposium on Microar-
chitecture, New York, NY, 2011, pp. 308–317.
[81] A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A
processor architecture for multiprocessing,” in International Sympo-
sium on Computer Architecture, June 1990, pp. 104–114.
[82] P. R. Nuth and W. J. Dally, “A mechanism for efficient context switch-
ing,” in International Conference on Computer Design on VLSI in
Computer and Processors, October 1991, pp. 301–304.
[83] J. A. Swensen and Y. N. Patt, “Hierarchical registers for scientific
computers,” in International Conference on Supercomputing, Septem-
ber 1988, pp. 346–354.
[84] T. M. Jones, M. F. P. O’Boyle, J. Abella, A. Gonza´lez, and O. Er-
gin, “Energy-efficient register caching with compiler assistance,” ACM
Transactions on Architecture and Code Optimization, vol. 6, no. 4, pp.
1–23, October 2009.
138
[85] P. R. Nuth and W. J. Dally, “The named-state register file: implemen-
tation and performance,” in International Symposium on High Perfor-
mance Computer Architecture, January 1995, pp. 4–13.
[86] E. Borch, E. Tune, S. Manne, and J. Emer, “Loose loops sink chips,” in
International Symposium on High Performance Computer Architecture,
February 2002, pp. 299–310.
[87] R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, “Reducing
the complexity of the register file in dynamic susperscalar processors,”
in International Symposium on Microarchitecture, December 2001, pp.
237–248.
[88] R. Shioya, K. Horio, M. Goshima, and S. Sakai, “Register cache system
not for latency reduction purpose,” in International Symposium on
Microarchitecture, December 2010, pp. 301–312.
[89] Z. Hu and M. Martonosi, “Reducing register file power consumption
by exploiting value lifetime characteristics,” Workshop on Complexity-
Effective Design, Vancouver, Canada, June 2000.
[90] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Holl-
berg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A
full system simulation platform,” IEEE Computer, vol. 35, pp. 50–58,
2002.
[91] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze,
S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, “SESC simulator,”
January 2005. [Online]. Available: http://sesc.sourceforge.net
[92] J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio,
J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator
for multicores,” in The 16th IEEE International Symposium on High-
Performance Computer Architecture (HPCA), January 2010, pp. 1–12.
[93] G. Zheng, G. Kakulapati, and L. Kale, “BigSim: a parallel simula-
tor for performance prediction of extremely large parallel machines,”
in Parallel and Distributed Processing Symposium, 2004. Proceedings.
18th International, 2004, p. 78.
[94] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson,
J. Keefe, and H. Angepat, “FPGA-accelerated simulation technologies
(FAST): Fast, full-system, cycle-accurate simulators,” in Proceedings
of the 40th Annual IEEE/ACM International Symposium on Microar-
chitecture, Washington, DC, 2007, pp. 249–261.
139
[95] Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovic´, and D. Patterson,
“A case for FAME: FPGA architecture model execution,” in Proceed-
ings of the 37th International Symposium on Computer Architecture,
2010, pp. 290–301.
[96] W.-W. Hu and J. Wang, “Making effective decisions in computer ar-
chitects’ real-world: Lessons and experiences with Godson-2 processor
designs,” Journal of Computer Science and Technology, vol. 23, pp.
620–632, July 2008.
[97] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An integrated power, area, and timing mod-
eling framework for multicore and manycore architectures,” in Pro-
ceedings of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture, New York, NY, 2009, pp. 469–480.
[98] M. S. S. Govindan, S. W. Keckler, and D. Burger, “End-to-end vali-
dation of architectural power models,” in International Symposium on
Low Power Electronics and Design, New York, NY, 2009, pp. 383–388.
[99] Tensilica, Xtensa Processor Developer’s Toolkit. [Online]. Available:
http://www.tensilica.com/uploads/pdf/HWdev.pdf
[100] Tensilica, Xtensa Software Developer’s Toolkit. [Online]. Available:
http://www.tensilica.com/uploads/pdf/SWdev.pdf
[101] The OpenRISC Project, OpenCores. [Online]. Available:
http://opencores.org/or1k
[102] D. M. T. Rakesh Kumar, Victor Zyuban, “Interconnections in mult-
core architecures: Understanding mechanisms, overheads and scaling,”
in Proceedings of the International Symposium on Computer Architec-
ture, 2005, pp. 408–419.
[103] S. Borkar, “Thousand core chips: A technology perspective,” in Pro-
ceedings of the 44th Annual Design Automation Conference, 2007, pp.
746–749.
[104] J. L. Gustafson, “Reevaluating Amdahl’s law,” Communications of the
ACM, vol. 31, no. 5, pp. 532–533, 1988.
140
