An evaluation framework for massively parallel accelerator processors by Johnson, Daniel R.
c© 2011 Daniel R. Johnson
AN EVALUATION FRAMEWORK FOR MASSIVELY PARALLEL
ACCELERATOR PROCESSORS
BY
DANIEL R. JOHNSON
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2011
Urbana, Illinois
Adviser:
Associate Professor Sanjay J. Patel
ABSTRACT
In this thesis, I describe the evaluation framework for Rigel, a 1024-core
single-chip accelerator architecture designed for high throughput on visual
computing and scientific workloads. I present an integrated evaluation frame-
work for investigating co-designed architecture, compilers, programming mod-
els, and RTL implementation for massively parallel chip multiprocessors
(CMPs). The research objective of the Rigel project, designing a proto-
type thousand-core chip, led to the development of the framework presented
in this thesis. I describe the tools and techniques which enabled this work.
The goal of this thesis is not to evaluate specific design tradeoffs, but to
describe the tools we have developed for making these decisions. I motivate
and describe our integrated performance simulator, code generator, and RTL
implementation for evaluation of a novel 1024-core accelerator architecture.
I demonstrate the utility of a flexible hardware-software interface support-
ing an evolving ISA for architectural design space exploration. Although I
present experiences related to a particular design, the methods applied and
lessons learned are more broadly applicable. I summarize some of the pub-
lished work which this framework has enabled.
ii
To my wife, Melanie
iii
ACKNOWLEDGMENTS
I would like to acknowledge the team effort applied to the fantastic infras-
tructure and tools the Rigel team has developed over the past several years.
In particular, John H. Kelm for his bootstrapping of the initial software
toolchain, simulator, and his constant push to move things forward. Matt
Johnson, for his work on the simulator and likeminded desire to see things
done “proper.” Bill Tuohy, for lending his experience and support to boot-
strapping an RTL design flow. Voytek, Steve, and Simon, the master’s stu-
dents I’ve had the priveledge of working with who have helped pull together
floating point units, processor RTL, random scripting infrastructure, caches,
and parallel software. And, Sanjay for instigating the Rigel project to begin
with.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Case for a Flexible, Integrated Evaluation Framework . . 3
1.3 The Rigel Toolflow . . . . . . . . . . . . . . . . . . . . . . . 4
CHAPTER 2 THE RIGEL ACCELERATOR ARCHITECTURE . . 6
2.1 Related Work: Throughput Processors . . . . . . . . . . . . . 6
2.2 Motivation: Current Accelerator Limitations . . . . . . . . . . 9
2.3 Rigel Accelerator Architecture Overview . . . . . . . . . . . 10
2.4 Caching and Memory Model . . . . . . . . . . . . . . . . . . . 15
2.5 Programming Rigel: The Rigel Task Model . . . . . . . . . 18
CHAPTER 3 THE RIGEL SIMULATOR . . . . . . . . . . . . . . . 21
3.1 Timing Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Simulation Automation . . . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 4 THE RIGEL RTL MODEL AND TOOLFLOW . . . . 27
4.1 RTL Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 The Rigel Core . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Testing and Verification . . . . . . . . . . . . . . . . . . . . . 30
4.4 The Rigel Cluster . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 GoldMine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 5 INTEGRATED TOOLFLOW WITH IDEA . . . . . . 34
5.1 Related Work: Existing Tools . . . . . . . . . . . . . . . . . . 34
5.2 Toolflow Objectives . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 The Case for a Flexible, Integrated Evaluation Framework . . 36
5.4 Toolflow Components . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Tool Chain Integration: IDEA . . . . . . . . . . . . . . . . . . 42
v
CHAPTER 6 EVALUATION . . . . . . . . . . . . . . . . . . . . . . 44
6.1 Benchmark Codes . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Data Sharing for Parallel Applications . . . . . . . . . . . . . 44
6.3 Rigel Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Area and Power Estimates . . . . . . . . . . . . . . . . . . . . 47
6.5 Cluster Design Space . . . . . . . . . . . . . . . . . . . . . . . 51
6.6 Coherence Techniques . . . . . . . . . . . . . . . . . . . . . . 54
6.7 IDEA Case Studies . . . . . . . . . . . . . . . . . . . . . . . . 57
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 63
7.1 Tools Release . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vi
LIST OF TABLES
2.1 Comparison of Rigel to other contemporary accelerator
architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Simulated parameters for the Rigel architecture. . . . . . . . 12
3.1 Cluster Interconnect Organizations . . . . . . . . . . . . . . . 24
6.1 Description of data- and task-parallel workloads. . . . . . . . . 45
6.2 Simulation parameters for the baseline architecture. . . . . . . 45
6.3 Power, area, and performance comparison of Rigel to ac-
celerators normalized to 40 nm. . . . . . . . . . . . . . . . . . 49
6.4 Area comparison of commercial cores normalized to 40 nm . . 50
6.5 CACTI cache area estimates. Bus widths are measured in bits. 50
vii
LIST OF FIGURES
1.1 IDEA: An integrated toolflow for accelerator design space
exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Block diagram of the Rigel processor. . . . . . . . . . . . . . 10
2.2 Block diagram of the Rigel cluster. . . . . . . . . . . . . . . 13
2.3 Block diagram of the Rigel tile and top-level chip organization. 15
2.4 TheRigel Task Model consists of hierarchical task queues.
Depending on the configuration, cores may produce tasks
into either local or global queues. Groups of tasks are
removed from the global queue and placed into local queues
for faster access and less contention. . . . . . . . . . . . . . . . 18
2.5 The BSP execution model of the Rigel Task Model. An
interval is defined as the time between two barriers. . . . . . . 19
4.1 Pipeline block diagram of the Rigel core. . . . . . . . . . . . 29
4.2 Early place and route image for the Rigel cluster. Cluster
includes eight cores and eight 8 kB banks of SRAM. . . . . . . 32
5.1 IDEA: An integrated toolflow for accelerator design space
exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Integrated toolflow components. A header file specifies var-
ious ISA parameters. This file is consumed by IDEA,
which produces components of the simulator, RTL, and
code generation tools. . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Average DRAM latency for our applications varies by a
factor of 10, necessitating a detailed timing model for ac-
curate performance prediction. . . . . . . . . . . . . . . . . . . 40
viii
6.1 Characterization of memory accesses in task-based BSP
applications. Input reads and output writes communicate
data across barriers. The majority of memory accesses
are to data that is private to a task. Conflict accesses
share data between two tasks in the same barrier interval,
requiring hardware coherence or special mechanisms like
atomic operations to maintain correctness, but are rare in
our applications. . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Benchmark scalability on Rigel chips with 1, 2, 4, and 8
128-core tiles (128-1024 cores). Speedups are relative to
one eight-core cluster. 128× represents linear scaling at
1024 cores. Benchmark binaries and datasets are identical
across all system sizes; global cache capacity and memory
bandwidth are scaled with the number of tiles. . . . . . . . . . 46
6.3 Area estimates for the Rigel design . . . . . . . . . . . . . . 47
6.4 Simulation cycles across five benchmarks and various con-
figurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Cohesion is a hybrid memory model for accelerators that
enables hardware and software-managed coherence to co-
exist, allows data to migrate between the two domains dy-
namically, and captures the performance, efficiency, and
programmability benefits of both regimes. . . . . . . . . . . . 56
6.6 Performance versus directory cache size for (A)HWcc alone
and (B) Cohesion, which amplifies effective directory size. . . 57
6.7 Area and power for core configurations with and without
bypassing, with 5 different L1 cache sizes and at 4 fre-
quency targets. Data are marked by clock period (ns). . . . . 59
6.8 Performance per watt vs. Performance per mm2 for 16-
and 32-register configurations. . . . . . . . . . . . . . . . . . . 61
6.9 Area cost of enabling bypass networks for a range of syn-
thesis clock targets. . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
LIST OF ABBREVIATIONS
ISA Instruction Set Architecture
CPU Central Processing Unit
GPU Graphics Processing Unit
MIMD Multiple Instruction Multiple Data
SIMD Single Instruction Multiple Data
SIMT Single Instruction Multiple Thread
VLIW Very Long Instruction Word
CMP Chip Multiprocessor
SoC System on Chip
FPU Floating Point Unit
DRAM Dynamic Random Access Memory
SRAM Static Random Access Memory
ILP Instruction Level Parallelism
BSP Bulk Synchronous Parallel
x
CHAPTER 1
INTRODUCTION
Computer system designers are challenged by an increasing reliance on paral-
lelism for system performance. As clock-rate scaling has slowed and devices
become power constrained, general purpose processors have shifted towards
multicore devices. Recently, graphics processing units (GPUs) have emerged
as programmable accelerators for data-parallel computation. Finally, the
ascendance of highly programmable mobile devices has moved the design of
systems-on-chip (SoC) to the forefront. Combined, these trends indicate new
opportunities for innovation in programmable high-performance chips while
presenting new challenges for design methodologies. Many such opportunities
require co-design of hardware and software at multiple levels of abstraction,
necessitating a departure from a traditional design methodology in which
hardware and software layers are decoupled from one another as much as
possible. Designers of parallel systems need an efficient way to explore mul-
tiple levels of the stack, from architecture to compilers to implementation,
in order to evaluate the impact of cross-cutting design descisions.
This thesis advocates a comprehensive, integrated methodology for investi-
gating problems spanning the areas of architecture, compilers, programming
models, and VLSI design. A flow that spans these levels by integrating
several traditional tools into a common framework effectively incorporates
design knowledge from across a project in a way that enables designers at
every level to gather more accurate information without waiting for feedback
from other parts of the team. Prominent researchers [1] have similarly ad-
vocated re-evaluation of the traditional fixed infrastructure stack, including
simulators, compilers, and ISAs. As a case study, I introduce a comprehen-
sive and flexible integrated design flow we developed to enable exploration
of a relatively new space in system design, massively parallel programmable
accelerator processors. This framework has been used to develop a novel
1024-core accelerator architecture, which I present along with case studies
1
from our experiences.
1.1 Accelerators
An insatiable appetite for performance on compute intensive data-parallel
workloads in visual and scientific computing has driven the design of mas-
sively parallel compute accelerators. Compute accelerators are designed to
improve performance and reduce power consumption for particular classes
of workloads. They do so by exploiting characteristics of their target do-
main and limiting the generality of the programming model. While general-
purpose processors tend to employ additional transistor resources to mini-
mize latency ( sec
operation
), accelerators are designed to maximize throughput
(operations
sec
). Contemporary accelerators include graphics processing units
(GPUs) [2], Cell [3], and Larrabee [4]. For a more complete discussion of
throughput-oriented CMP and accelerator architectures, please see 2.1.
1.1.1 Motivation: Current Accelerator Limitations
Current accelerators generally expose restricted programming models which
yield high performance for data-parallel applications with rigidly structured
computation and memory access patterns, but present a more difficult tar-
get for less regular parallel applications. The throughput-oriented archi-
tectural design choices of accelerators often compromise the generality of
the programming model. For instance, accelerators commonly achieve high
throughput through the use of SIMD processing elements. For dense or
regular data-parallel computations, SIMD enables an efficient hardware im-
plementation. However, for applications that do not naturally map to the
SIMD execution model, programmers must adapt their algorithms or suffer
reduced efficiency, limiting the scope of applications which can achieve good
performance. Software-managed scratchpad memories yield denser hardware,
tighter access latency guarantees, and consume less power than caches; how-
ever, they impose additional burden on either the programmer or software
tools. Multiple address spaces also increase the burden of development.
2
1.1.2 Rigel: A Programmable Manycore Accelerator
The Rigel architecture was conceived as an attempt to address some of the
shortcomings of parallel computation accelerators while pushing the envelope
on throughput-oriented designs. Broadly, the goals of the project are to:
• Determine the feasibility of a single-chip, massively parallel MIMD ac-
celerator architecture with 1024 cores
• Achieve high computation density, or throughput, in terms of (operations/sec
mm2
)
• Determine how to organize such a device to be programmer-friendly
• Present a more general target to developers, increasing the scope of
parallel applications which can target the design
These goals drove our development of Rigel [5], a 1024-core single-chip
accelerator architecture that targets a broad class of data- and task-parallel
computation. With theRigel design, we aim to strike a balance between raw
performance and ease of programmability by adopting programming inter-
face elements from general-purpose processors. Rigel is composed of 1024
independent, hierarchically organized cores that use a fine-grained MIMD
execution model. Rigel presents a single global address space and a fully
cached memory hierarchy. Parallel work is expressed in a task-centric, bulk-
synchronized manner using minimal hardware support. Compared with exist-
ing accelerators, which contain domain-specific hardware, specialized mem-
ories, and restrictive programming models, Rigel is more flexible and aims
to provide a more straightforward target for a broader set of applications.
The Rigel architecture is described in more detail in Chapter 2.
1.2 The Case for a Flexible, Integrated Evaluation
Framework
In addition to performance, most processor designers today are concerned
with area and power. Architects are traditionally well-equipped to evaluate
performance, but less so for area and power, often relying on previous designs
or analytical models.
3
Software Stack,
Compiler
IDEA
Figure 1.1: IDEA: An integrated toolflow for accelerator design space
exploration.
When exploring a new design space, in our case 1000+ core CMPs, it is
difficult to foresee all potential challenges, and we encountered many non-
obvious design decision questions throughout the process. Ultimately, per-
formance in silicon implementation can be difficult to predict without actual
implementation details. We find it valuable to consider physical design con-
straints proactively, during the definition of the architecture, rather than
reactively during implementation. For these reasons, we choose to pursue a
custom integrated design flow employing measurement rather than modeling
whenever possible. With an accurate RTL model, we can better quantify
design tradeoffs and their impact on area, power, and frequency.
For designers targeting new design spaces, an agile methodology for al-
lowing ISA design space exploration can be a powerful tool to aid in the
evaluation of power, area, and performance tradeoffs or simply extending
the design with additional features. An integrated evaluation framework al-
lows designers to consider the impact of design decisions that extend beyond
traditional boundaries.
1.3 The Rigel Toolflow
The development and evaluation of the Rigel architecture required a large
undertaking in tools development. Developing a clean-slate design, theRigel team
4
set out to construct a complete simulation, code-generation, and hardware
development framework to enable our work.
Our collection of tools are tightly integrated, allowing us to analyze perfor-
mance, area, and power, evaluate ISA changes, and perform correctness and
performance validation. Figure 1.1 illustrates the primary components of our
integrated design flow. The design flow includes a cycle-accurate full-system
performance simulator, an RTL flow for detailed area and power analysis,
a customized LLVM [6] compiler backend, a set of standard C and parallel
runtime libraries, and a key unifying component called IDEA. IDEA is a
utility application that takes a set of configuration parameters defining the
ISA and machine model and produces Verilog and C++ source code used
by our simulator, compiler, and RTL, allowing us to rapidly change these
components while minimizing cross-tool inconsistencies.
Contributions of this thesis include:
• descriptions of the parts of the Rigel toolflow used for evaluating fu-
ture large-scale chip multiprocessors (CMPs) including area, power,
and performance modeling with the design goal of silicon implementa-
tion
• a flexible methodology for ISA-level design space exploration
• a collection of examples and case studies demonstrating the utility of
the methodology, and a description of lessons learned, insights gained,
and pitfalls encountered while implementing and using our tools
• examples of our usage in published work
In this thesis, I describe our evaluation framework for Rigel, a 1024-core
single-chip accelerator architecture designed for throughput on scientific and
visual computing workloads. The research objectives of the Rigel project
lead to the development of the framework presented in this thesis. Although
I present experiences related to a particular design, I note that the methods
applied and lessons learned are more broadly applicable.
5
CHAPTER 2
THE RIGEL ACCELERATOR
ARCHITECTURE
In this chapter, I provide a description of the Rigel 1024 core accelerator
architecture. I describe previous work for throughput-oriented processing
designs. I detail the motivation for and objectives of the Rigel design. I
describe the cores and clustered core organization, the caches and memory
system, and Rigel’s task-based programming model.
2.1 Related Work: Throughput Processors
We consider two broad classes of throughput-oriented architectures: general-
purpose chip multiprocessors (CMPs) and specialized accelerators. Chip mul-
tiprocessors (CMPs) have been proposed as a way to use increasing transistor
densities to improve performance [7]. General-purpose CMP development is
driven by the need to increase performance while retaining support for a vast
ecosystem of existing multitasking operating systems, programming models,
applications, and development tools. Increasingly, CMPs are integrating
additional functionality such as memory controllers and peripheral bus con-
trollers on-die, converging to an SoC model. Accelerators are designed to
improve performance and reduce power for a specific class of workloads by
exploiting characteristics of the target domain and are optimized for a nar-
rower class of workloads and programming styles. While general-purpose
processors tend to employ additional transistors to decrease latency along
a single thread of execution, accelerators are architected to maximize ag-
gregate system throughput, often with increased latency for any particular
operation. For many systems, the distinction between CMP and accelerator
becomes somewhat blurred.
Although CMPs do not improve the performance of single threads, the
cores can collectively achieve higher throughput (per unit area) than a wide-
6
issue superscalar core on multithreaded workloads such as commercial server
workloads. Throughput-oriented CMPs such as Sun’s UltraSPARC T3 [8] and
Niagara target server workloads with a moderate number of simple, highly
multithreaded cores. Such CMPs achieve relatively high throughput at the
expense of single-threaded performance. However, such designs are ulti-
mately limited by low memory bandwidth relative to accelerators, heavy-
weight hardware cache coherence, and the per-core features and latency re-
duction techniques required to meet the needs of general-purpose workloads.
Tilera’s [9] most recent processors, based upon earlier work on RAW [10],
feature up to 100 cores in a tiled design with a mesh interconnect, optimized
for message passing and streaming applications. Intel has developed several
experimental mesh-based throughput processors, including the 1TFLOP 80-
core chip [11] and the 48-core Single-chip Cloud Computer. For both Tilera
and Intel, these mesh interconnects are intended for explicit point-to-point
massage-passing-based programming.
Stream processors [12] are programmable accelerators targeted at media
and signal processing workloads with regular, predictable data access pat-
terns. Imagine [13] pioneered the stream processing concept, and stream
processing has influenced modern graphics processor designs.
The most prominent class of programmable accelerators at present is graph-
ics processing units (GPUs). GPUs from NVIDIA [14, 2, 15] and AMD [16,
17] are targeted primarily at the raster graphics pipeline, but have exposed
an increasingly flexible substrate for more general purpose data-parallel com-
putations. Both NVIDIA and AMD GPUs utilize single instruction, multiple
thread (SIMT) style architectures, whereby a single instruction is simultane-
ously executed across multiple pipelines with different data. A key aspect of
SIMT designs is their ability to allow control divergence for branching code,
whereby only a subset of a SIMT unit’s pipelines are active. Such designs
enable dense hardware with high peak throughput but work best for regular
computations with infrequent divergence. Both NVIDIA and AMD rely upon
programmer-managed scratchpad memories and explicit data transfers for
high performance, though modern variants do implement small non-coherent
caches. GPUs also rely upon high memory bandwidth and thousands of
hardware threads to hide memory latency and improve throughput.
The Cell processor [3], introduced with the PlayStation 3 and also used
in high-performance computing, uses a heterogenous model with a multi-
7
threaded PowerPC processor and up to 8 synergistic processing elements
(SPEs) as coprocessors. SPEs use a programmer-managed scratchpad for
both instruction and data access and implement a SIMD instruction set.
Intel’s Larrabee [4] project approached the accelerator design point with
a fully programmable manycore x86 design, cached memory hierarchy, and
hardware cache coherence. Intel AVX extensions [18], derived from Larrabee,
support very wide 256-bit SIMD vectors for parallel processing. These wide
vectors represent a vastly different design point than the independent scalar
cores of Rigel or the SIMT units of GPUs, requiring additional programmer
or compiler effort for packing and alignment.
Kumar argued that the cores, caches, and interconnect in future-multicore
systems cannot be derived independently [19]. To achieve an efficient design
in terms of area and/or power vs. performance, analysis that includes all
aspects of the design is needed rather than piece by piece analysis.
Current GPU architectures from Nvidia and ATI already make use of over
100 arithmetic units. However, their architectures are heavily influenced by
their primary purpose, graphics. This can be seen in the G80s SIMD/SPMD
execution model and throughput-oriented, latency-hiding design [20]. Similar
to our work, Nvidia’s GPUs organize their processors into clusters; however,
their clusters are SIMD and cannot communicate with each other onchip
in any fashion. The Cell architecture [21] contains a collection of smaller
processing units dubbed Synergistic Processing Elements (or SPEs). Unlike
the Cell’s collection of SPEs, Rigel’s clustered cores have access to shared
cache as opposed to scratch-pad memory to improve programmability. None
provide the broad fine-grained programmability that Rigel proposes, nor are
they equivalent in core count.
2.1.1 Parallel Programming Models
The initial Rigel task model implementation was based on Carbon [22], a
hardware-based task queuing scheme and API [22]. While Carbon focused
on hardware mechanisms for managing tasks, RTM moved this function into
software to improve flexibility. The Cilk runtime system provides a flexible,
dynamic work distribution runtime system based on dataflow [23].
CUDA [20] is a programming developed by NVIDIA for writing data-
8
parallel programs that run on GPUs. CUDA as a programming model has
several interacting constructs for composing parallel programs on a shared-
memory system [14]. The programming model provides library APIs to con-
trol and manage grids of parallel execution specified by kernel functions. The
host portion of the code is written in a standard language and compiled for
the host platform, while the kernel code introduces constructs for expressing
SPMD parallelism. OpenCL [24] is a vendor-neutral standard for writing
data-parallel programs targetted at both GPUs and CPUs. Both CUDA and
OpenCL broadly fit the bulk synchronous parallel (BSP) model [25].
2.2 Motivation: Current Accelerator Limitations
Current accelerators generally expose restricted programming models which
yield high performance for data-parallel applications with rigidly structured
computation and memory access patterns, but present a more difficult tar-
get for less regular parallel applications. The throughput-oriented architec-
tural choices of accelerators often compromise the generality of the program-
ming model. For instance, accelerators commonly achieve high throughput
through the use of SIMD (single instruction, multiple data) processing ele-
ments as opposed to the MIMD (multiple instruction, multiple data) model.
For dense or regular data-parallel computations, SIMD hardware reduces
the cost of performing many computations by amortizing costs such as con-
trol and instruction fetch across many processing elements. However, when
applications do not naturally map to the SIMD execution model, program-
mers must adapt their algorithms or suffer reduced efficiency. SIMD then
limits the scope of applications which can achieve the hardware’s peak per-
formance. The memory system is another area where accelerators commonly
make compromises in support of hardware efficiency that limit programmabil-
ity. Software-managed scratchpad memories yield denser hardware, provide
tighter access latency guarantees, and consume less power than caches; how-
ever, they impose an additional burden on either the programmer or software
tools. Additionally, managing the multiple address spaces often associated
with accelerator memories requires copy operations and more explicit manual
memory management.
9
Rigel Architecture: Full Chip View 
9
Figure 2.1: Block diagram of the Rigel processor.
2.3 Rigel Accelerator Architecture Overview
2.3.1 Objectives
The Rigel architecture was conceived as an attempt to address some of the
shortcomings of parallel computation accelerators while pushing the envelope
on throughput-oriented designs.
Broadly, the goals of the project are to:
• Determine the feasibility of a single-chip, massively parallel MIMD gen-
eralized computation accelerator
• Achieve high computation density, or throughput, in terms of (operations/sec
mm2
)
• Determine how to organize such a device to be programmer-friendly
• Present a more general target to developers, increasing the scope of
parallel applications which can target the design
• Address the limitations of existing accelerator architectures
2.3.2 Rigel: A Programmable Manycore Accelerator
These objectives drove our development of Rigel [5], a 1024-core single-chip
accelerator architecture designed to efficiently target a wide class of regular
10
and irregular parallel applications, including data- and task-parallel compu-
tation. With the Rigel design, we aim to strike a balance between raw
performance and ease of programmability by adopting programming inter-
face elements from general-purpose processors. Rigel is composed of 1024
independent, hierarchically organized cores. Simple in-order cores emphasize
throughput over latency, but a MIMD execution model is chosen for flexibil-
ity over a potentially denser SIMD model. Rigel has a fully cached, single
address space memory model with no chip-wide hardware-enforced coherence
in the baseline configuration. Work distribution is managed in software in
a bulk-synchronous fashion. Compared to existing accelerators, which rely
on domain-specific hardware, multiple special-purpose memories, and limited
programming models, Rigel is more flexible and provides a more straight-
forward development target for a broader range of parallel applications.
Tradeoffs are made in Rigel’s low-level programming interface between
generality and accelerator performance. The primary elements that we iden-
tify as important for supporting our objectives include: the execution model,
the memory model, work distribution, synchronization, and locality man-
agement. The Rigel execution model omits complex ILP-oriented cores in
favor of simple, area-optimized in-order cores to improve full-chip through-
put. However, a more flexible MIMD model is chosen over denser SIMD
hardware. A degree of multithreading is beneficial in improving throughput
and is under investigation. Unlike many accelerators, Rigel presents a single
global address space similar to general purpose CMPs. Rigel supports soft-
ware work distribution in the form of task queues based on the common BSP
execution model. Rigel supports global synchronization through software
barriers and through atomic hardware primitives. Rigel supports a variety
of memory operations to aid in locality management, including prefetches,
local and global memory operations, and explicit cache management instruc-
tions.
A block diagram of the Rigel accelerator architecture is shown in Fig-
ure 2.1. Table 2.1 compares the Rigel design point to several other notable
accelerator processors, and Table 2.2 summarizes our design parameters. The
various components of the Rigel accelerator architecture are discussed in
more detail below.
11
Table 2.1: Comparison of Rigel to other contemporary accelerator
architectures
Rigel GPU Cell Larrabee
Vectors 1x (MIMD) 32x (SIMT) 4x (SIMD) 16x (SIMD)
Memory Fully Special DMA + Fully
Cached Purpose Scratchpad Cached
Address Space Single Multiple Multiple Single
Thread Count Some (1-4) Heavy None Some (4)
(per core) (10s-100s)
Core Count 1024 10s-100s 8-10 10s
Coherence HW/SW None None HW
Hybrid
Work Software Hardware Software Software
Distribution
Specialized None Significant None Some
Hardware (graphics) (texture)
Table 2.2: Simulated parameters for the Rigel architecture.
Cores 1024, 2-wide issue, in-order
DRAM 8 32-bit channels, GDDR5, 192 GB/s total
L1 ICache 2 kB, 2-way associative
L1 DCache 1 kB, 2-way associative kB
L2 Cache Unified, 1 per 8-core cluster 64 kB, 16-way associative
L3 Cache Unified, globally shared, 4 MB total, 32 banks, 8-way associative
12
Rigel Architecture: Full Chip View 
9
Figure 2.2: Block diagram of the Rigel cluster.
2.3.3 Core
The fundamental processing element of Rigel is a dual-issue in-order core
optimized for area rather than ILP. Cores support a custom 32-bit RISC-
like instruction set originally derived from MIPS. Each core has a standard
integer pipeline, a fully pipelined single-precision floating-point unit, load-
store pipeline, and 32-entry 32-bit general-purpose register file. A set of
special purpose registers serve to provide configuration information such as
unique core IDs. In contrast to SIMD or SIMT machines, each pipeline
has an independent fetch unit, allowing all cores to simultaneously execute
independent instruction streams in a MIMD fashion. Each core has a small
L1 instruction cache and L1 data cache. The baseline core micro-architecture
is explored in more detail in 4.2.
13
2.3.4 Cluster
Cores are organized intro groups called clusters. A cluster contains a col-
lection of cores attached to a shared unified cluster cache. Figure 2.2 illus-
trates the cluster. The baseline Rigel configuration contains eight cores
per cluster, with cores connected to the cluster cache via a shared bus. The
interconnect between cores and the cluster cache is a split-phase bus, en-
abling simultaneous requests and responses. Clusters allow efficient commu-
nication among their cores via the shared cluster cache. Cores are coherent
within a cluster. Clusters also implement local atomics, load-linked and store-
conditional. The cores, cluster cache, core-to-cluster-cache interconnect and
the cluster-to-global interconnect logic make up a single Rigel cluster. A
variety of alternate cluster organizations are possible, and some are explored
in 6.5.
2.3.5 Tile
Clusters are connected and grouped logically into a tile. In the 1024-core
baseline configuration of Rigel, eight tiles of 16 clusters each are distributed
across the chip. Clusters within a tile share resources on a bi-directional tree-
structured interconnect. A tree-structured interconnect is chosen as opposed
to a mesh due to the intended use pattern. Communication between cores
takes place via shared caches, not through explicit message passing. The
interconnect serves to connect cores to memory, not to enable arbitrary core-
to-core communication or coherence traffic. Tiles are distributed across the
chip, attached to global cache banks via a multistage switch interconnect.
Figure 2.3 illustrates the tile and top-level organization.
2.3.6 Global Cache
The global cache is Rigel’s last-level shared cache and provides buffering
for several high-bandwidth memory controllers. Our initial 1024-core design
includes 8 GDDR memory controllers and 32 global cache banks. The global
cache provides a point of coherence for memory accesses; each address may
be cached in only a single location in the last-level global cache. Shared
data made visible in the global cache is visible to all cores on the chip. By
14
Rigel Architecture: Full Chip View 
9
Figure 2.3: Block diagram of the Rigel tile and top-level chip organization.
default, global cache misses which access DRAM result in the returned data
being cached in the global cache. However, memory operations which bypass
caching in the global cache are optionally available. Additionally, global
atomic operations are performed at the global cache.
2.4 Caching and Memory Model
All cores on the Rigel processor share a single global address space. Cores
within a cluster have the same view of memory via the shared cluster cache,
while cluster caches in our baseline architecture are not explicitly kept co-
herent with one another. The low-level hardware operations and software
model for maintaining coherence are discussed further below. The global
cache provides a point of coherence for when software needs to synchronize
or otherwise safely share data between separate clusters. Due to the incoher-
ent nature of the cache hierarchy, Rigel implements two classes of memory
operations: local and global.
15
2.4.1 Local Memory Operations
Local memory operations are the standard path to memory on the Rigel ar-
chitecture. Local loads and stores are intended to constitute the majority of
memory operations, providing high bandwidth and low latency access. Local
operations are fully cached at the cluster cache, but are not kept coherent
between clusters by hardware. Local stores are not visible outside of the
cluster until either an eviction or explicit writeback occurs. Values evicted
from the cluster cache are written back to the last-level global cache, and
cluster cache misses are serviced by the global cache if the required data is
present. By default, local loads that initially miss in the on-chip caches are
also cached in the global cache to improve performance for read-shared data.
The cluster caches and the global cache are neither inclusive nor exclusive.
Local store operations are not guaranteed to be globally visible without ex-
plicit synchronization, and local loads may return inconsistent data values if
improperly used to access write-shared data without synchronization. Per-
word dirty bits are maintained for the cluster cache to mitigate false sharing.
Local memory operations are used for accessing read-only data, private data
such as the stack, and data shared with cores in the same cluster.
2.4.2 Global Memory Operations
Global loads, stores, and atomics on Rigel always bypass core-level and
cluster caches and complete at the global cache. Memory locations operated
on solely by global memory operations are trivially kept coherent across the
chip, as they may be cached in only a single location. Global operations are
key to supporting system software features such as resource management,
synchronization, and fine-grained inter-cluster communication. The cost of
global memory operations is high compared to local operations due to in-
creased latency, reduced read and write bandwidth, and contention on the
shared global interconnect. Rigel also implements a set of atomic opera-
tions (arithmetic, bitwise, min/max, exchange) that complete at the global
cache.
16
2.4.3 Coherence
A key design decision for the design of a 1024-core chip is what the memory
system should look like, including coherence. In analyzing the data sharing
and communication patterns in visual computing workloads, We observe that
such patterns can be leveraged in the design of memory systems for future
manycore accelerators. Based on these insights, I contributed to development
of both software and hardware mechanisms to manage coherence on parallel
accelerator processors. We developed the Task-Centric Memory Model [26], a
software protocol which works in concert with hardware caches to maintain a
coherent, single-address-space view of memory without the need for hardware
coherence. We then developed Cohesion [27] as a mechanism to support
hybrid coherence with both hardware and software-managed cache coherence
features, enabling multiple memory models in heterogeneous or accelerator-
based systems.
In the baseline Rigel architecture, software is responsible for enforcing
coherence when inter-cluster read-write sharing exists. This may be done
by co-locating sharers on the same cluster where caches are coherent, by
using global memory accesses for shared data, or by explicitly flushing when
shared data is written before the reader accesses it. Explicit instructions
for actions such as flushing or eviction are provided for cache management.
Co-locating sharers is now always possible, and using global accesses for
all shared data is undesirable for performance reasons, as global cache and
interconnect bandwidth is more limited. Instead, we use a software algorithm
to maintain coherence in a coarse-grained manner.
The cache coherence mechanism on Rigel is not implemented in hard-
ware, but instead exploits the sharing patterns present in accelerator work-
loads to enforce coherence in software. The sharing patterns present in our
target workloads allow Rigel to leverage local caches for storing output
write data between barriers before lazily making modifications globally visi-
ble. Most data sharing on accelerator workloads occurs not between barriers
but across barriers. Lazy updates can be performed as long as coherence ac-
tions performed to write-output data are completed when a barrier is reached.
Rigel enables software management of cache coherence in two ways. First,
by providing instructions for explicit cluster cache management that include
cache flushes and invalidate operations. Explicit cluster cache flushes up-
17
Communication
Execution
Barrier
Task Execution
Idle Time
Ti
m
e
In
te
rv
al
…
…
Local
Task Queues
Global 
Task Queue
cores
…
Task Queue Hierarchy
Figure 2.4: The Rigel Task Model consists of hierarchical task queues.
Depending on the configuration, cores may produce tasks into either local
or global queues. Groups of tasks are removed from the global queue and
placed into local queues for faster access and less contention.
date the value at the global cache, but do not update or invalidate cached
copies that may be held by other clusters. Second, broadcast invalidation
and broadcast update operations allow software to implement data synchro-
nization and wakeup operations that rely on invalidation or update-based
coherence in conventional cache coherent designs.
The topic of coherence is explored in more detail in Section 6.6.
2.5 Programming Rigel: The Rigel Task Model
Rigel is not restricted to running software written in a particular hardware-
specific paradigm, but instead has the ability to run standard C code. We
target Rigel using the LLVM compiler framework and a custom backend.
Rigel applications are developed using the Rigel Task Model (RTM),
a simple bulk-synchronous parallel (BSP), task-based work distribution li-
18
Communication
Execution
Barrier
Task Execution
Idle Time
Ti
m
e
In
te
rv
al
…
…
Local
Task Queues
Global 
Task Queue
cores
…
Task Queue Hierarchy
Figure 2.5: The BSP execution model of the Rigel Task Model. An
interval is defined as the time between two barriers.
19
brary that we have developed. We implement task management primarily in
software using hierarchical queues, enabling flexibility in work distribution
and scheduling policies, using minimal specialized hardware in the form of
atomics, global memory accesses, and broadcasts.
Applications are written in RTM using a SPMD execution model. All cores
execute a shared binary, but with arbitrary control flow per core. The pro-
grammer defines parallel work units, referred to as tasks, that are managed
via queues by the RTM runtime. We define an interval as the time between
two global synchronization barriers. During an interval, worker threads can
both produce and consume work units. There is no specified execution or-
dering for tasks within an interval. RTM task queues act as barriers when
empty to provide global synchronization points. When a worker thread at-
tempts to dequeue new work and finds an empty queue, the thread continues
to poll for additional work. When all threads have reached this state and
no additional work remains, a barrier has been reached. The last thread to
enter the barrier notifies the remaining threads. For Rigel, barriers repre-
sent a point at which any locally cached non-private data should be flushed
and made globally coherent before the start of a new interval. Figures 2.4
and 2.5 illustrate the BSP model we implement along with our hierarchical
task queues.
20
CHAPTER 3
THE RIGEL SIMULATOR
In this chapter, I desribe Rigelsim, a cycle-accurate structural timing simu-
lator in continued development.
3.1 Timing Simulator
Our simulator is an execution-driven cycle-accurate model of the Rigel ar-
chitecture. The simulator is structural model, with an RTL analog for each
major simulator component used to model timing. This allows for simpli-
fied block and module-level verification with RTL. Though SystemC was
considered for our simulator, we chose standard C++ for performance and
ultimately flexibility. Rigelsim is essentially a full-system simulator, running
a custom parallel runtime, but like most other accelerators does not currently
run a complete operating system in the traditional sense.
As we are exploring a relatively new architectural niche, 1000 core proces-
sors, we opted against using an abstract high-level model that could obscure
important performance effects. For instance, many research projects ap-
proximate DRAM behavior with a fixed-latency and either fixed or infinite
bandwidth. In contrast, we choose to fully model all timings and interactions
with fine granularity. We find that our applications exhibit a 10× variation
in average latency for a 1024-core system using a detailed timing model limit-
ing the value of fixed parameter models. We also find that it is more difficult
to develop abstractions without a more complete understanding of the sys-
tem being developed. Therefore, abstract models, though easier to develop
and enabling faster simulation times, were not desirable for a design point so
dissimilar from existing architectures.
The simulator is constructed as a hierarchy of interconnected class mod-
ules representing various parts of the design, including pipeline stages, cores,
21
arbiters, various caches, interconnects, and memory controllers. I describe
some of the major components of the simulator.
3.1.1 Performance Model
One of my major focuses within the Rigel simulator has been on extend-
ing our simulation and performance modeling infrastructure with structural
detail for parallel clusters of processors. In order to evaluate specific design
tradeoffs with high confidence, a detailed simulation model is required. Many
simulators gloss over details such as contention in the memory system. As
the Rigel project intends to develop a high-performance VLSI ASIC imple-
mentation of Rigel, it is important for us to model such low-level details to
better understand hardware behavior.
As part of the effort to develop a high-quality performance modeling sim-
ulator, I implemented various extensions to the simulation infrastructure. I
have implemented a global cache in the simulator. Rigel aims to integrate
128 or more clusters per chip; with such a large number of clusters, con-
tention for global cache banks can potentially be very high, so clusters in
simulation must be provided bandwidth to the global cache accordingly. To
support such modeling of the global cache and interconnect, I have imple-
mented multi-cluster support in the simulator. Test codes may be run across
a collection of clusters with their cluster-cache misses being serviced by the
simplified global cache model and a GDDR model of main memory. The
number of accesses to the global cache each cycle may be constrained in the
simulator to model practical bandwidth limitations. Cache misses at all lev-
els are handled with a finite set of Miss Handling Status Registers (MHSRs)
that must be contended for. Rigelsim supports a variety of main memory
models classes, including the default that faithfully models GDDR timings,
precharging, row buffers, and more. The chip-level interconnect is modeled
structurally with messages passed between a series of interconnected routers
that fully model contention and arbitration.
22
3.2 Cluster
The cluster is the primary design element of the Rigel architecture. Clus-
ters are replicated across the chip many dozens of times, making accurate
modeling of this portion of the design very important.
I have extended Rigelsim, a detailed timing model of the proposed Rigel
architecture, for performing an accurate design-space exploration for clusters
of parallel processors. Early in development, the simulator abstracted away
the details of the cluster. The RTL model does not provide the ability to do
rapid design space exploration at low level. I extended the existing Rigelsim
model to incorporate the effects of contention among architectural structures
and the latencies of our various levels of cache. These simulator extensions
provide an environment to better explore tradeoffs in our architecture.
I have worked to develop an accurate model of the cluster, and in par-
ticular the cache hierarchy for our proposed architecture, including timing
and resource contention at each level of the cluster cache hierarchy. This
facilitates an accurate evaluation of the design tradeoffs in the cache system.
I have also developed a generalized system of arbitration that can be used
for accurate contention modeling in the simulator.
3.2.1 Arbitration Modeling
In parallel designs such as Rigel, many shared resources exist. Contention for
these shared resources can play an important role in performance and should
be modeled accurately in the simulator for best results. Accordingly, we
have implemented a generic arbitration system that may be attached to any
shared resource or set of resources in the simulator. A variety of policies can
potentially be implemented, but currently the simple standby of round-robin
arbitration is used. Round-robin arbitration ensures that access is fair and
that no requester is starved. Currently, arbiters are used to control access to
shared cache structures at the cluster level as well as the global cache.
3.2.2 Cluster Cache Interconnect
There is a broad spectrum of cluster-level cache interconnect possibilities.
The question posed here is: how should the interconnect for an 8-core clus-
23
Table 3.1: Cluster Interconnect Organizations
Baseline Alternate
8x8 Crossbar Shared Bus
Word-sized (32-bit) access Cacheline-sized access (8 words)
256-bits/cycle 256 bits/cycle
8 Banks 1 Bank
No L1D Cache line buffer
ter based on our area constraints be designed to minimize conflicts and con-
tention while not over-provisioning and wasting precious area? Two extremes
of this design space are presented in Table 3.1.
The early Rigel cluster design includes an 8x8 crossbar and no Level 1
Data cache. This straightforward organization connects each of the eight
cores via a full crossbar to an eight-banked cluster cache. In this way, all
cores are fully connected to each cache-bank with a single-word (32-bit wide)
link. Each cluster cache bank has a single read/write port for data access. An
arbiter grants access to at most one core per cache bank per cycle. Aggregate
bandwidth to the Cluster cache is therefore a maximum of 256-bits/cycle.
The lack of an L1 data cache allows a cluster cache to guarantee sequential
consistency and maintain coherence. It also prevents data duplication. The
lack of L1 data caches allows a cluster cache to trivially guarantee processor
consistency and maintain coherence since there is no core-local state. This
also reduces on-chip data duplication. This also helps to keep area down as
no cache coherence hardware is required.
To challenge the original assumptions of an early Rigel design, I have
considered alternative organizations; under consideration is a more limited,
but simpler, shared bus-based design. The motivation here is reduced area
and complexity compared with the baseline crossbar implementation. In
this organization, instead of having a 32-bit word connection, each core is
connected via a 256-bit (cacheline-sized) shared bus to a single monolithic
cluster cache bank. The cache has a single 256-bit wide data read/write port.
A bus arbiter grants access to at most one core per cycle. The aggregate
bandwidth to the cluster cache remains 256-bits/cycle; however, contention
increases as now 8 cores contend for only a single port.
To mitigate this additional contention and make use of the additional band-
width provided by the wide bus, each cacheline is read out and stored into
24
(or, written from) a core-local line buffer. The line buffer is essentially a
cache line-sized L1 data cache; however, the purpose is not to provide a local
cache for the core but rather to cheaply reduce contention for the shared bus.
The line buffer makes it more difficult to provide sequential consistency and
cache coherence at the cluster level since writes to it must be kept coherent
with the cluster cache; however, this may be done at each core by snooping
accesses made on the shared bus. This increased complexity is compensated
for by a significant reduction in complexity and area for the cluster-level
cache interconnect.
The cluster-level cache organization is not restricted to the two scenarios
described; indeed, intermediate approaches with two or four multi-ported
banks are possible as well as smaller crossbars with cores sharing connections.
However, initial evaluations focus on the described organizations.
3.2.3 Cluster-Level Cache Organization
The cluster-level cache could be used either as purely a data cache or as a
cluster-level unified cache and store instructions as well. A unified cluster
cache presents additional consumers and more contention, especially in the
case of a shared bus. However, it presents the potential to reduce latencies
for L1I cache misses and act as a temporal read-sharing location for L1I cache
misses that could occur near in time. At the same time, cores each have their
own L1I caches and keeping the cluster cache data-only could provide more
effective cache space for data and reduce on-chip replication of code.
3.3 Simulation Automation
I have developed various tools enabling efficient distribution of simulation
jobs running Rigelsim experiments. Users can specify a set of jobs to be run
on one of two parallel environments, the Trusted Illiac cluster or our group’s
set of desktop machines. The Illiac cluster manages job submissions via the
freely available SLURM resource manager. The desktop machines manage
work via the Condor job queuing system. Submitting work to either set of
machines entails constructing a configuration file which specifies a set of sim-
ulation parameters, benchmark codes, data inputs, and other configuration
25
details. The Rigel job submission tool automatically builds a cross prod-
uct of all relevant parameters and submits each job to the specified set of
machines.
26
CHAPTER 4
THE RIGEL RTL MODEL AND
TOOLFLOW
Rigelsim provides us with performance results for various core, cache, and
interconnect configurations. To evaluate the area-performance tradeoffs that
influence parallel designs, I have undertaken the development of an RTL
model for use with the Rigel project.
In this chapter, I describe my work on development of RTL and related
infrastructure for the Rigel project. I describe the CAD toolflow, core RTL
and microarchitecture, core verification strategy, and cluster RTL develop-
ment.
4.1 RTL Infrastructure
For this work, I use a commercial suite of Synopsys CAD tools, including VCS
for RTL simulation, DesignCompiler for synthesis, and ICCompiler for place
and route. I use commercial standard cell libraries for a high performance
40 nm process.
Memories are generated using commercial SRAM and Register File com-
pilers when possible. However, our toolset is limited to generating single-
and dual-ported structures. Multiported structures are synthesized from
standard cells when the required configurations are not available from the
memory compiler.
4.1.1 CSL ASIC Flow
The physical design flow is driven by the CSL ASIC flow, a powerful frame-
work for chip development developed by Jonathan Ashbrook in collaboration
with the Rigel team.
In addition to performance, area and power are key criteria in our design
27
space of parallel architectures, where components are replicated many times
(for our cores, 1024 times). I have worked to develop a flexible RTL design
flow in order to evaluate the feasibility of implementing a 1024-core CMP in
current process technologies and allow comparison with existing GPU and
CPU architectures. Such an aggressive design target requires an emphasis
be placed on accurate area and power measurement and modeling for key
components. Our SystemVerilog core and cluster models are extensively
parameterized, allowing components to be interchanged or removed. We
target a production-quality 40 nm high-performance standard cell library.
The RTL flow is integrated with the simulator to perform performance val-
idation of the simulator and pre-silicon verification of the RTL. We use the
dynamic stream of register file writes from within the simulator for compari-
son with the RTL model, a method which nicely handles microarchitectureal
mismatches between the RTL and simulator.
4.1.2 RTL Design Space Exploration Flow
To enable comprehensive analysis of design tradeoffs impacting area, power,
and timing, we developed a flow for automated RTL-level design-space explo-
ration. Provided with a set of values for RTL-level configuration parameters,
we generate a cross product of possible design points and pass these to simu-
lation and synthesis flows that leverage the Condor distributed computation
system. For each configuration, a variety of performance test kernels may be
run under simulation for functional and performance verification as well as
for extracting switching data for power analysis. Each design point may be
synthesized under a variety of conditions for clock speed, voltage, and more.
Post-synthesis netlists can be combined with switching data to produce power
estimates. This flow allows us to rapidly experiment with a variety of design
tradeoffs impacting frequency, area, and power that are not clearly visible
within the confines of the traditional timing simulator approach typically
employed by architects.
28
  
 
 
 
 
 
 
 
 
Exec 
Fetch Decode 
Exec 1 
(Int) 
Mem 
FPU 1 
CCRead 
RegFile 
L1  
I-Cache 
L1  
D-Cache 
WB 
Mem2 
Score- 
Board 
Bypass 
Network 
Exec 2 
FPU 2 FPU 3 FPU 4 
ClusterNet 
(Bus) 
ClusterNet  
(Arb) 
(empty) (empty) 
SPRF 
FP  
Accumulator 
RegFile 
 
 
 
 
 
 
 
 
 
 
Mem 
 
 
 
 
 
 
 
 
 
 
FPU 
 
 
 
 
 
 
 
 
 
 
CCRead 
Agen 
Figure 4.1: Pipeline block diagram of the Rigel core.
29
4.2 The Rigel Core
As described in Chapter 2, the principle processing element of the Rigel ac-
celerator architecture is a simple dual-issue, in-order core. The Rigel core
has three separate pipelines: integer, floating-point, and memory. The inte-
ger pipeline also handles branch instructions. By default, the pipeline is fully
bypassed, with the exception that the integer and floating-point pipelines do
not bypass to each other. Bypassing can be enabled or disabled at synthesis
time.
The initial Rigel pipeline nominally has seven stages. The front-end has
two stages, fetch and decode. The decode stage also serves to schedule in-
structions based on operand availability, dependences, and hazards. Four
stage are provided for execution. The last stage handles register file write-
back. Simple branch prediction is provided in the form of a single entry
branch target buffer and a static backward-taken, forward not (BTFN) pre-
dictor.
Figure 4.1 shows the typical configuration of the Rigel pipeline.
4.3 Testing and Verification
I have developed a comprehensive framework for testing and verifying the
processor pipeline’s implementation. A series of test codes can be run on the
RTL and their outputs verified in a number of ways. Testing is automated
in a structured and configurable manner.
4.3.1 Assembly Codes
Tests written in assembly are simple to run and verify. Each assembly test
gets a directory in the testcode folder. Each directory holds the test code
and a test configuration file and optionally files for initial and final register
file state, register write trace, and memory dump state. An example test
code is written for each instruction supported in the RTL. A variety of other
assembly test patterns are available as well.
Initially, assembly tests were run on the RTL and their final register file
state was compared with that of the simulator running the same piece of
30
code. While this is sufficient for quick superficial functional checking, it fails
to capture intermediate errors which might occur.
To address this, the Rigel RTL flow captures a complete trace of register
file writes from both the simulator and the RTL models. As the specific core
microarchitectures in the two models are not identical and vary with time,
these traces allow an implementation-agnostic method of verifying function-
ality. On a correct execution, the ordering of register writes as well as the
values written directly correspond. Writes which occur sequentially in one
model are also correct when occurring simultaneously in the same cycle in
the other model. This can happen when one model extracts a better sched-
ule due to various architectural factors such as memory latency or pipeline
configuration differences.
4.3.2 Compiled Codes
More thorough testing and evaluation is possible by running code compiled
with the full Rigel toolchain. In order to support compiled codes, the RTL
model implements several features via behavioral emulation. For instance,
as the global cache and associated atomics are currently unimplemented in
RTL, these are emulated via testbench hooks.
4.4 The Rigel Cluster
The Rigel cluster is composed of a collection of cores along with a shared
cache and cluster-level atomic support. Discussion of the hardware design
space for Rigel’s caches is beyond the scope of this thesis. Figure 4.2 shows
an early placed and routed mapping of the Rigel cluster with 8 cores and 8
SRAM banks. The cluster and caches are under continued development in
RTL.
4.5 GoldMine
In addition to the Rigel project, the RTL for the Rigel core has been
used as a test subject for the GoldMine project [28]. GoldMine uses data
31
www.c2s2.org
•Initial RTL Place and Route
•Initial area estimates in range
• <2mm2 for 8-core cluster
•64kB shared cache
•Core:
•core logic, register file 
@45nm 
•~100,000 square micron, 
•~1.2GHz
Rigel RTL: P&R Plot
15
Figure 4.2: Early place and route image for the Rigel cluster. Cluster
includes eight cores and eight 8 kB banks of SRAM.
32
mining techniques to automatically generate assertions about RTL modules.
Designer feedback combined with GoldMine allows automated generation of
verification and regression tests.
33
CHAPTER 5
INTEGRATED TOOLFLOW WITH IDEA
In this chapter, I describe our integrated evaluation framework for Rigel.
This infrastructure consists of an integrated and tightly coupled set of tools
spanning simulation, code generation, and hardware development. I describe
IDEA, a unifying component of these tools.
5.1 Related Work: Existing Tools
A variety of tools and techniques exist for evaluating system performance,
area, and power. We motivate our approach by examining the tradeoffs in
existing tools.
Numerous timing simulators are available, ranging from simple models to
complex full-system simulators. Many contemporary simulators are based on
superscalar out-of-order execution [29, 30] and enable high-fidelity modeling
of complex processor cores. However, simulators often employ high levels
of abstraction, obfuscating the relationships between simulated components
and their RTL counterparts; this obfuscation frustrates the use of the simula-
tor as a more flexible golden correctness and performance model for the VLSI
implementation and motivates a less abstract simulator. Parallel simulators
for parallel machines such as Graphite [31] can achieve higher simulation
throughput than single-threaded simulators, but are generally more difficult
to extend and often employ a relaxed timing model, which may not be ac-
ceptable for performance validation of a new design. Simulators for modeling
large parallel machines have also employed abstract modeling techniques that
leverage the synchronization characteristics of their systems [32]. Such sim-
ulators are useful for evaluating programming models and runtimes, but are
less useful for pre-silicon validation and performance validation. GPGPU-
Sim [33] simulates a restricted SIMD processor. We find no existing simu-
34
lation infrastructure that both accurately models the scale of parallelism we
target and can evaluate arbitrary application and system code.
Emulation environments, such as FAST [34], RAMP [35], or PROTOFLEX [36]
execute parts of the timing model on an FPGA to improve performance.
While this approach improves simulation speed, it is not meant to provide
synthesizeable RTL models to aid in precise area or power estimation. While
FPGA emulation and timing simulation can provide guidance, their visibility
is limited by their constraints. For instance, the Godson-2 CPU experience
notable deviations from both FPGA and simulator models [37].
Tools such as McPAT [38] present an integrated modeling framework for
power, area, and timing. However, McPAT relies upon an analytical model
rather than an RTL flow for its physical design feedback and is aimed at
higher-level or more abstract design space exploration than we target. Power
modeling has been shown to be of limited effectiveness without any detailed
implementation information [39].
Commercially, Tensilica provides a synthesizable embedded processor frame-
work with Xtensa [40, 41]. The Tensilica flow allows existing designs to be
extended and provides a compiler and GNU-based toolchain. However, the
Tensilica Xtensa toolset relies upon a fixed baseline ISA with extensions,
rather than a completely flexible ISA specification. OpenRISC [42] provides
a simulator, software stack, and RTL, but these are not targetted at flexible
design space exploration.
Kumar argued that the cores, caches, and interconnect in future-multicore
systems cannot be derived independently [19]. To achieve an efficient design
in terms of area and/or power vs. performance, analysis that includes all as-
pects of the design is needed rather than piece by piece analysis. Prominent
researchers [1] have advocated re-evaluation of the traditional fixed infras-
tructure stack, including simulators, compilers, and ISAs.
5.2 Toolflow Objectives
The broader research goal of our project is to evaluate the potential of a 1024-
core MIMD architecture with a cached, shared address space that maximizes
throughput (FLOPS
mm2
or FLOPS
W
) while supporting a conventional programming
model. However, we wish to remain otherwise unconstrained by traditionally
35
limiting design considerations.
The goal of this thesis is not to evaluate specific design tradeoffs, but to
describe the tools we have developed for making these decisions. Our research
objectives lead us to develop the framework presented in this thesis.
5.2.1 Rigel: A Manycore Accelerator
I describe our evaluation framework for Rigel, a 1024-core single-chip ac-
celerator architecture designed for throughput on visual computing and sci-
entific workloads. The architecture is described in detail in Chapter 2, but
the salient points are summarized here. Each dual-issue, in-order core has
private L1 data and instruction caches. The cores are arranged into clusters
that share a unified L2 cache. The cluster acts as an 8-way SMP. The full de-
sign has 1024 cores in 128 clusters. All cores share a unified last-level global
cache. Our design allows for configurable numbers of memory controllers and
the DRAM model is parameterized allowing us to support a variety of DRAM
standards. We evaluate versions of the design that support various types of
hardware cache coherence and a form of software-managed coherence.
5.3 The Case for a Flexible, Integrated Evaluation
Framework
Most designs seperately approach the topics of architecture development,
hardware development, and software tool development. ISAs are generally
fixed and maintained across generations, and software stacks are inflexible.
The limited flow of information across design boundaries hampers globally
optimized design decisions. Prominent researchers [1] have advocated re-
evaluation of the traditional fixed infrastructure stack, including simulators,
compilers, and ISAs.
In addition to performance, most processor designers today are concerned
with area and power. Architects are traditionally well-equipped to evaluate
performance, but less so for area and power, often relying on previous designs
or analytical models. CPUs and GPUs are often designed and implemented
concurrently, with RTL used primarily for validation and verification, not
performance studies or for physical estimates (power, area), and can rely
36
Software Stack,
Compiler
IDEA
Figure 5.1: IDEA: An integrated toolflow for accelerator design space
exploration.
upon past designs to inform future decisions. SOC designers can rely on
known or characterized IP and use standard processor ISAs, mitigating de-
sign risk.
When exploring a new design space, in our case 1000+ core CMPs, it is
difficult to foresee all potential challenges and we encountered many non-
obvious design decision questions throughout the process. Ultimately, per-
formance in silicon implementation can be difficult to predict without actual
implementation details. We find it valuable to consider physical design con-
straints proactively, during the definition of the architecture, rather than
reactively during implementation. For these reasons, we choose to pursue a
custom integrated design flow employing measurement rather than modeling
whenever possible.
With an accurate RTL model, we can quantify design tradeoffs and their
impact on area, power, and frequency. Due to the targeted level of par-
allelism, design decisions impacting core efficiency are amplifed 1000-fold,
making accurate estimates paramount. However, it is difficult to measure
the performance impact in RTL due to the required detail and scale of the
model. Therefore, we take the approach of measuring key components in
RTL but evaluate the impact on system-level performance in a cycle-accurate
full-system simulator, providing improved design visibility.
In most designs, an ISA is selected or developed early in the design process
37
and remains mostly fixed throughout the development cycle (and, in prac-
tice, for many product generations). Once an ISA is chosen, many parallel
portions of the design flow are developed that depend on this choice, includ-
ing performance simulation, RTL implementation, and software tools. ISA
modification can represent a daunting task late in the design cycle, having
costly, widespread consequences throughout the development stack. Fixing
the ISA is a practical consideration, maintaining binary compatibility with
software infrastructure that depends upon the ISA, such as compilers and
assemblers, and assembly language system software. Many components of
the processor implementation may be affected by ISA-level changes incuding
the decoder, pipeline configuration, and execution units.
However, for designs targeting new design spaces, where entirely new ap-
plication codes and software stacks will be developed, the initial requirements
of a design may not initially be clear. Such attachment to an ISA may not
be necessary. The prevalance of just-in-time (JIT) compilation and the rise
of interpreted languages or virtual-machines like the JVM or .NET frame-
work have reduced somewhat the reliance upon underlying ISA. Devices such
as GPUs make use of JIT compilation for shaders, allowing them to make
substantial ISA-level changes between product generations.
For designers targeting new design spaces, an agile methodology for al-
lowing ISA design space exploration can be a powerful tool to aid in the
evaluation of power, area, and performance tradeoffs or simply extending
the design with additional features.
5.4 Toolflow Components
This section summarizes the components of our evaluation framework. Our
framework consists of three major components: an architectural timing sim-
ulator, an RTL implementation and tool flow, and a software stack for code
generation. Each of these components contains portions automatically gen-
erated by a component we call IDEA. Figure 5.1 illustrates the high-level
organization of the tools with IDEA, and Figure 5.2 illustrates the major
components of the toolflow.
38
RTL Simulator
LLVM Compiler
GNU Binutils
RTL-Decode Sim-Decode
IDEA Tool
Sim-SB
ISA,machine Specification
Test and 
Benchmark 
Code
RF Trace
Sim-Exec
Figure 5.2: Integrated toolflow components. A header file specifies various
ISA parameters. This file is consumed by IDEA, which produces
components of the simulator, RTL, and code generation tools.
39
0
50
100
150
200
250
300
350
400
450
A
ve
ra
ge
 L
at
e
n
cy
 (C
yc
le
s)
Mean
Figure 5.3: Average DRAM latency for our applications varies by a factor
of 10, necessitating a detailed timing model for accurate performance
prediction.
5.4.1 Timing Simulator
Our simulator is an execution-driven cycle-accurate model of the Rigel ar-
chitecture. The simulator is structural, with an RTL analog for every sim-
ulator component used to model timing, allowing for simplified validation
at interfaces with RTL. Though tempted to implement our simulator in Sys-
temC, we chose standard C++ for performance and flexibility, as our simula-
tor is essentially full-system. The simulator runs a custom parallel runtime,
but similar to other accelerators does not run a complete traditional oper-
ating system. As we are exploring a relatively new architectural niche, 1000
core processors, we opted not to use a more abstract high-level model that
might obscure important performance effects. For instance, many research
projects approximate DRAM behavior with a fixed latency and fixed or in-
finite bandwidth. We find that our applications exhibit a 10× variation in
average latency for a 1024-core system using a detailed timing model, as
shown in Figure 5.3. We also find that it is difficult to abstract without a
more complete understanding of the system being developed. As such, an ab-
stract model, though easier to develop and providing faster simulation times,
was not feasible for a design point far-removed from existing architectures.
40
5.4.2 RTL Model
In addition to performance, area and power are key criteria in our design
space of parallel architectures, where components are replicated many times
(for our cores, 1024 times). We developed a flexible RTL design flow in or-
der to evaluate the feasibility of implementing a 1024-core CMP in current
process technologies and allow comparison with existing GPU and CPU ar-
chitectures. Such an aggressive design target requires an emphasis be placed
on accurate area and power measurement and modeling for key components.
Our SystemVerilog core and cluster models are extensively parameterized, al-
lowing components to be interchanged or removed. We target a production-
quality 40 nm high-performance standard cell library.
The RTL flow is integrated with the simulator to perform performance val-
idation of the simulator and pre-silicon verification of the RTL. We use the
dynamic stream of register file writes from within the simulator for compari-
son with the RTL model, a method which nicely handles microarchitectureal
mismatches between the RTL and simulator.
RTL Design Space Exploration Flow
To enable comprehensive analysis of design tradeoffs impacting area, power,
and timing, we developed a flow for automated RTL-level design-space explo-
ration. Provided with a set of values for RTL-level configuration parameters,
we generate a cross product of possible design points and pass these to simu-
lation and synthesis flows that leverage the Condor distributed computation
system. For each configuration, a variety of performance test kernels may be
run under simulation for functional and performance verification as well as
for extracting switching data for power analysis. Each design point may be
synthesized under a variety of conditions for clock speed, voltage, and more.
Post-synthesis netlists can be combined with switching data to produce power
estimates. This flow allows us to rapidly experiment with a variety of design
tradeoffs impacting frequency, area, and power that are not clearly visible
within the confines of the traditional timing simulator approach typically
employed by architects.
41
5.4.3 Code Generation and Software Stack
We use the LLVM tool suite [6] with a custom backend for compilation and
GNU Binutils to assemble, disassemble, and link Rigel binaries. The de-
coupling of binary creation from the compiler and the autogeneration of
ISA-specific components of the compiler backend has allowed us to keep our
compiler updated with new releases of LLVM. Building a robust code gener-
ation framework has allowed us to take advantage of new features and tools
provided by the LLVM development effort. Moreover, using a compiler with
a retargetable intermediate representation allows us to take advantage of
transformations and extensions that target LLVM, such as [43].
5.5 Tool Chain Integration: IDEA
A key unifying component of our tool flow is IDEA (Integrated Design-space
Exploration for Accelerators). IDEA is a utility application that takes a
set of configuration files defining and documenting the ISA, architectural
parameters including latency and complement of functional units, and RTL
parameters. IDEA removes the fixed ISA restriction placed on a traditional
design flow and allows ISA design space exploration as part of the design
process. Building this tool early on helped us avoid locked-in commitments.
IDEA produces Verilog and C++ files used by our simulator, compiler, and
RTL for generating various stages of the pipeline and code generation code
in the compiler. IDEA is a critical piece of the design that allows us to
rapidly change the RTL, compiler and simulator while minimizing cross-tool
inconsistencies and incorrect output from the tools. The consistency and
single point of modification enables rapid modification of the ISA and core-
level microarchitectural parameters with little effort.
We supply an instruction mnenomic, encoding type, description for ISA
document generation, latency, the functional unit that handles the op, and
RTL-specific information. For instructions, we provide a list of available
functional units available for scheduling, the number and type of registers
used, and the encoding type used by the decoder. For functional units, we
describe latencies and microarchitectural configuration. We also include in-
formation necessary for generating an efficient RTL decoder including signals
generated by each instruction and functional unit such as carry out signals
42
or branch resolution information. IDEA consumes the configuration and
automatically generates an ISA that is compatible, assigning opcodes and en-
codings as required. An error is produced if the ISA specification cannot be
encoded in the provided bit space. In this case, the user is required to make
higher-level design decisions to open encoding space. For instance, in an ISA
that supports 16-bit immediates with 32-bit fixed length instructions and 32
registers (5-bit identifiers), at most 6 bits remain to specify opcode and in-
struction encoding type. The outputs of IDEA are shared by our compiler,
our assembler, the RTL flow, and our simulator. The assembler and GNU
binutils toolchain uses the instruction encodings automatically generated by
IDEA.
5.5.1 Limitations
Though powerful, IDEA has limitations. It does not automatically generate
entire simulators, compilers, or RTL implementations. We are required to
specify the implementation and functionality of new features in the simu-
lator, compiler, and RTL. However, once a new feature class (for instance,
a new instruction format or new field type) is implemented within IDEA,
modifications using those features are easy to propagate. We are extending
IDEA to provide a consistent execution unit profile, with quantity, latency,
and pipeline restrictions across simulator, compiler, and RTL.
43
CHAPTER 6
EVALUATION
This chapter describes some of the results generated by use of the previously
described set of tools developed for Rigel. Area and power estimates for
Rigel and its components are provided and compared with commercial de-
signs. We show that the achievable performance for a variety of accelerator
kernels enables Rigel to strike a good balance between a flexible program-
ming interface and high compute throughput. Several case study examples
using some or all of the tools are presented.
6.1 Benchmark Codes
We evaluate Rigel based on a variety of parallel applications and kernels
drawn from visual and scientific computing. Table 6.1 describes our set of
benchmark codes. Benchmarks are written using the Rigel Task Model for
work distribution.
6.2 Data Sharing for Parallel Applications
The design of Rigel’s memory system is informed by the sharing and com-
munication patterns of the parallel workloads targeted by accelerators. In
studying several such applications written for two different platforms (x86
and Rigel/RTM), we found that structured, coarse-grained sharing pat-
terns are common, that most sharing takes place across global synchroniza-
tion points (barriers), and that fine-grained data sharing between barriers is
uncommon.
Figure 6.1 illustrates the sharing patterns in several of our workloads. This
data was collected by instrumenting Rigelsim and our applications to track
44
Table 6.1: Description of data- and task-parallel workloads.
Benchmark Description
cg Conjugate Gradient linear solver
convolve 2D kernel convolution
dmm Blocked dense matrix multiplication
fft 2D complex-to-complex radix-2 Fast Fourier Transform
gjk Gilbert-Johnson-Keerthi 3D collision detection
heat 2D 5-point iterative, out-of-place stencil computation
kmeans K-means Clustering
march Marching Cubes polygonization of 3D volumetric data
mri Magnetic Resonance Image reconstruction (FHD matrix)
sobel Sobel edge detection
stencil 3D 7-point iterative, out-of-place stencil computation
sva Scaled Vector Add
Table 6.2: Simulation parameters for the baseline architecture.
Parameter Value Unit
Cores 1024 –
Memory BW 192 GB/s
DRAM Channels 8 –
L1I Size 2 KB
L1D Size 1 KB
L2 Size 64 KB
L2 Size 8 MB
and categorize all memory accesses during execution of the parallel portion
of the kernel.
6.3 Rigel Scalability
Figure 6.2 illustrates kernel scalability for a variety of parallel applications
up to 1024 cores. Across our selection of benchmarks, we observe very good
scaling. We observe an average speedup of 84× for 1024 cores compared to
one eight-core cluster (harmonic mean, 128× is ideal speedup). Table 6.2
lists relevant simulation parameters for the Rigel design.
45
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
W
ri
te
Re
ad
CG DMM GJK HEAT KMEANS MRI SOBEL
Output Conflict Private Input
Figure 6.1: Characterization of memory accesses in task-based BSP
applications. Input reads and output writes communicate data across
barriers. The majority of memory accesses are to data that is private to a
task. Conflict accesses share data between two tasks in the same barrier
interval, requiring hardware coherence or special mechanisms like atomic
operations to maintain correctness, but are rare in our applications.
2X
4X
8X
16X
32X
64X
128X
1
 T
il
e
2 
Ti
le
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4
 T
ile
s
8 
Ti
le
s
1
 T
il
e
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2 
Ti
le
s
4
 T
ile
s
8 
Ti
le
s
1 
Ti
le
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
1
 T
il
e
2 
Ti
le
s
4
 T
ile
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4 
Ti
le
s
8 
Ti
le
s
1
 T
il
e
2 
Ti
le
s
4 
Ti
le
s
8
 T
ile
s
1 
Ti
le
2
 T
ile
s
4
 T
ile
s
8 
Ti
le
s
1
 T
il
e
2
 T
ile
s
4 
Ti
le
s
8
 T
ile
s
cg dmm fft gjk heat kmeans march mri sobel stencil
Sp
e
e
d
u
p
 O
ve
r 
8-
co
re
 C
lu
st
e
r
Figure 6.2: Benchmark scalability on Rigel chips with 1, 2, 4, and 8
128-core tiles (128-1024 cores). Speedups are relative to one eight-core
cluster. 128× represents linear scaling at 1024 cores. Benchmark binaries
and datasets are identical across all system sizes; global cache capacity and
memory bandwidth are scaled with the number of tiles.
46
Feasibility: Area and Power
14
Gcache
30mm2
(10%)
Other Logic
30mm2
(9%)
Overhead
53mm2
(17%) Cluster Cache 
SRAM 75mm2
(23%)
Logic: Core + 
CCache)
112mm2 (35%)
Register Files
20mm2 (6%)
Clusters
207mm2
(67%)
•Targeting 45nm process @ 1.2 GHz, 1024 cores
•RTL synthesis results + memory compiler + datasheets 
•Can build this today: 320 mm2 die area, <100W average power
•Estimated FLOPS/W and FLOPS/mm2 match or exceed GPUs
Figure 6.3: Area estimates for the Rigel design
6.4 Area and Power Estimates
To demonstrate the feasibility of Rigel on current process technology, I
provide area and power estimates on a commercial 40 nm process. Our
estimates are derived from synthesized Verilog, compiled SRAM arrays, IP
components, and die plot analysis of other 40 nm and 45 nm designs.
6.4.1 Area
A goal of programmable accelerators is to provide higher performance com-
pared to a general-purpose solution by maximizing compute density. With
an initial RTL implementation of the Rigel cluster, I provide an area and
power estimate on 40 nm technology to understand the impact of our choices
on compute density. Our estimates are derived from synthesized Verilog and
include SRAM arrays from a memory compiler and IP components for parts
of the processor pipeline. For large blocks, such as memory controllers and
global cache banks, I use die plot analysis of other 40 nm and 45 nm de-
signs to approximate the area that these components will consume for Rigel.
Figure 6.3 shows a breakdown of preliminary area estimates for theRigel de-
sign. Cluster caches are 64 kB each (for a total of 8 MB) and global cache
banks are 128 kB each (for a total of 4 MB) and are constructed from a se-
lection of dual-ported (one-read/one-write) SRAM arrays chosen for optimal
area. Cluster logic includes estimates for core area, including FPU, and the
47
cluster cache controller. “Other logic” includes interconnect switches as well
as memory and global cache controller logic. Register files have four read and
two write ports and are synthesized from flip-flops. Our initial area estimate
totals 266 mm2. For a more conservative estimate, I include a 20% charge
for additional area overheads. The resulting 320 mm2 is reasonable for im-
plementation in current process technologies, and leaves space for additional
SRAM cache or more aggressive memory controllers.
6.4.2 Power
Typical power consumption of the design with realistic activity factors for
all components at 1.2 GHz is expected to be in the range of 99 W. Our
estimate is based on power consumption data for compiled SRAMs; post-
synthesis power reports for logic, leakage, and clock tree of core and cluster
components; and estimates for interconnect and I/O pin power. A 20%
charge for additional power overhead is included. Peak power consumption
beyond 100 W is possible for Rigel. The figure is similar to modern GPUs
from NVIDIA which consume around 150 W [44] and modern high-end CPUs
such as Intel’s 8-core Xeon processor which can consume 130 W [45].
6.4.3 Comparison
In Table 6.3, I compare our initial area and power estimates to those of
comparable accelerators scaled to match the process generation of the Rigel
implementation. The numbers provided are meant to lend context for our
estimates and are subject to parameter variation, such as clock speed. Rigel’s
expected power consumption to area ratio is comparable to that of the 45 nm
Cell. The estimates indicate that a Rigel design could potentially surpass
accelerators such as GPUs in compute density; this is partially due to the
lack of specialized graphics hardware. GPUs also spend a substantial portion
of their area budget on graphics-related hardware for texture, framebuffer,
and raster operations that take considerable area, but do not improve the
performance of general-purpose computation. GPUs also incorporate high
levels of multi-threading which increase utilization, but reduce peak compute
throughput. Rigel recovers this area and puts it towards additional compute
48
Table 6.3: Power, area, and performance comparison of Rigel to
accelerators normalized to 40 nm.
Architecture Power Perf. Machine
Balance
( W
mm2
) (GOPS
mm2
) (GBPSPEAKGOPSPEAK )
CellBE .3 1.8 .13
Intel Quad-core .5 .4 .25
NVIDIA GTX280 .3–.4 3.3 .14
ATI R700 .55–.9 6.4 .1
Rigel .3 8 .05
and cache resources. As expected, Rigel and other accelerators hold a
significant advantage in compute density compared to general-purpose CPUs,
such as those from Intel [45] and Sun [46].
6.4.4 Area Estimate Validation
The Rigel core area estimates are comparable to those of other simple core
designs. Tensilica cores with 8 kB SRAM scaled to 40 nm are 0.06-0.18
mm2 [47], approximating a cluster area of 0.5 to 1.6 mm2. Higher perfor-
mance MIPS soft cores consume 0.42 mm2 scaled to 40 nm, and if used to
build 8-core clusters, would approximately occupy 3.5 mm2 [48]. Neither
match the features of a Rigel core exactly, but both contain additional fea-
tures that are not required such as debug ports, MMU components, and
peripheral bus interfaces. Frequency and area depend on many parameters,
including enabled features, synthesis, process, and cell libraries.
Our area estimates were validated by comparing to existing designs such
as the MIPS 74K [49] and Tensilica 570T [50] and 108Mini [51]. These
commercial cores contain additional logic for features that our cores will lack
- such as bus interface units, debug ports, and FPUs that fully support IEEE
754 double precision. They are also targeted towards synthesis as a discrete
processor and configurability as opposed to a fixed design synthesized as a
group of eight stripped down cores.
Table 6.4 shows area data for existing designs normalized to 40 nm via
ideal scaling. From the range of values shown, we can see that the developed
49
Table 6.4: Area comparison of commercial cores normalized to 40 nm
Architecture Area in mm2
2Wide In-Order Model .21 (w/cache)
Tensilica 108Mini .07 to .125
Tensilica 570T .175
Table 6.5: CACTI cache area estimates. Bus widths are measured in bits.
architecture address data Area (mm2)
8-way banked 32 32 0.433
unbanked 32 256 0.329
model falls in the range of reasonable values for tiny cores. It is also clear
that hitting Rigel’s target area may be difficult and will require a spartan
core design and judicious use of area.
6.4.5 Cluster Cache Estimates
When considering tradeoffs at the cluster cache level, I chose to explore two
configurations (see Table 6.5 and Table 3.1). I used CACTI to get an estimate
of the sizes of our caches. Our two cache configurations were both 64 kB,
8-way set associative caches with 32 byte blocks. One was banked 8 ways
and the other was left as a single monolithic cache. The 8-way banked cache
was modeled with one 32-bit read/write port per bank and the single banked
cache was modeled with one 256-bit (full cache line) read/write port.
From the CACTI results for our two cache configurations, I find that for
this particular cache size and organization, the 8-way banked cache takes
about 32 percent more area than the single bank configuration. The bus-
based cache design saves about 0.1 mm2 of area, a huge win given the area
budget for a cluster. Table 6.5 shows the actual CACTI results for each of
the two cache configurations.
Another area consideration is the wiring overhead in the core to cache in-
terconnection network. The multi-banked cache would be be connected with
an 8-by-8 crossbar and the single banked cache would be connected with a
shared-bus. In the crossbar network 32-bit data and 32-bit addresses lines
would need to be routed between the banks and caches. With a shared bus
the bus would need to contain 256 bit data and 32-bit address lines. With
50
enough metal layers some of the interconnection area can be routed over
the cluster caches to mitigate the interconnect area overhead. However, Ku-
mar et al. [19] demonstrate that the area overhead for an 8-by-8 interconnect
is significantly greater than in an 8-core shared bus and provides another
argument for the lower hardware cost of a shared bus. In addition, arbitra-
tion logic for the shared bus will be simpler as there are fewer resources to
contend for but the same number of accessors.
6.5 Cluster Design Space
In parallel designs such as Rigel, many shared resources exist. Contention
for these shared resources can play an important role in performance and
should be modeled accurately in the simulator for best results. Accordingly,
I have implemented a generic arbitration system that may be attached to
any shared resource or set of resources in the simulator. A variety of policies
can potentially be implemented, but currently the simple standby of round-
robin arbitration is used. Round-robin arbitration ensures that access is fair
and that no requester is starved. Currently, arbiters are used to control
access to shared cache structures at the cluster level as well as the global
cache. As a sensitivity study, I present the results in Figure 6.4 to illustrate
performance deltas observed when modeling and not modeling contention
within the simulator. Clearly, to make design tradeoffs at a low level of
detail, contention must be modeled.
Figure 6.4 shows some performance results generated by our simulator for
several kernels of interest. These results are from a much earlier point in
the development of the Rigel project. The “NoCon” cases represent a best-
case performance level when there is no contention for the cluster cache. As
expected, there is a performance penalty when contention is modeled for all
configurations. The best performance is achieved by the use of a crossbar
interconnect combined with an L1 Line buffer; however, this is not one of
our target configurations as it increases complexity of the cache subsystem
and does not achieve a reduced area. The shared bus paired with an L1 line
buffer provides performance similar to, and indeed better than, the simple
Rigel baseline of a crossbar. Some of this benefit is due to the reduced load-
to-use penalty that a small core-local Line Buffer provides, while some is due
51
Figure 6.4: Simulation cycles across five benchmarks and various
configurations.
52
to reduced contention for shared ports. On codes that can better exploit
the line buffer with dense sequential accesses, such as dense matrix multiply
(DMM), the line buffer provides a large benefit. On other codes with less
regular accesses, such as 2D convolution, the benefit of the line buffer is
smaller but still enough to surpass the baseline crossbar. It is interesting to
note the exceptionally poor scaling of the SVA kernel; this is not surprising,
as this kernel is severely bandwidth bound. Achieving similar performance to
a crossbar while saving nearly 30 percent in cache area and nearly 10 percent
in cluster area is compelling for the area-oriented cluster design required by
Rigel.
6.5.1 Cluster-Level Cache Organization
Several other parameters of the cluster-level cache are under consideration.
The actual sizing of the cluster cache is a parameter of interest, but sizing
caches due to benchmark behavior could be a dangerous course of action; care
must be taken not to undersize caches based on results from overly optimistic
benchmark data.
The cluster-level cache could be used either purely as a data cache or as
a cluster-level unified cache and store instructions as well. A unified cluster
cache presents additional consumers and more contention, especially in the
case of a shared bus. However, it presents the potential to reduce latencies
for L1I cache misses and act as a temporal read-sharing location for L1I cache
misses that could occur near in time. At the same time, cores each have their
own L1I caches and keeping the cluster cache data-only could provide more
effective cache space for data and reduce on-chip replication of code.
The results show a relatively small impact on performance when instruc-
tions are cached in the cluster cache or not. Some benchmarks show a slight
increase in performance, likely due to the fact that their small code kernel
resides almost entirely in L1. Others show slight dips due to the increased
cost of an L1-I miss. Unfortunately, our current set of test codes are rela-
tively compact and may not ultimately be representative of realistic use of
the L1I cache. Further study of this issue will be required as we expand our
set of test kernels and applications to include more complex pieces of code.
The additional contention for the cluster cache with more complex pieces of
53
code may reveal deeper performance pathologies, but none are clear given
the our current test codes.
We conducted a simple study on sizing of the L1 line buffer which essen-
tially converts it to a fully associative and very small cache (1-8 lines). Pre-
liminary numbers (see Figure 6.4) show significant performance increases for
even small increases up to 8 lines; this is not surprising, as the L1 allows data
to be accessed quicker even without conflicts. However, we need to evaluate
these results further in the context of area overheads and design complexity
for the Rigel design. Implementing small line buffers or L1 caches increases
the design complexity that is required to maintain coherence at the cluster
level and also complicates the consistency model. However, these preliminary
results along with other findings on the cluster cache interconnect challenge
the early design assumptions of the Rigel design.
6.6 Coherence Techniques
A high-performance accelerator requires efficient mechanisms for sharing data
and maintaining coherence between multiple private caches; while scalable
multiprocessor hardware coherence schemes exist [52], they were designed for
machines with a very different mix of computation, communication, and stor-
age resources than accelerators. Indeed, modern general-purpose CMPs and
multi-socket systems generally use much simpler protocols which work well
for small systems but are cost-prohibitive for a 1024-core accelerator [53, 54].
Additionally, our target applications exhibit data sharing patterns that are
more structured than those targeted by traditional distributed machines and
CMPs. While our initial design goal for Rigel was to achieve good perfor-
mance and programmability without hardware cache coherence (HWcc), we
have since examined ways to attain some of the benefits of hardware coher-
ence with reduced overhead by leveraging the sharing characteristics of our
target workloads.
54
6.6.1 Software-managed Coherence with the Task-centric
Memory Model
Adopting a structured programming model enables us to implement software-
managed cache coherence (SWcc) efficiently. We developed the Task-Centric
Memory Model (TCMM) [26] as a contract describing the software actions
necessary to ensure correctness in task-based BSP programs in the absence
of hardware-enforced coherence. All blocks start in the clean state with
no sharers or cached copies and may transition to immutable (read-only),
shared as globally coherent, or private (either clean or dirty). State
is implicit and must be tracked by the programmer. Cached local memory
operations may operate on private or immutable data, whereas uncached
global operations are required for globally coherent data. Transitioning
data between states requires first moving through the clean state. SWcc re-
quires minimal hardware support in the form of instructions for explicitly
writing back and invalidating data in private caches. The TCMM protocol is
described in detail in [26]. We found that a small number of additional hard-
ware mechanisms, such as broadcast support to accelerate global barriers and
global atomic operations to facilitate infrequent intra-barrier sharing, greatly
improved scalability over a na¨ıve design. With these relatively inexpensive
mechanisms, SWcc was able to achieve performance on par with idealized
hardware coherence at 1024 cores. Only one benchmark suffered a greater
than 10% performance loss compared with the idealized coherent model. Fu-
ture accelerators can improve upon TCMM by automating coherence actions
in the compiler and scheduling coherence actions to optimize cache behavior.
6.6.2 Hybrid Coherence with Cohesion
Memory models in use today are either fully hardware-coherent or fully
software-coherent. In systems that include both models, the two models
are strictly separated by using disjoint address spaces or physical memories.
As systems-on-chip (SoCs) and heterogeneous accelerator platforms become
more prevalent, the ability to seamlessly manage data across different mem-
ory models will become increasingly important.
Software-managed cache coherence removes the area, power, and intercon-
nect traffic overhead of cache coherence for structured data sharing patterns
55
T=0 T=1 T=2 T=3 T=4
0x100
0x140
0x160
0x180
0x1C0
…
0x120
0x1A0
Software-managed
Coherence Protocol
Hardware-managed 
Coherence ProtocolSW-to-HW Transitions
COHESION
SWcc Cache Line HWcc Cache Line Transition
Ad
dr
es
s S
pa
ce
Time
Figure 6.5: Cohesion is a hybrid memory model for accelerators that
enables hardware and software-managed coherence to coexist, allows data
to migrate between the two domains dynamically, and captures the
performance, efficiency, and programmability benefits of both regimes.
and allows experienced application developers to achieve high performance.
Hardware coherence avoids the instruction overhead of software coherence,
performs well with unstructured sharing patterns, and provides correct data
sharing with low programmer effort. To achieve the combined benefits of
these two models, we have developed Cohesion, a hybrid memory model.
Cohesion includes a hardware coherence implementation which tracks the
entire address space by default. The developer can selectively remove cache
lines from the HWcc domain at runtime and manage them using software
to improve performance. Because data can move back and forth between
the SWcc and HWcc domains at will, Cohesion can be used to dynamically
adapt to the sharing needs of applications and runtimes and does not require
multiple address spaces or explicit copy operations. Figure 6.5 illustrates
the high-level operation of Cohesion. We implement software-managed
coherence using TCMM and use an MSI-based hardware coherence protocol,
but any hardware and software protocols may be used if the necessary state
transitions are enforced.
A developer may instruct the hardware coherence machinery to defer to
software management for a particular cache line by updating a software-
accessible table in memory. For instance, hardware coherence management
is inefficient when data is private or when a large amount of data can be
handled as a unit by software. Handling read-mostly and private data outside
the scope of the hardware coherence protocol can increase performance and
reduce the load on the coherence hardware, increasing the effective directory
size for data under HWcc as seen in Figure 6.6. Ultimately, Cohesion allows
56
0K
50K
100K
150K
200K
250K
300K
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
Co
he
si
on
H
W
cc
CG DMM GJK HEAT KMEANS MRI SOBEL STENCIL Mean
A
ve
ra
ge
 #
 D
ir
ec
to
ry
 E
nt
ri
es
 A
llo
ca
te
d
Code Heap/Global Stack Maximum Allocated
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
256 512 1024 2048 4096 8192 16384
Directory Entries per L3 Cache Bank
cg
dmm
gjk
heat
kmeans
mri
sobel
stencil
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
256 512 1024 2048 4096 8192 16384
Sl
ow
do
w
n 
N
or
m
al
iz
ed
 to
 In
fin
it
e 
En
tr
ie
s
Directory Entries per L3 Cache Bank
(B) (C)(A)
Figure 6.6: Performance versus directory cache size for (A) HWcc alone and
(B) Cohesion, which amplifies effective directory size.
explicit coherence management to be an optional optimization opportunity,
rather than necessary for correctness.
6.7 IDEA Case Studies
To illustrate the utility of our framework and integration with IDEA, I
present the following real-world case studies. I present quantitative data
when appropriate, but rely on qualitative description when necessary. While
quantitative data is presented, the intent is not to argue for or against partic-
ular design decisions but rather to consider what types of questions designers
in the 1000-core design space might pose and how we are able to evaluate
them in our framework.
6.7.1 Case Study: FMADD
Designers are often faced with a seemingly simple question: What is the
utility of augmenting the ISA with a new instruction? For a parallel system
such as Rigel, this is more than a question of core-level performance; the
marginal cost of a core-level design decision is multipled by 1024× across the
chip. A full evaluation may involve a design space exploration, examination
of performance tradeoffs, and area or power analysis. Within our integrated
57
tooflow, we examine a simple ISA extension, the addition of a floating-point
fused multiply add (FMADD) instruction.
An FMADD performs the simple operation R = A*B + C. Even for such a
simple operation, numerous design choices present themselves. Should the
instruction have four operands, or only three and reuse a source for the
destination? Should a special accumulator or accumulator register file be
used to improve implementation performance or provide additional register
file bandwidth? To evaluate this tradeoff, we can measure speedup for a
matrix multiplication, normalized to a standard MUL+ADD implementation,
for various unrollings of the inner loop. FMADD uses the standard register file
and FMACC uses a separate accumulator file. Combined with area and power
data, also generated through our flow, this performance data allows us to
comprehensively evaluate the merits of FMADD configurations.
6.7.2 Case Study: Instruction Encoding
Our flexible IDEAtool has enabled us to quickly mold our ISA to our needs.
We have found IDEA useful for adding new instructions, adding new instruc-
tion encoding types, and optimizing encoding for hardware implementation
efficiency.
Long Jump
Most ISAs provide instructions with immediate operands. Depending on
the ISA, immediate operand sizes vary greatly from a few bits to full word-
sized values. These immediate value sizes are fixed; if the software needs
additional bits, register based instructions are required. This can lead to
minor headaches for software or software developers. During development,
we ran into a situation where we desired greatly increased range in the jump
offset in order to reduce complexity in code generation. IDEA enabled us
to quickly re-encode the instruction set, extracting the additional encoding
space necessary to enable longer 26-bit immediate values for jumps. The new
field sizing was accepted in the simulator and RTL via their auto-generated
decoder modules where existing decode packet types were updated, and sim-
ilarly rebuilding LLVM enabled longer offsets to be used during compilation.
58
22
24
26
28
30
32
34
36
100000 120000 140000 160000 180000 200000
m
W
/c
or
e
area/core (square micron)
1.25
1.5
1.33
2
Figure 6.7: Area and power for core configurations with and without
bypassing, with 5 different L1 cache sizes and at 4 frequency targets. Data
are marked by clock period (ns).
While a simple example, this illustrates the types of problems such an ap-
proach can solve.
Additional examples
We have taken advantage of our flexible ISA framework on numerous occa-
sions to address practical issues. We found it necessary to reconsider the
number of architectural registers we provide, reducing this from 64 to 32.
Reducing the number of registers allowed space in the encoding for three
register IDs and a 16-bit immediate; large immediates reduce the number of
trampolines needed to implement branches in large-footprint codes. Enabled
by our integrated flow, this change was implemented across the entire design
quickly and efficiently.
We modified instruction formats to open additional encoding space and for
improving hardware decoder efficiency, all of which are transparently picked
up in the simulator and software stack as well as the RTL frontend. To
improve RTL decoder efficiency and ease of implementation, several one-hot
encoded bit fields were added.
59
We have augmented our ISA with various new memory operations over
time and experimented with non-standard encoding types for adding accu-
mulator registers.
6.7.3 Case Study: Core Microarchitecture Tradeoffs
Our cycle-accurate full-system simulator allows us to precisely evaluate ar-
chitectural tradeoffs for performance, and our configurable RTL model allows
us to evaluate the corresponding impact on area, power, and frequency. This
analysis is key to maximizing compute density; for our parallel accelerator,
performance can be traded for reduced area or power. The space of mi-
croarchitectural features to consider is vast; I examine a small subset for
illustrative purposes.
In this example, I examine the impact on power and area of cache sizing,
result bypassing, and synthesis clock target. Figure 6.7 presents a scatter-
plot of area and power consumption for a variety of design points with a range
of cache sizes and bypassing both enabled and disabled. From the figure, we
observe clusters of data points for each selection of cache size, where power
increases with clock frequency. Larger caches consume additional area and
power, but have the potential to improve performance.
6.7.4 Case Study: Register File Sizing
We consider a question that spans the entire system stack from compiler to
architecture to implementation: How many registers should we support? To
answer this in the context of throughput-oriented systems, we should con-
sider individual core performance in addition to area and timing impact. We
present a sample evaluation of 16 and 32-entry register files. Figure 6.8 illus-
trates the perf-per-area and perf-per-watt tradeoff for this design question,
normalized to a default configuration. The results include modifications to
the compiler, performance from simulation, and area and power from RTL.
For the data presented in this example, the larger register file tends to im-
prove both performance per area and performance per watt, even though the
larger register file itself consumes additional area and power. It enables more
efficient utilization of the hardware.
60
00.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3 3.5
N
or
m
al
ize
d 
Pe
rf
/W
at
t
Normalized Perf/Area
RF16
RF32
Figure 6.8: Performance per watt vs. Performance per mm2 for 16- and
32-register configurations.
6.7.5 Case Study: Bypass
Figure 6.9 shows the area cost of operand bypassing for a simple two-wide
in-order pipeline. Area numbers represent the core pipeline and register file
without caches. Due to limited SRAM performance, such clock speeds are
not attainable when including the caches. As clock speeds increase, the
incremental cost of bypassing increases from 10% to 20%. Complex bypass
networks introduce a lot of additional wiring connections for a simple in-
order core. Depending on the performance impact, overall global area cost,
and other available latency-hiding features such as multithreading, bypassing
may or may not be worth the cost.
61
110
Cost of Bypass
90
100
c r
o n
2 ~20%
80
1 0
0 0
s   o
f   m
i c
Bypass
60
70
A r
e a
    i
n  
1
NoBypass
~10%
50
0.8 1 1.2 1.33 1.4
Synthesized Clock Frequency (GHz)
Figure 6.9: Area cost of enabling bypass networks for a range of synthesis
clock targets.
62
CHAPTER 7
CONCLUSION
In this thesis, I have presented an integrated design flow for evaluating a new
design space, massively parallel chip multiprocessors. The Rigel toolchain
has successfully allowed exploration and innovation in the space of 1000+
core accelerator processors. I have shown some of the ways that this toolflow
has been applied in practice. This powerful set of tools has enabled research
leading to several related publications.
With the transistor count afforded by modern manufacturing processes and
large available die sizes, a 1024-core processor is feasible in today’s technol-
ogy. Such a design is not only possible, but competitive with contemporary
accelerator processors while providing a more general programming model
and support for a wider variety of data- and task-parallel workloads.
We faced many challenges while developing our toolflow, including long
simulation times, a need for accurate memory system modeling due to the
strong dependence of application performance on memory bandwidth and
latency, the criticality of accurate area and power measurement for cores,
and a rapidly evolving ISA and benchmark suite.
Through several case studies I demonstrate the practical utility of our
techniques and illustrate how they can be applied to future parallel archi-
tectures. I find that enabling a flexible ISA is a highly desirable design goal
for exploring accelerator architectures, requiring an integrated approach that
allows for the rapid introduction and removal of instructions from the code
generator, simulation infrastructure, and RTL implementation.
7.1 Tools Release
It is my intention to prepare, package, and release the Rigel toolset at
some point in the future so that the community at large can benefit from the
63
development effort of the Rigel team. Development on the RTL, simulation,
and compilation toolchain continues.
64
REFERENCES
[1] W. J. Dally, “Moving the needle, computer architecture research
in academe and industry,” in Proceedings of the 37th International
Symposium on Computer architecture, 2010. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815963
[2] NVIDIA, NVIDIA GeForce 8800 GPU Architec-
ture Overview, November 2006. [Online]. Available:
http://www.nvidia.com/object/IO 37100.html
[3] M. Gschwind, “Chip multiprocessing and the Cell broadband engine,”
in Proceedings of the 3rd Conference on Computing Frontiers, 2006, pp.
1–8.
[4] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,
S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,
T. Juan, and P. Hanrahan, “Larrabee: A many-core x86 architecture for
visual computing,” ACM Transactions on Graphics, vol. 27, pp. 1–15,
2008.
[5] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,
A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: An
architecture and scalable programming interface for a 1000-core accel-
erator,” in Proceedings of the International Symposium on Computer
Architecture, June 2009, pp. 140–151.
[6] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong
program analysis & transformation,” in CGO ’04: Proceedings of the
International Symposium on Code Generation and Optimization, 2004,
p. 75.
[7] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang,
“The case for a single-chip multiprocessor,” in Proceedings of the sev-
enth international conference on Architectural support for programming
languages and operating systems, 1996, pp. 2–11.
[8] J. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, C. Hwang, H. Li,
A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A. Leon, and
65
A. Strong, “A 40nm 16-core 128-thread CMT SPARC SoC processor,” in
IEEE International Solid-State Circuits Conference 2010 (ISSCC 2010),
Digest of Technical Papers, February 2010, pp. 98–99.
[9] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung,
J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao,
C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks,
D. Khan, F. Montenegro, J. Stickney, and J. Zook, “Tile64 - processor:
A 64-core soc with mesh interconnect,” in Solid-State Circuits Confer-
ence, 2008. ISSCC 2008. Digest of Technical Papers. IEEE Interna-
tional, 2008, pp. 88 –598.
[10] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-
wald, H. Hoffmann, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf,
M. Seneski, N. Shnidman, V. S. M. Frank, S. Amarasinghe, and A. Agar-
wal, “The Raw microprocessor: A computational fabric for software cir-
cuits and general purpose programs,” IEEE Micro, vol. 22, pp. 25–35,
2002.
[11] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Fi-
nan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote,
and N. Borkar, “An 80-tile 1.28tflops network-on-chip in 65nm cmos,” in
Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical
Papers. IEEE International, 2007, pp. 98 –589.
[12] U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson,
and J. D. Owens, “Programmable stream processors,” IEEE Computer,
vol. 36, no. 8, pp. 54–62, 2003.
[13] J. D. Owens, U. J. Kapasi, P. Mattson, B. Towles, B. Serebrin, S. Rixner,
and W. J. Dally, “Media processing applications on the Imagine stream
processor,” in Proceedings of the IEEE International Conference on
Computer Design, Sep. 2002, pp. 295–302.
[14] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[15] NVIDIA, NVIDIA’s Next Generation CUDA Compute Ar-
chitecture: Fermi, 1st ed., December 2009. [Online]. Available:
http://www.nvidia.com/content/PDF/fermi white papers/NVIDIA Fer-
mi Compute Architecture Whitepaper.pdf
[16] AMD, “The future is fusion: The industry-changing im-
pact of accelerated computing,” 2008. [Online]. Available:
http://sites.amd.com/jp/Documents/AMD fusion Whitepaper.pdf
66
[17] ATI, ATI Radeon HD 5870 GPU Feature Summary, 2010. [Online].
Available: http://www.amd.com/us/products/desktop/graphics/ati-
radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-specifications.aspx
[18] Intel, Advanced Vector Extensions Programming Reference, 2010.
[Online]. Available: http://software.intel.com/en-us/avx
[19] R. Kumar, V. Zyuban, and D. M. Tullsen, “Interconnections in mult-
core architecures: Understanding mechanisms, overheads and scaling,”
in Proceedings of the International Symposium on Computer Architec-
ture, ser. ISCA ’05, 2005, pp. 408–419.
[20] NVIDIA, CUDA Programming Guide 1.0, 2007. [Online]. Available:
http://developer.nvidia.com/ob-ject/cuda.html
[21] M. Gschwind, H. Hofstee, B. Flachs, M. Hopkin, Y. Watanabe, and
T. Yamazaki, “Synergistic processing in Cell’s multicore architecture,”
IEEE Micro, vol. 26, no. 2, pp. 10–24, 2006.
[22] S. Kumar, C. J. Hughes, and A. Nguyen, “Carbon: architectural support
for fine-grained parallelism on chip multiprocessors,” in Proceedings of
the International Symposium on Computer Architecture, June 2007, pp.
162–173.
[23] M. Frigo, C. E. Leiserson, and K. H. Randall, “The implementation of
the Cilk-5 multithreaded language,” in Proceedings of the ACM SIG-
PLAN ’98 Conference on Programming Language Design and Imple-
mentation, Montreal, Quebec, Canada, June 1998, pp. 212–223.
[24] The OpenCL Specification, 1st ed., Khronos OpenCL
Working Group, September 2010. [Online]. Available:
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
[25] L. G. Valiant, “A bridging model for parallel computation,” Communi-
cations of the ACM, vol. 33, no. 8, pp. 103–111, 1990.
[26] J. H. Kelm, D. R. Johnson, S. S. Lumetta, M. I. Frank, and S. J. Patel,
“A task-centric memory model for scalable accelerator architectures,”
in Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques, September 2009, pp. 77–87.
[27] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel,
“Cohesion: A hybrid memory model for accelerators,” in Proceedings of
the International Symposium on Computer Architecture, June 2010, pp.
429–440.
67
[28] S. Vasudevan, D. Sheridan, S. Patel, D. Tcheng, B. Tuohy, and D. John-
son, “Goldmine: Automatic assertion generation using data mining and
static analysis,” in Design, Automation Test in Europe Conference Ex-
hibition (DATE), 2010, March 2010, pp. 626 –629.
[29] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hollberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full
system simulation platform,” IEEE Computer, vol. 35, pp. 50–58, 2002.
[30] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze,
S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, “SESC simulator,”
January 2005. [Online]. Available: http://sesc.sourceforge.net
[31] J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio,
J. Eastep, and A. Agarwal, “Graphite: A distributed parallel simulator
for multicores,” in The 16th IEEE International Symposium on High-
Performance Computer Architecture (HPCA), January 2010, pp. 1–12.
[32] G. Zheng, G. Kakulapati, and L. Kale, “BigSim: a parallel simulator for
performance prediction of extremely large parallel machines,” in Parallel
and Distributed Processing Symposium, 2004. Proceedings. 18th Inter-
national, 2004, p. 78.
[33] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing
cuda workloads using a detailed gpu simulator,” in Performance Anal-
ysis of Systems and Software, 2009. ISPASS 2009. IEEE International
Symposium on, 26-28 2009, pp. 163 –174.
[34] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. John-
son, J. Keefe, and H. Angepat, “Fpga-accelerated simulation technolo-
gies (fast): Fast, full-system, cycle-accurate simulators,” in MICRO 40:
Proceedings of the 40th Annual IEEE/ACM International Symposium
on Microarchitecture. Washington, DC, USA: IEEE Computer Society,
2007, pp. 249–261.
[35] Z. Tan, A. Waterman, H. Cook, S. Bird, K. Asanovic´, and D. Patterson,
“A case for FAME: FPGA architecture model execution,” in Proceedings
of the 37th International Symposium on Computer Architecture, 2010,
pp. 290–301.
[36] E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi, and K. Mai, “A
complexity-effective architecture for accelerating full-system multipro-
cessor simulations using FPGAs,” in FPGA ’08: Proceedings of the 16th
international ACM/SIGDA symposium on Field programmable gate ar-
rays. New York, NY, USA: ACM, 2008, pp. 77–86.
68
[37] W.-W. Hu and J. Wang, “Making effective decisions in computer ar-
chitects’ real-world: lessons and experiences with Godson-2 processor
designs,” J. Comput. Sci. Technol., vol. 23, pp. 620–632, July 2008.
[38] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in MICRO 42:
Proceedings of the 42nd Annual IEEE/ACM International Symposium
on Microarchitecture, New York, NY, USA, 2009, pp. 469–480.
[39] M. S. S. Govindan, S. W. Keckler, and D. Burger, “End-to-end val-
idation of architectural power models,” in Proceedings of the 14th
ACM/IEEE international symposium on Low power electronics and de-
sign, ser. ISLPED ’09. New York, NY, USA: ACM, 2009, pp. 383–388.
[40] Tensilica, Xtensa Processor Developer’s Toolkit. [Online]. Available:
http://www.tensilica.com/uploads/pdf/HWdev.pdf
[41] Tensilica, Xtensa Software Developer’s Toolkit. [Online]. Available:
http://www.tensilica.com/uploads/pdf/SWdev.pdf
[42] OpenCores, OpenRISC. [Online]. Available:
http://opencores.org/project,or1k
[43] A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling gpu-cpu work-
loads and systems,” in GPGPU ’10: Proceedings of the 3rd Workshop
on General-Purpose Computation on Graphics Processing Units. New
York, NY, USA: ACM, 2010, pp. 31–42.
[44] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla:
A unified graphics and computing architecture,” IEEE Micro, vol. 28,
no. 2, pp. 39–55, 2008.
[45] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada,
M. Ratta, and S. Kottapalli, “A 45nm 8-core enterprise xeon processor,”
in IEEE International Solid-State Circuits Conference 2009 (ISSCC
2009), Digest of Technical Papers, February 2009, pp. 56–57.
[46] M. Tremblay and S. Chaudhry, “A third-generation 65nm 16-core 32-
thread plus 32-scout-thread CMT SPARC processor,” February 2008,
pp. 82–83.
[47] Tensilica, 570T Static-Superscalar CPU Core PRODUCT BRIEF, 2007.
[48] MIPS, MIPS32 24K Family of Synthesizable Processor Cores, 2009.
[49] MIPS Technologies, MIPS32 74K. [Online]. Avail-
able: http://www.mips.com/products/cores/32-bit-cores/mips32-
74k/index.cfm
69
[50] Tensilica, Tensilica Diamond 570T. [Online]. Available:
http://www.tensilica.com/diamond/di 570t.htm
[51] Tensilica, Tensilica diamond 108Mini. [Online]. Available:
http://www.tensilica.com/diamond/di 108mini.htm
[52] J. Laudon and D. Lenoski, “The SGI Origin: A ccNUMA highly scalable
server,” SIGARCH Computer Architecture News, vol. 25, no. 2, pp. 241–
251, 1997.
[53] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu,
M. Braganza, S. Meyers, E. Fang, and R. Kumar, “An integrated quad-
core opteron processor,” in IEEE International Solid-State Circuits Con-
ference, 2007. (ISSCC 2007), Digest of Technical Papers, February 2007,
pp. 102–103.
[54] Intel, An Introduction to the Intel QuickPath In-
terconnect, 1st ed., January 2009. [Online]. Available:
http://www.intel.com/technology/quickpath/introduction.pdf
70
