Architectural Implications of Automatic Parallelization With HELIX-RC by Brownell, Kevin Matthew
Architectural Implications of Automatic
Parallelization With HELIX-RC
The Harvard community has made this
article openly available.  Please share  how
this access benefits you. Your story matters
Citation Brownell, Kevin Matthew. 2015. Architectural Implications of
Automatic Parallelization With HELIX-RC. Doctoral dissertation,
Harvard University, Graduate School of Arts & Sciences.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:23845453
Terms of Use This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#LAA
Architectural Implications of Automatic
Parallelization with HELIX-RC
a dissertation presented
by
KevinMatthew Brownell
to
The School of Engineering and Applied Sciences
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Engineering Sciences
Harvard University
Cambridge, Massachusetts
September 2015
©2015 – KevinMatthew Brownell
all rights reserved.
Thesis advisor: Professor DavidM. Brooks, Gu-YeonWei KevinMatthew Brownell
Architectural Implications of Automatic Parallelization with
HELIX-RC
Abstract
As classic Dennard process scaling fades into the past, power density concerns have driven modern
CPU designs to de-emphasize the pursuit of single-thread performance, focusing instead on increas-
ing the number of cores in a chip. Computing throughput on a modern chip continues to improve,
since multiple programs can run in parallel, but the performance of single programs improves only
incrementally. Many compilers have been designed to automatically parallelize sequentially written
programs by leveraging multiple cores for the same task, thereby enabling continued single-thread
performance gains. One such compiler is HELIX, which can increase the performance of a mixture
of SPECfp and SPECint benchmarks by 2 on a 6-core Nehalem CPU.
Previous approaches to automatically parallelize irregular programs have focused on removing
apparent dependences through thread-level speculation, which limits the type of code that can be
targeted. In contrast, this dissertation increases the amount of code that can be parallelized by ad-
dressing the specific communication demands of that code. The dissertation proposes a special-
purpose extension of the cache hierarchy, called ring cache, to greatly reduce the perceived commu-
nication latency between cores running an automatically parallelized program. This co-design of
ring cache and the HELIX compiler, called HELIX-RC, increases the speedup of 10 SPEC bench-
marks running on 16 simulated in-order cores from an average of 2 to an average of over 8.
Speedups are slightly reduced to 7 on out-of- order cores, which extract instruction-level paral-
lelism on their own. A fully synthesized Verilog implementation of ring cache is evaluated and is
iii
Thesis advisor: Professor DavidM. Brooks, Gu-YeonWei KevinMatthew Brownell
shown to consume less than 25mW of power with an area of less than 0.275 square millimeters.
This dissertation includes a study comparing single program per core multiprogramming and
HELIX-RC. Counterintuitively, some HELIX-RC parallelized benchmarks not only surpass simple
multiprogramming in terms of single program performance, but can also beat multiprogramming in
terms of total multicore throughput by reducing the effective per-core working set of a program.
With communication bottlenecks removed by ring cache, automatic parallelization with HELIX-
RC restores a decade of lost single-thread performance improvements.
iv
Contents
0 Introduction 1
0.1 Performance Scaling Hits a Speed Bump . . . . . . . . . . . . . . . . . . . . . . . 2
0.2 Extracting Multicore Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2.1 Single-Program Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.2.2 Multiple-Program Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 Core Utilization Remains Low . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.4 Automatic Parallelization Can Improve Utilization . . . . . . . . . . . . . . . . . 6
0.5 Contribution of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.6 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Prior Parallelization of IrregularWorkloads Limited by Loop Size 9
1.1 Thread Extraction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 Cyclic Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.2 PipelinedMultithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Speculation and Additional Hardware For Increasing Performance . . . . . . . . . 21
1.2.1 Software Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.2 Hardware Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.3 Custom Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3 Automatic Parallelization of Irregular ProgramsMust Handle Small Loops . . . . . 30
1.3.1 Hardware Requirements for Parallelizing Small Loops . . . . . . . . . . . . 31
2 ExistingHardware CannotHandle Requirements of Small Loops 33
2.1 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Scalar Operand Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Tile Processor STN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 TRIPS OPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 User-Controlled On-Chip Networks . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.1 The Cell Processor Ring Network . . . . . . . . . . . . . . . . . . . . . . 41
v
2.3.2 Tile Processor UDN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Other Hardware for Accelerating Communication . . . . . . . . . . . . . . . . . . 43
2.4.1 Multiscalar’s Distributed Register File . . . . . . . . . . . . . . . . . . . . 44
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Automatic Parallelization of Irregular Programs withHELIX-RC 48
3.1 Background and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.1 Limits of Compiler-only Improvements . . . . . . . . . . . . . . . . . . . 50
3.1.2 Opportunity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 The HELIX-RC Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Decoupling Communication From Computation . . . . . . . . . . . . . . 58
3.3 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Architecture Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Ring Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.2 Memory Hierarchy Integration . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 Speedup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.3 Sensitivity to Architectural Parameters . . . . . . . . . . . . . . . . . . . . 75
3.5.4 Analysis of Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Ring Cache Detail and Implementation 78
4.1 Memory Hierarchy Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 Request and Reply Networks . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2 Reducing Remote Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Signal Buffer Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Synchronization Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.2 Signal Buffer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.3 Signal Buffer Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Ring Cache Synthesis Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.1 Signal Buffer Parameter Sweeps . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
vi
5 Future Directions forHELIX-RC 102
5.1 HELIX-RCWith Out-of-Order Cores . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.1 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.2 Speedup Degradation in Out-of-Order Cores . . . . . . . . . . . . . . . . 106
5.2 HELIX-RC vs. Multiprogram Parallelism . . . . . . . . . . . . . . . . . . . . . . 112
5.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Potential HELIX-RC Research Opportunities . . . . . . . . . . . . . . . . . . . . 119
5.3.1 Compiler Engineering Improvements . . . . . . . . . . . . . . . . . . . . 119
5.3.2 Compiler Sweeps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.3 Multiple-Loop ExecutionModel . . . . . . . . . . . . . . . . . . . . . . . 121
6 Conclusion 123
Appendix A Ring Cache Technical Report 124
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.2.1 HELIX ExecutionModel . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.2.2 Parallel Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.2.3 Decoupling Data Communication . . . . . . . . . . . . . . . . . . . . . . 129
A.2.4 Decoupling Signal Forwarding . . . . . . . . . . . . . . . . . . . . . . . . 133
A.3 Ring Cache Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.3.1 Core–Node Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
A.3.2 Node to Node Connection . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.3.3 Memory Hierarchy Integration . . . . . . . . . . . . . . . . . . . . . . . 142
A.4 Ring Cache Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.5 Datapath Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.6 External Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.6.1 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.6.2 L1 Cache Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.7 Network Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.7.1 Credit Based Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.7.2 Buffer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.8 Memory Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.9 Storing Shared Data and Signals - The Forwarding Network . . . . . . . . . . . . . 165
vii
A.9.1 Network Bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.9.2 Bundleizer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.9.3 Stopper Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
A.10 Loading Shared Data - The Request/Reply Networks . . . . . . . . . . . . . . . . 173
A.10.1 Request and Reply Networks . . . . . . . . . . . . . . . . . . . . . . . . 176
A.10.2 Load Unit Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.11 Ring Cache Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
A.11.1 MemoryModule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.11.2 Array Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A.11.3 Bloom Filter Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.12 Signal Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
A.12.1 Synchronization Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
A.12.2 Signal Buffer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A.12.3 Signal Tracker Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
A.12.4 Core Tracker Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
A.12.5 Signal Buffer Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 232
A.12.6 Previous Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 234
A.13 OS/Multiprogramming Considerations . . . . . . . . . . . . . . . . . . . . . . . 235
A.14 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
A.14.1 Reference Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
A.14.2 Signal Buffer Parameter Sweeps . . . . . . . . . . . . . . . . . . . . . . . 240
Appendix B Ring Cache Verilog Code 245
B.1 defines.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
B.2 ring_cache.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
B.3 buffer.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
B.4 bundleizer.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
B.5 stopper.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
B.6 load_unit.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
B.7 memory.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
B.8 priority_encoder.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
B.9 array.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
B.10 bloom_filter.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
B.11 hash.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
viii
B.12 signal_buffer.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
B.13 signal_buffer_signal_tracker.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
B.14 signal_buffer_core_tracker.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
References 309
ix
Listing of figures
1 Historical clock frequency scaling trend. . . . . . . . . . . . . . . . . . . . . . . . 2
2 Historical single-threaded performance scaling. . . . . . . . . . . . . . . . . . . . 3
3 Number of cores on a single die. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 A candidate loop for DOACROSS parallelization . . . . . . . . . . . . . . . . . . 13
1.2 Decomposition of loop iteration by DOACROSS . . . . . . . . . . . . . . . . . . 13
1.3 A loop schedule for DOACROSS parallelization . . . . . . . . . . . . . . . . . . . 15
1.4 A loop schedule for DOACROSS parallelization with high communication latency . 16
1.5 A loop schedule for DOACROSS parallelization with a small parallel region . . . . . 17
1.6 Decomposition of loop iteration by HELIX . . . . . . . . . . . . . . . . . . . . . 18
1.7 A loop schedule for HELIX parallelization with a small parallel region . . . . . . . 18
1.8 Decomposition of loop iteration by DSWP . . . . . . . . . . . . . . . . . . . . . 19
1.9 A loop schedule for DSWP parallelization with unbalanced stages . . . . . . . . . . 20
1.10 A loop schedule for DSWP parallelization with balanced stages . . . . . . . . . . . 21
2.1 Decomposition of loop iteration by HELIX with synchronization instructions . . . 36
2.2 HELIX communication penalty with reactive data transfer . . . . . . . . . . . . . 37
2.3 HELIX communication penalty with hypothetical proactive data transfer . . . . . . 38
2.4 Multiscalar Distributed Register File . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Augmenting the HELIX compiler does not improve irregular program performance 51
3.2 Accuracy of dependence analysis for small hot loops in irregular benchmarks . . . . 54
3.3 Short loop iterations in SPECint 2000 . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Predictability of variables reduces register communication in small hot loops . . . . 55
3.5 Distribution of required communication distance between 16 cores . . . . . . . . . 55
3.6 Most shared data is consumed by multiple cores . . . . . . . . . . . . . . . . . . . 56
3.7 Example of decoupled data and signal communication. . . . . . . . . . . . . . . . 60
3.8 Ring cache architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.9 HELIX-RC triples the speedup obtained by HCCv2 . . . . . . . . . . . . . . . . . 70
x
3.10 Breakdown of benefits of decoupling communication from computation . . . . . . 71
3.11 Code generated assuming the existence of ring cache slows down on normal hardware 73
3.12 Speedup sensitivity to core count and ring cache parameters . . . . . . . . . . . . . 75
3.13 Breakdown of overheads that prevent HELIX-RC from achieving ideal speedup . . 75
4.1 Ring cache must be carefully integrated the normal cache hierarchy . . . . . . . . . 80
4.2 An empty sequential segment is protected only be a light wait . . . . . . . . . . . . 84
4.3 Cores constrained to a single epoch have reduced performance . . . . . . . . . . . . 87
4.4 Cores that can decouple by an additional epoch have higher performance . . . . . . 89
4.5 Signal Buffer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Synthesized ring cache power and area . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Total ring node area as total signal ID capacity is swept from 8 to 512. . . . . . . . . 97
4.8 Increasing signal bandwidth increases signal buffer and network buffer sizes. . . . . 98
4.9 Increased decoupling increases ring cache area . . . . . . . . . . . . . . . . . . . . 99
4.10 Speedups plateau at two epochs of decoupling . . . . . . . . . . . . . . . . . . . . 99
4.11 Ring cache area reduces when there are fewer cores in the system . . . . . . . . . . . 100
5.1 Singled-threaded SPECint 2000 performance on different core types . . . . . . . . 103
5.2 HELIX-RC SPEC CPU2000 speedups on different core types . . . . . . . . . . . . 104
5.3 Overall HELIX-RC performance always increases for higher performance cores . . . 105
5.4 Performance bottlenecks on a single sequential segment . . . . . . . . . . . . . . . 107
5.5 Program latency vs. multicore throughput for 183.equake . . . . . . . . . . . . . . 115
5.6 Program latency vs. multicore throughput for 179.art . . . . . . . . . . . . . . . . 116
5.7 Program latency vs. multicore throughput for 188.ammp . . . . . . . . . . . . . . 116
5.8 Program latency vs. multicore throughput for 197.parser . . . . . . . . . . . . . . 117
5.9 Program latency vs. multicore throughput for 164.gzip . . . . . . . . . . . . . . . . 118
5.10 HELIX’s memory dependence analysis encounters diminishing returns . . . . . . . 120
5.11 Splitting sequential segments improves HELIX-RC speedups up to a certain point . 120
A.1 HELIX execution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2 AHELIX paralellized loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.3 Reactive communication produces worse performance than proactive communication 130
A.4 Shared data is often accessed by an unpredictable number of cores . . . . . . . . . . 131
A.5 Decoupling data and synchronization communication is vital for speedups . . . . . 132
A.6 Sequential forwarding chains limit HELIX-style parallelization . . . . . . . . . . . 134
xi
A.7 Breaking sequential forwarding chains improves parallel performance . . . . . . . . 135
A.8 Ring cache architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.9 Schematic of top level ring cache module. . . . . . . . . . . . . . . . . . . . . . . 146
A.10 A ring node has direct connections to its local core and its local L1 cache. . . . . . . 147
A.11 Load timing diagram for ring cache hit . . . . . . . . . . . . . . . . . . . . . . . . 150
A.12 Load from ring node to L1 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.13 A ring node is connected with its neighbor ring node by three different networks . . 155
A.14 Schematic of the buffer module . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.15 Control FSM for buffer module . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.16 Ring cache flush timing diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.17 Forwarding network bundle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.18 Schematic of bundleizer module . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.19 Schematic of stopper module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
A.20 Incorrect ring cache memory hierarchy integration . . . . . . . . . . . . . . . . . . 174
A.21 Request/reply network bundles . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.22 Schematic of load unit module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.23 Load unit FSM for local loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.24 Load unit FSM for remote loads . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
A.25 Cache bits and owner bits for a ring cache memory address . . . . . . . . . . . . . 192
A.26 Schematic of memory module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.27 Memory module FSM for loads . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.28 Memory module FSM for stores and flushes . . . . . . . . . . . . . . . . . . . . . 203
A.29 Schematic of array module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
A.30 Schematic of bloom filter module . . . . . . . . . . . . . . . . . . . . . . . . . . 208
A.31 Empty sequential segments are protected by modified light waits . . . . . . . . . . 213
A.32 One bit of signal buffering allows cores to decouple by only one epoch . . . . . . . 214
A.33 Two bits of signal buffering allows cores to decouple by two epochs . . . . . . . . . 215
A.34 Schematic of signal buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
A.35 Signal entry bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
A.36 Schematic of signal tracker module . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.37 Schematic of core tracker module . . . . . . . . . . . . . . . . . . . . . . . . . . 222
A.38 Core tracker FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
A.39 Core tracker module initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 223
A.40 Power and area for a single ring node . . . . . . . . . . . . . . . . . . . . . . . . . 239
xii
A.41 Total ring node area as total signal ID capacity is swept from 8 to 512. . . . . . . . . 240
A.42 Sensitivity of signal bandwidth on speedup . . . . . . . . . . . . . . . . . . . . . 241
A.43 Increasing signal bandwidth increases the signal buffer and network buffer sizes. . . 242
A.44 Decoupling synchronization from one to two epochs increases area significantly . . . 242
A.45 Decoupling synchronization up to two epochs increases speedups . . . . . . . . . . 243
A.46 HELIX-RC scales relatively well on a small number of cores. . . . . . . . . . . . . 244
A.47 Signal buffer area is linear with number of supported cores . . . . . . . . . . . . . 244
xiii
Listing of tables
3.1 Characteristics of parallelized benchmarks. . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Ring cache parameters for the reference design . . . . . . . . . . . . . . . . . . . . 93
4.2 Synthesis results for a single reference ring node. . . . . . . . . . . . . . . . . . . . 95
5.1 Working set sizes for SPECint 2000. . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.1 Ring Cache parameters for the reference design. . . . . . . . . . . . . . . . . . . . 237
A.2 Synthesis results for a single reference ring node. . . . . . . . . . . . . . . . . . . . 238
xiv
Previous Work
Portions of this dissertation have appeared in:
Simone Campanoni, Kevin Brownell, Svilen Kanev, TimothyM. Jones, Gu-Yeon
Wei, and David Brooks. “HELIX-RC: An architecture-compiler co-design
for automatic parallelization of irregular programs.” In Proceeding of the
41st annual international symposium on Computer architecuture, pp. 217-228.
IEEE Press, 2014.
and
Kevin Brownell. “Ring Cache Technical Report.”
Portions of this dissertation are in submission to:
ACMTransactions on Architecture and Code Optimization (TACO)
xv
0
Introduction
In 1965, GordonMoore observed that the number of transistors per unit area on an
integrated circuit was increasing by a factor of two year after year [42]. Although the timeframe
of his original forecast was not entirely correct, Moore’s Law heralded the general trend of regular
doublings of transistor density as the semiconductor industry strived to integrate more and more
transistors. After nearly 50 years of process technology improvements, the number of transistors in a
chip has increased from hundreds in the 1960s to well over a billion today. Along with the explosion
in the transistor count came a seemingly relentless increase in computing performance. A portion of
this increase was due to CPU architecture improvements, but a larger portion was the result of faster
and faster transistors [5].
1
1992 1995 1998 2001 2004 2007 2010 2013 2016
1
10
100
1000
10000
Cl
oc
k 
Fr
eq
ue
nc
y 
(M
Hz
) ~30% yearly growth
~3% yearly growth
Figure 1: Due to a breakdown of Dennard scaling, the growth rate of nominal CPU clock frequencies dramaধcally
decreased about a decade ago. Rising power density made further increases infeasible. Historical data from the
Stanford CPUDB project [17].
For decades, smaller transistor sizes provided what seemed to be a free lunch. As feature sizes de-
creased by a predictable factor, so too did capacitance. By decreasing the supply and threshold volt-
ages by the same factor, the speed of transistors could also be increased and their power consump-
tion decreased. Through this process, known as Dennard scaling [18], the power density of a chip
remained constant as clock frequencies steadily increased. This resulted in reliable single-threaded
performance increases—every new process technology meant that the previous year’s programs now
ran faster.
0.1 Performance ScalingHits a Speed Bump
Unfortunately, in the early 2000s, Dennard scaling began to break down. Due to increasing amounts
of leakage current, the previously steadily decreasing threshold voltage began to plateau [36]. Con-
sequently, clock speeds could not continue to increase, or so too would the power density of a chip.
Given limitations in the ability to cool a chip beyond a certain power ceiling, the industry had no
2
1992 1995 1998 2001 2004 2007 2010 2013 2016
1
10
100
1000
No
rm
al
iz
ed
 C
PU
 P
er
fo
rm
an
ce
~46% yearly growth
~19% yearly growth
Figure 2: The “power” wall resulted in sharply decreased historical single-threaded performance gains. CPU per-
formance has been normalized across mulধple generaধons of the SPECint benchmark suite. Only CPUs from the
database with SPECint numbers are ploħed. Historical data from the Stanford CPUDB project [17].
choice but to significantly reduce the aggressiveness of clock frequency increases. Figure 1 shows the
dramatic slowdown in clock frequency gains that resulted from the breakdown of Dennard scaling.
Nominal CPU frequencies plateaued around 2004, in the 3–4 GHz range.
Hand in hand with stalls in clock frequency gains, single-thread performance gains stalled as well.
Figure 2 shows normalized single-threaded performance around this time period for a large variety
of CPUs. The performance data, taken from the Stanford CPUDB project [17], has been normal-
ized across multiple generations of the SPECint benchmark suite. Prior to 2004, when clock fre-
quencies were still increasing, overall single-threaded CPU performance increased by nearly 46%
per year. After Dennard scaling broke down, performance still increased, but at the much slower
rate of 19% per year. Had the original 46% trend continued past 2004, CPU performance would
be 5–10 higher today. With this “power wall” blocking single-threaded performance gains, the
industry decided to instead use their still growing transistor budget to integrate multiple identical
general-purpose cores on a single die. Figure 3 shows the dramatic increase in the number of cores
3
1992 1995 1998 2001 2004 2007 2010 2013 2016
1
8
16
32
64
Nu
m
be
r o
f C
or
es
 p
er
 D
ie
Figure 3: Facing the power wall, industry transiধoned to placing mulধple cores on a single die. Historical data from
the Stanford CPUDB project [17], represenধng only general purpose CPUs.
starting in 2004. Since clock frequency ultimately has a cubic relationship with power, multiple
cores clocked at lower frequencies may still fit within a fixed power budget while at the same time
providing higher theoretical computing performance. This higher performance can only be realized,
however, if the multiple cores can all be utilized simultaneously, at least for some fraction of the
time.
0.2 ExtractingMulticore Performance
Broadly, there are two primary ways to extract performance from a general-purpose multicore pro-
cessor. First, a single program can be decomposed into different execution threads to exploit thread-
level parallelism (TLP) to gain performance on multiple cores. Alternatively, multiple programs can
independently use different cores at the same time. Although these techniques are usually easily ap-
plicable to simple and regular programs, they are often lacking for irregular workloads that contain
complex data and control flows.
4
0.2.1 Single-Program Parallelism
Although parallel computing had already long existed, the introduction of multiple cores in a single
chip opened the door for finer-grained parallel computing, which previously had been limited to
workloads that could tolerate the long communication latencies between different chips and ma-
chines. Decomposing a program into multiple threads, each of which can run on a different core,
can significantly improve the performance of a single program running on a multicore chip. An in-
creasing variety of modern tools and programming models have been introduced to facilitate multi-
threaded programming. Depending on workload complexity and available programmer time as well
as programmer ability, some programs are easier to split into threads than others. In general, mul-
tithreaded programs are much more difficult to create, maintain, and debug than single-threaded
programs. Additionally, it is often difficult for programmers to create balanced amounts of work for
the threads, so that the realized performance increase frommultithreading is often far less than the
theoretical performance increase. As a result, programmers often feel that it is not worth the effort
needed to make a program parallel, and they tend instead to rely entirely on the slower single-thread
performance scaling to gain performance.
0.2.2 Multiple-Program Parallelism
An easier way to extract multicore performance is through the use of multiple-program parallelism:
instead of trying to parallelize a single program, multiple programs can be run on different cores at
the same time. Even though single-threaded performance does not increase, the total throughput
does, so multicore computing resources are not wasted. Multiple-program parallelism also has the
benefit of scaling relatively well as long as the multiple programs do not interact destructively. As
more cores are added to a chip, the total throughput may increase by a predictable amount. Un-
fortunately, the large amount of shared resources on a multicore processor (shared caches, DRAM
5
bandwidth, on-chip network bandwidth, etc.) can result in less than ideal throughput scaling.
0.3 Core Utilization Remains Low
Both single-program and multiple-program parallelism have been insufficient for keeping multicore
utilization high. For datacenter-scale computing, core utilization is usually well below 50% on av-
erage [3]. One reason for the low utilization is the desire to isolate latency-sensitive applications, so
that processors are intentionally underprovisioned to ensure that multiple programs don’t overly
contend for shared resources [39]. Another reason is the desire to ensure that spare computing ca-
pacity is available if demand increases. Either way, the result is that cores sit idle.
In the mobile realm (e.g., phones and tablets), core utilization is dramatically lower than one
would expect, considering the ever-increasing number of cores generation after generation. Studies
have shown that although popular applications tend to have some TLP, most mobile applications
use less than two cores on average [20]. Additionally, typical mobile device interactions generally
encourage use of only a single application at a time, so multiple-program parallelism is also lim-
ited. The theoretical performance from having up to 8 cores on a single mobile device is thus largely
wasted.
0.4 Automatic Parallelization Can Improve Utilization
Given the difficulty of manually extracting TLP from sequential code, automatic parallelization of-
fers a promising route for increasing multicore utilization. Not only can automatic thread extraction
make use of an increasing number of cores; it can also increase single-program performance. Histor-
ically, a variety of techniques sought to parallelize programs across multiple chips and/or multiple
machines by automatically extracting threads [16, 28]. Each of these extracted threads would run on
a different processor/machine, and they would communicate when necessary for synchronization
or sharing data. While these techniques realized some success, they were as a rule only applicable
6
to workloads that had minimal communication or synchronization requirements, generally those
with very regular control and data flow. Due to the large latency between chips and machines, if
a program required frequent or irregular communication, the time spent communicating would
dominate the total execution time.
As the multicore era took hold, there was renewed interest in leveraging automatic paralleliza-
tion to regain lost single-thread performance. With multiple cores close together on a die, com-
munication costs decrease and inter-core bandwidth increases, making previously unscalable tech-
niques more realistic for a larger variety of workloads. A growing number of compiler techniques
to extract threads have proved to be feasible for previously unparallelizable irregular programs,
most of these techniques variations of either cyclic-multithreading or pipelined-multithreading
parallelism [12, 46, 48, 60]. Efforts have also been made to extract parallelism by combining com-
piler techniques with custom hardware [25, 38, 54, 53, 56], with some success. Despite this revital-
ized interest, however, there is still much room for improvement—specifically, there is a need for
a technique that 1) is broadly applicable to a large number of irregular programs, 2) produces high
speedups on those programs, and 3) doesn’t require large changes to existing general-purpose multi-
core architectures.
0.5 Contribution of the Dissertation
In this dissertation, I first examine a recent compiler technique for automatic parallelization called
HELIX [12], detailing its intrinsic performance limitations and bottlenecks. In order to boost the
performance of HELIX, I propose a co-design comprising an improved version of HELIX and a
light-weight hardware extension. This co-design, called HELIX-RC, boosts the speedup of sequen-
tially written irregular code—that is, code that contains complex data and control flows—from 2
to 6.85, which buys back a large portion of the single-threaded performance gains lost over the last
10 years. The speedup improvements stem from the ability of the co-designed compiler and hard-
7
ware to extract parallelism from loops with much higher communication requirements than prior
initiatives have been able to address. Moreover, by efficiently utilizing non-core resources, HELIX-
RC can achieve higher multicore throughput even in cases where multiple-program parallelism is
already abundant. The additional hardware component, ring cache, is easily integrated into existing
commodity multicore architectures at minimal power and area costs. The architectural implications
and the implementation of ring cache are evaluated in detail.
0.6 Organization of the Dissertation
The rest of this dissertation is organized as follows. First, Chapter 1 details relevant historical and
modern automatic parallelization techniques, with an emphasis on their limitations with respect to
parallelizing irregular workloads. The characteristics of the hardware support needed to boost the
performance of these workloads are described. Next, Chapter 2 presents existing hardware mecha-
nisms for inter-core communication and explains why existing hardware fails to address the com-
munication needs of irregular programs. Chapter 3 details and evaluates the proposed compiler-
architecture co-design, which is a combination of the HELIX compiler and some novel hardware,
the ring cache. Chapter 4 presents selected implementation details for ring cache, in addition to
synthesis results from a cycle-accurate Verilog model of the hardware. The full ring cache imple-
mentation report appears in Appendix A. Finally, Chapter 5 examines some architectural tradeoffs
regarding HELIX-RC, including its potential effect on different core architectures and its use for
tradeoffs between program execution time and overall multicore processor throughput, along with
possible future compiler extensions to further increase performance.
8
1
Prior Parallelization of Irregular Workloads
Limited by Loop Size
While some computing problems often translate to either inherently parallel or easy-to-parallelize
numerical programs, sequentially designed, irregular programs with complicated control (e.g., ex-
ecution paths) and data flows (e.g., aliasing) are much more common but difficult to analyze pre-
cisely. For years, many attempts have been made to accelerate single-thread performance beyond
what has been provided by traditional process scaling and architectural improvements. Although
the conventional wisdom is that irregular programs cannot make good use of multiple cores, re-
search in the past decade has made steady progress towards extracting TLP from complex, sequen-
9
tially designed programs such as the integer benchmarks from the SPEC CPU suites.
Some of this past research has focused primarily on compiler techniques to automatically ex-
tract parallel threads from sequentially written code. Other work combines compiler techniques
with special-purpose hardware in an attempt to overcome some of the limitations of the compiler-
only strategies. In general, these strategies are most successful on so-called regular (or numerical)
workloads—those with predictable control flow and data access. For irregular workloads, these
techniques tend not to be so successful.
The two primary approaches for automatic thread extraction are cyclic multithreading and
pipelined multithreading. Both operate by transforming sequentially written loops into multiple
threads that run on different cores or, historically, on different machines. Inter-thread communi-
cation is used to satisfy any required synchronization or data dependence forwarding. Cyclic mul-
tithreading assigns different loop iterations to different threads. Any loop-carried dependence (i.e.,
a dependence between different loop iterations) is communicated between threads from older it-
erations to younger iterations, forming a cycle between the threads. In contrast, pipelined multi-
threading forms a pipeline between threads, rather than a cycle. A loop iteration is split into mul-
tiple stages (e.g., the first half of every iteration belongs to one stage and the second half of every
iteration belongs to a stage), with each stage assigned to a different thread. Thus, unlike cyclic mul-
tithreading, every thread runs a portion of every iteration in pipelined multithreading. Although
many variations of these techniques exist, the core transformation of loops into thread cycles or
pipelines remains roughly the same.
For both categories of thread extraction, loops with larger iterations generally contain larger
amounts of code that can run in parallel, with less required communication. Smaller loops tend
to be more tightly coupled—and with such short loop bodies, the cost of performing any kind of
communication can quickly dwarf any benefit of parallelization. Since small loops require at least
some amount of communication, both cyclic multithreading and pipelined multithreading tend to
10
perform poorly for them, even though the loops may contain large amounts of potential parallelism.
For this reason, most state-of-the-art parallelization techniques target relatively large loops.
Unfortunately, complex control and data flows in irregular programs—both exacerbated by am-
biguous pointers and ambiguous indirect calls—make accurate data dependence analysis difficult.
In addition to actual dependences that require communication between threads, a compiler must
conservatively handle apparent dependences that are never realized at runtime. Additionally, larger
loops are harder to analyze, due to the increased lexical scope and amount of variables/memory be-
ing considered. If all of the apparent dependences need to be synchronized, performance will suffer
greatly.
A common way to handle a large number of apparent dependences is through speculation [35,
38, 56], which avoids the need for accurate data dependence analysis by speculating that some appar-
ent dependences are not realized. However, thread-level speculation (TLS) suffers from the over-
head needed to support misspeculation and therefore is primarily limited to targeting relatively large
loops in order to amortize penalties.
A potential alternative strategy to existing parallelization solutions is to target small loops instead,
as these are much easier to analyze via state-of-the-art control and data flow analysis, which signif-
icantly improve accuracy. Furthermore, this ease of analysis enables transformations that simply
recompute shared variables in order to remove a large fraction of actual dependences. This strategy
increases TLP and reduces core-to-core communication. Such optimizations do not readily translate
to TLS because the complexity of TLS-targeted code typically spans multiple procedures in larger
loops.
In the remainder of this chapter, I first discuss some of the primary compiler techniques for au-
tomatic thread extraction and detail their strengths and drawbacks, especially with regard to acceler-
ating irregular programs. Next, I discuss refinements of these techniques that attempt to overcome
some of the limitations of compiler-only solutions, such as compiler–architecture co-designs and
11
those that use TLS. Then I explore an opportunity to increase performance even further by target-
ing small loops, an untapped source of parallelism that has so far been left on the table. This op-
portunity is only realizable, however, if the significant communication requirements of small-loop
parallelization are fulfilled.
1.1 Thread Extraction Techniques
There are two primary models for extracting parallel threads from sequentially written loops. The
two approaches, cyclic multithreading and pipelined multithreading, underpin most automatic
parallelization techniques. In the following subsections, I describe the general transformation of
sequential code to parallel threads, as well as the primary performance bottlenecks and potential
pitfalls of each technique.
1.1.1 CyclicMultithreading
Cyclic multithreading was one of the first parallel processing paradigms to be introduced, in 1966 [4].
In general, cyclic multithreading (CMT) operates by assigning loop iterations to different threads,
which are then executed on different cores or processors. Once a core i completes iteration i, it next
executes iteration i + n, where n is the number of cores in the system (e.g., on a 4-core system, core
0 would execute iteration 0, then iteration 4, and so on). For simple loops that have no loop-carried
dependences (a degenerate case of CMT, often called DOALL), the threads can more or less run
independently. Unfortunately, other than relatively trivial or basic number-crunching scientific
applications, the vast majority of programs contain loops with control and data dependences. For
these nontrivial loops, the DOACROSS [16] strategy, which partitions iterations into a sequential
portion and a parallel portion, was developed. The sequential portion contains any loop-carried de-
pendences and must be executed in loop iteration order, in effect forming a cycle between threads
as older iterations feed data to younger iterations. The parallel portion can be executed completely
12
Node* node = root;
int mySum = node->data;
for(int i = 0; i < 8; i++) {
node = node->next;
mySum = mySum + node->data;
work(mySum, node->data);
}
Figure 1.1: A candidate loop for DOACROSS parallelizaধon.
work(mySum, node->data);
node = node->next;
mySum = mySum + node->data;
Parallel Code
Sequential Code
S
P
Figure 1.2: A loop iteraধon is decomposed into sequenধal and parallel porধons for DOACROSS parallelizaধon.
independently. More recently, a generalization of DOACROSS called HELIX [12] has further split
the sequential portion of each iteration into multiple smaller sequential segments. This potentially
enhances performance by enabling parallelism between different sequential segments.
DOACROSS
To illustrate the DOACROSS transformation, consider the code example in Figure 1.1. This loop
contains two loop-carried dependences. First, the node pointer for the linked list is updated by ev-
ery iteration as the linked list is traversed. Second, themySum variable contains the running sum of
the data located at each node. Let us assume that the subsequent work function is completely inde-
pendent and does not access any memory or registers shared between iterations. Figure 1.2 shows a
13
transformed version of an iteration of this loop. DOACROSS places the two loop-carried depen-
dences into a sequential region, which must be executed in loop iteration order (enforced by syn-
chronization instructions; not shown), and the independent work function (which relies only on
values produced by the current iteration of the loop) into a parallel region. Execution of the paral-
lel region can overlap with the sequential region of any younger iteration and the parallel region of
any other iteration. An execution timeline for eight iterations of this loop on four cores is shown
in Figure 1.3. The values of loop-carried dependences flow undirectionally from older iterations to
younger iterations—from core 0 to core 1, 2, and 3, and then back to core 0, forming a cycle. Note
that because the parallel region has a long execution time relative to the sequential portion, there is a
significant performance gain from the large amount of code that can execute in parallel.
There are two primary bottlenecks that can severely limit the performance of DOACROSS. Con-
sider Figure 1.4, where the time it takes for data to transfer between cores is tripled, leading to cores
stalling as they wait for loop-carried data to transfer. Execution time in this scenario is much longer
due to the tight coupling between cores intrinsic to CMT, making communication latency a signifi-
cant factor for DOACROSS. The second primary bottleneck is the size of the sequential portion of
the iteration relative to the parallel portion. If there is a large number of loop-carried dependences
compared to the size of the independent code, there will be limited opportunities for overlap be-
tween threads, which in turn reduces performance. Figure 1.5 depicts an execution timeline with a
much shorter work function, resulting in significant stalls. This highlights the importance of having
a larger parallel-to-sequential code ratio.
Unfortunately, irregular workloads not only have a large number of actual loop-carried depen-
dences (relative to regular workloads), but are also susceptible to a large number of apparent depen-
dences (i.e., dependences that do not manifest at runtime), which bloat the size of the sequential
portion of the loop iteration. In order to achieve good performance, DOACROSS must be able to
keep the sequential portion small and must have very fast inter-thread communication.
14
Core 0 Core 1
Program
Execution
Time
Core 2 Core 3
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
Parallel Code
Sequential Code
Communication
Figure 1.3: Four cores execute eight iteraধons of a DOACROSS loop, with data ﬂowing from older iteraধons to
younger iteraধons. Since the communicaধon delay is small and the parallel region is large, all four cores have high
uধlizaধon.
15
Core 0 Core 1
Program
Execution
Time
Core 2 Core 3
P
S
P
S
Parallel Code
Sequential Code
Communication
P
S
P
S
P
S
P
S
P
S
P
S
Stall on 
data
Stall on 
data
Stall on 
data
Stall on 
data
Figure 1.4: With slightly higher communicaধon latency, communicaধon stalls hurt DOACROSS performance.
HELIX
HELIX [12], an evolution of DOACROSS, addresses the sequential portion bottleneck by split-
ting the sequential portion into multiple sequential segments (it may also create multiple parallel
segments). Although each sequential segment still needs to run in loop iteration order, different
segments can execute simultaneously, to exploit parallelism among them. Figure 1.6 shows howHE-
LIX decomposes the original loop body shown in Figure 1.1. Note that there are now two different
independent sequential segments, s0 and s1. Segment zero relies only on the previous iteration’s
16
Core 0 Core 1
Program
Execution
Time
Core 2 Core 3
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
Parallel Code
Sequential Code
Communication
Stall on 
data
Stall on 
data
Stall on 
data
Stall on 
data
Figure 1.5: If the parallel region of a DOACROSS loop is short, there can potenধally be severe performance
degradaধon.
value for node, and segment one relies only on the previous iteration’s value ofmySum (it uses the
current iteration’s value of node). Even when the parallel region is small, as was the case in Figure 1.5,
HELIX can improve performance by allowing s0 and s1 to overlap their execution, as shown in Fig-
ure 1.7. However, as with DOACROSS, HELIX s speedups are limited by communication latency.
Despite this remaining bottleneck, HELIX was able to achieve a speedup of 2.25 for a mixture of
SPEC CPU 2000 integer and floating point benchmarks.
1.1.2 PipelinedMultithreading
In contrast to CMT, pipelined multithreading (PMT) splits loop iterations into different stages,
each of which is then assigned to a single thread. The data dependences that need communicat-
ing in this case are not loop-carried, as with CMT—instead, they are intra-iteration dependences.
17
work(mySum, node->data);
node = node->next;
Parallel Code
Sequential Code
S0
mySum = mySum + node->data; S1
P
Figure 1.6: A loop iteraধon is decomposed into sequenধal and parallel porধons for HELIX parallelizaধon.
DOACROSS would have created only one sequenধal porধon.
Core 0 Core 1
Program
Execution
Time
Core 2 Core 3
P
S0
Parallel Code
Sequential Code
Communication
S1
P
S0
S1
P
S0
S1
P
S0
S1
P
S0
S1
P
S0
S1
P
S0
S1
P
S0
S1
Stall on 
data
Stall on 
data
Stall on 
data
Stall on 
data
Figure 1.7: Even with a short parallel region, HELIX can reduce stalls by execuধng diﬀerent sequenধal segments in
parallel.
18
work(mySum, node->data);
node = node->next;
mySum = mySum + node->data;
A
B
Stage:
Figure 1.8: A loop iteraধon is decomposed into two stages for DSWP parallelizaধon.
These stages create a pipeline where data flows from the first stage of the pipeline to the last. Unlike
threads in CMT, each thread in PMT executes a portion of every iteration, so no cycle is formed,
and the different threads are not so tightly coupled. The most preeminent example of this technique
is known as decoupled software pipelining, or DSWP [46].
DSWP
For the code example in Figure 1.1, DSWPmay create two different pipeline stages, as shown in Fig-
ure 1.8. The first stage, A, encompasses the updates to the node andmySum variables. The second
stage, B, includes just the work function. Figure 1.9 depicts an execution timeline with this organiza-
tion for three iterations on a 2-core system. Since there are only two stages of the pipeline, only two
cores can be used. Core 0 repeatedly executes stage A and then communicates the values ofmySum
and node to core 1, which repeatedly executes the work function on the incoming data.
Since PMT creates a pipeline, it is less sensitive than CMT to communication latency, which
only affects the pipeline fill time. However, it has two other primary bottlenecks. First, as can be
seen in the figure, if different stages of the pipeline are not well balanced, performance is limited
by the longest stage and core utilization is very low. Figure 1.10 shows the much higher core utiliza-
tion when the pipeline stages are balanced. The second major bottleneck is bandwidth: if many
small pipeline stages are created, there might not be adequate inter-thread bandwidth (or buffering)
19
Core 0 Core 1
Program
Execution
Time
B
A
Communication
A
A
B
B
Figure 1.9: Two cores execuধng three iteraধons of a loop parallelized by DSWP suﬀer low uধlizaধon due to pipeline
stage imbalance.
20
Core 0 Core 1
Program
Execution
Time
B
A
Communication
A
A B
B
Figure 1.10: With balanced stages, DSWP oﬀers a parallelizaধon scheme that is robust to communicaধon latency,
but potenধally sensiধve to inter-core bandwidth.
to keep every stage of the pipeline busy. Irregular workloads with complex input-dependent con-
trol and data flows exacerbate this problem, as a compiler will then have more trouble predicting
pipeline balance and bandwidth requirements at compile time. Depending on the complexity of the
code and the amount of potential memory aliasing, there may be a very small number of possible
options for pipeline stages. DSWP favors larger loops, since smaller loops may be more difficult to
split into many balanced stages. In its original conception, DSWP achieved approximately a 1.1
program speedup on a variety of benchmarks, including some from SPEC CPU 2000 [46].
1.2 Speculation and Additional Hardware For Increasing Performance
We have seen that the fundamental CMT and PMT automatic parallelization approaches contain
some intrinsic bottlenecks that limit their success in parallelizing irregular programs. For CMT,
communication latency is the primary problem, whereas for PMT, communication bandwidth and
pipeline balance are the primary problems. These limitations are exacerbated by a potentially large
number of apparent dependences that a compiler must conservatively satisfy but that don’t actually
manifest at runtime. Additionally, since irregular programs tend to have unpredictable control and
21
data flows from iteration to iteration, there are a large number of dependences that only occasionally
manifest but that nevertheless must be handled. Although TLS helps in cases where dependences do
not manifest, misspeculation costs harm performance when dependences do manifest. Only loops
with a high ratio of apparent to actual dependences can reliably achieve high speedups. Sometimes
the compiler can prove that certain dependences will always manifest at runtime, and hardware can
be used to accelerate the communication of these. However, especially for irregular programs, these
are the minority of dependences, since memory references are often ambiguous.
Over the years, a large body of work has been devoted to improving irregular program perfor-
mance beyond CMT and PMT: TLS techniques help reduce the performance impact of apparent
dependences [25, 31, 35, 38, 41, 56, 61, 66, 67], and hardware support for decreasing communication
costs helps reduce the performance impact of actual dependences [50, 53, 54]. Other work com-
bines CMT and PMT to improve speedups [27, 49]. More radical co-designs of the compiler and
the computer architecture holistically extract parallelism differently than vanilla CMT/PMTwhile
reducing the impact of both apparent and actual dependences [51, 53, 54].
The rest of this section describes some recent improvements for automatically parallellizing pro-
grams. I will first discuss notable software-only speculation approaches, and then influential tech-
niques that rely on hardware TLS. Finally, I detail an example of a significant compiler–architecture
co-design for extracting parallelism. Although these techniques improve speedups compared to
vanilla CMT and PMT, they are still limited by the amount of communication they can either ac-
celerate (with special-purpose hardware) or remove (via speculation), as well as by the overheads
incurred in facilitating that communication.
1.2.1 Software Speculation
Software-based TLS can potentially enable improved parallel performance on today’s multicore pro-
cessors without any hardware changes. Unfortunately, the overhead of tracking dependences and
22
rollback information in software is prohibitive, thus limiting the type of loops that can be speculated
upon and therefore limiting the performance of irregular workloads. We discuss two such software
techniques that parallelize loops that are speculatively assumed to be DOALL and that attempt to
mitigate the impact of apparent dependences the compiler cannot safely eliminate. While other re-
search has applied software TLS to DOACROSS and DSWP, it does the parallelization and modifies
source code manually, in lieu of a fully automatic compiler-based approach [48].
STMLite
STMLite is a low-cost software transactional memory implementation [41] that improved upon
previous implementations by reducing the amount of locking and checking overhead normally en-
countered with software transactional memory (STM). STMLite accomplishes this by relaxing some
restrictions ordinarily encountered with software transactional memories and exploiting the sim-
pler nature of the DOALL loops being parallelized and the fact that only one loop is running at a
time. Various benchmarks, including some from SPECfp, were automatically parallelized by select-
ing loops that the compiler believed (via a profiling pass) to be DOALL, although it couldn’t nec-
essarily prove this. Loop iterations were distributed to different threads while speculating that no
dependences would be realized at runtime. If it turned out that a dependence wॷ realized, STMLite
performed a rollback and recovery, re-executing the code in proper sequential order.
Even with this STM implementation optimized for parallel loops, the speedup on the relatively
regular SPECfp benchmarks was limited to 2.2. Ambiguous memory locations accessed in these
benchmarks create potential dependences, so they must be placed in a transaction. The authors
concluded that STM has limited usefulness unless the number of speculated locations is very small,
since the overheads of speculation dwarf any performance improvement.
23
DOALL for Clusters
Extending speculative DOALL for a cluster of machines yields impressive, automatically extracted
speedups of 43.8 [35]. As with STMLite, loops that appear to be DOALL are parallelized by dis-
tributing their iterations to different threads, speculating that few or no dependences will mani-
fest. Without using any speculation, due to the limitations of a static analysis, even for relatively
straightforward DOALL loops, speedups are drastically lower, around 4.5. This highlights the
performance impact that apparent dependences can cause, despite never (or rarely) being realized at
runtime. However, this boost in speedup is only possible for very regular benchmarks with loops
that have few dependences that need to be speculated. Even very small misspeculation rates of <1%
can completely dwarf any parallel performance, as misspeculation recovery is very costly. For other
benchmarks, the required communication bandwidth to facilitate misspeculation checking greatly
limits performance.
1.2.2 Hardware Speculation
To overcome the limitations of software TLS, major effort has been devoted to augmenting tradi-
tional processors with the required hardware for efficiently parallelizing loops. Some of this hard-
ware accelerates the major functions related to speculation: dependence tracking and misspeculation
rollback. This section details some of the most influential efforts to create a compiler–processor co-
design that leverages hardware speculation support to speed up automatically parallelized irregular
programs.
HYDRA
The Hydra CMP [25, 26] was one of many early attempts to add dedicated thread-level speculation
hardware to a multicore processor. Speedups were obtained by running loop iterations in paral-
24
lel and speculating that loop-carried dependences would not manifest. Although Hydra primarily
targeted loop-level parallelism in the same way as cyclic multithreading, it also targeted function
parallelism. Function parallelism was extracted by speculatively assuming that the return value of a
function was predictable and that any side effects were not immediately relevant. One thread exe-
cuted the function, while another continued with code past the function. If an assumption made
about the return value or side effects turned out to be incorrect, execution was repeated with the
correct ordering.
The custom hardware for speculation performed a number of different functions. Overall, it
needed to make sure that a) any thread that read a shared value would receive the most recent value
written by any older thread (but not by a younger thread), b) any speculative reads that turned out
to be speculated incorrectly would cause execution to rollback, and c) any speculative state would
be buffered until it was no longer speculative, at which point it could commit in correct program
order. Additional bits were required in the L1 caches to track loads that might be misspeculated and
also for facilitating memory renaming, which ensured that threads wouldn’t read values written by
younger threads. Buffers in the shared L2 cache held any speculative writes until they were safe to
retire. These buffers also superseded any accesses to the L2 cache, so that threads could read shared
values written from older threads that had not yet committed. Crucial to Hydra’s operation was a
write-through cache system that allowed all cores to snoop on memory accesses so that they could
detect when a misspeculation had occurred and also to ensure, through cache invalidation, that they
always loaded the most recently written shared data from older threads alone.
Although loop/function selection was manual, the parallelization process was automatic. Per-
formance improvements were reasonable, ranging from around 1.5 for irregular programs to more
than 3 for simpler programs. Later manual SPECint parallelization efforts using Hydra yielded
loop/region speedups between 1.24 and 2.1, depending on the benchmark and the loop/re-
gion [47]. The authors noted that the large amount of irregularity in these integer benchmarks
25
made even manual parallelization difficult. They also acknowledged that some regions that might
otherwise be good for parallelization had too few instructions to amortize the TLS overheads, while
others had such unpredictable iteration lengths that load imbalances often resulted as short threads
stalled waiting for long threads. In general, the irregularity of these benchmarks inhibits successful
parallelization.
STAMPede
Like Hydra, STAMPede [56] is a multicore processor design that seeks to accelerate the performance
of automatically parallelized loops by augmenting the hardware with TLS support. As with previ-
ous TLS solutions, STAMPede is primarily targeted at loop iteration parallelism in cases where the
loop iterations contain mostly apparent dependences. Rather than relying on a write-through cache
organization with large speculative buffers, as Hydra does, STAMPede uses a more typical write-
back cache scheme with some additional cache coherence messages. Special “epoch” timestamps
per thread in addition to extra tracking bits per cache line allow the hardware to detect when an up-
date/invalidation of a speculatively loaded cache line is written by a logically older thread, indicating
a misspeculation. By passing around a commit token, threads are able to determine at what point
they are no longer speculative and can therefore commit their writes to the rest of the system.
STAMPede explicitly handles instances where a compiler can prove with certainty that an actual
dependence always exists and must be communicated, which is often the case for dependences in-
volving registers. Instead of repeatedly misspeculating, as other approaches might do, STAMPede
forwards the shared data between threads by placing it in a special region of the stack, and it enforces
sequential ordering of thread access with explicit wait and signal synchronization instructions.
Except for one benchmark, STAMPede only achieves a speedup of 1.21 or lower across a range
of irregular benchmarks. A large reason for this low program speedup was that the compiler was
unable to find enough promising loops to parallelize, and as a result the overall program coverage
26
was very low, 28.8% on average.
POSH
The POSH [38] compiler leverages profiling and a number of different heuristics to extract paral-
lelism on top of assumed TLS hardware. Like other approaches, POSH leverages speculation to
remove apparent dependences and thus incurs potential misspeculation costs if it speculates incor-
rectly. Similarly to STAMPede, it targets not only loop iteration parallelism, but also function-level
parallelism. Additionally, POSH exploits parallelism between loops/functions and code imme-
diately following the loops/functions. A profiling run prunes tasks (loops or functions) that are
predicted to have low amounts of parallelism—without such profiling, POSH is unable to extract
much speedup at all.
Overall, POSH extracts a 1.3 speedup for SPECint 2000 benchmarks. The profiler determines
that the most promising loop iterations / function regions in these irregular programs are on the
small side—the majority of committed tasks execute fewer than 500 instructions. However, the
amount of misspeculation is relatively large—more than 50% of the dynamic tasks are misspeculated
and subsequently re-executed. This is not surprising, given the irregular memory access patterns
within small loops in SPECint benchmarks. There is something of a silver lining to the large amount
of misspeculation: doomed speculative tasks often inadvertently prefetch shared data, which pre-
vents the relatively large load stall that would otherwise happen when the corresponding re-executed
task attempted to load it from the L2. Without this effect, POSH’s speedups were significantly
lower. The observation that loading actual dependences is a potentially large performance bottle-
neck in part motivates a more careful look at a hardware solution to accelerate communication of
these dependences.
27
1.2.3 Custom Architectures
In addition to techniques like TLS, other research has attempted to extract parallelism through even
larger overhauls of traditional processor designs. Two such influential projects are Multiscalar [54]
and the TRIPS [53] lineage, which includes the TFlex [34] and T3 [51] designs. Through careful
compiler–architecture co-design, these projects were able to increase the amount of extracted paral-
lelism at the cost of a larger design effort and more drastic changes to traditional architecture.
Multiscalar
Multiscalar processors were designed to extract parallelism from single-threaded programs through
aggressive speculative execution on multiple processing units. The associated compiler assigns large
blocks of instructions (called tasks) to different processing units. A task is an abstracted notion of a
contiguous region of dynamic instructions—it can range anywhere from a small number of instruc-
tions to an entire basic block, a loop iteration, or an entire function call. These tasks may execute
speculatively, with additional hardware used to detect misspeculation and initiate re-execution as
needed.
A significant difference from some of the previously discussed TLS techniques is that Multiscalar
has a notion of a single logical register file, which accelerates inter-task communication. This regis-
ter file is physically separated, with a portion of it contained within each processing unit. When the
compiler determines that a particular register value may be shared between tasks, it orchestrates the
forwarding of the last write to that register from the producer task to the register file of any consum-
ing tasks over a unidirectional ring network. That way, when a consuming task tries to read from
that register, if the value has already been produced, it is already available locally (otherwise the hard-
ware will force it to stall). This contrasts with other TLS implementations, which generally rely on
reactive memory systems to begin transfer of data when a consumer requests it rather than as soon
28
as it is produced. This acceleration of inter-task communication of dependences plays an important
role in boosting the performance achieved byMultiscalar, as it allows the selection of relatively small
tasks [63]. However, this register communication is latency sensitive. On an 8-core design, moving
from a 1-cycle latency between cores to a 2-cycle latency decreases Multiscalar performance on some
benchmarks by 4–5% [6]. Regrettably, the authors did not sweep communication latency beyond
2 cycles, but this performance drop-off after just one additional cycle of latency serves to highlight
how crucial it is to accelerate inter-task dependence communication.
TRIPS
The general philosophy of the TRIPS line of research is a redesign of conventional ISAs to reshape
computation to fit a data-flow model, which can then be used to target different types of parallelism.
Instructions directly encode the consumers of the value they produce, by specifying the instruction
that will consume the value rather than an explicit register. In one mode of operation, the compiler
decomposes execution of a single program into multiple blocks of many instructions each, which
are mapped onto available processing elements and executed speculatively. A routing network ac-
celerates the communication of register values between instructions in different blocks, somewhat
similarly to Multiscalar. The most recent incarnation of TRIPS achieves good speedups for SPECint
2000, over 3 [51] more than an Intel Atom core, although also with higher energy consumption.
Like Multiscalar, TRIPS represents a large architectural redesign, relies heavily on aggressive specula-
tion, and can only accelerate communication of dependences through registers rather than memory.
It also shares a similar communication latency sensitivity: when the number of cycles per hop in the
routing network was increased from 1 to 2, performance dropped by 20% [22].
29
1.3 Automatic Parallelization of Irregular ProgramsMustHandle Small
Loops
In the previous sections, I presented an overview of significant prior work in the field of automatic
parallelization of single-threaded programs, with an emphasis on their success when targeting irregu-
lar workloads. A number of general observations follow from this discussion.
Although the basic approaches of cyclic multithreading and pipelined multithreading can be
successful for simple programs, they falter in the case of more complex programs. The need to com-
municate dependences between threads—both actual and apparent—can greatly reduce the amount
of extracted parallelism. Thread-level speculation can remove the communication associated with
apparent dependences, but the overhead of misspeculating and re-execution greatly limit which
loop/code regions can be targeted for parallelization. Small loops, in particular, have too many de-
pendences (often the result of ambiguous memory accesses), and misspeculation happens too fre-
quently to amortize the costs of TLS approaches. However, without targeting small loops / code
regions, it is often difficult to find anything worth parallelizing, so program speedup can be limited
by low coverage when TLS is applied.
Much of the discussed research acknowledges the importance of accelerating the communication
of dependences when possible. To avoid misspeculation penalties, known actual dependences are
often handled with explicit communication, either through knownmemory locations protected
by synchronization instructions or through proactive data forwarding with dedicated hardware.
Dedicated communication hardware allowsMultiscalar and TRIPS to target smaller code regions
than other approaches, since data transfer latency for these dependences is reduced. However, many
dependences are ambiguous and may or may not manifest, so only a small subset of the depen-
dences can be accelerated. In a similar vein, the authors of POSH observed that fortuitous accidental
prefetching of shared data noticeably improved their speedups. These studies highlight the need for
30
fast communication of dependences, since even with dedicated hardware, program performance is
heavily dependent on communication latency.
Overall, this collective body of work points to the reality that in order to achieve good speedup
for irregular workloads, the challenging communication demands of small loops must be met head
on. However, since speculation cannot be relied upon, due to its overheads, the communication of
both apparent and actual dependences must be accelerated. A hardware design capable of facilitat-
ing this communication will potentially unlock previously unseen performance for these programs.
1.3.1 Hardware Requirements for Parallelizing Small Loops
Here are the properties that hardware support must have if small loops are to be profitably paral-
lelized:
1. Communication must be very fast. Even a slightly higher latency in Multiscalar’s register
forwarding ring or TRIPS’s operand routing network noticeably hurts performance.
2. TLS cannot be used, as the overheads are too high for many loops. Therefore, apparent de-
pendences must also be satisfied/communicated, not just actual dependences.
3. Since all dependences must be handled, dependences involving ambiguous memory refer-
ences must be accelerated, not only those involving registers. Consequently, a statically un-
known amount of shared data, with a statically unknown number of producers and con-
sumers, must be communicated.
In the following chapter, we examine some existing hardware communication mechanisms and
explain why they do not meet all the requirements for accelerating small loops. In Chapter 3, we
present a novel compiler–architecture co-design that is able to meet all of the challenging properties
required for improving the performance of small loops. Our investigation is based on the HELIX
31
automatic parallelization technique, as this represents the state of the art for compiler-only cyclic-
multithreading parallelism. Pipelined multithreading is not as well suited for small loops, due to the
difficulty of finding many balanced pipeline stages in small amounts of code (the static size of the
control-flow graph limits the size of each pipeline stage).
32
2
Existing Hardware Cannot Handle
Requirements of Small Loops
Chapter 1 detailed prior work on automatic parallelization of irregular programs. The observation
that small loops must be targeted to increase the extracted performance of such workloads emerged
from that discussion. In particular, the discussion motivated the goal of constructing dedicated
hardware to accelerate the communication of inter-thread dependences. This hardware 1) must
perform very fast communication, 2) must not rely on thread-level speculation to remove apparent
inter-thread dependences but must communicate them instead, and 3) must be able to accelerate the
communication of dependences involving memory (as a result of potentially ambiguous pointers),
33
not just easily detected dependences involving registers—which entails dealing with an unknown
number of producers and consumers of shared data.
This chapter will examine existing hardware mechanisms for inter-thread communication and
show that none of them fulfill the hardware requirements needed for accelerating communication
for irregular programs. First, I discuss cache coherence protocols (the most common inter-thread
communication mechanisms) and show them to be inadequate since they transfer data reactively,
that is, only when shared data is requested. Next, I explore special-purpose hardware that can facil-
itate proactive data communication, such as scalar operand networks and software-controlled on-
chip networks, but these are still found to be lacking with respect to a number of the requirements.
Finally, I examine prior work on custom hardware specifically designed for communicating inter-
thread dependences—yet although the custom pieces of hardware are improvements over tradition
hardware, they are still lacking. The upshot of this discussion is that all of these existing mechanisms
are insufficient for the goal of boosting the performance of irregular programs for HELIX/cyclic-
multithreading automatic parallelization, which motivates the creation of the custom compiler–
architecture co-design detailed in Chapter 3 to accomplish this task.
2.1 Cache Coherence Protocols
The traditional way to communicate data in a commodity multicore processor is through shared
memory. Due to the typical presence of per-core private caches, complex cache coherence protocols
are necessary to create the simplified illusion of a single cache hierarchy. In reality, the work required
to maintain coherence is very complex, to ensure that different copies of the same logical memory
location do not exist on the chip. Generally, these protocols transfer data reactively. When a core
attempts to load a piece of shared data that is not in its local cache, the coherence protocol deter-
mines where it is located in the system (either in another core’s local cache, in a shared cache, or in
main memory) and then fetches it. Depending on the particular computer architecture and cache
34
coherence implementation, this may involve two or three trips either to the last-level cache (as in
some commodity multicore chips) or over an on-chip network to a specific “home” core (as in some
many-core chips, such as Tilera [1] or Intel MIC [15])—leading to tens or hundreds of cycles of la-
tency, even with a fast communication fabric. However, these multiple trips are necessary to load
the most up-to-date value and to invalidate any outstanding cached data, as dictated by the protocol.
Countless different cache coherence protocols have been implemented over the years [55], the exact
details of which are well beyond the scope of this dissertation. For our purposes, we will assume a
cache coherence protocol typical of Intel’s or AMD’s commodity multicore processors.
In the case of HELIX (and cyclic multithreading generally), shared data flows between different
cores in a cycle, from older loop iterations to younger loop iterations. Within sequential regions
of code, a core will often write a piece of shared data. The core running the next loop iteration will
then load that piece of shared data when it is safe to do so. Unfortunately, the shared data is almost
always located in the previous core’s private cache, so it must be transferred locally by the coherence
mechanism before it can be used. This creates a constant, pathological communication delay in the
already latency-sensitive HELIX style of parallelization.
The following example of a HELIX parallelized loop, shown in Figure 2.1, will serve to high-
light this pathological latency. A sequential segment of the loop body reads the shared location X,
performs some computation, and then writes the result back to location X. The parallel portion
of the loop performs some unrelated, independent calculation. At the beginning of the sequential
segment, HELIX inserts a wait operation. At the end of each sequential segment, HELIX inserts
a signal operation. A wait operation prevents a core from entering a sequential segment until
a corresponding signal has been received from the previous iteration of the loop, enforcing the
sequential ordering of the sequential segment.
Figure 2.2 shows an execution timeline for a two-core system using this example loop. At the
start of execution, core 0 has entered the sequential segment, while core 1 waits to enter. During
35
Wait
Load X
X = f(X)
Store X
Signal
B()
Start Next 
Iteration
Parallel Code
Sequential Code
Sequential Segment
Figure 2.1: A loop iteraধon is decomposed into sequenধal and parallel porধons for HELIX parallelizaধon. The
sequenধal porধon accesses a shared memory locaধon that must be communicated between threads.
the execution of the sequential segment, core 0 stores a value to the address of variable X, whose
cache line will be loaded into its L1 cache. The core then leaves the sequential segment by issuing a
signal to unblock core 1. After some communication latency,* core 1 receives the signal and enters
the sequential segment. Subsequently, core 1 issues a load to the address of variable X. Because the
recently written value of X resides in core 0’s L1 cache, there is a cache coherence delay until core 1
receives the data. Since sequential segments are executed in loop iteration order, the data transfer
latency significantly increases the critical path execution time.
The cost is a direct result of the reactive nature of cache coherence protocols: the data is only
moved when it is requested. This produces a coupling effect between the communication of the
shared data and the usage of the shared data. Although prefetching of data could alleviate this cou-
*The astute reader may wonder why the signal does not suffer the same reactive latency as the data. De-
pending on the implementation, it might. In the original HELIX paper [12], signals were accelerated through
clever prefetching techniques that unfortunately cannot be applied to shared data, due to the unpredictability
of their access in irregular programs.
36
Signal 
unblocks 
core 1
Program
Execution
Time
Core 0 Core 1
Wait
Signal Not 
Received Stall
Assume X is 
available 
locally, since 
first iteration 
of loop
Wait
Load X
X = f(X)
Store X
Signal
B()
Load X
The new value 
of X is stored 
in Core 0's L1 
cache
The remote load 
incurs significant 
latency
X = f(X)
Store X
Signal
Wait
Signal Not 
Received Stall
B()
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Data Communication
The load of X 
misses locally, 
must be fetched 
from Core 0's L1
Figure 2.2: Typical cache coherence protocols reacধvely transfer previously wriħen/cached data only when another
core aħempts to load it. For a two-core system running a HELIX parallelized loop, this leads to a long stall for core 1
as it waits for the data to transfer from core 0’s private L1 cache.
37
X is available 
locally, so 
loading it is 
very fast
X pushed 
to core 1 
proactively
Signal 
unblocks 
core 1Program
Execution
Time
Core 0 Core 1
Wait
Signal Not 
Received Stall
Assume X is 
available 
locally, since 
first iteration 
of loop
Wait
Load X
X = f(X)
Store X
Signal
B()
Load X
X = f(X)
Store X
Signal
The new value 
of X is pushed 
to Core 1's L1 
cache
Wait
Signal Not 
Received Stall
B()Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Data Communication
Figure 2.3: With a hypotheধcal proacধve data transfer mechanism, core 1 ﬁnds the shared data has already arrived
when it aħempts to fetch it. Core 1 can execute the sequenধal segment without stalling.
pling effect, the ambiguous pointer memory accesses found within irregular programs are too un-
predictable to be reliably prefetched. In contrast, a hypothetical proactive communication mecha-
nism would significantly reduce the stall suffered by core 1, as seen in Figure 2.3.
In general, cache coherence protocols target relatively small amounts of data that are shared in-
frequently between cores, so tens or hundreds of cycles of latency are not a big deal. In contrast,
inter-thread communication for small loops requires frequent time-critical data sharing between
cores, which makes cache coherence protocols insufficient. A different communication mechanism
38
is still needed.
2.2 Scalar OperandNetworks
In contrast to the reactive communication of cache coherence protocols, scalar operand networks are
designed to proactively transfer scalar data from the producers of data to the known consumers of
that data. Although one can think of the ALU bypass paths in a single processor as an operand net-
work [59], modern scalar operand networks generally connect larger grids of cores [64] or ALUs [21]
with a point-to-point network. Because of the streaming/data-flow types of workloads that fit nat-
urally on this type of interconnect, scalar operand networks tend to have very low latency between
nodes in the network. The low latency serves to mitigate the cost of operand transport between at
least somewhat coupled producers and consumers of data. This section briefly reviews the most no-
table scalar operand networks and describes why they are not suited for the communication needs of
small loops, despite transferring data proactively and with low latency.
2.2.1 Tile Processor STN
Processors in the Tile line (n e RAW processors [58]) consist of many simple cores (e.g., 64) ar-
ranged in a grid and connected by several mesh on-chip networks. In some of the Tile processors,
one of these networks is a scalar operand network that the designers call the Static Network (STN) [64].
The design goal of this network was to tightly integrate the operand transport network within the
core itself, because of the desire for low latency. From the core’s point of view, transmission over the
STN is accomplished by writing to a specific network-mapped register and receiving from the net-
work is accomplished by reading a specific register. As a result of this tight integration, the latency
for reading or writing to the STN is very low.
Routes for particular flows of operands within the STN are set up ahead of time between pairs
of cores, in typical circuit-switched fashion. This paradigm favors predictable, long data flows, as in
39
a streaming model of computation. Circuit-switched routing allows for headerless in-order routing
and very low transfer latency for operands between the correct producers and consumers of data,
with only one cycle spent for each intermediate router along the route.
Although very effective for a certain class of workloads, even the ultra-fast STN is not well suited
for the communication needs of small loops. Because knowledge of the exact producers and con-
sumers of shared data is required, only the simplest provably true register-to-register dependences
can be handled by the STN. In addition, ambiguous pointer accesses in small loops make it virtually
impossible to determine which threads on which cores will produce or consume any particular piece
of data, making the STN useless for these types of dependences.
2.2.2 TRIPS OPN
As previously described in Section 1.2.3, the TRIPS custom architecture facilitates an aggressively
speculative, data-flow execution model. The TRIPS scalar operand network (which the designers
call OPN) interconnects execution, register, and memory resources with a low-latency, point-to-
point mesh, single cycle per hop network [22, 64]. Different blocks of instructions run on different
execution units, and necessary inter-block register values are routed over the OPN as soon as they
are produced. Unlike the routing in the Tile STN, the routing of operands in the TRIPS OPN is
dynamic. At runtime, the hardware dynamically routes the operand to wherever the consuming
instruction has been assigned.
However, despite the additional flexibility of dynamic routing, the OPN suffers the same critical
drawbacks as the STN: the determination of which instructions produce and consume a particular
data element must happen in advance, which limits the OPN to handling register-to-register depen-
dences and makes it inapplicable to apparent dependences. Instead, TRIPS handles any apparen-
t/memory dependences with thread-level speculation, which is not usable for small loops. The OPN
can hypothetically handle register dependences that are not always communicated, since space is re-
40
served statically at the consuming execution unit for any value that may arrive, which ensures that
the OPNwill not back up and stall. However, the need to statically reserve space at the consumer
makes the OPN unable to handle dependences based on memory, due to the statically unknown
number of locations and consumers.
2.3 User-Controlled On-Chip Networks
In addition to or instead of compiler-controlled scalar operand networks, some chips implement
user-controllable on-chip networks to support fast, proactive point-to-point communication pat-
terns between processing elements; these include RAW [58] / Tilera [64], the Cell processor [37],
and Intel SCC [62]. Many different network topologies have been studied in academia, but simple
mesh and ring topologies with simple routing dominate the networks of chips that have seen actual
real-world use [30]. These networks were designed to scale arbitrary, proactive inter-core communi-
cation beyond the number of cores supportable by a single shared cache. I present two of the most
widely used† dynamic user-controllable on-chip networks and show that although their increased
flexibility makes them better candidates than scalar operand networks for handling the communica-
tion demands of parallelized small loops, they are still unsuitable for the task.
2.3.1 The Cell Processor Ring Network
The Cell processor [37] combined a high performance core with several simpler in-order co-processors
called synergistic processor elements (SPEs). The SPEs were specialized for high-throughput float-
ing point and integer arithmetic. Each SPE contained a private scratchpad memory. A very high-
bandwidth ring network connected all of the SPEs with the high performance core and the mem-
ory controller. Instead of using reactive cache coherence to share data between SPEs, explicit user-
controlled direct memory access (DMA) requests transfer data from the SPEs to main memory or
†Since these types of on-chip networks are not widely used, this is not saying much.
41
directly between the private memories of the SPEs.
Unfortunately, the Cell network fails our first requirement of very fast communication. The
latency to transfer even a single byte is >100 clock cycles, far higher than desired. Since the network
was designed for high-bandwidth bulk transfers of data, it was assumed that the cost of initiating a
DMA transfer could be amortized over many bytes.
2.3.2 Tile Processor UDN
The Tile Processor architecture contains several on-chip networks, one of which (the UDN) is a
point-to-point mesh network dedicated to user-controlled, dynamic message passing between dif-
ferent cores [64]. Unlike some of the other networks in the Tile architecture that facilitate a more
traditional reactive cache coherence protocol, the UDN allows direct communication between dif-
ferent threads on different cores. Programmers have the flexibility to send whatever data they choose
in whatever fashion they choose, with the only restriction being the hardware resources that are ex-
posed by the architecture. Three different messaging paradigms are supported. Buffered channel
and message passing APIs allow flexible, simple usage models with logically unlimited buffering.
This makes them easy to use, and programmers do not need to be too concerned about network
deadlock. However, their relatively high latency makes them completely unsuitable for the needs of
small loops. Raw channels, on the other hand, provide very close access to the hardware, enabling
very low-latency sending and receiving of data, at the cost of very limited buffering and the potential
need for additional user-defined flow control to avoid deadlock.
The ability to use raw channels to send arbitrary data proactively, with potentially single-digit
latencies between cores, provides an interesting option for the communication needs of small loops.
These channels would make it relatively easy to communicate register-to-register dependences be-
tween known producers and consumers. Since the network transmits packets in order, they could
also be safely used to send synchronization signals that could indicate to a consumer core when it is
42
safe to read a particular dependent value.
However, it is very difficult to see how raw channels could be adapted to handle any dependences—
whether they involve memory or registers—when the exact consumers are unknown (or nonex-
istent). Since consuming cores would not know which core to request produced data from, each
core would need to proactively communicate any produced data that could potentially be shared to
every other core in the system—and because the UDN doesn’t have any built-in broadcast mecha-
nism, each core would need to explicitly send the dependent data to every other core, creating a huge
amount of network traffic. Moreover, in order to avoid deadlock, all the other cores would need to
constantly inspect their raw channel output buffers and transfer any received items into local storage
in order to prevent the network from deadlocking. Without some hardware mechanism to facilitate
the broadcast and automatic storing of data sent through the raw channels, the cores in the system
would need to expend a large amount of time merely managing the transmission and reception of
this data, even though much of it will never be consumed. All of the work to ensure correct loop-
iteration-order sequential access to the shared data would also need to be done by each core through
software, an unreasonable task. Although the flexible network fabric seems promising, additional
hardware would be needed to handle the required communication demands of small loops.
2.4 OtherHardware for Accelerating Communication
Other variants of interconnections have been designed to handle particular communication patterns
for automatic parallelization. Previous work on DSWP, for instance, uses a series of relatively sim-
ple queues to pipeline dependent data between threads [50]. This so-called synchronization array
is limited to cases where there is a one-to-one mapping between the production of values and the
consumption of values (i.e., known actual register-to-register dependences), so similarly to scalar
operand networks, it is insufficient for our purposes. There is one remaining hardware communica-
tion acceleration mechanism that comes close to fulfilling the requirements for small loop commu-
43
Core Core Core
Register 
File
Register 
File
Register 
File
Control 
Masks
Control 
Masks
Control 
Masks
Figure 2.4: The Mulধscalar distributed register ﬁle proacধvely distributes shared registers around the ring of cores.
When a running task aħempts to read a shared register, control masks ensure that the value is only accessed if the
most recently updated value is present; otherwise, the core stalls unধl it arrives.
nication and that serves as strong motivation for the hardware solution proposed in Chapter 3.
2.4.1 Multiscalar’s Distributed Register File
We revisit the distributed register file communication mechanism of the Multiscalar processor that
was briefly described in Section 1.2.3, but now in the specific context of accelerating data movement
rather than the overall automatic parallelization approach. Recall that the Multiscalar processor
enabled automatic parallelization by speculatively executing dynamic blocks of instructions called
tasks in parallel on multiple cores. When misspeculation occurred (e.g., dependent data was read
before it was written), the execution of a task was restarted.
Thread-level speculation helpedMultiscalar eliminate the communication cost of apparent inter-
task dependences involving memory. To accelerate the communication of register-to-register depen-
dences, a distributed register file was used, as shown in Figure 2.4. Every core contained a portion
of this register file, with all of the portions connected by a unidirectional ring network. At compile
44
time, any task that potentially wrote to registers that were shared amongst tasks would have the cor-
responding bits of those registers set in one of its per-core control bitmasks. When a shared register
value was written by some task, it was automatically forwarded around the ring by the hardware.
The register value continued propagating until it updated every core’s register file or another task
stopped its propagation. A task would stop the propagation of a register if it knew that it might it-
self write that register, so that younger tasks wouldn’t prematurely receive an incorrect value. For
a register that would only sometimes be updated by a task (depending on control flow), the task
would propagate the unmodified register value as soon as it knew it would not be modifying the
value, thus guaranteeing that younger tasks that were expecting an update to that register would
remain unblocked.
Through the propagation of control bitmasks from past tasks to future tasks, each task was able
to determine which registers would be written by its predecessor tasks and therefore which regis-
ter values it might need to consume. If a task attempted to read one of these shared register values
but it hadn’t yet been received, the core would stall until it arrived. If the register had already been
received, the correct, newly produced value would be present locally, so it could be accessed very
quickly. If a task never attempted to read a particular potentially shared register (i.e, if the register
dependence was apparent rather than actual), then it would continue execution uninterrupted.
In sum, this distributed register file allowedMultiscalar to accelerate all register communication—
both apparent and actual dependences—even when it was unknown whether a task would actually
produce or consume a particular register. This nice property was enabled by the proactive forward-
ing of register values between cores and the intelligent hardware control that orchestrated stalling
cores when data was not yet received, stopping propagation of a register if a core might potentially
update it, and gracefully handling the case where a register did not actually need to be updated. If a
register dependence manifested, the data was made available for consumption only a few cycles after
it was produced. If a register dependence did not manifest, cores did not stall unnecessarily.
45
Unfortunately, despite these nice properties, the Achilles heel of the Multiscalar register file is that
it can only be applied to register-to-register dependences, where the registers that may be commu-
nicated are statically known. It is impossible to map ambiguous memory accesses to registers, as the
addresses and even the number of accessed memory locations is unknown statically. Dynamically
mapping memory to registers would be entirely unfeasible, as each core would need to simultane-
ously perform the same exact mapping. Additionally, the number of shared locations could quickly
overwhelm the limited size register file. As such, Multiscalar still relies on thread-level speculation to
handle actual and apparent dependences through memory.
2.5 Conclusion
In this chapter, we have examined a number of hardware mechanisms for accelerating core-to-core
communication. In general, these mechanisms fail the criteria for small loop communication by
virtue of 1) having too high a latency, 2) only being able to accelerate statically known register-to-
register inter-thread dependences, and/or 3) needing to perform too much communication man-
agement in software. Scalar operand networks exemplify property 2), traditional cache coherence
protocols exemplify property 1), and user-controlled on-chip networks exemplify properties 2) and
3). On the other hand, the Multiscalar distributed register file comes close to fulfilling the require-
ments for small loop communication, since it is able to quickly accelerate actual register dependences
as well as to gracefully avoid a performance penalty from accelerating apparent or only sometimes
seen register dependences. Hardware mechanisms make most of this communication automatic,
without intervention from the core. However, the Multiscalar distributed register file fails to acceler-
ate dependences involving memory.
Despite its shortcomings, the Multiscalar register file serves as partial inspiration for the hard-
ware solution I propose in Chapter 3, as it shares a similar ring structure. However, instead of a dis-
tributed register file, my HELIX-RC compiler–architecture co-design combines a distributed shared
46
cache with intelligent synchronization buffering to fully meet the communication demands of small
loop parallelization.
47
3
Automatic Parallelization of Irregular
Programs with HELIX-RC
Chapter 1 discussed the drawbacks of existing automatic parallelization techniques with regard to
irregular non-numerical workloads. Most prior work has attempted to mitigate the high commu-
nication demands resulting from parallelizing these workloads by relying on thread-level specula-
tion (TLS). TLS limits the regions of code that can be parallelized. In particular, TLS overheads
overwhelm any performance improvement when small loop iterations are parallelized. Thus, TLS
cannot be used for small loops. Yet, to extract maximal parallel performance, small loops must be
parallelized.
48
Targeting small loops presents its own set of challenges. Even after extensive code analysis and op-
timizations, small hot loops will retain actual dependences (in addition to a small number of appar-
ent dependences), typically to share dynamically allocated data. Moreover, since the loop iterations
of small loops tend to be short in duration, they require frequent, memory-mediated communica-
tion. To run these iterations in parallel, low-latency core-to-core communication is needed for mem-
ory traffic. Moreover, ambiguous dependences owing to pointers make it difficult to determine not
only the specific shared memory addresses, but also the total amount of shared data. In Chapter 2,
we described potential hardware mechanisms that could be used for communication, but found that
no existing solution can fulfill the requirements for this kind of communication pattern. TheMulti-
scalar register file was the closest, as it was able to proactively communicate both actual and apparent
register-to-register dependences; however, it was unable to accelerate memory dependences without
relying on TLS.
To meet the communication demands for short loops, we present HELIX-RC, a co-designed
architecture–compiler parallelization framework for chip multiprocessors. The compiler identifies
which data must be shared between cores, and the architecture proactively circulates this data along
with synchronization signals among the cores rather than waiting for a request. The proactive com-
munication immediately circulates shared data as early as possible—thus decoupling communica-
tion from computation. HELIX-RC builds on the HCCv1 compiler, developed for the first iteration
of HELIX [12, 13], which automatically generates parallel code for commodity multicore processors.
Because performance improvements fromHCCv1 saturate at four cores, due to communication la-
tency, I propose ring cache as an architectural enhancement that facilitates low-latency core-to-core
communication to satisfy inter-thread memory dependences, relying on guarantees provided by the
co-designedHCCv3 compiler to keep it lightweight.
HELIX-RC automatically parallelizes irregular programs with unmatched performance improve-
ments. Across a range of SPECint 2000 benchmarks, decoupling communication and computation
49
enables a threefold improvement in performance over HCCv1, on a simulated multicore processor
consisting of 16 Atom-like, in-order cores with a ring cache that has 1KB of memory per node (32
smaller than the L1 data cache). The proposed system offers an average speedup of 6.85 over un-
parallelized code running on a single core. Detailed evaluations show that even with a conservative
ring cache configuration, HELIX-RC is able to achieve 95% of the possible speedup with unlimited
resources (i.e., unbounded bandwidth, instantaneous intercore communication, and unconstrained
size).
The remainder of this chapter further describes the motivation for HELIX-RC and the results
of implementing it. I first review the limitations of compiler-only improvements and identify co-
design opportunities for improving the thread-level parallelism (TLP) of loop iterations. Next, I
explore the speedups obtained by decoupling communication from computation with compiler
support. After describing the overall HELIX-RC approach, I delve more deeply into both the com-
piler and the hardware enhancement. Finally, I use a detailed simulation framework to evaluate the
performance of HELIX-RC and analyze its sensitivity to architectural parameters.
3.1 Background andOpportunities
3.1.1 Limits of Compiler-only Improvements
To understand what limits the performance of parallel code extracted from irregular programs, I
began with HCCv1 [12, 13], a state-of-the-art parallelizing compiler.
HCCv1 This first-generation compiler automatically generates parallel threads from sequential
programs by distributing successive loop iterations across adjacent cores within a single multicore
processor, similar to conventional DOACROSS parallelism [16]. Since there are data dependences
between loop iterations (i.e., loop-carried dependencॸ), some segments of a loop’s body—called
sequential segments—must execute in iteration order on the separate cores to preserve the semantics
50
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
18
3.e
qu
ak
e
17
9.a
rt
18
8.a
mm
p
17
7.m
es
a
FP
 G
eo
me
an
Ge
om
ea
n
0
2
4
6
8
10
12
14
16
P
ro
gr
am
 s
pe
ed
up
Numerical
Programs
Irregular
Programs
HCCv1 HCCv2
Figure 3.1: Improving the HCCv1 compiler alone does not improve performance for SPECint 2000 benchmarks.
of the sequential code. Synchronization operations mark the beginning and end of each sequential
segment.
HCCv1 includes a large set of code optimizations (e.g., code scheduling, method inlining, loop
unrolling), most of which are specifically tuned to extract TLP. Despite this, performance improve-
ments obtained by the original HCCv1 compiler saturate at four cores, due to high core-to-core com-
munication latency.
HCCv2 I first improved the code analysis and transformation. Specifically, I increased the accu-
racy of both data dependence and induction variable analysis, and I added other transformations to
extract more parallelism (e.g., scalar expansion, scalar renaming, parallel reductions, and loop split-
ting [2]). I call this improved compilerHCCv2.
Figure 3.1 compares speedups for HCCv1 and HCCv2 based on simulations of parallel code gen-
erated by each when targeting a 16-core processor with an optimistic 10-cycle core-to-core commu-
nication latency.* The engineering improvements of HCCv2 significantly increased speedups over
*Details of this experiment are presented in Section 3.5.
51
HCCv1 for numerical programs (SPECfp 2000), from 2:4 to 11. HCCv2 successfully parallelized
the numerical programs because the data dependence analysis is highly accurate for loops at almost
any level of the loop nesting hierarchy. Furthermore, the improved compiler removed the remaining
actual dependences among registers (e.g., via parallel reduction) to generate loops with long itera-
tions that can run in parallel on different cores.
Unfortunately, irregular programs (SPECint) are not as compliant to the compiler improvements
and saw little to no benefit fromHCCv2. Because core-to-core communication in conventional
systems is expensive, the compiler must parallelize large loops (the larger the loop with loop-carried
dependences, the less frequently cores synchronize), which limits the accuracy of the dependence
analysis and thereby limits TLP extraction. This is why HELIX-RC focuses on small (hot) loops
to parallelize this class of programs. My hypothesis is that modest architectural enhancements co-
designed with a compiler that targets small loops can successfully parallelize irregular programs.
3.1.2 Opportunity
There is an opportunity to aggressively parallelize irregular programs based on the following in-
sights: (i) small loops are easier to analyze with high accuracy, (ii) predictable computation means
that most of the required communication updates shared memory locations, (iii) we can efficiently
satisfy the communication demands of actual dependences for small loops with low-latency, core-to-
core communication, and (iv) proactive communication efficiently hides communication latencies.
Accurate data dependence analysis is possible for small loops. The accuracy of
data dependence analysis increases for smaller loops because (i) there is less code—and therefore
less complexity—to analyze, and (ii) the number of possible aliases for a pointer in the code scales
down with code size. In other words, we can avoid the conservative pointer aliasing assumptions
that lower accuracy for large loops.
52
To evaluate the accuracy of the data dependence analysis for small loops using modern compilers,
I started with a state-of-the-art analysis called VLLPA [23]. Figure 3.2 shows that the initial accuracy
of this analysis (i.e., the average number of actual data dependences compared to all dependences
identified for the set of loops HELIX-RC ended up selecting for parallelization) was 48%. To im-
prove the accuracy, I extended VLLPA (i) to be fully flow-sensitive [14], i.e., to track the values of
both registers and memory locations according to their position in the code, (ii) to be path-based,
i.e., to name runtime locations according to how they are accessed from program variables [19], (iii)
to exploit data type and type casting information to conservatively eliminate incompatible aliases,
and (iv) to exploit standard library-call semantics. Figure 3.2 shows that these extensions increased
the accuracy of the analysis for small loops to 81%. As a result, most of the loop-carried data depen-
dences identified by the compiler are actual and therefore require core-to-core communication. The
remaining 19% can always be handled by speculation.
Most required communication is for updating shared memory locations. Shar-
ing data among loop iterations requires core-to-core communication to propagate new values when
loop iterations run on different cores. However, if new values are predictable (e.g., incrementing a
shared variable at every iteration), communication can be avoided. I extended the variable analysis in
HCCv1 to capture the following predictable variables: (i) induction variables for which the update
function is a polynomial up to the second order, (ii) accumulative, maximum, and minimum vari-
ables, (iii) variables set but not used until after the loop, and (iv) variables set in every iteration, even
when the updated value is not constant. If a variable falls into any of these categories, each core can
independently recompute its correct value.
Exploiting the predictability of variables—again for small loops in irregular programs—allows
the compiler to remove a large fraction of the communication required to share registers. Figure 3.4
compares a naive solution that propagates new values for all loop-carried data dependences (100%)
53
0% 100%81%48%
VLLPA +flow sensitive +path based +data type +lib calls
Figure 3.2: Various improvements to the VLLPA data dependence analysis boost accuracy signiﬁcantly for small hot
loops in SPECint 2000.
0 25 75 107 190 260
Clock Cycles
0
50
100
P
e
rc
e
n
ta
g
e
 o
f 
lo
o
p
 i
te
ra
ti
o
n
s
  Measured cache
 coherence latency
Atom
Nehalem
Sandy
Bridge
  Ivy
Bridge
Haswell
Figure 3.3: The most promising loops to parallelize (as determined by the compiler) usually have very short iteraধons,
especially compared to a typical cache coherence latency.
versus a solution that exploits variable predictability. By recomputing variables, the majority of the
remaining communication is for shared memory locations rather than registers.
Communication for small hot loops must be fast. While the simplicity of small loops
allows for easy analysis, small loops have short iterations—typically less than 100 clock cycles. Be-
cause these short iterations require (at least) some communication to run in parallel, efficient parallel
execution demands a low-latency core-to-core communication mechanism.
To better understand this need for fast communication, Figure 3.3 plots a cumulative distribution
of average iteration execution times on a single Atom-like core (described in Section 3.5) for the set
54
Mem
Mem Register
100%15%
Re-compute
Figure 3.4: Predictability of variables reduces register communicaধon. The remaining required communicaধon is
mostly through memory.
Hop
6+6%
5
9%
4
12%
3 39%
2
22% 1
12%
Figure 3.5: Most required data transfers for small loops are between non-adjacent cores in a hypotheধcal 16-core
system connected by a ring network.
of hot loops from SPECint 2000 benchmarks chosen for parallelization by HELIX-RC. The shaded
portion of the plot shows that more than half of the loop iterations complete within 25 clock cycles.
The plot also delineates the measured core-to-core round trip communication latencies for three
modern multicore processors. Even for the shortest-latency machine, Ivy Bridge, 75 cycles is much
too long for the majority of these short loops. Of course, a conventional region-extending transfor-
mation such as loop unrolling could lengthen the duration of these inner loops, but this would also
increase the lengths of sequential segments, reducing exploitable parallelism.
Proactive communication achieves low latency by decoupling communication
from computation. A compiler must conservatively assume that dependences exist between all
iterations for most of the loop-carried dependences in irregular programs. Because of the complexity
of the control and data flows in such programs, a compiler cannot easily infer the distance between
55
Core
6+9%
5
34%
4 12%
3
21%
2
8% 1
16%
Figure 3.6: Most shared data for small loops is consumed by mulধple cores in a hypotheধcal 16-core system.
a loop iteration that generates data and the ones that consume it. For conventional synchronization
approaches [12, 44, 45, 60, 65, 66], this assumption of dependences between all subsequent itera-
tions leads to sequential chains that severely limit the performance sought by running loop iterations
in parallel.† These sequential chains, which include both communication and computation, have
two sources of inefficiency. First, adjacent-core synchronization often turns out not to be necessary
for every link of these chains. Second, when data forwarding is initiated lazily (at request time), it
blocks computation while waiting for data transfers between cores.
Finally, for loops parallelized by HELIX-RC, most communication is not between successive
loop iterations. Hence, because HELIX-RC distributes successive loop iterations to adjacent cores,
most communication is not between adjacent cores. Figure 3.5 charts the distribution of undirected
distances between data-producing cores and their first consumer core on a platform with 16 cores
organized in a ring. Only 12% of those transfers are between adjacent cores. Moreover, Figure 3.6
shows that most (84%) of the shared values from these loops are consumed by multiple cores. Since
consumers of shared values are not known at compile time, HELIX-RC implements a mechanism
that proactively broadcasts data and signals to all other cores. Such proactive communication, which
†Others have called this chain a critical forwarding path [57, 66].
56
does not block computation, is the cornerstone of the HELIX-RC approach.
3.2 TheHELIX-RC Solution
The goal of HELIX-RC is to decouple all communication required to efficiently run iterations of
small hot loops in parallel. This is realized by decoupling value forwarding from value generation and
by decoupling signal transmission from synchronization. I will now explain howHELIX-RC achieves
such decoupling.
3.2.1 Approach
HELIX-RC is a co-design of compiler (HCCv3) and architectural (ring cache) enhancements. HCCv3
distinguishes parallel code (i.e., code outside any sequential segment) from sequential code (i.e., code
within sequential segments) by using two instructions that extend the instruction set. The ring
cache is a ring network that connects ring nodॸ attached to each core in the processor to operate
during sequential segments as a distributed first-level cache that precedes the private L1 caches. The
hardware support can be simple and efficient because it relies on compiler-guaranteed properties of
the code. The following paragraphs summarize the main components of HELIX-RC.
ISA A pair of instructions—wait and signal—are introduced to mark the beginning and end
of a sequential segment. Each of these instructions has an integer value as a parameter that identifies
the particular sequential segment. The wait instruction blocks execution of the core that issued it
(e.g., wait 3) until all other cores have finished executing the corresponding sequential segment,
which they signify by executing the appropriate signal instruction (e.g., signal 3).
Compiler HCCv3 takes sequential programs and parallelizes loops that are most likely to speed
up performance when their iterations execute in parallel. Only one loop at a time runs in parallel,
and its successive iterations run on cores organized as a unidirectional ring.
57
To satisfy loop-carried data dependences, HCCv3 keeps the execution of sequential segments in
iteration order by inserting wait and signal instructions to delimit the entry and exit points of
these segments. In this way, HCCv3 guarantees that accesses to a variable or another memory loca-
tion that might need to be shared between cores are always within sequential segments. Moreover,
shared variables (normally allocated to registers in sequential code) are mapped to specially allocated
memory locations. Hence, accesses to these variables within sequential segments occur via memory
operations.
Core A core forwards all memory accesses within sequential segments to its local ring node. All
other memory accesses (not within a sequential segment) go through its private L1 cache. To deter-
mine whether the executing code is part of a sequential segment, a core simply counts the number
of executed wait and signal instructions. If more waits have been executed than matching sig-
nals, then the executing code belongs to a sequential segment.
Memory The ring cache is a connected ring of nodes, one per core. Each ring node has a cache
array that satisfies both loads and stores received from its attached core.
HELIX-RC does not require other changes to the existing memory hierarchy because the ring
cache orchestrates interactions with it. To avoid any changes to conventional cache coherence pro-
tocols, the ring cache permanently maps each memory address to a unique ring node. All accesses
from the distributed ring cache to the next cache level (L1) go through the associated node for a cor-
responding address.
3.2.2 Decoupling Communication From Computation
Having introduced the main components of HELIX-RC, we now describe how they interact to
efficiently decouple communication from computation.
58
Shared data communication HELIX-RC decouples communication of variables and other
shared data locations from computation by propagating new shared data through the ring cache as
soon as it is generated. Once a ring node receives a store, it records the new value and proactively
forwards its address and value to an adjacent node in the ring cache, all without interrupting the
execution of the attached core. The value then propagates from node to node through the rest of
the ring without interrupting the computation of any core—thus decoupling communication from
computation.
Synchronization Given the difficulty of determining which iteration depends on which in ir-
regular programs, compilers typically make the conservative assumption that an iteration depends
on all of its predecessor iterations. Therefore, a core cannot execute sequential code until it is un-
blocked by its predecessor [12, 44, 57]. Moreover, an iteration unblocks its successor only if both it
and its predecessors have executed this sequential segment or they are not going to. This execution
model leads to a chain of signal propagation across loop iterations that includes unnecessary syn-
chronization: even if an iteration is not going to execute sequential code, it still needs to synchronize
with its predecessor before unblocking its successor.
HELIX-RC removes these synchronization overheads by enabling an iteration to detect the readi-
ness of all predecessor iterations, not just one. Therefore, once an iteration forgoes executing the
sequential segment, it immediately notifies its successor without waiting for its predecessor. Unfor-
tunately, while HELIX-RC removes unnecessary synchronization, it increases the number of signals
that can simultaneously be in flight.
HELIX-RC relies on the new signal instruction to handle synchronization signals efficiently.
Synchronization between a producer and a consumer involves (i) the producer generating a signal,
(ii) the consumer requesting that signal, and (iii) signal transmission between the two.
On a conventional multicore processor, which relies on a pull-based memory hierarchy for com-
59
wait 1;
signal 1;
...
...
   a = a+1;
wait 1;
a=load;
store a;
signal 1;
1
sequential
segment
Parallel code
Sequential chain
Sequential code
Data forwarding
(a) Parallel code
Time
signal wait
signal
wait
signal
Core 0 Core 1 Core 2
load
signal stall
data stall
(b) Coupled communicaধon
signal wait
signal
Core 0 Core 1 Core 2
signal
load
(c) Decoupled communi-
caধon
Figure 3.7: Example illustraধng beneﬁts of decoupling communicaধon from computaধon. On the leđ, a representa-
ধve loop from 175.vpr toggles between a necessary/unnecessary sequenধal segment. If communicaধon is coupled,
signal and data stalls slow down computaধon.
munication, signal transmission is inherently lazy, and signal requests and transmissions get serial-
ized. In contrast, in HELIX-RC, signal instructs the ring cache to proactively forward a signal to
all other nodes in the ring without interrupting any of the cores, thereby decoupling signal transmis-
sion from synchronization.
Code example Given the importance of these decoupling mechanisms to fully realize perfor-
mance benefits, let’s explore howHELIX-RC implements them using a concrete example. The code
in Figure 3.7(a), abstracted for clarity, represents a small hot loop from 175.vpr of SPECint 2000
that is responsible for 55% of that program’s total execution time. The loop contains a sequential
segment with two possible execution paths. The left path contains an actual dependence in which
instances of instruction 1 in an iteration use values from previous iterations. The right path does
not depend on prior data. Because the compiler cannot predict the execution path of a particular
iteration (due to complex control flow), it must assume that in any given iteration, instruction 1
depends on the previous iteration. Therefore, it must synchronize all successive iterations by insert-
ing wait and signal instructions on every execution path. Figure 3.7(b) highlights this sequential
chain in red. Now assume that only iterations 0 and 2, running on cores 0 and 2, respectively, exe-
60
cute instruction 1. In this case, the sequential chain is unnecessarily long because of the superfluous
wait in iteration 1. Each iteration waits (via the wait instruction) for the signal generated by the
signal instruction of the previous iteration. Also, iterations that update a (iterations 0 and 2) must
load previous values first (using a regular load). Hence, two sets of stalls slow down the chain. First,
iteration 1 performs unnecessary synchronization (signal stalls), because it only contains parallel
code. Second, lazy forwarding of the shared data leads to data stalls, because the transfer only begins
when requested, at a load, rather than when generated, at a store.
HELIX-RC proactively communicates data and synchronization signals between cores, which
leads to the more efficient scenario shown in Figure 3.7(c). The sequential chain now includes only
the delay required to satisfy the dependence, that is, communication updating a shared value. As a
side note, a TLS-based solution suffers in this scenario. Only a complex speculation approach might
be able to capture the run-time behavior of this dependence (i.e., speculating when the program will
execute which path), because 175.vpr frequently toggles between the two paths. Moreover, spec-
ulation cannot avoid the data-forwarding overhead when the program executes the branch with
instruction 1, adding communication delay to the already critical sequential chain.
3.3 Compiler
The decoupled execution model of HELIX-RC described so far is possible given the tight co-design
of the compiler and architecture. In this section, we focus on compiler-guaranteed code properties
that enable a lightweight ring cache design, and we follow up with code optimizations that make use
of the ring cache.
Guaranteed code properties
• Only one loop at a time can run in parallel. Apart from a dedicated core responsible for ex-
ecuting code outside parallel loops, each core is either executing an iteration of the current
61
loop or waiting for the start of the next one.
• Successive loop iterations are distributed to threads in a round-robin manner. Since each
thread is pinned to a predefined core and cores are organized in a unidirectional ring, succes-
sive iterations form a logical ring.
• Communication between cores executing a parallelized loop occurs only within sequential
segments.
• Different sequential segments always access different shared data. HCCv3 only generates
multiple sequential segments when there is no intersection of shared data. Consequently,
instances of distinct sequential segments may run in parallel.
• At most two signals per sequential segment emitted by a given core can be in flight at any
time. Hence, only two signals per segment need to be tracked by the ring cache.
This last property eliminates unnecessary wait instructions while keeping the architectural en-
hancement simple. Eliminating waits allows a core to execute a later loop iteration than its succes-
sor (significantly boosting parallelism). Future iterations, however, produce signals that must be
buffered. The last code property prevents a core from getting more than one “lap” ahead of its suc-
cessor. Therefore, when buffering signals, each ring cache node only needs to recognize two types—
those from the past and those from the future.
Code optimizations In addition to the optimizations of HCCv2, HCCv3 includes optimiza-
tions that are essential for the best performance of irregular programs on a ring-cache-enhanced
architecture: aggressive splitting of sequential segments into smaller code blocks, identification and
selection of small hot loops, and elimination of unnecessary wait instructions.
Sizing sequential segments poses a tradeoff. Additional segments created by splitting can run in
parallel with others, but extra segments entail extra synchronization, which adds communication
62
overhead. Thanks to decoupling, HCCv3 can split more aggressively than HCCv2 to significantly
increase TLP. Note that segments cannot be split indefinitely—each shared location must belong to
only one segment.
To identify small hot loops that are most likely to speed up when their iterations run in parallel,
HCCv3 includes a profiler to capture the behavior of the ring cache. Whereas HCCv1 relies on an an-
alytical performance model to select the loops to parallelize, HCCv3 profiles loops on representative
inputs. During profiling, instrumentation code emulates execution with the ring cache, resulting
in an estimate of the time saved by parallelization. Finally, HCCv3 uses a loop-nesting graph, anno-
tated with the profiling results, to choose the most promising loops.
3.4 Architecture Enhancements
Adding a ring cache to a multicore architecture enables the proactive circulation of data and signals
thats boost parallelization. This section describes the design of the ring cache and its constituent
ring nodes. The design is guided by the following objectives:
Low-latency communication HELIX-RC relies on fast communication between cores in
a multicore processor for synchronization and for data sharing between loop iterations. Since low-
latency communication is possible between physically adjacent cores in modern processors, the ring
cache implements a simple unidirectional ring network.
Caching shared values A compiler cannot easily guarantee whether and when shared data
generated by a loop iteration will be consumed by other cores running subsequent iterations. Hence,
the ring cache must cache shared data. Keeping shared data on local ring nodes provides quick access
for the associated cores. As with data, it is also important to buffer signals in each ring node for im-
mediate consumption.
63
Data and Signals
Cache array
Signal buﬀer
... Past
Future
Signal 1Signal S
ReadPort
WritePort
Credits
Data and 
Signals
Link
Buﬀers
Data and 
Signals
Credits Control
Loads 
from Core
Stores/Signals
from Core
Ring
node
  DL1
Cache
Core
Remote L1
Request/Reply
L1 Cache Reads/Writes
Core
Figure 3.8: Ring cache architecture overview. From leđ to right: overall system; single core slice; ring node internal
structure.
Easy integration The ring cache is a minimally invasive extension to existing multicore sys-
tems, easy to adopt and integrate. It does not require modifications to the existing memory hierar-
chy or to cache coherence protocols.
With these objectives in mind, we now describe the internals of the ring cache and its interaction
with the rest of the architecture.
3.4.1 Ring Cache Architecture
The ring cache architecture relies on the following properties of the compiled code: (i) parallelized
loop iterations execute in separate threads on separate cores, arranged in a logical ring, and (ii) data
shared between iterations moves between cores from current to future iterations. These properties
imply that the data involved in timing-critical dependences that potentially limit overall perfor-
mance are both produced and consumed in the same order as loop iterations. Furthermore, a ring
network topology captures this data flow, as sketched in Figure 3.8. The following paragraphs de-
scribe the structure and purpose of each ring cache component.
Ring node structure The internal structure of a per-core ring node is shown in the right half
of Figure 3.8. Parts of this structure resemble a simple network router. Unidirectional links connect
a node to its two neighbors to form the ring backbone. Bidirectional connections to the core and
64
the private L1 cache allow injection of data into and extraction of data from the ring. There are three
separate sets of data links and buffers. A primary set forwards data and signals between cores. Two
other sets manage infrequent traffic for integration with the rest of the memory hierarchy (see Sec-
tion 3.4.2). Separating these three traffic types simplifies the design and avoids deadlock. Finally, sig-
nals move in lockstep with forwarded data to ensure that a shared memory location is not accessed
before the data arrives.
In addition to these router-like elements, a ring node also contains structures more common to
caches. A set-associative cache array stores all data values (and their tags) received by the ring node,
whether from a predecessor node or from its associated core. The line size of this cache array is kept
at one machine word. While the small line is contrary to typical cache designs, it ensures there will be
no false data sharing by independent values from the same line.
The final structural component of the ring node is the signal buffer, which stores signals until
they are consumed.
Node-to-node connection The main purpose of the ring cache is to proactively provide
many-to-many core communication in a scalable and low-latency manner. In the unidirectional
ring formed by the ring nodes, data propagates by value circulation. Once a ring node receives an
(address, value) pair, either from its predecessor or from its associated core, it stores a local copy in
its cache array and propagates the same pair to its successor node. The pair eventually propagates
through the entire ring (stopping after a full cycle), so that any core can consume the data value from
its local ring node, as needed.
This value circulation mechanism allows the ring cache to communicate between cores more
quickly than reactive systems (such as most coherent cache hierarchies). In a reactive system, data
transfer only begins when the receiver requests the shared data, which adds transfer latency to an
already latency-critical code path. In contrast, a proactive scheme overlaps transfer latencies with
65
computation to lower the receiver’s perceived latency.
The ring cache prioritizes the common case where data generated within sequential segments
must propagate to all other nodes as quickly as possible. Assuming no contention over the network
and single-cycle node-to-node latency, the design shown in Figure 3.8 allows us to bound the latency
for a full trip around the ring toN clock cycles, whereN is the number of cores. Each ring node
prioritizes data received from the ring and stalls injection from its local core.
In order to eliminate buffering delays within the nodes that are not due to L1 traffic, the number
of write ports in each node’s cache array must match the link bandwidth between two nodes. While
this may seem like an onerous design constraint for the cache array, Section 3.5.3 shows that just one
write port is sufficient to reap more than 99% of the ideal-case benefits.
To ensure correctness under network contention, the ring cache is sometimes forced to stall all
messages (data and signals) traveling along the ring. The only events that can cause contention and
stalls are ring cache misses and evictions, which may then necessitate fetching data from a remote L1
cache. While these ring stalls are necessary to guarantee correctness, they are infrequent.
The ring cache relies on credit-based flow control [30] and is deadlock free. Each ring node has at
least two buffers attached to the incoming links to guarantee forward progress. The network main-
tains the invariant that there is always at least one empty buffer somewhere in the ring per set of
links. That is why a node only injects new data from its associated core into the ring when there is
no data from a predecessor node to forward.
Node–core integration Ring nodes are connected to their respective cores as the closest level
in the cache hierarchy (Figure 3.8). The core’s interface to the ring cache is through regular loads and
stores for memory accesses in sequential segments.
As previously discussed, wait and signal instructions delineate code within a sequential seg-
ment. A thread that needs to enter a sequential segment first executes a wait, which is only returned
66
by the associated ring node when matching signals have been received from all other cores executing
prior loop iterations. The signal buffer within the ring node enforces this. Specialized core logic de-
tects the start of the sequential segment and routes memory operations to the ring cache.‡ Finally,
execution of the corresponding signalmarks the end of the sequential segment.
The wait and signal instructions require special treatment in out-of-order cores. Since they
may have system-wide side effects, these instructions must issue non-speculatively from the core’s
store queue, and regular loads and stores cannot be reordered around them. My implementation
reuses logic from load–store queues for memory disambiguation and holds a lightweight local fence
in the load queue until the wait returns to the senior store queue. This is not a concern for in-order
cores.
3.4.2 MemoryHierarchy Integration
The ring cache is a level within the cache hierarchy and as such must not break any consistency guar-
antees that the hierarchy normally provides. Consistency between the ring cache and the conven-
tional memory hierarchy results from the following invariants: (i) shared memory can only be ac-
cessed within sequential segments through the ring cache (compiler enforced) (ii) only a uniquely
assigned owner node can read or write a particular shared memory location through the L1 cache on
a ring cache miss (ring cache enforced) and (iii) the cache coherence protocol preserves the order of
stores to a memory location through a particular L1 cache.§
Sequential consistency To preserve the semantics of a parallelized single-threaded program,
memory operations on shared values require sequential consistency. The ring cache meets this re-
quirement by leveraging the unidirectional data flow guaranteed by the compiler. Sequential consis-
‡This feature may add one multiplexer delay to the critical delay path from the core to the L1 cache.
§Most cache coherence protocols (including Intel, AMD, and ARM implementations) provide this
minimum guarantee.
67
tency must be preserved when ring cache values reach lower-level caches, but the consistency model
provided by conventional memory hierarchies is weaker. I resolve this difference by introducing a
single serialization point per memory location, namely, a unique owner node responsible for all in-
teractions with the rest of the memory hierarchy. When a shared value is moved between the ring
cache and L1 caches (owing to occasional ring cache load misses and evictions), only its owner node
can perform the required L1 cache accesses. This solution preserves existing consistency models with
minimal impact on performance.
Cache flush Finally, to guarantee coherence between parallelized loops and serial code between
loop invocations, each ring node flushes the dirty values of memory locations it owns to its core’s L1
once a parallel loop has finished execution. This is equivalent to executing a distributed fence at the
end of loops. In a multiprogram scenario, signal buffers must also be flushed/restored at program
context switches.
3.5 Evaluation
As a result of the compiler being co-designed with the architecture, HELIX-RCmore than triples
the performance of parallelized code compared to a compiler-only solution (i.e., HCCv2). This sec-
tion investigates HELIX-RC’s performance benefits and their sensitivity to ring cache parameters. I
confirm that the majority of speedups come from decoupling all types of communication and syn-
chronization. I conclude by analyzing the execution model’s remaining overheads.
3.5.1 Experimental Setup
I ran experiments on two sets of architectures. The first relies on a conventional memory hierarchy
to share data among the cores. The second relies on the ring cache.
68
Table 3.1: Characterisধcs of parallelized benchmarks.
Benchmark Phases Parallel loop coverage
HELIX-RC HCCv2 HCCv1
Integer benchmarks
164.gzip 12 98.2% 42.3% 42.3%
175.vpr 28 99% 55.1% 55.1%
197.parser 19 98.7% 60.2% 60.2%
300.twolf 18 99% 62.4% 62.4%
181.mcf 19 99% 65.3% 65.3%
256.bzip2 23 99% 72.3% 72.1%
Floating point benchmarks
183.equake 7 99% 99% 77.1%
179.art 11 99% 99% 84.1%
188.ammp 23 99% 99% 60.2%
177.mesa 8 99% 99% 64.3%
Simulated conventional hardware Unless otherwise noted, I simulated a multicore in-
order x86 processor by adding multiple-core support to the XIOSim simulator. The single-core
XIOSimmodels have been extensively validated against an Intel® Atom￿processor [32]. I used
XIOSim because it is a publicly available simulator that is able to simulate fine-grained microar-
chitectural events with high precision.
The simulated cache hierarchy has two levels: a per-core 32KB, 8-way associative L1 cache and
a shared 8MB 16-bank L2 cache. I varied the core count from 1 to 16, but did not vary the amount
of L2 cache with the number of cores, keeping it at 8MB for all configurations. Scaling the cache
size would have made it difficult to distinguish the benefits of parallelizing a workload from the
benefits of fitting its working set into the larger cache, causing misleading results. Finally, I used
DRAMSim2 [52] for cycle-accurate simulation of memory controllers and DRAM.
I extended XIOSim with a cache coherence protocol that assumes an optimistic cache-to-cache
latency of 10 clock cycles. This 10-cycle latency is optimistically low even compared to research pro-
69
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
18
3.e
qu
ak
e
17
9.a
rt
18
8.a
mm
p
17
7.m
es
a
FP
 G
eo
me
an
Ge
om
ea
n
0
2
4
6
8
10
12
14
16
P
ro
gr
am
 s
pe
ed
up
Numerical
Programs
Irregular
Programs
HCCv2 HELIX­RC
Figure 3.9: HELIX-RC triples the speedup obtained by HCCv2. Speedups are relaধve to sequenধal program exe-
cuধon. Because the integer benchmarks had much more communicaধon that needed to be accelerated, they were
helped much more than the ﬂoaধng point benchmarks.
totypes of low-latency coherence [40]. In fact, it is the minimum that is reasonably possible with a
44 2Dmesh network. (Running microbenchmarks in my testbed, I found that Intel Ivy Bridge is
75 cycles, Intel Sandy Bridge is 95 cycles, and Intel Nehalem is 110 cycles.) I only use this low-latency
model to simulate conventional hardware, and I later (Section 3.5.2) show that low latency alone is
not enough to compensate for the lazy nature of the cache coherence protocol.
Simulated ring cache I extended XIOSim to simulate the ring cache described in Section 3.4.
Unless otherwise noted, the simulated ring cache has the following configuration: a 1KB 8-way as-
sociative array size, a one-word data bandwidth, a five-signal bandwidth, a single-cycle adjacent core
latency, and two cycles of core-to-ring-node injection latency to minimally impact the already delay-
70
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
0
2
4
6
8
10
12
14
16
P
ro
gr
am
sp
ee
du
p
B
en
efi
to
fd
ec
ou
pl
in
g
m
em
or
y
co
m
m
un
ic
at
io
n
Benefits of
decoupling
synchronization
HCCv2
decoupled reg. communication
decoupled reg. comm. and synch.
decoupled reg. and memory comm.
HELIX-RC (decoupled all communication)
Figure 3.10: Decoupling register, synchronizaধon, and memory communicaধon is vital for maximizing speedups.
critical path from the core to the L1 cache. I used a simple bitmask as the hash function to distribute
memory addresses to their owner nodes. To avoid triggering the cache coherence protocol, all words
of a cache line have the same owner. Lastly, XIOSim simulates changes made to the core to route
memory accesses either to the attached ring node or to the private L1.
Benchmarks I used 10 of the 15 C benchmarks from the SPEC CPU2000 suite: 4 floating point
(SPECfp 2000) and 6 integer (SPECint 2000) benchmarks. For engineering reasons, the data de-
pendence analysis that HCCv3 relies on [23] requires either too much memory or too much time to
handle the other benchmarks. This limitation is orthogonal to the results described below.
Compiler I extended the ILDJIT compilation framework [8], version 1.1, to use LLVM 3.0 for
backend machine code generation. I generated both single- and multi-threaded versions of the
benchmarks. The single-threaded programs are the unmodified versions of the benchmarks, op-
71
timized (O3) and generated by LLVM. This code outperforms GCC 4.8.1 by 8% on average and
underperforms ICC 14.0.0 by 1.9%.¶ The multi-threaded programs were generated by HCCv3 and
HCCv2 to run on ring-cache-enhanced and conventional architectures, respectively. Both compilers
produce code automatically and do not require any human intervention. During compilation, they
use SPEC training inputs to select the loops to parallelize.
Measuring performance I computed the speedups relative to sequential simulation. Both
single- and multi-threaded runs use reference inputs. To make simulation feasible, I simulated mul-
tiple phases of 100M instructions as identified by SimPoint [24].
3.5.2 Speedup Analysis
In the 16-core processor evaluation system, HELIX-RC boosts the performance of sequentially-
designed programs (SPECint 2000), which are assumed not to be amenable to parallelization. Fig-
ure 3.9 shows that HELIX-RC raises the geometric mean of speedups for these benchmarks from
2.2 for HCCv2 without ring cache to 6.85.
HELIX-RC not only maintains the performance increases of HCCv2 (compared to HCCv1) on
numerical programs (SPECfp 2000), but also increases the geometric mean of speedups for SPECfp
2000 benchmarks from 11.4‖ to almost 12.
Next, I turn to explaining where the speedups come from.
Communication Speedups obtained by HELIX-RC come from decoupling both synchroniza-
tion and data communication from computation in loop iterations, which significantly reduces
communication overhead, allows the compiler to split sequential segments into smaller blocks,
¶As an aside, automatic parallelization features of ICC led to a geomean slowdown of 2.6% across the
SPECint 2000 benchmarks, suggesting ICC cannot parallelize irregular programs.
‖These speedups are possible even with the cache coherence latency of conventional processors (e.g., 75
cycles).
72
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
0
20
40
60
80
100
120
140
160
180
%
E
xe
cu
tio
n
Ti
m
e
510
C
C C C
C C
C
R
R R R R R R
Slow
down
Speed
up
Communication Computation
Figure 3.11: While code generated by HCCv3 speeds up with a ring cache (R), it slows down on convenধonal hard-
ware (C).
and cuts down the critical path of the generated parallel code. Figure 3.10 compares the speedups
gained by multiple combinations of decoupling synchronization, register-based communication,
and memory-based communication. As expected, fast register transfers alone do not provide much
speedup since most in-register dependences can be satisfied by recomputing the shared variables in-
volved (Section 3.1). Instead, most of the speedups come from decoupling communication for both
synchronization and memory-carried actual dependences. To the best of my knowledge, HELIX-
RC is the only solution that accelerates all three types of transfers for actual dependences. TheMul-
tiscalar register file (Section 2.4.1) comes closest by decoupling register communication and syn-
chronization, but as the figure shows, decoupling communication and synchronization for memory
provides significant additional benefits.
In order to assess the impact of decoupling communication from computation in the SPECint
2000 benchmarks, I executed the parallel code generated by HCCv3—assuming a decoupling archi-
tecture like a ring cache—on a simulated conventional system that does not decouple. The loops
selected under the assumption that a fast communication mechanism is present do require frequent
73
communication (every 24 instructions on average). Figure 3.11 shows that such code, when run on a
conventional multicore processor (left bars), performs no better than sequential execution (100%),
even with the optimistic 10-cycle core-to-core latency. These results further underscore the impor-
tance of selecting loops based on the core-to-core latency of the architecture.
Sequential segments While more splitting offers higher TLP (more sequential segments can
run in parallel), more splitting also requires more synchronization at run time. Hence, the high syn-
chronization cost for conventional multicore processors discourages aggressive splitting of sequen-
tial segments.** In contrast, the ring cache enables aggressive splitting to maximize TLP.
To analyze the relationship between splitting and TLP, I computed the number of instructions
that execute concurrently for the following two scenarios: (i) conservative splitting constrained by
a contemporary multicore processor with a high synchronization penalty (100 cycles) and (ii) ag-
gressive splitting for HELIX-RC with low-latency communication (<10 cycles) provided by the ring
cache. In order to compute the TLP independent of both the communication overhead and core
pipeline advantages, I used a simple abstracted model of a multicore system that has no communi-
cation cost and is able to execute one instruction at a time. Using the same set of loops chosen by
HELIX-RC and shown in Figure 3.9, TLP increased from 6.4 to 14.2 instructions per cycle with ag-
gressive splitting. Moreover, the average number of instructions per sequential segment dropped
from 8.5 to 3.2 instructions.
Coverage Despite all the loop-level speedups made possible by decoupling communication
and aggressively splitting sequential segments, Amdahl’s law states that program coverage dictates
the overall speedup of a program. Prior parallelization techniques have avoided selecting loops
with small bodies because communication would slow down execution on conventional proces-
sors [12, 56]. Since HELIX-RC does not suffer from this problem, the compiler can freely select
**This is the rationale behind DOACROSS parallelization [16].
74
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
0
2
4
6
8
10
12
14
16
P
ro
gr
am
sp
ee
du
p 16 cores
8 cores
4 cores
2 cores
(a) Core count.
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
0
2
4
6
8
10
12
14
16
1 cycle
4
8
16
32
(b) Adjacent node link
latency.
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
0
2
4
6
8
10
12
14
16
Unbounded
4 Signals
2
1
(c) Signal bandwidth.
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
0
2
4
6
8
10
12
14
16
Unbounded
32 KB
1 KB
256 B
(d) Node memory size.
Figure 3.12: Sensiধvity to core count and ring cache parameters. Only SPECint benchmarks are shown.
HELIX-RC
Speedup
Additional
Instructions
Wait/Signal
Instructions
MemoryIteration
Imbalance
Low Trip
Count
CommunicationDependence
Waiting
177.mesa
188.ammp
179.art
183.equake
256.bzip2
181.mcf
300.twolf
197.parser
175.vpr
164.gzip
29.3% 0.9% 3.7% 58.4% 7.3% 0.0% 0.3% 15.1x
64.1% 8.0% 6.3% 7.4% 8.9% 2.2% 3.1% 12.5x
0.2% 0.0% 47.7% 24.8% 16.1% 0.0% 11.3% 10.5x
0.2% 0.0% 9.1% 1.5% 87.7% 0.0% 1.5% 10.1x
3.4% 3.4% 51.6% 0.1% 1.1% 19.7% 20.7% 12.0x
37.7% 10.4% 5.5% 1.2% 3.2% 20.9% 21.2% 8.7x
0.1% 0.2% 41.8% 1.4% 31.8% 0.0% 24.6% 7.6x
31.3% 24.3% 15.3% 5.0% 0.3% 11.6% 12.2% 7.3x
11.9% 0.4% 74.2% 12.4% 0.0% 0.5% 0.5% 6.1x
40.8% 8.1% 9.6% 4.5% 0.0% 18.1% 18.8% 3.0x
Figure 3.13: Breakdown of overheads that prevent HELIX-RC from achieving ideal speedup.
small hot loops to cover in almost the entirety of the original program. Table 3.1 shows that HELIX-
RC achieves>98% coverage for all of the benchmarks evaluated.
3.5.3 Sensitivity to Architectural Parameters
The speedup results presented so far assume the default configuration (in Section 3.5.2) for the ring
cache. I will now investigate the impact of different architectural parameters on speedup. In the next
set of experiments I sweep one ring cache parameter at a time while keeping all others constant at the
default configuration.
Core count Figure 3.12a shows that HELIX-RC efficiently scales parallel performance with the
core count, from 2 to 16.
Link latency Figure 3.12b shows the speedups obtained versus the minimum communication
latency between adjacent ring nodes. As expected, HELIX-RC performance degrades for longer
75
latencies for most of the benchmarks. It is important to note that current technologies can satisfy
single-cycle adjacent core latencies, as confirmed by commercial designs [64] and CACTI [43] wire
models of interconnect lengths for dimensions in modern multicore processors.
Link bandwidth A ring cache uses separate dedicated wires for data and signals to simplify the
design. The simulations confirm that a minimum data bandwidth of one machine word (hence, a
single write port) sufficiently sustains more than 99.9% of the performance obtained by a data link
with unbounded bandwidth for all benchmarks. In contrast, reducing the signal bandwidth can
degrade performance, as shown in Figure 3.12c, due to synchronization stalls. However, the physical
overhead of adding additional signals (up to 4) is negligible.
Memory size Figure 3.12d shows the impact of memory size. The finite-size cases assume LRU
replacement. Reducing the cache array size within the ring node only impacts 197.parser, which has
the largest ring cache working set.
3.5.4 Analysis of Overhead
To identify areas for improvement, I have categorized every overhead cycle (preventing ideal speedup)
based on a set of simulator statistics and the methodology presented by Burger and colleagues [7].
Figure 3.13 shows the results of this categorization for HELIX-RC, again implemented on a 16-core
processor.
Most importantly, the small fraction of communication overheads suggests that HELIX-RC suc-
cessfully eliminates the core-to-core latency for data transfer in most benchmarks. For several bench-
marks, notably 175.vpr, 300.twolf, 256.bzip2, and 179.art, the major source of overhead is the low
number of iterations per parallelized loop (low trip count). While many hot loops are frequently in-
voked, the low iteration count (ranging from 8 to 20) leads to idle cores. Other benchmarks, such as
164.gzip, 197.parser, 181.mcf, and 188.ammp, suffer from dependence waiting due to large sequential
76
segments. Finally, HCCv3 must sometimes add a large number of wait and signal instructions
(i.e., many sequential segments) to increase TLP, as can be seen for 164.gzip, 197.parser, 181.mcf, and
256.bzip2.
3.6 Conclusion
Decoupling communication from computation makes irregular programs easier to parallelize auto-
matically by compiling loop iterations as parallel threads. While numerical programs can often be
parallelized by compilation alone, irregular programs greatly benefit from a combined compiler–
architecture approach. The HELIX-RC prototype shows that a minimally invasive architecture ex-
tension co-designed with a parallelizing compiler can liberate enough parallelism to make good use
of 16 cores for irregular benchmarks commonly thought not to be parallelizable. Unlike the previous
approaches described in Chapters 1 and 2, HELIX-RC decouples all inter-thread memory, register,
and synchronization communication from the core computation.
In the next chapter, I delve into some of the important implementation details for ring cache.
77
4
Ring Cache Detail and Implementation
In Chapter 3, most of the HELIX-RC speedup results were generated using a C/C++ based x86
simulator called XIOSIM [32], with the ring cache similarly modeled in C++. While every effort was
made to accurately model the ring cache at a cycle-accurate level, high-level languages are not the best
fit for expressing the operation of hardware. In order to fully specify the implementation of the ring
cache, I created and tested a synthesizable Verilog reference design. Fully detailed explanations of the
implementation and the design decisions appear in the Appendix to this dissertation.
In this chapter, I delve into some of the implementation details. First, I discuss the ring cache’s
integration with the normal cache hierarchy in more detail, along with a very important memory
consistency issue that must be handled properly. Then I describe the signal buffer, which plays
a crucial role in reducing HELIX-RC synchronization costs. I introduce the concept of synchro-
78
nization epochs and describe how the signal buffer can be designed to allow cores executing parallel
code to decouple their execution by an arbitrary number of iterations. Finally, I describe the power
and area consumed by the synthesized ring cache design and specifically examine some of the major
tradeoffs related to the signal buffer.
4.1 MemoryHierarchy Integration
One of the most important contributions of ring cache is that it transforms a reactive cache coher-
ence protocol into a proactive system of communication. In most cases, when a core attempts to
access shared data, it will find that the data is present in its local ring node memory, and the com-
munication cost is simply the time it takes to access the ring node. In contrast, if the normal cache
hierarchy were used, it would take dozens of cycles from the time the data was requested until it
arrived locally.
A distinguishing factor of ring cache relative to other fast communication mechanisms (Mul-
tiscalar register file [54], scalar operand networks [59]) is that the number of shared pieces of data
is not known at compile time, nor is the number of consumers of any particular shared piece of
data. Since other solutions for fast communication of dependences rely on statically known num-
bers of shared elements, they are not suitable for HELIX. Instead, a cache structure that can handle
unknown numbers of elements is needed. This means that ring cache must be able to handle load
misses and cache evictions, since there is no guarantee that all of a shared piece of data will fit in
the ring cache. For this reason, the normal cache hierarchy must be relied upon to support the ring
cache.
However, the memory consistency guarantees of the normal cache hierarchy are different than
that of ring cache, and a naive integration between the two could raise a significant consistency issue.
Consider an implementation where a ring node simply writes any evicted value back to its local L1.
In the case of a subsequent ring cache load miss, the ring node fetches the data from its L1. While
79
A0 = 32 A0 = 32 A0 = 32
Core 
0
Core
1
Core 
2
A0 = 32 A0 = 32 A0 = 32
Core 1 stores 64 
to  A0
Core 0 stores 4 
to A1
A1 = 4 A0 = 64 A0 = 32
A1 conflicts with 
A0, store 32 to 
A0 in the L1
A1 = 4 A1 = 4 A0 = 64
Pending L1 
store A0 <- 32
A0 <- 64
A0 <- 64
A1 <- 4
A1 <- 4
A1 conflicts with  
A0, store 64 to 
A0 in the L1
A0 <- 64A1 <- 4
...
...
...
...
Time
Cycle 0
Cycle 1
Cycle 2
Cycle 3
In cycle 3, core 0 is writing 32 to A0, but core 1 is 
writing 64 to A0!  A memory race results.
AX = Y
Ring cache memory, 
Y is currently stored 
in address X
AX <- Y
Forwarding network 
bundle, store Y in 
address X
The stores from core 1 and core 2 propagate one hop in the 
forwarding network
Forwarding network 
link
Figure 4.1: Allowing any core to load from and write back to the normal cache hierarchy results in a race condiধon
that may violate correctness!
80
this might seem like a reasonable idea at first glance, it gives rise to race conditions that can violate
correctness. Figure 4.1 depicts the timeline of a three-core system suffering from such a race condi-
tion. For simplicity, assume a ring node memory size of just a single word. Sometime in the past, the
value 32 was written to address A0, which was propagated on the data/signal forwarding network
and stored in every ring node memory. During cycle 1, core 0 and core 1 execute two different se-
quential segments. Core 0 stores the value 4 to address A1, and core 1 stores the value 64 to address
A0. Both stores enter the data/signal forwarding network. In the next cycle, core 0’s store has trig-
gered an eviction in its memory, and A0, with a value of 32, begins to be written back to core 0’s L1
cache. In the same cycle, core 1’s ring node memory updates address A0 with the value 64. Addi-
tionally, both stores propagate one more hop on the forwarding network. In cycle 3, the store to A1
triggers an eviction of A0 in core 1. Now core 1 begins to write A0 back to its own L1. Unlike core 0,
however, core 1 writes back the newly updated value of 64. This results in a race to update A0 with
either the old or the new value. Many (perhaps all) modern architectures do not make any guaran-
tees about which value will be recorded first.
I handle this consistency issue by enforcing the constraint that for any unique memory address,
there is a single owner core that is solely responsible for any loads or stores to that address between
the ring cache and the normal cache hierarchy. The owner core is determined based on certain bits of
the address, depending on howmany cores there are in the system.
4.1.1 Request and Reply Networks
In the case of a ring cache load miss, the request network is responsible for requesting the data from
the owner ring node, which subsequently performs an L1 lookup and returns the result on the re-
ply network. In the case of a ring cache eviction, if the ring node owns the evicted address, it writes
it back to its L1. If the evicting ring node is not the owner, it simply discards the data without per-
forming any write back. These networks implement a reactive data transfer mechanism that, if used
81
frequently, completely eliminates all of the benefits of having the ring cache. For this reason, they
are not tuned for performance—if they are used more than rarely, performance will tank regard-
less. Like the data/signal forwarding network, the request/reply networks are also implemented
with unidirectional ring networks that single hop around each core in the system, one clock cycle per
hop. While it might seem that a more highly connected topology could have been used, since strict
in-order data flow is not required for loads as it is for stores and signals, there are two reasons why I
stayed with unidirectional rings. First, unidirectional rings are easy to reason about in terms of data
flow and deadlock avoidance. Second, and more importantly, care must be taken so that a remote
load on the request network doesn’t pass a store to the same address on the forwarding network, or
an incorrect value could be returned.
4.1.2 Reducing Remote Loads
The number of remote loads can be optimized by relaxing the constraint that only the owner core of
an address can interact with the normal cache hierarchy. Instead, only the owner core of an address
can interact with the normal cache hierarchy if the address hॷ previously been written to ring cache
but subsequently evicted. The race condition previously shown in Figure 4.1 can only occur if an ad-
dress was at some point present in the ring cache—in that example, address A0. If a particular core is
trying to load A0 and knows for sure that A0 was never written to the ring cache before this loop in-
vocation, it can conclude that it is impossible that it is currently being evicted from any other node.
Given the definition of a sequential segment, if a core is loading A0, no other core in the system can
be loading or storing it. It is therefore safe for the core to load it from the normal cache hierarchy,
even if it isn’t the owner. We use a bloom filter in each ring node to track whether an address has
been written before this loop invocation. If a core attempts to perform a load, and it misses it in its
ring node memory, the bloom filter is consulted. If the address is present in the bloom filter, the
core knows it must make a remote load request unless it is the owner of the address. If the address is
82
not present in the bloom filter, the core can load from its own L1, despite not being the owner.
4.2 Signal Buffer Implementation
The signal buffer contributes a significant portion of the improved speedups that ring cache pro-
vides for HELIX. The signal buffer produces speedups both by pushing the signal tracking logic
to hardware instead of software and by decoupling signal forwarding from synchronization, which
helps break the sequential forwarding chains of synchronization that are intrinsic to HELIX and
DOACROSS-style parallelization. The amount of hardware resources dedicated to the signal buffer
can increase speedups along two dimensions. First, adding more available signal IDs allows the com-
piler to more aggressively parallelize sequential code into smaller sequential segments, potentially
increasing parallelism amongst segments. Second, adding more bits for buffering received and sent
signals allows cores to increase the number of iteration epochs they can decouple from one another
during execution, which reduces the core idle time that normally results from sequential forwarding
chains. I begin the signal buffer discussion by explaining the concept of epochs in this context, as
well as how the signal buffer facilitates synchronization decoupling by allowing cores to skip light
waits. Then I discuss some optimizations the compiler can make to reduce the amount of synchro-
nization instructions that need to be sent to the signal buffer.
4.2.1 Synchronization Epochs
Section 3.1.2 described at a high level how decoupling signal forwarding from synchronization allows
HELIX-RC to break sequential forwarding chains and increase speedups. In this section, I describe
in detail the operation of the signal buffer, how it breaks these chains, and how it allows cores to
decouple in units of epochs.
I define an epoch to be a set of N iterations, where N is the number of cores executing a parallel
loop. Consider a 3-core system. For a parallel loop with 8 total iterations, core 0 executes iterations
83
A()
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
Light Wait ID1
Signal ID1
IF
COND
B()
Sequential 
Segment 1
Start Next 
Iteration
Parallel Code
Sequential Code
Sequential Segment
Figure 4.2: A modiﬁed loop body where empty sequenধal segments start with light wait instrucধons instead of ordi-
nary wait instrucধons.
84
0, 3, and 6; core 1 executes iterations 1, 4, and 7; and core 2 executes iterations 2, 5, and 8. An epoch is
therefore 3 iterations long. The start of an epoch, however, does not need to occur at the beginning
of a sequence of iterations. For example, an epoch could span from iteration 1 to iteration 3, not just
from 0 to 2.
I further define the concept of a synchronization epoch, which is an epoch whose bounds are not
at iteration boundaries. Instead, they exist just before the next sequential segment in an iteration.
Take the example of an epoch of iterations from 1 to 3. The corresponding synchronization epoch
would begin precisely at the next sequential segment encountered by the core executing iteration 1,
and would end just before the next sequential segment encountered by the core executing iteration
3. In higher-level terms, a synchronization epoch is the furthest distance any pair of cores can drift
apart if they are constrained by a sequential forwarding chain. In the case where all sequential seg-
ments contain dependences that need to be satisfied, no two cores can ever separate by more than
a synchronization epoch, even if signal buffering is available, since all sequential segments must be
executed in loop iteration order. However, if there are sequential segments that are empty and there
is sufficient signal buffering available, cores can drift apart (“decouple”) by multiple synchronization
epochs, which reduces core idle time. I define wait instructions that mark the boundaries of such
empty sequential segments as light waits that do not need to be synchronized under certain circum-
stances.
An example will be helpful for understanding these two concepts. Figure 4.2 depicts a HELIX
parallelized loop, a modified version of Figure 3.7(a), with a light wait instead of a normal wait
instruction on the right branch of the sequential segment, indicating that the segment lacks any de-
pendence. To keep things simple, assume we have a two-core system—the same conclusions will
hold for a chip with more cores. Each core only needs to receive signals from the other core to un-
block a particular sequential segment. First, we consider a scenario where a core uses a single bit per
core for each other core in the system to track signals received from that core for a particular sequen-
85
tial segment. Since we have only two cores and only one sequential segment, this means that each
core only needs a single bit to track signals. When a core receives a signal, it sets the signal bit. When
a core finishes executing the sequential segment, it clears the bit, thereby “consuming” the signal.
Figure 4.3 depicts the execution of four iterations of the example loop. The time it takes to trans-
mit a signal is exaggerated to better illustrate the impact of signal buffering. Note that since core 0 is
executing the first iteration of a loop, its signal bit is preset, since there is no iteration -1 to receive a
signal from. For simplicity, data communication is not shown. First, core 0 and core 1 start execut-
ing the parallel portions of iterations 0 and 1. Just before reaching the sequential segment, core 0 is
stalled on a DRAM access. Meanwhile, core 1 reaches the light wait instruction at the start of
the sequential segment. Core 1 executing iteration 1 hasn’t received the signal from core 0 executing
iteration 0, so it may not enter the sequential segment. However, since it took the right branch of
the if, the sequential segment starts with a light wait rather than a normal wait. Core 1 therefore
knows that it doesn’t contain a dependence to synchronize and would prefer to continue with exe-
cution, even though it is blocked. In this situation, core 0 and core 1 are said to both be within the
same synchronization epoch. Once the DRAM access returns, core 0 executes the segment, clears its
signal received bit, and sends the signal to unblock core 1. Core 1 sets the signal received bit, which
grants it access to the sequential segment, before quickly sending its own signal, which clears the re-
ceived bit. Core 0 soon begins executing iteration 2, and core 1 begins executing iteration 3, where
the same dynamic repeats itself, although without the DRAM stall. Even though core 1 never needs
to access shared data, it is nonetheless constrained by the sequential segment. When HELIX does
not have access to a ring cache, cores will always be constrained to a single synchronization epoch.
This can be seen by observing that throughout, iteration 0 only overlaps with iteration 1, which in
turn only overlaps with iteration 0 and iteration 2.
Imagine a scenario where core 1, knowing that the sequential segment doesn’t contain a depen-
dence, bypasses the light wait instruction altogether. It would send the corresponding signal,
86
Signal 
Tracker Core 0
A()
Signal 
TrackerCore 1
A()
IF
COND
Light Wait ID1
1 0
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 1
A()
IF
COND
Light Wait ID1
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 3
Iter 0
Iter 2
Can not 
proceed, even 
though no 
dependence!
DRAM 
Stall
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
B()
A()
IF
COND
B()
IF
COND
Wait ID1
Signal Not 
Received Stall
1
0
Load X
X = f(X)
Store X
Signal ID1
0
Program
Execution
Time
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Figure 4.3: A single state bit for synchronizing signals constrains cores to operate within a single synchronizaধon
epoch. In a two-core system, this implies that cores cannot move apart by more than two iteraধons.
87
which would ordinarily clear core 1’s bit, and set core 0’s bit. However, core 1’s bit is already cleared
and core 0’s bit is already set. In this case, the information that a signal was consumed and that a sig-
nal was received is lost. Core 1, upon receiving the signal from core 0 (executing iteration 0), will set
its bit and therefore enter the sequential segment of iteration 3, even though that signal was meant
for the segment it skipped in iteration 1. This potentially results in accessing shared data out of iter-
ation order, a correctness violation. Likewise, core 0 has lost the knowledge that it received a signal
from core 1, iteration 1, and therefore may not enter the sequential segment in iteration 2. Since we
would like core 1 to be able to skip the empty sequential segment, we add an additional bit to our
signal tracking. Instead of 2 states (received signal or not), there are now 4. These new states allow
cores to record whether they’ve skipped a sequential segment (state -1) and therefore need to receive
two more signals to enter the next non-light sequential segment, and also whether they’ve received
an extra signal (state 2) and therefore are free to enter the sequential segment two more times.
Figure 4.4 depicts a new execution timeline for when this additional signal-buffering hardware
is added. This time, core 1 is able to skip the sequential segment and race ahead to iteration 3 with-
out violating correctness. Core 1 takes the right branch of the if and, even though the sequential
segment is once again empty, must block. However, because of the extra signal-buffering capability,
the cores have now been able to drift apart by an additional synchronization epoch, allowing core 1
to begin executing iteration 3 before iteration 0 has executed the sequential segment. For every two
additional states that are added to track signals, cores can drift apart yet another synchronization
epoch. Of course, cores can only decouple if they encounter sequential segments that are empty—
otherwise, they need to access shared data and the sequential segments still need to be executed in
loop iteration order. The benchmarks contain enough empty segments, however, that decoupling
cores reduces idle time and increases speedups—significantly so for some benchmarks, as we saw in
Figure 3.10. In our simplified example, the execution time only improves slightly. In more realistic
scenarios, with more cores and more heterogeneity in execution, having cores execute as far ahead
88
Already received 
two signals, 
proceed 
immeidately
No signal stall since 
no dependence to 
satisfy!  Send signal 
immediately 
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Signal 
Tracker Core 0
Program
Execution
Time
A()
Signal 
TrackerCore 1
A()
IF
COND
DRAM 
Stall
Light Wait ID1
Signal ID1
1 0
B()
-1
0
Iter 1
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
2
1
B()
Light Wait ID1
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 3
Iter 0
A()
IF
COND
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
0
B()
Iter 2
A()
IF
COND
IF
COND
Can t unblock yet, 
need to receive 
two signals since 
the previous light 
wait was skipped
Iter 3 now overlaps 
with Iter 0!  Cores 
have decoupled by 
an epoch
Figure 4.4: Using two signal-buﬀering state bits improves performance by allowing cores to decouple by an addi-
ধonal synchronizaধon epoch.
89
as possible and send as many signals as soon as possible increases overall performance even more.
In practice, even given unconstrained resources, the maximum that cores drift apart in the complex
benchmarks is usually limited to two synchronization epochs, so only four states for signal buffering
are required to gain most of the benefit. I call the number of synchronization epochs by which the
signal buffer can decouple the epoch bound.
4.2.2 Signal Buffer Architecture
Although the previous example had only two cores, the signal buffer must independently track sig-
nals from every other core in the system, whatever the number, not just a single core. Figure 4.5
shows the general hierarchy of the signal buffer. Each signal buffer within a single ring node con-
tains signal tracker modules equal in number to the maximum number of signal IDs that the com-
piler can create. Within each signal tracker module, there are core tracker modules, one for each
core in the system other than the core that the signal buffer is part of. Each of these core tracker
modules contains a counter large enough to represent the total number of states required for the
desired amount of synchronization decoupling, with two states per desired epoch of decoupling. As
mentioned earlier, if the epoch bound is 2, four states are required to buffer signals and so a two-bit
counter is used.
4.2.3 Signal Buffer Optimization
There are a few possible signal buffer optimizations that may be appropriate to use with the refer-
ence design. The first of these is only applicable when the epoch bound is 1—that is, cores are not
able to decouple and must always stay within the same synchronization epoch, even in the presence
of sequential segments that lack dependences. This scenario is similar to when HELIX is running
on a traditional multicore, albeit still with faster signal propagation speed. In this configuration,
every sequential segment is run in loop iteration order, without exception. Light waits don’t ex-
90
Signal Buffer Module
Signal 
ID: 0
Signal 
ID: 1
Signal 
ID: N-2
Signal 
ID: N-1
...
N Signal Tracker Modules, N = Max Signal IDs
Signal Tracker Module
M Core Tracker Modules
M = Num Other Cores = Total Cores In System - 1
Core ID: 
0
Core ID: 
1 ... Core ID: M - 2 Core ID: M - 1
Core Tracker Module
State Register = 
epoch bound 
total bits
Figure 4.5: The signal buﬀer of a single ring node contains signal tracker modules for the total number of possible
signal IDs that the compiler might generate. Within each signal tracker, there are core tracker modules, one for each
other core in the system that the owner of this signal buﬀer might receive signals from. Within these modules, state
registers track how many signals have been received relaধve to how many have been sent by the owner core. The
number of necessary states is dictated by the desired amount of synchronizaধon decoupling.
91
ist, since there is no situation where they can be skipped. The requirement for (numCores - 1) core
tracker modules per signal tracker module is no longer necessary, since receiving a signal for a par-
ticular sequential segment ID from a particular core implies that every previous loop iteration from
every other core has already executed the segment. This allows the signal buffer to reduce the num-
ber of core tracker modules from (numCores - 1) per signal tracker module to just 1 per signal tracker
module. This potentially reduces the total amount of signal buffer area by a factor of the number
of cores. The area savings come at a cost, however: cores cannot decouple across synchronization
epochs and so speedups are reduced. However, if the signal buffer area is prohibitively large, limiting
the epoch bound to 1 is a possible solution. Also, signals no longer need to circulate around the en-
tire ring; they only need to travel to the subsequent core in the ring, which reduces signal bandwidth
requirements.
The second optimization is applicable only when the epoch bound is 2, which implies cores can
decouple an additional synchronization epoch. If the compiler can guarantee that there is at least
one non-light wait instruction per loop iteration (as would be the case if a particular dependence
always needed to be satisfied), then light waits can be removed entirely, leaving only the cor-
responding signal behind. This non-light wait instructionmust belong to the same sequential
segment in every iteration. A core can just straight away send the corresponding signalwithout
relying on the light wait to prevent underflow. This is because the existence of at least one un-
avoidable non-light wait per iteration naturally limits synchronization decoupling to less than two
synchronization epochs. This optimization is a trade-off—light wait instructions can be re-
moved, which reduces the number of instructions executed, and the signal buffer no longer needs
to contain the logic to examine light waits. However, the compiler may now have to artificially
generate a non-light wait instruction that could otherwise could be light. In practice, I found that
this downside doesn’t happen often. This optimization was used in the HELIX-RC evaluation in
Chapter 3.
92
The third optimization applies when the epoch bound is 3. As with the previous optimization,
if the compiler can make a guarantee about non-light waits, than light waits don’t need to be
executed at all. However, in this case, the compiler only needs to guarantee that at least one sequen-
tial segment executes a non-light wait per iteration. Unlike the previous case, it needn’t be the same
sequential segment in every iteration. This prevents our six-state core tracker counters from ever un-
derflowing. For our particular benchmarks, increasing the epoch bound beyond 2 did not have any
effect on performance, so this optimization may not be useful.
4.3 Ring Cache Synthesis Evaluation
Table 4.1: Ring cache parameters for the reference design.
Number of Supported Cores 16
Address Width 32 bits
Data Width 32 bits
Cache Associativity direct mapped
Cache Data Storage 1024 KB
Total Number of Signal IDs 128
Signal Bandwidth 5 signals per cycle
Signal Buffer epoch bound 2
Store Bandwidth 1 per cycle
Data/Signal Forwarding Network Total Wires 129 bits
Remote Load Request Network Total Wires 37 bits
Remote Load Reply Network Total Wires 37 bits
Assumed Network Link Latency 0.5 ns
This section presents some preliminary area, power, and timing results for the ring cache. Ta-
ble 4.1 shows the parameters for the reference design. The values were chosen to roughly match the
simulated ring cache in Chapter 3. The most noticeable exception is the ring cache memory, which
was 8-way set associative in Chapter 3 rather than direct mapped. Later simulations showed that
the difference between 8-way associativity and direct mapping was minor, so for ease and clarity of
93
implementation, I used the latter. It is important to note that many of these parameters were se-
lected to be just large enough so as not to be a bottleneck for the six SPECint 2000 benchmarks that
were evaluated. Other programs may have vastly differently requirements, so these particular values
should not be overly relied upon for a final implementation.
The reference configuration was exhaustively tested with test vectors from our team’s X86 cycle-
level C++ simulator, XIOSIM, with the modeled ring cache configured the same as the reference
design. Vectors were collected for every Simpoint phase of all 10 of the SPEC benchmarks that were
evaluated in Chapter 3. Specifically, for every simulated cycle, the values of all inputs and outputs
corresponding to the ring cache interface between the attached core and the L1 cache were collected.
Verilog simulations were performed with 16 ring nodes that were excited with these test vectors.
At every cycle, the outputs of the ring nodes were compared to the known correct outputs from
XIOSIM. Every phase passed the testbench.
Many of the parameters, such as number of supported cores, signal bandwidth, signal buffer con-
figuration, and total number of signal IDs, have a large impact on the area/performance of the ring
cache. After generating initial results for this reference design, some of these important parame-
ters were swept. First, the reference design was synthesized using Synopsys Design Compiler with a
40nm process technology. The synthesis tool was steered to optimize for critical path delay. I used
RTL-level activity factors from the SPECint test runs to more accurately predict power in the syn-
thesized design. Table 4.2 summarizes the post-synthesis results for a single ring node. Although a
0.5 ns link latency was assumed for all of the inter-node links, I stress that the power and area num-
bers here do not account for any link area/power and strictly represent only Design Compiler’s post-
synthesis estimates. By using RTL simulation activity factors instead of Design Compiler’s default,
the power is reduced from nearly 100 mW to 19.22 mW—because although the SPECint bench-
marks use the ring cache relatively regularly, it is still only accessed in sequential segments. Since
there are still significant portions of parallel code, the ring cache is often dormant.
94
Table 4.2: Synthesis results for a single reference ring node.
Area 0.272 sq mm
Dynamic Power 19.22 mW
Leakage Power 3.3 mW
Max Frequency 1.11 GHz
Critical Path
The post-synthesis timing report exposes two primary critical paths in the ring cache design. The
first path starts in the network receive buffers for the primary data/signal network, for a bundle
of stores/signals that may be sent to the subsequent ring node during this cycle. From there, the
path continues through the logic that decides 1) whether a new store/signal from the core can be
added to the aforementioned network bundle and 2) whether any circulating stores/signals have
completed a full cycle and must therefore be removed. Finally, the critical path extends over the
outgoing network link to the next core. Since link propagation accounts for 0.5 ns of the path, the
routing logic accounts for only a small portion of the total path. It is beneficial that link propagation
can happen in parallel with writing to the memory, since the next longest paths involve the memory.
Specifically, they start at the data/signal network receive buffers as before. But then the paths change
and instead go to the memory array to perform a tag lookup and prepare a cache array write for the
next clock edge.
Area
Figure 4.6a depicts the area usage in the reference design. The cache array is marginally smaller than
the signal buffer. Although perhaps unexpected, this follows from the fact that the reference design
uses a total of 128 signal IDs. Since the storage per ID is 2 storage bits per core per signal, the total
number of registers per signal buffer is 2 * 16 * 128 = 4096 bits. The area of the memory is somewhat
larger than it could otherwise be, since the reference design uses a register array in lieu of an SRAM
95
Cache Array (47.4%) 
Signal Buffer (48.7%) 
Other Memory Module Logic (2.0%) 
Forwarding Network (1.5%) 
Request/Reply Networks (0.5%) 
(a) Ring Node Area
Cache Array (9.7%) 
Signal Buffer (87.9%) 
Other Memory Module Logic (1.0%) 
Forwarding Network (1.0%) 
Request/Reply Networks (0.4%) 
(b) Ring Node Dynamic Power
Figure 4.6: Power and area for a single ring node. The forwarding network includes all logic to route data/signals
between nodes. The request/reply networks include all of the logic necessary to integrate the ring cache with the
rest of the normal memory hierarchy.
to minimize access latency. If the area is prohibitively large, it could be very beneficial to experiment
with using an SRAM (which would necessitate adjusting the FSMs in the memory and array mod-
ules) and reducing the number of signal IDs.
Power
Figure 4.6b shows the dynamic power breakdown. Although the cache array constitutes a large
portion of the area, it constitutes a relatively smaller proportion of the power consumption. The
signal bandwidth (5 signals per cycle) was set at this high level relative to the store bandwidth (1 store
per cycle) because the large number of empty sequential segments tends to produce far more signals
than shared data. As a result, the signal buffer is utilized far more frequently than the cache array.
4.3.1 Signal Buffer Parameter Sweeps
The signal buffer has several parameters that potentially have a large impact on system performance
and ring node area. These include the total number of possible signal IDs, the amount of signal
bandwidth, the amount of allowed synchronization epoch decoupling, and the number of cores in
the system.
96
0.0
0.5
1.0
1.5
2.0
2.5
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
8 Signals
16
32
64
128
256
512
Figure 4.7: Total ring node area as total signal ID capacity is swept from 8 to 512.
Number Of Signal IDs
The compiler has control over the maximum number of sequential segments it will produce in any
given loop. Maximum performance can be achieved when the compiler has total flexibility to cre-
ate as many sequential segments as it would like. If restricted, the compiler must combine multiple
sequential segments into one, which may have an impact on performance. However, the number
of signal IDs has a linear effect on the amount of signal buffer area, as each ring node must contain
signal tracker modules for each possible unique signal ID. In Chapter 3, the number of signal IDs
was unrestricted, and a maximum of approximately 128 signal IDs was required for most loops. Due
to current limitations in the compiler, the maximum number of signal IDs cannot be swept at this
time. I leave this analysis for future work. However, although I can’t examine this thoroughly, the
intuition I have gathered from examining the compiled code suggests that significantly fewer than
128 signals would be required to capture most or all of the performance. Some of the sequential seg-
ments are purely for edge cases (exceptions, error handling) that do not occur in normal operation
and are rarely if ever synchronized.
However, I was able to sweep the maximum number of signal IDs in the signal buffer to see how
the area changes. Figure 4.7 depicts the area of a ring node, normalized to the reference design, as the
number of signals are swept from 8 to 512. Given that the signal buffer was a significant fraction of
97
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
1 signals per cycle
2
3
4
5
Figure 4.8: Increasing signal bandwidth increases signal buﬀer and network buﬀer sizes.
the total ring node area in the reference design, it is no surprise that the total ring node area increases
dramatically for the largest signal capacities.
Signal Bandwidth
Figure 3.12c from Chapter 3 shows the importance of high signal bandwidth for achieving good
speedups on SPECint. Although it doesn’t have as drastic an effect on area as the number of signal
IDs, increasing the signal bandwidth does increase the area of the signal buffer. Additional multi-
plexers and logic to process multiple incoming signals result in a 28% decrease in total ring node area
between the reference design and one with a signal bandwidth of 1 per cycle, as seen in Figure 4.8.
Since the signal buffer is approximately 50% of the design area, this corresponds to about a 55% de-
crease in signal buffer area. Although the data/signal network buffers are also halved, their overall
impact is very slight, given their size. Since the critical path involves these network buffers, the max-
imum achievable frequency increases slightly, by around 5%, as signal bandwidth decreases from 5
signals per cycle to 1 signal per cycle.
Amount of Synchronization Decoupling
The epoch bound parameter in the signal buffer dictates howmany synchronization epochs can be
decoupled by cores. In Chapter 3, this parameter was essentially set to 2, with the optimization de-
98
0.0
0.2
0.4
0.6
0.8
1.0
1.2
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
Epoch Bound = 1
2
3
Figure 4.9: Decoupling synchronizaধon from one to two epochs increases area signiﬁcantly, but also increases per-
formance.
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
0
2
4
6
8
10
12
14
16
P
ro
gr
am
sp
ee
du
p
Epoch Bound = 1
Epoch Bound = 2
Epoch Bound = 3
Figure 4.10: Decoupling synchronizaধon up to two epochs increases speedups, but decoupling any further has no
eﬀect.
scribed in Section 4.2. Increasing this parameter has the potential to increase speedup, but will also
increase the area consumed by the signal buffer, as more bits are needed to track the state of each
signal. Figure 4.9 shows the area impact for three values of the epoch bound: 1, 2, and 3. Note the
large impact of moving from 1 to 2. This happens because at an epoch bound value of 1, the signal
buffer can be simplified by having each core only track received signals from its immediate prede-
cessor and only send signals to its immediate successor. This optimization is possible because at an
epoch bound value of 1, cores are unable to decouple. This implies executing all sequential segments
99
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
16 supported cores
8
4
2
Figure 4.11: Signal buﬀer size varies linearly with the number of supported cores, so decreasing the number of cores
has a large impact on the ring node area.
strictly in loop iteration order, which removes the reason for having a core broadcast a signal to ev-
ery other core, as receiving a signal from your immediate predecessor guarantees that all previous
iterations have already executed older iterations. However, an epoch bound of 2 has a drastic per-
formance impact, as seen in the simulation results in Figure 4.10 (this consists of some of the same
data as Figure 3.10). In contrast, moving to an epoch bound of 3 has absolutely no benefit—there are
only a very few times in all of the combined SimPoint phases where any benchmarks decouple by
that amount. Unless other program characteristics vary significantly from SPECint, it seems point-
less to use any epoch bound value other than 2.
Number of Supported Cores
The signal buffer needs to track received signals from every other core in the design. Consequently,
the amount of state in the signal buffer varies linearly with the number of cores in the system, in
much the same way that the total number of signal IDs does. Figure 4.11 shows the drastic decrease
in ring node size when the number of cores is reduced from 16 to 2. The size of the signal buffer
decreases by a factor of 8. Intuitively, as the number of cores decreases, so does the achievable
speedup, as I previously showed in Figure 3.12a.
100
4.4 Conclusion
A number of important engineering decisions must be made when designing the ring cache, many
of them highly dependent on the characteristics of the workloads being parallelized. Appendix A
contains the full details, schematics, control FSMs, and more for all of my ring cache design deci-
sions.
101
5
Future Directions for HELIX-RC
With the primary communication bottleneck of HELIX solved with ring cache, there are a variety of
possibilities for future studies. In this chapter, I present some initial results from two such studies,
in addition to proposing additional work that could be performed with some modification to the
compiler. First, since the original evaluation of HELIX-RC involved only in-order cores, I exam-
ine howmuch performance is lost when switching to out-of-order cores. I conclude that although
the out-of-order cores reduce the amount of TLP that HELIX-RC can extract, performance does
not suffer greatly on more complex cores. Second, I compare traditional multiprogramming paral-
lelism with HELIX-RC automatic single program parallelism. Intuitively, one would suspect that
HELIX-RC parallelized programs would make worse use of cores than multiple independent pro-
gram copies, since HELIX-RC speedups don’t scale linearly with the number of cores. However, I
102
16
4.
gz
ip
17
5.
vp
r
19
7.
pa
rs
er
30
0.
tw
ol
f
18
1.
m
cf
25
6.
bz
ip
2
IN
T 
Ge
om
ea
n
18
3.
eq
ua
ke
17
9.
ar
t
18
8.
am
m
p
17
7.
m
es
a
FP
 G
eo
m
ea
n
Ge
om
ea
n
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
R
e
la
ti
v
e
 S
in
g
le
 T
h
re
a
d
 P
e
rf
o
rm
a
n
ce
2-way IO
2-way OOO
4-way OOO
Figure 5.1: Singled-threaded SPECint 2000 performance increases by around 2 when moving from in-order to
out-of-order cores.
show that, counterintuitively, HELIX-RC can make better use of multiple cores than merely run-
ning multiple copies of a program, since the burden on shared chip resources is reduced.
5.1 HELIX-RCWithOut-of-Order Cores
HELIX-RC successfully extracts TLP for in-order cores. Out-of-order cores also automatically ex-
tract parallelism, in the form of finer-grained instruction-level parallelism (ILP). It is important
to evaluate whether the ILP provided by modern complex cores eats away at TLP, reducing the
speedups obtained by HELIX-RC. I compared the speedups on a 2-way and a 4-way out-of-order
core to the speedup of the 2-way in-order Atom core. Although speedup relative to the single-
threaded execution on a particular core type is somewhat reduced (from 8.5 on the in-order core
to 7 on the 4-way out-of-order core), the overall performance of the parallelized code always im-
proves as the core becomes more capable.
In this section I present the characteristics of the single- and multi-threaded SPEC 2000 bench-
marks on different core types. I describe why out-of-order speedups decrease, and I detail possible
103
16
4.
gz
ip
17
5.
vp
r
19
7.
pa
rs
er
30
0.
tw
ol
f
18
1.
m
cf
25
6.
bz
ip
2
IN
T 
Ge
om
ea
n
18
3.
eq
ua
ke
17
9.
ar
t
18
8.
am
m
p
17
7.
m
es
a
FP
 G
eo
m
ea
n
Ge
om
ea
n
0
2
4
6
8
10
12
14
16
P
ro
g
ra
m
 s
p
e
e
d
u
p
2-way IO
2-way OOO
4-way OOO
Figure 5.2: HELIX-RC speedup decreases from 8.5 to 7 on more complex cores, compared to their single-
threaded execuধons.
solutions for restoring performance.
5.1.1 Out-of-Order Execution
For the purpose of comparison, I consider three different core types: a 2-way in-order Intel Atom, a
2-way out-of-order core, and a 4-way out-of-order core. The architectures of the out-of-order cores
roughly correspond to a modern Intel Atom and an Intel Nehalem core, respectively. Since we are
primarily concerned with the consequences of changing architecture, the core frequencies remain
fixed for all three core types. The same 8MB last-level cache from the default 16-core configuration
is used, as is the 4 memory controller DDR3 memory configuration.
Figure 5.1 shows the relative single-threaded performance of the SPECint 2000 benchmarks I
used for the three core types. The performance increase from the 2-way in-order to the 2-way out-of-
order core is approximately 60%. Increasing the out-of-order commit width from 2 to 4 boosts the
performance by an additional 45%. The floating point benchmarks generally see more of a perfor-
mance improvement, as their memory accesses and control flow are more regular and predictable.
104
16
4.
gz
ip
17
5.
vp
r
19
7.
pa
rs
er
30
0.
tw
ol
f
18
1.
m
cf
25
6.
bz
ip
2
IN
T 
Ge
om
ea
n
18
3.
eq
ua
ke
17
9.
ar
t
18
8.
am
m
p
17
7.
m
es
a
FP
 G
eo
m
ea
n
Ge
om
ea
n
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
P
ro
g
ra
m
 s
p
e
e
d
u
p
2-way IO
2-way OOO
4-way OOO
Figure 5.3: Absolute parallelized program performance always increases for HELIX-RC when moving from a sim-
ple 2-way in-order core to a 4-way out-of-order core—as much as 2.6 for 188.ammp and as liħle as 1.2 for
183.equake.
Although the out-of-order cores can extract more ILP for the same workloads, they also have a re-
duced HELIX-RC speedup for most of the benchmarks on 16 cores. Figure 5.2 shows that generally,
as the core architecture becomes more capable, the speedup of both the integer and floating point
benchmarks decreases (only 256.bzip and 175.vpr have negligible speedup differences between the
in-order and out-of-order cores). The speedups here are relative to the single-threaded execution of
the same core type. For example, on the 2-way in-order core, the parallelized version of 197.parser
is approximately 7 faster than the single-threaded execution on the same core, whereas the paral-
lelized version of the code is only 5.9 faster on the 4-way out-of-order core than the single-threaded
execution on the same core.
While the HELIX-RC speedups change across core type, overall performance increases when a
more capable core is used. Figure 5.3 shows the performance of HELIX-RC parallelized code on
the three different core types relative to the performance on the 2-way in-order core. Regarding the
absolute performance of the multithreaded code, the 4-way out-of-order core always has higher
performance than the less capable cores. In some cases, the multithreaded performance scales almost
105
as well as the single-threaded performance—around 2. However, in some cases, the performance
is barely higher than that on the 2-way in-order core; for example, there is only a 20% improvement
for 183.equake. While it is encouraging that the significant number of transformations performed
by HCCv3 never reduces the extractable ILP for more complex cores, it is important to understand
why the multithreaded performance scales worse than single-threaded performance and whether
anything can be done to improve it.
5.1.2 Speedup Degradation in Out-of-Order Cores
We have seen several reasons why out-of-order cores have lower speedups than in-order ones. Next,
I will describe the sources of reduced performance that account for all of the speedup gaps between
the in-order and out-of-order cores. This analysis is useful for guiding parallelization techniques
that might target even more complex cores in the future. The sources of reduced performance are as
follows:
• low ILP sequential segments limit the overall multithreaded performance (164.gzip);
• there is a suboptimal selection of which loops to parallelize (197.parser);
• small loop invocations fail to extract ILP (300.twolf);
• there are memory-bound regions of both single- and multithreaded versions of the code,
resulting in Amdahl-bound performance improvement (181.mcf); and
• L1 spatial locality and memory predictability are destroyed in the multithreaded code by
HELIX-RC’s distribution of loop iterations across cores (177.mesa, 183.equake, 179.art,
188.ammp).
106
0 1Core ID:
Time
0 1 2 3 4 5 6 7
2-way in-order
0 7
Unblocking Signals
2-way out-of-order
0 7
Parallel Code
Sequential Segment
Waiting For Signal
4-way-out-of-order
Figure 5.4: A single sequenধal segment is the speedup boħleneck for a loop in 164.gzip (16 iteraধons of an actual
execuধon trace shown). Out-of-order cores are unable to extract as much ILP from the read-ađer-write chain in the
sequenধal segment, compared to normal code. Even though execuধon ধme decreases, speedups are sধll lower than
those obtained with in-order cores.
Low ILP in a Sequential Segment
To obtain high speedups, parallelized loops need sufficient parallel code or overlapping sequential
segments such that cores are rarely waiting for computation on other cores to finish. While signifi-
cant portions of computation can be overlapped in many loops, this cannot be done for some loops;
this results in the dependence waiting overhead shown earlier in Figure 3.13, which severely limits the
speedup of 164.gzip. The loops that limit the performance of 164.gzip on the in-order core are also
responsible for the even worse speedup on the out-of-order cores. Most of these 164.gzip loops have
the characteristic that there is only a single relevant sequential segment followed by a small amount
of parallel code per iteration. The execution time of these loops is set by the time it takes to execute
the sequential segment and communicate the signal to the next core, since there is not enough par-
allel code to cover the waiting time. Figure 5.4 plots the execution time of 16 iterations of one such
107
loop on each of the three core types. There are 8 vertical bars per core type, one for each simulated
core. Each bar represents the state of a core over time. At any point in time, a core can be executing
a sequential segment, executing parallel code, or waiting to enter a sequential segment. Figure 5.4
shows that each core has to wait to enter the sequential segment each time it is executed. This pro-
duces an overall speedup of around 3 on 8 or more in-order cores.
This sequential segment bottleneck for 164.gzip persists despite the core type; the reason for
the speedup discrepancy across core types is that out-of-order execution accelerates vanilla single-
threaded execution more than a sequential segment. For the single-threaded execution of the loop
shown in Figure 5.4, the 2-way and 4-way out-of-order cores have a speedup of 1.5 and 2, respec-
tively, over the in-order core. In contrast, the sequential segment has only a 1.4 and a 1.5 speedup
for these cores, respectively, and therefore the parallel version of the loop only speeds up by those
factors. This is because ILP, which is what out-of-order cores take advantage of, is less in a sequen-
tial segment compared to the corresponding original sequential code, as a result of how the HCCv3
compiler generates sequential segments. A sequential segment is often the critical path of execution.
Consequently, HCCv3 inserts only the code required to unblock execution of the next core. This re-
sults in code that is a chain or tree of read-after-write–dependent instructions—the minimal amount
required to calculate the loop-carried dependence. Notice that the latency of executing this depen-
dent code is not drastically improved by a more complex core, since highly dependent code does not
benefit as much frommultiple issue or out-of-order execution. In contrast, the single-threaded ver-
sion of the code is not repeatedly blocked by wait instructions that delimit a sequential segment and
therefore does not pay the penalty of executing the dependent code after a standstill. The latency
of the dependent code is no longer as important, since there are no subsequent cores to unblock, so
these instructions are executed concurrently with other instructions in the iteration. This character-
istic of sequential segments is intrinsic to the HELIX-RC execution model, since they represent the
critical path of execution.
108
Suboptimal Loop Selection
The HCCv3 compiler selects the most promising loops to parallelize by examining their estimated
performance if parallelized. To estimate their parallel performance, HCCv3 includes a profiler to
capture the behavior of the ring cache; it profiles loops on representative inputs. During profiling,
instrumentation code emulates execution with the ring cache, resulting in an estimate of the time
saved by parallelization. Finally, HCCv3 uses a loop nesting graph annotated with the profiling re-
sults to choose the most promising loops.
The emulation significantly simplifies the ring cache and the rest of the platform to enable the
compiler to profile code with low overhead to limit the compilation time. This simplification results
in estimation errors that are higher for out-of-order cores; I observed that some loops of 197.parser
for which the compiler predicts marginal speedups are actually slower on the simulated 16-core plat-
form. When loop selection is improved to eliminate these misestimated loops, speedup increases—
in particular for 197.parser, and especially on the out-of-order cores.
Short Loop Invocations
HELIX-RC’s execution model requires between-loop synchronizations—which entail overhead—to
be performed before and after a parallelized loop is executed. Before a parallelized loop is executed,
the initial memory address where the parallelized code is stored needs to be propagated to all cores.
To do this, HELIX-RC relies on a special wait/signal pair that is executed just before jumping to
a parallelized loop. After a parallelized loop is executed, there is a memory barrier, because values
produced in a loop may be read outside the loop. This barrier is implemented with a sequence of
waits and signals that instruct each core to flush its local ring node memory before informing the
master core that it is safe to resume execution of the code outside the parallelized loop that has been
just executed.
109
As long as the execution time of each loop invocation is large compared to these synchroniza-
tions, their overhead can be ignored. But this is not the case for 300.twolf. Aggressive loop splitting,
which the compiler performed to improve the amount of parallelism extracted, also resulted in short
loop invocations for 300.twolf. Between-loop synchronizations therefore became significant, nearly
10% of the total cycles for one phase of 300.twolf. Additionally, there is so little work to do per core
that the Out-of-order cores are unable to extract meaningful ILP. In the single-threaded case, the
more complex cores are able to extract significant ILP between loop invocations; hence, the relative
speedup on the more complex cores decreases.
Unfortunately, even when the compiler takes this overhead into account while compiling the par-
allel code, it makes identical loop selection decisions since it is unable to find any loops with better
performance. A possible solution to this problem is to reduce the between-loop synchronization
overhead by, for example, adding speculation support to allow cores to start executing the code out-
side a parallelized loop sooner.
Memory-BoundWorkload
181.mcf is relatively memory bound compared to the other SPEC benchmarks [29], as it traverses
memory in an unpredictable fashion. Although nearly all of the per-loop speedups on the out-
of-order cores are equal to those on the in-order cores, the overall speedups on the former still de-
crease. This is due to the several phases of 181.mcf that are memory bound to the point that even
the single-threaded execution time is nearly identical on all of the examined core types. The result
is an Amdahl-bound performance improvement—even though the out-of-order cores are able to
accelerate parts of the benchmark, the memory-bound phases limit the overall speedup. Moreover,
since the memory-bound region now accounts for a larger portion of the out-of-order cores’ execu-
tion time, speedup decreases compared to the in-order core speedup. Since this is a characteristic of
the benchmark, there is not much the compiler can do to compensate.
110
DisruptedMemory Access Predictability and Spatial Locality
The limitation in the speedup for the floating point benchmarks is a result of their worse L1 spatial
locality and fewer L1 prefetch successes. This was the case for all of the examined core types and is
represented in the memory bar in Figure 3.13. This was also observed for HELIX-UP [10], where
significant L1 locality was lost by HELIX loop parallelization. The source of this lost locality is the
distribution of iterations across cores. Imagine an array whose ith element is accessed on the ith
iteration of a loop. In the single-threaded case, the L1 will have good spatial locality, as some itera-
tions will be L1 hits by virtue of accessing the same cache line as a previous iteration. With HELIX,
however, a core does not execute subsequent iterations—if core 0 executes iteration 0 on a 16-core
system, the next iteration it will execute will be iteration 16. Depending on the size and the type of
the array, the i+16th element will likely not be on the same cache line as the ith element, leading to
a cache miss. In addition, the larger strides between accesses make it difficult for the prefetcher to
detect subsequent accesses, as it is nowmore likely that subsequent accesses from a single core will
cross page boundaries. The simulated prefetcher operates on physical addresses and cannot prefetch
across a page. A stride that was large to begin with in the single-threaded case will become 16 as
large in the case of HELIX and therefore that much more likely to cross a page boundary.
The lost locality affects parallel performance for all of the examined core types. However, the
effect of the lost locality is to make the benchmarks more memory bound. As was the case with
181.mcf, the more memory bound the benchmarks become, the less utility there is from the more
complex cores. Unlike the case of 181.mcf, however, the increase in memory boundedness only af-
fects the parallel version of the code, so the relative speedup between the three core types is more sig-
nificant. The more complex cores can execute the single-threaded versions of the code much faster,
since locality is not lost, and the prefetcher can take advantage of the fairly regular access patterns in
these benchmarks. This makes the difference between the single-threaded performance and parallel
111
performance on the out-of-order cores more pronounced, resulting in the lower speedups.
While HELIX-RC’s loop transformations only hurt out-of-order speedup somewhat, they may
hurt it more for even more complex cores. The compiler may need to be modified to take differ-
ent architectures into better account. Future work must look more carefully at the implications of
different architectures, to ensure that HELIX-RC is robust across them.
5.2 HELIX-RC vs. Multiprogram Parallelism
HELIX-RC increases program performance by using otherwise idle cores. Another way to increase
utilization of idle cores is to run multiple programs concurrently. One might suspect that since
HELIX-RC does not scale perfectly linearly with the number of cores, it cannot compete with
pure multiple-program parallelism in terms of extracting raw computing throughput. I will show
that, counterintuitively, often the automatically parallelized version of a benchmark gets far better
use of core resources, not only in terms of single-program performance but also in terms of overall
throughput.
Assuming resources are not shared between programs, 16 copies of a program on 16 cores should
achieve a throughput that is 16 times that of a single copy. But on modern processors, multiple
resources are shared between cores, such as DRAM bandwidth, last-level cache (LLC) storage,
prefetching hardware, and on-chip network bandwidth. All of these resources can degrade ideal
performance if overburdened. Looking at LLC cache contention, for example, a single 16-threaded
HELIX-RC process will only have the working-set footprint of one single-threaded process, whereas
16 copies of the program will have a working set footprint that is 16 larger. In such situations,
HELIX-RC excels by greatly reducing the burden on any limited resources whose use scales with
the number of running programs.
This section shows that running one or more multithreaded HELIX-RC processes often im-
proves both program latency (i.e., execution time) and throughput if shared hardware resource
112
contention is considered, compared to running multiple single-threaded program copies. Alter-
natively, for a fixed throughput target, running HELIX-RC processes can reduce the amount of
shared resources that are required. In other cases, where HELIX-RC speedup is not high enough to
compete with multiple programs on throughput, Helix-RC compiled programs still present a trade-
off between program latency and throughput that can be exploited for situations that may require a
balance between single program performance and total system throughput.
5.2.1 Experimental Setup
In the SPEC 2000 workloads running on the default 16-core system, I found that the LLC is the
only shared resource that degrades the performance of multiprogram execution. The DRAM band-
width, on the other hand, is more than sufficient for the SPEC 2000 benchmarks, even when 16
copies of a benchmark are running. As a result, the simulation experiments consider only con-
tention for the LLC.
Benchmark Classification
Depending on the working set size, a benchmark may or may not improve throughput by increasing
the number of processes (either HELIX-RC processes or vanilla single-threaded processes). There-
fore, I classify the benchmarks used into different categories in Table 5.1 based on working set size,
as obtained from [29] and the present study’s simulation results. First, 183.equake has a very large
working set of approximately 32 MB, comparable to the LLC sizes of modern server processors.
Next, 179.art and 181.mcf have large working sets (~4MB), but not so large as to completely fill a
modern LLC. The rest of the benchmarks have decreasing working sets ranging from 1MB down
to less than 1/8th of a MB. For the sake of space and simulation time, I will explore one benchmark
from each of the five categories.
113
Table 5.1: Working set sizes for SPECint 2000.
Category Working Set Per Process Benchmarks
Very large 32 MB 183.equake
Large 3-4 MB 179.art, 181.mcf
Medium 0.5-1.0 MB 188.ammp, 300.twolf, 175.vpr
Small < 0.5 MB 197.parser, 256.bzip2
Negligible < 0.15 MB 164.gzip, 177.mesa
Simulation results
Simulations were run on the default 16 in-order core, 4 DDR3 memory channel system. The LLC
size was swept from 4MB to 32 MB. The original default size, 8 MB, is typical of modern desktop
processors, while the largest size, 32 MB, is typical of modern high-end server processors. Only one
benchmark at a time was considered.
In addition to considering a single 16-threaded HELIX-RC process versus 16 single-threaded
processes, we can also evaluate the performance of running multiple copies of 8-, 4-, or 2-threaded
HELIX-RC processes. Consider the HELIX-RC speedups for different core counts in Figure 3.12a,
which shows better scaling for lower core counts. For example, on two cores, most of the SPECint
benchmarks achieve nearly a 2 speedup. On four cores, most benchmarks achieve over a 3 speedup.
As the core count continues to increase, the amount of speedup gained becomes less linear with the
number of cores. This presents an opportunity to get a closer to ideal throughput improvement by
running multiple copies of fewer HELIX-RC threads, at the expense of decreasing the speedup (i.e.,
increasing the latency) of any individual copy.
When multiple HELIX-RC processes are run at the same time, the 16 cores in the system are stat-
ically allocated such that each process has an exclusive set of cores. The ring cache was modified to
include address space IDs for memory locations and signals. The ring topology was left unchanged:
a single unidirectional ring connects all the cores. This increases the perceived core-to-core commu-
114
0 4 8 16 32
LLC size (MB)
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
1,2H
4,8H
16ST
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
0.0 0.5 1.0 1.5 2.0
Normalized Process Latency
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
32
16
8
4 MB LLC
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
Figure 5.5: 183.equake has a 32 MB working set, which results in beħer performance when fewer processes are
used, even on large cache sizes. Only a single 16-thread HELIX-RC process ﬁts in the LLC.
nication latency when multiple HELIX-RC processes are running simultaneously. The total system
throughput and process latency (i.e., the maximum execution time of any process) were collected.
The throughput and latency are normalized to the case of a single thread running on the default
system (8MB LLC).
5.2.2 Evaluation
For benchmarks with large working sets, running multiple HELIX-RC programs sequentially in-
creases program throughput and lowers program latency compared to running multiple sequential
programs concurrently. This is because the much smaller working set of one multithreaded process
can make better use of a shared LLC, even if the HELIX-RC speedup does not scale linearly with
the number of cores. Finally, HELIX-RC enables different throughput/program latency tradeoff
points for benchmarks with smaller working sets. These counterintuitive results show that auto-
matically parallelized programs are often a superior option even in cases where abundant multiple-
program parallelism is available.
115
0 4 8 16 32
LLC size (MB)
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
1H
2H
4,8H
16ST
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
0.0 0.5 1.0 1.5 2.0
Normalized Process Latency
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
4 MB
8 MB LLC
16 MB
32 MB
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
Figure 5.6: 179.art’s relaধvely large 4 MB working set means that only 8 processes can ﬁt comfortably at the largest
cache size. Depending on the cache size, diﬀerent HELIX-RC conﬁguraধons are ideal.
0 4 8 16 32
LLC size (MB)
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
1,8H
2,4H
16ST
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
0.0 0.5 1.0 1.5 2.0
Normalized Process Latency
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
4 MB LLC
8 MB
16 MB
32 MB
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
Figure 5.7: 188.ammp’s medium working set means that the pure mulধprogramming case (16 single-threaded pro-
cesses) outperforms HELIX-RC at large cache sizes.
LargeWorking Set
Figure 5.5 shows the results for the benchmark with the largest working set, 183.equake. A single
HELIX-RC 16-thread process obtains higher program latency and overall system throughput than
any other option. The working set of a single 183.equake is approximately 32 MB, the largest eval-
uated cache size. For this reason, performance is flat across all configurations, except for a single
process running on 32 MB, where it just begins to fit in the LLC. All other configurations are lim-
ited by the LLC size, so performance decreases as the number of processes increases, even though
HELIX-RC scales better with the core count for those configurations.
The next largest working set benchmarks are 179.art and 181.mcf (4 MB); I chose the former to be
116
0 4 8 16 32
LLC size (MB)
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
1H
2H
4H
8H
16ST
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
0.0 0.5 1.0 1.5 2.0
Normalized Process Latency
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
4 MB LLC
8 MB
16 MB
32 MB
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
Figure 5.8: HELIX-RC gives 197.parser a number of reasonable latency/throughput tradeoﬀ possibiliধes.
representative. Figure 5.6 shows that different configurations are better in terms of both throughput
and latency for different cache sizes. As the LLC size increases from 4MB to 32 MB, the optimal
configuration moves from one HELIX-RC process using every core to four HELIX-RC processes
using four cores each. A larger number of processes begins to introduce too much contention for
the LLC.
MediumWorking Set
I use 188.ammp to represent medium-sized working set benchmarks. Figure 5.7 is the first example
where the non-HELIX-RCmultiprogramming case is superior to every HELIX-RC configuration,
but only at the largest cache sizes. At an LLC size of 8 MB, a single HELIX-RC process matches the
throughput of 16 single-threaded copies, but with the benefit of much better program latency.
SmallWorking Set
Figure 5.8 shows 197.parser, and Figure 5.9 shows 164.gzip. For the former, only the 16-process con-
figuration is ever limited by cache size, whereas for the latter, none of the configurations are so lim-
ited. For 164.gzip, HELIX-RC never outperforms the vanilla multiprogram, even though multiple
smaller HELIX-RC processes can extract more throughput than one 16-core HELIX-RC process.
This is primarily the result of 164.gzip’s poor scaling beyond two cores: the compiler is unable to
117
0 4 8 16 32
LLC size (MB)
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
1H
2H
4H
8H
16ST
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
0.0 0.5 1.0 1.5 2.0
Normalized Process Latency
0
2
4
6
8
10
12
14
16
N
or
m
al
iz
ed
Th
ro
ug
hp
ut
4 MB8, 16, 32 MB LLC
1 HELIX 16-threaded
2 HELIX 8-threaded
4 HELIX 4-threaded
8 HELIX 2-threaded
16 single-threaded
Figure 5.9: The worst-scaling HELIX-RC benchmark, and also the most CPU bound, 164.gzip achieves beħer
throughput as the number of threads increases, at the expense of some latency.
extract enough parallelism to make good use of 16 cores. Even though HELIX-RC never excels in
terms of throughput, the various configurations still provide multiple points where there is a trade-
off between program latency and throughput in scenarios where single program performance mat-
ters. Similarly, maximum throughput is extracted from 197.parser, but only when the LLC exceeds
16 MB. Unlike 164.gzip, 197.parser scales better with the number of HELIX-RC threads, so a num-
ber of competitive throughput/latency points are available. Additionally, 8 two-threaded HELIX-
RC processes with a 4MB LLC can nearly match the throughput of 16 single-threaded processes
with a 16 MB cache, showing that HELIX-RC can enable better performance on more resource-
constrained systems.
HELIX-RC provides an interesting opportunity for a tradeoff between program performance
and throughput in the presence of shared resources. This initial study shows that relieving pres-
sure on the LLC can produce a significant benefit. This is even more notable since the SPEC bench-
marks that are used are many years old; more recent programs may have even larger working sets and
might stress other shared resources, such as DRAM. Further work evaluating modern programs on
modern platforms is needed to fully understand the role that HELIX-RC could play in optimizing
throughput and performance in different scenarios (e.g., desktop, mobile, server) and under differ-
ent realistic resource constraints.
118
5.3 Potential HELIX-RC ResearchOpportunities
In this section, I briefly present some other promising lines of research with HELIX-RC.
5.3.1 Compiler Engineering Improvements
Due in part to the need for a very accurate memory dependence analysis, it could take one or more
days to compile a single benchmark. This slow compilation speed obviously makes it difficult to try
out new ideas. A thorough analysis of the time consumed by each compilation step could be useful
to target compiler speed improvements. One reason the memory dependence analysis is so slow is
that the algorithm iterates until it converges on a solution. I suspect (but do not know) that a large
portion of the memory dependence analysis execution time is spent improving the quality (i.e., the
accuracy) of the analysis by very little, thus taking a disproportionate amount of time. Large gains in
compilation speed could occur if just a small bit of accuracy were sacrificed. Figure 5.10 depicts the
expected tradeoff in dependence accuracy as a function of time spent on the analysis. Most of the
benefit may be gained in a short amount of time. It will be worth looking into howmuch HELIX-
RC speedups are affected when the analysis time is reduced.
5.3.2 Compiler Sweeps
The evaluation of HELIX-RC in Chapter 3 depended on various heuristics to set many of the com-
pilation parameters. Many of these parameters have non-obvious “sweet spots,” and without in-
depth analysis, it is difficult to tell what their larger system effects are. For instance, the compiler
split sequential segments into multiple smaller sequential segments considering only the potential
parallelism among segments, and not the side effects of the extra wait and signal instructions that
were necessarily added to the code. Figure 5.11 shows a potential relationship between the number
of segments and overall speedups—it will not necessarily be easy for the compiler to determine stati-
119
Time spent on memory dependence analysis
Memory 
Dependence
Analysis Accuracy
Large initial 
accuracy 
increase
 Good enough  
accuracy?
Figure 5.10: The memory dependence analysis that HELIX relies on takes a long ধme to converge on a soluধon. I
suspect that the quality of the analysis may plateau relaধvely quickly, potenধally allowing compilaধon ধme to be
reduced by ending the analysis early on.
Increasing Number of Sequential Segments
HELIX-RC Speedup
Large initial performance 
increase with more segments, 
due to segment overlap
Performance declines past a 
certain point, as the additional 
synchronization instructions 
dominate any marginal benefit
DOACROSS = 1 segment
Figure 5.11: By spliষng the sequenধal region of a loop iteraধon into a large number of sequenধal segments, HELIX
potenধally improves performance by allowing diﬀerent segments to be executed in parallel. However, at some point
the overhead of the addiধonal synchronizaধon signals will overwhelm any beneﬁt and will also require a larger signal
buﬀer hardware structure.
120
cally howmuch splitting it should do. Additionally, many transformations (such as loop unrolling,
loop splitting, method inlining) are performed early in the compilation process to make the code
more amenable to parallelization (e.g., by simplifying subsequent analysis/transformations). How-
ever, as in the case of splitting sequential segments, the exact effects of these transformations are dif-
ficult to predict. Excessive loop unrolling and method inlining, for example, can bloat the code size
and therefore instruction cache misses. Heuristics were used for these transformations, in part due
to the extremely long compilation time as discussed in the previous section. It was therefore not fea-
sible to sweep these parameters to understand their effects. If compilation speed is improved, there
will be an opportunity to take a methodical look at the various compilation parameters. By sweep-
ing them and exploring different heuristics, the compilation process can be made more intelligent,
which should improve HELIX-RC speedups.
5.3.3 Multiple-Loop ExecutionModel
In the evaluation section of Chapter 3, I showed that HELIX-RC speedups scale relatively well with
core count. However, when fewer cores were used, speedups scaled more linearly with fewer cores
than with more cores. This is part of the reason why HELIX-RC was able to provide a better per-
formance/throughput tradeoff than normal multiprogramming in Section 5.2. However, this was
in the context of multiple HELIX-RC processes. Ideally, a single HELIX-RC process would be able
to run multiple loops in parallel, each using only a subset of the available cores, thus boosting single-
threaded performance more than using all the cores on a single loop. However, it is very difficult
to determine which loops can run in parallel. Although the memory dependence analysis within
a small loop can be very accurate, the accuracy of such an analysis between small loops is probably
much lower. Because the scope of the analysis is larger, there will likely be a large increase in appar-
ent dependences. Therefore, it is very likely that speculation will be required to run multiple loops
in parallel.
121
The hardware necessary for multiple loops may be difficult to create. Since values may need to
be forwarded between executing loops, a single ring network would not be sufficient. Instead, some
other topology, such as a 2Dmesh network, could be used in a circuit-switched fashion. Multiple
arbitrary groups of rings could be set up and torn down dynamically with every loop invocation.
Special circuits between the rings could be used for inter-loop dependences. However, if speculation
is necessary, this will require significant hardware changes, as in TRIPS, Multiscalar, and Hydra, for
example.
122
6
Conclusion
Despite much effort expended in the past, automatic parallelization of irregular workloads remained
out of reach. This dissertation has proposed a compiler–architecture co-design that unlocks the
previously inaccessible parallelism possible for small loops. This robust solution, HELIX-RC, pro-
duces speedups of 6.85 for hard-to-parallelize programs. Moreover, little additional hardware is
required, and there is a very well-defined, simple interface between it and a core’s existing memory
hierarchy. Through the addition of this hardware, automatically parallelized irregular programs can
make much better use of multiple cores, in terms of performance and throughput, than even the
“easy” parallelism provided by multiprogramming.
123
A
Ring Cache Technical Report
A.1 Introduction
In ISCA 2014, [9] we proposed HELIX-RC, a compiler–architecture co-design for automatic par-
allelization of irregular programs. The combination of an automatically parallellizing compiler and
a custom-designed piece of hardware logic demonstrated a nearly 6.85 speedup on unmodified,
highly irregular SPECint 2000 benchmarks. The success of HELIX-RC emanates from a new piece
of hardware, the ring cache, which helps overcome the primary bottlenecks in HELIX-style paral-
lelization: data communication latency and sequential forwarding synchronization chains. Without
ring cache, the HELIX compiler was limited to producing around a 2 speedup on commodity
multicore chips [11].
124
In the HELIX-RC paper, the technique was evaluated on a C/C++ based x86 simulator called
XIOSIM [33], with the ring cache similarly modeled in C++. While every effort was made to model
the ring cache at a cycle-accurate level, high-level languages are not the best fit for expressing hard-
ware operations. To increase confidence in the ring cache, this report presents a fully tested and
synthesizable Verilog reference design, as well as fully detailed explanations of its implementation
and our design decisions. Due to the constrained length of conference proceedings, the HELIX-
RC paper only explored the high-level details of the ring cache. In contrast, the goal of the present
report is to fully flesh out any missed details so that our results can be replicated, and so that other
researchers, with the aid of our Verilog, can explore and evaluate the ring cache in FPGAs or even
silicon.
Before proceeding, the reader should be familiar with the HELIX-RC paper, in order to have
an appreciation of HELIX-style parallelization and at least a high-level overview of how ring cache
operates. The rest of this report is organized as follows. First, a brief summary of HELIX-style par-
allelization is presented. Next, a high-level overview of ring cache and its components is given, to
establish terminology and to place more detailed explanations in context. Then implementation de-
tails and design decisions for these same core components are presented, with datapath schematics,
control FSMs, and timing examples provided where appropriate. Finally, we describe some pre-
liminary synthesis results for our reference design. The full Verilog implementation is included as a
second appendix .
A.2 Background
While it is assumed that the reader of this report has previously read the HELIX-RC paper [9], this
section provides a limited background for HELIX and ring cache. This review of certain aspects
of HELIX-RC serves as a reference point for descriptions and design decisions discussed later in
the implementation section of this report. For the full background for HELIX, the original CGO
125
paper [11] and subsequent ISCA paper [9] should be consulted. In this section, we describe the basic
HELIX execution model and the method through which HELIX extracts parallelism from single-
threaded code. Further, we highlight the main performance bottlenecks that HELIX suffers on a
commodity multicore system, mostly data communication latency and limitations resulting from
sequential forwarding synchronization chains. Finally, we describe how ring cache alleviates these
bottlenecks by decoupling data forwarding from data generation and decoupling signal forwarding
from synchronization. These concepts set the framework for understanding the motivation behind
many of our ring cache design decisions.
A.2.1 HELIX ExecutionModel
HELIX achieves a speedup of single-threaded code by automatically parallelizing loops that run on
multiple cores of a single chip. The compiler performs extensive memory dependence analysis and
code transformations to determine which loops it should parallelize in a program, and it chooses
these loops at compile time. HELIX executes loops in parallel by assigning loop iterations to cores,
with subsequent iterations of a loop assigned to subsequent cores. Specifically, in anN-core system,
core i first executes iteration imodN, then iterationN + (imodN), then iteration 2N + (imod
N), and so forth. This mapping produces a unidirectional flow of shared data and synchronization
signals from past iterations to future iterations in a logical ring of cores. Figure A.1 depicts an execu-
tion timeline for HELIX parallelized code. During a program’s execution, a single core is assigned to
be the master core, which executes all code outside of the chosen parallelized loops. When this core
reaches the start of a parallelized loop, it indicates to all other participating cores that a loop is about
to begin. All cores, including the master core, then jump to the parallel loop. When the loop has fin-
ished executing, a memory barrier is executed to ensure that any memory addresses written during
the loop are properly visible to every other core. After this memory barrier has been executed, the
master core is free to resume executing code between parallel loops, with the knowledge that it is safe
126
Executing 
Single 
Threaded 
Code 
Master
Core 0 Core 1 Core 2
Program
Execution
Time
Idle Cores
New Parallel 
Loop
Jump To 
Parallel Loop
Executing 
Parallel Loop 
Iterations 
0, 3, 6... 
Jump To 
Parallel Loop
Jump To 
Parallel Loop
Executing 
Parallel Loop 
Iterations 
1, 4, 7... 
Executing 
Parallel Loop 
Iterations 
2, 5, 8... 
Finish Loop Finish Loop Finish Loop
Memory Barrier
Executing 
Single 
Threaded 
Code 
Idle Cores
Figure A.1: A single core executes the code between parallelized loops, and before execuধng a parallelized loop, it
also instructs other cores to jump to the same loop.
127
A()
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
Wait ID1
Signal ID1
IF
COND
B()
Sequential 
Segment 1
Start Next 
Iteration
Parallel Code
Sequential Code
Sequential Segment
Figure A.2: A simpliﬁed example of the staধc code of a loop in 175.vpr containing a single sequenধal segment. The
right branch of the if-statement must sধll be synchronized, even though no shared data is accessed.
to read or write any memory address.
A.2.2 Parallel Code
Figure A.2 represents the static code of a loop after parallelization. This loop is roughly representa-
tive of a loop in 175.vpr, one of the benchmarks we evaluated. After in-depth memory dependence
analysis, HELIXmarks the regions of a loop body that access data that is shared across loop iter-
ations (or, at least, data it can’t prove is not shared). We call these regions sequential segments. At
the beginning of each sequential segment, HELIX inserts a wait operation that contains a partic-
ular sequential segment ID. At the end of each sequential segment, HELIX inserts a signal
operation with the same ID. A wait operation prevents a core from entering a sequential segment
128
with that specific segment ID until the corresponding signal has been received from the previous
iteration of the loop, which is running on a different core. Therefore, each segment is executed in
loop iteration order. This creates a sequential chain of synchronization, as each core must explicitly
unblock the iteration running on the next core.
Once a core enters a sequential segment, the loads and stores it performs may involve shared data.
It is unknown at compile time which specific addresses will be shared and which other iterations will
access those addresses; it is only known that any accesses in that segment might be to shared data.
Only code that might access shared data is placed inside a sequential segment—if the compiler de-
termines that some code will never access shared data, it is not placed within a sequential segment
and therefore can run in parallel across all iterations. Figure A.2 depicts a loop body with only one
sequential segment. If the compiler is 100% confident that certain accesses are independent from
others, it will split the sequential segment into multiple smaller segments, which can run in parallel
with one another. Even though shared data may be accessed in these multiple segments, the com-
piler has determined that each segment will access unique sets of shared data.
A.2.3 Decoupling Data Communication
To illustrate the data communication latency encountered by HELIX on a traditional multicore
chip, Figure A.3 (left) shows an execution timeline for a two-core system. At the start of execution,
core 0 has entered the sequential segment, while core 1 waits to enter. During the execution of the
sequential segment, core 0 stores a value to the address of variable X, whose cache line will be loaded
into its L1 cache. The core then leaves the sequential segment by issuing a signal to unblock core
1. After some communication latency, core 1 receives the signal and enters the sequential segment.
Subsequently, core 1 issues a load to the address of variable X. Because the recently written value of
X resides in core 0’s L1 cache, there is a cache coherence delay (75–210 cycles on modern Intel CPUs)
before core 1 receives the data. Since sequential segments are executed in loop iteration order, the
129
Core 0
A()
Core 1
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
Assume X is 
available 
locally, since 
first iteration 
of loop
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
B()
IF
COND
Program
Execution
Time
Signal 
unblocks 
core 1
The new value 
of X is stored 
in Core 0's L1 
cache
X is already 
available 
locally, no 
fetch penalty
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
Load X
X = f(X)
Store X
Signal ID1
Signal 
unblocks 
core 0
B()
A()
IF
COND
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Data Communication
Data 
pushed to 
core 1
Load X
X = f(X)
Store X
Signal ID1
Core 0
A()
Core 1
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
Assume X is 
available 
locally, since 
first iteration 
of loop
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
B()
IF
COND
Load X
Signal 
unblocks 
core 1
The new value 
of X is stored 
in Core 0's L1 
cache
The load of X 
misses locally, 
must be fetched 
from Core 0's L1
The remote load 
incurs significant 
latency
X = f(X)
Store X
Signal ID1
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
Load X
X = f(X)
Store X
Signal ID1
Signal 
unblocks 
core 0
B()
A()
IF
COND
Figure A.3: Leđ: a reacধve data communicaধon mechanism results in core 1 stalling on a load. Right: the proacধve
ring cache reduces the stall by sending the data as soon as it is produced.
130
Hop
6+6%
5
9%
4
12%
3 39%
2
22% 1
12%
Distance Between Producer and Con-
sumer
Core
6+9%
5
34%
4 12%
3
21%
2
8% 1
16%
Number of Consumers
Figure A.4: It is very hard to predict which core will need a parধcular piece of shared data (leđ). Diﬀerent pieces of
shared data may also have very diﬀerent numbers of consumers (right). As a result, it is very hard for a core to know
which addresses it should prefetch, and when.
data transfer latency significantly increases the loop’s critical execution path. This data communi-
cation cost greatly restricts the loops that the HELIX compiler can target, limiting it to select loops
that have few dependences. The cost is a direct result of the reactive nature of cache coherence pro-
tocols: data is only moved when it is requested. This produces a coupling effect between the com-
munication of the shared data and its usage.
One reasonable solution would be to aggressively prefetch data. In fact, the original HELIX
work [11] used Intel Hyperthreading to prefetch signals, thus reducing their perceived communi-
cation latency. When compiling for a traditional multicore processor, HELIX implements waits
and signals with vanilla loads and stores, respectively, to special memory locations. A special hy-
perthread running on each core constantly executes the load corresponding to the wait instruction
of the next sequential segment to be executed, so the signal from the predecessor core is effectively
pulled locally soon after it is generated, potentially before the subsequent core actually needs it.
Without such prefetching, the signal in Figure A.3 (left) would experience a reactive transfer delay
similar to the data load.
131
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
0
2
4
6
8
10
12
14
16
P
ro
gr
am
sp
ee
du
p
HELIX without decoupling
Decoupled data communication
Decoupled data and signal communication)
Figure A.5: Decoupling data and signal communicaধon with a ring cache drasধcally improves speedups over vanilla
HELIX on a tradiধonal mulধcore processor.
Prefetching of signals is only possible, however, because the sequence of sequential segments is
very predictable. In contrast, shared data is not predictable: variable X in one iteration of the loop
may be different than in the subsequent iteration. A good example of such behavior is a traversal of
a linked list/tree/graph, where each iteration of a loop follows one or more pointers before perform-
ing some work on the corresponding node(s). The memory addresses resulting from traversing such
a structure are very unpredictable and there would therefore be no feasible way to prefetch the nodes
at those addresses. Moreover, not only is the exact location of the shared data unpredictable, but the
number of different cores that access a piece of shared data is also unpredictable. Figure A.4 shows
the number of consumers of a given piece of shared data and the distance between the consumers in
a logical ring for the SPECint 2000 benchmarks we evaluated in [9]. Because of this unpredictabil-
ity, proper prefetching is a very difficult problem to solve.
Instead, HELIX-RC ameliorates data communication costs through the addition of ring cache.
Rather than reactive data communication, the ring cache facilitates proactive communication. As
132
soon as a piece of potentially shared data is produced, rather than being stored in the traditional
cache hierarchy, it is proactively distributed to the ring cache in every core throughout the system.
As a result, when core 1 tries to access the shared data, it is already present locally and thus there is no
data communication cost, as depicted in Figure A.3 (right). Unlike in typical cache coherence proto-
cols, the communication of the data has been decoupled from its consumption. Additionally, since
loads, stores, and signals sent to the ring cache do not need to incur any cache coherence protocol
overhead, they can execute and propagate very quickly, with low single-digit latencies. Figure A.5
shows that decoupling data communication increases HELIX speedups by more than double com-
pared to using only a traditional cache coherence protocol. Part of this improvement is due to the
proactive nature of the ring cache, and part of it is due to the very fast communication.
A.2.4 Decoupling Signal Forwarding
While decoupling data transfer reduces the perceived data communication cost, there is another
decoupling opportunity that ring cache exploits. Consider Figure A.2 once again. The sequential
segment in this loop only sometimes accesses shared data, depending on the preceding outcome of
the “if” evaluation—we call the right branch of the “if” an empty sequential segment. Traditionally,
on a normal multicore processor, HELIXmust still synchronize this sequential segment despite the
fact that it does not always access shared data. This requirement of executing sequential segments
in loop iteration order creates a sequential forwarding synchronization chain, even if it is unneces-
sary. In the three-core system of Figure A.6, core 1 unnecessarily waits for core 0 to finish executing
sequential segment 1 and send the corresponding signal before entering the sequential segment itself.
Ring cache breaks this synchronization chain by performing signal buffering. It does this by having
each core record howmany sequential segments of a particular segment ID it has executed, relative
to every other core. This allows each core to correctly decide whether it must execute the wait in-
struction of any particular empty sequential segment or whether it can skip it and continue with
133
Program
Execution
Time
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Core 0
A()
Core 1
A()
IF
COND
Wait ID1
IF
COND
B()
Signal Not 
Received Stall
Core 2
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
B()
Load X
X = f(X)
Store X
Signal ID1
Signal ID1
Core 1 took right 
branch of IF, no 
data dependence 
to satisfy.  Must 
stall anyway
Signal 
unblocks 
core 1
Signal 
unblocks 
core 2
B()
Wait ID1
Signal ID1
Figure A.6: Core 1 needs to stall on the wait instrucধon even though the sequenধal segment it is execuধng lacks a
shared data access. This creates a sequenধal forwarding chain between the three cores.
134
Program
Execution
Time
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Core 0
A()
Core 1
A()
IF
COND
Wait ID1
Signal ID1
B()
IF
COND
Core 2
A()
IF
COND
Wait ID1
Signal Not 
Received Stall
B()
Load X
X = f(X)
Store X
Signal ID1
Core 0 and 1 took right 
branch of IF, no data 
dependence to satisfy.
Core 2 receives 
signals from core 0 
and core 1, 
unblocks early
Wait ID1
Signal ID1
B()
Core 1 neglects 
to wait, instead 
sends signal 
immediately
Core 0 sends 
signals to 
both core 1 
and core 2
Core 1 skipped the 
wait instruction, 
ignore the signal 
from core 0
Figure A.7: Core 0 simultaneously sends signals to both core 1 and core 2. Core 1 skips its wait instrucধon be-
cause the sequenধal segment it is execuধng is empty and lacks any shared data accesses. By virtue of allowing early
signal transmission when synchronizaধon is unnecessary (hence decoupling signal transmission from synchronizaধon)
core 2 is unblocked sooner than it otherwise would be.
135
execution.
This does have the downside, however, of requiring cores to send signals not only to their neigh-
bor in the ring (the subsequent iteration), but to every other core in the ring. Likewise, all cores
must now track whether they’ve received signals from every other core, rather than just from their
predecessor (the previous iteration). This makes decoupling signal transmission prohibitively expen-
sive without ring cache, since the additional signals between cores would result in a large increase in
cache coherence traffic. With ring cache’s hardware signal-buffering capability, however, the ben-
efit outweighs the cost. Some loops benefit significantly, especially if the branch outcome before
such a segment changes frequently, as it does in the corresponding loop in 175.vpr. Figure A.7 de-
picts the result of breaking the synchronization chain. Despite the additional required signals, core
2 is unblocked sooner than it otherwise would be, getting a head start on executing the sequential
segment. To better illustrate the decoupling effect, this figure assumes that core 0 can send a signal
to both core 1 and core 2 in approximately the same amount of time—and in the ring cache, this
is close to being true, since the latency between cores is only one cycle, but the initial time to get
from the core to the ring cache may be longer. The important point is that there is no longer a sig-
nal chain connecting core 0 to core 1 to core 2. Instead, the presence of the ring cache and the empty
sequential segment allows core 0 and core 1 to send signals as soon as they correctly can, rather than
being tightly coupled in a forwarding chain. Figure A.5 summarizes the performance improvements
HELIX-RC obtains from decoupling both data and signal communication—decoupling both of
these increases HELIX speedups nearly 3 compared to systems that lack ring cache. Both decou-
pling bars use fast proactive communication, and therefore the contribution from decoupling signal
forwarding is purely due to breaking sequential chains, not from fast signal propagation, which is
already included in the second bar.
136
Data and Signals
Cache array
Signal buﬀer
... Past
Future
Signal 1Signal S
ReadPort
WritePort
Credits
Data and 
Signals
Link
Buﬀers
Data and 
Signals
Credits Control
Loads 
from Core
Stores/Signals
from Core
Ring
node
  DL1
Cache
Core
Remote L1
Request/Reply
L1 Cache Reads/Writes
Core
Figure A.8: Ring cache architecture overview. From leđ to right: overall system; single core slice; ring node internal
structure.
A.3 Ring Cache Overview
To establish terminology and a high-level understanding of ring cache, this section presents an
overview of its architecture and design. With this basic background, implementation details and
design decisions will be more easily understood. This section is divided into three parts: first, the
interaction between the core and the ring cache is detailed, followed by the connections between
ring nodes, and finally the integration between the ring cache and the rest of the memory hierarchy.
In subsequent sections, each of the presented blocks and functions of ring cache will be thoroughly
examined at the hardware level.
Figure A.8, from the original HELIX-RC paper, depicts the general structure of a multicore sys-
tem with ring cache. The leftmost diagram shows the flow of data and signals between cores. Each
core has an attached ring node, and these are organized in a ring. The middle diagram shows the
contents of a ring node, and the rightmost diagram depicts a high-level view of the ring nodes’ in-
ternal structure. The ring nodes contain a core interface, a cache array, a signal buffer, a connection
to the local L1 cache, and three unidirectional ring connections between nodes. The following sub-
sections will detail at a high level how these different elements are used during a typical core interac-
tion.
137
A.3.1 Core–Node Interaction
Every core in a HELIX-RC enabled chip has a connection to and from a private ring node. This
connection is used exclusively when a core is executing a sequential segment. If a particular loop
is outside of a sequential segment (or lacks sequential segments altogether), then the normal cache
hierarchy is used, and the ring node connection is irrelevant.
When a core approaches a sequential segment, it first executes a wait instruction, which has an
associated sequential segment ID. When ring cache is present in a system, a wait instruction is not
just a special load instruction but a unique instruction used only by the ring cache. When the wait
instruction is executed, it is sent to the ring node, along with its ID. The ring node checks its local
signal buffer to determine whether enough signals have been received by every other participating
core to allow entry into the sequential segment. To enter a particular sequential segment where
shared data is accessed (as opposed to an empty one, as discussed in Section A.2.4), a core must have
received signals with the correct ID from every other previous iteration of the loop (and therefore
every other core). If a sequential segment is empty, a core can enter it as long as sending the associ-
ated signal won’t overwhelm any other core’s signal-buffering capability. If a core cannot yet enter a
sequential segment, the wait instruction is held by the ring node, and the core stalls. Once enough
signals have been received, the ring node releases the wait instruction back to the core, which may
then proceed to execute the sequential segment.
Once inside a sequential segment, any loads or stores executed by the core must be sent to the
ring node for processing. When a load instruction is presented to the ring node, it first searches its
local ring cache array for the value. If it is not present, the ring code can utilize the request network
and the reply network to find the correct value from other ring nodes. In some circumstances, such
as in the case of a cold miss, the value will be loaded from the traditional cache hierarchy, subject to
certain rules to maintain memory consistency, as detailed in the following two subsections. Since a
138
ring node is guaranteed to service a load either locally or through the request/reply networks or the
traditional cache hierarchy, all loads appear to hit in the ring cache from the core’s point of view, al-
beit with potentially unpredictable latency. In the common case of a local ring node hit, the correct
value is returned in one clock cycle.
Usually, after entering a segment and issuing a load instruction, a core will perform some arith-
metic processing on the data before storing an updated value to the ring node. The address and
value of the store are handed to the ring node, which immediately writes this value to the local ring
cache array. Additionally, the ring node inserts the address, value, and originating core ID into a
bundle that is sent over the forwarding network. The forwarding network propagates this informa-
tion to every other ring node in the ring, one hop per cycle, and each of these in turn stores the value
in its local ring cache array. The originating core ID is used to stop propagation of the store once it
has circled the entire ring. As a result, every ring node contains the newly updated shared value.
Within this sequential segment, the executing core is the only one who can possibly read or write
any shared address belonging to the segment, as guaranteed by the compiler. In order to leave the
sequential segment and inform other cores that they may now enter the segment and access said
data, the core injects a signal instruction to the ring node. Upon entering the ring node, the signal
updates the local signal buffer to record the fact that the core is leaving a sequential segment (and
therefore can “forget” the signals that granted it entry). Simultaneously, the signal is also added to a
bundle and propagated throughout the forwarding network, just like stores. It continues propagat-
ing around the ring until it reaches the predecessor of the core that executed the signal, updating all
the ring node signal buffers along the way. After the core injects the signal, it has left the sequential
segment, and all future loads and stores go to the traditional cache hierarchy, until the next sequen-
tial segment is encountered.
Since stores and signals have global effects, the core can only send them to the ring node in pro-
gram order once they are no longer speculative. Additionally, loads and stores cannot be reordered
139
around either wait or signal instructions. In our simulations, we reuse logic from the load–store
queues for memory disambiguation to block items in the load queue until the wait instruction has
returned from the ring node.
A.3.2 Node toNode Connection
The ring cache implements three different networks: the forwarding network, to propagate signals
and data throughout the ring, the request network, to handle loads that missed in a local ring node
and are therefore sent to other cores, and the reply network, which returns values requested over
the request network. The latter two networks are used infrequently but are necessary for memory
consistency. If they are not used infrequently, performance will suffer, since they essentially revert
the ring cache’s proactive data communication mechanism to a reactive communication mechanism,
resulting in a performance penalty similar to that in a traditional cache coherence protocol.
Each network has its own network receive buffers and credit-based flow control. The separation
of the different traffic types into three different networks simplifies deadlock prevention, as there
are easily understood interactions between the networks. Even with just a single network, deadlock
avoidance is a significant concern, owing to the ring topology of the ring cache. To prevent deadlock
within a single network, eventual forward progress must be guaranteed. The primary way we en-
sure this is by using a minimum of two slots per buffer for each network and enforcing the invariant
that a new itemmay not be injected into the network if there is already an item in the correspond-
ing node’s receive buffers. This guarantees that there is always an open buffer somewhere in the
network, and as long as there is always an open buffer somewhere in each network, some item will
always be able to advance. In turn, if some item can always advance, items will eventually proceed
to the point where they can leave the network. In the case of the forwarding network, for example,
items disappear from the network once they have traversed the ring, so as long as items can con-
tinue circulating, they will eventually leave the network and free up slots for other stores/signals.
140
Although the forwarding network can stall due to evictions from the ring cache array, it will always
resume after the eviction to the L1 is complete.
To avoid deadlock, there must also not be any circular dependency between the three networks.
The forwarding network never interacts with the other two networks, so it cannot create such a
dependence. The request network receives input from a ring node’s corresponding core only when it
issues a load that misses locally. Items exit the request network at a remote node, where they wait to
access the ring cache array. After performing the load (and potentially an L1 request), they enter the
reply network, which returns them to their originating core. Since there is a unidirectional flow of
data from core to request network to reply network to originating core, and a single core may only
be executing one load at a time, the networks cannot deadlock.
In the forwarding network, signals and stores are packaged in bundles and move around the ring
in lockstep. For correctness, signals may never pass the stores, since they could accidentally allow a
core access to shared data before the data has arrived. By sufficiently provisioning the bandwidth of
the ring cache array and signal buffer, bundles can traverse a ring node in a single cycle, since they
access the memory/signal buffer in parallel with link traversal. Traversing the entire ring therefore
takes the same number of cycles as the number of cores.
The items in the request network are addresses that a core has requested to load and the core ID
that requested it. The items in the reply network are the requested data and the ID of the core that
originally requested it. Items in these networks can similarly hop between ring nodes in a single
cycle. The cost of a remote load is therefore approximately the same number of cycles as the number
of cores, plus the time it takes to exit the request network, perform the load to a cache array / L1
cache, and enter the reply network.
141
A.3.3 MemoryHierarchy Integration
The ring cache can be thought of as another layer of the memory hierarchy that sits just below the L1
but is exclusively used for shared data in sequential segments. As such, it must maintain any mem-
ory consistency guarantees in its interactions with the existing cache hierarchy. This is accomplished
through three different invariants. First, shared memory can only be accessed within sequential
segments through the ring cache (this is by the definition of a sequential segment, and therefore en-
forced by the compiler). Second, only a single “owner” core can access a particular shared memory
location through the L1 cache on a ring cache miss. Finally, the existing cache coherence mecha-
nismmust guarantee total store ordering (as Intel’s does), which means that the memory will process
stores to a particular address from a particular cache in the order that it receives them.
Interactions between the ring cache and L1 can occur when data is evicted from the ring cache and
when a load misses in the ring cache. In the case of evictions, the evicted data will be written back
to the L1 only if the owner of a particular memory address is the one performing the eviction—all
other cores just overwrite the data with whatever store triggered the eviction. Likewise, when a load
misses in the ring cache, a request is sent to the request network to fetch the data from the owner
core’s ring cache array and, if it is not present there, from the owner core’s L1 cache. It is crucial that
cache lines are assigned separate owners, or different cores will risk updating the same cache line
at the same time. By ensuring that all loads and stores to particular cache lines go through a single
core’s L1 cache, races to update a particular cache line are avoided, and sequential, in-order memory
accesses to that cache line are guaranteed. Having different cache lines have different owners also
ensures that the existing cache coherence mechanism won’t “ping pong” a cache line containing
shared data back and forth between different cores, incurring a performance penalty.
Finally, when a loop invocation is ending, all cores must update their L1s with any shared data
that they are the owner of in their local ring cache arrays, since after the loop is over, the core execut-
142
ing the code between loops can access any memory location. This is effectively a distributed memory
barrier.
A.4 Ring Cache Implementation
The remainder of this report discusses the implementation details of our reference design, includ-
ing datapath schematics and control FSMs when appropriate. The previous section has established
some base terminology and a high level description of the functionality of ring cache. In the follow-
ing sections, the exact details and in-depth explanations for different functional blocks are described.
The description is split into several sections, corresponding to major aspects of ring cache. First, an
overall view of the ring cache datapath is presented. Second, the precise interfaces between cores and
ring nodes, ring nodes and L1 caches, and ring nodes and other ring nodes are discussed. Hypothet-
ically, those interfaces are all that is needed to integrate our reference design into a system. Then, a
description of the end of loop ring cache array flush is detailed. Next, we present details of the three
networks that enable data and signal communication between ring nodes. Finally, the primary com-
ponents that form the core functionality of ring cache – the memory module and the signal buffer
module – are described in detail. The exact manner in which the memory handles incoming loads,
stores, evictions, L1 accesses, and end of loop flushes is included in this description, as well as an op-
timization to reduce unnecessary L1 accesses. We also present a fleshed out explanation of how the
signal buffer decouples signal forwarding from synchronization, and how the hardware design facili-
tates this decoupling.
The schematics in the following sections are meant to highlight the most crucial parts of the im-
plementation, but may not contain every last detail of the control logic, for example. The Verilog
implementation included in the appendix of this report contains all of the precise details, bitwidths,
etc., and likely should be consulted along with the written description to resolve any ambiguity.
143
A.5 DatapathOverview
Figure A.9 presents the top level schematic of a ring node. The vast majority of the datapath and
control is contained within other major modules. This schematic is meant to serve as a reference for
the different top level module connectivity – future sections describe each module in detail. The
minor logic that is present (credit registers primarily) will be described in Section A.7. Clock and
reset routing is not shown to reduce clutter, but all modules that have state have clk and reset inputs
on their respective block. Any blocks without clk and reset inputs are entirely combinational. The
major blocks:
Receive Buffers These buffers capture circulating bundles from the three different networks
(forwarding, request, and reply). Each buffer contains at least two slots internally, as required to
prevent network deadlock and to cover buffer turnaround time.
Load Unit This module arbitrates between loads injected by the core and loads circulating in the
request network. It passes the selected load operation to the memory module. After the memory
responds to a load, the load unit processes it appropriately, depending on whether it hit or miss
locally. If it hit, and the load originated from the local core, the loaded data is returned to the core.
If it hit, and the load originated from the request network, the response is injected into the reply
network. If the load missed, a remote load is injected into the request network. The load unit is also
responsible for forwarding existing items from the request and reply network receive buffers to the
outgoing links, where they are propagated to the next ring node.
Bundleizer The bundleizer arbitrates between stores/signals injected by the core, and stores/sig-
nals already circulating on the forwarding network. The stores/signals output from the bundleizer
are sent in parallel to the memory module (for stores), the signal buffer (for signals), and stopper
144
module (for both). This module ensures that stores and signals circulate in-order by “bundling”
them together into a single structure, forcing them to propagate around the ring in lockstep.
Stopper The stopper module removes stores/signals from the forwarding network if they have
completed propagating around the entire ring. Any items left in the network bundle after this prun-
ing are sent over the forwarding network links to the next ring node.
Memory The memory module processes has two logical ports, one for loads and one for stores.
Loads are initiated from the load unit, and stores from the bundleizer. Load results are returned to
the load unit. When necessary, the memory module writes evicted data to the local L1 cache, and
loads required data from the L1 cache. When instructed by the signal buffer, the memory module
flushes all stored data to the L1.
Signal Buffer The signal buffer reads the signals output from the bundleizer, and records them
internally. When the core executes a wait instruction, the signal buffer is responsible for deciding
whether it can be released, or if the core must stall. In the case of a special flush-related wait instruc-
tion, the signal buffer also instructs the memory module to begin flushing after the wait is released.
A.6 External Interfaces
This section documents all of the necessary interfaces between a ring node and its core, in addition
to a ring node and its L1 cache. Figure A.10 depicts the signals that connect a ring node, a core, and
an L1 cache.
A.6.1 Core Interface
There are five main inputs between a core and its ring node, in addition to clk and reset. They are
related to the particular instruction that the core is presenting to the ring node for execution. There
145
B
u
n
d
le
ize
r
coreC
om
m
and
Type
coreC
om
m
and
V
alid
coreC
om
m
and
A
d
dr
coreC
om
m
and
Id
coreC
om
m
and
D
a
ta
m
em
o
ryRe
ady
o
utbo
u
nd
LinkR
ead
y
coreInp
u
tServiced
le
ftRe
lea
seB
un
d
le
le
ftV
alidD
ep
artingB
u
nd
le
le
ftD
ep
artin
gBu
n
dle
o
utpu
tBu
n
dle
o
utpu
tBu
n
dleV
alid
Signal B
uffer
rese
t
clk
in
com
in
gW
aitV
a
lid
in
com
in
gSign
als
w
aitR
ele
ased
To
StartFlush
w
aitR
ele
ased
in
com
in
gW
aitLigh
t
in
com
in
gW
aitId
coreCommandValid
coreCommandType
coreCommandId
coreCommandAddr
coreCommandData
Stop
per
in
p
utB
un
dle
in
p
utV
alid
o
utpu
tBu
n
dle
o
utpu
tV
alid
coreResult
coreLoadIsHit
B
uffer
rese
t
clk
arrivin
gEn
try
o
utgoing
Cred
it
in
p
utV
alid
rele
aseE
ntry
d
ep
artin
gEn
try
validD
ep
artingE
ntry
B
uffer
rese
t
clk
arrivin
gEn
try
o
utgoing
Cred
it
in
p
utV
alid
rele
aseE
ntry
d
ep
artin
gEn
try
validD
ep
artingE
ntry
B
uffer
rese
t
clk
arrivin
gEn
try
o
utgoing
Cred
it
in
p
utV
alid
rele
aseE
ntry
d
ep
artin
gEn
try
validD
ep
artingE
ntry
le
ftFo
rw
a
rd
A
rrivin
gB
un
dle
le
ftFo
rw
a
rd
B
un
dleV
alid
le
ftFo
rw
a
rd
O
u
tgo
in
gC
re
dit
le
ftRe
plyA
rriving
Bu
n
dle
le
ftRe
plyBu
n
dleV
alid
le
ftRe
plyO
utgoing
Cred
it
le
ftRe
qu
estA
rrivin
gB
un
d
le
le
ftRe
qu
estB
un
d
leV
alid
le
ftRe
qu
estO
u
tgo
in
gC
re
dit
Load U
n
it
coreC
om
m
and
A
d
dr
coreC
om
m
and
V
alid
coreC
om
m
and
Type
req
ue
stC
o
m
p
leteLoa
d
o
utbo
u
nd
R
eq
uestLin
kRe
ady
le
ftRe
qu
estR
ele
aseB
u
nd
le
le
ftRe
qu
estV
alid
D
ep
artin
gBu
n
dle
le
ftRe
qu
estD
epa
rtin
gB
un
d
le
o
utbo
u
nd
R
ep
lyLinkR
ead
y
le
ftRe
plyV
alidD
ep
artingB
u
nd
le
le
ftRe
plyD
ep
arting
Bu
n
dle
req
ue
stH
itLo
ad
d
ataO
u
tLo
ad
rese
t
clk
ad
dressTo
Lo
adV
alid
ad
dressTo
Lo
ad
coreLoad
Pro
ce
ssed
coreLoad
R
esu
lt
coreLoad
H
it
righ
tR
ep
lyV
a
lid
righ
tR
ep
lyD
ep
arting
le
ftRe
plyRe
lea
seB
un
d
le
B
reakou
t store 
an
d sign
als 
fro
m
 bu
nd
le
B
reakou
t store 
valid bit / 
ad
dress / d
ata
M
em
o
ry
rese
t
clk
w
rite
Re
ady
in
p
utV
alidLo
ad
ad
dressLo
ad
d
ataO
u
tLo
ad
req
ue
stC
o
m
p
leteLoa
d
req
ue
stH
itLo
ad
in
p
utV
alidSto
re
ad
dressSto
re
d
ataStore
startFlu
sh
finish
edFlush
w
rite
backD
ata
w
rite
backV
alid
w
rite
backA
d
dr
writebackAccepted
writebackComplete
cacheLoadAccepted
cacheLoadComplete
cacheLoadData
cach
eLoad
V
alid
cach
eLoad
A
d
dr
righ
tFo
rw
ard
D
e
partin
gBu
n
dle
righ
tR
ep
lyB
u
nd
le
V
alid
righ
tR
ep
lyD
ep
artingB
u
nd
le
righ
tR
eq
u
estB
un
dleV
alid
righ
tR
eq
u
estD
e
partin
gBu
n
dle
righ
tR
eq
u
estV
alid
righ
tR
eq
u
estD
e
partin
g
coreC
om
m
and
Pro
ce
ssed
writebackAccepted
writebackComplete
cacheLoadAccepted
cacheLoadComplete
cacheLoadData
isW
aitIn
stru
ctio
n
Inform
 co
re th
at th
e 
currently execu
tin
g 
in
struction
 h
as co
m
p
le
ted
C
ore In
p
uts
C
ore O
utp
uts
From
 L1 C
ach
e 
writebackAddr
cacheLoadValid
To
 L1 C
ach
e
righ
tR
eq
u
estIn
com
in
gCred
it
righ
tR
ep
lyIn
co
m
ingC
red
it
righ
tFo
rw
ard
In
co
m
in
gCred
it
Incom
in
g Lin
ks a
nd
 B
u
ffers 
fo
r R
eq
ue
st, R
ep
ly, an
d 
Fo
rw
arding N
etw
o
rks
O
u
tgo
in
g n
etw
o
rk lin
ks
W
h
e
n
 p
o
ssib
le
, all 
in
p
u
t sign
als are 
p
lace
d
 ab
o
ve
 th
e
 
m
o
d
u
le n
a
m
e
, all 
o
u
tp
u
ts b
e
lo
w
A
B
B
 - A
A
B
B
 - A
A
B
B
 - A
> 0
+
Fo
rw
ard 
C
re
dit 
R
egiste
r
D
Q
+
R
ep
ly
C
re
dit 
R
egiste
r
D
Q
> 0
> 0
+
R
eq
ue
st 
C
re
dit 
R
egiste
r
D
Q
p
eak
p
eak
Figure A.9: Schemaধc of top level ring cache module.
146
coreCommandValid
coreCommandType
coreCommandId
coreCommandAddr
coreCommandData
coreResult
coreLoadIsHit
coreCommandProcessed
writebackAccepted
writebackComplete
cacheLoadAccepted
cacheLoadComplete
cacheLoadData
cacheLoadValid
Ring 
Node
Core
L1 
Cache reset
clk
Figure A.10: A ring node has direct connecধons to its local core and its local L1 cache.
are three outputs from the ring node related to the completion of the instruction. A core can only
present one instruction at a time to the ring node, and it may not remove it or otherwise present
a subsequent instruction unless the ring node indicates completion of the original instruction by
raising the one bit coreCommandProcessed signal high. All inputs (coreCommandValid, coreCom-
mandType, coreCommandId, coreCommandAddr, coreCommandData, and reset) are assumed to
be the outputs of registers – that is, they arrive at the ring node input ports immediately after the
rising edge of the clock, and do not change during a clock period. All outputs from the ring node
(coreCommandProcessed, coreResult, and coreLoadIsHit) may transition towards the end of the clock
cycle, somewhat before the rising edge. The core must be prepared to act on these outputs before
the clock edge, by either choosing to hold the instruction for the next cycle, or by preparing to set a
new instruction. If coreCommandProcessed is raised high, the core must remove the instruction by
deasserting coreCommandValid, or the ring node will execute the instruction twice. The ring node
147
outputs are not stable after the clock edge. This dynamic means that the critical path of execution
in the ring node may extend into the core logic to set the next instruction. Cores can buffer instruc-
tions headed to the ring node, subject to the following constraints:
1. Stores and signals must be presented to the ring node in program order, non-speculatively, as
they have global effects in the ring cache.
2. Loads and stores can not be reordered around wait and signal instructions – otherwise,
shared data may be read or written outside of the sequential segment.
3. Once an instruction is presented to the ring node (by raising coreCommandValid high) it
must stay there until released by coreCommandProcessed.
Inputs
coreCommandValid This one bit input should be raised high when the other instruction re-
lated inputs from the core to the ring node are valid. It instructions the ring node to execute the
presented instruction. This input must remain high until the instruction has completed execution,
indicated by coreCommandProcessed being set high by the ring node.
coreCommandType This two bit input encodes the four possible intsruction types (wait, sig-
nal, load, store). Different instruction types use different sets of the other core inputs, where noted
below.
coreCommandId Signal and wait instructions have an associated segment ID. The bit width of
this input is dependent on the total number of signals the signal buffer can handle (log2(max_signals)).
coreCommandAddr Load and store instructions set this input to the address to be loaded/s-
tored. In our design, all addresses are 32-bits, andmust be 4 byte word aligned.
148
coreCommandData Store instructions set this input to the data to be stored. Wait instructions
use the 0th bit to indicate whether the sequential segment they are protecting is empty of shared
data accesses or not. This input is 32-bits.
reset This input should be raised high whenever all ring node state needs to be reinitialized –
just before a loop invocation starts (or just after a loop invocation ends). Any data stored in the ring
cache array should already be flushed to the normal cache hierarchy before this signal is asserted, by
executing the special flush signals and waits detailed in Section A.8. During the flush, each ring node
writes back only a portion of its stored data, for performance and correctness reasons. So despite the
flush, each ring node still needs to invalidate its entire cache array when reset is raised.
Outputs
coreCommandProcessed This 1 bit signal will be raised high by the ring node when the cur-
rently presented instruction from the core is finished executing. For stores and signals, this implies
that they have been added to a bundle and are currently being stored to the memory and signal
buffer, and beginning propagation around the ring. For loads, this implies that the loaded data is
present in coreResult and the hit status is set correctly in coreLoadHit. For waits, a coreCommand-
Processed set high indicates that it is safe for the core to enter the sequential segment with the ID
that the core set in coreCommandId. In all cases, when this signal is set high, the core must prepare
to deassert coreCommandValid (either that or prepare to set the inputs to another valid instruction)
at the next rising edge of the clock.
coreResult This 32-bit output contains the data requested from the executing load instruction.
coreLoadHit In the reference implementation of ring cache, a load will always hit, either by
fetching the data from the local ring cache, the local L1, or a remote L1. So from the point of view
149
clk
coreCommandValid
coreCommandType
coreCommandAddr
coreCommandProcessed
X LOAD
X 0xbeef 0xcafe
coreCommandData X
LOAD
0x444 X 0x555
X
X
Figure A.11: A load that hit in the local ring node completes in one cycle. Another load takes slightly longer.
of the core, every load hits. This 1-bit output could be used to indicate that a load suffered a miss
locally, if such information was useful to the core for any reason.
Instruction Latencies
Depending on current contention for resources, the core may need to hold an instruction at the ring
node inputs for several or even hundreds of cycles before it is processed. In the best case, any of the
four instruction types can finish executing at the edge of the clock immediately following the inputs
being set (i.e., 10 signals could finish executing in 10 cycles, if not blocked for any reason). There
are no restrictions on which instructions can follow which – regardless of type, instructions that
complete within one cycle can be forever executed back to back (e.g, a sequence of load, store, load,
store can finish in 4 cycles if there aren’t any resource conflicts or cache array misses).
Loads Loads, if they hit in the local ring node, may return to the core by the next rising edge of
the clock. If they miss in the local node, they may need to access the traditional cache hierarchy (per-
haps after traversing the request network), so may not return for tens or hundreds of cycles, depend-
ing on where they hit in the hierarchy. Figure A.11 shows a timing diagram of a load that finishes in
one cycle followed immediately by a load that finishes in 3 cycles. We assume the ring cache mem-
ory can give the appearance of combinational reads, either by using an array of registers or a double
150
clk
cacheLoadValid
cacheLoadAddr
cacheLoadAccepted
X
X 0xbeef
cacheLoadComplete
X
X
cacheLoadData 0xcafe
Figure A.12: A load from a ring node to its L1 takes a few cycles, as the cache must ﬁrst accept the load into its in-
ternal queues. Ađer processing the load, a cacheLoadComplete signal informs the ring node that the loaded data is
ready.
clocked SRAM, as will be elaborated on when discussing the memory module in Section A.11.
Stores and Signals Store and signal instructions usually return by the next rising edge of the
clock, but may be blocked if the forwarding network is already at capacity or if the memory store
port is blocked waiting for a value to be evicted. Generally, stores and signals will not block for more
than several cycles.
Waits Wait instructions are only released when enough signals have been received to grant the
core entrance into a particular sequential segment. In the case of loops with large loop bodies and
low parallelism, this could be many hundreds or thousands of cycles.
A.6.2 L1 Cache Interface
Every ring node also contains a direct interface to and from its attached core’s L1 cache. This inter-
face consists of dedicated wires for both (1) stores resulting from ring cache evictions and (2) loads
for requested shared data for which the ring node is the owner. The former input and output wires
are prefaced with writeback and the latter with cacheLoad. Similarly named wires (*Valid, *Accepted,
*Completed) have roughly the same semantic for both stores and loads. Like with the core inter-
151
face, all signals from the cache are assumed to be set immediately after a positive clock edge, and held
steady until the next positive clock edge. Signals from the ring node to the cache arrive sometime
near the end of the clock cycle, and the cache interface must be prepared to adjust its outputs for the
next cycle.
The reference ring cache implementation assumes certain properties about the interface in order
to match the cache model in our cycle-level C++ simulator – it may need to be altered if the follow-
ing guarantees can not be met:
1. All requests from the ring node, once accepted by the cache into its internal queues, must be
completed in-order. The relative ordering of loads and stores must be maintained.
2. The cache must check for loads or stores to the same address and properly handle read-after-
writes. If both a load and store are presented to the cache on the same cycle, the store must
always be processed first, even if neither have been accepted yet.
3. Once a cache confirms that a store is completed, then the store must be visible to the rest of
the cache hierarchy in the chip (i.e., the stored value is present in at least the local L1 cache
array).
The first two of these restrictions is to ensure that a load to a very recently evicted value doesn’t
leap ahead of it when accessing the cache, thus potentially reading a stale value from the L1. The ring
node does not check that loads or stores to the L1 have the same address, instead relying on the L1
to perform these checks. The ring node currently only implements a single element eviction buffer,
so if a load misses in the ring cache, the evicted store is guaranteed to already be presented to the L1
(though not necessarily accepted yet). Since a larger eviction buffer would prevent the memory from
having to throttle stores, it may be desirable. However, if a larger eviction buffer is used, any loads
that miss in the ring cache must make sure to check the eviction buffer before accessing the L1.
152
The third of these restrictions exists to properly implement the ring cache flush at the end of
every loop invocation (see Section A.8). Subsequent code outside of the loop invocation, either
between loops or within future loops, may potentially access any memory location from previous
loop invocations, so the flushed data must be confirmed to be in the normal cache hierarchy before
the ring node can inform the rest of the cores that it is safe to continue.
Figure A.12 depicts a timing diagram for an example load request from a ring node to an L1 cache.
Outputs
writebackValid, cacheLoadValid These one bit values are set high to indicate that the ring
node is either storing or loading a value to/from the L1. They indicate to the L1 that the other out-
puts from the ring node are valid and may be captured if possible. These signals must remain high
until the cache raises the corresponding writebackAccepted or cacheLoadAccepeted signals to indi-
cate that the store/load has been queued by the cache.
writebackAddr, cacheLoadAddr These 32-bit word aligned addresses corspond to the ad-
dress that the ring node wants to either store or load to/from the L1. If writebackValid or cacheLoad-
Valid are raised high, then writebackAddr or cacheLoadAddr should also be set correctly before the
clock edge hits.
writebackData This 32-bit value is the data that the ring node is storing to the L1.
Inputs
writebackAccepted, cacheLoadAccepted These one bit values are set high by the cache if
and only if writebackValid or cacheLoadValid were set high, and the cache has latched the store/load
from the ring node into its own internal queues. They may be raised high as early as the next clock
edge after the valid signals are raised high. The ring node, upon seeing an *Accepted signal raised,
153
must lower the corresponding *Valid signal by the subsequent clock edge, or the cache may queue
the request twice.
writebackComplete This one bit value is set high by the cache when a previously accepted
store request is confirmed to be stored in the L1 cache array. This signal is held high for one cycle
per item that is confirmed completed. Since many stores may be accepted into internal queues be-
fore any have completed, this signal may need to remain high for several cycles in a row as stores are
processed, as is often the case during a ring cache memory flush.
cacheLoadComplete Unlike with writebacks, the ring node only can have one outstanding
load to the L1 at a time. When this signal is raised high, the L1 is indicating that the outstanding load
has completed, with the loaded data present in cacheLoadData. This signal and the loaded data are
only valid for one clock cycle.
cacheLoadData This is the 32-bit data loaded by the L1 pursuant to a ring node load request. It
is only valid only for one clock cycle.
A.7 Network Interfaces
Every ring node is connected to three unidirectional ring networks. Figure A.13 depicts the network
connections from the point of view of a single ring node – incoming links on the left, and outgo-
ing links on the right, connect all the cores in the system into a logical ring. We use the term bundle
to describe an element which is propagated by the networks. A bundle can be thought of as a net-
work packet with a single flit, though, unlike a traditional network packet, it can contain individual
elements from different origins being carried to different final destinations, which all move in lock-
step together. Credits, used for flow control, move in the opposite direction of data bundles. Unlike
bundles, credits only propagate to the immediate predecessor core, and do not circulate around the
154
Ring 
Node
Ring 
Node
leftForwardArrivingBundle
leftForwardBundleValid
leftForwardOutgoingCredit
leftReplyArrivingBundle
leftReplyBundleValid
leftReplyOutgoingCredit
leftRequestArrivingBundle
leftRequestBundleValid
leftRequestOutgoingCredit
Ring 
Node
rightForwardDepartingBundle
rightForwardBundleValid
rightForwardIncomingCredit
rightReplyDepartingBundle
rightReplyBundleValid
rightReplyIncomingCredit
rightRequestDepartingBundle
rightRequestBundleValid
rightRequestIncomingCredit
All signals from 
point of view of 
center ring node
Figure A.13: A ring node is connected with its neighbor ring node by three diﬀerent networks. The forwarding net-
work handles propagaধon of stores and signals, and the request and reply networks handle loads that need to access
a remote ring node or L1 cache. Each network has two arriving (deparধng) signals for the data payload and a valid bit,
as well as a deparধng (arriving) signal for network credits. All signals are named from the point of view of the center
node.
155
entire ring.
The highest bandwidth and most important of these networks, the forwarding network, is re-
sponsible for proactively forwarding stored shared data and signals between every ring node. The
details of this network will be described in Section A.9. Two other networks, the request and reply
networks, have lower bandwidth and performance characteristics, as they are uncommonly used and
only necessary for memory consistency. The reason for and details of these two networks will be
described in Section A.10.
The remainder of this section will instead focus on the generic network elements – credit based
flow control, buffer sizing, deadlock prevention, and the implementation of the buffer module.
A.7.1 Credit Based Flow Control
All three of our networks use traditional credit based flow control. Since any of the three networks
can stall on a variety of conditions, back-pressure between ring nodes is necessary to avoid over-
flowing the network receive buffers seen on the left side of Figure A.9. Generally, credit based flow
control operates by having each network participant maintain counters for the number of avail-
able buffer entries at all possible downstream nodes. In the case of ring cache, each ring node only
has one downstream node, so keeps three credit counters total, one for each of the three networks.
Anytime a ring node sends a bundle to a downstream node it decrements its count of available
downstream buffer entries. Whenever a node removes a bundle from its buffers (either because it
has propagated to the subsequent ring node, or has finished circulating around the ring and has
been removed from the network entirely), it raises the credit line to its predecessor high, indicat-
ing that a buffer entry has become free. The predecessor, seeing the credit line raised, increments its
count of available downstream buffer entries. The credit registers can be seen in Figure A.9. The
credit counts are incremented when a credit is received, and decremented when a bundle leaves a
ring node.
156
Buffer Sizing
Credit based flow control potentially introduces an unwelcome latency to bundles propagating in
the networks. Consider the case where each node contains only one buffer entry for each network.
Further consider that only one ring node, node i, is actively enqueuing items to the forwarding net-
work. After enqueuing one item, it decrements its credit counter, since the downstream node’s
(node i+1) buffer entries are now full. On the next cycle, the downstream node propagates the item
to its next downstream node (i + 2), and sends a credit back to node i. Meanwhile, node i was un-
able to enqueue an additional item, since it has yet to receive the credit from node i+1. Once it does,
some cycles later, it increments its credit counter, and is now free to enqueue an additional item.
This delay is known in the network community as the buffer turnaround time, and is a direct conse-
quence of the delay in propagating credits back upstream. Generally, a network should have at least
enough buffers to prevent stalls of a single enqueuing node, under no other contention. In the case
of our ring cache reference design, it takes only one cycle to propagate both data items and credits
between cores. As a result, only two buffer entries are needed to allow a single enqueuing core to
cycle after cycle send network items without stalling waiting for credits. A larger number of buffer
entries could potentially increase performance further, but we did not investigate this possibility.
Preventing Deadlock
It is important for any network to provide deadlock free routing, since a deadlocked state may be
impossible to recover. The ring cache networks only contain single-flit packets, which are simpler
to handle than multi-flit packets. However, ring cache still needs to be particularly careful, as the
ring structure lends itself to deadlock. We prevent deadlock by enforcing the following, which apply
individually to each of the three networks:
1. If there is at least one network buffer entry free in any ring node, a network will always even-
157
tually be able to make forward progress.
2. A new item (either from the core or from a different network) may not be enqueued into a
network if it will consume the last available network buffer entry.
The first of these is guaranteed by the structure of the networks – while the networks can always
be blocked on various events (memory evictions, stalls for L1 cache loads, etc), there are no events
that can block indefinitely, nor any stall events that have a circular dependence on any other event.
The core attached to a ring node isn’t involved with dequeuing things from the networks – merely
by reaching their final destination, network items will remove themselves without any core inter-
vention. So as long as there is a single buffer entry somewhere in a network, some network item will
always eventually be able to move forward one node, and make forward progress. Eventually, even a
very highly contended network will drain.
This forward progress is only guaranteed if the second condition is enforced – a new item being
added to a network may never take the last available buffer entry. At first this seems like a simple
thing to enforce – if there are any items in the receive buffers, don’t allow the core to enqueue a new
item. Items already in the network must always have priority over newly enqueued items. However,
this is not enough. In addition to prioritizing items already in the network, there must be enough
buffering at each node, where “enough” is dependent on the link latency between nodes. In the
case of our design, where link latency between nodes is only one cycle, we can get away with two
buffer entries, as previously mentioned. To understand this, consider the case where every core tries
to enqueue items every cycle. After one cycle, every node will have sent a item to its downstream
node. Now, since every node has a item in its receive buffer, they will not enqueue any new items
to that network. If, during propagation, the network stalls for any reason, items may start to back
up behind the stalled item. Nodes downstream of the stalled itemmay enqueue more items, but
as long as their receive buffers are empty, they can do so, knowing that as long as they stop as soon
158
as they receive a item, they will never take up the last buffer entry in a network. Consider the case,
however, where instead of one cycle latency between cores, there is a ten cycle latency between cores,
but still with only two buffer entries. All of our ring nodes once again attempt to enqueue as many
items as possible. Every node will enqueue two items before stalling on not enough credits. Since the
inter-node latency is ten cycles, no node knows if other nodes have enqueued any items yet. After
the 10 cycles elapses, every node has two items in their receive buffers, violating our invariant that
there must always be a free buffer entry somewhere in the network and creating a deadlock where
nothing can make forward progress. To fix this, we need to have at least (inter-node cycle latency +
1) buffer entries per receive buffer. If that were the case in our example, and each node had 11 entries,
then each node would only enqueue 10 items before receiving the first one sent from its predecessor.
After receiving that first one from its predecessor, a node would stop enqueuing new items. Since
there are 11 entries per buffer, this keeps at least one entry free, and our invariant is maintained.
A.7.2 BufferModule
Figure A.14 depicts the datapath schematic of a buffer with two entries. Since it is not parameteriz-
able, if more buffer entries are needed (i.e. if link latency needed was higher, more buffer entries are
required to prevent deadlock) then it should be redesigned.
Parameters
ENTRY_WIDTH This parameter represents the bit width of the stored buffer entries. Depend-
ing on the network and the configured signal bandwidth this will change.
Inputs
arrivingEntry A receive buffer receives a bundle directly from the network link. It has a bit-
width of ENTRY_WIDTH.
159
ENTRY_WIDTH = E
Parameters
Receive Buffer - buffer.v
clk
reset
arrivingEntry
inputValid
outgoingCredit
releaseEntry
departingEntry
validDepartingEntry
peakA
peakB
E bits
D
V
Q
0..N 0..N
1 bit
D
V
Q
0..N 0..N
E bits
D
V
Q
0..N 0..N
1 bit
D
V
Q
0..N 0..N
1 bit
D
V
Q
0..N 0..N
0
3
1
2
entryA
entryB
0
1
0
1
0
validB
validA
current
0
0
3
1
20
0
3
1
2
0
3
1
2
1
1
0
0
0
3
1
20
Mux control signals:
f(current, validA, validB, 
validDepartingEntry, 
inputValid, reset)
Reset and clock routing not 
shown
Figure A.14: Datapath of the buﬀer module. Arriving items are stored in one of two buﬀer entries (A or B). The
oldest entry is output to the ring node, which frees the buﬀer entry by raising releaseEntry. Credits are sent upstream
to noধfy the predecessor core that a buﬀer entry is now available for use.
160
0
Buffer = {}
1
Buffer = {A}
2
Buffer = {A, B}
3
Buffer = {B}
4
Buffer = {B, A}
releaseEntryinputValid
State validA validB current
0 0 0 0
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
departingEntry
invalid
A
A
B
B
outgoingCredit
0
releaseEntry
releaseEntry
releaseEntry
releaseEntry
Figure A.15: Control FSM, state bits, and an output of the buﬀer module. There can either be no valid buﬀer entries,
only A, only B, or both A and B. The current bit tracks which of A or B is the oldest entry, and therefore ﬁrst to be
released. A credit is sent upstream whenever a buﬀer entry is released.
inputValid This one-bit input is raised high when arrivingEntry is a valid network bundle.
releaseEntry This one bit control signal from the ring node informs the receive buffer that the
entry at the head of the queue is safe to free, since it is being consumed by the ring node.
Outputs
departingEntry This ENTRY_WIDTH sized output is the entry at the head of the buffer.
validDepartingEntry This one bit value indicates that departingEntry is currently valid.
peakA, peakB These ENTRY_WIDTH sized outputs are the contents of both buffer entries.
These are used by the request network to snoop on the forwarding network, to make sure there
aren’t any loads passing stores to the same address.
outgoingCredit This is a one-bit signal sent to the predecessor ring node that is set high when
a buffer is freed.
161
Datapath
The datapath consists of two registers, one for each entry in our two entry buffer, entryA and en-
tryB. There are also three other state bits, validA, validB, and current, which indicate which entries
are valid, and which is the oldest.
Control
The control FSM and state bits are depicted in Figure A.15. Control is relatively straightforward. If
the buffer is empty, any newly arriving entry is placed in entryA. If entryA already contains an entry,
it is stored in entryB. It is impossible, due to the credit flow control, for an entry to be received when
both buffers entries are full. At all times, the oldest of the entries is output to the ring node. When
releaseEntry is raised, the oldest of the two entries is freed, and a credit sent to the predecessor core.
A.8 Memory Flushing
An important aspect of the HELIX execution model is that there is a memory barrier at the end of
every loop invocation, as described in Section A.2. This barrier exists because code outside of a par-
allel loop may access recently written values from code within the parallel loop. If the core executing
the code outside the loop is different from the core that wrote a particular value inside the loop, an
incorrect value may be read if there isn’t a memory barrier. On a traditional multicore, this barrier
can be performed merely by executing the appropriate x86 instruction. On a chip where ring cache
is present, however, the contents of every ring node memory need to be flushed to the normal cache
hierarchy. This section describes the mechanics of the memory flush at a system level – the individ-
ual sections on the signal buffer (Section A.12), the memory module (Section A.11), and the cache
interface (Section A.6) touch on the various aspects relevant to those modules and interfaces.
For performance reasons, we desire that the ring cache memory flush happens in parallel, with all
162
Core 0 Core 1 Core 2
Program
Execution
Time
Executing 
Parallel Loop 
Iterations 
0, 3, 6... Executing 
Parallel Loop 
Iterations 
1, 4, 7... 
Executing 
Parallel Loop 
Iterations 
2, 5, 8... 
SigStartFlush
WaitStartFlush
SigStartFlush
WaitStartFlush
SigStartFlush
WaitStartFlush
Signals 
Received, Flush 
Started
Flushing to L1
Flushing to L1
Flushing to L1
SigEndLoop
WaitEndLoop
WaitStartFlush 
Released
WaitStartFlush 
Released
WaitStartFlush 
Released
SigEndLoop
WaitEndLoop
SigEndLoop
WaitEndLoop
WaitEndLoop 
Released
WaitEndLoop 
Released WaitEndLoop 
Released
Exit Loop
Exit Loop
Exit Loop
Signals 
Received, Flush 
Started
Signals 
Received, Flush 
Started
Waiting for 
signals
Waiting for 
signals
Core 0 signals
Core 1 signals
Core 2 signals
Figure A.16: The HELIX required end of loop memory barrier is implemented using two special pairs of wait/signal
instrucধons. First, all cores wait to receive a SigStartFlush from every other core before beginning to ﬂush their local
ring node memory. Ađer ﬂushing addresses “owned” by the node to the L1 cache, each core sends a SigEndLoop, and
waits for every other core to do the same before leaving the loop.
163
ring nodes participating at the same time. Later, in Section A.11 we describe the concept of a unique
“owner” for every memory address. A core who “owns” a particular memory address, as determined
by inspecting the address bits, is responsible for all transfers from/to that address between the ring
cache memory and the local L1 cache. During an end of loop flush, all cores can writeback all of their
“owned” addresses in parallel. A special bit-array is used so each ring node knows exactly which
indexes in their cache array need to be loaded and written back to the L1, to avoid wasting cycles
fetching non-owned addresses. On average, each core will writeback approximately (totalNum-
WordsInArray / numCores) words back to its L1. The memory flush hardware will be described in
Section A.11.
At a high level, the flush occurs through the insertion of two special pairs of signal and wait in-
structions. The function of the first pair is to make each core wait until every other core has finished
their last iteration of the executing parallel loop. This forces all stores currently in the forwarding
network to finish propagating, so they don’t accidentally write to a ring node memory after the flush
starts. Since signals can not pass stores in the forwarding network, it is guaranteed that after a core
has received the first of these special signals from a particular core, it will not receive any more stores
from that core. After a core has received this signal from every other core, it performs its ring node
flush. After the flush finishes, a second special signal/wait pair forces every core to wait until every
other core has finished flushing before leaving the loop. At that point, it is guaranteed that all data
previously stored in the ring cache is now properly stored in the normal cache hierarchy.
Figure A.16 depicts a timeline of a ring cache flush. At the end of every parallel loop, HELIX in-
serts two pairs of special signal and wait instructions. Theymust be inserted only after the very last
instructions of the last iteration executed by a particular core. The highest two sequential segment
IDs must be reserved for these two special signals, since the signal buffer tracks them differently
from normal signals. Since they are only executed once at the end of the loop, their state in the signal
buffer must be initialized appropriately, as discussed in Section A.12.2. Besides using the reserved
164
Sequential Segment IDValid Sender Core ID
Signal Entry
Store AddressValid Sender Core ID Store Data
Store Entry
Signal Entry 0
Forwarding Network Bundle
Signal Entry 1 Signal Entry ... Signal Entry N Store Entry
Figure A.17: A forwarding network bundle contains one or more entries for signals, and one entry for a store. Both
entry types include the ID of the core that sent them to the ring cache, and a valid bit. Signal entries also include a
sequenধal segment ID, whereas store entries include a 32-bit address and data value to be stored. Our reference
design uses 128 sequenধal segment IDs (7 bits to represent each ID), 16 cores (4 bits to represent), and 5 signal slots
per bundle. The store entry therefore has 1 + 4 + 32 + 32 = 69 bits, and each signal entry has 1 + 4 + 7 = 12 bits. The
total network bundle is 69 + (5 * 12) = 129 bits
sequential segment ID, however, they are encoded just as normal signal instructions. First, every
core executes the SigStartFlush signal, and then waits onWaitStartFlush. Every core must receive the
SigStartFlush signal from every other core before proceeding. Once a core has received this signal
from every other core, it doesn’t release the wait instruction quite yet. The special handling ofWait-
StartFlush by the signal buffer, instead of releasing the wait, triggers the ring cache memory flush to
commence. After every “owned” address has been confirmed written to the L1 cache, the memory
module triggers the release ofWaitStartFlush. Subsequently, the core may execute a SigEndloop and
then wait onWaitEndloop. Once a core has received a SigEndloop from every other core, it knows
that every ring node has completed its flush, and that it is safe to leave the parallel loop.
A.9 Storing Shared Data and Signals - The ForwardingNetwork
The forwarding network is a key piece of the ring cache. It is responsible for the proactive commu-
nication of shared data and signals. In contrast to cache coherence mechanisms which deliver data
reactively, only when requested, the forwarding network distributes every store and signal to every
other core in the system very soon after they are produced. The forwarding network captures newly
executed stores and signals from sequential segments and forwards them throughout the unidirec-
165
Bundelizer - bundelizer.v
outboundLinkReady
outputBundle
coreInputServiced
coreCommandValid
coreCommandType
coreCommandId
coreCommandAddr
coreCommandData
leftValidDepartingBundle
leftDepartingBundle
memoryReady
leftReleaseBundle
outputBundleValid
Breakout store 
and signals 
from bundle
isStoreInstruction
isSignalInstruction
coreHasStore
coreHasSignal
Breakout store 
valid bit
Breakout three 
signals entries
Breakout signal 
valid bits
leftFullSignals
leftHasSignal
0
1
0
leftHasStore
serviceCoreSignal
serviceCoreStore
Check if ring node 
can process a bundle 
this cycle
0
1
Combine core ID, store 
data, and addr data 
into store entry bus
{1'b1, C}
Prioritize store from 
network over a new 
store from core
0
1
0
0
1
0
1
0
1
{1'b1, C}
Combine core ID and 
segment ID into 
signal entry bus 0
1
0
Select existing signals in bundle 
before new core signal.  If empty 
slots, insert core signal in first 
available one
Collapse three 
signal entries
Combine store 
and signals 
back into 
bundle
Some mux control signals 
omitted for clarity, see 
Verilog implementation
New core signal entry
New core store entry
CORE_ID = C
SIGNAL_BANDWIDTH = 3
Parameters
Bundle is valid there are any valid 
stores or signals, either from core or 
already in bundle. Core input is 
serviced only if the existing bundle 
isn t at capacity.
Figure A.18: The bundleizer arbitrates between bundles already in the forwarding network, and newly inserted stores
and signals from the core. This schemaধc assumes a signal bandwidth of 3 signals per cycle.
166
Stopper - stopper.v
inputValid
inputBundle
Breakout store 
and signals 
from bundle
Breakout three 
signals entries
0
1
0
CORE_ID = C
SIGNAL_BANDWIDTH = 3
Parameters
outputValid
outputBundle
Breakout store 
origin core ID
-1
bundleStoreTerminatingCoreId
!= C
Breakout signal 
origin core ID
-1
bundleSignalTerminatingCoreId[0]
!= C
-1
bundleSignalTerminatingCoreId[1]
!= C
-1
bundleSignalTerminatingCoreId[2]
!= C
0
1
0
0
1
0
0
1
0
If the origin core ID of a signal 
or store is (C + 1) % NumCores, 
then it has finished propagating 
around the ring, and is removed 
from the bundle
Collapse three 
signal entries
Combine store 
and signals back 
into bundle
Check that there is at 
least one remaining 
valid store or signal
Reduction 
Unary OR
Figure A.19: The stopper module removes entries from a bundle if they have ﬁnished circulaধng around the enধre
ring. It assumes a signal bandwidth of 3 signals per cycle.
tional logical ring of cores that makes up a HELIX-RC system. When a core sends a store or signal
to ring cache, it is sent to a module called the bundleizer. This module also reads bundles out of the
forwarding network receive buffers (described in Section A.7). If there is room in the network bun-
dle for a new store or signal, the bundleizer grabs it from the core. This module passes output to the
rest of the ring node only if the ring node memory is ready to accept a store instruction, and there
are available downstream receive buffers in the subsequent ring node. The output is sent to three
places. First, if there is a valid store in the bundle, it is sent to the ring node memory to be stored.
Second, any signals in the bundle are sent to the signal buffer to be recorded. Third, the entire bun-
dle is sent to the stopper module. The stopper module is responsible for removing any stores or
signals from the bundle that have circulated around the entire ring (e.g., a signal inserted by core 0
will be removed from the bundle by the stopper module of core 15, after passing through core 1, 2,
etc.). If there are any stores or signals remaining in the bundle after the stopper has pruned it, the
remainder of the clock cycle is used to send the bundle over the forwarding network link to the sub-
167
sequent core in the ring. In this way, the stores/signals are written to the memory / signal buffer
in parallel with network link transmission – once the bundleizer outputs a bundle, it is guaranteed
access to the memory, signal buffer, and network link in parallel.
In the remainder of this section, we will first describe what a bundle is. Then, we will present
schematics for the bundleizer and stopper modules.
A.9.1 Network Bundle
The atomic unit that is circulated around the forwarding network is called a bundle. Figure A.17
depicts a bundle, which consists of one or more signal entries and a single store entry. When a core
presents a store or signal to the ring node, the bundleizer adds it to one of the available signal or
store slots. Since ring cache can only process one store per cycle, it doesn’t make sense to provide
more than one store slot. The signal buffer, on the other hand, can process multiple signals per cycle.
Higher signal bandwidth improves performance for a couple of the SPEC CINT 2000 benchmarks,
so it is desirable to package multiple signals in a bundle.
Once added to a bundle, individual stores and signals all move in lockstep together – it is impos-
sible for an individual store or signal to “move ahead” a bundle or be “left behind” in a bundle. This
is a requirement of the HELIX execution model – if signals were able to pass stores, they may allow
a core entry into a sequential segment before the shared data they are protecting has arrived. For this
reason, propagation of the entire bundle must stall if the ring node memory is not ready to process a
new store (e.g., if its currently evicting a value to the L1).
A.9.2 BundleizerModule
The bundleizer module is entirely combinational. It performs its duties sometime early in the clock
cycle. This module is responsible for:
1. Adding new signals and stores from the local core to an existing bundle in the network if
168
there is an empty slot of the appropriate type.
2. Preventing a core from enqueuing a new signal or store into the network if there isn’t a slot.
3. Creating a new bundle for a signal or store from the local core if there aren’t any bundles
currently in the network receive buffer.
4. Deciding if a bundle on this cycle can be sent to the memory, signal buffer, and propagated to
the subsequent ring node.
The first and second of these items upholds the requirements of preventing network deadlock in
Section A.7.1. By always deferring to bundles already in the network, a newly inserted store or signal
from the core is unable to consume the last available buffer entry in the network.
Parameters
CORE_ID The bundleizer module must know the local core ID of its attached core. The ID gets
added to any newly initialized store or signal entry when the local core injects a new signal or store
into the ring node.
SIGNAL_BANDWIDTH The number of signal slots in the network bundle must be known.
The bundleizer needs to know which signal slots are invalid so it can add a new signal from the core
to the bundle.
Inputs
All core inputs to the ring node (coreCommandId, coreCommandData, coreCommandAddr, coreCom-
mandValid, coreCommandType) related to a new instruction are forwarded to the bundleizer. See
Section A.6 for a description of these signals. In short, they represent all of the information neces-
sary to enqueue a new store or signal from the core to a bundle.
169
leftDepartingBundle This input carries the oldest bundle currently in the forwarding net-
work receive buffers. This bundle only leaves the receive buffer if instructed by the bundleizer, by
default it is only presented to the bundleizer for inspection. The size of this input depends on the
configured signal bandwidth and howmany bits are required to represent a sequential segment ID,
as shown in Figure A.17.
leftValidDepartingBundle This one bit input is held high if the leftDepartingBundle input
is valid.
outboundLinkReady This input from the ring node is held high if the number of forwarding
network credits is greater than 0, indicating that the ring node downstream in the network has at
least one buffer entry available to be used.
memoryReady This one bit input is held high if the local ring cache memory can accept a new
store instruction this cycle.
Outputs
outputBundle This output is a full network bundle. The only difference from the leftDe-
partingBundle input is that one new store or signal from the core may have been added if there was
space. It is the same bitwidth as leftDepartingBundle.
outputBundleValid This one bit output merely indicates whether the outputBundle signal is
valid.
coreInputServiced This one bit output is raised high if and only if a a new store/signal pre-
sented by the core was accepted into a network bundle. By raising it, the bundleizer is letting the
core know that the store or signal is finished executing, since it will be written this cycle. As a result
170
the core is free to execute another instruction.
leftReleaseBundle This one bit output is raised high if the bundleizer processed the bundle
at the head of the receive buffer (contained within leftDepartingBundle). Upon raising it, the receive
buffer knows it should release the buffer entry and send a credit to the upstream ring node.
Datapath
Figure A.18 depicts a schematic of the bundleizer module, with a signal bandwidth of 3 signals per
cycle. If both outboundLinkReady andmemoryReady are high, then the bundleizer can potentially
send a bundle to the outputBundle output port. If there is already a bundle in the network, in left-
DepartingBundle, the bundleizer inspects it to determine whether there are any empty slots to in-
sert a new signal or store from the core inputs. If there aren’t any core inputs, the bundle is simply
passed to the output port. If there is an available store slot (and the core is trying to insert a store),
the core ID of the ring node is appended to the coreCommandData and coreCommandAddress and
the result is added to the bundle, which is then passed to the output port. If there are any available
signal slots (and the core is trying to insert a signal), the first available slot in the bundle is taken. The
mux control signals not pictured on the schematic are set by the valid bits of the signal entries in the
bundle, which determine whether there is a slot available for the new core signal, and if so, which
slot it should take. If there isn’t a valid bundle already in the receive buffer, then a new bundle is
constructed with only the new core store or signal. The bundleizer indicates to the core with the cor-
eInputServiced output if the store or signal was accepted. It indicates to the receive buffer with the
leftReleaseBundle signal whether the input bundle was passed on to the output, and therefore the
entry can be released from the receive buffer.
171
Control
The bundleizer is stateless. All control signals are determined from the valid bits of the input bundle
or coreCommandValid.
A.9.3 StopperModule
The stopper module removes any stores or signals from the network bundle that have completed
circulating around the ring, while allowing stores and signals that haven’t completed circulating to
continue over the forwarding network link to the subsequent core. Essentially, it checks the origin
core ID of each store and signal slot, and invalidates the entry in the outgoing bundle if it is ready to
be removed.
Parameters
CORE_ID The stopper module must know the local core ID of its attached core, so it can check
which of the items in the bundle have completed circulation around the ring.
SIGNAL_BANDWIDTH The number of signal slots in the network bundle must be known, so
the stopper module knows howmany signal slots it needs to check.
Inputs
inputBundle The input bundle is the output bundle from the bundleizer.
inputValid This one bit input is high if the inputBundle signal contains any valid store or signal
entries.
172
Outputs
outputBundle This bundle is the same as the inputBundle, with any store or signal entries in-
validated if their origin core ID belongs to the subsequent core in the ring.
outputValid This one bit input is high if the outputBundle signal contains any valid store or
signal entries.
Datapath
This module merely inspects the origin core IDs for each signal and store entry in the bundle. If the
origin core ID of an entry minus 1, mod the number of cores equals the core ID of this node, the
stopper zeros out all of the bits in that entry. Whatever signal or store was there ceases to exist in the
forwarding network. If all slots in the bundle were invalidated this cycle, then outputValid is set low,
and the entire also bundle ceases to exist. The schematic is shown in Figure A.19.
A.10 Loading Shared Data - The Request/Reply Networks
One of the most important contributions of ring cache is that it transforms a reactive cache coher-
ence protocol into a proactive system of communication. In most cases, when a core attempts to
access shared data, it will find that it is present in its local ring node memory. The communication
cost is only the time it takes to access the ring node. If the normal cache hierarchy was used, it would
take dozens of cycles from the time data was requested until the data arrived locally
One of the distinguishing factors of ring cache over other fast communication mechanisms (Mul-
tiscalar register file [54], scalar operand networks [59]) is that the number of shared pieces of data
is not known at compile time, nor the number of consumers of any particular shared piece of data.
Since other solutions rely on a statically known number of shared elements, they are not suitable for
HELIX. Instead, a cache structure that can handle an unknown number of elements is needed. That
173
A0 = 32 A0 = 32 A0 = 32
Core 
0
Core
1
Core 
2
A0 = 32 A0 = 32 A0 = 32
Core 1 stores 64 
to  A0
Core 0 stores 4 
to A1
A1 = 4 A0 = 64 A0 = 32
A1 conflicts with 
A0, store 32 to 
A0 in the L1
A1 = 4 A1 = 4 A0 = 64
Pending L1 
store A0 <- 32
A0 <- 64
A0 <- 64
A1 <- 4
A1 <- 4
A1 conflicts with  
A0, store 64 to 
A0 in the L1
A0 <- 64A1 <- 4
...
...
...
...
Time
Cycle 0
Cycle 1
Cycle 2
Cycle 3
In cycle 3, core 0 is writing 32 to A0, but core 1 is 
writing 64 to A0!  A memory race results.
AX = Y
Ring cache memory, 
Y is currently stored 
in address X
AX <- Y
Forwarding network 
bundle, store Y in 
address X
The stores from core 1 and core 2 propagate one hop in the 
forwarding network
Forwarding network 
link
Figure A.20: Allowing any core to writeback and load to the normal cache hierarchy results in a race condiধon that
may violate correctness!
174
means ring cache must be able to handle load misses and cache evictions, since there is no guarantee
that all of the shared data will fit in the ring cache. For this reason, the normal cache hierarchy must
be relied upon to support the ring cache.
However, the memory consistency guarantees of the normal cache hierarchy are different than
that of ring cache. A naive integration between the two raises a significant consistency issue. Con-
sider an implementation where a ring node simply writes-back any evicted value to its local L1. In
the case of a subsequent ring cache load miss, the ring node fetches the data from its L1. While this
might seem like a reasonable idea at first glance, it gives rise to race conditions which will violate cor-
rectness. Figure A.20 depicts the timeline of a three-core system suffers from such a race condition.
For simplicity we assume a ring node memory size of just a single word. Sometime in the past, the
value of 32 is written to address A0, which is propagated on the forwarding network and stored in
every ring node memory. In cycle 1, core 0 and core 1 execute two different sequential segments.
Core 0 stores the value of 4 to address A1, and core 1 stores the value of 64 to address A0. Both stores
enter the forwarding network. On the next cycle, core 0’s store has triggered an eviction in its mem-
ory, and A0, with a value of 32, begins to be written back to core 0’s L1 cache. In the same cycle, core
1’s ring node memory updates address A0 with the value 64. Additionally, both stores propagate
one more hop on the forwarding network. On cycle 3, the store to A1 triggers an eviction of A0 in
core 1. Now, core 1 begins to writeback A0 to its own L1. Unlike with core 0, however, core 1 writes-
back the newly updated value of 64. This results in a race to update A0 with either the old or new
value. Many (or perhaps all) modern architectures do not make any guarantees about which value
will be recorded first.
We handle this consistency issue by enforcing the constraint that for any unique memory address,
there is a single owner core that is solely responsible for any loads or stores to that address between
the ring cache and the normal cache hierarchy. The owner core is determined based on the address –
details of how this is done can be found in Section A.11. In the case of a ring cache load miss, the re-
175
Load AddressValid Requester Core ID
Request Network Bundle
Loaded DataValid Requester Core ID
Reply Network Bundle
Figure A.21: Unlike the forwarding network, the request and reply networks only contain one load (request or result)
per bundle. The core ID indicates which ring node requested the remote load, which is where the loaded data will
be returned to. For a 16 core system, with 32-bit addresses and data, the total number of bits per request or reply
bundle is 1 + 4 + 32 = 37 bits.
quest network is responsible for requesting the data from the owner ring node, which subsequently
performs a L1 lookup, and returns the result on the reply network. In the case of a ring cache evic-
tion, if the ring node owns the evicted address, it writes it back to its L1. If the evicting ring node is
not the owner, it simply discards the data without performing any writeback.
The load unit module of ring cache handles all load-related interactions between the core, the
request network, and the reply network. Loads from the core enter the load unit, which arbitrates
between the core and the request network for access to the memory module. We first discuss the
request and reply networks at a high level before describing how the core and the two networks
interact within the load unit.
A.10.1 Request and Reply Networks
The request and reply networks, as mentioned, exist only to fulfill memory consistency require-
ments. They implement a reactive data transfer mechanism that, if used frequently, completely
eliminates any of the benefits of having the ring cache. For this reason they are not tuned for per-
formance – if they aren’t used rarely, then performance will tank regardless. Like the forwarding
network, the request/reply networks also implement unidirectional ring networks that single hop
around each core in the system, one clock cycle per hop. While it might seem like a higher connec-
176
tivity topology could have been used, since there isn’t a requirement for strict in-order data flow of
loads like there is for stores and signals, there are two reasons why we stuck with unidirectional rings.
First, unidirectional rings are easy to reason about in terms of data flow and deadlock avoidance.
Second, and more importantly, care must be taken such that a remote load on the request network
doesn’t pass a store to the same address on the forwarding network, or an incorrect value could be
returned. Both of these reasons will be discussed in the following subsections.
Unlike the forwarding network, which contains multiple signals (and a store) per network bun-
dle, the request and reply networks contain only a single remote load request/reply. Figure A.21
depicts the structure of each network bundle – for the request network, the core ID of the request-
ing core is included, along with the address to be loaded. In the reply bundle, the requesting core
ID remains, but the address to be loaded is replaced with the actual loaded data. If the load unit de-
cides a ring node can not not service a load locally (i.e. it is not the owner of the address), it adds its
core ID and the address to a request network bundle, and injects it into the network. At every hop
around the ring, the local ring node load unit inspects the request network bundle to see if owns
that particular memory address. If so, it removes it from the request network, and first performs the
load operation on its ring node memory. If it misses in the ring node memory, the value is subse-
quently loaded from its L1. The loaded data is then packaged in a reply network bundle and injected
into the reply network. At each hop around the reply network, each ring node inspects the core ID
in the bundle to see if the reply is destined for it. Once it reaches the core that originally performed
the load, it is removed from the reply network, and the result handed to the core. This behavior re-
sults in a remote load latency of at least 16 cycles just for network transmission – since the rings are
unidirectional, the total number of hops to the owner and back to the original core will always equal
the total number of cores. There are additional cycles of latency for entering/exiting the network,
performing the load to the ring node memory, potentially accessing the L1, etc.
177
Reducing Remote Loads
The number of remote loads can be optimized be relaxing our constraint that only the owner core
of an address can interact with the normal cache hierarchy. Instead, only the owner core of an ad-
dress can interact with the normal cache hierarchy if the address hॷ previously been written to ring
cache, but subsequently evicted. The race condition previously shown in Figure A.20 can only occur
if an address was at some point present in the ring cache – in that example, A0. If a particular core is
trying to load A0, but it knows for sure that A0 was never written to the ring cache yet this loop in-
vocation, it can conclude that it is impossible that it is currently being evicted from any other node.
Given the definition of a sequential segment, if a core is loading A0, no other core in the systemmay
be loading or storing it. It is therefore safe for the core to load it from the normal cache hierarchy,
even if it isn’t the owner. We use a bloom filter in each ring node to track whether an address has
been written yet this loop invocation. If a core attempts to perform a load, and it misses in its ring
node memory, the bloom filter is consulted. If the address is present in the bloom filter, the core
knows it must make a remote load request unless it is the owner of the address. If it is not present in
the bloom filter, the core can load from its own L1, despite not being the owner. More details of the
bloom filter can be found in Section A.11.
Deadlock Avoidance
Our general approach for deadlock was previously explained in Section A.7.1. As mentioned, one of
the reasons for choosing unidirectional rings for the request/reply networks was because we could
apply the same policy that the forwarding network implements to avoid deadlock. The basic ap-
proach is to make sure a new item being injected to the network can never take the last available
buffer entry anywhere in that network. That way, forward progress is always guaranteed. In our cur-
rent reference design, each core can only make one remote load request at a time – the number of
178
outstanding remote loads, then, is equal to the total number of cores in the system. Since each net-
work buffer contains two entries, they can never fill up, in either the request or reply network. Since
a possible optimization is to allow multiple outstanding remote loads at a time, we also enforce the
invariant that a new item being injected into either network has lower priority than items already
in the network. As long as there are sufficient buffer entries considering the link latency (again, as
detailed in Section A.7.1), this prevents deadlock even if cores can make multiple remote requests at a
time.
In the case of the request network, the only source of new items is when a core injects a remote
load. Items exit this network when they arrive at the owner core of the requested address. There,
they wait until they can access the local ring node memory, perform the load, and then wait to enter
the reply network. Until they can enter the reply network, backpressure is applied on any other
items needing to exit the request network, creating a unidirectional dependence between the request
network and reply network. Once in the reply network, the item propagates until it reaches the
original requesting core, where it exits the network. As long as the reply network makes forward
progress, as guaranteed by keeping at least one buffer entry empty at all times, the request network
will also be able to make forward progress. Since the reply network doesn’t depend on the request
network for progress, there is not a circular dependence, so deadlock is avoided.
Ordering Constraints
Unlike the forwarding network, there aren’t as many ordering constraints on the request/reply net-
works. The forwarding network enforces a strict ordering between signals and stores, since a signal
passing a store can produce incorrect behavior. Loads in the request/reply networks don’t have a
similar constraint – the loads could hypothetically be reordered, though we don’t do that. The one
very important constraint, however, is that a load in the request network must never pass a store in
the forwarding network to the same address. While exceptionally rare (it doesn’t occur a single time
179
in any phase of any of our benchmarks), it is hypothetically possible. Imagine the case where core 0
is within a sequential segment, and performs a store to address X (owned by, let’s say, core 8), which
begins propagating in the forwarding network. Immediately after, it performs a store to address Y,
which conflicts in the local ring node memory, so evicts address X. Right after that, core 0 attempts
to load address X. Since address X was previously stored in the ring cache, but was just evicted, and
core 0 is not the owner, it must make a remote load request. Imagine that the store to address X
is stalled from forwarding at core 2 due to the eviction of an unrelated address. The remote load
request by core 0 could hypothetically pass the stalled store to address X at core 2, and proceed to
core 3, 4, and eventually 8. At core 8, the load will either load a stale value from the local ring node
memory, or a stale value from the L1. Either way, by passing the store in the forwarding network, an
incorrect value is loaded. For this reason, a ring node prevents the request network from propagat-
ing if any of the buffer entries in its forwarding network contain a store to the same address that is
being remotely loaded. In our example, the remote load in the request network would stall at core 2
until the forwarding network continued propagating. Once the store to address X was processed by
core 8 (i.e., left the forwarding network receive buffer), the remote load would be allowed to access
core 8’s ring node memory. The reply network, on the other hand, does not need to perform any
checks of this nature, since it only delivers already loaded values back to the requester.
A.10.2 Load UnitModule
The load unit module is responsible for arbitrating between loads from the request network and the
core, which both want to access the local ring node memory. It is responsible for sending the load to
the actual ring node memory module. Additionally, it contains the logic to both inject and remove
items from the request and reply network. It also returns loaded values from the ring node memory
or reply network to the core. In some ways, it serves a similar purpose that the bundleizer module
does for the forwarding network.
180
Load Unit – load_unit.v
CORE_ID = C
Parameters
clk
coreCommandValid
coreCommandType
coreCommandAddr
dataOutLoad
requestCompleteLoad
requestHitLoad
outboundRequestLinkReady
leftRequestValidDepartingBundle
leftRequestDepartingBundle
outboundReplyLinkReady
leftReplyValidDepartingBundle
leftReplyDepartingBundle
leftReplyReleaseBundle
leftRequestReleaseBundle
peak
coreLoadProcessed
coreLoadResult
coreLoadHit
addressToLoadValid
addressToLoad
rightRequestValid
rightRequestDeparting
rightReplyValid
rightReplyDeparting
sizeof
(requestNetworkBundle) 
bits
D
V
Q
0..N 0..N
remoteLoadWaiting
reset
0
1
Package core load into 
potential request 
network bundle
0
1
1
C
Release the stored request bundle if its final 
destination is not this node and the request 
network is not blocked for any reason OR its 
final destination is this node
Send a new load from the core over the 
request network if necessary, or forward 
the existing bundle from the request 
network.  The latter always has priority
Release the stored reply bundle if its final 
destination is not this node and the reply 
network is not blocked for any reason OR its 
final destination is this node
0
1
32 bits
D
V
Q
0..N 0..N
remoteLoadData
reset
Breakout valid bit, 
origin core ID, and 
address to load
Package completed 
remote load into reply 
network  bundle
1
Send a load response over the reply 
network for a serviced load from the 
request network, or forward the existing 
bundle from the reply network.
Return data to core if a load hit in 
ring cache, or if the result returned 
from the reply network 
0
1
0
1
Latch the load bundle 
departing the request 
network, clear it when 
entering the reply network
reset
reset
0
1
Latch load from memory 
destined for reply 
network
Control
reset
clk
memServicingCoreLoad
memServicingRemoteLoad
peak
requestNetCanProceed
leftRequestBundleFoundHome
leftRequestLeaveNetwork
releaseCoreLoadToRequestNetwork
releaseRemoteLoadToReplyNetwork
leftReplyBundleReturnedToOrigin
leftRequestDepartingBundle
leftRequestValidDepartingBundle
leftReplyDepartingBundle
leftReplyValidDepartingBundle
requestHitLoad
requestCompleteLoad
remoteLoadEnqueingReplyNetworkNext
coreCommandValid
coreCommandType
Perform load from core, or 
from remote load that exited 
the request network 
outboundRequestLinkReady
outboundReplyLinkReady
Figure A.22: A simpliﬁed schemaধc of the load unit module datapath.
181
Figure A.22 depicts a datapath schematic for the load unit, which includes connections to/from
the core, from the request and reply network receive buffers, and to the outgoing request and reply
network links. For simplicity the control logic is not shown. The control FSM and Verilog should
be consulted for control behavior.
Parameters
CORE_ID The load unit needs to know the ring node’s ID, so it can 1) append it to the remote
loads it adds to the request network and 2) detect when a remote load in the request network needs
to exit the network to access its ring node memory.
Inputs
All core inputs to the ring node coreCommandAddr, coreCommandValid, coreCommandType re-
lated to a load instruction are forwarded to the load unit. See Section A.6 for a description of these
signals. In short, they represent all of the information necessary to enqueue a new load instruction
to the ring node.
reset The one-bit reset signal shouldn’t be necessary, as all outstanding state should be zeroed
appropriately when remote loads complete execution.
peakA, peakB These signals carry the addresses from the entries currently in the forwarding net-
work receive buffers. They are inspected to make sure loads in the request network don’t pass stores
to the same address.
leftRequestDepartingBundle, leftReplyDepartingBundle These inputs carry the
oldest bundle currently in the request or reply network receive buffers, respectively. These bundles
only leave the receive buffer if instructed by the load unit. The structure of these signals is shown in
182
Figure A.21.
leftRequestValidDepartingBundle, leftReplyValidDepartingBundle These one
bit inputs are held high if the corresponding leftXDepartingBundle input is valid.
outboundRequestLinkReady, outboundReplyLinkReady These inputs from the ring
node are held high if the number of request/reply network credits is greater than 0, indicating that
the ring node downstream has at least one buffer entry available for the corresponding network.
requestCompleteLoad This one bit signal from the memory module indicates that the load
sent to it has completed executing, with the resulting data (in the case of a hit) and hit status stored
to dataOutLoad and requestHitLoad. A load may complete on the same cycle as it was sent to the
memory, as our reference design implements the memory with a combinational register array.
requestHitLoad This one bit signal from the memory module is high if the recently com-
pleted load either hit in the local ring node memory, or was properly loaded from the node’s L1 in-
terface (if it either was the address owner, or the address was not present in the bloom filter).
dataOutLoad This 32-bit signal contains the loaded data from a load sent to the memory mod-
ule.
Outputs
leftRequestReleaseBundle This one bit output is raised high if the load unit is either 1)
forwarding the bundle from leftRequestDepartingBundle to the outgoing request network link or
2) removing the remote load from the request network to process locally. In both cases, the buffer
entry from the request network receive buffer is consequently invalidated.
183
leftReplyReleaseBundle This one bit output is raised high if the load unit is either 1) for-
warding the bundle from leftReplyDepartingBundle to the outgoing reply network link or 2) re-
moving the reply from the reply network, since this node is the one who initiated the remote load
to begin with. In both cases, the buffer slot from the reply network receive buffer is consequently
invalidated.
rightRequestDeparting, rightReplyDeparting These signals correspond to the outgo-
ing network links of the request and reply networks. They are each sized according to the network
bundles in Figure A.21.
rightRequestValid, rightReplyValid These one-bit outputs are raised high if the load unit
is sending a network bundle on the respective network links, rightRequestDeparting and rightReply-
Departing.
addressToLoad This 32-bit signal holds the address of the load that the load unit is sending to
the memory module.
addressToLoadValid This 1-bit signal is high if addressToLoad is a valid address that should be
loaded.
coreLoadResult This 32-bit signal is the loaded data value that the load unit is returning to
the core. The data may have originated from the local memory module, or from the reply network
following a remote load.
coreLoadHit This 1-bit signal is always high, since from the point of view of the core (in the
current implementation), loads always hit, just with a variable latency.
184
coreLoadProcessed This 1-bit signal is raised high when coreLoadResult and coreLoadHit are
valid, and the load instruction the core is waiting on has been fetched from either the local memory
module or from the request/reply networks.
Datapath
Loads from core Loads directly from the core interface arrive near the beginning of a clock cy-
cle on coreCommandValid, coreCommandAddr and coreCommandType. The load unit makes sure
the core’s new command is a load. If it is, and if there isn’t currently a load from the request network
accessing or waiting to access the memory module (that is,memServicingRemoteLoad is low), the
core’s requested load address is sent to the addressToLoad and addressToLoadValid outputs. The
memory module attempts to load the address from the local ring node memory. If it hits, or if it
loads from its L1, the result is returned to the load unit in dataOutLoad, with a high value on re-
questHitLoad and requestCompleteLoad. The result is routed back to the core on coreLoadResult,
with coreCommandProcesed raised high to let the core know the load has finished. If it misses lo-
cally and is unable to access the L1, requestCompleteLoad goes high, but requestHitLoad stays low.
This triggers a request network bundle being constructed with the coreCommandAddr. The core
waits until the request network link is ready and the local request network receive buffer is empty,
and then sends the network bundle over the link, on rightDepartingBundle. This request travels to
the adjacent node by the end of the cycle. It continues throughout the network until it arrives at the
owner node of the address, where it accesses the L1, fetches the requested value, and returns it on
the reply network. On every cycle, the output of the reply network receive buffer, leftReplyDepart-
ingBundle, is inspected. If the core ID in the reply network bundle matches the local core ID, the
bundle is removed from the reply network, and its data payload returned to the core on coreLoad-
Result, with a high value on coreCommandProcessed. The core is now free to send a new instruction
to the ring node.
185
Loads from request network The load unit also processes loads from the request network.
On every cycle, the requested address in leftRequestDepartingBundle is inspected to see if it is owned
by the local core. If it is, it is removed from the request network, and the whole bundle is latched
in remoteLoadWaiting. Once the memory module is available (i.e.,memServicingCoreLoad is low),
the load is sent to the memory via addressToLoad. The result, either from the ring node memory or
from the L1, is eventually returned on coreLoadResult, which is latched in remoteLoadData. There,
the data waits until the reply network link is ready and the local reply network receive buffer is
empty. Once this is true, a reply network bundle with the fetched data from remoteLoadData, and
the core ID from remoteLoadWaiting are packaged into a bundle and sent over the rightReplyDe-
parting network link, where they travel to the subsequent core by the end of the cycle. Meanwhile,
remoteLoadWaiting and remoteLoadData are cleared. Eventually, the reply bundle returns back to
the original requesting core.
Request/reply network forwarding On every cycle, the load unit also checks if the bun-
dles at the front of the request and reply buffers, in leftRequestDepartingBundle and leftReplyDe-
partingBundle, need to continue propagating to the next ring node. If their corresponding out-
boundXLinkReady inputs are high, and they are not supposed to exit the network at this node,
they are simply forwarded to the corresponding output links, rightRequestDeparting or rightRe-
plyDeparting. The request network bundle also checks the forwarding receive buffer output (the
peak inputs) to make sure its remote load address doesn’t match any of the store addresses being
forwarded. It stalls if any addresses match.
Control
The control FSM is split into one for core loads, in Figure A.23, and one for remote loads, in Fig-
ure A.24. The FSMs are only partially independent – since the memory module can only handle
186
Ready / 
Normal 
Loads
Ring Cache 
Load 
Pending
Enqueue 
Request 
Network
Waiting For 
Reply 
Network
State pendingCoreLoad coreLoadEnqueuingRequestNetwork
Ready 0 0
Ring Cache Load 
Pending
1 0
Enqueue Request 
Network
0 1
coreLoadWaitingForReply
0
0
0
Waiting For Reply 
Network
0 0 1
memServicingCoreLoad and memServcingRemoteLoad are mutually exclusive, since the ring cache can 
only process one load at a time
~requestCompleteLoad
Figure A.23: The FSM and state bits for core loads to the load unit.
187
State pendingRemoteLoad remoteLoadWaiting
Ready 0 0
Remote Load 
Waiting For Ring 
Cache
0 1
Ring Cache Load 
Pending 1 0
remoteLoadEnqueuingReplyNetwork
0
0
0
Enqueue Reply 
Network
0 0 1
Ready
Remote Load 
Waiting For 
Ring Cache
Ring Cache 
Load Pending
Enqueue 
Reply 
Network
memServicingCoreLoad and memServcingRemoteLoad are mutually exclusive, since the ring cache can 
only process one load at a time
Unlike core loads, remote loads always register as hits, because even if they miss in the ring cache, they 
will always access the attached L1 of the remote ring node.
Figure A.24: The FSM and state bits for request network loads to the load unit.
188
one load at a time, the core or the request network may need to wait if one or the other is already ac-
cessing it. Additionally, the request network load (in remoteLoadWaiting) always his priority over a
new core load, since it blocks all items behind it in the request network, and its requester has already
suffered significant latency waiting for it. If the load unit is merely forwarding bundles over the re-
quest and reply networks, rather than injecting or removing bundles, then these FSMs do not apply.
Consult the Verilog for exact control signal logic.
Loads from core
Ready/Normal Loads In this state, the core may initiate a new load if the memory is not
busy servicing another load. If the load hits in the local ring node memory, as indicated by requestHit-
Load and requestCompleteLoad going high by the end of the cycle, then it returns the loaded result
to the core, and stays in the Ready state. If the load doesn’t return immediately, because the L1 is
being accessed (either because the core is the address owner, or the address is not in the bloom filter),
then it transitions to the Ring Cache Load Pending state. If the load returns immediately, but with
requestHitLoad indicating that it missed in the local ring node memory and couldn’t be fetched
from L1, it transitions to the Enqueue Request Network state.
Ring Cache Load Pending This state waits for a core load to return from the memory
module. If it is in this state, then the load is accessing the local L1. Once it eventually returns as a
hit, it transitions back to the Ready state.
Enqueue Request Network In this state, a core load has already attempted to access the
local ring node memory, but the load missed, and was unable to be fetched from L1. Here, the FSM
waits until it is safe to enqueue the remote load into the request network. For deadlock avoidance,
releaseCoreLoadToRequestNetwork goes high only if the request network link is ready (i.e., has at
189
least one credit), and there isn’t an item currently in the request network receive buffer (i.e., left-
RequestValidDepartingBundle is low). Even if leftRequestValidDepartingBundle is low, however,
releaseCoreLoadToRequestNetwork can also go high if the item in the request network receive buffer
is exiting the request network to the load unit this cycle. Once the remote load is injected onto the
request network links, the FSM transitions to the Waiting For Reply Network state.
Waiting for Reply Network The core remains in this state until its remote load has been
serviced by a remote core, and the loaded data returned over the reply network. This could be a po-
tentially very long time, depending on network contention, and whether the load hit in the remote
ring node’s memory, or if it had to go to its L1. The incoming reply network link is inspected ev-
ery cycle, and when the core ID in the reply network bundle matches the local core ID, the data is
returned to the core, and the FSM transitions back to the Ready state.
Loads from request network
Ready In this state, the load unit waits until there is a bundle in the request network receive
buffer whose load address is owned by the local ring node. Once there is, it is removed from the re-
quest network and stored in the remoteLoadWaiting register. The FSM transitions into the Remote
LoadWaiting For Ring Cache state.
Remote LoadWaiting For Ring Cache In this state, the latched remote load waits until
the ring node memory is free. Like with loads from the core, the load address is sent to the memory.
If the data returns immediately (requestCompleteLoad goes high before the end of the cycle), then
the load hit in the local ring node memory. The data is stored in remoteLoadData, and the FSM
transitions to the Enqueue Reply Network state. If the load doesn’t return by the end of the cycle,
it must have missed in the local ring node memory, and is accessing the L1. Unlike with core loads,
190
these remote loads that miss in the ring node memory will always access the L1, since they were re-
moved from the network specifically because they arrived at the address owner node. In this case,
the FSM transitions to the Ring Cache Load Pending state.
Ring Cache Load Pending This state waits for a remote load to return from the memory
module. If it is in this state, then the load is accessing the local L1. Once it eventually returns as a hit,
it transitions to the Enqueue Reply Network state.
Enqueue Reply Network In this state, the load from the request network has obtained its
value, either from the ring node memory or the local L1. It waits until it is safe to enter the reply net-
work. It is safe (as far as deadlock avoidance is concerned) to do so when the reply network link is
ready (i.e., has at least 1 credit) and the reply network receive buffers are empty (i.e., leftReplyValid-
DepartingBundle is low). It is also safe to do so if the item at the front of the reply network receive
buffer is exiting the network this cycle. A reply network bundle is constructed with the loaded data
as a payload, and is sent over the reply network links to the subsequent core. The FSM returns to
the Ready state.
Request/reply network forwarding As previously mentioned, the request and reply net-
works can propagate any bundles already in the network as long as the outgoing network links are
ready (i.e., have at least one credit.) Additionally, the request network can only propagate if the ad-
dress of its remote load doesn’t match any addresses being stored in any of the bundles in the local
forwarding network buffers. For deadlock avoidance, propagating existing network bundles always
has priority over injecting new items.
191
32-bit Memory Address
Ring Cache Tag Ring Cache Index 00
0131 2
... L1 Line Offset
031 569
Owner 
Core ID
10
910
Figure A.25: For a 1 KB direct-mapped cache with a line size of a single 32-bit word, there are 8 index bits. For 16
cores, 4 bits of the address are required to determine which core owns it. Since owners must be assigned on L1
cache line granularity, the bits above the L1 line oﬀset are used to determine ownership. For a typical 64 byte L1 line
size, the ownership bits completely overlap with the ring cache array index bits, which results in simpler hardware to
track which indexes a parধcular ring node needs to ﬂush.
A.11 Ring CacheMemory
The memory module of ring cache facilitates fast access to the shared data contained within. It
caches data written by cores within sequential segments, and provides the cores fast access to the
same data when it is read within sequential segments. The memory has two logical ports – one ded-
icated for loads, and one dedicated for stores. For optimal performance, the actual storage within
the array module is currently implemented with a register array, which enables a combintional load,
a combinational tag read, and an edge triggered write to occur in a single cycle. While it could be
made to be set-associative (as in the original HELIX-RC paper), our reference implementation uses
a direct-mapped structure (performance was only very slightly impacted for one benchmark), which
makes certain other optimizations more straightforward. The cache line is a single word, as dictated
by the compiler to avoid false sharing. Each ring node memory module also interacts with the rest of
the cache hierarchy for any memory addresses that it is the owner of (for load misses and evictions).
A bloom filter is used to increase the number of load misses that a ring node memory can fetch from
its local L1, rather than requesting from a remote core’s L1. At the end of every loop invocation, the
memory is responsible for flushing its contents back to the L1, as mentioned in Section A.8.
Before describing the contents of the memory module in detail, we first describe the concept of
192
an owner core, in addition to the different read and write modes of the ring cache memory. Then we
go on to describe the memory module and its submodules, the array module and the bloom filter
module.
Address Ownership
For correct execution, access to shared data within sequential segments must be sequentially consis-
tent. This property is guaranteed by the compiler’s placement of wait and signal instructions. These
instructions, in concert with the ring cache hardware, guarantee that cores can only read or write
shared values in correct loop iteration order, within the appropriate sequential segments.
The ring cache memory size is limited, so at some point it must interact with the normal cache
hierarchy. The core may try to load addresses that were previously evicted or haven’t been written to
the ring cache yet – perhaps those written by a previous parallel loop, or by code outside any loop.
Additionally, conflicts in the ring cache memory will result in evicted values needing to be written
back to the L1. The latter presents a problem, since it in effect means that any ring node could write
any particular address to the L1 at any point in time, even outside of a sequential segment. This is
even more concerning since the address being evicted may also be currently circulating throughout
the forwarding network. If this is the case, two cores (one preceding the circulating store, and one
following it), may writeback two different values for the same address, and a race ensues. This sce-
nario is described more in depth in Section A.10.
For this reason, we establish the notion of unique core ownership of a memory address. Fig-
ure A.25 depicts how the owner core is derived from a 32-bit address, in addition to the ring cache
array index and tag, for our reference design of 1 KB of storage, an assumed L1 line size of 64 bytes,
and 16 cores. In order to avoid false sharing of L1 cache lines, which would incur cache coherence
communication costs, different cache lines are assigned different owner cores. Owner cores are se-
lected by using the log2(numCores) bits immediately above the L1 line offset. Notice that the owner
193
core ID of an address completely overlaps with the ring cache index bits. This implies that, for our
reference configuration, certain ring cache array indexes always map to the same owner core. This
coincidence makes it easier to tell which values a ring node will need to writeback to the L1 at flush
time. Since there are only 256 sets in 1 KB of memory, only 16 bits per core are required (256 / 16
cores) to keep track of which of these indexes currently holds an owned address. If the owner core
ID didn’t exclusively map to ring cache indexes this way, 256 bits would be required per core.
Reads
Reads have two modes of operation. In normal mode, incoming addresses access the ring cache
memory array. Since the array is implemented with registers instead of SRAM, the result of the
load is known combinationally, before the end of the cycle. If the load was a hit, the result is re-
turned to the requester, the load unit module. However, if the load misses in the ring cache, and the
attached core is the owner of the address, then the memory module fetches it from the L1 instead.
It then returns the result to the load unit, indicating a hit. If the load misses in the ring cache, but
the attached core is not the owner, the load unit is informed that the load missed, and subsequently
initiates a remote request through the request/reply networks to another core’s L1 to fetch the value.
Many of these remote requests may be avoided if its known for sure that the address has never
been written to the ring cache yet during this loop. If we know that this is the case, it is impossi-
ble that the address is currently being written back by some other remote ring node. It is therefore
safe in this case for a ring node that is not the owner of the address to load it directly from its L1. By
adding all written addresses to the bloom filter module, a load that misses in the ring cache memory
is able to determine whether it can just access the local L1, instead of suffering a costly trip around
the request/reply networks. Since bloom filters only have false positives, and no false negatives, a
poorly sized bloom filter can result in extra remote requests, but never incorrect behavior.
194
Writes
Writes have three modes of operation. For typical stores, incoming addresses perform the tag lookup
combinationally, and the write itself at the clock edge. Simultaneously, the ownerBitset is updated
to reflect whether the array index it was written to now contains an owned address. The address is
also inserted into the bloom filter. If the array index that was written previously contained a valid
value, and the attached core is the owner of its address, it is written to the local L1 interface. If the
core is not the owner, the value is simply discarded. The memory module doesn’t allow any more
writes (which stalls the forwarding network) until the L1 has confirmed the value has been written
to the normal cache hierarchy. Having only one outstanding eviction simplifies the logic, and allows
the ring cache to rely on the L1 cache to satisfy any loads to recently evicted values, as detailed in Sec-
tion A.6.2. If the eviction buffer is made larger, than any loads that miss in the ring cache memory
must search it to make sure they aren’t bypassing stores to the address they are loading from.
At the end of every loop invocation, the signal buffer instructs the memory module to flush its
contents to the L1. The memory module inspects its ownerBitset to determine which ring cache ar-
ray indexes contain owned addresses, and writes them back to the L1 one by one, clearing the owner-
Bitset as it goes. After the L1 confirms that every flushed value has been written to the normal cache
hierarchy, the memory module informs the core that the flush has completed.
A.11.1 MemoryModule
Figure A.26 depicts a simplified schematic of the memory module. It contains the bloom filter and
array submodules. The reference implementation uses a 1 KB direct-mapped cache, since perfor-
mance is not noticeably less than when using a set-associative cache. For simplicity, certain control
logic is not shown, and some control signals were left unrouted. For fully commented control sig-
nals and output logic, please see the Verilog implementation of the module.
195
Memory – memory.v
CORE_ID = C
NUM_SETS = S
NUM_INDEX_BITS = D = log2(S)
NUM_TAG_BITS = 32 – NUM_INDEX_BITS – 2
Parameters
w
ri
te
ba
ck
A
cc
ep
te
d
w
ri
te
ba
ck
Co
m
pl
et
e
ca
ch
eL
oa
d
A
cc
ep
te
d
ca
ch
eL
oa
d
C
om
pl
et
e
ca
ch
eL
oa
d
D
a
ta
w
ri
te
ba
ck
A
d
dr
ca
ch
eL
oa
d
V
al
id
clk
inputValidLoad
addressLoad
inputValidStore
addressStore
dataStore
startFlush
writeReady
dataOutLoad
requestCompleteLoad
requestHitLoad
finishedFlush
requestCompleteStore
Array
port1Address
port1Valid
port1WriteEnable
reset
clk
port1WriteData
port2Address
port2Valid
port2WriteEnable
port2WriteData
port1Hit
port1DataOut
port1Eviction
port1Complete
port1ExistingData
port1ExistingAddr
port2Hit
port2DataOut
port2Eviction
port2Complete
port2ExistingData
port2ExistingAddr
Priority 
Encoder
ownerBitset
nextIndexToFlush
S bits
D
V
Q
0..N 0..N
OwnerBitset
D to S 
decoder
input
output
Bitwise OR
Bloom Filter
addrToSet
addrToCheck reset
clk
addrToSetValid
hashTableMiss
STATE_READY
STATE_FLUSH
0
1
STATE_FLUSH
port2Address
reset
32 bits
D
V
Q
0..N 0..N
pendingEvictData
reset
32 bits
D
V
Q
0..N 0..N
pendingEvictAddr
reset
0
0
Control
writebackComplete
reset
clk
finishedFlush
STATE_READY
initiateNewWriteback
STATE_FLUSH
writebackAccepted
Split index 
bits of 
address
0
3
1
2
Bitwise XOR
Set a bit if address belongs 
to owner core, clear a bit if 
evicting, or don t alter
0
3
1
2
0
3
1
2Set if starting 
new writeback, 
clear if writeback 
accepted, or 
hold
0
0
startFlush
cacheLoadAccepted
cacheLoadComplete
Get binary encoding of next 
index to flush in the array, 
that this core owns
32 bits
D
V
Q
0..N 0..N
pendingL1LoadAddr
reset
pendingEvictValidpendingL1LoadValid
pendingL1LoadAccepted
0
3
1
2
0
Set new address to 
load from L1 if this 
is the owner core, 
keep it the same, or 
clear this register
l1LoadFinished
hashTableMiss
port1Hit
rcHit
0
3
1
2
0
Return the value 
loaded from L1 or 
from array module, 
depending on 
where it was found
rcHit
l1LoadFinished
Datapath overview, some 
control signals not routed for 
clarity
Add previously 
seen addresses 
to bloom filter
STATE_READY
0 1
Pass new 
writeback info 
or latched info 
from last cycle0 1
0 1Pass new load 
info or latched 
info from last 
cycle
To Control
To 
Control
port2Eviction
Figure A.26: A simpliﬁed schemaধc of the memory module datapath.
196
Parameters
CORE_ID The memory module must be aware of the core ID so it can determine if it is the
owner of an address.
NUM_SETS In our reference design we use 256 sets. With a line size of a single 4 byte word, this
makes the total cache size 1 KB.
NUM_INDEX_BITS, NUM_TAG_BITS An address is split into index and tag bits, as shown
previously in Figure A.25.
Inputs
reset The reset signal is used to clear all state after the end of a loop invocation. In particular, the
memory array and bloom filter bits must be cleared for proper operation.
addressLoad This 32-bit address input comes from the load unit module, and it is the load to be
performed this cycle. It is stable early in the clock cycle.
inputValidLoad If high, addressLoad is valid, and the load should be performed.
addressStore, dataStore A 32-bit address in which to store the 32-bit data value. Originates
from the output of the bundleizer module. Should be stable early in the clock cycle.
inputValidStore If high, dataStore should be written to addressStore this cycle.
startFlush This one-bit output from the signal buffer module is raised high when it has re-
ceived a flush signal from every other core. It causes the memory to transition from normal opera-
tion into the flush state.
197
writebackAccepted, writebackComplete These 1 bit signals related to writebacks to the
L1 are directly connected to the cache interface. See Section A.6.2 for details.
cacheLoadAccepted, cacheLoadComplete These 1 bit signals are related to loads to the L1
cache. See Section A.6.2 for details.
cacheLoadData This 32-bit signal is the data returned by a load to the L1. See Section A.6.2 for
details and timing.
Outputs
writeReady This one-bit signal is raised high if the memory module can accept a store this cycle
(i.e. it is not stalled performing a writeback). Since it is based on state bits, it is stable at the begin-
ning of a clock cycle. The bundleizer module uses this signal to know if a circulating bundle in the
forwarding network can enqueue its store and continue propagating. The forwarding network does
not need a write completion confirmation from the memory to continue propagating, only the
ready signal to initiate a new store.
requestCompleteLoad This one-bit output indicates to the load unit that a load has com-
pleted. In the case of an immediate ring cache hit, it may be raised as soon as the end of the clock
cycle that inputValidLoad was raised high. It may be delayed several or tens of cycles if a miss in the
ring cache resulted in an L1 load.
requestHitLoad This one-bit signal is raised high if the load hits in the ring cache, or if the
memory module loaded the value from the L1. It is only held low (indicating a miss) if the load can
not be satisfied by this ring node, and instead must be sent over the request/reply networks. This
is only the case if the ring node is not the address owner, and the address has been previously writ-
ten to the ring cache (i.e., the address is present in the bloom filter). This signal is valid only when
198
requestCompleteLoad is raised.
dataOutLoad This 32-bit signal contains the loaded data, and is valid only in the case that re-
questHitLoad is high.
writebackAddr, writebackData These 32-bit address and data wires are the values to be
written to the L1 cache. See Section A.6.2.
writebackValid A one-bit signal that is raised high if the memory module is performing a
writeback to L1. See Section A.6.2.
cacheLoadAddr A 32-bit address that the memory module wants to load from the L1. See Sec-
tion A.6.2.
cacheLoadValid This one-bit signal is raised high if cacheLoadAddr is a valid load to be per-
formed. See Section A.6.2.
Datapath
Loads Loads enter the memory module through the addressLoad and inputValidLoad inputs
near the beginning of a cycle, originating from the load unit. There, they access a dedicated port of
the memory array to load the data. Note that port1 of the memory array has the write signals sent to
0 – in our current implementation the port is dedicated to reads, but this doesn’t necessarily need to
be the case. In parallel with the array load, the bloom filter is checked to see if this memory address
has been previously written to the ring cache. If the load hits in the ring cache memory, the result-
ing data is returned on the dataOutLoad output, with requestHitLoad and requestCompleteLoad
being set high. In this case these signals are valid at the end of the same clock cycle the load began,
since our memory array performs combinational reads. If the load misses in the ring cache, it either
199
is requested from the L1 cache interface, or returned as a miss with requestHitLoad set low and re-
questCompleteLoad set high, which indicates to the load unit that the load needs to be sent out over
the request/reply networks. The load can be serviced by the local L1 in two conditions – first, if the
owner of the address to be loaded is the local core, or if the hashTableMiss output of the bloom fil-
ter is high, indicating that this address has not been written to the ring cache previously. If the load
is sent to the L1, the cacheLoadAddr and cacheLoadValid outputs are set appropriately. Once the
cache responds, potentially many cycles later, requestHitLoad and requestCompleteLoad are set high,
and the loaded data is passed on to dataOutLoad.
Stores Stores are sent to the memory module by the bundleizer module on the addressStore,
dataStore, and inputValidStore inputs. These signals are routed to port2 of the memory array where
they perform a tag lookup combinationally, and write the new value from dataStore into the ar-
ray on the clock edge. In parallel, addressStore is routed to the addrToSet input of the bloom filter,
where it is hashed and added to the filter bitset. Also in parallel, the index of the address is used to
write the corresponding bit in the ownerBitset. If the address is owned by this core, the bit whose
position matches the array index of the address is set to 1, otherwise to 0. In our reference design,
since the bits of the address used to determine the owner core overlap completely with the cache
array index bits, and we use a direct-mapped cache, only a limited subset of bits in the ownerBitset
could ever be set – the rest are optimized away by synthesis.
If the store to the memory array triggered an eviction, the evicted address/value may need to be
written to the L1 cache. If this ring node is not the owner of the evicted address, no action is taken.
If it is the owner, the address/value is sent to the L1 cache writeback interface. While a value is being
written back, the writeReady output is held low, to prevent any more stores from being performed
until the L1 cache has confirmed the write, which could take several cycles or more. After completing
the writeback, writeReady returns high to allow a new store to be performed.
200
Flush When the memory module receives a high value on the startFlush input, it transitions to
the flush state. In this state, the ownerBitset is consulted to determine which indexes of the cache
array must be written back to the L1. The ownerBitset array is fed into a priority encoder, which
gives a binary representation of the least significant bit set in the bitset. This represents an index
of the cache array that must be written back to the L1, since it is owned by this core. The memory
module fetches this data value from the array, and initiates a writeback to the L1, like for an eviction.
Unlike for normal evictions, the flush mechanism doesn’t wait for the writeback to complete, only
that it is accepted by the cache interface. Meanwhile, the ownerBitset clears the bit corresponding
to the index that was just written back, and uses the output of the priority encoder to fetch the next
array index to flush. Once the ownerBitset is equal to 0, ﬁnishedFlush is raised to indicate to the rest
of the ring node that the flush is done.
Control
Figure A.27 depicts the FSM for any loads to the memory module, whereas Figure A.28 depicts
the FSM for stores, writebacks, and flushes. For loads, the control states are fairly simple. In the
Ready/Normal state, loads that hit in the ring cache are performed at the rate of one cycle per load.
If a load misses in the local ring cache, and it is either owned by this core or is not present in the
bloom filter, an L1 load begins by setting the cacheLoadValid and cacheLoadAddr outputs. If nei-
ther condition is true, the load immediately returns as a miss. Once the load is accepted by the cache
interface, we transition to another state where we wait for the cache to return the loaded value. Once
it does, the loaded data is returned and normal loads resume.
A similar dynamic happens for evictions from the ring cache, except with the writeback interface
instead of the load interface. For stores, the Ready/Normal state processes one store per cycle as long
as there aren’t any evicted values to writeback. If there are, the store port transitions to initiate the
L1 writeback, where the evicted address and value are presented to the cache interface. Once they are
201
State readState pendingL1LoadValid
Ready `STATE_READY 0
Initiate L1 `STATE_L1_LOAD 1
L1 Accepted `STATE_L1_LOAD 1
pendingL1LoadAccepted
0
0
1
Ready / 
Normal 
Loads
Initiate L1 
Load
issueL1Load = (Miss in RC memory) AND 
(Load Address is owned by the local ring node OR Load Address is not in Bloom Filter)
L1 
Accepted 
Load
Figure A.27: The FSM and state bits for memory module loads.
accepted by the interface, we transition to a waiting state, where we remain until the cache confirms
the store has completed.
For flushes, the logic is slightly more complex. When the startFlush signal is received, the port
transitions to the start ﬂush state. If there aren’t any bits set in the ownerBitset, then the node has
nothing to flush, so transitions back to the Ready state. If there are bits in the ownerBitset, then the
corresponding array indexes are first fetched, then written to the cache interface, while the bit is
cleared in the ownerBitset. Unlike with normal evictions, we don’t wait until stores are confirmed,
only that they are accepted by the cache. A count of the number of writes that have yet to be con-
firmed are saved in the evictionsAwaitingL1Conﬁrmation register, which is incremented for every
initiated writeback, and decremented whenever cacheLoadComplete goes high for a cycle. After
202
State writeState pendingEvictValid
Ready `STATE_READY 0
Initiate L1 WB `STATE_PENDING_EVICTION 1
L1 Accepted WB `STATE_PENDING_EVICTION 0
Ready / 
Normal 
Stores
Initiate L1 
WB
L1 
Accepted 
WB
initiateNewWriteback = 
(A normal store triggers an eviction AND 
the evicted address is owned by the local ring node) 
OR 
(Flush is active and there are still more data items to writeback)
Start Flush
startFlushInitiate 
Next Flush 
WB
L1 
Accepted 
Flush WB Awaiting 
L1
Complete
ownerBitset != 0
Flush Done
~ownerBitset
State writeState pendingEvictValid
Ready `STATE_READY 0
Initiate Next 
Flush WB
`STATE_FLUSH 1
L1 Accepted 
Flush WB
`STATE_FLUSH 0
Awaiting L1 
Complete
`STATE_FLUSH 0
Flush Done `STATE_FLUSH 0
flushWalkDone
0
0
0
1
1
evictionsAwaitingL1
Confirmation
0
Previous value + 1 - 
cacheLoadComplete
Previous value - 
cacheLoadComplete
Previous value - 
cacheLoadComplete
0
Start Flush `STATE_FLUSH 0 0 0
evictionsAwaitingL1Confirmation increments every time a new flushed item is accepted by the 
L1 cache, and decrements whenever the L1 confirms that it has completed a store
Figure A.28: The FSM and state bits for memory module stores and ﬂushes.
203
NUM_SETS = S
NUM_INDEX_BITS = D = log2(S)
NUM_TAG_BITS = 32 – NUM_INDEX_BITS – 2
SET_SIZE = 1+NUM_TAG_BITS+DATA_WIDTH
Parameters
clk
Array - array.v
Register 
Array
S Registers
SET_SIZE bits 
each
reset
clk
regIndex1
writeEnable1
writeEntry1
regIndex2
writeEnable2
writeData2
regOut1
regOut2
Split Address into tag 
and index
Split set into valid 
bit, tag, and data
Check if tags are equal
port1ReadTag
Combine valid bit, 
tag, and data into 
write entry
1
Reads are combinational, 
writes edge triggered
0
1
0
valid
Restore previously 
stored address 
from loaded tag 
and index
00
0
1
0
1
0
0
Port2 has identical logic, not 
shown for clarity
Evict if missed array, was 
writing, and written 
register was valid
Valid Bit Tag Data
Cache Entry
Figure A.29: A schemaধc of the array submodule. The tag and data storage of ring cache is implemented with an
array of registers to enable combinaধonal loads and tag lookups.
ownerBitset is completely empty, we idle until evictionsAwaitingL1Conﬁrmation has reduced to 0,
and then transition back to normal operation.
A.11.2 ArrayModule
Figure A.29 depicts the array submodule. It provides a two read/write port interface to an array of
registers. This array of registers serves as the primary storage for the ring cache memory. Each regis-
ter in the array contains a valid bit, a cache tag, and a data value. Since ideally we want ring cache to
be able to process a load and a store simultaneously in one cycle, the combinational reads provided
by the register array allow for a tag lookup and store in the same cycle, as well as a load that does not
require a clock edge. If it is found that this performance is not needed, an SRAM could be substi-
tuted in this module. Some control logic in the array module and memory module will need to be
204
tweaked to account for loads requiring a clock edge, and stores requiring two cycles.
The array module treats both port1 and port2 equally, even though port1 is dedicated for loads,
and port2 is dedicated for stores and the flush. Synthesis will remove any unnecessary extra logic.
Parameters
NUM_SETS This is the number of sets in the register array / cache. Since the current implemen-
tation just uses a direct-mapped structure, the total cache capacity is the number of sets multiplied
by the size of a word, 32-bits.
NUM_INDEX_BITS The number of index bits needed to addressNUM_SETS cache entries.
NUM_TAG_BITS The number of tag bits required to differentiate addresses in the cache.
SET_SIZE The number of bits in a cache entry is 1 (for a valid bit) plusNUM_TAG_BITS plus 32
bits for data.
Inputs
Since port1 and port2 of the array have identical logic, descriptions will provided generically.
reset The one-bit reset signal must be raised between loop invocations to clear out every register
in the array.
portXValid This one-bit signal indicates whether the other inputs are valid.
portXAddress A 32-bit address of to be loaded or stored to/from the array.
portXWriteEnable If this is low, perform a load. Otherwise, perform a tag lookup and a store.
205
portXWriteData A 32-bit data value to be stored in the array. Only valid if portXWriteEnable
is high.
Outputs
Since port1 and port2 of the array have identical logic, descriptions will provided generically.
portXComplete This one-bit signal is raised high if the outputs of the module are valid. In our
current implementation, since the array can be read combinationally, all accesses are complete by the
end of the cycle that they become valid inputs.
portXHit This one-bit output is raised high if an access to the array resulted in a tag match.
portXDataOut In the case of a hit, this 32-bit output contains the accessed data from the array.
portXExistingData This 32-bit output contains the data that was contained in a cache entry
before a write occurred. It is used to writeback a value that has been evicted, or during a memory
flush.
portXExistingAddr This 32-bit output contains the address that was previously stored to a
cache entry before a write occurred. Along with portXExistingData, it is used to to writeback an
evicted/flushed piece of data to the L1 cache.
portXEviction This one-bit signal is raised high if a store missed in the array, but the index it
accessed previously contained valid data, which must now be written back to L1.
Datapath
The array module contains typical cache logic. First, the address on the portXAddress input port
is split into index and tag bits, depending on the size of the register array. The index bits are used
206
to access one of the registers in the array, which contains the appropriate cache entry. Meanwhile,
if portXWriteEnable is high, the data to be written from portXWriteData is packaged into a cache
entry by combining a valid bit, tag bits from the address, and the data itself. Combintionally, the
proper register in the array is read, either in service of a load or a tag lookup for a store. The result-
ing read entry is split into a valid bit, tag, and data. If the valid bit is not set, then either a load has
missed, or a store has nothing to evict. If the valid bit is set and the tag bits of the requested address
and the read cache entry match, then either the load or store has hit. If the request was a store, no
eviction is required. The loaded data is placed on portXDataOut and portXHit is set appropriately.
If the tag bits don’t match, then either a load has missed, or in the case of a store, the previously
stored value must be evicted. The previously stored value will be found on portXExistingData,
along with its reconstructed address on portXExistingAddr. At the clock edge, if portXWriteEn-
able is high, then the constructed write entry is stored to the register indicated by the input address’
index bits.
Control
Other than writing the appropriate register on the clock edge, there isn’t any control to speak of.
A.11.3 Bloom FilterModule
Bloom filters consist of a bit array and one or more hash functions. They are used as a resource effi-
cient way to track membership of a set, but without the cost of storing every member of a set. Items
to be inserted into the bloom filter are hashed by one or more different hash functions. The result-
ing hashes are used to set bits in the bit array. To check if an item is already in the set, it is once again
hashed by the same hash functions, and the corresponding bits are inspected to see if they had been
previously set. This organization means that an itemmay be incorrectly reported as being present
in the set (false positive), but may never incorrectly report that an item is not present, when it actu-
207
BITS_IN_TABLE = B
INDEX_BITS = D = log2(B)
Parameters
clk
Bloom Filter – bloom_filter.v
D to B 
decoder
input
output
Hash
out = 
(addr * hashSeed)[31:32-D]
addr
hashSeed
out
Hash
out = 
(addr * hashSeed)[31:32-D]
addr
hashSeed
out
Hash
out = 
(addr * hashSeed)[31:32-D]
addr
hashSeed
out
Hash
out = 
(addr * hashSeed)[31:32-D]
addr
hashSeed
out
32'h2ABF7209
32'h2ABF7209
32'h1A8FCEE7
32'h1A8FCEE7
B bits
D
V
Q
0..N 0..N
tableBitSet
reset
Bit 
Select 
Mux
B-to-1
Bit 
Select 
Mux
B-to-1
D to B 
decoder
input
output
Bitwise OR
Bitwise OR
0
1 Either keep bitset 
the same, or set two 
bits if inserting a 
new address
Figure A.30: A schemaধc of the bloom ﬁlter submodule. As shown here, two hash funcধons are used per address
being checked/set.
208
ally is (false negative). By sizing the bloom filter bit array large enough, and by providing a sufficient
number of hash functions, false positives can be minimized.
The bloom filter plays a very important performance role for the ring cache. The bloom filter
module hashes every address that is being stored into ring cache, and provides an interface to check
whether an address being loaded has previously been stored. If an address being loaded has previ-
ously been stored in the ring cache, but subsequently evicted, a remote load must be sent over the
request/reply networks to fetch it from the L1 of the owner core of that address. If an address being
loaded was never previously stored in the ring cache, it may load from the L1 of the local core that is
requesting it. For some loops in some benchmarks, the number of remote loads is very high without
the bloom filter. Some of those remote loads are the result of cold cache load misses. Others result
from the compiler, which must be conservative in the face of memory aliases – if it can’t prove an
address is not shared, it must assume it is shared. This causes a number of loads to the ring cache for
data that isn’t actually shared, that will never actually be stored in the ring cache. The bloom filter,
assuming the false positive rate is not too high, removes a significant portion of these unnecessary
remote loads.
For hash functions we implemented multiply and shift hashing. Each 32-bit address is multiplied
by a randomly chosen hash seed, with only the least significant 32-bits of the product kept. These
resulting 32-bits are shifted right until only the required number of bits to index into the bloom
filter bitset remain. The bit corresponding to the particular index is then accessed or set, depending
on whether we are checking for membership or inserting an item.
We experimented to determine how large a bloom filter was required to achieve good perfor-
mance. In short, we found that a one hash function bloom filter with 256 bits of storage was more
than enough to reduce false positives to an acceptably low level for all of our evaluated benchmarks.
Moreover, using just the cache array index bits of the address (8 in our design) was almost as good
as using a proper multiply-and-shift hash function. Since the extra hardware for the multipliers in-
209
creases the critical path of the ring cache, we do not include them in the design. Since this result may
be very specific to our particular SPEC 2000 CINT benchmarks, we still include support for the
full bloom filter hardware, in the event that it is necessary for other programs. For our 6 SPEC INT
benchmarks, three of them (164.gzip, 175.vpr, and 197.parser) had a significant number of remote
loads without a bloom filter. Compared to using no bloom filter at all, the simple 1-hash bloom filter
reduced the number of remote loads by 97%, 92%, and 60%, respectively. Without it, speedups were
reduced by 68%, 48%, and 22%. False positives accounted for a negligible increase in remote loads.
The bloom filtermust be cleared between loop invocations, or it will quickly fill up and have
a very high false positive rate. Since there is a memory barrier between every loop invocation, it is
safe to clear the bloom filter at that point, since every value previously stored in the ring cache will
already be flushed to the normal cache hierarchy.
Parameters
BITS_IN_TABLE The size of the bloom filter potentially has a large impact on false positive rates
when checking for set membership.
INDEX_BITS Log2(BITS_IN_TABLE) is the number of bits needed to address the bitset. It also
corresponds to the number of bits the hash functions should output.
Inputs
reset This one-bit input must be set high for at least one cycle between loop invocations to clear
the bloom filter of all state.
addrToCheck This 32-bit input is the address that is being checked for set membership in the
bloom filter.
210
addrToSet This 32-bit input is the address that is being inserted into the bloom filter.
addrToSetValid This one-bit input is set high if addrToSet is a valid address to insert into the
bloom filter.
Outputs
hashTableMiss The single one-bit output from the bloom filter indicates of addrToCheckwas
present in the set or not.
Datapath
A schematic of the bloom filter is shown in Figure A.30, with a 512 bit bit-array and two hash func-
tions. Since a ring node may process a load in parallel with a store, we may want to both insert an
item into the set while simultaneously checking a different address (HELIX guarantees that the ad-
dress being stored and the address being loaded are never the same, so that situation does not need
to be explicitly considered). This requires four hash function operations, two for the addrToCheck
and two identical ones for addrToSet. For the address being checked, the bit positions output from
the hash functions are read combinationally from the bitset. If one or more of the bit positions were
not already set to 1, then this address has not been seen before, and hashTableMiss is set high. For
the address being set, the calculated bit positions are used to set the two corsponding bits to 1 in the
bitset at the clock edge.
Control
There is no control for this module.
211
A.12 Signal Buffer
The signal buffer contributes a significant fraction of the improved speedups that ring cache pro-
vides HELIX. It produces speedups both by pushing the signal tracking logic to hardware instead
of software, and by decoupling signal forwarding from synchronization, which helps break the se-
quential forwarding chains of synchronization which are intrinsic to HELIX and DOACROSS
style parallelization. The amount of hardware resources dedicated to the signal buffer can increase
speedups along two dimensions. First, adding more available signal IDs allows the compiler to more
aggressively parallelize sequential code into smaller sequential segments, potentially increasing par-
allelism amongst segments. Second, adding more bits for buffering received and sent signals allows
cores to increase the number of iteration epochs they can decouple from each other during execution,
which reduces core idle time that normally results from sequential forwarding chains. We start the
signal buffer discussion by first explaining the concept of epochs in this context, and how the signal
buffer facilitates synchronization decoupling by allowing cores to skip light waits. Next, a datapath
and control FSM of the signal buffer reference design is presented. The hardware implementation
of two important parameters – number of signal IDs and amount of signal buffering – is described.
Then, we discuss some optimizations the compiler can make to reduce the amount of synchroniza-
tion instructions that need to be sent to the signal buffer. Finally, we discuss how the signal han-
dling used in our previous publications – a real multicore in the original HELIX paper [11], and
our simulated ring cache enabled multicore in HELIX-RC [9] – map to our generalized reference
design. We follow up in Section A.14 with area and timing results for the signal buffer.
A.12.1 Synchronization Epochs
Previously, Section A.2 described at a high level how decoupling signal forwarding from synchro-
nization allows HELIX-RC to break sequential forwarding chains and increase speedups. In this
212
A()
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
Light Wait ID1
Signal ID1
IF
COND
B()
Sequential 
Segment 1
Start Next 
Iteration
Parallel Code
Sequential Code
Sequential Segment
Figure A.31: A modiﬁed loop body where empty sequenধal segments start with light wait instrucধons instead of
ordinary wait instrucধons.
213
Signal 
Tracker Core 0
A()
Signal 
TrackerCore 1
A()
IF
COND
Light Wait ID1
1 0
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 1
A()
IF
COND
Light Wait ID1
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 3
Iter 0
Iter 2
Can not 
proceed, even 
though no 
dependence!
DRAM 
Stall
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
B()
A()
IF
COND
B()
IF
COND
Wait ID1
Signal Not 
Received Stall
1
0
Load X
X = f(X)
Store X
Signal ID1
0
Program
Execution
Time
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Figure A.32: A single bit of state for synchronizing signals constrains cores to operaধng within a single synchronizaࣅon
epoch. In a two core system, this implies cores can not move apart by more than 2 iteraধons.
214
Already received 
two signals, 
proceed 
immeidately
No signal stall since 
no dependence to 
satisfy!  Send signal 
immediately 
Parallel Code
Sequential Code
Sequential Segment
Signal Communication
Signal 
Tracker Core 0
Program
Execution
Time
A()
Signal 
TrackerCore 1
A()
IF
COND
DRAM 
Stall
Light Wait ID1
Signal ID1
1 0
B()
-1
0
Iter 1
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
2
1
B()
Light Wait ID1
Signal ID1
Signal Not 
Received Stall
B()
1
0
Iter 3
Iter 0
A()
IF
COND
Wait ID1
Load X
X = f(X)
Store X
Signal ID1
0
B()
Iter 2
A()
IF
COND
IF
COND
Can t unblock yet, 
need to receive 
two signals since 
the previous light 
wait was skipped
Iter 3 now overlaps 
with Iter 0!  Cores 
have decoupled by 
an epoch
Figure A.33: Using two bits of signal buﬀering state improves performance by allowing cores to decouple an addi-
ধonal synchronizaࣅon epoch.
215
section, we describe in detail the operation of the signal buffer, how it breaks these chains, and how
it allows cores to decouple in units of epochs.
We define an epoch to be a set of N iterations, where N is the number of cores executing a parallel
loop. Consider a 3-core system. For a parallel loop with 8 total iterations, core 0 executes iterations
0, 3, 6; core 1 executes iterations 1, 4, 7; core 2 executes iterations 2, 5, 8. An epoch, therefore is 3 it-
erations long. The start of an epoch, however, does not need to start at the beginning of a trip of
iterations. For example, an epoch could span from iteration 1 to iteration 3, not just from 0 to 2.
We further define the concept of a synchronization epoch, which is an epochwhose bounds are not
at iteration boundaries. Instead, they exist just before the next sequential segment in an iteration.
Take our example of an epoch of iterations from 1 to 3. The corresponding synchronization epoch
would begin precisely at the next sequential segment encountered by the core executing iteration 1,
and would end just before the next sequential segment encountered by the core executing iteration
3. In higher level terms, a synchronization epoch is the furthest distance any pair of cores can drift
apart if they are constrained by a sequential forwarding chain. In the case where all sequential seg-
ments contain dependences that need to be satisfied, no two cores can ever separate by more than a
synchronization epoch, even if signal buffering is available, since all sequential segments must be ex-
ecuted in loop iteration order. However, if there are sequential segments that are empty and there
is sufficient signal buffering available, cores can drift apart (“decouple”) by multiple synchroniza-
tion epochs, which reduces core idle time. We define wait instructions that mark the boundaries of
such empty sequential segments as light waits, which do not need to be synchronized under certain
circumstances.
An example is helpful to understand these two concepts. Figure A.31 presents a modified version
of Figure A.2, except this time with a light wait instead of a normal wait instruction on the right
branch of the sequential segment, indicating that the segment lacks any dependence. To keep things
simple, we assume only a two core system – the same conclusions will hold true for a chip with more
216
cores. Each core only needs to receive signals from the other core to unblock a particular sequential
segment. First, we consider a setup where only a single bit per core is used to track signals received
for a particular sequential segment for each other core in the system. Since we have only two cores,
and only one sequential segment, this means each core only needs a single bit total to track signals.
When a core receives a signal, it sets the signal bit. When a core finishes executing the sequential
segment, it clears the bit, therefore “consuming” the signal.
Figure A.32, depicts the execution of four iterations of the example loop. The time it takes to
transmit a signal is exaggerated to better illustrate the impact of signal buffering. Note that since
core 0 is executing the first iteration of a loop, its signal bit is preset, since there is no iteration -1 to
receive a signal from. Data communication is not shown for simplicity. First, core 0 and core 1 start
executing the parallel portions of iterations 0 and 1. Just before reaching the sequential segment, core
0 is stalled on a DRAM access. Meanwhile, core 1 reaches the light wait instruction at the start of the
sequential segment. Core 1, executing iteration 1, hasn’t received the signal from core 0, executing
iteration 0, yet, so may not enter the sequential segment. However, since it took the right branch
of the if, the sequential segment starts with a light wait rather than a normal wait. Core 1 therefore
knows that it doesn’t contain a dependence to synchronize, so would prefer to continue on with
execution, even though it is blocked. In this situation, core 0 and core 1 are said to both be within
the same synchronization epoch. Once the DRAM access returns, core 0 executes the segment, clears
its signal received bit, and sends the signal to unblock core 1. Core 1 sets the signal received bit, which
grants it access to the sequential segment, before quickly sending its own signal, which clears the
received bit. Core 0 soon begins executing iteration 2, and core 1 begins executing iteration 3, where
the same dynamic repeats itself, though without the DRAM stall. Even though core 1 never needs
to access shared data, it is nonetheless constrained by the sequential segment. When HELIX does
not have access to a ring cache, cores will always be constrained to a single synchronization epoch.
This can be seen by observing that iteration 0 only ever overlaps with iteration 1, which in turn only
217
overlaps with iteration 0 and iteration 2.
Imagine a scenario where core 1, knowing that the sequential segment doesn’t contain a depen-
dence, bypasses the light wait instruction anyway. It would send the corresponding signal, which
would ordinarily clear core 1’s bit, and set core 0’s bit. However, core 1’s bit is already cleared, and
core 0’s bit is already set. In this case, the information that a signal was consumed and that a signal
was received was lost. Core 1, upon receiving the signal from core 0 (executing iteration 0), will set its
bit and therefore enter the sequential segment of iteration 3, even though that signal was meant for
the segment it skipped in iteration 1. This potentially results in accessing shared data not in iteration
order, a correctness violation. Likewise, core 0 has lost the knowledge that it received a signal from
core 1 iteration 1, so may not enter the sequential segment in iteration 2. Since we would like core
1 to be able to skip the empty sequential segment, we add an additional bit to our signal tracking.
Instead of 2 states (received signal or not), there are now 4. These new states allow cores to record
whether they’ve skipped a sequential segment (state -1), and therefore need to receive two more sig-
nals to enter the next non-light sequential segment, and whether they’ve received an extra signal
(state 2), and therefore are free to enter the sequential segment two more times.
Figure A.33, depicts a new execution timeline when this additional signal buffering hardware is
added. This time, core 1 is able to skip the sequential segment and race ahead to iteration 3 without
violating correctness. Core 1 takes the right branch of the if, and even though the sequential segment
is empty once again, must block. However, because of the extra signal buffering capability, the cores
have now been able to drift apart by an additional synchronization epoch, letting core 1 start execute
iteration 3 before iteration 0 has executed the sequential segment. For every two additional states
that are added to track signals, cores can drift apart yet another synchronization epoch. Of course,
cores can only decouple if they encounter sequential segments that are empty – otherwise, since they
need to access shared data, the sequential segments still need to be executed in loop iteration order.
Our benchmarks contain enough empty segments, though, that decoupling cores reduces idle time
218
and increases speedups, significantly for some benchmarks, as we saw in Figure A.5. In our simpli-
fied example, the execution time only improves slightly. In more realistic scenarios, with more cores
and more heterogeneity in execution, having cores execute as far ahead and send as many signals as
soon as possible increases overall performance even more. In practice, even given unconstrained re-
sources, the maximum that cores drift apart in our complex benchmarks is usually limited to two
synchronization epochs, so only 4 states for signal buffering are required for most of the benefit.
A.12.2 Signal BufferModule
The following sections describe the datapath and control implementation of the reference design of
the signal buffer. This implementation is parameterized and more general than the specific one used
by HELIX-RC. After explaining the generalized design, Section A.12.6 will describe how the signal
buffer fromHELIX-RC [9] maps to the general design, and the specific values of the parameters.
First an overview of the datapath will be presented. Then, the control for initialization and general
operation will be described. Figures A.34, A.36, and A.37 present datapath schematics of the signal
buffer, split into three different hierarchical levels – the signal buffer module, the signal tracker mod-
ule, and the core tracker module. Each module will be described separately. Since the core tracker
module contains most of the business logic, the section dedicated to that module will contain most
of the control discussion. Note that the datapath schematics are meant to be an overview, and the
Verilog reference design should be consulted for exact bitwidths, etc.
This outermost module contains the entire signal buffer, shown in Figure A.34. Signals and waits
arrive from the core and forwarding network. Signals are recorded internally, and wait instructions
are released only when release criteria are met (i.e., when enough signals have been received).
Parameters The signal buffer is impacted by three system parameters. First, the signal band-
width parameter indicates howmany signals the signal buffer must be able to process in a single
219
EPOCH_BOUND = E
RECEIVER_CORE = R
NUM_SIGNALS = N
Parameters
Signal Buffer - signal_buffer.v
clk
reset
incomingSignals
incomingWaitLight
waitReleased
waitReleased 
ToStartFlush
Signal Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R
SEGMENT_ID = 0
...
Flush 
signals
Signal Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R
SEGMENT_ID = N - 2
Ordinary Signals
0 to N - 3
Signal Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R 
SEGMENT_ID = N - 1
Signal Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R
SEGMENT_ID = N - 3
N - 1
0 0
1
N - 2
0 0
1
N - 3
0 0
1
0
0 0
1
incomingWaitID
incomingWaitValid
Figure A.34: A signal buﬀer module contains signal tracker submodules, one for each sequenধal segment that must
be tracked (128 in our reference design). It processes up to signal bandwidth signals per cycle, as well as checking if
one wait instrucধon per cycle can be released.
Sequential Segment IDValid Sender Core ID
Figure A.35: A signal entry contains a valid bit, a core ID and a segment ID. In our reference implementaধon with 16
cores and 128 total sequenধal segment IDs, this totals 1 + 4 + 7 = 12 bits per signal.
220
EPOCH_BOUND = E
RECEIVER_CORE = R
SEGMENT ID = SEG_ID
Parameters
Signal Tracker - signal_buffer_signal_tracker.v
clk
reset
incomingSignals
incomingWaitValid
incomingWaitLight
signalsOriginCoreIds
signalsIds
Breakout each 
individual signal 
origin core ID and 
segment ID.
Compare each signal 
ID to this module s 
segment ID, if signal 
is valid
Core Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R
SENDER_CORE = 0
SEGMENT_ID = SEG_ID
Core Tracker Module
EPOCH_BOUND = E
RECEIVER_CORE = R
SENDER_CORE = NUM_CORES - 1
SEGMENT_ID = SEG_ID
SEG_ID
signalsMatchThisModule
...
Number of Cores – 1 core 
tracker modules
(all  cores except owner 
core)
...
waitReleased
Figure A.36: A signal tracker module handles all waits and signals to a single sequenধal segment ID. It contains core
tracker submodules, one for each other core besides the receiver core of this signal buﬀer (therefore 15 submodules
on a 16 core system). It determines whether a sequenধal segment is safe to enter by the receiver core.
221
EPOCH_BOUND = E
RECEIVER_CORE = R
SENDER_CORE = S
IS_FLUSH_SIGNAL = F
Parameters
Core Tracker - signal_buffer_core_tracker.v
clk
reset
incomingSignalsCoreIds
incomingWaitValid
incomingWaitLight
incomingSignalsValid
0
1
0
R
0
1
0
R
0
1
0
R
0
1
0
S
0
1
0
S
0
1
0
S
Assume Ring Cache 
Signal Bandwidth= 3
Inspect each signal, 
check if it matches 
receiver or sender 
core ID of this 
module
log2(E*2) bits
D
V
Q
0..N 0..N
0
3
1
2 0
1
anyMatchSenderCoreId
anyMatchReceiverCoreId
+ 1
- 1
counter
nextCounter
>= E
> 0
This register records logical states 
 ,-1,0,1,2, 
as actual counter bit values: 
0,1,2,3 
Depending on EPOCH_BOUND
Check if 
normal wait 
can release
Check if 
light wait 
can release
waitReleased
F
Init  value 
f(R, S, F)
Figure A.37: A core tracker module handles all waits and signals to a single sequenধal segment ID. It contains a
counter for tracking number of signals sent by a single sender core, as well as sent from the receiver core. It de-
termines whether it is safe for the receiver core to execute a wait instrucধon and enter a sequenধal segment, as far
as the single sender core is concerned.
222
-1
Don t Release 
Any Waits
0
Release only 
Light Waits
2
Release All 
Waits
1
Release All 
Waits
SigReceiver: A signal sent 
by the owner of the signal 
buffer with a particular 
segment ID
SigSender: A signal sent by 
a core other than the 
owner with a particular 
segment ID
Normal Signals Flush Related Signals
0
Don t Release 
Any Waits
1
Release All 
Waits
EPOCH_BOUND = 2
Figure A.38: The state machine of the core tracker module counts the diﬀerence between signals received from the
sender core minus signals sent from the receiver core. Depending on the epoch bound parameter and the counter
value, diﬀerent types of wait instrucধons can be released. In this example, epoch bound is 2. Flush related signals
require only recording the recepধon of a single signal.
Core Tracker 
Sender Core
Counter Initial 
State
Core 1 State 1
Signal Buffer 
Owner Core
Core 2 State 1
Core 0
Core 0 State 0
Core 2 State 1
Core 1
Core 0 State 0
Core 1 State 0
Core 2
Core Tracker 
Sender Core ID
Counter Initial 
State
Core Y
Signal Buffer 
Owner Core
Core Z State 0
Core X
State 0
Normal Signals Flush Related Signals
Figure A.39: For the start of every loop invocaধon, every counter in the signal buﬀer must be iniধalized properly,
accounধng for the fact that the ﬁrst epoch of iteraধons only need to be unblocked by cores with a lower core ID.
Flush related signals are always iniধalized to 0.
223
clock cycle, which has a noticeable affect on signal buffer area. Second, the epoch bound parameter
indicates howmany synchronization epochs two cores can drift apart in execution, which increases
the amount of bits needed to track signals. Finally, the num signals parameter indicates howmany
unique signal IDs that the signal buffer tracks. The compiler must know howmany signal IDs are
available prior to compilation, as it limits the total number of sequential segments it can create in
any given loop. The top two signal IDs are reserved for two special signals related to the ring cache
flush that happens at the end of every loop invocation.
Inputs At the highest level, the signal buffer has several different inputs pertinent to its primary
operation. First, it receives one or more signals on the incomingSignals input, up to the maximum
depending on the signal bandwidth parameter. Figure A.35 shows the bit layout of a signal – mul-
tiple of these are combined together on the incomingSignals bus. These signals may originate from
other cores (sender corॸ), or from the core which contains this signal buffer (which we call the re-
ceiver core). Each incoming signal contains the core ID of whoever sent it, in addition to the se-
quential segment ID that it corresponds to. It is expected that these signals arrive sometime near
the beginning of the clock period, and are all serviced by the next clock edge, at which point they
are removed as inputs. The signal buffer, therefore, must be able to handle signal bandwidth num-
ber of signals per cycle. In addition to signal inputs, the signal buffer also has three inputs related
to wait instructions. Incoming wait instructions are only issued by the receiver core. The first of
these, incomingWaitValid, indicates whether a wait instruction is being executed by the core. The
second, incomingWaitID, contains the sequential segment ID that is protected by this particular
wait instruction. Finally, an additional input bit, incomingWaitLight, indicates whether this wait
instruction is protecting an empty sequential segment, and therefore is subject to different release
criteria.
224
Outputs There is one main output, waitReleased, which is set high if and only if the receiver
core is executing a wait instruction that the signal buffer has determined is safe to release, thereby
allowing the core to enter a sequential segment. A secondary output, waitReleasedToStartFlush, is
dedicated for a special set of signal tracking bits related to the flush operation. These signal tracking
bits are initialized differently than a normal signal, and releasing their accompanying wait has a spe-
cial semantic to the receiver core, so they require a special output from the signal buffer module. The
waitReleased outputs will be held high as long as the wait instruction release criteria are met, and the
wait instruction related inputs are still valid.
Datapath Within the signal buffer, there is a submodule instantiation per signal ID that needs
to be tracked. The incomingSignals are routed to each of the signal tracker submodules, which have
been instantiated with a signal ID to track, the epoch bound parameter, and the ID of the receiver
core. The incomingWaitValid signal is passed on to a submodule if and only if the incomingWaitId
matches the signal ID of the particular submodule. Therefore, only one of the submodules will
receive a high input on their own incomingWaitValid input. The incomingWaitLight input is also
passed on to each of the signal tracker submodules. A signal tracker that receives a high incoming-
WaitValid signal will raise their waitReleased output to high if sufficient signals have been received
from every other core, therefore allowing the receiver core entrance into the sequential segment.
Every signal tracker submodule’s waitReleased signals are ORed together to set the signal buffer’s
waitReleased output. Since only one one wait instruction can be executed at a time, and it is only
passed on to one signal tracker submodule, then only one submodule will raise their waitReleased
signal. There are also two special trackers dedicated to the top two possible signal IDs, which are
used to facilitate the end of loop flush. These two trackers are initialized slightly differently than a
normal signal tracker, and only require the epoch bound parameter to be set to 1.
225
Control The outermost signal buffer module lacks any control, instead it routes inputs/outputs
to/from the signal tracker module.
A.12.3 Signal TrackerModule
This module is responsible for tracking all signal tracking and wait releasing pertinent to a single
sequential segment ID, and is shown in Figure A.36.
Parameters The signal tracker module contains three parameters set by the parent signal buffer
module. The first of these is the epoch bound parameter as described in the signal buffer module
section. Next, the module keeps track of its receiver core ID. The third parameter, segment id is the
sequential segment ID that this module is responsible for.
Inputs The inputs are the same as those of the signal buffer module, with the exception that in-
comingWaitValid is only set high by the parent module if the wait ID matches the sequential seg-
ment ID that this module instantiation is responsible for (segment id). The timing of all of the input
signals is identical to the parent module.
Outputs The single output is waitReleased, which is set high if and only if the receiver core is ex-
ecuting a wait instruction whose IDmatches segment id, and enough signals have been received that
the core may enter the sequential segment. This latter criteria is met only when the waitReleased
output of every core tracker module is set high. The output is held high as long as the wait related
inputs are valid and the release criteria are met.
Datapath The signal tracker module contains N-1 instantiations of the core tracker module,
where N is the total number of cores in the chip. There are only N-1 of these modules since the
receiver core doesn’t track itself. Each of the core tracker submodules are assigned a core ID corre-
sponding to each possible sender core. A special parameter is set if the submodule corresponds to a
226
special flush signal. At this level, the signal tracker module is only concerned about incoming signals
that match the ID of the signal tracker itself. Some minor logic is used to set an array of valid bits for
the incoming signals, where the valid bit corresponding to an incoming signal is only set if its signal
ID matches that of the signal tracker. The IDs of the cores that sent the signals are also broken out
from the incomingSignals bus. The reason that all signals with a matching ID are sent to the sub-
modules, rather than just those that match the signal ID and the sender core ID, is that signals sent
from the actual receiver coremust also update every core tracker submodule. The signal valid bits
and sender core IDs are sent to all of the core tracker submodules, in addition to the incomingWait-
Valid and incomingWaitLight inputs.
Control This module lacks any control, and primary just routes inputs/outputs to/from the
core tracker modules.
A.12.4 Core TrackerModule
This module, shown in Figure A.37, is responsible for tracking signals received by an sent to a single
sender core, for a single sequential segment ID.
Parameters As with the parent and grandparent module, the core tracker module contains an
epoch bound parameter. It also contains a receiver core ID parameter, and a sender core ID parame-
ter. There is a fourth parameter, special ﬂush signal to indicate that this module is for tracking signals
to one of the reserved signal IDs pertaining to the end of loop flush. For such a special signal, the
epoch bound parameter is always set to 1, since the special signals are only sent once at the end of a
loop invocation, and therefore don’t have epochs to track.
Inputs The incomingSignals bus from the parent module has been split into an incomingSig-
nalsValid array of bits and an incomingSignalsCoreIds bus. The former indicates which of the sig-
227
nals received by the signal buffer matched the signal tracker module segment ID which contains this
core tracker module. The latter indicates which core IDs sent which of the incoming signals, one per
signal bandwidth supported by the system. There are also incomingWaitValid and incomingWait-
Light inputs that are identical to those of the parent module. Input timing is identical to the parent
module.
Outputs The output of a core tracker, waitReleased, is set high if the receiver core has received
enough signals from the sender core to enter the sequential segment. It is held high for as long as the
wait related inputs are valid, and the release criteria are met.
Datapath The core tracker module is where the bulk of the signal buffer work is performed.
Within each core tracker module, there is a counter, where the number of states needed to be repre-
sented by the counter correspond to two times the number of synchronization epochs that we allow
cores to decouple by. There can be at most two valid incoming signals for each core tracker – one
sent by the sender core, and one sent by the actual receiver core. Since signals are sent and propagated
around ring cache in-order, and without passing each other, it is impossible to receive two copies of
a signal from a single core with the same signal ID. Depending on which of these two possible sig-
nals are received by the core tracker, the internal counter is either incremented, decremented, or held
to the same value. Depending on the value of the counter, and whether incomingWaitValid and/or
incomingWaitLight is high, the only output, waitReleased, is raised high. This indicates that as far
as this sender core is concerned, the receiver core is free to execute the sequential segment associated
with this particular wait instruction.
Control Now that that datapath and module hierarchy is established, the control and logic gov-
erning how the core tracker counters are updated is described, in addition to the conditions in which
waits and light waits are released. As previously described, a particular core tracker module can re-
228
ceive at most two signals for a particular signal ID on a given cycle – one from the receiver core of the
signal buffer, and one from the sender core corresponding to the particular instantiation of the core
tracker module. We can call these two signals SigReceiver and SigSender. Processing a SigSender has
the semantic that the sender core has exited the coresponding sequential segment. Processing a Si-
gReceiver has the semantic that the receiver core has just exited the sequential segment. The counter
in each core tracker module keeps track of the difference between the number of SigReceivers and
SigSenders processed. Figure A.38 depicts the FSM for how these two signal types get processed.
The epoch bound is set to be 2 in this implementation, so the number of counter states required is
4. Whenever a core tracker processes a SigSender, the counter is incremented, indicating that the
sender core has executed a particular sequential segment. Whenever a core tracker processes a Si-
gReceiver, the counter is decremented, indicating that this receiver core has consumed one of the
received signals by finishing execution of the sequential segment. If incomingWaitValid is raised high
but incomingWaitLight is low, the core tracker sets waitReleased high if and only if the value of the
counter is greater than or equal to 1. If both incomingWaitValid and incomingWaitLight are raised
high, then waitReleased is set high only if the counter value is greater than the minimum possible
(-1 in this configuration). Note that although we refer to states ranging from -1...2, in hardware the
counter register will range from 0...3. We refer to the logical states to better map to the idea of cores
being in different relative epochs.
The natural question is, what do the different counter values represent? What is the semantic of
the different states? How come light waits are released at a different threshold than normal waits?
Consider the case where only normal wait instructions are being executed (i.e., every sequential
segment contains dependences that must be satisfied in loop iteration order). In this situation, as
discussed in section A.12.1, cores can not decouple by more than one synchronization epoch in order
to properly satisfy the dependences. In that situation, state 0 refers to a receiver core that has not yet
received a signal for a particular segment this synchronization epoch – this core is further along in exe-
229
cution than the corresponding sender core. The receiver coremay be executing parallel code, or may
be stuck waiting on a wait instruction, waiting to receive a signal. In either case, since the upcoming
segment contains a dependence, it can not execute it until it receives a signal from every other core.
State 1 in this situation refers to a receiver core that has received a signal for a particular segment, but
has not yet executed (and therefore consumed the signal) of that sequential segment yet – this core
is further behind in execution than the sender core. When this core executes a wait instruction, it
knows it has received a signal from the corresponding sender core, which implies that the sender core
has exited the sequential segment. The receiver core executes the segment and then sends a SigRe-
ceiver, which reduces the counter to state 0. As long as only non-light wait instructions are executed,
the counter can only be in state 0 or state 1, as was the case in Figure A.32.
However, if light wait instructions are executed, it implies that a core is executing a segment that
does not actually need to be synchronized. Imagine a receiver core whose counter is currently in
state 0. Now it encounters a light wait instruction, which it can skip and send the SigReceiver signal
immediately. This reduces its state to -1, indicating that not only was it already ahead of the corre-
sponding sender core, but now it is ahead by an additional synchronization epoch. It now needs to
receive two signals from the sender core in order to unblock the next non-light wait instruction it
encounters. The sender core, upon receiving the signal from the receiver core, increments its state
from 1 to 2, indicating that it has two outstanding signals to consume, as was the case in Figure A.33.
Notice that in the FSM that there aren’t any transitions representing either underflow or over-
flow of the counter. In our previous example, what would happen if the receiver core were to race
ahead an execute the sequential segment from yet another synchronization epoch ahead? Absent any
back pressure, it would skip the light wait instruction, send the signal, underflowing its counter,
which would be received by the sender core, overflowing its counter. This is the reason that light
wait instructions must still be able to block a core, and is the reason why a light wait may only be
released by the signal buffer if and only if the counter is not about to underflow. If a receiver core is
230
prohibited from underflowing its counter, then no sender core can ever overflow. This is a result of
the fact that counter states are symmetric – a state of -1 indicates a core is ahead by one synchroniza-
tion epoch, and a state of 2 indicates it is behind by one. If a core is prevented frommoving forward,
past the hardware limit, to an epoch too far in the future, then this implies that no core can ever fall
behind an additional epoch past the hardware limit. If the epoch bound parameter is increased, then
cores can decouple even further without blocking due to hardware limits.
Figure A.38 depicted the control for a single core tracker module. Remember that to release a
wait instruction from a receiver core, all (numCores - 1) core tracker modules must agree to release
the wait. For a normal wait, this indicates that all sender cores have executed the sequential segment
from iterations older than the receiver. For a light wait, this indicates that no signal buffer counters
will overflow or underflow if the light wait is released.
If a core tracker module is designated special by the special ﬂush signal parameter, the control is
slightly modified. Since these core tracker modules are merely tracking whether every core has exe-
cuted a particular special signal just once, any SigReceivers are ignored, since epochs are not relevant
(epoch bound is set to 1).
Initialization The state counters are initialized at the beginning of every loop in a particular
way, depending on the receiver core ID and the sender core ID. For the first synchronization epoch
of a loop, each core expects a different number of signals from all of the other cores. Consider a 3
core system where core 0 always runs iteration 0 of a loop. Figure A.39 depicts the state of each core’s
signal buffer just as a new loop begins. Since iteration 0 is the first in a loop, it doesn’t need to be
unblocked by any core for core 0 to enter any sequential segment. Therefore, all of the counters for
every signal tracker in core 0’s signal buffer are initialized to state 1 at the beginning of every loop,
indicating that all segments can be entered without receiving any signals. Core 1, on the other hand,
when executing iteration 1, needs to receive signals from iteration 0 to enter any segment. Conse-
231
quently, the core tracker modules corresponding to core 0, for every signal tracker in core 1’s signal
buffer, are initialized to state 0, indicating that signals must be received by core 0 before entering any
segments. The counters corresponding to core 2 in core 1’s signal buffer, on the other hand, must
still be initialized to state 1, since core 2 never executes any older iterations than core 1. Finally, every
counter in core 2’s signal buffer is initialized to state 0, since core 2 must receive signals from both
core 0 and core 1 before entering any segments. In the case of any core tracker modules contained
within a signal tracker module corresponding to special flush signal, all counters are initialized to
state 0, since we desire that every core waits for every other core before releasing the corresponding
wait instruction.
A.12.5 Signal Buffer Optimizations
There are a few possible signal buffer optimizations that may be appropriate to use with the ref-
erence design. The first of these is only applicable when epoch bound is equal to 1 – that is, cores
are not able to decouple, and must always stay within the same synchronization epoch, even in the
presence of sequential segments that lack dependences. This is a similar scenario to when HELIX is
running on a traditional multicore, albeit still with faster signal propagation speed. In this configu-
ration, every sequential segment is run in loop iteration order, without exception. Light waits don’t
exist, since there is no situation where they can be skipped. The requirement for (numCores - 1) core
tracker modules per signal tracker module is no longer necessary, since receiving a signal for a par-
ticular sequential segment ID from a particular core implies that every previous loop iteration from
every other core has already executed the segment. This allows the signal buffer to reduce the num-
ber of core tracker modules from (numCores - 1) per signal tracker module to just 1 per signal tracker
module. This reduces the total amount of signal buffer area by potentially a factor of the number
of cores. The area savings comes at a cost – cores can not decouple across synchronization epochs, so
speedups are reduced, as we saw previously in Figure A.5. However, if the signal buffer area is pro-
232
hibitively large, limiting epoch bound to 1 is a possible solution. Also, signals do not need to circulate
around the entire ring any more, they only need to travel to their subsequent core in the ring, which
reduces signal bandwidth requirements.
The second optimization is applicable only when epoch bound is equal to 2, which implies cores
can decouple an additional synchronization epoch. If the compiler can guarantee that there is at least
one non-light wait instruction per loop iteration (as would be the case if a particular dependence
always needed to be satisfied), then light waits can be removed entirely, leaving only the correspond-
ing signal behind. This non-light wait instructionmust belong to the same sequential segment every
iteration. A core can just straight away send the corresponding signal without relying on the light
wait to prevent underflow. This is because the existence of at least one unavoidable non-light wait
per iteration per iteration naturally limits synchronization decoupling to less than two synchroniza-
tion epochs. This optimization is a trade-off – light wait instructions can be removed, which reduces
the number of instructions executed, and the signal buffer no longer needs to contain the logic to
examine light waits. However, the compiler may now have to artificially make a wait instruction
non-light which otherwise could be light. In practice, we found that this downside doesn’t happen
often.
The third optimization applies when epoch bound is equal to 3. Like with the previous optimiza-
tion, if the compiler can make a guarantee about non-light waits, than light waits don’t need to be
executed at all. Except in this situation, the compiler only needs to guarantee that at least one se-
quential segment executes a non-light wait per iteration. Unlike the previous case, it doesn’t need to
be the same sequential segment every iteration. This prevents our 6-state core tracker counters from
ever underflowing. For our particular benchmarks, increasing the epoch bound past 2 did not have
any effect on performance, so this optimization may not be useful.
233
A.12.6 Previous Implementations
Original HELIX-RC Signal Buffer
The previously presented design is a generalized reference design, and not the exact design we used
in the HELIX-RC paper. In that work, we used a specialized implementation which was designed
before the generalized design was realized. Instead of a state counter and the notion of epoch bounds,
we used two distinct bits per signal per sender core, called past and future. The second optimization
from Section A.12.5 was used, so light wait instructions didn’t need to be executed at all. Like the
generalized design, bits were set when a receiver core received signals from any sender core. If the
past bit was not yet set, it was subsequently set. If the past bit was already set, the future bit was now
set. Unlike the generalized design, signals sent by the receiver core were not recorded locally in any
way. Instead, a wait instruction was released when at least one bit was set, which then cleared the
bit (future bit first). If a signal was received when both bits were set, the future bit was cleared – in
doing so, the receiver core was inferring that an epoch had elapsed. This scheme had the downside
that in addition to being harder to reason about, wait instructions altered ring cache state, so could
not be issued speculatively. It otherwise had the same performance, effective FSM, and decoupling
ability that the generalized signal buffer design has. There is no reason to use the original design over
the generalized design. For the HELIX-RC work, signal bandwidthwas set to be 5 signals per cycle
(less reduced performance for 164.gzip). The number of signals needed to be tracked by the buffer
was larger than 100, since the compiler could not limit the number of sequential segments at that
point in time.
NormalMulticore Signal Handling
On a real multicore, HELIX handled signal tracking by allocating a special private region of memory
per core. Signals were implemented with stores, and waits with loads. In effect, this would map to
234
a signal buffer with an epoch bound of 1 with the first optimization in Section A.12.5. It would be
unreasonable to emulate a signal buffer with higher epoch bound, since instead of each core only
needing to write signals to a single other core’s private signal tracking memory, they would need to
both read and write from every other core’s. The read would be necessary to either increment or
decrement state, depending on the current value stored there. This explosion in cache coherence
traffic would quickly dwarf any performance improvement.
A.13 OS/Multiprogramming Considerations
Our reference design and all of our previous studies have assumed that only a single HELIX process
can run at any single point in time. Currently, there is no direct support for context switching. The
only time during execution when it is safe to context switch is between parallel loop invocations, af-
ter the ring node memory has been flushed. At that point in time, all states within the ring cache are
default states. The only thing that needs to be done is to raise the reset signal, which reinitializes
the signal buffer and invalidates the cache array. That procedure is the same regardless of whether
the next loop to be executed is from the same process or not.
If a ring cache needs to be able to handle context switches, there are a few things to consider.
First, the contents of the ring node memory need to either be tagged with address space IDs or
flushed to the normal cache hierarchy during every context switch. The same mechanism for the
normal ring cache flush, as described in Section A.8, can be used to accomplish this at any point in
time, as long as a core does not execute any instructions after the special flush signals. All of the sig-
nal tracking bits in the signal buffer must be saved as well. This amounts to 32 bits per signal ID per
core, assuming an epoch bound of 2 and 16 cores in the system.
In the current implementation, a safe way to bring the ring cache to a halt in preparation for a
context switch is as follows. First, all instructions that a core has presented to the ring node inter-
face, besides waits, must be held there until the ring node indicates they have completed. This may
235
be hundreds of cycles in the worst case, as a number of remote loads may be pending. Waits should
be removed from the ring node interface immediately, since they have no side effects on any state
and will never return if their corresponding signal is not already in flight. After this last store/sig-
nal/load has finished (or a wait instruction has been removed), the core should not execute any
more instructions from the loop. Instead, the flush mechanism of Section A.8 should be initiated by
executing the special wait and signal instructions. This will guarantee that all outstanding (non-
flush) store/signal activity has stopped before initiating the flush of the ring node memory in each
core. It is also now safe to back up the contents of the signal buffer, through some yet to be designed
mechanism. Once the flush of both the memory and the signal buffer has completed, the core’s re-
set signal should go high to finish clearing the memory contents and the bloom filter. To resume
execution, the signal buffer contents of each ring node need to be restored and a synchronization
barrier executed. Now each core can resume where it left off.
A.14 Synthesis Results
A.14.1 Reference Design
This section presents some preliminary area, power, and timing results for the ring cache. Table A.1
shows the parameters for our reference design. The values were chosen to roughly match the simu-
lated ring cache from the HELIX-RC paper. The most noticeable exception is the ring cache mem-
ory, which was 8-way set-associative rather than direct mapped in the HELIX-RC paper. Later sim-
ulations showed that the difference between 8-way and direct-mapped memory was minor, so for
ease and clarity of implementation, the latter was used. It is important to understand that many of
these parameters were selected to be just big enough so as not to be a bottleneck for the 6 SPECint
2000 benchmarks we evaluated. Other programs may have vastly different requirements, so these
particular values should not be overly relied upon for a final implementation.
236
Table A.1: Ring Cache parameters for the reference design.
Number of Supported Cores 16
Address Width 32 bits
Data Width 32 bits
Cache Associativity direct mapped
Cache Data Storage 1024 KB
Total Number of Signal IDs 128
Signal Bandwidth 5 signals per cycle
Signal Buffer Epoch bound 2
Store Bandwidth 1 per cycle
Forwarding Network Total Wires 129 bits
Request Network Total Wires 37 bits
Reply Network Total Wires 37 bits
Bloom Filter Configuration 256 bit-array, 1 hash function
Assumed Network Link Latency 0.5 ns
The reference configuration was exhaustively tested with test vectors from our X86 cycle-level
C++ simulator, XIOSIM [33], with the modeled ring cache configured the same as the reference de-
sign. Vectors were collected for every Simpoint phase of all 10 of the SPEC benchmarks we evaluated
in our previous HELIX-RC paper. Specifically, for every simulated cycle, the value of all inputs and
outputs corresponding to the ring cache interface (Section A.6) were collected. Verilog simulations
were performed with 16 ring nodes that were excited with these test vectors. At every cycle, the out-
puts of the ring nodes were compared to the known correct outputs from XIOSIM. Every phase
passed the testbench.
Many of the parameters, such as the number of supported cores, signal bandwidth, signal buffer
configuration, and total number of signal IDs, have a large impact on the area/performance of ring
cache. After generating initial results for this reference design, some of these important parameters
were swept. First, the reference design was synthesized using Synopsys Design Compiler and a 40nm
process technology. The synthesis tool was steered to optimize for critical path delay. We used RTL-
level activity factors from our SPECint test runs to more accurately predict power in the synthesized
237
design. Table A.2 summarizes the post-synthesis results for a single ring node. Although we assumed
a 0.5 ns link latency for all of our inter-node links, we stress that the power and area numbers here
do not account for any link area/power and strictly represent only Design Compiler’s post-synthesis
estimates. Using RTL simulation activity factors instead of Design Compiler’s default, the power is
reduced from nearly 100 mW to 19.22 mW because although the SPECint benchmarks use the ring
cache relatively regularly, it is still only accessed in sequential segments. Since there are still signifi-
cant portions of parallel code, the ring cache is often dormant.
Table A.2: Synthesis results for a single reference ring node.
Area 0.272 sq mm
Dynamic Power 19.22 mW
Leakage Power 3.3 mW
Max Frequency 1.11 GHz
Critical Path The post-synthesis timing report exposes two primary critical paths in the ring
cache design. The first path starts in the forwarding network receive buffers. It continues through
the bundleizer module, where stores/signals from the core may be added to the network bundle.
Then it travels through the stopper module, where stores/signals may be removed from the bundle.
Finally, it exits the stopper module and begins propagating over the network link to the subsequent
ring node. Since link propagation accounts for 0.5 ns of the path, the other bundleizer and stopper
logic accounts for only a small portion of the total path. It is beneficial that link propagation can
happen in parallel with writing to the memory, since the next longest paths involve the memory.
Specifically, they start at the forwarding network receive buffer and travel through the bundleizer as
before. Then the path changes and instead goes to the memory array to perform a tag lookup and
prepare a memory write for the next clock edge.
238
Cache Array (47.4%) 
Signal Buffer (48.7%) 
Other Memory Module Logic (2.0%) 
Forwarding Network (1.5%) 
Request/Reply Networks (0.5%) 
Ring Node Area
Cache Array (9.7%) 
Signal Buffer (87.9%) 
Other Memory Module Logic (1.0%) 
Forwarding Network (1.0%) 
Request/Reply Networks (0.4%) 
Ring Node Dynamic Power
Figure A.40: Power and area for a single ring node. The forwarding network includes the corresponding network
buﬀer, bundleizer, and stopper modules. The request/reply networks include the network buﬀers and the load unit.
Area Figure A.40a depicts the area usage in the reference design. The cache array is marginally
smaller than the signal buffer. Although perhaps unexpected, this follows from the fact that the
reference design uses 128 total signal IDs. Since the storage per ID is 2 storage bits per core per signal,
the total number of registers per signal buffer is 2 * 16 * 128 = 4096 bits. The area of the memory is
somewhat larger than it could otherwise be, since the reference design uses a register array in lieu
of an SRAM to minimize access latency. If the area is prohibitively large, experimenting with using
an SRAM (which would necessitate adjusting the FSMs in the memory and array modules) and
reducing the number of signal IDs could provide a large benefit.
Power Figure A.40b shows the dynamic power breakdown. Although the cache array consti-
tutes a large fraction of the area, it constitutes a smaller proportion of the power consumption. The
signal bandwidth (5 signals per cycle) was set at a high level relative to store bandwidth (1 store per
cycle) because the large amount of empty sequential segments tends to produce far more signals
than shared data. As a result, the signal buffer is utilized far more often than the cache array.
239
0.0
0.5
1.0
1.5
2.0
2.5
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
8 Signals
16
32
64
128
256
512
Figure A.41: Total ring node area as total signal ID capacity is swept from 8 to 512.
A.14.2 Signal Buffer Parameter Sweeps
The signal buffer has several parameters that potentially have a large impact on system performance
and ring node area. These include the total number of possible signal IDs, the amount of signal
bandwidth, the amount of allowed synchronization epoch decoupling, and the number of cores in
the system. These properties were discussed in Section A.12.
Number Of Signal IDs
The compiler has control over the maximum number of sequential segments it will produce in any
given loop. Maximum performance can be achieved when the compiler has total flexibility to cre-
ate as many sequential segments as it would like. If restricted, the compiler must combine multiple
sequential segments into one, which may have an impact on performance. However, the number
of signal IDs has a linear effect on the size of the signal buffer area, as each ring node must contain
signal tracker modules for each possible unique signal ID. In our HELIX-RC paper, the number of
signal IDs was unrestricted, which required a maximum of approximately 128 signal IDs for most
loops. Due to limitations in the compiler, we are currently unable to sweep maximum signal IDs
and we therefore leave this analysis for future work. However, although we can’t examine it thor-
oughly, the intuition we have gathered from examining the compiled code suggests that significantly
240
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
0
2
4
6
8
10
12
14
16
Unbounded
4 Signals
2
1
Figure A.42: Some benchmarks are very sensiধve to signal bandwidth, with 5 signals per cycle required to maximize
speedups.
fewer than 128 signals would be required to capture most or all of the performance. Some of the se-
quential segments are purely for edge cases (exceptions, error handling) that do not occur in normal
operation and are rarely, if ever, synchronized.
We can, however, sweep the maximum amount of signal IDs in the signal buffer to see how the
area changes. Figure A.41 depicts the area of a ring node, normalized to our reference design, as the
number of signals are swept from 8 to 512. Given that the signal buffer was a significant fraction of
the total ring node area in the reference design, it is no surprise that the total ring node area increases
dramatically for the largest signal capacities.
Signal Bandwidth
Figure A.42 from our HELIX-RC paper shows the importance of high signal bandwidth for achiev-
ing good speedups on SPECint. Although it doesn’t have as drastic an effect on area as the number
of signal IDs, increasing the signal bandwidth does increase the area of the signal buffer. Additional
multiplexers and logic to process multiple incoming signals result in a 28% decrease in total ring
node area from our reference design to one with a bandwidth of 1 signal per cycle, as seen in Fig-
ure A.43. Since the signal buffer is approximately 50% of the design area, this corresponds to about
241
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
1 signals per cycle
2
3
4
5
Figure A.43: Increasing signal bandwidth increases the signal buﬀer and network buﬀer sizes.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
Epoch Bound = 1
2
3
Figure A.44: Decoupling synchronizaধon for one or two epochs increases area signiﬁcantly, but also increases per-
formance.
a 55% decrease in signal buffer area. Although the forwarding network buffers are also halved, their
overall impact is very slight, given their size. Since the critical path involves the forwarding network
buffers and the bundleizer module, the maximum achievable frequency increases slightly as signal
bandwidth decreases, by around 5% from 5 signals per cycle to 1 signal per cycle.
Amount of Synchronization Decoupling
The epoch bound parameter in the signal buffer dictates howmany synchronization epochs cores can
decouple. In the HELIX-RC paper, this parameter was essentially set to 2, with a slight optimiza-
tion (see Section A.12.6). Increasing this parameter has the potential to increase speedup, but will
also increase the area consumed by the signal buffer, as more bits are needed to track the state of each
signal. Figure A.44 shows the area impact for three values of epoch bound: 1, 2, and 3. Note the large
242
16
4.g
zip
17
5.v
pr
19
7.p
ar
se
r
30
0.t
wo
lf
18
1.m
cf
25
6.b
zip
2
IN
T G
eo
me
an
0
2
4
6
8
10
12
14
16
P
ro
gr
am
sp
ee
du
p
Epoch Bound = 1
Epoch Bound = 2
Epoch Bound = 3
Figure A.45: Decoupling synchronizaধon up to two epochs increases speedups, but beyond that there is no eﬀect.
impact of moving from 1 to 2. This is because at a value of 1, the signal buffer can be simplified by
having each core track only received signals from its immediate predecessor and send signals only to
its immediate successor. This optimization is possible because at a value of 1, cores are unable to de-
couple. This implies executing all sequential segments strictly in loop iteration order, which removes
the point of having a core broadcast a signal to every other core—since receiving a signal from your
immediate predecessor guarantees that all previous iterations have already executed older iterations.
However, an epoch bound of 2 has a drastic performance impact, as seen in our simulation results in
Figure A.45 (this consists of some of the same data as Figure A.5). In contrast, moving to an epoch
bound of 3 has absolutely no benefit—there are only a very few times in all of the combined Sim-
Point phases where any benchmarks decouple by that amount. Unless other program characteristics
are significantly different from those of SPECint, it seems pointless to use any epoch bound value
other than 2.
243
   1
64
.gz
ip
    
17
5.v
pr
 19
7.p
ars
er
  3
00
.tw
olf
    
18
1.m
cf
  2
56
.bz
ip2
IN
T G
eo
me
an
 18
3.e
qu
ak
e
    
17
9.a
rt
   1
88
.am
mp
   1
77
.m
es
a
 FP
 G
eo
me
an
    
Ge
om
ea
n
0
2
4
6
8
10
12
14
16
Pr
og
ra
m
 s
pe
ed
up
16 cores
8 cores
4 cores
2 cores
Figure A.46: HELIX-RC scales relaধvely well on a small number of cores.
0.0
0.2
0.4
0.6
0.8
1.0
N
or
m
al
iz
ed
 R
in
g 
N
od
e 
A
re
a
16 supported cores
8
4
2
Figure A.47: Signal buﬀer size is linear with the number of supported cores, so decreasing the number of cores has a
large impact on ring node area.
Number of Supported Cores
The signal buffer needs to track received signals from every other core in the design. Consequently,
the size of the state in the signal buffer varies linearly with the number of cores in the system, much
in the same way that the total number of signal IDs does. Figure A.47 shows the drastic decrease
in ring node size when the number of cores is reduced from 16 to 2. The size of the signal buffer
decreases by a factor of 8. Intuitively, as the number of cores decreases, so does the achievable
speedup, as we previously discussed in the HELIX-RC paper and as shown in Figure A.46.
244
B
Ring Cache Verilog Code
245
B.1 defines.v
‘ i f n d e f DEFINE_V
‘ d e f i n e DEFINE_V
‘ d e f i n e ADDR_WIDTH 3 2
‘ d e f i n e DATA_WIDTH 3 2
‘ d e f i n e ID_WIDTH 7
‘ d e f i n e NUM_SIGNALS 1 2 8 / / Must b e e x a x c t l y 2 ^ i d_w id t h
‘ d e f i n e SIGNAL_BANDWIDTH 5
‘ d e f i n e CORE_ID_WIDTH 4
‘ d e f i n e NUM_CORES 1 6 / / Must b e e x a c t l y 2 ^ c o r e _ i d_w i d t h
‘ d e f i n e TYPE_WIDTH 2
‘ d e f i n e TYPE_LOAD 2 ’ d0
‘ d e f i n e TYPE_STORE 2 ’ d 1
‘ d e f i n e TYPE_WAIT 2 ’ d2
‘ d e f i n e TYPE_SIGNAL 2 ’ d 3
‘ d e f i n e SIGNAL_ENTRY_WIDTH ( 1 + ‘CORE_ID_WIDTH + ‘ID_WIDTH ) / / 1 i s v a l i d b i t , CORE_ID i s t h e o r i g i n c o r e
‘ d e f i n e STORE_ENTRY_WIDTH ( 1 + ‘CORE_ID_WIDTH + ‘ADDR_WIDTH + ‘DATA_WIDTH) / / d i t t o
‘ d e f i n e REMOTE_LOAD_REQUEST_ENTRY_WIDTH ( 1 + ‘CORE_ID_WIDTH + ‘ADDR_WIDTH)
‘ d e f i n e REMOTE_LOAD_REPLY_ENTRY_WIDTH ( 1 + ‘CORE_ID_WIDTH + ‘DATA_WIDTH)
‘ d e f i n e FORWARD_NETWORK_BUNDLE_WIDTH ( ( ‘SIGNAL_ENTRY_WIDTH * ‘SIGNAL_BANDWIDTH ) + ‘STORE_ENTRY_WIDTH )
‘ d e f i n e REQUEST_NETWORK_BUNDLE_WIDTH ( ‘REMOTE_LOAD_REQUEST_ENTRY_WIDTH)
‘ d e f i n e REPLY_NETWORK_BUNDLE_WIDTH ( ‘REMOTE_LOAD_REPLY_ENTRY_WIDTH)
/ / As sume L1 l i n e s i z e i s 64 b y t e s , u s e d f o r a d d r e s s o w n e r s h i p c a l c u l a t i o n .
/ / Words on t h e same c a c h e l i n e mus t have t h e same owner
‘ d e f i n e L1_LINE_SIZE_LOG_2 6
/ / F u n c t i o n s
‘ d e f i n e CLOG2 ( x ) \
( x <= 2 ) ? 1 : \
( x <= 4 ) ? 2 : \
( x <= 8 ) ? 3 : \
( x <= 1 6 ) ? 4 : \
( x <= 3 2 ) ? 5 : \
( x <= 64 ) ? 6 : \
( x <= 1 2 8 ) ? 7 : \
( x <= 2 5 6 ) ? 8 : \
( x <= 5 1 2 ) ? 9 : \
246
( x <= 1 0 2 4 ) ? 1 0 : \
 1
/ / Home c o r e ha sh f u n c t i o n n e e d s t o r e s p e c t 64 b y t e L1 c a c h e s i z e ,
/ / o r i n c o r r e c t b e h a v i o r r e s u l t s .
‘ d e f i n e HOME_CORE( x ) ( ( x >> ‘L1_LINE_SIZE_LOG_2 ) % ‘NUM_CORES )
‘ e n d i f
247
B.2 ring_cache.v
‘ i n c l u d e ” d e f i n e s . v ”
‘ i n c l u d e ”memory . v ”
‘ i n c l u d e ” s i g n a l _ b u f f e r . v ”
‘ i n c l u d e ” b u n d l e i z e r . v ”
‘ i n c l u d e ” s t o p p e r . v ”
‘ i n c l u d e ” b u f f e r . v ”
‘ i n c l u d e ” l o a d_ un i t . v ”
/ *
As sume t h a t r i n g c a c h e i n p u t s f r om t h e c o r e and L1 a r e o u t p u t s o f r e g i s t e r s f r om o u t s i d e
t h e modu l e .
* /
module r i n g_ c a c h e (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
i n p u t w i r e f l u s h ,
i n p u t w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] f i r s t I t e r a t i o n C o r e I d ,
/ / I n p u t s and o u t p u t s t o t h e owner c o r e
i n p u t w i r e coreCommandValid ,
i n p u t w i r e [ ‘TYPE_WIDTH  1 : 0 ] coreCommandType ,
i n p u t w i r e [ ‘ID_WIDTH  1 : 0 ] coreCommandId ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] coreCommandAddr ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] coreCommandData ,
o u t p u t w i r e coreCommandProce s s ed ,
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] c o r e R e s u l t ,
o u t p u t w i r e c o r e L o a d I sH i t ,
/ / I n p u t s and o u t p u t s f r om / t o l e f t n e i g h b o r
i n p u t w i r e l e f t F o r w a r d B u n d l e V a l i d ,
i n p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t F o r w a r dA r r i v i n g B u n d l e ,
o u t p u t w i r e l e f t F o r w a r dOu t g o i n gC r e d i t ,
i n p u t w i r e l e f t R e q u e s t B u n d l e V a l i d ,
i n p u t w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e q u e s t A r r i v i n g B u n d l e ,
o u t p u t w i r e l e f t R e q u e s t O u t g o i n gC r e d i t ,
i n p u t w i r e l e f t R e p l y B u n d l e V a l i d ,
i n p u t w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e p l y A r r i v i n g B u n d l e ,
o u t p u t w i r e l e f t R e p l yO u t g o i n gC r e d i t ,
/ / I n p u t s and o u t p u t s f r om / t o r i g h t n e i g h b o r
i n p u t w i r e r i g h t F o rw a r d I n c om ingC r e d i t ,
248
o u t p u t w i r e r i g h t F o r w a r d B u n d l e V a l i d ,
o u t p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t F o rw a r dD e p a r t i n g B und l e ,
i n p u t w i r e r i g h tR e q u e s t I n c om i n gC r e d i t ,
o u t p u t w i r e r i g h t R e q u e s t B u n d l e V a l i d ,
o u t p u t w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t R e q u e s tD e p a r t i n g B u n d l e ,
i n p u t w i r e r i g h tR e p l y I n c om in gC r e d i t ,
o u t p u t w i r e r i g h t R e p l y B u n d l e V a l i d ,
o u t p u t w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h tR e p l yD e p a r t i n g B u n d l e ,
/ / I n p u t s and o u t p u t s f r om / t o L1 c a c h e
i n p u t w i r e w r i t e b a c kA c c e p t e d ,
i n p u t w i r e w r i t e b a ckComp l e t e ,
o u t p u t w i r e w r i t e b a c k V a l i d ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] w r i t e b a ckAdd r ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] w r i t e b a c kDa t a ,
/ / P o r t s f o r r e ad i n t e r a c t i o n w i t h L1 c a c h e
i n p u t w i r e c a ch eLo adAc c e p t e d ,
i n p u t w i r e c a cheLoadComple t e ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] c a c h eLo a dR e s u l t ,
o u t p u t w i r e c a c h eLo a dV a l i d ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] c a cheLoadAddr
) ;
p a r am e t e r CORE_ID = 0 ;
/ / C omb i n a t i o n a l i n p u t s t o memory
w i r e [ ‘ADDR_WIDTH  1 : 0 ] a dd r_ r ;
w i r e i n p u t _ v a l i d _ r ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] a d d r e s s S t o r e ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] d a t a S t o r e ;
w i r e i n p u t V a l i d S t o r e ;
/ / O u t p u t s f r om memory
w i r e memoryReadReady ;
w i r e memoryWriteReady ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] d a t aOutLoad ;
w i r e r e q u e s tComp l e t eLo a d ;
w i r e r e q u e s tH i t L o a d ;
w i r e r e q u e s t C om p l e t e S t o r e ;
249
/ / C omb i n a t i o n a l o u t p u t s s i g n a l b u f f e r , memory
w i r e c o r eW a i t R e l e a s e d ;
w i r e c o r eW a i t R e l e a s e d T o S t a r t F l u s h ;
w i r e memor yF in i s h edF l u s h ;
/ / I n p u t s t o R e c e i v e B u f f e r s . R e l e a s e s t h e c u r r e n t o l d e s t b u f f e r / e n t r y / b u n d l e .
w i r e l e f t F o r w a r d R e l e a s e B u n d l e ;
w i r e l e f t R e q u e s t R e l e a s e B u n d l e ;
w i r e l e f t R e p l y R e l e a s e B u n d l e ;
/ / O u t p u t s f r om f o r w a r d R e c e i v e B u f f e r
w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t F o r w a r dD e p a r t i n g B u n d l e ;
w i r e l e f t F o r w a r d V a l i d D e p a r t i n g B u n d l e ;
w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] r e q u e s t B u f f e r P e e k A ;
w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] r e q u e s t B u f f e r P e e k B ;
/ / O u t p u t s f r om r e q u e s t R e c e i v e B u f f e r
w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e q u e s t D e p a r t i n g B u n d l e ;
w i r e l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e ;
/ / O u t p u t s f r om r e p l y R e c e i v e B u f f e r
w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e p l y D e p a r t i n g B u n d l e ;
w i r e l e f t R e p l y V a l i d D e p a r t i n g B u n d l e ;
/ / O u t p u t s f r om b u n d e l i z e r and s t o p p e r
w i r e c o r e I n p u t S e r v i c e d B y B u n d l e ;
w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] u p d a t e d B und l e ;
w i r e u p d a t e d B u n d l e V a l i d ;
w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] p r un edBund l e ;
w i r e p r u n e d B un d l e V a l i d ;
/ / B r e a k o u t s f r om upda t e d bund l e , t o p e r f o rm s t o r e t o memory , and s i g n a l t o s i g n a l b u f f e r
w i r e [ ‘ADDR_WIDTH  1 : 0 ] u p d a t e d S t o r e A d d r e s s ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] u p d a t e d S t o r eD a t a ;
w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] u p d a t e d S t o r e S e n d e r C o r e ;
w i r e u p d a t e d S t o r e V a l i d ;
w i r e [ ( ‘SIGNAL_BANDWIDTH* ‘SIGNAL_ENTRY_WIDTH )  1 : 0 ] u p d a t e d S i g n a l s ;
/ / I n p u t t o l o a d un i t , don ’ t l e t r e q u e s t n e tw o r k p a s s f o rwa rd n e tw o r k
w i r e anyAddrMatch ;
/ / O u t p u t s f r om l o a d un i t , s e n d t o e i t h e r : memory , c o r e , r e q u e s t n e twork , o r r e p l y n e tw o r k
250
w i r e [ ‘ADDR_WIDTH  1 : 0 ] a d d r e s s T oLo a d ;
w i r e a d d r e s s T oL o a d V a l i d ;
w i r e c o r e L o a d P r o c e s s e d ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] c o r e L o a dR e s u l t ;
w i r e c o r eLo adH i t ;
w i r e r i g h t R e q u e s t V a l i d ;
w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t R e q u e s t D e p a r t i n g ;
w i r e r i g h t R e p l y V a l i d ;
w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t R e p l yD e p a r t i n g ;
/ / Ou tbound l i n k s and t h e i r c r e d i t s
w i r e ou tboundForwa rdL inkReady ;
r e g [ 1 : 0 ] o u t b o undFo rw a r dL i nkC r e d i t s ;
w i r e ou tboundRequ e s tL inkReady ;
r e g [ 1 : 0 ] o u t b o un dR e q u e s t L i n kC r e d i t s ;
w i r e ou tboundRep l yL inkReady ;
r e g [ 1 : 0 ] o u t b o undR e p l yL i nkC r e d i t s ;
/ / B u f f e r s f o r t h e t h r e e n e tw o r k s , a l l have t h e same b a s i c i n t e r f a c e .
/ / A r r i v i n g i t e m s f r om t h e c o r e s p o n d i n g l i n k s a r e s t o r e d i n t h e b u f f e r .
/ / E x i s t i n g i t e m s a r e r e l e a s e d i f t h e y a r e c on s umed by t h e r i n g c a c h e n od e .
/ / A c r e d i t i s s e n t t o t h e r i n g c a c h e n od e on t h e l e f t i f a b u f f e r i s c on sumed ,
/ / t o i n f o rm i t o f t h i s f a c t .
a s s i g n ou tboundRep l yL inkReady = ( o u t b o undR e p l yL i nkC r e d i t s == 2 ’ b00 ) ? 1 ’ b0 : 1 ’ b 1 ;
r e c e i v e _ b u f f e r # ( .CORE_ID (CORE_ID ) , . ENTRY_WIDTH( ‘REPLY_NETWORK_BUNDLE_WIDTH ) )
r e p l y R e c e i v e B u f f e r (
. r e s e t ( r e s e t | f l u s h ) ,
. c l k ( c l k ) ,
. i n p u t V a l i d ( l e f t R e p l y B u n d l e V a l i d ) ,
. a r r i v i n g E n t r y ( l e f t R e p l y A r r i v i n g B u n d l e ) ,
. o u t g o i n gC r e d i t ( l e f t R e p l y O u t g o i n gC r e d i t ) ,
. r e l e a s e E n t r y ( l e f t R e p l y R e l e a s e B u n d l e ) ,
. d e p a r t i n g E n t r y ( l e f t R e p l y D e p a r t i n g B u n d l e ) ,
. v a l i d D e p a r t i n g E n t r y ( l e f t R e p l y V a l i d D e p a r t i n g B u n d l e )
) ;
a s s i g n ou tboundRequ e s tL inkReady = ( o u t b o un dR e q u e s t L i n kC r e d i t s == 2 ’ b00 ) ? 1 ’ b0 : 1 ’ b 1 ;
r e c e i v e _ b u f f e r # ( .CORE_ID (CORE_ID ) , . ENTRY_WIDTH( ‘REQUEST_NETWORK_BUNDLE_WIDTH ) )
r e q u e s t R e c e i v e B u f f e r (
251
. r e s e t ( r e s e t | f l u s h ) ,
. c l k ( c l k ) ,
. i n p u t V a l i d ( l e f t R e q u e s t B u n d l e V a l i d ) ,
. a r r i v i n g E n t r y ( l e f t R e q u e s t A r r i v i n g B u n d l e ) ,
. o u t g o i n gC r e d i t ( l e f t R e q u e s t O u t g o i n gC r e d i t ) ,
. r e l e a s e E n t r y ( l e f t R e q u e s t R e l e a s e B u n d l e ) ,
. d e p a r t i n g E n t r y ( l e f t R e q u e s t D e p a r t i n g B u n d l e ) ,
. v a l i d D e p a r t i n g E n t r y ( l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e )
) ;
a s s i g n ou tboundForwa rdL inkReady = ( o u t b o undFo rw a r dL i nkC r e d i t s == 2 ’ b00 ) ? 1 ’ b0 : 1 ’ b 1 ;
r e c e i v e _ b u f f e r # ( .CORE_ID (CORE_ID ) , . ENTRY_WIDTH( ‘FORWARD_NETWORK_BUNDLE_WIDTH ) )
f o r w a r d R e c e i v e B u f f e r (
. r e s e t ( r e s e t | f l u s h ) ,
. c l k ( c l k ) ,
. i n p u t V a l i d ( l e f t F o r w a r d B u n d l e V a l i d ) ,
. a r r i v i n g E n t r y ( l e f t F o r w a r d A r r i v i n g B u n d l e ) ,
. o u t g o i n gC r e d i t ( l e f t F o r w a r dO u t g o i n gC r e d i t ) ,
. r e l e a s e E n t r y ( l e f t F o r w a r d R e l e a s e B u n d l e ) ,
. d e p a r t i n g E n t r y ( l e f t F o r w a r dD e p a r t i n g B u n d l e ) ,
. v a l i d D e p a r t i n g E n t r y ( l e f t F o r w a r d V a l i d D e p a r t i n g B u n d l e ) ,
. peekA ( r e q u e s t B u f f e r P e e k A ) ,
. p e ekB ( r e q u e s t B u f f e r P e e k B )
) ;
b u n d l e i z e r # ( .CORE_ID (CORE_ID ) )
b und l eLog i c (
/ / I n p u t s f r om c o r e
. coreCommandAddr ( coreCommandAddr ) ,
. coreCommandData ( coreCommandData ) ,
. coreCommandType ( coreCommandType ) ,
. coreCommandVal id ( coreCommandVal id ) ,
. coreCommandId ( coreCommandId ) ,
/ / S t a t u s o f memory and l i n k , n e e d e d t o d e t e rm i n e wh e t h e r t o r e l e a s e a b u f f e r
. memoryReady ( memoryWriteReady ) ,
. ou tboundLinkReady ( ou tboundForwa rdL inkReady ) ,
/ / Ou t p u t s , i n f o rm c o r e and f o rward n e tw o r k t h a t s om e t h i n g was c on s umed
. c o r e I n p u t S e r v i c e d ( c o r e I n p u t S e r v i c e d B y B u n d l e ) ,
. l e f t R e l e a s e B u n d l e ( l e f t F o r w a r d R e l e a s e B u n d l e ) ,
/ / I n p u t s f r om b u f f e r
252
. l e f t D e p a r t i n g B u n d l e ( l e f t F o r w a r dD e p a r t i n g B u n d l e ) ,
. l e f t V a l i d D e p a r t i n g B u n d l e ( l e f t F o r w a r d V a l i d D e p a r t i n g B u n d l e ) ,
/ / O u t p u t s t o s t o p p e r
. o u t p u t B u n d l e ( u p d a t e d B und l e ) ,
. o u t p u t V a l i d ( u p d a t e d B u n d l e V a l i d )
) ;
/ / P r o c e s s b u n d l e c om ing f r om t h e b u n d l e i z e r , p a s s any v a l i d s t o r e t o t h e memory ,
/ / any v a l i d s i g n a l s t o t h e s i g n a l b u f f e r
/ / V a l i d b i t f o r s t o r e
a s s i g n u p d a t e d S t o r e V a l i d = u p d a t e d B u n d l e V a l i d & u p d a t e d B und l e [ ‘STORE_ENTRY_WIDTH  1 ] ;
a s s i g n u p d a t e d S t o r e S e n d e r C o r e = u p d a t e d S t o r e V a l i d == 1 ’ b 1 ?
u p d a t e d B und l e [ ‘STORE_ENTRY_WIDTH 2:‘ADDR_WIDTH+‘DATA_WIDTH ] :
{ ‘CORE_ID_WIDTH { 1 ’ b 1 } } ;
a s s i g n u p d a t e d S t o r e A d d r e s s = u p d a t e d S t o r e V a l i d == 1 ’ b 1 ?
u p d a t e d B und l e [ ‘STORE_ENTRY_WIDTH 2 ‘CORE_ID_WIDTH : ‘DATA_WIDTH ] :
{ ‘ADDR_WIDTH { 1 ’ b0 } } ;
a s s i g n u p d a t e d S t o r eD a t a = u p d a t e d S t o r e V a l i d == 1 ’ b 1 ?
u p d a t e d B und l e [ ‘DATA_WIDTH  1 : 0 ] :
{ ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n u p d a t e d S i g n a l s = u p d a t e d B u n d l e V a l i d == 1 ’ b 1 ?
u p d a t e d B und l e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : ‘STORE_ENTRY_WIDTH ] :
{ ( ( ‘SIGNAL_BANDWIDTH* ‘SIGNAL_ENTRY_WIDTH ) ) { 1 ’ b0 } } ;
/ / Handle w r i t e s t o t h e memory , c o u l d b e c o r e o r l e f t r e c e i v e b u f f e r . B u n d l e i z e r ha s a l r e a d y c h o s e n .
a s s i g n a d d r e s s S t o r e = u p d a t e d S t o r e V a l i d ? u p d a t e d S t o r e A d d r e s s : { ‘ADDR_WIDTH { 1 ’ b0 } } ;
a s s i g n d a t a S t o r e = u p d a t e d S t o r e V a l i d ? u p d a t e d S t o r eD a t a : { ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n i n p u t V a l i d S t o r e = u p d a t e d S t o r e V a l i d ;
/ / Remove any s t o r e s / s i g n a l s f r om t h e b u n d l e t h a t have r e a c h e d t h e i r f i n a l c o r e . Th i s h a p p e n s
/ / a f t e r t h e y a r e s e n t t o memory / s i g n a l b u f f e r , b u t b e f o r e p l a c e d on t h e l i n k .
s t o p p e r # ( .CORE_ID (CORE_ID ) )
endLog i c (
. i n p u t V a l i d ( u p d a t e d B u n d l e V a l i d ) ,
. i n p u t B u n d l e ( u p d a t e d B und l e ) ,
. o u t p u t B u n d l e ( p r un edBund l e ) ,
. o u t p u t V a l i d ( p r u n e d B u n d l e V a l i d )
) ;
253
memory # ( .CORE_ID (CORE_ID ) ) r ingCacheMemory (
. r e s e t ( r e s e t | f l u s h ) ,
. c l k ( c l k ) ,
/ / O u t p u t s i n d i c a t i n g l o a d / s t o r e r e a d i n e s s
. r e a dRe ad y ( memoryReadReady ) ,
. w r i t eR e a d y ( memoryWriteReady ) ,
/ / I n p u t l o a d a d d r e s s f r om l o a d u n i t
/ / ( u l t i m a t e l y f r om c o r e o r r e q u e s t n e tw o r k )
. i n p u t V a l i d L o a d ( i n p u t _ v a l i d _ r ) ,
. a d d r e s s L o a d ( add r_ r ) ,
/ / Ou t p u t s , s e n t t o l o a d u n i t
. d a t aOutLoad ( da t aOutLoad ) ,
. r e q u e s tComp l e t eLo a d ( r e q u e s tComp l e t eLo a d ) ,
. r e q u e s tH i t L o a d ( r e q u e s tH i t L o a d ) ,
/ / I n p u t s t o r e a d d r e s s / da ta f r om o u t p u t o f b u n d e l i z e r
. i n p u t V a l i d S t o r e ( i n p u t V a l i d S t o r e ) ,
. a d d r e s s S t o r e ( a d d r e s s S t o r e ) ,
. d a t a S t o r e ( d a t a S t o r e ) ,
/ / Ou t pu t
. r e q u e s t C om p l e t e S t o r e ( r e q u e s t C om p l e t e S t o r e ) ,
/ / F l u s h r e l e a t e d i n p u t and o u t p u t , t r i g g e r i n g s t a r t and i n d i c a t i n g f i n i s h
. s t a r t F l u s h ( c o r eW a i t R e l e a s e d T o S t a r t F l u s h ) ,
. f i n i s h e d F l u s h ( memor yF in i s h edF l u s h ) ,
/ / I n p u t and o u t p u t s t o L1 c a c h e
. w r i t e b a c k V a l i d ( w r i t e b a c k V a l i d ) ,
. w r i t e b a c kAdd r ( w r i t e b a c kAdd r ) ,
. w r i t e b a c kD a t a ( w r i t e b a c kD a t a ) ,
. w r i t e b a c kA c c e p t e d ( w r i t e b a c kA c c e p t e d ) ,
. w r i t e b a c kComp l e t e ( w r i t e b a c kComp l e t e ) ,
. c a c h eLo a dA c c e p t e d ( c a c h eLo a dA c c e p t e d ) ,
. c a ch eLoadComp l e t e ( c a cheLoadComp l e t e ) ,
. c a c h e L o a dR e s u l t ( c a c h e L o a dR e s u l t ) ,
. c a c h eL o a dV a l i d ( c a c h eL o a dV a l i d ) ,
. c a cheLoadAddr ( c a cheLoadAddr )
) ;
s i g n a l _ b u f f e r # ( .EPOCH_BOUND ( 2 ) , . RECEIVER_CORE (CORE_ID ) )
s i g n a l _ b u f f e r (
. c l k ( c l k ) ,
254
. r e s e t ( r e s e t | f l u s h ) ,
. f i r s t I t e r a t i o n C o r e I d ( f i r s t I t e r a t i o n C o r e I d ) ,
/ / S i g n a l s f r om t h e b u n d e l i z e r , s e n d t o s i g n a l b u f f e r t o r e c o r d
. i n c om i n g S i g n a l s ( u p d a t e d S i g n a l s ) ,
/ / I n c om in g wa i t i n s t r u c t i o n f r om t h e c o r e , o u t p u t wh e t h e r i t i s r e l e a s e d o r n o t .
. i n c om ingWa i tV a l i d ( coreCommandVal id == 1 ’ b 1 && coreCommandType == ‘TYPE_WAIT ) ,
. i n comingWa i tL igh t ( coreCommandData [ 0 ] ) ,
. i n comingWai t Id ( coreCommandId ) ,
. w a i t R e l e a s e d ( c o r eW a i t R e l e a s e d ) ,
. w a i t R e l e a s e d T o S t a r t F l u s h ( c o r eW a i t R e l e a s e d T o S t a r t F l u s h )
) ;
l o a d_ un i t # ( .CORE_ID (CORE_ID ) )
l o a d_ un i t (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
/ / I n c om i n g c o r e command
. coreCommandAddr ( coreCommandAddr ) ,
. coreCommandType ( coreCommandType ) ,
. coreCommandVal id ( coreCommandVal id ) ,
/ / I n c om i n g i t em on t h e r e q u e s t n e tw o r k
. o u t boundRequ e s tL inkReady ( ou tboundRequ e s tL inkReady ) ,
. l e f t R e q u e s t D e p a r t i n g B u n d l e ( l e f t R e q u e s t D e p a r t i n g B u n d l e ) ,
. l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e ( l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e ) ,
/ / I n c om i n g i t em on t h e r e p l y n e tw o r k
. o u tboundRep l yL inkReady ( ou tboundRep l yL inkReady ) ,
. l e f t R e p l y D e p a r t i n g B u n d l e ( l e f t R e p l y D e p a r t i n g B u n d l e ) ,
. l e f t R e p l y V a l i d D e p a r t i n g B u n d l e ( l e f t R e p l y V a l i d D e p a r t i n g B u n d l e ) ,
/ / O u t p u t s f r om t h e memory   r e ad h i t s t a t u s , and whe t h e r i t i s f i n i s h e d
. r e q u e s tComp l e t eLo a d ( r e q u e s tComp l e t eLo a d ) ,
. r e q u e s tH i t L o a d ( r e q u e s tH i t L o a d ) ,
. d a t aOutLoad ( da t aOutLoad ) ,
/ / Check i t e m s i n f o rwa r d i n g n e tw o r k b u f f e r s t o make s u r e n on e
/ / match a n y t h i n g i n t h e r e q u e s t n e tw o r k b u f f e r s
. peekA ( r e q u e s t B u f f e r P e e k A ) ,
. p e ekB ( r e q u e s t B u f f e r P e e k B ) ,
/ / O u t p u t s f r om l o a d un i t , s e n d t o memory
255
. a d d r e s s T oL o a d V a l i d ( a d d r e s s T oL o a d V a l i d ) ,
. a d d r e s s T oLo a d ( a d d r e s s T oLo a d ) ,
/ / O u t p u t s f r om l o a d un i t , u s e t o r e s p o n d t o c o r e i n p u t
. c o r e L o a d P r o c e s s e d ( c o r e L o a d P r o c e s s e d ) ,
. c o r e L o a dR e s u l t ( c o r e L o a dR e s u l t ) ,
. c o r eLo adH i t ( c o r eLo adH i t ) ,
/ / O u t p u t s f r om l o a d un i t , u s e t o r e l e a s e a b u f f e r f r om t h e r e q u e s t n e twork ,
/ / and / o r s e n d a new on e on t h e o u t g o i n g l i n k
. l e f t R e q u e s t R e l e a s e B u n d l e ( l e f t R e q u e s t R e l e a s e B u n d l e ) ,
. r i g h t R e q u e s t V a l i d ( r i g h t R e q u e s t V a l i d ) ,
. r i g h t R e q u e s t D e p a r t i n g ( r i g h t R e q u e s t D e p a r t i n g ) ,
/ / O u t p u t s f r om l o a d un i t , u s e t o r e l e a s e a b u f f e r f r om t h e r e p l y ne twork ,
/ / and / o r s e n d a new on e on t h e o u t g o i n g l i n k
. l e f t R e p l y R e l e a s e B u n d l e ( l e f t R e p l y R e l e a s e B u n d l e ) ,
. r i g h t R e p l y V a l i d ( r i g h t R e p l y V a l i d ) ,
. r i g h t R e p l yD e p a r t i n g ( r i g h t R e p l yD e p a r t i n g )
) ;
/ / C o nn e c t l o a d u n i t modu l e t o memory . Not r e a l l y n e c e s s a r y , b u t makes i t more e x p l i c i t .
a s s i g n i n p u t _ v a l i d _ r = a d d r e s s T oL o a d V a l i d ;
a s s i g n add r_ r = a d d r e s s T oL o a d V a l i d ? a d d r e s s T oLo a d : { ‘ADDR_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * *
O u t p u t s
* * * * * * * * * /
/ / Ou t pu t p run ed f o rward b u n d l e s ( s t o r e s and s i g n a l s ) o n t o o u t g o i n g l i n k
a s s i g n r i g h t F o r w a r d B u n d l e V a l i d = p r u n e d B u n d l e V a l i d ;
a s s i g n r i g h t F o r w a r dD e p a r t i n g B u n d l e = p r un edBund l e ;
/ / Ou t pu t p r o p e r r e q u e s t b u n d l e ( e i t h e r f r om c o r e o r a l r e a d y i n ne twork , a s d e c i d e d by l o a d u n i t )
a s s i g n r i g h t R e q u e s t B u n d l e V a l i d = r i g h t R e q u e s t V a l i d ;
a s s i g n r i g h t R e q u e s t D e p a r t i n g B u n d l e = r i g h t R e q u e s t D e p a r t i n g ;
/ / Ou t pu t p r o p e r r e p l y b u n d l e ( e i t h e r f r om c o r e o r a l r e a d y i n ne twork , a s d e c i d e d by l o a d u n i t )
a s s i g n r i g h t R e p l y B u n d l e V a l i d = r i g h t R e p l y V a l i d ;
a s s i g n r i g h t R e p l yD e p a r t i n g B u n d l e = r i g h t R e p l yD e p a r t i n g ;
/ / R e t u r n a h i t i f a l o a d h i t l o c a l l y , o r when r e t u r n i n g f r om r e p l y n e tw o r k .
/ / C u r r e n t l y , f r om t h e c o r e ’ s p o i n t o f v i ew , ALL l o a d s a r e ( e v e n t u a l l y ) h i t s .
/ / Load u n i t c a l c u l a t e s t h i s v a l u e x
a s s i g n c o r e L o a d I sH i t = c o r eLo adH i t ;
256
/ / R e s u l t i s f r om l o c a l l o a d o r r e p l y n e tw o r k . Ou t pu t f r om l o a d u n i t .
a s s i g n c o r e R e s u l t = c o r e L o a dR e s u l t ;
/ / C o r e command c om p l e t e d i f a l o c a l l o a d h i t , a s t o r e was i n j e c t e d t o t h e f o rwa r d i n g ne twork ,
/ / a norma l wa i t was r e l e a s e d , t h e s p e c i a l f l u s h wa i t was r e l e a s e d ,
/ / o r a r em o t e l o a d i n i t i a t e d by t h i s c o r e came back on t h e r e p l y n e tw o r k .
a s s i g n coreCommandProce s s ed = c o r e L o a d P r o c e s s e d | c o r e I n p u t S e r v i c e d B y B u n d l e |
c o r eW a i t R e l e a s e d | ( memor yF i n i s h e dF l u s h & c o r eW a i t R e l e a s e d T o S t a r t F l u s h ) ;
a lwa y s@ ( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b0 ) b e g in
/ / New c r e d i t t o t a l = o l d c r e d i t s + p o s s i b l e i n c om i n g c r e d i t ,   s om e t h i n g s e n t
ou t b o undFo rw a r dL i nkC r e d i t s <= o u t b o undFo rw a r dL i nkC r e d i t s + r i g h t F o rw a r d I n c om in gC r e d i t  
( r i g h t F o r w a r d B u n d l e V a l i d ? 1 ’ b 1 : 1 ’ b0 ) ;
o u t b o undR e q u e s t L i n kC r e d i t s <= o u t b o un dR e q u e s t L i n kC r e d i t s + r i g h tR e q u e s t I n c om i n gC r e d i t  
( r i g h t R e q u e s t B u n d l e V a l i d ? 1 ’ b 1 : 1 ’ b0 ) ;
o u t b o undR e p l y L i nkC r e d i t s <= o u t b o undR e p l yL i nkC r e d i t s + r i g h tR e p l y I n c om i n gC r e d i t  
( r i g h t R e p l y B u n d l e V a l i d ? 1 ’ b 1 : 1 ’ b0 ) ;
end
e l s e b e g i n
o u t b o undFo rw a r dL i nkC r e d i t s <= 2 ’ d2 ;
o u t b o undR e q u e s t L i n kC r e d i t s <= 2 ’ d2 ;
o u t b o undR e p l y L i nkC r e d i t s <= 2 ’ d2 ;
end
end
endmodule
257
B.3 buffer.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / G e n e r i c b u f f e r u s e d f o r t h r e e d i f f e r e n t p s e d u o n e tw o r k s .
/ / C o n t a i n s two s l o t s f o r h o l d i n g ” b u n d l e s ” / p a c k e t s / e l e m e n t s .
module r e c e i v e _ b u f f e r #( p a r am e t e r ENTRY_WIDTH = ( ‘FORWARD_NETWORK_BUNDLE_WIDTH) , CORE_ID = 0 ) (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
/ / I n p u t s f r om l e f t r i n g n od e .
/ / I f v a l i d , we a r e p o s i t i v e t h a t t h e r e i s an empty b u f f e r s l o t ,
/ / o t h e r w i s e i t wou ld n e v e r have b e e n s e n t .
i n p u t w i r e i n p u t V a l i d ,
i n p u t w i r e [ ENTRY_WIDTH  1 : 0 ] a r r i v i n g E n t r y ,
/ / O u t p u t s t o l e f t r i n g n od e .
/ / R a i s e t h i s t o h i g h f o r o n e c y c l e t o c ommun i c a t e t o t h e l e f t n od e
/ / t h a t o n e o f t h e b u f f e r s ha s b e c ome f r e e .
o u t p u t w i r e o u t g o i n gC r e d i t ,
/ / I n p u t s f r om l o c a l r i n g n od e .
/ / Th i s c o n t r o l s wh e t h e r a wa i t i n g e n t r y can b e f r e e d .
i n p u t w i r e r e l e a s e E n t r y ,
/ / O u t p u t s t o l o c a l r i n g n od e .
/ / I f a v a l i d d e p a r t i n g e n t r y , t h e r i n g c a c h e migh t c o n s ume i t by r a i s i n g
/ / t h e r e l e a s e E n t r y wir e , i n f o rm i n g t h e b u f f e r t o f r e e i t i n t e r n a l l y .
o u t p u t w i r e [ ENTRY_WIDTH  1 : 0 ] d e p a r t i n g En t r y ,
o u t p u t w i r e v a l i dD e p a r t i n g E n t r y ,
/ / P e e k a t b u f f e r s f o r a d d r e s s c h e c k i n g .
/ / Used o n l y by t h e r e q u e s t n e tw o r k t o p e e k a t t h e f o rwa r d i n g ne twork , t o make s u r e a r em o t e l o a d
/ / i s n ’ t p a s s i n g by a s t o r e t o t h e same a d d r e s s .
o u t p u t w i r e [ ENTRY_WIDTH  1 : 0 ] peekA ,
o u t p u t w i r e [ ENTRY_WIDTH  1 : 0 ] p e ekB
) ;
/ / Two s l o t s i n t h e b u f f e r
r e g [ ENTRY_WIDTH  1 : 0 ] en t r yA ;
r e g [ ENTRY_WIDTH  1 : 0 ] e n t r y B ;
/ / V a l i d b i t s n o t n e c e s s a r y d e p e n d i n g on t h e da ta f o rma t o f t h e e n t r y , b u t we u s e them anyway .
r e g v a l i dA ;
r e g v a l i d B ;
258
r e g c u r r e n t ; / / 0 means en t r yA i s n ex t t o d e p a r t , 1 i s B
/ / C omb i n a t i o n a l n ex t v a l u e s
r e g [ ENTRY_WIDTH  1 : 0 ] n e x tEn t r yA ;
r e g [ ENTRY_WIDTH  1 : 0 ] n e x tEn t r y B ;
r e g n e x t V a l i dA ;
r e g n e x t V a l i d B ;
r e g n e x tCu r r e n t ;
/ / R e s e t and n ex t s t a t e l o g i c
a l w a y s @( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b 1 ) b e g in
en t r yA <= {ENTRY_WIDTH { 1 ’ b0 } } ;
e n t r y B <= {ENTRY_WIDTH { 1 ’ b0 } } ;
v a l i dA <= 1 ’ b0 ;
v a l i d B <= 1 ’ b0 ;
c u r r e n t <= 1 ’ b0 ;
end
e l s e b e g i n
en t r yA <= nex tEn t r yA ;
e n t r y B <= n e x tEn t r y B ;
v a l i dA <= n e x t V a l i dA ;
v a l i d B <= n e x t V a l i d B ;
c u r r e n t <= n e x tCu r r e n t ;
end
end
/ / A s s i g n t h e p e e k o u t p u t s t o t h e two b u f f e r s .
a s s i g n peekA = en t r yA ;
a s s i g n peekB = e n t r y B ;
/ / C omb i n a t i o n a l l o g i c f o r o u t p u t o f an e n t r y .
/ / Check t h e ” c u r r e n t ” r e g t o d e t e rm i n e whi ch o f t h e two b u f f e r s r e f e r s t o
/ / t h e o l d e s t v a l i d e n t r y .
a s s i g n v a l i d D e p a r t i n g E n t r y = ( ( c u r r e n t == 1 ’ b0 ) && ( v a l i dA == 1 ’ b 1 ) ) | |
( ( c u r r e n t == 1 ’ b 1 ) && ( v a l i d B == 1 ’ b 1 ) ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n d e p a r t i n g E n t r y = ( ( c u r r e n t == 1 ’ b0 ) && ( v a l i dA == 1 ’ b 1 ) ) ? en t r yA :
( ( c u r r e n t == 1 ’ b 1 ) && ( v a l i d B == 1 ’ b 1 ) ) ? e n t r y B :
{ENTRY_WIDTH { 1 ’ b0 } } ;
/ / R e l e a s e a c r e d i t t o t h e l e f t n od e i f we have c o n f i rm e d from t h e l o c a l n od e t h a t
/ / a b u f f e r i s b e i n g c on s umed t h i s c y c l e .
/ / The d e c i s i o n a s t o wh e t h e r a b u f f e r i s b e i n g c on s umed i s made e a r l y i n t h e c y c l e ,
259
/ / s o we a s s ume t h i s s i g n a l can t r a v e r s e t o t h e a d j a c e n t c o r e w i t h i n a c y c l e .
/ / I f t h i s i s n ’ t r e a l i s t i c , i t w i l l n e e d t o b e l a t c h e d h e r e f i r s t .
a s s i g n o u t g o i n gC r e d i t = v a l i d D e p a r t i n g E n t r y & r e l e a s e E n t r y ? 1 ’ b 1 : 1 ’ b0 ;
/ / Muxes f o r n ex t s t a t e v a l u e s .
/ / Mux i n p u t s a r e wh e t h e r an e n t r y i s l e a v i n g t h e b u f f e r , and wh e t h e r an e n t r y i s a r r i v i n g .
/ / 0 0 : N o t h i n g i s c hang ing , k e e p o l d v a l u e s .
/ / 0 1 : N o t h i n g i s l e a v i n g b u f f e r , b u t s om e t h i n g i s a r r i v i n g , s o f i n d p r o p e r s l o t f o r i t .
/ / 1 0 : S ome t h i n g i s l e a v i n g , b u t n o t a r r i v i n g . Make s u r e t o o u t p u t t h e o l d e s t v a l i d e n t r y .
/ / 1 1 : S ome t h i n g i s a r r i v i n g and l e a v i n g . Make s u r e t o u p da t e d a l l r e g i s t e r s p r o p e r l y .
/ / In g e n e r a l   when s om e t h i n g i s a r r i v i n g , i t g o e s i n t o A i f i t i s a v a i l a b l e , B o t h e r w i s e .
/ / When s om e t h i n g i s l e a v i n g , i t l e a v e s f i r s t f r om A i f c u r r e n t == 0 , B i f c u r r e n t == 1 .
/ / When s om e t h i n g i s b o t h l e a v i n g and a r r i v i n g , t h e same r u l e s a p p l y . I t i s i m p o s s i b l e f o r
/ / s om e t h i n g t o b e a r r i v i n g i f b o t h b u f f e r s a r e a l r e a d f u l l .
w i r e v a l i d E n t r y B e i n g R e l e a s e d ;
a s s i g n v a l i d E n t r y B e i n g R e l e a s e d = v a l i d D e p a r t i n g E n t r y & r e l e a s e E n t r y ;
/ / Mux f o r n ex tEn t r yA
a lway s@ ( * ) b e g in
c a s e ( { v a l i d E n t r y B e i n g R e l e a s e d , i n p u t V a l i d } )
2 ’ b00 : n ex tEn t r yA = en t r yA ;
2 ’ b0 1 : n e x tEn t r yA = ~ ( v a l i dA | v a l i d B ) ? a r r i v i n g E n t r y :
v a l i dA ? en t r yA :
v a l i d B ? a r r i v i n g E n t r y :
{ENTRY_WIDTH { 1 ’ b0 } } ;
2 ’ b 1 0 : n e x tEn t r yA = c u r r e n t ? en t r yA : {ENTRY_WIDTH { 1 ’ b0 } } ;
2 ’ b 1 1 : n e x tEn t r yA = c u r r e n t ? a r r i v i n g E n t r y : {ENTRY_WIDTH { 1 ’ b0 } } ;
e n d c a s e
end
/ / Mux f o r n ex tVa l i dA
a lway s@ ( * ) b e g in
c a s e ( { v a l i d E n t r y B e i n g R e l e a s e d , i n p u t V a l i d } )
2 ’ b00 : n e x t V a l i dA = v a l i dA ;
2 ’ b0 1 : n e x t V a l i dA = 1 ’ b 1 ; / / i f s om e t h i n g a r r i v i n g , A a lway s s e l e c t e d f i r s t
2 ’ b 1 0 : n e x t V a l i dA = c u r r e n t ? v a l i dA : 1 ’ b0 ;
2 ’ b 1 1 : n e x t V a l i dA = c u r r e n t ;
e n d c a s e
end
/ / Mux f o r e n t r y B . S im i l a r t o A , b u t s w i t c h i n g on i n v e r s e o f ’ c u r r e n t ’ r e g
a lway s@ ( * ) b e g in
c a s e ( { v a l i d E n t r y B e i n g R e l e a s e d , i n p u t V a l i d } )
260
2 ’ b00 : n e x tEn t r y B = e n t r y B ;
2 ’ b0 1 : n e x tEn t r y B = ~ ( v a l i dA | v a l i d B ) ? e n t r y B :
v a l i dA ? a r r i v i n g E n t r y :
v a l i d B ? e n t r y B :
{ENTRY_WIDTH { 1 ’ b0 } } ;
2 ’ b 1 0 : n e x tEn t r y B = c u r r e n t ? {ENTRY_WIDTH { 1 ’ b0 } } : e n t r y B ;
2 ’ b 1 1 : n e x tEn t r y B = c u r r e n t ? {ENTRY_WIDTH { 1 ’ b0 } } : a r r i v i n g E n t r y ;
e n d c a s e
end
/ / Mux f o r n e x tVa l i d B
a lway s@ ( * ) b e g in
c a s e ( { v a l i d E n t r y B e i n g R e l e a s e d , i n p u t V a l i d } )
2 ’ b00 : n e x t V a l i d B = v a l i d B ;
2 ’ b0 1 : n e x t V a l i d B = v a l i dA ? 1 ’ b 1 : v a l i d B ;
2 ’ b 1 0 : n e x t V a l i d B = c u r r e n t ? 1 ’ b0 : v a l i d B ;
2 ’ b 1 1 : n e x t V a l i d B = ~ c u r r e n t ;
e n d c a s e
end
/ / Mux f o r n e x tC u r r e n t
a lway s@ ( * ) b e g in
c a s e ( { v a l i d E n t r y B e i n g R e l e a s e d , i n p u t V a l i d } )
2 ’ b00 : n e x tCu r r e n t = c u r r e n t ;
2 ’ b0 1 : n e x tCu r r e n t = ~ ( v a l i dA | v a l i d B ) ? 1 ’ b0 : c u r r e n t ; / / I f n o t h i n g v a l i d , A g e t s s e l e c t e d
2 ’ b 1 0 : n e x tCu r r e n t = ~ c u r r e n t ; / / S om e t h i n g l e a v i n g , t o g g l e .
2 ’ b 1 1 : n e x tCu r r e n t = ~ c u r r e n t ;
e n d c a s e
end
endmodule
261
B.4 bundleizer.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / Th i s modu l e a r b i t r a t e s b e tw e e n i t e m s a l r e a d y i n t h e f o rwa r d i n g n e tw o r k
/ / ( c om ing f rom t h e r e c e i v e b u f f e r modu l e ) , and s t o r e s / s i g n a l s i n j e c t e d by
/ / t h e c o r e a t t a c h e d t o t h i s n od e . S i g n a l s / s t o r e s c i r c u l a t e i n t h e f o rwa r d i n g
/ / n e tw o r k t o g e t h e r i n a b u n d l e . A b u n d l e c o n s i s t s o f o n e o r more s t o r e s and on e o r more
/ / s i g n a l s . Once i t e m s a r e i n a b u n d l e t o g e t h e r , t h e y mus t s t a y i n t h e b u n d l e t o g e t h e r ,
/ / u n t i l t h e y have r e a c h e d t h e i r f i n a l d e s t i n a t i o n . Th i s i s f o r c o r e c t n e s s , a s HELIX
/ / a s s um e s t h a t i t e m s i n j e c t e d t o t h e r i n g c a c h e a r e p r o c e s s e d in o r d e r . I f a newer s i g n a l
/ / p a s s e d an o l d e r s t o r e , a c o r e c o u l d b e e r r o r o n o u s l y u n b l o c k e d b e f o r e t h e c o r r e c t da ta ha s a r r i v e d .
/ / A c o r e can o n l y add s t o r e s / s i g n a l s t o t h e b u n d l e i f t h e r e i s an a p p r o p r i a t e empty s l o t ( s t o r e o r s i g n a l )
/ /   o t h e r w i s e i t mus t wa i t f o r a b u n d l e w i t h an empty s l o t .
/ / I f f o rwa r d i n g i s b l o c k e d f o r any r e a s o n ( p e n d i n g e v i c t i o n i n t h e RC memory ) ,
/ / t h e n t h e e n t i r e bund l e , s t o r e s and s i g n a l s i n c l u d e d mus t wa i t .
/ / Once a b u n d l e p a s s e s t h r o u g h t h i s modu l e , i t i s s e n t t o t h e RC memory and s i g n a l b u f f e r f o r p r o c e s s i n g .
/ / I t i s o n l y s e n t t o t h e o u t p u t p o r t s i f i t can p r o c e e d t o t h e memory and o u t g o i n g l i n k t h i s c y c l e .
/ / The s i g n a l b u f f e r can n o t b l o c k , s o i s a lway s r e a d y f o r i n p u t s .
/ / A b u n d l e ha s a f i x e d number o f s l o t s f o r s i g n a l s , c o r r e s p o n d i n g t o t h e a v a i l a b l e s i g n a l bandwid th i n t h e
/ / s i g n a l b u f f e r , and a f i x e d number o f s l o t s f o r s t o r e s , c o r r e s p o n d i n g t o t h e a v a i l a b l e da ta bandwid th i n t h e
/ / r i n g c a c h e memory .
/ / Ther e a r e f o u r d i f f e r e n t s c e n a r i o s t o h and l e :
/ / N o t h i n g l e a v i n g r e c e i v e b u f f e r , no c o r e s t o r e / s i g n a l b e i n g i n j e c t e d
/ / Bund l e l e a v i n g r e c e i v e b u f f e r , no c o r e s t o r e / s i g n a l b e i n g i n j e c t e d , s o p a s s i t on un chang ed
/ / N o t h i n g l e a v i n g r e c e i v e b u f f e r , c o r e s t o r e / s i g n a l b e i n g i n j e c t e d ,
/ / s o c r e a t e new b un d l e w i t h o n l y c o r e s t o r e / s i g n a l
/ / B und l e l e a v i n g r e c e i v e b u f f e r , c o r e s t o r e / s i g n a l b e i n g i n j e c t e d , add i t o n l y i f a f r e e s p o t
module b u n d l e i z e r (
/ / I n p u t s f r om c o r e , c o u l d b e e i t h e r a s t o r e o r a s i g n a l
i n p u t w i r e coreCommandValid ,
i n p u t w i r e [ ‘TYPE_WIDTH  1 : 0 ] coreCommandType ,
i n p u t w i r e [ ‘ID_WIDTH  1 : 0 ] coreCommandId ,
i n p u t [ ‘ADDR_WIDTH  1 : 0 ] coreCommandAddr ,
i n p u t [ ‘DATA_WIDTH  1 : 0 ] coreCommandData ,
/ / I n p u t s f r om f o rwa r d i n g n e tw o r k r e c i e v e b u f f e r , i n d i c a t i n g i f i t
/ / h o l d s a v a l i d bund l e , and i f s o , i t s c o n t e n t s .
i n p u t w i r e l e f t V a l i d D e p a r t i n g B u n d l e ,
i n p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t D e p a r t i n g B u n d l e ,
262
/ / R e a d i n e s s i n p u t s f r om t h e RC memory and l i n k t o t h e a d j a c e n t RC nod e .
i n p u t w i r e memoryReady ,
i n p u t w i r e outboundLinkReady ,
/ / Ou t pu t t o t h e r e c e i v e b u f f e r t o t e l l i t t h a t i t can r e l e a s e t h e c u r r e n t b u n d l e
o u t p u t w i r e l e f t R e l e a s e B u n d l e ,
/ / The u p da t e d b u n d l e b e i n g s e n t t o t h e RC memory , s i g n a l b u f f e r , and o u t g o i n g l i n k
o u t p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] o u t p u t B und l e ,
o u t p u t w i r e o u t p u t V a l i d ,
/ / An o u t p u t t o t e l l t h e c o r e t h a t i t s new i t em ha s b e e n added t o a b u n d l e
o u t p u t w i r e c o r e I n p u t S e r v i c e d
) ;
p a r am e t e r CORE_ID = 0 ;
/ / H e l p e r c o n t r o l w i r e s , i n d i c a t i n g wh e t h e r t h e i n c om i n g b u n d l e / c o r e i n p u t s a r e p r e s e n t .
w i r e c o r e H a s S t o r e ;
w i r e c o r e H a s S i g n a l ;
w i r e l e f t H a s S t o r e ;
w i r e l e f t H a s S i g n a l ;
/ / Wir e s c o n t r o l l i n g whi ch o f t h e i n p u t s we w i l l p a s s on t o t h e o u t p u t b u n d l e .
w i r e s e r v i c e C o r e S i g n a l ;
w i r e s e r v i c e C o r e S t o r e ;
w i r e s e r v i c e L e f t S t o r e ;
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] l e f t V a l i d S i g n a l s ; / / Which s i g n a l s l o t s a r e a l r e a d y taken , o n e b i t p e r s i g n a l .
w i r e l e f t F u l l S i g n a l s ; / / I s t h e l e f t b u n d l e e n t i r e l y f u l l .
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] s e r v i c e L e f t S i g n a l ; / / Which s i g n a l s f r om i n c om i n g b u n d l e t h a t we w i l l p a s s on .
/ / D e t e rm in e whi ch s i g n a l s f r om l e f t b u n d l e a r e v a l i d , i n a s i g n a l bandwid th a g n o s t i c away .
g en v a r i ;
g e n e r a t e
f o r ( i = 0 ; i < ‘SIGNAL_BANDWIDTH ; i = i + 1 ) b e g in : v a l i d S i g n a l s
a s s i g n l e f t V a l i d S i g n a l s [ i ] = l e f t V a l i d D e p a r t i n g B u n d l e &
( l e f t D e p a r t i n g B u n d l e [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) ) + ‘STORE_ENTRY_WIDTH   1 ] ) ;
end
e n d g e n e r a t e
/ / V a l i d b i t f o r s t o r e i n i n c om i n g b u n d l e .
a s s i g n l e f t H a s S t o r e = l e f t V a l i d D e p a r t i n g B u n d l e & l e f t D e p a r t i n g B u n d l e [ ‘STORE_ENTRY_WIDTH  1 ] ;
/ / OR a l l s i g n a l v a l i d b i t s t o d e t e rm i n e i f any s i g n a l s a r e v a l i d .
a s s i g n l e f t H a s S i g n a l = | l e f t V a l i d S i g n a l s ;
263
/ / AND a l l s i g n a l b i t s t o d e t e rm i n e i f any empty s p a c e i n b u n d l e f o r a new s i g n a l .
a s s i g n l e f t F u l l S i g n a l s = & l e f t V a l i d S i g n a l s ;
a s s i g n c o r eH a s S t o r e = ( coreCommandVal id == 1 ’ b 1 && coreCommandType == ‘TYPE_STORE ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n c o r eH a s S i g n a l = ( coreCommandVal id == 1 ’ b 1 && coreCommandType == ‘TYPE_SIGNAL ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n s e r v i c e C o r e S i g n a l = ( c o r e H a s S i g n a l == 1 ’ b 1 && l e f t F u l l S i g n a l s == 1 ’ b0 ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n s e r v i c e C o r e S t o r e = ( c o r eH a s S t o r e == 1 ’ b 1 && l e f t H a s S t o r e == 1 ’ b0 ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n s e r v i c e L e f t S i g n a l = l e f t V a l i d S i g n a l s ; / / A l l v a l i d s i g n a l s g e t s e r v i e d b e f o r e any c o r e s i g n a l .
a s s i g n s e r v i c e L e f t S t o r e = ( l e f t H a s S t o r e == 1 ’ b 1 ) ? 1 ’ b 1 : 1 ’ b0 ;
/ / O u t p u t s f r om b u n d l e i z e r , i n f o rm i n g c o r e and / o r r e c e i v e b u f f e r t h a t t h e y can r e l e a s e
/ / t h e i r c u r r e n t i n p u t / b u n d l e . On ly s e t h i g h i f i t i s g u a r e n t e e d t h e b u n d l e w i l l s e n d
/ / ( i . e . memory i s r e a d y and o u t b o u n d l i n k i s r e a d y ) .
a s s i g n c o r e I n p u t S e r v i c e d = ( memoryReady == 1 ’ b0 | | ou tboundLinkReady == 1 ’ b0 ) ? 1 ’ b0 :
s e r v i c e C o r e S i g n a l == 1 ’ b 1 | | s e r v i c e C o r e S t o r e == 1 ’ b 1 ? 1 ’ b 1 :
1 ’ b0 ;
/ / As l o n g a s t h e memory and l i n k a r e r eady , and l e f t ha s any v a l i d bund l e , i t can b e s e n t .
/ / Ou t pu t t h i s s o t h e r e c e i v e b u f f e r can r e l e a s e t h e b u n d l e
a s s i g n l e f t R e l e a s e B u n d l e = ( memoryReady == 1 ’ b0 | | ou tboundLinkReady == 1 ’ b0 ) ? 1 ’ b0 :
( l e f t H a s S i g n a l == 1 ’ b 1 | | l e f t H a s S t o r e == 1 ’ b 1 ) ? 1 ’ b 1 :
1 ’ b0 ;
/ / Use a b o v e c a l c u l a t e d v a l u e s t o s e l e c t p r o p e r s t o r e s / s i g n a l s t o s e n d t o o u t b o u n d b u n d l e
w i r e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] c o r e S t o r e ;
w i r e [ ‘SIGNAL_ENTRY_WIDTH  1 : 0 ] c o r e S i g n a l ;
w i r e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] l e f t S t o r e ;
w i r e [ ‘SIGNAL_ENTRY_WIDTH  1 : 0 ] l e f t S i g n a l ;
w i r e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] c h o s e n S t o r e ;
w i r e [ ( ‘SIGNAL_BANDWIDTH * ‘SIGNAL_ENTRY_WIDTH )  1 : 0 ] c h o s e n S i g n a l s ;
w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] c o r e I d ;
a s s i g n c o r e I d = CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ;
/ / C o n s t r u c t p o t e n t i a l new e n t r i e s f o r c o r e / l e f t s t o r e and s i g n a l .
/ / C o r e i d i s p a r t o f t h e e n t r y s o t h a t t h e e n t r y can b e r emov ed f r om t h e b u n d l e
/ / when i t ha s r e a c h e d c o r e # ( c o r e I d   1 % NUM_CORES ) .
a s s i g n c o r e S t o r e = { 1 ’ b 1 , c o r e I d , coreCommandAddr , coreCommandData } ;
a s s i g n c o r e S i g n a l = { 1 ’ b 1 , c o r e I d , coreCommandId } ;
264
a s s i g n l e f t S t o r e = l e f t D e p a r t i n g B u n d l e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] ;
/ / F ind t h e r i g h t m o s t 1 b i t i n em p t y S l o t s , u s e t h a t t o add new s i g n a l t o b u n d l e
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] em p t y S l o t s = ~ l e f t V a l i d S i g n a l s ;
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] r i g h tmo s t Emp t y S l o t = em p t y S l o t s & (  em p t y S l o t s ) ;
/ / A s s i g n s i g n a l s t o o u t p u t b u n d l e i n s i g n a l bandwid th a g n o s t i c way .
/ / F o r t h e i t h s i g n a l e n t r y i n bund l e , c h e c k i f i t i s t h e s e l e c t e d s l o t f o r a new c o r e s i g n a l .
/ / O t h e rw i s e , c h e c k t o make s u r e t h e s i g n a l i n t h e b u n d l e i s v a l i d , and i f s o , p a s s i t on t o t h e o u t p u t .
g e n e r a t e
f o r ( i = 0 ; i < ‘SIGNAL_BANDWIDTH ; i = i + 1 ) b e g in : c h o o s e S i g n a l s
a s s i g n c h o s e n S i g n a l s [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) )   1 : ( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) ] =
( s e r v i c e C o r e S i g n a l == 1 ’ b 1 && r i g h tmo s t Emp t y S l o t [ i ] == 1 ’ b 1 ) ?
c o r e S i g n a l :
s e r v i c e L e f t S i g n a l [ i ] == 1 ’ b 1 ?
l e f t D e p a r t i n g B u n d l e [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) ) + ‘STORE_ENTRY_WIDTH 1 :
( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) + ‘STORE_ENTRY_WIDTH ] :
{ ‘SIGNAL_ENTRY_WIDTH { 1 ’ b0 } } ;
end
e n d g e n e r a t e
/ / C h o o s e whi ch o f l e f t o r c o r e g e t t o b e s e n t t o o u t g o i n g b u n d l e
a s s i g n c h o s e n S t o r e = s e r v i c e C o r e S t o r e == 1 ’ b 1 ? c o r e S t o r e :
s e r v i c e L e f t S t o r e == 1 ’ b 1 ? l e f t S t o r e :
{ ‘STORE_ENTRY_WIDTH { 1 ’ b0 } } ;
/ / Ou t pu t i s o n l y v a l i d i f b o t h t h e memory and l i n k a r e a v a i l a b l e , and t h e r e i s s om e t h i n g i n t h e b u n d l e
a s s i g n o u t p u t V a l i d = ( c o r e I n p u t S e r v i c e d == 1 ’ b 1 | | l e f t R e l e a s e B u n d l e == 1 ’ b 1 ) ? 1 ’ b 1 : 1 ’ b0 ;
/ / Ou t pu t s e l e c t e d s i g n a l s and s t o r e s a s t h e new b un d l e
a s s i g n o u t p u t B un d l e = ( o u t p u t V a l i d == 1 ’ b 1 ) ?
{ c h o s e n S i g n a l s , c h o s e n S t o r e } :
{‘FORWARD_NETWORK_BUNDLE_WIDTH{ 1 ’ b0 } } ;
endmodule
265
B.5 stopper.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / Th i s modu l e r em o v e s s t o r e s and s i g n a l s t h a t a r e a t t h e i r t e rm i n a l l o c a t i o n f r om t h e c i r c u l a t i n g b u n d l e .
module s t o p p e r (
/ / The i n p u t c o r e s p o n d s t o t h e o u t p u t f r om t h e b u n d l e i z e r modu l e .
/ / Th e s e i t e m s have a l r e a d y b e e n s e n t t o t h e RC memory and s i g n a l b u f f e r ,
/ / and have b e e n c l e a r e d t o a c c e s s t h e o u t g o i n g l i n k i f a p p r o p r i a t e .
i n p u t w i r e i n p u t V a l i d ,
i n p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] i n p u t B und l e ,
/ / Any i t e m s n o t a t t h e i r f i n a l d e s t i n a t i o n ( i . e . , haven ’ t t r a v e r s e d t h e e n t i r e r i n g y e t )
/ / a r e a l l o w e d t o p a s s on t o t h e o u t g o i n g l i n k .
o u t p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] o u t p u t B und l e ,
o u t p u t w i r e o u t p u t V a l i d
) ;
p a r am e t e r CORE_ID = 0 ;
w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] c o r e I d ;
a s s i g n c o r e I d = CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ;
/ / On ly p a s s s t o r e a l o n g i f t h i s i s n ’ t i t s f i n a l d e s i t i n a t i o n . Any s t o r e s t h a t have t r a v e r s e d t h e r i n g
/ / a r e r emov ed f rom t h e b u n d l e .
w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] b u n d l e S t o r e T e rm i n a t i n gCo r e I d ;
w i r e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] b u n d l e S t o r e ;
w i r e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] c h o s e n S t o r e ;
a s s i g n b un d l e S t o r e T e rm i n a t i n gCo r e I d = i n p u t B u n d l e [ ‘STORE_ENTRY_WIDTH 2:‘ADDR_WIDTH+‘DATA_WIDTH ]   1 ’ b 1 ;
a s s i g n b u n d l e S t o r e = i n p u t B u n d l e [ ‘STORE_ENTRY_WIDTH  1 : 0 ] ;
a s s i g n c h o s e n S t o r e = ( b u n d l e S t o r e T e rm i n a t i n gCo r e I d == c o r e I d ) ? { ‘STORE_ENTRY_WIDTH { 1 ’ b0 } } : b u n d l e S t o r e ;
/ / On ly p a s s s i g n a l s a l o n g i f t h i s i s n ’ t i t s f i n a l d e s i t i n a t i o n . S l i g h t l y u g l y due t o n e e d t o h and l e
/ / d i f f e r e n t s i g n a l b andw i d t h s . Any s i g n a l s t h a t have t r a v e r s e d t h e r i n g
/ / a r e r emov ed f rom t h e bund l e ,
w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] b u n d l e S i g n a l T e rm i n a t i n gCo r e I d [ 0 : ‘SIGNAL_BANDWIDTH  1 ] ;
w i r e [ ‘SIGNAL_ENTRY_WIDTH  1 : 0 ] b u n d l e S i g n a l s [ 0 : ‘SIGNAL_BANDWIDTH  1 ] ;
w i r e [ ( ‘SIGNAL_BANDWIDTH * ‘SIGNAL_ENTRY_WIDTH )  1 : 0 ] c h o s e n S i g n a l s ;
g e n v a r i ;
g e n e r a t e
f o r ( i = 0 ; i < ‘SIGNAL_BANDWIDTH ; i = i + 1 ) b e g in : c h o s e S i g n a l s
a s s i g n b un d l e S i g n a l T e rm i n a t i n gCo r e I d [ i ] = i n p u t B u n d l e [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) ) + ‘STORE_ENTRY_WIDTH  2:
( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) + ‘ID_WIDTH+‘STORE_ENTRY_WIDTH]  1 ’ b 1 ;
266
a s s i g n b u n d l e S i g n a l s [ i ] = i n p u t B u n d l e [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) ) + ‘STORE_ENTRY_WIDTH   1 :
( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) + ‘STORE_ENTRY_WIDTH ] ;
a s s i g n c h o s e n S i g n a l s [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) )   1 : ( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) ] =
( b u n d l e S i g n a l T e rm i n a t i n gCo r e I d [ i ] == c o r e I d ) ? { ‘SIGNAL_ENTRY_WIDTH { 1 ’ b0 } } :
b u n d l e S i g n a l s [ i ] ;
end
e n d g e n e r a t e
/ / Ou t pu t i s v a l i d i f i n p u t i s v a l i d , and t h e b u n d l e ha s any r ema i n i n g v a l i d s t o r e s / s i g n a l s
a s s i g n o u t p u t V a l i d = ( i n p u t V a l i d == 1 ’ b 1 ) ? ( | c h o s e n S t o r e ) | ( | c h o s e n S i g n a l s ) : 1 ’ b0 ;
a s s i g n o u t p u t B un d l e = { c h o s e n S i g n a l s , c h o s e n S t o r e } ;
endmodule
267
B.6 load_unit.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / Th i s modu l e h a n d l e s a l l o f t h e l o g i c t o h and l e l o a d s i n t h e r i n g c a c h e .
/ / L oad s have two s o u r c e s   f r om t h e l o c a l c o r e , o r f r om a r em o t e c o r e .
/ / Th i s u n i t a r b i t r a t e s b e tw e e n who g e t s t h e l o a d p o r t o f t h e memory ,
/ / makes t h e r e q u e s t , and p r o c e s s e s t h e r e s u l t .
/ / I f a c o r e p e r f o rm s a l o ad , and i t m i s s e s , t h e r e s u l t mus t b e s e n t o v e r t h e r e q u e s t n e tw o r k
/ / t o f i n d t h e home c o r e f o r t h e a d d r e s s .
/ / Once a r em o t e l o ad , r e ad fr om t h e r e q u e s t n e twork , a c c e s s e s t h e l o c a l memory , t h e r e p l y
/ / i s s e n t ba ck o v e r t h e r e p l y n e tw o r k . Once t h e r e p l y r e t u r n s t o t h e o r i g i n a t i n g c o r e ,
/ / t h i s modu l e o u t p u t s t h e l o a d e d v a l u e back t o t h e c o r e .
module l o a d_ un i t (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
/ / I n p u t s and o u t p u t s t o t h e owner c o r e
i n p u t w i r e coreCommandValid ,
i n p u t w i r e [ ‘TYPE_WIDTH  1 : 0 ] coreCommandType ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] coreCommandAddr ,
/ / I n p u t s f r om t h e r e q u e s t and r e p l y n e t w o r k s
i n p u t w i r e ou tboundReque s tL inkReady ,
i n p u t w i r e l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e ,
i n p u t w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e q u e s t D e p a r t i n g B u n d l e ,
i n p u t w i r e ou tboundRep l yL inkReady ,
i n p u t w i r e l e f t R e p l y V a l i d D e p a r t i n g B u n d l e ,
i n p u t w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] l e f t R e p l yD e p a r t i n g B u n d l e ,
/ / R e s u l t o f t h e l a s t memory l o a d o p e r a t i o n
i n p u t w i r e r e q u e s tComp l e t eLo ad ,
i n p u t w i r e r e q u e s tH i t L o a d ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] da taOutLoad ,
/ / P e e k s i n t o t h e f o rwa r d i n g n e tw o r k r e c e i v e b u f f e r s
i n p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] peekA ,
i n p u t w i r e [ ‘FORWARD_NETWORK_BUNDLE_WIDTH  1 : 0 ] peekB ,
/ / The s e l e c t e d a d d r e s s t o n ex t l o a d from t h e memory
o u t p u t w i r e a d d r e s s T oLo a dV a l i d ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] a dd r e s sToLoad ,
268
/ / O u t p u t s t o t h e c o r e w i t h t h e r e s u l t o f i t s l o a d s
o u t p u t w i r e c o r e L o a d P r o c e s s e d ,
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] c o r e L o a dR e s u l t ,
o u t p u t w i r e co r eLoadHi t ,
/ / O u t p u t s t o s e n d a new r e q u e s t / r e p l y n e tw o r k i t em , and / o r t o r e l e a s e t h e c u r r e n t o n e
o u t p u t w i r e l e f t R e q u e s t R e l e a s e B u n d l e ,
o u t p u t w i r e r i g h t R e q u e s t V a l i d ,
o u t p u t w i r e [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t R e q u e s tD e p a r t i n g ,
o u t p u t w i r e l e f t R e p l y R e l e a s e B u n d l e ,
o u t p u t w i r e r i g h t R e p l y V a l i d ,
o u t p u t w i r e [ ‘REPLY_NETWORK_BUNDLE_WIDTH  1 : 0 ] r i g h t R e p l yD e p a r t i n g
) ;
p a r am e t e r CORE_ID = 0 ;
/ / Track what t h e r e ad p o r t i s d o i n g
w i r e memServ i c ingRemoteLoad ;
w i r e memServ i c ingCoreLoad ;
r e g pend ingCoreLoad ;
r e g pendingRemoteLoad ;
/ / I n d i c a t o r b i t f o r c o r e l o a d wa i t i n g t o e n t e r r e q u e s t n e tw o r k
r e g co r eLo adEnqu eu ingRequ e s tNe two rk ;
r e g c o r eLo a dWa i t i n g F o rR e p l y ;
/ / I n d i c a t o r b i t f o r r em o t e l o a d wa i t i n g t o e n t e r r e p l y n e tw o r k
r e g r emot eLoadEnqueu ingRep l yNe twork ;
/ / F o r r em o t e l o a d s a c c e s s i n g t h i s r i n g c a c h e memory
r e g r emo t eLo adWa i t i n gV a l i d ;
r e g [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : 0 ] r emot eLoadWa i t ing ;
/ / F o r r em o t e l o a d s w a i t i n g t o a c c e s s r e p l y n e tw o r k
r e g [ ‘DATA_WIDTH  1 : 0 ] r emoteLoadData ;
/ / C omb i n a t i o n a l v a l u e s r e l a t e d t o r e q u e s t n e tw o r k
w i r e c o r eLo adEnqu eu ingRequ e s tNe two rkNex t ;
w i r e r emot eLoadEnqueu ingRep l yNe two rkNex t ;
w i r e r e q u e s tN e tC a nP r o c e e d ;
w i r e r e l e a s eCo r eLo a dToR e q u e s tN e two r k ;
w i r e l e f t R e q u e s t L e a v eN e t w o r k ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] l e f tR e q u e s t B und l eHomeCo r e ;
w i r e l e f tR equ e s t Bund l e FoundHome ;
269
/ / C omb i n a t i o n a l v a l u e s r e l a t e d t o t h e r e p l y n e tw o r k
w i r e l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ;
w i r e r e l e a s eR emo t eLo adToRep l yNe two rk ;
/ / Memory i s s e r v i c i n g a l o a d fr om t h e c o r e i f i t a l r e a d y was , o r i f t h e r e ’ s a new l o a d wa i t i n g ,
/ / and n o t a l o a d i n t h e r e q u e s t n e tw o r k wa i t i n g .
a s s i g n memServ i c ingCoreLoad = pend ingCoreLoad |
( coreCommandVal id == 1 ’ b 1 && coreCommandType == ‘TYPE_LOAD &&
co r eLo adEnqu eu ingRequ e s tNe two rk == 1 ’ b0 && co r eLo a dWa i t i n g F o rR e p l y == 1 ’ b0 &&
pendingRemoteLoad == 1 ’ b0 && r emo t eLo adWa i t i n gV a l i d == 1 ’ b0 ) ;
/ / Memory i s s e r v i c i n g a l o a d fr om t h e r em o t e l o a d n e tw o r k i f i t a l r e a d y was ,
/ / o r i f t h e r e i s a new r em o t e l o a d wa i t i n g , and n o t o n e o c c u p y i n g t h e s e n d b u f f e r .
a s s i g n memServ i c ingRemoteLoad = pendingRemoteLoad |
( ~ p end ingCoreLoad & r emo t eLo adWa i t i n gV a l i d & ~ r emot eLoadEnqueu ingRep l yNe twork ) ;
/ * * * * * * * * * * * * * * * * *
O u t p u t s t o memory
* * * * * * * * * * * * * * * * /
/ / Memory i n p u t i s v a l i d i f we a r e s e r v i c i n g any l o a d .
a s s i g n a d d r e s s T oL o a d V a l i d = memServ i c ingCoreLoad | memServ i c ingRemoteLoad ;
/ / S e t t h e r e ad a d d r e s s b a s e d on wh e t h e r we a r e s e r v i c i n g t h e c o r e o r t h e r e q u e s t n e tw o r k .
a s s i g n a dd r e s s T oLo a d = memServ i c ingCoreLoad ? coreCommandAddr :
memServ i c ingRemoteLoad ? r emot eLoadWai t ing [ ‘ADDR_WIDTH  1 : 0 ] :
{ ‘ADDR_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * * * * * * * * * * * * * *
R e q u e s t n e tw o r k l o g i c
* * * * * * * * * * * * * * * * * * * * * /
/ / R e q u e s t n e tw o r k mus t n o t p a s s f o rwa r d i n g n e tw o r k i f i t c o n t a i n s any s t o r e s t h a t match t h e
/ / a d d r e s s o f t h e r e q u e s t b e i n g f o rwa r d e d . I t i s v e r y v e r y u n l i k e l y t h a t t h i s wou ld o c c u r ,
/ / b u t i s n e c e s s a r y f o r c o r r e c t n e s s . P e e k i n t o r e c e i v e b u f f e r f o r f o rwa r d i n g ne twork ,
/ / c ompar e t o c a n d i d a t e f o r l e a v i n g t h e r e q u e s t n e tw o r k .
w i r e p e e kAVa l i d ;
w i r e p e e k BV a l i d ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] peekAAddr ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] peekBAddr ;
w i r e peek sMatchA ;
w i r e p e ek sMa t chB ;
w i r e anyAddrMatch ;
270
a s s i g n p e e kAVa l i d = peekA [ ‘STORE_ENTRY_WIDTH  1 ] ;
a s s i g n p e e k BV a l i d = peekB [ ‘STORE_ENTRY_WIDTH  1 ] ;
a s s i g n peekAAddr = peekA [ ‘DATA_WIDTH+‘ADDR_WIDTH  1 : ‘DATA_WIDTH ] ;
a s s i g n peekBAddr = peekB [ ‘DATA_WIDTH+‘ADDR_WIDTH  1 : ‘DATA_WIDTH ] ;
a s s i g n peek sMatchA = p e e kAVa l i d == 1 ’ b 1 && l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e == 1 ’ b 1 &&
peekAAddr == l e f t R e q u e s t D e p a r t i n g B u n d l e [ ‘ADDR_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n p e ek sMa t chB = p e e k BV a l i d == 1 ’ b 1 && l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e == 1 ’ b 1 &&
peekBAddr == l e f t R e q u e s t D e p a r t i n g B u n d l e [ ‘ADDR_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n anyAddrMatch = peek sMatchA | p e ek sMa t chB ;
a s s i g n r e q u e s tN e tC a nP r o c e e d = ( ~ anyAddrMatch ) ;
/ / Check i f a r r i v i n g r e q u e s t n e e d s t o a c c e s s t h i s c o r e s r i n g c a c h e n od e
a s s i g n l e f tR e qu e s t B und l eHomeCo r e = ‘HOME_CORE( l e f t R e q u e s t D e p a r t i n g B u n d l e [ ‘ADDR_WIDTH  1 : 0 ] ) ;
a s s i g n l e f tR equ e s t Bund l e FoundHome = l e f tR e qu e s t B und l eHomeCo r e [ ‘CORE_ID_WIDTH  1 : 0 ] ==
CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
/ / Check i f i t i s v a l i d f o r i t em i n r e q u e s t n e tw o r k t o e x i t and a c c e s s l o c a l n od e
a s s i g n l e f t R e q u e s t L e a v eN e t w o r k = l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e &
l e f tR equ e s t Bund l e FoundHome & ~ r emo t eLo adWa i t i n gV a l i d ;
/ / C o r e l o a d can e n t e r r e q u e s t n e tw o r k i f t h e r e i s o n e wa i t i n g , and t h e o u t g o i n g l i n k i s r eady ,
/ / and t h e r e q u e s t n e tw o r k d o e s n ’ t have s om e t h i n g t o f o rward , o r i f i t d o e s ,
/ / i s i t s f i n a l d e s t i n a t i o n t h i s n od e . I t em i n n e tw o r k a lway s ha s p r i o r i t y
a s s i g n r e l e a s eCo r eLo a dToR e q u e s tN e two r k = co r eLo adEnqu eu ingRequ e s tNe two rk & ou tboundRequ e s tL inkReady &
( ~ l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e |
( l e f t R e q u e s t L e a v eN e t w o r k & r e q u e s tN e tC anP r o c e e d ) ) ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * *
O u t p u t s t o R e q u e s t n e tw o r k
* * * * * * * * * * * * * * * * * * * * * * * * * * /
/ / R e l e a s e i t em from r e q u e s t n e tw o r k b u f f e r i f i t i s a c c e s s i n g t h i s node ,
/ / o r i f i t i s b e i n g p a s s e d on t o n ex t n od e on t h e r i g h t
a s s i g n l e f t R e q u e s t R e l e a s e B u n d l e = r e q u e s tN e tC anP r o c e e d &
( l e f t R e q u e s t L e a v eN e t w o r k |
( l e f t R e q u e s t V a l i d D e p a r t i n g B u n d l e & ~ l e f tR equ e s t Bund l e FoundHome &
ou tboundRequ e s tL inkReady
)
) ;
271
/ / Send r e q u e s t n e tw o r k i t em t o t h e r i g h t i f t h e r e i s o n e t o p a s s on ,
/ / o r a new on e i n j e c t e d f r om t h e c o r e .
a s s i g n r i g h t R e q u e s t V a l i d = ( l e f t R e q u e s t R e l e a s e B u n d l e & ( ~ l e f t R e q u e s t L e a v eN e t w o r k ) ) |
r e l e a s eCo r eLo a dToR e q u e s tN e two r k ;
a s s i g n r i g h t R e q u e s t D e p a r t i n g = l e f t R e q u e s t R e l e a s e B u n d l e & ~ l e f t R e q u e s t L e a v eN e t w o r k ?
l e f t R e q u e s t D e p a r t i n g B u n d l e :
r e l e a s eCo r eLo a dToR e q u e s tN e two r k ?
{ 1 ’ b 1 , CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] , coreCommandAddr } :
{ ‘REQUEST_NETWORK_BUNDLE_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * * * * * * * * * * * * *
R e p l y n e tw o r k l o g i c
* * * * * * * * * * * * * * * * * * * * /
/ / Check i f t h e o l d e s t i t em i n t h e r e p l y n e tw o r k b u f f e r o r i g i n a t e d f r om t h i s c o r e
a s s i g n l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n = ( l e f t R e p l y V a l i d D e p a r t i n g B u n d l e ) &
( l e f t R e p l y D e p a r t i n g B u n d l e [ ‘REPLY_NETWORK_BUNDLE_WIDTH 2:‘DATA_WIDTH ] ==
CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ) ;
/ / I f t h e r e i s a r em o t e l o a d d on e p r o c e s s i n g , and t h e r e i s n ’ t an i t em i n t h e r e p l y n e twork ,
/ / r e l e a s e t h e r em o t e l o a d t o t h e r e p l y n e tw o r k .
/ / I t can a l s o b e r e l e a s e d i f t h e c u r r e n t i t em i n t h e r e p l y n e tw o r k i s e x i t i n g a t t h i s n od e .
a s s i g n r e l e a s eR emo t eLo adToRep l yNe two rk = ou tboundRep l yL inkReady & remot eLoadEnqueu ingRep l yNe twork &
( ( l e f t R e p l y V a l i d D e p a r t i n g B u n d l e & l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n )
| ( ~ l e f t R e p l y V a l i d D e p a r t i n g B u n d l e )
) ;
/ * * * * * * * * * * * * * * * * * * * * * * * * *
O u t p u t s t o r e p l y n e tw o r k
* * * * * * * * * * * * * * * * * * * * * * * * * /
/ / R e l e a s e i t em from r e p l y n e tw o r k b u f f e r i f i t i s a c c e s s i n g t h i s node ,
/ / o r i f i t i s b e i n g p a s s e d on t o n ex t n od e on t h e r i g h t
a s s i g n l e f t R e p l y R e l e a s e B u n d l e = l e f t R e p l y V a l i d D e p a r t i n g B u n d l e &
( ( ~ l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n & ou tboundRep l yL inkReady ) |
l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ) ;
/ / Send r e p l y n e tw o r k i t em t o t h e r i g h t i f t h e r e i s o n e t o p a s s on ,
/ / o r a r e c e n t l y f i n i s h e d r em o t e l o a d wa i t i n g t o e n t e r t h e r e p l y n e tw o r k .
a s s i g n r i g h t R e p l y V a l i d = ( l e f t R e p l y R e l e a s e B u n d l e & ( ~ l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ) ) |
r e l e a s eR emo t eLo adToRep l yNe two rk ;
a s s i g n r i g h t R e p l yD e p a r t i n g = l e f t R e p l y R e l e a s e B u n d l e & ~ l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ?
l e f t R e p l y D e p a r t i n g B u n d l e :
272
r e l e a s eR emo t eLo adToRep l yNe two rk ?
{ r emot eLoadWa i t ing [ ‘REQUEST_NETWORK_BUNDLE_WIDTH  1 : ‘ADDR_WIDTH ] , r emoteLoadData } :
{ ‘REPLY_NETWORK_BUNDLE_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * * * * * * * * * *
O u t p u t s t o c o r e
* * * * * * * * * * * * * * * * /
a s s i g n c o r e L o a d P r o c e s s e d = ( memServ i c ingCoreLoad & r e q u e s tComp l e t eLo a d & r e q u e s tH i t L o a d ) |
l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ;
a s s i g n c o r eLo adH i t = ( memServ i c ingCoreLoad & r e q u e s tComp l e t eLo a d & r e q u e s tH i t L o a d ) |
l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ;
a s s i g n c o r e L o a dR e s u l t = ( memServ i c ingCoreLoad & r e q u e s tComp l e t eLo a d & r e q u e s tH i t L o a d ) ? d a t aOutLoad :
l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n ? l e f t R e p l y D e p a r t i n g B u n d l e [ ‘DATA_WIDTH  1 : 0 ] :
{ ‘DATA_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * * * * * * * * * * *
Next s t a t e v a l u e s
* * * * * * * * * * * * * * * * * * /
/ / Enqueue c o r e l o a d on r e q u e s t n e tw o r k i f i t m i s s e d l o c a l l y
a s s i g n co r eLo adEnqu eu ingRequ e s tNe two rkNex t = memServ i c ingCoreLoad & ~ r e q u e s tH i t L o a d & r e q u e s tComp l e t eLo a d ;
/ / A f t e r r em o t e l o a d i s c om p l e t e , e n q u e u e t o r e p l y n e tw o r k
a s s i g n r emot eLoadEnqueu ingRep l yNe two rkNex t = memServ i c ingRemoteLoad & r e q u e s tComp l e t eLo a d ;
/ / S t a t e t r a n s i t i o n s
a lway s@ ( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b0 ) b e g in
/ / R emo t e l o a d s s t a r t e d f r om t h i s c o r e
i f ( c o r eLo adEnqu eu ingRequ e s tNe two rkNex t == 1 ’ b 1 ) b e g in
co r eLo adEnqu eu ingRequ e s tNe two rk <= 1 ’ b 1 ;
end
e l s e i f ( r e l e a s eCo r eLo a dToR e q u e s tN e two r k == 1 ’ b 1 ) b e g in
co r eLo adEnqu eu ingRequ e s tNe two rk <= 1 ’ b0 ;
c o r eLo a dWa i t i n g F o rR e p l y <= 1 ’ b 1 ;
end
e l s e i f ( l e f t R e p l y B u n d l e R e t u r n e d T oO r i g i n == 1 ’ b 1 ) b e g in
c o r eLo a dWa i t i n g F o rR e p l y <= 1 ’ b0 ;
end
/ / R emo t e l o a d s f r om o t h e r c o r e s
273
i f ( l e f t R e q u e s t L e a v eN e t w o r k == 1 ’ b 1 ) b e g in
r emo t eLo adWa i t i n gV a l i d <= 1 ’ b 1 ;
r emot eLoadWa i t ing <= l e f t R e q u e s t D e p a r t i n g B u n d l e ;
end
e l s e i f ( r emot eLoadEnqueu ingRep l yNe tworkNex t == 1 ’ b 1 ) b e g in
r emot eLoadEnqueu ingRep l yNe twork <= 1 ’ b 1 ;
r emoteLoadData <= da t aOutLoad ;
end
e l s e i f ( r e l e a s eR emo t eLo adToRep l yNe two rk == 1 ’ b 1 ) b e g in
r emo t eLo adWa i t i n gV a l i d <= 1 ’ b0 ;
r emot eLoadWa i t ing <= {‘REQUEST_NETWORK_BUNDLE_WIDTH { 1 ’ b0 } } ;
r emot eLoadEnqueu ingRep l yNe twork <= 1 ’ b0 ;
r emoteLoadData <= { ‘DATA_WIDTH { 1 ’ b0 } } ;
end
/ / S t a t e o f r e ad p o r t . I f r e q u e s t d idn ’ t c om p l e t e , r e c o r d t h i s f a c t .
pend ingCoreLoad <= memServ i c ingCoreLoad & ~ r e q u e s tComp l e t eLo a d ; ;
pendingRemoteLoad <= memServ i c ingRemoteLoad & ~ r e q u e s tComp l e t eLo a d ;
end
e l s e b e g i n
co r eLo adEnqu eu ingRequ e s tNe two rk <= 1 ’ b0 ;
c o r eLo a dWa i t i n g F o rR e p l y <= 1 ’ b0 ;
r emo t eLo adWa i t i n gV a l i d <= 1 ’ b0 ;
r emot eLoadWa i t ing <= {‘REQUEST_NETWORK_BUNDLE_WIDTH { 1 ’ b0 } } ;
r emoteLoadData <= { ‘DATA_WIDTH { 1 ’ b0 } } ;
r emot eLoadEnqueu ingRep l yNe twork <= 1 ’ b0 ;
p end ingCoreLoad <= 1 ’ b0 ;
pendingRemoteLoad <= 1 ’ b0 ;
end
end
endmodule
274
B.7 memory.v
‘ i n c l u d e ” d e f i n e s . v ”
‘ i n c l u d e ” p r i o r i t y _ e n c o d e r . v ”
‘ i n c l u d e ” b l o om_ f i l t e r . v ”
‘ i n c l u d e ” a r r a y . v ”
/ * Th i s modu l e wrap s t h e a r r a y module , and p r o c e s s e s l o a d s and s t o r e s f o r t h e r i n g c a c h e .
* /
‘ d e f i n e STATE_FLUSHING 2 ’ d0
‘ d e f i n e STATE_PENDING_EVICTION 2 ’ d 1
‘ d e f i n e STATE_READY 2 ’ d2
‘ d e f i n e STATE_L1_LOAD 2 ’ d 1
module memory (
i n p u t w i r e r e s e t ,
i n p u t w i r e c lk ,
o u t p u t w i r e r e adReady ,
o u t p u t w i r e w r i t eR e a d y ,
/ / I n p u t s f o r l o a d s , e i t h e r f r om c o r e o r r e q u e s t n e tw o r k
i n p u t w i r e i n p u t V a l i dLo a d ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] a d d r e s s L o a d ,
/ / C omb i n a t i o n a l o u t p u t s
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] da taOutLoad ,
o u t p u t w i r e r e q u e s tComp l e t eLo ad ,
o u t p u t w i r e r e q u e s tH i t L o a d ,
/ / I n p u t s f o r s t o r e s
i n p u t w i r e i n p u t V a l i d S t o r e ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] a d d r e s s S t o r e ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] d a t a S t o r e ,
/ / I n p u t s / o u t p u t s f o r f l u s h
i n p u t w i r e s t a r t F l u s h ,
o u t p u t w i r e f i n i s h e d F l u s h ,
/ / C omb i n a t i o n a l o u t p u t s
o u t p u t w i r e r e q u e s t C omp l e t e S t o r e ,
/ / P o r t s f o r w r i t e b a c k i n t e r a c t i o n w i t h L1 c a c h e
i n p u t w i r e w r i t e b a c kA c c e p t e d ,
i n p u t w i r e w r i t e b a ckComp l e t e ,
o u t p u t w i r e w r i t e b a c k V a l i d ,
275
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] w r i t e b a ckAdd r ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] w r i t e b a c kDa t a ,
/ / P o r t s f o r r e ad i n t e r a c t i o n w i t h L1 c a c h e
i n p u t w i r e c a ch eLo adAc c e p t e d ,
i n p u t w i r e c a cheLoadComple t e ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] c a c h eLo a dR e s u l t ,
o u t p u t w i r e c a c h eLo a dV a l i d ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] c a cheLoadAddr
) ;
p a r am e t e r CORE_ID = 0 ;
p a r am e t e r NUM_ENTRIES = 2 5 6 ;
p a r am e t e r ASSOC = 1 ; / / mus t b e 1 f o r now
l o c a l p a r am NUM_SETS = ( NUM_ENTRIES / ASSOC ) ;
l o c a l p a r am NUM_INDEX_BITS = ‘CLOG2 ( NUM_SETS ) ;
l o c a l p a r am NUM_TAG_BITS = ‘ADDR_WIDTH   2   NUM_INDEX_BITS ;
l o c a l p a r am WAY_ENTRY_SIZE = ( ASSOC * ( 1 + NUM_TAG_BITS+‘DATA_WIDTH ) ) ;
/ / C omb i n a t i o n a l i n t e r m e d i a t e v a l u e s d e p e n d e n t on r e q u e s t , s t a t e , e t c
w i r e i n i t i a t e N ewWr i t e b a c k ;
/ / S t a t e b i t s
r e g [ 1 : 0 ] r e a d S t a t e ;
r e g [ 1 : 0 ] w r i t e S t a t e ;
/ / C omb i n a t i o n a l n ex t s t a t e
r e g [ 1 : 0 ] n e x t R e a d S t a t e ;
r e g [ 1 : 0 ] n e x tW r i t e S t a t e ;
/ *
E v i c t i o n / f l u s h r e l a t e d :
* /
/ / Maximum number o f i t e m s p e n d i n g e v i c t i o n <= number o f e n t r i e s i n t h e da ta array , l i k e l y fa r , f a r f ew e r
r e g [ NUM_INDEX_BITS  1 : 0 ] e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n ;
/ /
r e g p e n d i n g E v i c t V a l i d ;
r e g [ ‘DATA_WIDTH  1 : 0 ] p e n d i n gE v i c tD a t a ;
276
r e g [ ‘ADDR_WIDTH  1 : 0 ] p e n d i n gE v i c tAdd r ;
/ / F l u s h r e l a t e d s t a t e and c om b i n a t i o n a l l o g i c
r e g f lu shWalkDone ;
r e g [ NUM_SETS  1 : 0 ] o w n e r B i t s e t ;
w i r e [ NUM_SETS  1 : 0 ] n e x t I n d e x B i t T o F l u s h ;
w i r e [ NUM_INDEX_BITS  1 : 0 ] n e x t I n d e xToF l u s h ;
/ *
L1 l o a d r e l a t e d :
* /
/ / S t a t e
r e g p e n d i n gL 1Lo a dV a l i d ;
r e g p end i ngL 1Lo adAc c e p t e d ;
r e g [ ‘ADDR_WIDTH  1 : 0 ] p end ingL 1LoadAddr ;
/ / Track whi ch v a l u e s have a l r e a d y b e e n s e e n , s o u n c e s s a r y r e q u e s t n e tw o r k a c c e s s e s a r e a v o i d e d .
w i r e h a s hT a b l eM i s s ;
/ *
Memory a r r a y i n p u t s / o u t p u t s
* /
w i r e p o r t 1 H i t ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 1D a t aOu t ;
w i r e p o r t 2 V a l i d ;
w i r e p o r t 2W r i t e E n a b l e ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2 A d d r e s s ;
w i r e p o r t 2 E v i c t i o n ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2 E x i s t i n gA d d r ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 2 E x i s t i n gD a t a ;
w i r e p o r t 2H i t ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 2Da t aOu t ;
/ *
M i s c . memory a d d r e s s p r o p e r t i e s
* /
/ / F o r r e a d s , t o know i f t h e L1 s h o u l d b e a c c e s s e d i n s t e a d o f r e q u e s t n e tw o r k
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 1Add r e s sHomeCor e ;
277
w i r e p o r t 1Add r e s sHomeCo r eMa t ch e s ;
/ / F o r s t o r e e v i c t i o n s , t o know whe t h e r t o w r i t e b a c k t h e e v i c t e d v a l u e t o t h e L1
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2 E x i s t i n gAdd r e s sHomeCo r e ;
w i r e p o r t 2 E x i s t i n gAdd r e s sHomeCo r eMa t c h e s ;
/ / F o r s t o r e s , t o r e c o r d whi ch s t o r e d v a l u e s a r e ’ owned ’ by t h i s n od e
w i r e [ ‘ADDR_WIDTH  1 : 0 ] po r t 2Addr e s sHomeCor e ;
w i r e po r t 2Add r e s sHomeCor eMa t ch e s ;
/ / The a c t u a l memory a r r a y and l o g i c .
a r r a y # ( . NUM_ENTRIES ( NUM_ENTRIES ) , . ASSOC ( ASSOC ) ) a r r a y
(
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. p o r t 1 V a l i d ( i n p u t V a l i d L o a d ) ,
. p o r t 1 A d d r e s s ( a d d r e s s L o a d ) ,
. p o r t 1W r i t eD a t a ( { ‘DATA_WIDTH { 1 ’ b0 } } ) ,
. p o r t 1W r i t e E n a b l e ( 1 ’ b0 ) ,
. p o r t 2 V a l i d ( p o r t 2 V a l i d ) ,
. p o r t 2 A d d r e s s ( p o r t 2 A d d r e s s ) ,
. p o r t 2Wr i t eD a t a ( d a t a S t o r e ) ,
. p o r t 2W r i t e E n a b l e ( p o r t 2W r i t e E n a b l e ) ,
. p o r t 1D a t aOu t ( p o r t 1D a t aOu t ) ,
. p o r t 1 H i t ( p o r t 1 H i t ) ,
. p o r t 1 C omp l e t e ( ) ,
. p o r t 1 E v i c t i o n ( ) ,
. p o r t 1 E x i s t i n g D a t a ( ) ,
. p o r t 1 E x i s t i n g A d d r ( ) ,
. p o r t 2Da t aOu t ( p o r t 2Da t aOu t ) ,
. p o r t 2H i t ( p o r t 2H i t ) ,
. p o r t 2Comp l e t e ( ) ,
. p o r t 2 E v i c t i o n ( p o r t 2 E v i c t i o n ) ,
. p o r t 2 E x i s t i n gD a t a ( p o r t 2 E x i s t i n gD a t a ) ,
. p o r t 2 E x i s t i n gA d d r ( p o r t 2 E x i s t i n gA d d r )
) ;
/ / H a s h t a b l e / b l o om f i l t e r t r a c k s whi ch a d d r e s s e s have
/ / a l r e a d y b e e n s e e n by t h i s n od e . I f an a d d r e s s ha s
/ / n e v e r b e e n s e e n b e f o r e , and i t i s b e i n g l o a d e d and m i s s e s ,
278
/ / i t can b e l o a d e d d i r e c t l y f r om t h e L1 e v e n i f t h i s i s n ’ t t h e home c o r e
/ / f o r t h a t a d d r e s s .
b l o om_ f i l t e r hashLookup
(
. addrToCheck ( a d d r e s s L o a d ) ,
. a d d rTo S e t ( a d d r e s s S t o r e ) ,
. a d d r T o S e t V a l i d ( i n p u t V a l i d S t o r e & w r i t eR e a d y ) ,
. r e s e t ( r e s e t ) ,
. c l k ( c l k ) ,
. h a s hT a b l eM i s s ( h a s hT a b l eM i s s )
) ;
/ / F l u s h h e l p e r . U s e s t h e o w n e r B i t s e t t o t r a c k whi ch s t o r e d a d d r e s s e s t h a t
/ / t h i s n od e owns . Th i s makes t h e f l u s h more e f f i c i e n t by o n l y l o a d i n g and w r i t i n g back
/ / t h e p r o p e r e n t r i e s i n t h e memory a r r a y . C u r r e n t l y a s s um e s d i r e c t mapped c a c h e .
a s s i g n n e x t I n d e x B i t T o F l u s h = o w n e r B i t s e t ;
p r i o r i t y _ e n c o d e r # ( . NUM_SETS ( NUM_SETS ) , . NUM_INDEX_BITS ( NUM_INDEX_BITS ) ) ownerEncode r
(
. o w n e r B i t s e t ( o w n e r B i t s e t ) ,
. n e x t I n d e xToF l u s h ( n e x t I n d e xToF l u s h )
) ;
/ * * * * * * * * * * * * * * * *
S t o r e p o r t l o g i c
* * * * * * * * * * * * * * * * * /
w i r e [ NUM_INDEX_BITS  1 : 0 ] a d d r S t o r e I n d e x ;
a s s i g n w r i t eR e a d y = ( w r i t e S t a t e == ‘STATE_READY ) ;
a s s i g n p o r t 2 V a l i d = ( i n p u t V a l i d S t o r e & w r i t eR e a d y ) | ( w r i t e S t a t e == ‘STATE_FLUSHING ) ;
a s s i g n p o r t 2W r i t e E n a b l e = ( i n p u t V a l i d S t o r e & w r i t eR e a d y ) ;
a s s i g n p o r t 2 A d d r e s s = ( w r i t e S t a t e ! = ‘STATE_FLUSHING ) ? a d d r e s s S t o r e :
{ { ( ‘ADDR_WIDTH NUM_INDEX_BITS  2 ) { 1 ’ b0 } } , n e x t I nd e xToF l u s h , 2 ’ b0 } ;
a s s i g n a d d r S t o r e I n d e x = p o r t 2 A d d r e s s [ NUM_INDEX_BITS + 2   1 : 2 ] ;
a s s i g n p o r t 2 E x i s t i n gAdd r e s sHomeCo r e = ‘HOME_CORE( p o r t 2 E x i s t i n gA d d r ) ;
a s s i g n p o r t 2 E x i s t i n gAdd r e s sHomeCo r eMa t c h e s = p o r t 2 E x i s t i n gAdd r e s sHomeCo r e [ ‘CORE_ID_WIDTH  1 : 0 ] ==
CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n po r t 2Addr e s sHomeCor e = ‘HOME_CORE( p o r t 2 A d d r e s s ) ;
279
a s s i g n po r t 2Add r e s sHomeCor eMa t ch e s = po r t 2Addr e s sHomeCor e [ ‘CORE_ID_WIDTH  1 : 0 ] ==
CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
/ / I f we a r e f l u s h i n g , o r j u s t p r o c e s s i n g an e v i c t i o n , i n i t a t e t h e w r i t e b a c k t o t h e c a c h e h i e a r c h y .
/ / On ly i f t h i s c o r e i s t h e home c o r e f o r an a d d r e s s .
a s s i g n i n i t i a t e N ewWr i t e b a c k = ( w r i t e S t a t e == ‘STATE_FLUSHING && flu shWalkDone == 1 ’ b0 &&
po r t 2 E x i s t i n gAdd r e s sHomeCo r eMa t c h e s == 1 ’ b 1 && n e x t I n d e x B i t T o F l u s h ! = {NUM_SETS { 1 ’ b0 } } ) ? 1 ’ b 1 :
w r i t eR e a d y == 1 ’ b 1 && p o r t 2 E v i c t i o n == 1 ’ b 1 && po r t 2 E x i s t i n gAdd r e s sHomeCo r eMa t c h e s == 1 ’ b 1 ? 1 ’ b 1 :
1 ’ b0 ;
/ * * * * * * * * * * * * * * * * * *
S t o r e p o r t o u t p u t s
* * * * * * * * * * * * * * * * * * * /
/ / As sume t h a t s t o r e s a r e a lway s c on s umed t h e c y c l e t h e y a r r i v e .
/ / From t h e c o r e ’ s p o i n t o f v i ew t h i s i s t r u e , s i n c e t h e y a r e s e n t
/ / i n p a r a l l e l w i t h t h e memory w r i t e anyway .
a s s i g n r e q u e s t C om p l e t e S t o r e = i n p u t V a l i d S t o r e & w r i t eR e a d y ;
a s s i g n f i n i s h e d F l u s h = ( w r i t e S t a t e == ‘STATE_FLUSHING ) && ( f lu shWalkDone == 1 ’ b 1 ) &&
( e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n == 1 ’ b0 ) && ( p e n d i n g E v i c t V a l i d == 1 ’ b0 ) ;
/ / Wr i t e ba ck t o L1 i f t h e t a g l o o k u p i n d i c a t e d t h a t t h e r e n e e d s t o b e an e v i c t i o n ,
/ / and t h e e v i c t e d a d d r e s s ’ s home c o r e i s t h i s c o r e . I n i t i a t e t h e w r i t e b a c k t h e same c y c l e a s t h e t a g l o o k u p ,
/ / t h o u g h i t m igh t n o t b e a c c e p t e d f o r w r i t e b a c k u n t i l a s u b s e q u e n t c y c l e .
a s s i g n w r i t e b a c k V a l i d = ( p e n d i n g E v i c t V a l i d == 1 ’ b 1 ) ? 1 ’ b 1 :
i n i t i a t e N ewWr i t e b a c k == 1 ’ b 1 ? 1 ’ b 1 :
1 ’ b0 ;
a s s i g n w r i t e b a c kAdd r = ( p e n d i n g E v i c t V a l i d == 1 ’ b 1 ) ? p e n d i n gE v i c tAdd r :
i n i t i a t e N ewWr i t e b a c k == 1 ’ b 1 ? p o r t 2 E x i s t i n gA d d r :
{ ‘ADDR_WIDTH { 1 ’ b0 } } ;
a s s i g n w r i t e b a c kD a t a = ( p e n d i n g E v i c t V a l i d == 1 ’ b 1 ) ? p e n d i n gE v i c tD a t a :
i n i t i a t e N ewWr i t e b a c k == 1 ’ b 1 ? p o r t 2 E x i s t i n gD a t a :
{ ‘DATA_WIDTH { 1 ’ b0 } } ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * *
S t o r e p o r t s t a t e t r a n s i t i o n s
* * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i n t e g e r c ;
a l w a y s @( p o s e d g e c l k ) b e g i n
/ / R e s e t a l l s t a t e b e tw e e n l o o p i n v o c a t i o n s
i f ( r e s e t == 1 ’ b 1 ) b e g in
f o r ( c = 0 ; c < NUM_SETS ; c = c + 1 ) b e g i n
o w n e r B i t s e t [ c ] <= 1 ’ b0 ;
280
end
w r i t e S t a t e <= ‘STATE_READY ;
p e n d i n g E v i c t V a l i d <= 1 ’ b0 ;
p e n d i n gE v i c tD a t a <= { ‘DATA_WIDTH { 1 ’ b0 } } ;
p e n d i n gE v i c tAdd r <= {‘ADDR_WIDTH { 1 ’ b0 } } ;
e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n <= {NUM_INDEX_BITS { 1 ’ b0 } } ;
f l u shWalkDone <= 1 ’ b 1 ;
end
e l s e b e g i n
/ / I f p r o c e s s i n g a new w r i t e . . .
i f ( i n p u t V a l i d S t o r e == 1 ’ b 1 && w r i t eR e a d y == 1 ’ b 1 ) b e g in
o w n e r B i t s e t [ a d d r S t o r e I n d e x ] <= po r t 2Add r e s sHomeCor eMa t ch e s ;
/ / Num o u t s t a n d i n g e v i c t i o n s i s t h i s p o t e n t i a l one ,
/ / m inu s any t h a t migh t have come back t h i s c y c l e
e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n <= i n i t i a t e N ewWr i t e b a c k ;
/ / I f s om e t h i n g n e e d e d t o b e e v i c t e d , s a v e t h e da ta and addr .
i f ( i n i t i a t e N ewWr i t e b a c k == 1 ’ b 1 ) b e g in
p e n d i n gE v i c tD a t a <= w r i t e b a c kD a t a ;
p e n d i n gE v i c tAdd r <= w r i t e b a c kAdd r ;
p e n d i n g E v i c t V a l i d <= w r i t e b a c k V a l i d ;
w r i t e S t a t e <= ‘STATE_PENDING_EVICTION ;
end
e l s e b e g i n
w r i t e S t a t e <= ‘STATE_READY ;
end
end
e l s e i f ( w r i t e S t a t e == ‘STATE_PENDING_EVICTION ) b eg in
/ / Num o u t s t a n d i n g e v i c t i o n s i s minu s any t h a t migh t have come back t h i s c y c l e
e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n <= e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n   wr i t e b a c kComp l e t e ;
i f ( w r i t e b a c kA c c e p t e d == 1 ’ b 1 ) b e g in
p e n d i n gE v i c tD a t a <= { ‘DATA_WIDTH { 1 ’ b0 } } ;
p e n d i n gE v i c tAdd r <= {‘ADDR_WIDTH { 1 ’ b0 } } ;
p e n d i n g E v i c t V a l i d <= 1 ’ b0 ;
end
i f ( w r i t eR e a d y == 1 ’ b 1 ) b e g in
w r i t e S t a t e <= ‘STATE_READY ;
end
end
e l s e i f ( w r i t e S t a t e == ‘STATE_READY && s t a r t F l u s h == 1 ’ b 1 ) b e g in
w r i t e S t a t e <= ‘STATE_FLUSHING ;
f lu shWalkDone <= 1 ’ b0 ;
end
e l s e i f ( w r i t e S t a t e == ‘STATE_FLUSHING ) b eg in
281
e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n <= e v i c t i o n s Aw a i t i n g L 1 C o n f i rm a t i o n +
( i n i t i a t e N ewWr i t e b a c k & ( ~ p e n d i n g E v i c t V a l i d ) )   wr i t e b a c kComp l e t e ;
/ / I f t h e r e i s n o t h i n g l e f t t o f l u s h , s e t t h e walk f i n i s h e d f l a g .
i f ( p e n d i n g E v i c t V a l i d == 1 ’ b0 && n e x t I n d e x B i t T o F l u s h == {NUM_SETS { 1 ’ b0 } } ) b e g i n
f lu shWalkDone <= 1 ’ b 1 ;
end
/ / I f we haven ’ t f i n i s h e d f l u s h i n g , p o t e n t i a l l y w r i t e b a c k a new a d d r e s s ,
/ / a s l o n g a s an e x i s t i n g o n e i s n ’ t p e n d i n g a c c e p t a n c e .
e l s e i f ( p e n d i n g E v i c t V a l i d == 1 ’ b0 && flu shWalkDone == 1 ’ b0 ) b e g in
/ / I n v a l i d a t e ow n e r s h i p a r r a y
o w n e r B i t s e t [ a d d r S t o r e I n d e x ] <= 1 ’ b0 ;
/ / In t h e c u r r e n t im p l em e n t a t i o n , t h e n od e a lway s owns t h i s a d d r e s s ,
/ / s o t h i s i f w i l l a lwa y s b e e n t e r e d , u n l e s s t h e f l u s h i s e n d i n g .
i f ( i n i t i a t e N ewWr i t e b a c k == 1 ’ b 1 ) b e g in
p e n d i n gE v i c tD a t a <= w r i t e b a c kD a t a ;
p e n d i n gE v i c tAdd r <= w r i t e b a c kAdd r ;
p e n d i n g E v i c t V a l i d <= w r i t e b a c k V a l i d ;
end
end
e l s e i f ( w r i t e b a c kA c c e p t e d == 1 ’ b 1 ) b e g in
p e n d i n gE v i c tD a t a <= { ‘DATA_WIDTH { 1 ’ b0 } } ;
p e n d i n gE v i c tAdd r <= {‘ADDR_WIDTH { 1 ’ b0 } } ;
p e n d i n g E v i c t V a l i d <= 1 ’ b0 ;
end
e l s e i f ( f i n i s h e d F l u s h == 1 ’ b 1 ) b e g in
w r i t e S t a t e <= ‘STATE_READY ;
end
end
end
end
/ * * * * * * * * * * * * * * *
Load p o r t l o g i c
* * * * * * * * * * * * * * * * /
w i r e r cH i t ;
w i r e r cM i s s ;
w i r e i s s u e L 1 L o a d ;
w i r e l 1 L o a d F i n i s h e d ;
a s s i g n r e a dRe ad y = ( r e a d S t a t e == ‘STATE_READY ) ;
a s s i g n po r t 1Add r e s sHomeCor e = ‘HOME_CORE( a d d r e s s L o a d ) ;
282
a s s i g n po r t 1Add r e s sHomeCo r eMa t ch e s = po r t 1Add r e s sHomeCor e [ ‘CORE_ID_WIDTH  1 : 0 ] ==
CORE_ID [ ‘CORE_ID_WIDTH  1 : 0 ] ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n r cH i t = ( r e a d S t a t e == ‘STATE_READY ) && ( i n p u t V a l i d L o a d == 1 ’ b 1 ) && ( p o r t 1 H i t == 1 ’ b 1 ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n r cM i s s = ( r e a d S t a t e == ‘STATE_READY ) && ( i n p u t V a l i d L o a d == 1 ’ b 1 ) && ( p o r t 1 H i t == 1 ’ b0 ) ? 1 ’ b 1 : 1 ’ b0 ;
/ / On ly i s s u e L1 l o a d i f t h i s n od e i s t h e a d d r e s s owner , o r i f i t ha s n e v e r b e e n s t o r e d b e f o r e
a s s i g n i s s u e L 1 L o a d = r cM i s s & ( po r t 1Add r e s sHomeCo r eMa t ch e s | h a s hT a b l eM i s s ) ;
a s s i g n l 1 L o a d F i n i s h e d = r e a d S t a t e == ‘STATE_L1_LOAD && pend i ngL 1Lo adAc c e p t e d == 1 ’ b 1 &&
cacheLoadComp l e t e == 1 ’ b 1 ;
/ * * * * * * * * * * * * * * * * * * * * * * * *
Load p o r t o u t p u t v a l u e s
* * * * * * * * * * * * * * * * * * * * * * * * * /
a s s i g n c a c h eL o a dV a l i d = ( p e n d i n gL 1Lo a dV a l i d == 1 ’ b 1 && pend i ngL 1Lo adAc c e p t e d == 1 ’ b0 ) ? 1 ’ b 1 :
( p e n d i n gL 1Lo a dV a l i d == 1 ’ b 1 && pend i ngL 1Lo adAc c e p t e d == 1 ’ b 1 ) ? 1 ’ b0 :
( i s s u e L 1 L o a d == 1 ’ b 1 ) ? 1 ’ b 1 :
1 ’ b0 ;
a s s i g n ca cheLoadAddr = ( p e n d i n gL 1Lo a dV a l i d == 1 ’ b 1 && pend i ngL 1Lo adAc c e p t e d == 1 ’ b0 ) ? p end ingL 1LoadAddr :
( p e n d i n gL 1Lo a dV a l i d == 1 ’ b 1 && pend i ngL 1Lo adAc c e p t e d == 1 ’ b 1 ) ? { ‘ADDR_WIDTH { 1 ’ b0 } } :
( i s s u e L 1 L o a d == 1 ’ b 1 ) ? a d d r e s s L o a d :
{ ‘ADDR_WIDTH { 1 ’ b0 } } ;
a s s i g n da t aOutLoad = r cH i t ? p o r t 1D a t aOu t :
l 1 L o a d F i n i s h e d ? c a c h e L o a dR e s u l t :
{ ‘DATA_WIDTH { 1 ’ b0 } } ;
/ / The c o r e c o n s i d e r s a l o a d a s a h i t i f i t was f o und l o c a l l y , o r l o a d e d l o c a l l y f r om i t s own L1 .
/ / A m i s s i m p l i e s t h e a d d r e s s n e e d s t o b e f e t c h e d o v e r t h e r e q u e s t n e tw o r k .
a s s i g n r e q u e s tH i t L o a d = r cH i t | l 1 L o a d F i n i s h e d ;
/ / I n f o rm t h e c o r e t h a t t h e l o a d ha s c om p l e t e d , e i t h e r f o und l o c a l l y , l o a d e d f r om L1 l o c a l l y , o r m i s s e d .
a s s i g n r e q u e s tComp l e t eLo a d = r cH i t | l 1 L o a d F i n i s h e d |
( r cM i s s & ( ~ ( p o r t 1Add r e s sHomeCo r eMa t ch e s | h a s hT a b l eM i s s ) ) ) ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * *
Load p o r t s t a t e t r a n s i t i o n s
* * * * * * * * * * * * * * * * * * * * * * * * * * * * /
a l w a y s @( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b 1 ) b e g in
r e a d S t a t e <= ‘STATE_READY ;
283
p e nd i n gL 1Lo a dV a l i d <= 1 ’ b0 ;
p end ingL 1LoadAddr <= {‘ADDR_WIDTH { 1 ’ b0 } } ;
p e nd i n gL 1Lo adAc c e p t e d <= 1 ’ b0 ;
end
e l s e b e g i n
i f ( r e a d S t a t e == ‘STATE_READY ) b eg in
i f ( i s s u e L 1 L o a d == 1 ’ b 1 ) b e g in
p e nd i n gL 1Lo a dV a l i d <= c a c h eL o a dV a l i d ;
p end ingL 1LoadAddr <= cacheLoadAddr ;
p e nd i n gL 1Lo adAc c e p t e d <= 1 ’ b0 ;
r e a d S t a t e <= ‘STATE_L1_LOAD ;
end
end
e l s e i f ( r e a d S t a t e == ‘STATE_L1_LOAD ) b eg in
i f ( c a c h eLo a dA c c e p t e d == 1 ’ b 1 ) b e g in
p end i ngL 1Lo adAc c e p t e d <= 1 ’ b 1 ;
end
e l s e i f ( l 1 L o a d F i n i s h e d ) b e g i n
p e nd i n gL 1Lo a dV a l i d <= 1 ’ b0 ;
p end ingL 1LoadAddr <= {‘ADDR_WIDTH { 1 ’ b0 } } ;
p e nd i n gL 1Lo adAc c e p t e d <= 1 ’ b0 ;
r e a d S t a t e <= ‘STATE_READY ;
end
end
end
end
endmodule
284
B.8 priority_encoder.v
/ / P r i o r i t y e n c o d e r , t a k e s t h e r i g h t m o s t b i t o f i n p u t and o u t p u t s
/ / t h e b i n a r y r e p r e s e n t a t i o n
/ / e . g .
/ / 4 ’ b 0 1 0 1  > 2 ’ b 0 0
/ / 4 ’ b 1 1 1 0  > 2 ’ b 0 1
/ / 4 ’ b 0 1 0 0  > 2 ’ b 1 0
/ / 4 ’ b 1 0 0 0  > 2 ’ b 1 1
/ / Used t o r e d u c e t h e number o f memory wo rd s t h a t n e e d t o b e l o a d e d
/ / t o f l u s h t o t h e norma l c a c h e h i e r a r c h y f r om t h e r i n g c a c h e .
module p r i o r i t y _ e n c o d e r #( p a r am e t e r NUM_SETS = 2 , p a r am e t e r NUM_INDEX_BITS = 1 ) (
i n p u t w i r e [ NUM_SETS  1 : 0 ] o w n e r B i t s e t ,
o u t p u t w i r e [ NUM_INDEX_BITS  1 : 0 ] n e x t I n d e xToF l u s h
) ;
a s s i g n n e x t I n d e xToF l u s h = ( o w n e r B i t s e t [ 0 ] ) ? 8 ’ d0 :
( o w n e r B i t s e t [ 1 ] ) ? 8 ’ d 1 :
.
.
.
( o w n e r B i t s e t [ 2 5 3 ] ) ? 8 ’ d 2 5 3 :
( o w n e r B i t s e t [ 2 5 4 ] ) ? 8 ’ d 2 54 : 8 ’ d 2 5 5 ;
endmodule
285
B.9 array.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / Th i s modu l e p r o v i d e s an i n t e r f a c e t o a memory a r r a y .
/ / I t c u r r e n t l y o n l y s u p p o r t s a d i r e c t mapped c o n f i g u r a t i o n , and m o d e l s t h e memory
/ / a r r a y w i t h an a r r a y o f r e g i s t e r s , wh i ch s u p p o r t s two c o m b i n a t i o n a l r e a d s and two
/ / e d g e t r i g g e r e d w r i t e s p e r c y c l e . We c u r r e n t l y u s e p o r t 1 f o r r i n g c a c h e l o a d s ,
/ / s o t h e w r i t e on p o r t 1 i s u n u s e d . On p o r t 2 , we u s e t h e c o m b i n a t i o n a l r e ad t o
/ / do a t a g l o o k u p , and t h e w r i t e t o do t h e s t o r e i t s e l f .
/ / An SRAM im p l em e n t a t i o n o f t h i s wou ld r e q u i r e b e i n g d o u b l e c l o c k e d t o g e t t h e
/ / r e ad t im i n g we d e s i r e . G iven t h e sm a l l s i z e o f t h e array , j u s t u s i n g
/ / r e g i s t e r s m igh t b e a c c e p t a b l e .
/ / We add two s p e c i a l o u t p u t s , e x i s t i n gD a t a and e x i s t i n g A d d r . Th e s e a r e t h e
/ / a d d r e s s e s / da ta t h a t wer e a l r e a d y p r e s e n t i n t h e a r e a a t a p a r t i c u l a r
/ / i n d e x . I f t h e a c c e s s was a h i t , t h e e x i s t i n g A d d r e s s w i l l match p o r t XAd d r e s s .
/ / I f t h e a c c e s s was a m i s s , i t w i l l b e t h e v a l u e o f t h e a d d r e s s / da ta
/ / p r e v i o u s s t o r e d t h e r e . Th i s i s u s e d f o r w r i t i n g back e v i c t i o n s .
module a r r a y (
i n p u t w i r e r e s e t ,
i n p u t w i r e c lk ,
/ / P o r t 1 i n p u t s
i n p u t w i r e p o r t 1 V a l i d ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 1 A d d r e s s ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 1W r i t eD a t a ,
i n p u t w i r e p o r t 1W r i t e E n a b l e ,
/ / P o r t 1 o u t p u t s
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 1Da t aOu t ,
o u t p u t w i r e p o r t 1 H i t ,
o u t p u t w i r e p o r t 1Comp l e t e ,
o u t p u t w i r e p o r t 1 E v i c t i o n ,
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 1 E x i s t i n gD a t a ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 1 E x i s t i n gA d d r ,
/ / P o r t 2 i n p u t s
i n p u t w i r e p o r t 2 V a l i d ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2 A d d r e s s ,
i n p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 2Wr i t eDa t a ,
i n p u t w i r e p o r t 2Wr i t e En a b l e ,
/ / P o r t 2 o u t p u t s
286
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] po r t 2Da t aOu t ,
o u t p u t w i r e p o r t 2H i t ,
o u t p u t w i r e po r t 2Comp l e t e ,
o u t p u t w i r e p o r t 2 E v i c t i o n ,
o u t p u t w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 2 E x i s t i n gD a t a ,
o u t p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2 E x i s t i n gA d d r
) ;
p a r am e t e r NUM_ENTRIES = 2 5 6 ;
p a r am e t e r ASSOC = 1 ; / / mus t b e 1 f o r now
l o c a l p a r am NUM_SETS = ( NUM_ENTRIES / ASSOC ) ;
l o c a l p a r am NUM_INDEX_BITS = ‘CLOG2 ( NUM_SETS ) ;
l o c a l p a r am NUM_TAG_BITS = ‘ADDR_WIDTH   2   NUM_INDEX_BITS ;
l o c a l p a r am WAY_ENTRY_SIZE = ( ASSOC * ( 1 + NUM_TAG_BITS+‘DATA_WIDTH ) ) ;
/ / Data Array
r e g [WAY_ENTRY_SIZE  1 : 0 ] a r r a y [ 0 : NUM_SETS  1 ] ;
/ / I n p u t addr s p l i t i n t o t a g and i n d ex f o r b o t h p o r t s
w i r e [ NUM_TAG_BITS  1 : 0 ] p o r t 1Add rT a g ;
w i r e [ NUM_INDEX_BITS  1 : 0 ] p o r t 1 A d d r I n d e x ;
w i r e [ NUM_TAG_BITS  1 : 0 ] p o r t 2Add rTag ;
w i r e [ NUM_INDEX_BITS  1 : 0 ] p o r t 2Add r I n d e x ;
/ / R e s u l t o f r e ad s p l i t i n t o e n t r y , l i n e , v a l i d b i t , tag , and da ta f o r b o t h p o r t s
w i r e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] p o r t 1 R e a d En t r y ;
w i r e [WAY_ENTRY_SIZE  1 : 0 ] p o r t 1 R e a dL i n e ;
w i r e p o r t 1 R e a d V a l i d ;
w i r e [ NUM_TAG_BITS  1 : 0 ] p o r t 1R e a dT a g ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 1 R e a dD a t a ;
w i r e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] p o r t 2R e a dEn t r y ;
w i r e [WAY_ENTRY_SIZE  1 : 0 ] p o r t 2R e a dL i n e ;
w i r e p o r t 2 R e a d V a l i d ;
w i r e [ NUM_TAG_BITS  1 : 0 ] p o r t 2Re adTag ;
w i r e [ ‘DATA_WIDTH  1 : 0 ] p o r t 2R e a dDa t a ;
/ / The a d d r e s s t h a t was a c t u a l l y s t o r e d i n memory . Same a s t h e i n p u t a d d r e s s
/ / i f t h e memory l o o k u p was a h i t .
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 1R e a dAdd r ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] p o r t 2Re adAdd r ;
/ / Do e s t h e l o a d e d va l u d t a g s match t h e i n p u t a d d r e s s
w i r e p o r t 1R e a dT a g sMa t c h ;
287
w i r e po r t 2Re adTag sMa t ch ;
/ / P o t e n t i a l w r i t e s p l i t i n t o e n t r y and l i n e . C u r r e n t l y ,
/ / p o r t 1 n e v e r d o e s w r i t e s , b u t k e e p t h i s anyway .
w i r e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] p o r t 1W r i t e E n t r y ;
w i r e [WAY_ENTRY_SIZE  1 : 0 ] p o r t 1W r i t e L i n e ;
/ / P o t e n t i a l w r i t e s p l i t i n t o e n t r y and l i n e
w i r e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] p o r t 2W r i t e E n t r y ;
w i r e [WAY_ENTRY_SIZE  1 : 0 ] p o r t 2W r i t e L i n e ;
/ / P r o p e r t i e s o f i n p u t a d d r e s s
a s s i g n p o r t 1 A d d r I n d e x = p o r t 1 A d d r e s s [ ‘ADDR_WIDTH NUM_TAG_BITS  1 : 2 ] ;
a s s i g n p o r t 1Add rT a g = p o r t 1 A d d r e s s [ ‘ADDR_WIDTH  1 : ‘ADDR_WIDTH NUM_TAG_BITS ] ;
/ / R e s u l t o f r e ad from memory a r r a y
a s s i g n p o r t 1 R e a dL i n e = a r r a y [ p o r t 1 A d d r I n d e x ] ;
a s s i g n p o r t 1 R e a d En t r y = p o r t 1 R e a dL i n e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] ;
a s s i g n p o r t 1 R e a d V a l i d = p o r t 1 R e a d En t r y [WAY_ENTRY_SIZE  1 ] ;
a s s i g n p o r t 1R e a dT a g = p o r t 1 R e a d En t r y [WAY_ENTRY_SIZE 2:‘DATA_WIDTH ] ;
a s s i g n p o r t 1 R e a dD a t a = p o r t 1 R e a d En t r y [ ‘DATA_WIDTH  1 : 0 ] ;
/ / The a d d r e s s t h a t was a c t u a l l y p r e s e n t i n t h e a s s i g n e d s e t
a s s i g n po r t 1R e a dAdd r = { po r t 1R e adTag , p o r t 1Add r I n d e x , 2 ’ b00 } ;
a s s i g n p o r t 1R e a dT a g sMa t c h = ( p o r t 1 R e a dT a g == po r t 1Add rT a g ) ? 1 ’ b 1 : 1 ’ b0 ;
/ / P o t e n t i a l w r i t e l i n e t o w r i t e t o a r r a y
a s s i g n p o r t 1W r i t e E n t r y = { 1 ’ b 1 , p o r t 1Add rTag , p o r t 1W r i t eD a t a } ;
a s s i g n p o r t 1W r i t e L i n e = p o r t 1W r i t e E n t r y ;
/ / P r o p e r t i e s o f i n p u t a d d r e s s
a s s i g n p o r t 2Add r I n d e x = p o r t 2 A d d r e s s [ ‘ADDR_WIDTH NUM_TAG_BITS  1 : 2 ] ;
a s s i g n po r t 2Add rTag = p o r t 2 A d d r e s s [ ‘ADDR_WIDTH  1 : ‘ADDR_WIDTH NUM_TAG_BITS ] ;
/ / R e s u l t o f r e ad from memory a r r a y
a s s i g n p o r t 2R e a dL i n e = a r r a y [ p o r t 2Add r I n d e x ] ;
a s s i g n p o r t 2R e a dEn t r y = p o r t 2R e a dL i n e [ 1 + NUM_TAG_BITS+‘DATA_WIDTH  1 : 0 ] ;
a s s i g n p o r t 2 R e a d V a l i d = p o r t 2R e a dEn t r y [WAY_ENTRY_SIZE  1 ] ;
a s s i g n po r t 2Re adTag = p o r t 2R e a dEn t r y [WAY_ENTRY_SIZE 2:‘DATA_WIDTH ] ;
a s s i g n po r t 2R e a dDa t a = p o r t 2R e a dEn t r y [ ‘DATA_WIDTH  1 : 0 ] ;
/ / The a d d r e s s t h a t was a c t u a l l y p r e s e n t i n t h e a s s i g n e d s e t f o r t h i s s t o r e
a s s i g n po r t 2Re adAdd r = { po r t 2ReadTag , p o r t 2Add r Ind e x , 2 ’ b00 } ;
a s s i g n po r t 2Re adTag sMa t ch = ( po r t 2Re adTag == po r t 2Add rTag ) ? 1 ’ b 1 : 1 ’ b0 ;
288
/ / P o t e n t i a l w r i t e l i n e t o w r i t e t o a r r a y
a s s i g n p o r t 2W r i t e E n t r y = { 1 ’ b 1 , po r t 2AddrTag , p o r t 2Wr i t eD a t a } ;
a s s i g n p o r t 2W r i t e L i n e = p o r t 2W r i t e E n t r y ;
/ / O u t p u t s
a s s i g n p o r t 1 H i t = p o r t 1 V a l i d & p o r t 1 R e a d V a l i d & po r t 1R e a dT a g sMa t c h ;
a s s i g n p o r t 1D a t aOu t = p o r t 1 H i t ? p o r t 1 R e a dD a t a : { ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n p o r t 1 C omp l e t e = p o r t 1 V a l i d ;
a s s i g n p o r t 1 E v i c t i o n = p o r t 1 V a l i d & p o r t 1 R e a d V a l i d & p o r t 1W r i t e E n a b l e & ( ~ p o r t 1 H i t ) ;
a s s i g n p o r t 1 E x i s t i n g D a t a = p o r t 1 V a l i d & p o r t 1 R e a d V a l i d ? p o r t 1 R e a dD a t a : { ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n p o r t 1 E x i s t i n g A d d r = p o r t 1 V a l i d & p o r t 1 R e a d V a l i d ? p o r t 1R e a dAdd r : { ‘ADDR_WIDTH { 1 ’ b0 } } ;
a s s i g n p o r t 2H i t = p o r t 2 V a l i d & p o r t 2 R e a d V a l i d & po r t 2Re adTag sMa t ch ;
a s s i g n po r t 2Da t aOu t = p o r t 2H i t ? p o r t 2R e a dDa t a : { ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n po r t 2Comp l e t e = p o r t 2 V a l i d ;
a s s i g n p o r t 2 E v i c t i o n = p o r t 2 V a l i d & p o r t 2 R e a d V a l i d & p o r t 2W r i t e E n a b l e & ( ~ p o r t 2H i t ) ;
a s s i g n p o r t 2 E x i s t i n gD a t a = p o r t 2 V a l i d & p o r t 2 R e a d V a l i d ? p o r t 2R e a dDa t a : { ‘DATA_WIDTH { 1 ’ b0 } } ;
a s s i g n p o r t 2 E x i s t i n gA d d r = p o r t 2 V a l i d & p o r t 2 R e a d V a l i d ? p o r t 2Re adAdd r : { ‘ADDR_WIDTH { 1 ’ b0 } } ;
i n t e g e r c ;
a lwa y s@ ( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b 1 ) b e g in
f o r ( c = 0 ; c < NUM_SETS ; c = c + 1 ) b e g i n
a r r a y [ c ] <= {WAY_ENTRY_SIZE { 1 ’ b0 } } ;
end
end
e l s e b e g i n
i f ( p o r t 1 V a l i d & p o r t 1W r i t e E n a b l e ) b e g in
a r r a y [ p o r t 1 A d d r I n d e x ] <= p o r t 1W r i t e E n t r y ;
end
i f ( p o r t 2 V a l i d & p o r t 2W r i t e E n a b l e ) b e g in
a r r a y [ p o r t 2Add r I n d e x ] <= p o r t 2W r i t e E n t r y ;
end
end
end
endmodule
289
B.10 bloom_filter.v
‘ i n c l u d e ” ha sh . v ”
/ / Th i s modu l e p r o v i d e s a / m u l t i p l e ha sh f u n c t i o n ( s ) t o c h e c k i f
/ / a memory a d d r e s s ha s a l r e a d y b e e n s t o r e d t o t h e r i n g c a c h e .
/ / I t s u p p o r t s two s i m u l t a n e o u s a d d r e s s i n p u t s   o n e f o r
/ / an a d d r e s s b e i n g l o a d e d f r om t h e r i n g ca ch e , f o r whi ch
/ / we want t o know whe t h e r i t ha s b e e n s e e n a l r e a d y o r no t , and on e
/ / f o r an a d d r e s s b e i n g w r i t t e n t o t h e r i n g ca ch e , f o r whi ch we want
/ / t o r e c o r d t h i s f a c t i n t h e h a s h t a b l e .
/ / S e e d s f o r t h e ha sh f u n c t i o n s a r e random 32  b i t number s .
module b l o om_ f i l t e r (
i n p u t w i r e r e s e t ,
i n p u t w i r e c lk ,
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] addrToCheck , / / R e t u r n r e s u l t b a s e d on t h i s a d d r e s s
i n p u t w i r e [ ‘ADDR_WIDTH  1 : 0 ] a dd rToS e t , / / S e t t h e c o r r e s p o n d i n g b i t s i n t h e h a s h t a b l e f o r
i n p u t w i r e a d d r T o S e t V a l i d , / / t h i s a d d r e s s .
/ / The o n l y o u t p u t i s wh e t h e r t h e addrToCheck was a b s e n t
/ / f r om t h e h a s h t a b l e , wh i ch i s a l l we c a r e a b o u t .
o u t p u t w i r e h a s hT a b l eM i s s
) ;
p a r am e t e r BITS_IN_TABLE = 5 1 2 ;
l o c a l p a r am INDEX_BITS = ‘CLOG2 ( BITS_IN_TABLE ) ;
/ / B i t a r r a y t o r e c o r d r e s u l t o f o n e o r more ha sh f u n c t i o n s .
/ / Each ha sh f u n c t i o n w i l l s e t a b i t i n t h i s t a b l e . When c h e c k i n g f o r
/ / member sh ip , t h e c o r e s p o n d i n g b i t s f o r e a c h ha sh f u n c t i o n a r e c h e c k e d ,
/ / and i f b o t h s e t , t h e n t h e i t em ha s a l r e a d y b e e n added .
r e g [ BITS_IN_TABLE  1 : 0 ] t a b l e B i t S e t ;
w i r e [ INDEX_BITS  1 : 0 ] c h e c k I n d e x 1 ;
w i r e [ INDEX_BITS  1 : 0 ] c h e c k I nd e x 2 ;
w i r e [ INDEX_BITS  1 : 0 ] s e t I n d e x 1 ;
w i r e [ INDEX_BITS  1 : 0 ] s e t I n d e x 2 ;
w i r e h a s h 1 I s S e t ;
w i r e h a s h 2 I s S e t ;
ha sh # ( . INDEX_BITS ( INDEX_BITS ) ) ha shCheck 1 (
290
. a dd r ( addrToCheck ) ,
. h a s h S e e d ( 3 2 ’ h 2 a b f 7 2 0 9 ) ,
. o u t ( c h e c k I n d e x 1 )
) ;
ha sh # ( . INDEX_BITS ( INDEX_BITS ) ) ha shCheck2 (
. a dd r ( addrToCheck ) ,
. h a s h S e e d ( 3 2 ’ h 1 a 8 f c e e 7 ) ,
. o u t ( c h e c k I nd e x 2 )
) ;
ha sh # ( . INDEX_BITS ( INDEX_BITS ) ) h a s h S e t 1 (
. a dd r ( a d d rTo S e t ) ,
. h a s h S e e d ( 3 2 ’ h 2 a b f 7 2 0 9 ) ,
. o u t ( s e t I n d e x 1 )
) ;
ha sh # ( . INDEX_BITS ( INDEX_BITS ) ) h a s h S e t 2 (
. a dd r ( a d d rTo S e t ) ,
. h a s h S e e d ( 3 2 ’ h 1 a 8 f c e e 7 ) ,
. o u t ( s e t I n d e x 2 )
) ;
/ / Check i f b o t h ha sh b i t s wer e a l r e a d y s e t .
a s s i g n h a s h 1 I s S e t = t a b l e B i t S e t [ c h e c k I n d e x 1 ] ;
a s s i g n h a s h 2 I s S e t = t a b l e B i t S e t [ c h e c k I nd e x 2 ] ;
a s s i g n h a s hT a b l eM i s s = ~ ( h a s h 1 I s S e t == 1 ’ b 1 && h a s h 2 I s S e t == 1 ’ b 1 ) ;
a lwa y s@ ( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b 1 ) b e g in
t a b l e B i t S e t <= { BITS_IN_TABLE { 1 ’ b0 } } ;
end
e l s e b e g i n
i f ( a d d r T o S e t V a l i d == 1 ’ b 1 ) b e g in
t a b l e B i t S e t [ s e t I n d e x 1 ] <= 1 ’ b 1 ;
t a b l e B i t S e t [ s e t I n d e x 2 ] <= 1 ’ b 1 ;
end
end
end
endmodule
291
B.11 hash.v
/ / Th i s modu l e p r o v i d e s a m u l t i p l y and s h i f t ha sh f u n c t i o n f o r 32  b i t a d d r e s s e s .
/ / INDEX_BITS i s t h e number o f b i t s o f o u t p u t , i . e . t h e a d d r e s s s i z e o f t h e
/ / b i t s e t where t h e r e s u l t w i l l b e r e c o r d e d / c h e c k e d .
module ha sh #( p a r am e t e r INDEX_BITS = 9 ) (
i n p u t [ ‘ADDR_WIDTH  1 : 0 ] addr , / / A d d r e s s t o ha sh /
i n p u t [ ‘ADDR_WIDTH  1 : 0 ] h a sh S e ed , / / Random 32  b i t ha sh s e e d .
o u t p u t [ INDEX_BITS  1 : 0 ] ou t
) ;
w i r e [ ( ‘ADDR_WIDTH* 2 )   1 : 0 ] mu l tR e s u l tH a s h ;
w i r e [ ‘ADDR_WIDTH  1 : 0 ] t r u n c a t e d R e s u l t ;
w i r e [ INDEX_BITS  1 : 0 ] s h i f t e d R e s u l t ;
a s s i g n mu l tR e s u l tHa s h = add r * h a s h S e e d ;
a s s i g n t r u n c a t e d R e s u l t = mu l tR e s u l tHa s h [ ‘ADDR_WIDTH  1 : 0 ] ;
a s s i g n s h i f t e d R e s u l t = t r u n c a t e d R e s u l t [ ‘ADDR_WIDTH  1 : ‘ADDR_WIDTH INDEX_BITS ] ;
a s s i g n ou t = s h i f t e d R e s u l t ;
endmodule
292
B.12 signal_buffer.v
‘ i n c l u d e ” d e f i n e s . v ”
‘ i n c l u d e ” s i g n a l _ b u f f e r _ s i g n a l _ t r a c k e r . v ”
/ / Th i s modu l e im p l em e n t s a l l o f t h e l o g i c f o r t h e s i g n a l b u f f e r .
/ / The c o m p i l e r mus t know how many s i g n a l i d s a r e a v a i l a b l e i n t h e hardware
/ / b e f o r e c o m p i l a t i o n . The more s i g n a l i d s a v a i l a b l e , t h e b e t t e r p o t e n t i a l p e r f o rman c e .
/ / Each s i g n a l i d ha s a c o r e s p o n d i n g b i t s e t t o t r a c k whi ch c o r e s have s e n t t h a t s i g n a l ,
/ / ( and p o t e n t i a l l y how many t i m e s w i t h i n a s l i d i n g window o f i t e r a t i o n s have t h e y s e n t t h a t s i g n a l ) .
/ / On e v e r y c y c l e , i t can r e c o r d up t o ‘SIGNAL_BANDIWDTH number o f s i g n a l s , t o d i f f e r e n t s i g n a l i d s .
/ / I t can a l s o c h e c k i f a l l t h e s i g n a l s c o r e s p o n d i n g w i t h a wa i t i n s t r u c t i o n have
/ / b e e n r e c e i v e d , s o t h a t t h e i n j e c t i n g c o r e can e n t e r a s e q u e n t i a l s e gm e n t .
/ / S i g n a l s f r om o t h e r c o r e s a s w e l l a s f r om t h e c o r e a t t a c h e d t o t h i s RC nod e p a s s t h r o u g h t h e
/ / s i g n a l b u f f e r .
/ / I f t h e f i n a l s i g n a l t o u n b l o c k a wa i t i s r e c e i v e d on t h e same c y c l e
/ / a c o r e q u e r i e s wh e t h e r s a i d wa i t can b e u n b l o c k e d , t h e r e l e a s e s i g n a l
/ / i s r a i s e d t h a t v e r y c y c l e .
/ / The s i g n a l b u f f e r ha s two s p e c i a l s i g n a l b i t s e t s c o r e s p o n d i n g t o two s p e c i a l wa i t / s i g n a l
/ / p a i r s t h a t c o n t r o l t h e RC memory f l u s h a t t h e end o f a l o o p i n v o c a t i o n . A f t e r t h e f i r s t
/ / p a i r i s u n b l o c k e d ( s i g n a l i d ‘NUM_SIGNALS   1 ) , t h e c o r e may b e g i n t o f l u s h . A f t e r t h e memory
/ / f i n i s h e s f l u s h i n g , t h e n ex t s i g n a l i s s e n t ( s i g n a l i d ‘NUM_SIGNALS   2 ) t o i n f o rm a l l
/ / o t h e r c o r e s o f t h a t f a c t . When t h e wa i t i n s t r u c t i o n c o r e s p o n d i n g t o t h i s f i n a l s i g n a l
/ / i s r e l e a s e d , a l l c o r e s know t h a t a l l o t h e r c o r e s have f i n i s h e d t h e i r f l u s h . The c o m p i l e r
/ / mus t i n s e r t t h e s e s p e c i a l s i g n a l s a t t h e end o f e v e r y l o o p i n v o c a t i o n .
module s i g n a l _ b u f f e r (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
/ / Used o n l y t o match up w i th C++ s i m u l a t i o n s f o r t e s t i n g ,
/ / s i n c e s i m u l a t e d p h a s e s m igh t s t a r t on a r b i t r a r y i t e r a t i o n s .
/ / I t c o n t r o l s wh i ch s i g n a l s e a c h c o r e r e c o r d s a s hav i n g a l r e a d y b e e n r e c e i v e d .
i n p u t w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] f i r s t I t e r a t i o n C o r e I d ,
/ / Up t o ‘SIGNAL_BANDWIDTH number o f i n c om i n g s i g n a l s r e c o r d e d e a c h c y c l e .
i n p u t w i r e [ ( ‘SIGNAL_BANDWIDTH * ‘SIGNAL_ENTRY_WIDTH )  1 : 0 ] i n c om in g S i g n a l s ,
/ / A wa i t ID t h e c o r e i s c u r r e n t l y e x e c u t i n g , and i s c h e c k i n g wh e t h e r
/ / t h e s i g n a l s c o r e s p o n d i n g t o t h i s ID have b e e n r e c e i v e d f r om ea ch c o r e .
/ / A l s o wh e t h e r t h e wa i t i s ’ l i g h t ’ and p o t e n t i a l l y s k i p a b l e i f i t
/ / r e l e a s i n g i t d o e s n ’ t o v e r / und e r f l ow t h e b i t s e t s .
293
i n p u t w i r e in comingWa i tVa l i d ,
i n p u t w i r e [ ‘ID_WIDTH  1 : 0 ] incomingWai t Id ,
i n p u t w i r e incomingWai tL ight ,
/ / Ou t pu t t o t e l l t h e c o r e t h e wa i t i t i s e x e c u t i n g can b e r e l e a s e d ,
/ / s i n c e t h e p r o p e r s i g n a l s have a l r e a d y b e e n r e c e i v e d .
/ / A s p e c i a l wa i t r e l e a s e o u t p u t i s t o t e l l t h e c o r e i t mus t f l u s h t h e RC memory b e f o r e
/ / r e l e a s i n g t h e wa i t i n s t r u c t i o n .
o u t p u t w i r e w a i t R e l e a s e d T o S t a r t F l u s h ,
o u t p u t w i r e w a i t R e l e a s e d
) ;
/ / S e e t e c h n i c a l r e p o r t f o r d e s c r i p t i o n .
/ / C o r r e s p o n d s t o t h e s i z e o f t h e s l i d i n g window o f i t e r a t i o n s s i g n a l s can
/ / p o t e n t i a l l y b e r e c e i v e d f r om .
p a r am e t e r EPOCH_BOUND = 2 ;
p a r am e t e r RECEIVER_CORE = 0 ; / / The c o r e i d o f t h i s RC node ’ s c o r e .
/ / o n l y g o up t o NUM_SIGNALS 2, s i n c e s p e c i a l f l u s h s i g n a l s a r e t h e t o p two .
w i r e [ ‘NUM_SIGNALS 2 1:0] w a i tR e l e a s e d F r omCo r e ;
/ / s p e c i a l wa i t r e l e a s e s i g n a l when t h e wa i t c o r e s p o n d i n g t o t h e f l u s h i s r e l e a s e d .
w i r e w a i t R e l e a s e d F r om S t a r t F l u s h ;
/ / s p e c i a l wa i t r e l e a s e s i g n a l f o r when t h e f i n a l wa i t o f t h e l o o p i s r e l e a s e d .
w i r e w a i t R e l e a s e d F r omF i n i s hL o o p ;
/ / On ly o n e wa i t can b e c h e c k e d p e r c y c l e ,
/ / s o j u s t OR t h e r e s u l t o f a l l s i g n a l r e l e a s e s t o g e t h e r t o f i n d o u t i f t h e c o r e s h o u l d p r o c e e d .
a s s i g n w a i t R e l e a s e d = ( | w a i tR e l e a s e d F r omCo r e ) | w a i t R e l e a s e d F r omF i n i s hL o o p ;
/ / Need a s p e c i a l o u t p u t f o r t h e f l u s h wai t ,
/ / s o r i n g c a c h e can hang on t o i t u n t i l t h e f l u s h ha s s t a r t e d t h e n f i n i s h e d
a s s i g n w a i t R e l e a s e d T o S t a r t F l u s h = w a i t R e l e a s e d F r om S t a r t F l u s h ;
/ / C r e a t e o n e s u bmodu l e f o r e a c h p o s s i b l e s i g n a l .
/ / The c o m p i l e r can l i m i t t h e p r o d u c e d c o d e t o o n l y hav i n g a s p e c i f i c
/ / number o f s i g n a l s . Each s u bmodu l e ha s N c o u n t e r s , where N= number o f s i m u l a t e d c o r e s .
/ / N o t e t h e  2 i n t h e l o o p end c o n d i t i o n ,
/ / t h e h i g h e s t two s i g n a l i d s a r e r e s e r v e d f o r t h e f l u s h s i g n a l s .
/ / A l s o n o t e t h e i n c om i n g wa i t i n s t r u c t i o n t o c h e c k i s o n l y v a l i d i f t h e ID ma t c h e s t h e s u bmodu l e .
g en v a r i ;
g e n e r a t e
f o r ( i = 0 ; i < ‘NUM_SIGNALS 2; i = i + 1 ) b e g in : s i g n a l B i t S e t s
s i g n a l _ t r a c k e r # ( .EPOCH_BOUND(EPOCH_BOUND) , . RECEIVER_CORE (RECEIVER_CORE ) , . SEGMENT_ID ( i ) )
294
s i g n a l B i t s (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. f i r s t I t e r a t i o n C o r e I d ( f i r s t I t e r a t i o n C o r e I d ) ,
. i n c om i n g S i g n a l s ( i n c om i n g S i g n a l s ) ,
. i n c om ingWa i tV a l i d ( i n c om ingWa i tV a l i d && ( incomingWai t Id == i [ ‘ID_WIDTH  1 : 0 ] ) ) ,
. i n comingWa i tL igh t ( in comingWa i tL igh t ) ,
. w a i t R e l e a s e d ( w a i tR e l e a s e d F r omCo r e [ i ] )
) ;
end
e n d g e n e r a t e
/ / S p e c i a l f l u s h / f i n i s h b i t s o n l y n e e d e p o c h bound 1 ,
/ / s i n c e t h e y by d e s i g n don ’ t a l l o w c o r e s t o e n t e r d i f f e r e n t e p o c h s
s i g n a l _ t r a c k e r # ( .EPOCH_BOUND ( 1 ) , . RECEIVER_CORE (RECEIVER_CORE ) , . SEGMENT_ID ( ( ‘NUM_SIGNALS 2 ) ) )
s i g n a l F i n i s h L o o p B i t s (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. f i r s t I t e r a t i o n C o r e I d ( f i r s t I t e r a t i o n C o r e I d ) ,
. i n c om i n g S i g n a l s ( i n c om i n g S i g n a l s ) ,
. i n c om ingWa i tV a l i d ( i n c om ingWa i tV a l i d && ( incomingWai t Id == ( ‘NUM_SIGNALS  2 ) ) ) ,
. i n comingWa i tL igh t ( 1 ’ b0 ) ,
. w a i t R e l e a s e d ( w a i t R e l e a s e d F r omF i n i s hL o o p )
) ;
s i g n a l _ t r a c k e r # ( .EPOCH_BOUND ( 1 ) , . RECEIVER_CORE (RECEIVER_CORE ) , . SEGMENT_ID ( ( ‘NUM_SIGNALS  1 ) ) )
s i g n a l S t a r t F l u s h B i t s (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. f i r s t I t e r a t i o n C o r e I d ( f i r s t I t e r a t i o n C o r e I d ) ,
. i n c om i n g S i g n a l s ( i n c om i n g S i g n a l s ) ,
. i n c om ingWa i tV a l i d ( i n c om ingWa i tV a l i d && incomingWai t Id == ( ‘NUM_SIGNALS  1 ) ) ,
. i n comingWa i tL igh t ( 1 ’ b0 ) ,
. w a i t R e l e a s e d ( w a i t R e l e a s e d F r om S t a r t F l u s h )
) ;
endmodule
295
B.13 signal_buffer_signal_tracker.v
‘ i n c l u d e ” d e f i n e s . v ”
‘ i n c l u d e ” s i g n a l _ b u f f e r _ c o r e _ t r a c k e r . v ”
/ / Th i s modu l e r e p r e s e n t s a l l t h e b i t s n e e d e d t o t r a c k a s i n g l e s i g n a l . Ther e a r e # o f C o r e s s u bm o d u l e s , s i n c e
/ / we n e e d t o t r a c k r e c e i v e d s i g n a l s f o r t h i s i d f r om a l l c o r e s .
module s i g n a l _ t r a c k e r (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
/ / Used o n l y t o match up w i th C++ s i m u l a t i o n s f o r t e s t i n g ,
/ / s i n c e s i m u a l t e d p h a s e s m igh t s t a r t on a r b i t r a r y i t e r a t i o n s
i n p u t w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] f i r s t I t e r a t i o n C o r e I d ,
/ / Up t o ‘SIGNAL_BANDWIDTH i n c om i n g s i g n a l s t o r e c o r d t h i s c y c l e .
i n p u t w i r e [ ( ‘SIGNAL_BANDWIDTH* ‘SIGNAL_ENTRY_WIDTH )  1 : 0 ] i n c om in g S i g n a l s ,
/ / I f t h i s i n p u t i s h i gh , s e e i f e n ou gh s i g n a l s have b e e n r e c e i v e d t o l e t t h i s c o r e
/ / e n t e r t h e s e q u e n t i a l s e gm e n t c o r e s p o n d i n g w i t h t h i s SEGMENT_ID o f t h i s s u bm odu l e .
i n p u t w i r e in comingWa i tVa l i d ,
i n p u t w i r e incomingWai tL ight ,
/ / I f i n c om i n gWa i tVa l i d i s h igh , r a i s e t h i s o u t p u t h i g h o n l y i f i t i s s a f e
/ / t o e n t e r t h e s e q u e n t i a l s e gm e n t .
o u t p u t w i r e w a i t R e l e a s e d
) ;
p a r am e t e r EPOCH_BOUND = 2 ; / / C o n t r o l s t h e s i z e o f t h e s l i d i n g window f o r s i g n a l t r a c k i n g .
p a r am e t e r RECEIVER_CORE = 0 ; / / The c o r e t h i s s i g n a l b u f f e r i s a t t a c h e d t o .
p a r am e t e r SEGMENT_ID = 0 ; / / The s i g n a l b e i n g t r a c k e d by t h e s e b i t s e t s .
/ / B r e a k o u t t h e d i f f e r e n t s i g n a l e n t r y f i e l d s f o r t h e ‘SIGNAL_BANDWIDTH i n c om i n g s i g n a l s ,
/ / t o b e t t e r c h e c k whi ch a r e a p p r o p r i a t e f o r t h i s modu l e .
w i r e [ ‘SIGNAL_ENTRY_WIDTH  1 : 0 ] b r e a k o u t S i g n a l s [ 0 : ‘SIGNAL_BANDWIDTH  1 ] ;
w i r e s i g n a l s V a l i d B i t [ 0 : ‘SIGNAL_BANDWIDTH  1 ] ;
w i r e [ ‘ID_WIDTH  1 : 0 ] s i g n a l s I d s [ 0 : ‘SIGNAL_BANDWIDTH  1 ] ;
w i r e [ ( ‘CORE_ID_WIDTH*‘SIGNAL_BANDWIDTH)  1 : 0 ] s i g n a l s O r i g i n C o r e I d s ;
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] s i g n a l sM a t c hTh i sMod u l e ;
/ / Check t h e b i t s e t s f o r e v e r y c o r e t o s e e i f a p o s s i b l e wa i t b e i n g
/ / e x e c u t e d by t h e r e c e i v e r c o r e can b e r e l e a s e d .
/ / Can o n l y r e l e a s e i f a l l c o r e s have s e n t t h e p r o p e r s i g n a l s ,
/ / s o AND t h e wa i t r e l e a s e v a l u e f r om ea ch c o r e .
w i r e [ ‘NUM_CORES  1 : 0 ] w a i tR e l e a s e d F r omCo r e ;
296
a s s i g n w a i t R e l e a s e d = &wa i tR e l e a s e d F r omCo r e ;
/ / G e n e r a t e v a l i d b i t s f o r t h e i n c om i n g s i g n a l s ,
/ / t o mark whi ch a r e a p p r o p r i a t e t o r e c o r d f o r t h i s s i g n a l ID ,
/ / and hav i n g b e e n s e n t f r om which c o r e .
g en v a r i ;
g e n e r a t e
f o r ( i = 0 ; i < ‘SIGNAL_BANDWIDTH ; i = i + 1 ) b e g in : v a l i d S i g n a l s
a s s i g n b r e a k o u t S i g n a l s [ i ] = i n c om i n g S i g n a l s [ ( ‘SIGNAL_ENTRY_WIDTH * ( i + 1 ) )   1 :
( ‘SIGNAL_ENTRY_WIDTH * ( i ) ) ] ;
a s s i g n s i g n a l s V a l i d B i t [ i ] = b r e a k o u t S i g n a l s [ i ] [ ‘SIGNAL_ENTRY_WIDTH  1 ] ;
a s s i g n s i g n a l s O r i g i n C o r e I d s [ ( ‘CORE_ID_WIDTH * ( i + 1 ) )   1 : ( ‘CORE_ID_WIDTH* i ) ] =
s i g n a l s V a l i d B i t [ i ] == 1 ’ b 1 ?
b r e a k o u t S i g n a l s [ i ] [ ‘SIGNAL_ENTRY_WIDTH 2:‘ID_WIDTH ] : { ‘CORE_ID_WIDTH { 1 ’ b0 } } ;
a s s i g n s i g n a l s I d s [ i ] = s i g n a l s V a l i d B i t [ i ] == 1 ’ b 1 ?
b r e a k o u t S i g n a l s [ i ] [ ‘ID_WIDTH  1 : 0 ] : { ‘ID_WIDTH { 1 ’ b0 } } ;
/ / On ly p a s s s i g n a l s t o s u bmodu l e i f t h e s i g n a l i d s match t h e s i g n a l ID o f t h i s modu l e
a s s i g n s i g n a l sM a t c hTh i sMod u l e [ i ] = s i g n a l s V a l i d B i t [ i ] == 1 ’ b 1 &&
s i g n a l s I d s [ i ] == SEGMENT_ID [ ‘ID_WIDTH  1 : 0 ] ;
end
e n d g e n e r a t e
/ / G e n e r a t e s i g n a l t r a c k i n g b i t s f o r e a c h p o s s i b l e s e n d i n g c o r e .
/ / Th i s g e n e r a t e a l s o c r e a t e s t r a c k i n g b i t s f o r t h e r e c e i v e r c o r e , e v e n t h o u g h t h a t i s c o m p l e t e l y u n c e s s a r y .
/ / S y n t h e s i s o p t i m i z e s i t away .
/ / As a p a r ame t e r t h i s s u bm odu l e i s a s s i g n e d a s e n d i n g c o r e (SENDER_CORE ) .
/ / On ly s i g n a l s c o r e s p o n d i n g t o SEGMENT_ID ar e s e t t o v a l i d .
/ / A s p e c i a l p a r ame t e r c o r e s p o n d s t o wh e t h e r SEGMENT_ID i s o n e o f t h e s p e c i a l f l u s h s i g n a l s ,
/ / wh i ch a r e i n i t i a l i z e d s l i g h t l y d i f f e r e n t l y .
g e n e r a t e
f o r ( i = 0 ; i < ‘NUM_CORES ; i = i + 1 ) b e g i n : c o r e B i t s
c o r e _ t r a c k e r # ( .EPOCH_BOUND(EPOCH_BOUND) , . RECEIVER_CORE (RECEIVER_CORE ) ,
. SENDER_CORE ( i [ ‘CORE_ID_WIDTH  1 : 0 ] ) ,
. IS_FLUSH_SIGNAL ( ( SEGMENT_ID == ( ‘NUM_SIGNALS  1 ) ) | | ( SEGMENT_ID == ( ‘NUM_SIGNALS  2 ) ) ) )
c o r e B i t s (
. c l k ( c l k ) ,
. r e s e t ( r e s e t ) ,
. f i r s t I t e r a t i o n C o r e I d ( f i r s t I t e r a t i o n C o r e I d ) ,
. i n c om i n g S i g n a l s V a l i d ( s i g n a l sM a t c hTh i sMod u l e ) ,
. i n c om i n g S i g n a l s C o r e I d s ( s i g n a l s O r i g i n C o r e I d s ) ,
. i n c om ingWa i tV a l i d ( i n c om ingWa i tV a l i d ) ,
. i n comingWa i tL igh t ( in comingWa i tL igh t ) ,
. w a i t R e l e a s e d ( w a i tR e l e a s e d F r omCo r e [ i ] )
297
) ;
end
e n d g e n e r a t e
endmodule
298
B.14 signal_buffer_core_tracker.v
‘ i n c l u d e ” d e f i n e s . v ”
/ / C o r r e s p o n d s t o a s i n g l e b i t s e t , b e l o n g i n g t o a s i n g l e c o r e , a s i n g l e s i g n a l id ,
/ / and r e p r s e n t i n g t h e s i g n a l s r e c e i v e d f r om a s i n g l e c o r e .
/ / Any v a l u e >= EPOCH_BOUND means a wa i t i n s t r u c t i o n f o r t h a t s i g n a l ID can p r o c e e d .
/ / A c o u n t e r f u f i l l i n g t h i s r e s t r i c t i o n i m p l i e s t h a t more s i g n a l s have b e e n r e c e i v e d
/ / f o r t h i s i d f r om a p a r t i c u l a r s e n d i n g c o r e than have b e e n s e n t by t h e r e c e i v e r c o r e ,
/ / t h e r e f o r e t h e r e c e i v e r c o r e i s s a f e t o e n t e r t h e s e q u e n t i a l
/ / s e gm e n t c o r r e s p o n d i n g t o t h e s i g n a l i d .
/ / The c o m p i l e r n e e d s t o g u a r e n t e e , e i t h e r t h r o u g h p r o p e r t i e s o f t h e g e n e r a t e d c o d e ,
/ / o r by i n s e r t i n g l i g h t wa i t i n s t r u c t i o n s , t h a t t h e c o u n t e r b i t s w i l l n e v e r o v e r f l o w o r u n d e r f l ow .
/ / Whenever a c o r e ( t h a t i s n ’ t t h e r e c e i v e r ) s e n d s a s i g n a l c o r e s p o n d i n g t o t h i s modu l e ,
/ / t h e c o u n t e r i s i n c r em e n t e d .
/ / When t h e r e c e i v e r c o r e o f t h e s e b i t s s e n d s a s i g n a l , t h e c o u n t e r i s d e c r em e n t e d .
/ / N o t e t h a t t h e t e c h r e p o r t r e f e r s t o s t a t e s ” 1” , ” 0 ” , ” 1 ” , e t c .
/ / The a c t u a l c o u n t i n g r e g i s t e r s b e g i n c o u n t i n g a t 0 ,
/ / s o s t a t e ” 1” f o r an e p o c h_b ound o f 2 i s r e c o r d e d a s 0 i n t h e c o u n t e r ,
/ / s t a t e 0 r e c o r d e d a s 1 , s t a t e 1 w i t h 2 , e t c .
module c o r e _ t r a c k e r (
i n p u t w i r e c lk ,
i n p u t w i r e r e s e t ,
/ / The c o r e t h a t ran t h e f i r s t i t e r a t i o n o f t h e l o o p . I s u s u a l l y / a lway s c o r e 0 .
i n p u t w i r e [ ‘CORE_ID_WIDTH  1 : 0 ] f i r s t I t e r a t i o n C o r e I d ,
/ / The ‘SIGNAL_BANDWIDTH number o f whi ch c o r e s s e n t t h e s e r e c e i v e d s i g n a l s , and
/ / wh e t h e r t h e y a r e v a l i d . Th i s i n f o i s u s e d t o d e c i d e d whi ch s i g n a l s
/ / s h o u l d b e r e c o r d e d by t h i s modu l e .
i n p u t w i r e [ ( ‘CORE_ID_WIDTH*‘SIGNAL_BANDWIDTH )  1 : 0 ] i n c om i n g S i g n a l s C o r e I d s ,
i n p u t w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] i n c om i n g S i g n a l s V a l i d ,
/ / Whether we want t o know i f e n ough s i g n a l s c o r e s p o n d i n g w i t h t h e s i g n a l
/ / i d and s e n d i n g c o r e have b e e n r e c e i v e d s u c h t h a t we can r e l e a s e
/ / t h e r e c e i v e r c o r e i n t o t h e c o r e s p o n d i n g s e q u e n t i a l s e gm e n t .
i n p u t w i r e in comingWa i tVa l i d ,
i n p u t w i r e incomingWai tL ight ,
/ / C omb i n a t i n a l o u t p u t a s t o wh e t h e r a wa i t i n s t r u c t i o n i s c l e a r t o p r o c e e d .
o u t p u t r e g w a i t R e l e a s e d
299
) ;
/ / F o r a v a l u e o f 2 , c o r e s can d r i f t 2 e p o c h s o f i t e r a t i o n s a p a r t .
p a r am e t e r EPOCH_BOUND = 2 ;
/ / Which r i n g n od e t h i s s i g n a l b u f f e r b e l o n g s t o .
p a r am e t e r RECEIVER_CORE = 0 ;
/ / What s e n d i n g c o r e t h i s s e t o f s i g n a l t r a c k i n g b i t s r e f e r s t o .
/ / I f == RECEIVER_CORE , t h i s modu l e i s o p t im z e d away .
p a r am e t e r SENDER_CORE = 0 ;
/ / S p e c i a l f l u s h s i g n a l s a r e i n i t i a l i z e d and t r e a t e d s l i g h t l y d i f f e r e n t l y .
p a r am e t e r IS_FLUSH_SIGNAL = 0 ;
/ / The i n t i a l v a l u e o f t h e c o u n t e r s when t h e l o o p b e g i n s . C o r e s p o n d s t o s t a t e ” 0 ”
l o c a l p a r am INITIAL_VALUE = EPOCH_BOUND   1 ;
/ / The v a l u e o f t h e c o u n t e r s a t whi ch a norma l wa i t i n s t r u c t i o n can p r o c e e d . C o r e s p o n d s t o s t a t e ” 1 ”
l o c a l p a r am RELEASE_THRESH = EPOCH_BOUND ;
/ / S t a t e b i t s
/ / EPOCH_BOUND* 2 i s t h e t o t a l number o f r e q u i r e d s t a t e s , s o 2  > 2 b i t s
r e g [ ( ‘CLOG2 (EPOCH_BOUND* 2 ) )   1 : 0 ] c o u n t e r ;
/ / C omb i n a t i o n a l n ex t c o u n t e r v a l u e .
r e g [ ( ‘CLOG2 (EPOCH_BOUND* 2 ) )   1 : 0 ] n e x tCoun t e r ;
/ / Need t o p r e s e t c e r t a i n b i t s i n s i g n a l b u f f e r b e c a u s e on t h e f i r s t t r i p o f i t e r a t i o n s ,
/ / s i g n a l s a r e o n l y e x p e c t e d f r om a s u b s e t o f c o r e s .
/ / Mo s t o f t h e c o m p l e x i t y h e r e i s f r om s t a r t i n g a l o o p n o t on t h e f i r s t i t e r a t i o n ( t h a t i s , on c o r e 0 ) .
/ / Th i s i s f o r p h a s e s i m u l a t i o n p u r p o s e s .
/ / F o r a ’ r e a l ’ im p l em e n t a t i o n , f i r s t I t e r a t i o n C o r e wou ld a lway s e q u a l c o r e 0 , s o t h i s l o g i c s i m p l i f i e s .
a lway s@ ( p o s e d g e c l k ) b e g i n
i f ( r e s e t == 1 ’ b 1 ) b e g in
/ / F o r f l u s h s i g n a l s , s e t a l l c o r e s b e l ow t h e r e l e a s e t h r e s h o l d , s i n c e we want t o wa i t f o r a l l c o r e s .
i f ( IS_FLUSH_SIGNAL == 1 ’ b 1 ) b e g in
c o u n t e r <= INITIAL_VALUE ;
end
/ / The r e c e i v e r c o r e o f t h e s e b i t s d o e s n ’ t a c t u a l l y n e e d t o t r a c k h i s own s i g n a l s ,
/ / b u t i t ’ s e a s i e r t o a s s ume t h a t t h e r e a r e num b i t s e t s == num c o r e s ,
300
/ / s o j u s t p r e s e t t h e b i t s t o a lway s r e l e a s e w a i t s . S y n t h e s i s o p t i m z i e s t h i s away .
e l s e i f ( RECEIVER_CORE == SENDER_CORE ) b eg in
c o u n t e r <= (EPOCH_BOUND*2 )  1 ;
end
/ / I n i t i a l i z e a l l c o u n t e r s i f n o t s t a r t i n g t h e f i r s t i t e r a t i o n .
e l s e i f ( RECEIVER_CORE ! = f i r s t I t e r a t i o n C o r e I d ) b e g in
/ / I f t h e c o r e r u n s i t e r a t i o n s i n t h e same e p o c h a s t h e f i r s t i t e r a t i o n c o r e :
i f ( RECEIVER_CORE > f i r s t I t e r a t i o n C o r e I d ) b e g in
i f ( SENDER_CORE < f i r s t I t e r a t i o n C o r e I d | | SENDER_CORE > RECEIVER_CORE ) b eg in
c o u n t e r <= RELEASE_THRESH ;
end
e l s e b e g i n
c o u n t e r <= INITIAL_VALUE ;
end
end
/ / I f t h i s c o r e i s b e f o r e t h e f i r s t i t e r a t i o n c o r e
/ / ( t h a t i s , o n l y s t a r t s r unn i n g i t e r a t i o n s f r om t h e s e c o n d t r i p o f i t e r a t i o n s )
e l s e b e g i n
i f ( SENDER_CORE < f i r s t I t e r a t i o n C o r e I d && SENDER_CORE > RECEIVER_CORE ) b eg in
c o u n t e r <= RELEASE_THRESH ;
end
e l s e b e g i n
c o u n t e r <= INITIAL_VALUE ;
end
end
end
/ / I f t h e owning c o r e i s t h e f i r s t c o r e o f t h e i t e r a t i o n ,
/ / i t s k i p s a l l o f t h e f i r s t i t e r a t i o n wa i t i n s t r u c t i o n s
e l s e b e g i n
c o u n t e r <= RELEASE_THRESH ;
end
end
/ / I f n o t r e s e t i n g
e l s e b e g i n
c o u n t e r <= n e x tCoun t e r ;
end
end
/ / R e c o r d r e c e i v e d s i g n a l l o g i c
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] m a t c h e s R e c e i v e rC o r e I d ;
w i r e [ ‘SIGNAL_BANDWIDTH  1 : 0 ] m a t c h e s B i t s C o r e I d ;
w i r e a n yMa t c hR e c e i v e rCo r e I d ;
w i r e a n yMa t c hB i t sCo r e I d ;
/ / F o r e a c h p o s s i b l e i n c om i n g s i g n a l , c h e c k i f i t ’ s v a l i d
/ / ( whi ch means t h a t i t c o r e s p o n d s w i t h t h e s i g n a l i d c o r e s p o n d i n g t o t h i s modu l e ) .
301
/ / Th i s ha s b e e n s e t by t h e p a r e n t modu l e .
/ / A l s o c h e c k i f t h e i n c om i n g s i g n a l e i t h e r ma t c h e s t h e s e n d i n g c o r e c o r e s p o n d i n g t o t h i s modu l e ,
/ / o r i f t h e r e c e i v e r c o r e o f t h i s wh o l e s i g n a l b u f f e r ha s s e n t t h e s i g n a l .
g en v a r i ;
g e n e r a t e
f o r ( i = 0 ; i < ‘SIGNAL_BANDWIDTH ; i = i + 1 ) b e g in : m a t c h i n g S i g n a l s
a s s i g n ma t c h e s R e c e i v e rC o r e I d [ i ] = i n c om i n g S i g n a l s V a l i d [ i ] == 1 ’ b 1 &&
i n c om i n g S i g n a l s C o r e I d s [ ( ‘CORE_ID_WIDTH * ( i + 1 ) )   1 : ( ‘CORE_ID_WIDTH* i ) ]
== RECEIVER_CORE ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n m a t c h e s B i t s C o r e I d [ i ] = i n c om i n g S i g n a l s V a l i d [ i ] == 1 ’ b 1 &&
i n c om i n g S i g n a l s C o r e I d s [ ( ‘CORE_ID_WIDTH * ( i + 1 ) )   1 : ( ‘CORE_ID_WIDTH* i ) ] ==
SENDER_CORE ? 1 ’ b 1 : 1 ’ b0 ;
end
e n d g e n e r a t e
/ / F o r a p a r t i c u l a r s e t o f b i t s c o r r e s p o n d i n g t o a s i g n a l id , and a s e n d i n g c o r e ,
/ / we can o n l y have two v a l i d i n p u t s i g n a l s . Th i s i s a p r o p e r t y o f t h e HELIX e x e c u t i o n mode l .
/ / One p o s s i b l e v a l i d i n p u t c o r r e s p o n d s t o a s i g n a l s e n t by t h e s e n d i n g c o r e a s s i g n e d t o t h i s b i t s e t ,
/ / and t h e o t h e r by t h e c o r e t h a t p h y s i c a l l y c o n t a i n s t h e b i t s e t s .
/ / In t h e f o rm e r c a s e , we i n c r em e n t t h e b i t c o u n t e r t o i n d i c a t e t h a t a s i g n a l was
/ / r e c e i v e d f r om a n o t h e r c o r e . In t h e l a t t e r c a s e , we d e c f em e n t t h e c o u n t e r t o i n d i c a t e t h a t t h e
/ / p h y s i c a l r e c e i v e r c o r e o f t h e b i t s ha s j u s t l e f t s e n t a s i g n a l . Or i f b o t h c a s e s a r e t r u e , do b o t h .
/ / I f t h e r e c e i v e d f l u s h s i g n a l was s e n t by t h e same c o r e , p r e t e n d i t d o e s n ’ t match
/ / and j u s t i n c r em e n t t h e c o u n t e r , f o r s p e c i a l f l u s h s i g n a l s e m a n t i c s .
a s s i g n an yMa t c hR e c e i v e rCo r e I d = ( | m a t c h e s R e c e i v e rC o r e I d ) == 1 ’ b 1 && ( IS_FLUSH_SIGNAL == 1 ’ b0 ) ? 1 ’ b 1 : 1 ’ b0 ;
a s s i g n an yMa t c hB i t sCo r e I d = | m a t c h e s B i t s C o r e I d ;
a lway s@ ( * ) b e g in
c a s e ( { a n yMa t chRe c e i v e rCo r e I d , a n yMa t c hB i t sCo r e I d } )
2 ’ b00 : n e x tCoun t e r = c o u n t e r ;
2 ’ b0 1 : n e x tCoun t e r = c o u n t e r + 1 ;
2 ’ b 1 0 : n e x tCoun t e r = c oun t e r  1 ;
/ / F o r t h i s c a s e t h e r e a r e two p o s s i b l i t i e s . One , SENDER_CORE==RECEIVER_CORE ,
/ / i n whi ch c a s e we don ’ t want t o c hang e t h e b i t s anyway .
/ / When t h a t c om p a r i s o n i s f a l s e , t h e n i t means we ’ v e r e c e i v e d two o p p o s i n g s i g n a l s ,
/ / s o we a l s o don ’ t want t o c hang e t h e b i t s .
2 ’ b 1 1 : n e x tCoun t e r = c o u n t e r ; / / + 1 and  1, s o n e t e f f e c t o f 0
e n d c a s e
end
302
/ / Wait r e l e a s e l o g i c . R e l e a s e t h e wa i t b a s e d on t h e n ex t c o u n t e r v a l u e .
a lway s@ ( * ) b e g in
i f ( i n c om ingWa i tV a l i d == 1 ’ b 1 && c o un t e r >= RELEASE_THRESH ) b eg in
w a i t R e l e a s e d = 1 ’ b 1 ;
end
/ / I f wa i t i s ’ l i g h t ’ , can r e l e a s e a s l o n g a s c o u n t e r wouldn ’ t p o t e n t i a l l y
/ / u n d e r f l o w o n c e t h e c o r e s p o n d i n g s e q u e n t i a l s e gm e n t e x e c u t e s , and t h e r e c e i v e r c o r e s e n d s t h i s s i g n a l .
/ / I f t h e c o u n t e r was 0 , and we r e l e a s e d t h i s wai t , i t w i l l u n d e r f l ow o n c e t h e
/ / n e x t r e c e i v e r s i g n a l i s s e n t .
e l s e i f ( i n c om ingWa i tV a l i d == 1 ’ b 1 && c o un t e r > { ( ‘CLOG2 (EPOCH_BOUND * 2 ) ) { 1 ’ b0 } } &&
incomingWa i tL igh t == 1 ’ b 1 ) b e g in
w a i t R e l e a s e d = 1 ’ b 1 ;
end
e l s e b e g i n
w a i t R e l e a s e d = 1 ’ b0 ;
end
end
endmodule
303
References
[1] (2013). Tile Processor Architecture Overview for the TILEPro Seriॸ. Tilera Corportation.
[2] Allen, R. & Kennedy, K. (2002). Optimizing compilers for modern architecturॸ. Morgan
Kaufmann.
[3] Barroso, L. A., Clidaras, J., & Hölzle, U. (2013). The datacenter as a computer: An introduc-
tion to the design of warehouse-scale machines. Synthesॹ lecturॸ on computer architecture,
8(3), 1–154.
[4] Bernstein, A. (1966). Analysis of programs for parallel processing.
[5] Borkar, S. & Chien, A. A. (2011). The future of microprocessors. Commun. ACM, 54(5),
67–77.
[6] Breach, S. E., Vijaykumar, T. N., & Sohi, G. S. (1994). The anatomy of the register file in
a multiscalar processor. In Proceedings of the 27th Annual International Symposium on
Microarchitecture, MICRO 27 (pp. 181–190). New York, NY, USA: ACM.
[7] Burger, D., Goodman, J. R., & Kägi, A. (1996). Memory bandwidth limitations of future
microprocessors. In ISCA.
[8] Campanoni, S., Agosta, G., Reghizzi, S. C., & Biagio, A. D. (2010). A Highly Flexible, Paral-
lel Virtual Machine: Design and Experience of ILDJIT. In Software: Practice and Experience.
[9] Campanoni, S., Brownell, K., Kanev, S., Jones, T., Wei, G.-Y., & Brooks, D. (2014). Helix-rc:
An architecture-compiler co-design for automatic parallelization of irregular programs. In
Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on (pp.
217–228).
[10] Campanoni, S., Holloway, G., Wei, G.-Y., & Brooks, D. (2015). Helix-up: Relaxing pro-
gram semantics to unleash parallelization. In Proceedings of the 13th Annual IEEE/ACM
International Symposium on Code Generation and Optimization, CGO ’15 (pp. 235–245).
Washington, DC, USA: IEEE Computer Society.
304
[11] Campanoni, S., Jones, T., Holloway, G., Reddi, V. J., Wei, G.-Y., & Brooks, D. (2012a). Helix:
Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of
the Tenth International Symposium on Code Generation and Optimization, CGO ’12 (pp.
84–93). New York, NY, USA: ACM.
[12] Campanoni, S., Jones, T. M., Holloway, G., Janapa Reddi, V., Wei, G.-Y., & Brooks, D.
(2012b). HELIX: Automatic Parallelization of Irregular Programs for ChipMultiprocess-
ing. In CGO.
[13] Campanoni, S., Jones, T. M., Holloway, G., Wei, G.-Y., & Brooks, D. (2012c). HELIX: Mak-
ing the Extraction of Thread-Level ParallelismMainstream. In IEEE Micro.
[14] Chatterjee, R., Ryder, B. G., & Landi, W. A. (1999). Relevant Context Inference. In POPL.
[15] Chrysos, G. (2012). Knights corner, intel’s first many integrated core (mic) architecture prod-
uct. InHot Chips, volume 24.
[16] Cytron, R. (1986). DOACROSS: Beyond vectorization for multiprocessors. In ICPP.
[17] Danowitz, A., Kelley, K., Mao, J., Stevenson, J. P., &Horowitz, M. (2012). Cpu db: Record-
ing microprocessor history. Queue, 10(4), 10:10–10:27.
[18] Dennard, R. H., Rideout, V., Bassous, E., & Leblanc, A. (1974). Design of ion-implanted
mosfet’s with very small physical dimensions. Solid-State Circuits, IEEE Journal of, 9(5),
256–268.
[19] Deutsch, A. (1992). A storeless model of aliasing and its abstractions using finite representa-
tions of right-regular equivalence relations. In ICCL.
[20] Gao, C., Gutierrez, A., Dreslinski, R. G., Mudge, T., Flautner, K., & Blake, G. (2014). A
study of thread level parallelism on mobile devices. In Performance Analysॹ of Systems and
Software (ISPASS), 2014 IEEE International Symposium on (pp. 126–127).: IEEE.
[21] Gratz, P., Kim, C., Sankaralingam, K., Hanson, H., Shivakumar, P., Keckler, S. W., & Burger,
D. (2007a). On-Chip Interconnection Networks of the TRIPS Chip. In IEEE Micro.
[22] Gratz, P., Sankaralingam, K., Hanson, H., Shivakumar, P., McDonald, R., Keckler, S., &
Burger, D. (2007b). Implementation and evaluation of a dynamically routed processor
operand network. InNetworks-on-Chip, 2007. NOCS 2007. First International Symposium
on (pp. 7–17).
305
[23] Guo, B., Bridges, M. J., Triantafyllis, S., Ottoni, G., Raman, E., & August, D. I. (2005). Prac-
tical and accurate low-level pointer analysis. In CGO.
[24] Hamerly, G., Perelman, E., & Calder, B. (2004). How to use simpoint to pick simulation
points. InACM SIGMETRICS Performance Evaluation Review.
[25] Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M. K., & Olukotun, K.
(2000). The Stanford Hydra CMP. In IEEE Micro.
[26] Hammond, L., Willey, M., & Olukotun, K. (1998). Data speculation support for a chip
multiprocessor. In Proceedings of the Eighth International Conference on Architectural
Support for Programming Languagॸ and Operating Systems, ASPLOS VIII (pp. 58–69). New
York, NY, USA: ACM.
[27] Huang, J., Raman, A., Jablin, T. B., Zhang, Y., Hung, T.-H., & August, D. I. (2010). Decou-
pled software pipelining creates parallelization opportunities. In CGO.
[28] Hurson, A. R., Lim, J. T., Kavi, K. M., & Lee, B. (1997). Parallelization of doall and doacross
loops - a survey. InAdvancॸ in Computers, volume 45.
[29] Jaleel, A. (2007). Memory characterization of workloads using instrumentation-driven simu-
lation – a pin-based memory characterization of the spec cpu2000 and spec cpu2006 bench-
mark suites. In Technical Report, VSSAD, Intel Corporation.
[30] Jerger, N. E. & Peh, L.-S. (2009). On-Chip Networks. Synthesis Lectures on Computer
Architecture. Morgan & Claypool.
[31] Johnson, T. A., Eigenmann, R., & Vijaykumar, T. N. (2007). Speculative thread decomposi-
tion through empirical optimization. In PPoPP.
[32] Kanev, S., Wei, G.-Y., & Brooks, D. (2012a). XIOSim: power-performance modeling of
mobile x86 cores. In ISLPED.
[33] Kanev, S., Wei, G.-Y., & Brooks, D. (2012b). Xiosim: Power-performance modeling of mo-
bile x86 cores. In Proceedings of the 2012 ACM/IEEE International Symposium on Low
Power Electronics and Design, ISLPED ’12 (pp. 267–272). New York, NY, USA: ACM.
[34] Kim, C., Sethumadhavan, S., Govindan, M. S., Ranganathan, N., Gulati, D., Burger, D.,
& Keckler, S. W. (2007). Composable lightweight processors. InMicroarchitecture, 2007.
MICRO 2007. 40th Annual IEEE/ACM International Symposium on (pp. 381–394).: IEEE.
306
[35] Kim, H., Johnson, N. P., Lee, J. W., Mahlke, S. A., & August, D. I. (2012). Automatic specu-
lative doall for clusters. In Proceedings of the Tenth International Symposium on Code Gener-
ation and Optimization, CGO ’12 (pp. 94–103). New York, NY, USA: ACM.
[36] Kim, N., Austin, T., Baauw, D., Mudge, T., Flautner, K., Hu, J., Irwin, M., Kandemir, M., &
Narayanan, V. (2003). Leakage current: Moore’s law meets static power. Computer, 36(12),
68–75.
[37] Kistler, M., Perrone, M., & Petrini, F. (2006). Cell multiprocessor communication network:
Built for speed. 26(3).
[38] Liu, W., Tuck, J., Ceze, L., Ahn, W., Strauss, K., Renau, J., & Torrellas, J. (2006). POSH: A
TLS compiler that exploits program structure. In PPoPP.
[39] Mars, J., Tang, L., Hundt, R., Skadron, K., & Soffa, M. L. (2011). Bubble-up: Increasing
utilization in modern warehouse scale computers via sensible co-locations. In Proceedings
of the 44th annual IEEE/ACM International Symposium on Microarchitecture (pp. 248–
259).: ACM.
[40] Martin, M. M. K. (2003). Token coherence. PhD thesis, University of Wisconsin-Madison.
[41] Mehrara, M., Hao, J., Hsu, P.-C., &Mahlke, S. (2009). Parallelizing sequential applications
on commodity hardware using a low-cost software transactional memory. In Proceedings of
the 30th ACM SIGPLAN Conference on Programming Language Design and Implementa-
tion, PLDI ’09 (pp. 166–176). New York, NY, USA: ACM.
[42] Moore, G. (1965). Cramming more components onto integrated circuits. Electronics Maga-
zine.
[43] Muralimanohar, N., Balasubramonian, R., & Jouppi, N. P. (2009). CACTI 6.0: A tool to
model large cachॸ. Technical Report 85, HP Laboratories.
[44] Nicolau, A., Li, G., & Kejariwal, A. (2009a). Techniques for efficient placement of synchro-
nization primitives. In PPoPP.
[45] Nicolau, A., Li, G., Veidenbaum, A. V., & Kejariwal, A. (2009b). Synchronization optimiza-
tions for efficient execution on multi-cores. In ICS.
[46] Ottoni, G., Rangan, R., Stoler, A., & August, D. I. (2005). Automatic thread extraction with
decoupled software pipelining. InMICRO.
307
[47] Prabhu, M. K. &Olukotun, K. (2005). Exposing speculative thread parallelism in spec2000.
In Proceedings of the Tenth ACM SIGPLAN Symposium on Principlॸ and Practice of Par-
allel Programming, PPoPP ’05 (pp. 142–152). New York, NY, USA: ACM.
[48] Raman, A., Kim, H., Mason, T. R., Jablin, T. B., & August, D. I. (2010). Speculative paral-
lelization using software multi-threaded transactions. InASPLOS.
[49] Raman, E., Ottoni, G., Raman, A., Bridges, M. J., & August, D. I. (2008). Parallel-stage
decoupled software pipelining. In CGO.
[50] Rangan, R. et al. (2004). Decoupled software pipelining with the synchronization array. In
PACT.
[51] Robatmil, B., Li, D., Esmaeilzadeh, H., Govindan, S., Smith, A., Putnam, A., Burger, D., &
Keckler, S. W. (2013). How to Implement Effective Prediction and Forwarding for Fusable
Dynamic Multicore Architectures. InHPCA.
[52] Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). DRAMSim2: A Cycle Accurate Memory
System Simulator. In IEEE Computer Architecture Letters.
[53] Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Ranganathan, N., Burger, D.,
Keckler, S. W., McDonald, R. G., &Moore, C. R. (2004). TRIPS: A polymorphous archi-
tecture for exploiting ILP, TLP, and DLP. InACM TACO.
[54] Sohi, G. S., Breach, S. E., & Vijaykumar, T. N. (1995). Multiscalar processors. In ISCA.
[55] Sorin, D. J., Hill, M. D., &Wood, D. A. (2011). A primer on memory consistency and cache
coherence. Synthesॹ Lecturॸ on Computer Architecture, 6(3), 1–212.
[56] Steffan, J. G., Colohan, C., Zhai, A., &Mowry, T. C. (2005). The STAMPede approach to
thread-level speculation. InACM Transactions on Computer Systems.
[57] Steffan, J. G., Colohan, C. B., Zhai, A., &Mowry, T. C. (2002). Improving value communi-
cation for thread-level speculation. InHPCA.
[58] Taylor, M. B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H.,
Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V.,
Frank, M., Amarasinghe, S., & Ararwal, A. (2002). The RAWmicroprocessor: A computa-
tional fabric for software circuits and general-purpose programs. In IEEE Micro.
308
[59] Taylor, M. B., Lee, W., Amarasinghe, S. P., & Agarwal, A. (2005). Scalar Operand Networks.
In IEEE Transactions on Parallel Distributed Systems.
[60] Tournavitis, G., Wang, Z., Franke, B., & O’Boyle, M. F. P. (2009). Towards a holistic ap-
proach to auto-parallelization. In PLDI.
[61] Vachharajani, N. et al. (2007). Speculative decoupled software pipelining. In PACT.
[62] Van der Wijngaart, R. F., Mattson, T. G., &Haas, W. (2011). Light-weight communications
on intel’s single-chip cloud computer processor. ACM SIGOPS Operating Systems Review,
45(1), 73–83.
[63] Vijaykumar, T. & Sohi, G. S. (1998). Task selection for a multiscalar processor. In Proceedings
of the 31st annual ACM/IEEE international symposium on Microarchitecture (pp. 81–92).:
IEEE Computer Society Press.
[64] Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M.,
Miao, C.-C., Brown, III, J. F., & Agarwal, A. (2007). On-chip interconnection architecture
of the tile processor. In IEEE Micro.
[65] Zhai, A., Colohan, C. B., Steffan, J. G., &Mowry, T. C. (2002). Compiler optimization of
scalar value communication between speculative threads. InASPLOS.
[66] Zhai, A., Steffan, J. G., Colohan, C. B., &Mowry, T. C. (2008). Compiler and hardware
support for reducing the synchronization of speculative threads. InACM TACO.
[67] Zhong, H., Mehrara, M., Lieberman, S., &Mahlke, S. (2008). Uncovering hidden loop level
parallelism in sequential applications. InHPCA.
309
