Exploring alternatives to hardware support for fine-grain synchronization by Ahrens, Benjamin
c© 2012 Benjamin J. Ahrens
EXPLORING ALTERNATIVES TO HARDWARE SUPPORT FOR
FINE-GRAIN SYNCHRONIZATION
BY
BENJAMIN J. AHRENS
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2012
Urbana, Illinois
Adviser:
Professor Josep Torrellas
ABSTRACT
As we prepare for the extreme-scale era of computing, communication over-
head and synchronization between cores will soon become extremely impor-
tant. In this work we study three different methods of support for fine-
grain synchronization. Fine-grain synchronization allows a task to be broken
up into very small units, improving load balancing and reducing lock con-
tention. The different methods include hardware support for full/empty bits,
compare-and-swap (CAS) emulation of full/empty bits, and dual CAS oper-
ations. Roger Golliver’s single CAS implementation is a novel method which
chooses a bit-pattern to represent an “empty” full/empty bit state. The pri-
mary concerns are hardware overhead, efficiency of the synchronization, and
energy wasted while spinning. We have tested the methods on a set of four
different highly parallel algorithms on up to 32 cores. Our results show that
fine-grain synchronization can have significant performance benefits, and em-
ulation through CAS can do just as well as hardware-supported full/empty
bits in many cases. We had much difficulty finding suitable algorithms that
use fine-grain synchronization in a meaningful way, and among those that did
use fine-grain synchronization meaningfully, there were limited cases where
hardware support had a significant advantage over emulation through CAS.
Given these results, we find it difficult to justify including full/empty bits in
an extreme-scale design.
ii
To my parents, for their love and support
iii
ACKNOWLEDGMENTS
I thank Professor Josep Torrellas for his patience throughout my work on this
project. Thank you to Roger Golliver for the idea of using single compare-
and-swap for fine-grain synchronization emulation. Thank you to John Feo
for providing the benchmarks which provided the basis for verifying this
work. Finally, thank you to Aditya Agrawal for your time in explaining the
quirks of SESC.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . 3
CHAPTER 3 RELATED WORK . . . . . . . . . . . . . . . . . . . . 5
CHAPTER 4 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . 7
4.1 Hardware Support for Full/Empty Bits . . . . . . . . . . . . . 7
4.2 Emulation with Compare-and-Swap . . . . . . . . . . . . . . . 8
4.2.1 Using Compare-and-Swap . . . . . . . . . . . . . . . . 8
4.2.2 Choosing an Empty Value . . . . . . . . . . . . . . . . 10
4.3 Emulation with Dual CAS . . . . . . . . . . . . . . . . . . . . 10
4.3.1 Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3.2 Implementation Details . . . . . . . . . . . . . . . . . . 11
CHAPTER 5 BENCHMARKS . . . . . . . . . . . . . . . . . . . . . 13
5.1 Giga-Updates per Second . . . . . . . . . . . . . . . . . . . . . 13
5.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 RAlign - Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 RAlign - Row-by-Row . . . . . . . . . . . . . . . . . . . . . . 15
5.5 Strongly Connected Components . . . . . . . . . . . . . . . . 17
CHAPTER 6 EVALUATION . . . . . . . . . . . . . . . . . . . . . . 19
6.1 Giga-Updates per Second . . . . . . . . . . . . . . . . . . . . . 20
6.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 RAlign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.4 Strongly Connected Components . . . . . . . . . . . . . . . . 27
CHAPTER 7 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . 30
CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 31
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
CHAPTER 1
INTRODUCTION
With the stagnation of single-threaded performance, adding more and more
cores to chip multiprocessors has become the dominant mechanism for squeez-
ing more performance out of computers. Extreme-scale machines are on the
horizon [1], and with extreme-scale comes many difficult issues. We must
leverage the enormous parallelism afforded by hundreds to thousands of cores
on a chip, and the cores must be extremely simple. They also must be able
to communicate efficiently, as contention over networks of thousands of cores
will be very high. More importantly, communication will be expensive over
such an expansive network. We can no longer strive for performance at
all costs. Rather, energy per computation is becoming a primary concern.
Therefore, we want to make sure that transactions over the network are al-
ways doing useful work.
While multiprogramming and embarrassingly parallel applications easily
take advantage of more cores, the scalability of barrier and lock-based code
is more questionable. These centralized mechanisms do not scale well past
tens of cores due to contention and load balancing issues. Other solutions
for scalability can be quite undesirable from a programmer’s perspective.
Lock contention can be solved by adding more detailed and complex lock-
ing schemes, but it often complicates the code and increases the memory
footprint of the program. Barriers can be improved through better load bal-
ancing, but with the indeterminate nature of memory accesses over complex
fabrics, this can never really be guaranteed. Fine-grain synchronization is
the logical conclusion in the arena of scalable parallel programming by asso-
ciating synchronization with individual pieces of data.
Hardware support for fine-grain synchronization through full/empty bits
has been the de facto implementation of fine-grain synchronization for some
time. Each word in memory is tagged with a single bit representing the status
of the word. If the word’s status is “empty,” the data is not ready. The empty
1
state can be thought of as the locked state in normal synchronization terms.
If the word’s status is marked “full,” then the data is ready for consumption,
or unlocked. Full/empty bits in memory obviate the need to allocate and
manage locks and barriers.
We define several typical operations which can be done with full/empty
bits. ReadFE is a load which does not complete until the location is marked
full (F), and once the read completes, the location is marked empty (E).
WriteEF is a store which does not complete until the location is marked
empty, and once the store completes, the location is marked full. Variations
such as ReadFF (waits until full and does not update full/empty bit) or
WriteXF (non-blocking write) are also used.
Full/empty bits have four main benefits for scalability and efficiency. First,
full/empty bits reduce the memory overhead of locks by allowing the lock and
the data to be combined into a single memory location. Second, by combining
data and synchronization, false dependences brought on by synchronization
are eliminated, since the only memory location which is accessed and locked
is the data location itself. Third, full/empty bit operations are very efficient.
Since synchronization and data access are done at the same time, they can be
accomplished in a single instruction if there is hardware support for it. This
combining of accesses allows for frequent low-latency synchronization in ex-
tremely parallel applications. Finally, hardware support for full/empty bits
includes support for queueing synchronization requests which fail. Queue-
ing requests ensures that each synchronized access only results in a single
transaction over the network.
Unfortunately, support for full/empty bits in hardware is not free, and
therefore we would like to evaluate the purported benefits in order to de-
termine its usefulness in enabling efficient parallel algorithms. We utilize a
pair of software alternatives to full/empty bits in hardware and evaluate how
these alternatives approximate the efficiency benefits which are claimed for
hardware full/empty bits. We also investigate the ease of modifying algo-
rithms to take advantage of these benefits.
2
CHAPTER 2
MOTIVATION
The canonical argument for fine-grain synchronization revolves around the
current use of barriers between phases in applications in order to ensure that
all computations are completed before moving on. While barriers are sim-
ple to program, they can create dependencies among threads that may have
not existed in the first place. One thread might have been de-scheduled or
run into a few cache misses, causing it to fall behind. As the number of
cores increases, the probability of this imbalance occurring only increases.
This imbalance causes all the other threads to wait on a slow thread even
though they may not depend on the data which it generates. Not only does
the increasing parallelism increase the probability of imbalance, but it also
increases the number of cores affected by the imbalance. Fine-grain synchro-
nization removes these false dependencies, leaving behind only the true data
dependencies between producers and consumers. Fine-grain synchronization
effectively de-centralizes the synchronization in the program, opening it up
to the potential for large speedups as the number of processors grows. While
performance increases are desirable, this elimination of false dependencies
also reduces wasted work.
This simple mechanism has a price though. Some designs call for full/empty
bits on a per-byte granularity, creating a 12.5% tax in memory. In addition
to full/empty bit overhead, there must also be a mechanism in memory and
possibly caches to queue accesses which fail, as this is the key to limiting
network transactions on failed synchronizations. Memory buffers will not
only buffer outstanding requests to the memory modules, but they will also
have to buffer requests that are waiting on a full/empty bit to flip. Wait-
ing for a full/empty bit flip would require memory accesses to potentially
search a content-addressable memory of pending fine-grain synchronization
requests for every fine-grain synchronization access. These overheads may be
too much to handle as we increasingly push to put more cores on a chip.
3
It is also important to remember that not all applications can make use of
these full/empty bits. Some programs are already very well balanced, or do
not require large amounts of synchronization. In these cases, the full/empty
bits go unused. By placing them in memory, they impose a constant over-
head while not contributing to the execution. We can have the best of both
worlds by making use of a software implementation of full/empty bits: rel-
atively high performance fine-grain synchronization, and zero memory over-
head when the mechanism is not in use.
4
CHAPTER 3
RELATED WORK
Full/empty bits have been implemented in hardware in many different sys-
tems, including the Denalcor HEP [2], Monsoon [3], the Tera MTA family
of processors [4], Alewife [5], and Eldorado (Cray XMT) [6]. Cray is the
most well-known for doing this. The Tera family of processors tagged each
64-bit data location in memory with four bits related to access state, one
of them being a full/empty bit. Since the Tera does not have caches, every
memory access has a very high latency. Full/empty bits allow the Tera to
avoid spinning between the processors and the main memory, and they also
limit the number of memory accesses to synchronize. Unfortunately, all these
systems have implemented full/empty bits for every single memory location,
regardless of whether it is being used or not. Given that area is a major
concern for extreme-scale systems, hardware full/empty bits may not be the
ideal candidate for fine-grain synchronization.
Some improvements to full/empty bit support in hardware have been pro-
posed. Zhu et al. [7] observe that at any instance of time, only a small
portion of memory is actually participating in synchronizing memory oper-
ations. Therefore, they designed the Synchronization State Buffer (SSB),
which is a small cache attached to the main memory controller which keeps
track of locations which are actually using the synchronization hardware. In
their experiments, they use an 8-way SSB with 16 entries for each on-chip
memory bank and 1024 entries for off-chip memory. While this arrangement
limits the hardware overhead, any additional hardware in an extreme-scale
machine must demonstrate its usefulness in a wide range of use cases. Our
experiments attempt to address this question of usefulness.
One purely hardware solution to the issue of increased contention at locks
and barriers is speculative synchronization [8]. This method allows threads
to execute past active barriers, locks, and flags, assuming that there will be
no dependencies or collisions when skipping the synchronization. If there
5
is a violation, the thread must roll back and re-execute. While speculative
synchronization can provide substantial performance benefits, re-execution
can waste quite a bit of energy. As we move towards extreme-scale machines
where collisions become more likely, the energy cost of re-execution may be
untenable. The same argument applies to transactional memory mechanisms:
while it has potential for performance benefits, any mechanism which requires
an operation to be completed multiple times just does not make a lot of sense
in a paradigm where minimal energy is the primary concern. Effort is much
better utilized making sure that the cores are kept busy doing useful work.
Queue on Lock Bit (QOLB) [9] proposes a hardware mechanism which
helps with lock arbitration by queuing waiters and passing the lock directly
from releaser to the next waiter with minimal messages over the network. It
also supports non-blocking accesses for performing synchronous prefetching,
a property not shared with full/empty bits. It supports both local spin-
ning and collocation, where the data is allowed to reside on the same cache
line as the lock, increasing efficiency. While these properties are similar to
that of full/empty bits in hardware without the requirement for tagging ev-
ery memory location, QOLB still requires extensive hardware modifications.
These modifications include a requirement for non-blocking synchronization
instructions, direct transfer of a lock from node to node, storage of queue
state information, and support for “shadow lines” in the caches. All of this
is required to support a mechanism which only applies to locks, and only
has significant advantages in the case of high contention. Even then, the
authors also propose a software-only implementation which works on com-
modity hardware and demonstrates most of the benefits of the mechanism.
The Mellor-Crummey and Scott (MCS) [10] lock proposes a software-only
method for queueing requests to access a lock, which requires allocating nodes
and adding them to a linked list. Since each requester spins on its own node,
it is difficult to do collocation. This is because the data protected by the
lock only exists in one location that is probably not on the same cache line
as the locally allocated lock. MCS locks do not compare well with hardware
full/empty bits because they only address high contention cases and not the
common case of successful locking.
6
CHAPTER 4
IMPLEMENTATION
The following three methods for fine-grain synchronization were implemented
and tested in the SuperESCalar (SESC) cycle-accurate architectural simula-
tor.
4.1 Hardware Support for Full/Empty Bits
We have implemented a mechanism for hardware support where a full/empty
load (or store) does not return until the data is ready (or the location is
empty). This implementation represents the minimum-transaction imple-
mentation of fine-grain synchronization. In order to obtain the theoretical
best result, we will de-schedule the waiting process until the data is ready, and
instantly reschedule it when the data is ready without any time or energy
overhead. This instant scheduling creates a baseline for the lowest power,
highest performance implementation possible, as the only energy consumed
when the core is queued and spinning is leakage energy.
The implementation requires tracking waiting threads for each location.
When the consumer sees that the data is not ready, it will suspend its execu-
tion and write its process ID into an entry shared by all the cores. When the
producer finally produces the data, it first checks the queued requests to see
if there are any processes to wake up. If there are any, it performs a wake-up
operation on the process. If there are both ReadFE and ReadFF opera-
tions in the queue, ReadFF operations are woken up until the first ReadFE
reached. This implementation may be more complex than a real hardware
implementation, but we choose to be optimistic about the capabilities in
order to make our comparisons to emulation more conservative.
7
4.2 Emulation with Compare-and-Swap
The implementation for emulation through compare-and-swap (CAS) was
obtained through a communication with Roger Golliver, a collaborator. He
observed that when data is not available for consumption, the memory loca-
tion for fine-grain synchronization is typically not in use. Therefore, we can
store a pre-determined value to that location to indicate that the location is
“empty.” Any other value in the location means that the location is “full.”
The designers of the Alewife multiprocessor found that in their suite of ap-
plications, only 7% of fine-grained synchronization operations failed [5], and
in our algorithms, we often found far lower rates of synchronization failure.
Therefore, the implementation was designed to be fast in the common case.
4.2.1 Using Compare-and-Swap
We attempt to complete a ReadFE operation in one operation over the net-
work by doing a contrast-and-swap, which writes the empty bit pattern and
reads the data only if the data is not equal to the empty value. Note that
this is the exact opposite of a compare-and-swap, as the operation is only
performed if the two values are not equal. If the old value was not the empty
value, then we succeeded. Alternatively, we could do an atomic exchange,
which always writes the empty value to the location. Note that the time to
complete this atomic operation and value comparison may be very competi-
tive with the hardware implementation of ReadFE.
In the case where the old value is the empty value, we must wait until the
value is produced. Currently we have two simple implementations for spin-
ning. “Smart” spinning simply reads the location until it becomes full, taking
advantage of cache coherency to spin locally and then doing the compare-
and-swap. “Simple” spinning does compare-and-swaps repeatedly until the
operation succeeds. This method does not take advantage of cache coherency
and thus better approximates a system which does not have a mechanism
for local spinning. We use this method for the “bypass” experiments which
bypass the cache in order to better approximate an extreme-scale machine.
Performing a WriteEF operation on the synchronized location works in
much the same way. We start by doing a compare-and-swap, only swapping
if the location is empty. We must wait until the location is empty because
8
this state signals that the consumer has used the data which was previously
stored. Again, only spinning causes additional network traffic, and we have
smart and simple methods of spinning for writes as well. The pseudocode
for the simple versions of the ReadFE and WriteEF operations is
ReadFE_simple(addr,data){
do{
res = ContrastAndSwap(addr, EMPTY/*expected*/, EMPTY/*new*/);
}while(res==EMPTY);
return res;
}
WriteEF_simple(addr,data){
do{
res = CompareAndSwap(addr, EMPTY/*contrast*/,data/*new*/);
}while(res!=EMPTY);
}
The implementation of atomics in SESC makes our changes relatively sim-
ple, though the performance and power accuracy is less concrete. SESC is
based on the Microprocessor without Interlocked Pipeline Stages (MIPS) in-
struction set architecture (ISA). Unfortunately, this ISA does not support
atomic operations. It only supports the load-linked and store-conditional
instructions. While these operations are simple to implement, they are not
ideal for modeling compare-and-swap because they can fail when there is
contention for the location, and they require many instructions for an oper-
ation like compare-and-swap. Ideally, we would add instructions like atomic
compare-and-swap to the instruction set, but in the interest of time, we have
followed the lead of previous researchers who implemented atomic operations
using “magic” functions. The simulator performs the comparison operation
instantly, and the instruction goes down the pipeline as if it were a store.
While the simulator does not capture the costs of locking the location while
performing the operation and the time required to actually perform the oper-
ation on the data, it does accurately model the interconnect overhead of the
operation. Since the interconnect becomes a major bottleneck as the number
of cores scale, we deem this implementation sufficient for our purposes.
9
4.2.2 Choosing an Empty Value
Since the implementation is in software, we can customize the empty value
for each application. If the location is used for storing pointers, then we could
simply store an illegal address or a NULL into the location to indicate that
it is empty. For floating point data, we could store the value representing
not-a-number (NaN). In the case where the data values could be any bit
pattern, a backup plan would have to be implemented. One possibility is
that the producer may compare the data to the empty value before writing,
and set a flag if they are equal. The consumer can then check for a signal if
it spins for a large amount of time while waiting for a location to be set to
full. The drawback to this method is that writes would have to do this check
in all cases, thus slowing down all synchronizing actions. An alternative is to
set an upper bound on the amount of time spent spinning, and if it is met,
re-run the application with a different “empty” bit pattern. We assume that
collisions between data and empty values will not occur in this work.
4.3 Emulation with Dual CAS
4.3.1 Reasoning
An alternative to the single atomic CAS method is dual atomic CAS mem-
ory operations. As mentioned, the problem with single CAS is that the
programmer must be able to guarantee that one data value will never be
used. We also found an application (RAlign) where the programmer may
want to store data despite the memory being marked empty. While that
particular algorithm still performed as well as the dual CAS implementation
when the array was duplicated and single CAS was used, there may be other
algorithms where this is not the case.
With dual CAS, we allocate one memory location for the data, and a
separate location for the full/empty bit. Since the full/empty bit has been
moved to a dedicated location, we do not need to use a reserved empty bit
pattern. This dedicated full/empty bit location is useful for programs where
guarantees about the value of the data cannot be made. It also allows for
more complex uses of full/empty bits, such as producing data for multiple
10
consumers by setting the full/empty bit to the number of consumers and
decrementing it as each consumer arrives. Dual CAS could also be used for
interesting queueing techniques. Perhaps a consumer could store a pointer to
itself when the producer is running behind, allowing the producer to imme-
diately notify the consumer when the data is ready. This work focuses on the
power and performance of a vanilla full/empty bit usage of dual CAS oper-
ations, as our main concern is to demonstrate whether dual CAS operations
can approximate the benefits of hardware full/empty bit support.
The cost of dual CAS, of course, is that two memory locations must be
accessed at once. This requirement of two memory locations has a few differ-
ent implications. First, by allocating a memory location for the full/empty
bit, we lose the advantage of full/empty bits not wasting memory on syn-
chronization. This extra memory overhead means that the traffic over the
interconnect may increase, since each synchronization operation will involve
two memory locations. This overhead could be mitigated by constraining the
full/empty bits to reside on the same cache line as the data, but the memory
footprint still increases. There is also more energy consumed by passing two
memory requests down the pipeline and keeping track of all of the operands.
These software mechanisms should be compared to the hardware implementa-
tion of full/empty bits in order to get an idea of the trade-offs in performance
and power that we must pay in order to avoid extensive hardware support.
4.3.2 Implementation Details
Dual atomics introduce a set of unique requirements into the memory sub-
system which were not required by any other type of operation. We have
basically implemented two-request transactions within the caches. The suc-
cess of this type of transaction depends on the memory subsystem ensuring
deadlock-free operation as well as atomicity.
The dual atomic operation is split into two memory operations — the
full/empty word and the data word. To keep things simple, each word is
32 bits. Atomicity is ensured through the locking of cache lines. When the
first memory operation completes, we must lock that cache line. Locking
occurs at the L2 cache, which is the lowest-level private cache owned by each
core. The memory hierarchy will be discussed in further detail in Chapter 6.
11
Locking means that any read or invalidate destined for this line is queued
until the dual operation is completed. We have not currently implemented
any mechanism to account for the power implications of queueing requests,
as we expect them to be infrequent, but the delay in completing transactions
is modeled.
The reason for this locking is to ensure that other processors never see
a new value before seeing the other. For example, if a core has stale data
in its cache and reads that a processor has set the full/empty bit before it
sees the transaction updating the data, it may return bad data. Our current
implementation does not begin the second request until the first request has
completed. It should be possible to begin both accesses at the same time,
and this is an optimization that we will look into in our future work. In fact,
since most requests will have the full/empty bits and the data locations on
the same cache line, we should be able to combine the requests into a single
request which returns twice the amount of data. This could save energy
related to queueing stores as well as remove the requirement for locking
cache lines.
The other implementation challenge involves preventing deadlocks. One
concern is a situation where one core locks the full/empty bit for a dual
atomic transaction, and in the meantime, another core locks the data lo-
cation corresponding to that full/empty location. In this case, the second
memory access for each dual operation will stall in the other core, waiting
for the cache line to unlock. In order to prevent this, we enforce an arbitrary
ordering on the transaction such that the full/empty bit is always accessed
first. As long as the user never creates a situation where a data location and
a full/empty location share a cache line while the corresponding full/empty
and data locations share a different cache line, we should be able to avoid
deadlock.
12
CHAPTER 5
BENCHMARKS
We compiled four different benchmarks in order to test these different meth-
ods for fine-grain synchronization. We believe that they cover most of the
trade-offs and typical uses of fine-grain synchronization. The Giga-Updates
per Second (GUPS) algorithm was written from scratch, and the rest were
ported from Cray XMT code.
5.1 Giga-Updates per Second
The Giga-Updates per Second (GUPS) benchmark is used as a baseline to
show the basic properties of the different methods we have implemented for
fine-grain synchronization. Each core does a fixed number of random accesses
to an array, where the access involves a ReadFE, an update to the data, and
a WriteEF. Here is the pseudocode for a single core running GUPS:
for ( i = 0 ; i < NUM_UPATES ; ++i ){
rn = get_rand();
data = ReadFE(gups_array[rn % SIZE]);
data ++;
WriteEF(gups_array[rn % SIZE], data);
}
If the array is very large (SIZE >> NUM THREADS), there are few colli-
sions and the performance of the synchronizing action itself can be accurately
observed. If the array is small (SIZE << NUM THREADS), then most ac-
cesses result in collisions where a core must spin while waiting for another
core to finish its update. A smaller array allows us to observe the perfor-
mance of the different methods for fine-grain synchronization while under
contention. The variable NUM UPDATES is large enough that start-up and
cache effects are eliminated.
13
Figure 5.1: The red edges form one possible matching. Note that since this
algorithm is greedy, it may not always make the optimal choice.
5.2 Matching
The Matching code attempts to find the maximum-weight matching for an
undirected graph. A matching is a set of edges where each vertex is incident
on at most one edge in the matching. The algorithm works by accessing all
of the vertices in parallel and trying to find the optimal “suitor” edge. See
Figure 5.1 for an example matching.
In this algorithm, a core looks at one vertex at a time, updating the suitor
of its neighbor if it shares an edge with greater weight than the neighbor’s
current suitor. This update is protected from collisions by doing a ReadFE
operation on the suitor and weight array entries, updating the suitor and
weight, and finishing with a WriteEF operation on both. If the suitor replaces
an old suitor, then the algorithm evaluates potential suitors for the old suitor
until it finds a suitor that had not been matched before. It then moves on
to the next vertex.
Fine-grain synchronization operations are only used as efficient locks around
the suitor and weight updates in this algorithm. We only do this locking op-
eration when we find a weight that is greater than the current weight, or on
the first update. Assuming the threads are scheduled so that the nodes con-
nected to the heaviest edges are processed first, few nodes will have multiple
updates. Couple this with the extremely low chances that the same node will
be updated by two different threads at the same time, and this algorithm
has almost no spinning or contention.
14
Figure 5.2: Diagonal algorithm - synchronize after each diagonal completes.
5.3 RAlign - Diagonal
The RAlign algorithm is a wavefront algorithm where data passes from the
bottom-left corner of the image to the top-right corner. Each pixel is updated
based on the value from the current iteration for pixels below and to the left
of the current pixel, and the value from the previous iteration for pixels up
and to the right.
The baseline, or diagonal, algorithm slices the image into diagonals (Fig-
ure 5.2) where each block within the diagonal can run in parallel. Each
diagonal contains square blocks which have a configurable width in order to
exploit locality and avoid synchronization. Communication is synchronized
by a barrier which waits for all the blocks in a diagonal to complete. Since
threads are not allowed to pass the barrier until all other threads complete,
this algorithm is an example of a case where barriers limit parallelism, be-
cause a thread could be working on the next diagonal as long as its immediate
neighbor has completed.
5.4 RAlign - Row-by-Row
A slightly different algorithm is proposed for use with fine-grain synchroniza-
tion constructs. We will call it row-by-row. In this algorithm, there are no
barriers. Each thread is assigned a row in the image, and it works across
the image from left to right. The only synchronization reads the latest value
from the pixel below, as seen in Figure 5.3. The pixel to the left does not
need to be synchronized because it is produced by the consuming thread.
15
Figure 5.3: Row-by-Row algorithm - synchronize when reading from the
pixel below.
This algorithm creates much more synchronization, but it is only from pro-
ducer to consumer, so there should be better load balancing. This algorithm
can be blocked in order to reduce synchronization at the cost of increased
start-up time. In this case, the thread is assigned a block of contiguous rows,
and synchronization is only required on the bottom row of the block. The
thread works up a column at a time in order to make sure the neighboring
consumer can begin working as soon as possible.
This is an example of an algorithm where single CAS emulation is limited,
because the pixel holds valid data before it is filled with new data each
iteration. Since the empty state for emulation requires that the memory
location contain a specific bit pattern, we cannot store the old value here.
Therefore, we can work around this problem by allocating separate arrays to
hold the pixel values from the previous iteration and for the current iteration.
Separate arrays allow us to write new values to a separate location so that
the old value is not overwritten.
It does not make sense to block the image into squares with fine-grain
synchronization for several reasons. First, if a thread is assigned a square
or rectangle, the upper right pixel has two consumers. When there are two
consumers, the pixel cannot be read from full to empty because the consumer
does not know if it is the last consumer. If we leave the array full as we
execute the algorithm, we will have to re-initialize the array to empty between
iterations. Second, rows have spatial locality, so we might as well have one
thread work on the entire row. Finally, if a thread works on a block, the
thread must choose which direction to work so that the thread can allow
16
neighbors to begin executing. Either way the thread chooses, neighbors in
the other direction will be unable to make progress.
5.5 Strongly Connected Components
The Strongly Connected Components (SCC) algorithm is a recursive algo-
rithm which makes use of a parallel job queue for distributing jobs to the
cores. A strongly connected component of a directed graph is a maximal
subset of vertices containing a directed path from each vertex to all others
in the set. The algorithm works by recursively breaking the vertices into
independent groups which can be worked on in parallel. These three groups
are descendants, predecessors, and the remainder. Among a set of vertices,
if a vertex V is a part of a strongly connected component, then all the ver-
tices in that component will be descendants of that vertex V. Predecessors
of V are vertices which have an edge directed at V, but are not descendants
of V. The remainder is the unconnected vertices. Since we are looking for
maximal SCCs, no vertex can exist in two different SCCs, thus allowing us
to work on these three sets in parallel.
The algorithm begins by selecting a random root and then doing a breadth-
first search (BFS) to determine descendants and predecessors of that node. If
a node is a descendant and a predecessor, then the two vertices are part of an
SCC. The algorithm is called recursively on the nodes which are not part of
an SCC so far. There are opportunities for parallelizing the BFS, subgraph
initialization, and recursive calling. Originally, the code only used fine-grain
synchronization for calculating the period of the subgraphs. This is the
greatest common divisor of the length of the cycles in the graph. Like the
Matching code, the period calculation rarely collides and results in minimal
spinning. We added code for a work queue, since the original code used the
XMT runtime, which automatically performs the function of distributing jobs
to threads. The work queue allows for low overhead distribution of jobs. We
used full/empty bit operations to synchronize the enqueues and dequeues.
Since threads can spend a significant amount of time contending for jobs and
waiting for jobs to arrive, this presents a great opportunity to analyze the
performance of the fine-grain synchronization implementations in a situation
where synchronization does not always succeed the first time. When threads
17
arrive at the queue looking for work, they do a ReadFE operation on a
“QueueEmpty” variable. As long as the queue is empty, this variable should
be left empty. This variable makes sure that the threads spin while they are
waiting for the queue to be filled. The variable is marked full whenever a
thread finishes dequeuing and the queue is not empty, or if a thread enqueues
and the queue was previously empty.
18
CHAPTER 6
EVALUATION
Our work makes use of the SuperESCalar (SESC) cycle-accurate architec-
tural simulator. We currently have the capability to use up to 32 cores.
The cores have a simple single-issue in-order pipeline, following the trend in
extreme-scale work to simplify cores and use more of them. At present, we
are focusing on a cache coherent architecture for simplicity, as most kernels
are written with this basic assumption. The cache hierarchy has three levels,
as seen in Figure 6.1. The L1 is a write-through split cache, with a 32KB
4-way associative data cache and 32KB 2-way associative instruction cache.
The L2 is a write-back 256KB 8-way associative cache which is also private
to the core. Finally, the L3 is a 1MB directory cache per core and shared
by all cores. All cache lines are 32B and contain 8 words. The full/empty
bits are associated with 32-bit memory locations. A software-managed cache
or message passing architecture might make more sense for an extreme-scale
architecture. Porting kernels for these mechanisms are a target of our future
work.
We were able to run experiments on four separate kernels: GUPS, Matching,
RAlign, and Strongly Connected Components. The GUPS kernel was writ-
ten from scratch, while the others were originally written for the Cray XMT
and were ported to use OpenMP. The Strongly Connected Components
code was modified to use pthreads and a custom work queue in order to test
the usage of fine-grain synchronization under contention.
We ran all experiments with two different models for synchronized memory
locations. The baseline model makes use of cache coherency and does the
smart spinning method. We also ran experiments with bypass mode enabled
for synchronized memory locations. The bypass experiments force these ac-
cesses to bypass the caches in order to approximate a system which is too
large to support cache coherency. These experiments use the simple spinning
method, because there is no ability to spin locally. We will show results from
19
Figure 6.1: SESC cache hierarchy.
the cache coherent executions unless otherwise noted. Bypass results will be
shown when they differ significantly from the coherent results.
6.1 Giga-Updates per Second
The GUPS kernel, as described in Chapter 5, was useful for testing our im-
plementations and demonstrating the basic relationships between these three
methods for fine-grain synchronization. The first experiment, as seen in Fig-
ures 6.2 and 6.3, allocates an array of 16K words which is accessed by all
threads concurrently. Each thread does an atomic ReadFE, update, and
WriteEF on a random word in the array. Each thread does 8192 random
updates in order to make sure that all start-up and caching effects are elimi-
nated. With 16K possible locations and only up to 32 cores executing at once,
we do not expect many collisions. Even if there is a collision, the amount of
time spent spinning is insignificant because the cores are only doing a sin-
gle operation while the location is empty. Therefore, these graphs show the
differences in latency and energy when most fine-grain synchronization oper-
ations succeed. As expected, the hardware implementation is slightly faster
because the accesses can be accomplished in a single operation. The single
CAS implementation requires a compare-and-swap operation followed by a
comparison between the old value and the expected value, so it is slightly
slower. The dual CAS implementation is slightly slower than the single CAS
because it requires more operands and creates two stores instead of one for
20
Figure 6.2: GUPS large array performance without bypassing.
Figure 6.3: GUPS large array energy without bypassing.
each CAS. Energy-wise, again there is not much difference between single
CAS and hardware support. The extra stores explain the extra energy from
dual operations. However, the overhead is not significantly more because the
full/empty bit is allocated next to the data word. This allocation ensures that
the core only fetches a single cache line per update. Note that the increase
in cores does not demonstrate any noticeable differences in behavior between
these methods since this application does not have many collisions — and
behavior under contention is mostly what separates hardware full/empty bits
from emulation.
In our second experiment with the GUPS kernel as seen in Figures 6.4 and
6.5, we reduced the size of the array to four locations. With up to 32 threads
21
Figure 6.4: GUPS small array performance without bypassing.
Figure 6.5: GUPS small array energy without bypassing.
attempting to update only four random words, we expect at least eight dif-
ferent cores to be contending for a location at any given time. With all this
contention, we expect much more spinning as threads wait for locations to
be marked full. Here we can compare how the different methods perform
when the data to be accessed is highly contended. This time, the hardware
implementation is significantly faster than the other methods because the
same number of network transactions is used in the contended case as in the
uncontended case. The CAS methods pass the cache line from core to core
as they attempt to do the ReadFE operation. These accesses waste energy
and create contention in the network. As expected, the hardware implemen-
tation is much more energy efficient than the other methods because of their
22
Figure 6.6: GUPS small array performance with bypassing.
wasteful network accesses. When comparing the performance as the num-
ber of cores increase, we see that the hardware implementation scales more
gracefully, since only required transactions enter the network in the hardware
implementation.
Finally, we ran the four-location experiment with cache bypassing enabled
(see Figures 6.6 and 6.7) in order to perhaps recreate a more realistic envi-
ronment for extreme-scale computing. Notice that the dual CAS operations
fare poorly in these experiments. This is likely because our simulator does
not do any write combining, and therefore each access generates two accesses
to main memory. Normally the caches do a decent job of partially hiding
this overhead in the non-bypass experiments. By comparing Figures 6.6 and
6.4, we see that the emulation methods perform even worse than in the non-
bypass case. In the non-bypass case, threads only attempt to write when
they see the location become empty, meaning the contention is only at lock
release. In the bypass case, the threads are constantly writing to memory in
an attempt to access the location. This writing results in more contention
over the network. It is important to remember that this experiment rep-
resents the worst-case performance for emulation methods, since practically
the entire runtime is spent spinning.
23
Figure 6.7: GUPS small array energy with bypassing.
6.2 Matching
The Matching kernel represents a good example of fine-grain synchronization
methods providing almost equivalent performance when synchronization is
uncontended. This experiment used 8, 16, and 32 cores on a recursive-matrix
(R-MAT) graph with 217 nodes. The weight and suitor of each node is syn-
chronized. The “none” method uses a pthread mutex to lock each location.
In Figure 6.8, we can see that this algorithm is basically wait-free. The only
differentiator between the different methods for fine-grain synchronization
in this benchmark is the latency of the locking operation, since spinning is
basically non-existent. The fine-grain synchronization operations provide a
25% to 30% performance benefit compared to pthread mutexes.
One of the purported advantages of fine-grain synchronization is that it
can aid in scalability. This is not always the case, as in this algorithm where
the fine-grain synchronization operations are only used to replace locks. The
algorithm must be fundamentally changed to take advantage of more efficient
synchronization in order to see scalability benefits. We ran an experiment on
a smaller graph (Figure 6.9) with only 212 nodes in order to see this property.
As can be seen, once we move to 32 cores, the run-time does not improve for
any of the methods.
24
Figure 6.8: Matching performance without bypassing with a large graph.
Figure 6.9: Matching performance without bypassing with a smaller graph
demonstrating scalability issues.
25
Figure 6.10: RAlign performance on both row-by-row and diagonal
algorithms without bypassing.
6.3 RAlign
The RAlign algorithm is an excellent example of how fine-grain synchroniza-
tion operations can be used to fundamentally change an algorithm to take
advantage of more efficient synchronization. We ran two different implemen-
tations of the algorithm. The fine-grain synchronization methods use the
row-by-row implementation, where pixel data is passed from row-to-row on
a pixel granularity. The baseline uses a diagonal blocked algorithm which
synchronizes by using barriers between each diagonal row of blocks on the
image. Figures 6.10 and 6.11 show the results for a 512 x 512 pixel image
which is operated on by 8, 16, and 32 threads. The fine-grain synchronization
operations are assigned eight rows at a time in order to balance the benefits
of efficient synchronization with locality. The diagonal algorithm uses 16 x
16 pixel blocks. Note that only one diagonal uses all 32 cores on this 512
x 512 image. If we try to make the blocks smaller, the overhead of barrier
synchronization begins to dominate and makes the performance even worse.
Looking at performance, all fine-grain synchronization methods do equally
well, while the diagonal algorithm fails to scale to 32 cores. Single CAS is
slightly worse than dual CAS because single CAS has to use an “old” and
“new” data array, as explained in Chapter 5. Fortunately this increase in
the memory footprint does not have a significant impact on the energy or
performance of the single CAS method.
Spinning is kept at a minimum for the row-by-row implementation because
26
Figure 6.11: RAlign energy on both row-by-row and diagonal algorithms
without bypassing.
the algorithm is very regular. The producer-consumer relationship says that
each core must wait for the core below it to complete the first column of its
set of rows, and once this is complete, the consumer is sufficiently behind the
producer such that it almost never has to wait for the producer to complete.
Also, once the first column is complete, all cores can be executing across the
image. In the diagonal algorithm, there is a longer period of time at the
beginning and end of the algorithm where cores are waiting for the diagonal
to become large enough for all to participate.
6.4 Strongly Connected Components
The Strongly Connected Components algorithm is a recursive algorithm
which generates jobs that can be executed in parallel by the cores. We ran
experiments with 512 strongly connected components and 4096 total nodes in
order to have reasonable run-times. Typically, cores spend between 10% and
30% of their time spinning in this algorithm, which means that it can be used
to effectively show differences in spinning efficiency in a real algorithm. We
implemented a back-off mechanism (marked by “ bo” in the figures) along
with a pause instruction to be used by the emulation methods. This back-
off mechanism allows us to see how well emulation methods can attempt
to approximate the queueing advantages of hardware support for fine-grain
synchronization.
27
Looking at performance in Figures 6.12 and 6.13, hardware full/empty bits
seem to have a slight advantage, except for in a few cases where emulation
matches or slightly surpasses it. The take-away from this observation is that
the performance is not greatly affected by variations in spinning efficiency.
In fact, sometimes emulation is slightly faster than hardware support, indi-
cating that outside effects such as variation in which cores grab which jobs
from the queue may have a larger effect on performance than the method of
synchronization.
Figure 6.12: SCC performance without bypassing.
Figure 6.13: SCC performance with bypassing.
The interesting results are in energy consumption, as seen in Figures 6.14
28
Figure 6.14: SCC energy without bypassing.
Figure 6.15: SCC energy with bypassing.
and 6.15. Without a back-off mechanism, the emulation methods use signif-
icantly more energy than the hardware method. Fortunately, by observing
the typical spinning times and developing a heuristic for gradually increasing
the amount of time a core pauses before attempting to load again, we were
able to get the energy consumption much closer to that of hardware support.
This result indicates that while hardware support does have the advantage of
minimal transactions without any extra work, with a little bit of observation
and software work, we were able to do almost as well as a hardware queueing
mechanism with emulation. When comparing bypass and non-bypass runs,
we again see that dual operations do worse in a bypass environment because
the caches cannot mask their multiple accesses as well.
29
CHAPTER 7
FUTURE WORK
The main issue we would like to resolve is finding algorithms which take
advantage of fine-grain synchronization in a meaningful way which actually
changes the algorithm in order to use the efficient synchronization to improve
parallelization or to make synchronization more energy efficient. GUPS was
able to demonstrate this potential, but RAlign was the only algorithm able
to demonstrate how a program could fundamentally change in order to take
advantage of fine-grain synchronization. It would be desirable to find more
examples of meaningful usage of fine-grain synchronization, since most of the
uses simply replaced locks in order to gain a small latency advantage. We
would also like to find an algorithm which has varying amounts of spinning
on variables that are more fine-grained than the work queue locking which
we observed in the Strongly Connected Components code.
Dual operations show promise as a more robust alternative to single CAS
operations, but we would also like to investigate implementing alternatives
to simple full/empty bits with dual operations. Since dual operations al-
locate an entire memory location to synchronization, we could potentially
use dual operations to implement an efficient barrier, or orchestrate multiple
consumers by setting the location to the number of times that the data can
be consumed.
We would also like to perform these experiments on a simulator which
supports more cores. With the extreme-scale era approaching, we must be
able to simulate hundreds to thousands of cores in order to be more confident
about our results.
30
CHAPTER 8
CONCLUSION
This study of fine-grain synchronization has helped us better understand
the benefits of fine-grain synchronization, and how hardware support for
full/empty bits compares to methods of emulation. We found that fine-
grain synchronization can provide significant performance benefits simply by
replacing locks with ReadFE and WriteEF operations. Fine-grain synchro-
nization can have even larger performance benefits in the case of RAlign,
where the algorithm is able to use more synchronization because of the effi-
cient fine-grain synchronization, thus allowing for more parallelization. Single
CAS emulation of full/empty bits can match the performance of hardware
support for full/empty bits in algorithms which do not require much spin-
ning. In the case of RAlign, where the synchronized location must be able
to hold data when it is empty, the single CAS method was able to do just as
well as the hardware-supported run by allocating an extra array for previous
data. Therefore, this algorithm was not dependent on the property of fine-
grain synchronization which allows it to combine data with synchronization
in order to see benefits.
For us to recommend hardware support for fine-grain synchronization, we
must find algorithms which have a very specific set of requirements. First, the
algorithm must require a large amount of spinning, and the spinning must be
frequent and unpredictable enough that an exponential back-off mechanism
cannot sufficiently replicate a queued request mechanism. If the algorithm
does not require much spinning, the chances of seeing benefits for hardware
support are even smaller. Perhaps an algorithm which has a high degree of
locality and extremely frequent synchronization may be dependent enough
on latency of access that hardware support would make sense. This is highly
unlikely considering that the single CAS method only requires extra operands
and a comparison operation. The final case where hardware support may be
beneficial is if the footprint of synchronization is an issue. The only case
31
where we may have to use up additional memory footprint is if an empty
value cannot be determined for single CAS operations, or if the memory
location must hold valid data when it is marked empty. Since large graph
algorithms already have poor locality, this was not an issue in the GUPS,
Matching, and SCC algorithms. The RAlign code, which had better locality
but needed to store data when the location was empty, did not show any ill
effects from the increased footprint.
We believe that hardware support for fine-grain synchronization only makes
sense under very specific circumstances. Even though the idea has been
around for decades, good algorithms for taking advantage of fine-grain syn-
chronization were few and far between. Maybe this barrier to entry will begin
to erode as algorithms are re-written to take advantage of more parallelism,
but this process has already been going on for quite some time. Implement-
ing fine-grain synchronization with single and dual CAS has shown similar
benefits to hardware support without the significant hardware overhead.
32
REFERENCES
[1] S. Borkar, “The exascale challenge,” in VLSI Design Automation and
Test (VLSI-DAT), 2010 International Symposium on, April 2010, pp.
2–3.
[2] B. J. Smith, “Architecture and applications of the HEP multiprocessor
computer system,” in Real-Time Signal Processing IV, 1982, pp. 241–
248.
[3] G. Papadopoulos and D. Culler, “Monsoon: An explicit token-store
architecture,” ACM SIGARCH Computer Architecture News, vol. 18,
no. 3a, 1990, pp. 82–91.
[4] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield,
and B. Smith, “The Tera computer system,” in Proceedings of the 4th
International Conference on Supercomputing. ACM, 1990, pp. 1–6.
[5] A. Agarwal, R. Bianchini, D. Chaiken, D. Kranz, J. Kubiatowicz,
B. Lim, K. Mackenzie, and D. Yeung, “The MIT Alewife machine: Ar-
chitecture and performance,” in Proceedings of the 22nd Annual Inter-
national Symposium on Computer Architecture, 1995, pp. 2–13.
[6] J. Feo, D. Harper, S. Kahan, and P. Konecny, “Eldorado,” in Proceedings
of the 2nd Conference on Computing Frontiers. ACM, 2005, p. 34.
[7] W. Zhu, V. Sreedhar, Z. Hu, and G. Gao, “Synchronization state buffer:
Supporting efficient fine-grain synchronization on many-core architec-
tures,” in Proceedings of the 34th Annual International Symposium on
Computer Architecture. ACM, 2007, p. 45.
[8] J. Martinez and J. Torrellas, “Speculative synchronization: Applying
thread-level speculation to explicitly parallel applications,” in Proceed-
ings of the 10th International Conference on Architectural Support for
Programming Languages and Operating Systems. ACM, 2002, pp. 18–
29.
[9] A. Kagi, D. Burger, and J. Goodman, “Efficient synchronization: Let
them eat QOLB,” in Proceedings of the 24th Annual International Sym-
posium on Computer Architecture. ACM, 1997, pp. 170–180.
33
[10] J. M. Mellor-Crummey and M. L. Scott, “Synchronization without
contention,” SIGPLAN Not., vol. 26, no. 4, Apr. 1991, pp. 269–278.
[Online]. Available: http://doi.acm.org/10.1145/106973.106999
34
