Energy-Efficient Hardware-Accelerated Synchronization for
  Shared-L1-Memory Multiprocessor Clusters by Glaser, Florian et al.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020 1
Energy-Efficient Hardware-Accelerated
Synchronization for Shared-L1-Memory
Multiprocessor Clusters
Florian Glaser, Student Member, IEEE, Giuseppe Tagliavini, Member, IEEE, Davide Rossi, Member, IEEE,
Germain Haugou, Qiuting Huang, Fellow, IEEE, and Luca Benini, Fellow, IEEE
Abstract—The steeply growing performance demands for highly power- and energy-constrained processing systems such as
end-nodes of the internet-of-things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of
low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising
architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of
computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the
clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In
this work, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters
of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core
cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in
advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life
applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the
baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10%
synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92%
and 23% on average and energy efficiency by up to 98% and 39% on average.
Index Terms—Energy-efficient embedded parallel computing, fine-grain parallelism, tightly memory-coupled multiprocessors.
F
1 INTRODUCTION
A FTER being established as the architectural standard forgeneral-purpose and high-performance computing over a
decade ago [1], [2], the paradigm of chip multiprocessors (CMPs)
has as well been adopted in the embedded computing domain
[3], [4]. While the main driving force for the former domain
is prohibitive heat dissipation as a result of ever-rising clock
frequencies of up to several GHz, the turn to parallel processing
in the latter domain is propelled by the trend toward high com-
putational performance without losing the energy efficiency of the
so far used simple and low-performance microcontrollers. A key
contributor to the increase in performance demand in embedded
devices is the rise of internet-of-things (IoT) end-nodes that need
to flexibly handle multiple sensor data streams [5], [6] (e.g.,
from low-resolution cameras or microphone arrays) and perform
complex computations on them to reduce the bandwidth over
energy-intensive wireless data links.
As the straightforward replacement of microcontroller cores
with more powerful core variants featuring multiple-issue,
multiple-data pipeline stages, and higher operating frequencies,
naturally jeopardizes energy efficiency [7], researchers are turning
to parallel near-threshold computing (NTC). Reducing the supply
voltage of the underlying CMOS circuits to their optimal energy
point (OEP) [8], usually located slightly above their threshold
voltage, enables improvements of energy efficiency by up to one
order of magnitude [9]. The gain in energy efficiency, however,
comes with a significant loss in performance of around 10× [9]
due to the reduced maximum operating frequency directly linked
to voltage scaling. To overcome this performance limitation, we
resort to parallel NTC [10], an approach coupling the high energy
efficiency of near-threshold operation with the performance typical
of tightly coupled clusters of processing elements (PEs) that
can, when utilized in parallel, recover from the supply-scaling
induced performance losses, delivering up to GOPS in advanced
technology nodes. However, parallel NTC can only achieve the
fundamental goal of increasing energy efficiency with parallel
workloads and, when the utilization of computational resources is
well balanced, or, more in general, when the underlying hardware
can effectively exploit the parallelism present in applications.
If this is not the case, performance loss must be recovered
by increasing the operating frequency (and the supply voltage) to
achieve the given target, thereby reducing the energy efficiency of
the system. Moreover, in sequential portions of applications, par-
allel hardware resources such as PEs and part of the interconnect
towards the shared memory consume power without contributing
to performance. Hence, in these regions, all idle components must
be aggressively power-managed at a fine-grain level.
The described requirements highlight the need for communica-
tion, synchronization, and power management support in parallel
clusters. Communication mechanisms allow PEs to communicate
with each other to exchange intermediate results and orchestrate
parallel execution. In this work, we focus on shared-memory
multiprocessors that typically rely on data-parallel computational
models. For this class of systems, data exchange is trivial,
restricting the communication aspects to pointer exchange and
data validity signaling. However, wait-and-notify primitives are
required by every application that has any form of data dependency
between threads (i.e., that cannot be vectorized). Consequently, the
support for synchronization mechanisms remains mandatory also
ar
X
iv
:2
00
4.
06
66
2v
1 
 [c
s.A
R]
  1
4 A
pr
 20
20
2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
for this type of parallel processing systems.
The most straightforward way to enable functionally correct
implementations of every kind of multi-PE synchronization is
to provide atomic accesses to the shared memory (or a part
thereof) and use spinlocks on protected shared variables. While
this approach is universal, flexible, and requires small hardware
overhead, it constitutes a form of busy-waiting as every contestant
repeatedly accesses the shared variables until all get exclusive
access (at least once, depending on the synchronization primitive
that is implemented). This concept is not only prohibitive from
an energy efficiency viewpoint due to the wasted energy for
every failing lock-acquire attempt but is also considerably dis-
advantageous from a performance point-of-view as the concurrent
attempts can put high loads on processor-to-memory interconnect
systems and cause contentions on the shared memory system. Ad-
ditionally, as every contestant has to acquire the lock sequentially,
the cost of synchronization primitives in terms of cycles is even
in the best case lower bounded by the product of memory access
latency and the number of involved PEs, therefore growing with
the number of contestants.
Basic interrupt and power management (PM) support allow
avoiding busy-waiting with the help of software synchronization
primitives that only require all contestants to become active
after updating the shared atomic variables resp. to change spin
lock ownership. While the context changes required for interrupt
handling naturally incur software overheads, the issues associated
with concurrent lock-acquire attempts, additionally, the cost and
scaling of synchronization primitives remain unaffected. As a
result, the handling of a single synchronization point can take
over a hundred cycles even with less than ten involved PEs as our
experiments in Sec. 6.3 show. With the constraint to gain from
parallelization, either in terms of energy or performance or both,
this cost figure causes a lower bound for the average length of
periods that PEs can work independently from each other, often
referred to as the synchronization-free region. In this work, we use
the term parallel section interchangeably, as it is often found in
the context of parallel programming models.
The fundamental principle in parallel NTC of utilizing all
available computational resources as equally as possible conflicts
heavily with a minimum required synchronization-free region
length: If parallelizing an application only pays off for parallel
sections of thousands of cycles or more due to the associated
synchronization overhead, energy-efficient execution of all ap-
plications that exhibit finer-grain inter-thread dependencies is
thwarted even though the overall system architecture would be
perfectly suitable.
To overcome the explained challenges and limitations, we
propose a light-weight hardware-supported solution that aims at
drastically reducing the synchronization overhead in terms of
cycles and – more importantly – energy, thereby making fine-grain
parallelization for the targeted shared-memory NTC processing
clusters affordable. All signaling is done by restfully waiting on
events, i.e., halting and resuming execution at synchronization
points without the need to change the software context. We detail
the hardware architecture of the synchronization and commu-
nication unit (SCU), the foundation of the proposed solution
that centrally manages event generation, and is instantiated as a
single-cycle accessible shared peripheral. It provides both general-
purpose signaling and an easily extensible set of commonly
used synchronization primitives. In this work, we focus on the
barrier and mutex primitives, as they are required for the parallel
and critical section constructs that are fundamental in parallel
programming frameworks such as OpenMP [11].
For cases where a completely balanced utilization of all PEs is
not possible, we propose fine-grain PM in the form of clock gating,
fused into the SCU, that allows saving energy during idle periods
as short as tens of cycles. To demonstrate the capabilities of the
solution, we integrate the SCU into an eight-core heterogeneous
cluster as an example for the targeted systems and how it can be ef-
ficiently used from a software point-of-view by extending the ISA
of the digital signal processing (DSP)-enabled RISC-V cores with
only a single instruction. We illustrate both the opportunities and
relevance of lowering synchronization overhead for parallel NTC
through synthetic benchmarks and a set of DSP kernels that are
typical for the targeted system type. By analyzing the performance
and energy of the whole system in both cases, we demonstrate not
only the theoretically possible but also the practical gains of the
proposed solution. As a competitive baseline, we use software
implementations of the primitives based on atomic L1 memory
access in the form of test-and-set (TAS); for a fair comparison, the
baseline implementations also include variants that employ event-
based restful waiting. To achieve reliable results in the context of
fine-grain parallelization, we carry out all experiments on a gate-
level, fabrication-ready implementation of the multicore cluster in
a 22 nm process, allowing us to obtain cycle-exact performance
numbers as well as to measure energy with an accuracy close to
that of physical system realizations. The results obtained in our
evaluation show that the SCU allows synchronization-free regions
as small as 42 cycles, which is more than 41× smaller than the
implementation based on fast TAS when constraining the syn-
thetic benchmarks to 10% synchronization overhead. Moreover,
when evaluated on real-life kernels, the proposed SCU improves
performance by up to 92% and energy efficiency by up to 98%.
The remainder of this paper is organized as follows. Sec. 2
provides a comprehensive overview of the prior art related to
synchronization in embedded multiprocessors. Sec. 3 discusses
the relevance of fine-grain synchronization in the context of the
targeted system type. The architecture of both the proposed SCU
and the hosting multiprocessor is explained in Sec. 4, followed by
the thereby enabled concept of aggressively reducing the overhead
for synchronization primitives in Sec. 5. The baseline method, as
well as the experimental setup and methodology which we used,
can be found in Sec. 6, followed by the experimental results for
both the microbenchmarks and the range of DSP-applications.
Sec. 7 summarizes and concludes the work.
2 RELATED WORK
The shortcomings associated with straight-forward synchroniza-
tion support based on atomic memory access have been broadly
recognized by the research community; a variety of works,
therefore, proposes, similar to our approach, hardware-accelerated
solutions to improve performance [12], [13], [14], [15], [16], [17],
energy efficiency [18], [19], or both [20], [21]. Reviews of the
performance and characteristics of software-based solutions for
shared-memory multiprocessors can be found in [22], [23], [24],
[25], [26] and date back as early as the 1990s. Multiple works
[25], [27], [28], [29] propose to dynamically adjust the speed
of individual PEs at runtime to equalize their execution speed
instead of power-managing them. This approach, however, incurs
significant control and circuit complexity due to the required
asynchronous clocks and severely increases the latency between
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 3
PEs and memory as crossing clock domain boundaries already
takes several cycles. Consequently, it only pays off if the costs
for entering and leaving low-power modes are in the order of
thousands of cycles, as the authors of [25] assume. Our approach
instead aims at using a single synchronous clock, implementing a
simple variant of PM that is suitable for very short idle periods,
equalizing workload with fine-grain parallelization and ultimately
gain from system-wide frequency and supply scaling.
The prior art in hardware-accelerated synchronization for em-
bedded, synchronously clocked systems covers a wide range of
target architectures and implementation concepts, is, however, to
a large extent, not suitable for the microcontroller-class cacheless
shared memory type of system that we target. What follows is a
review of the corresponding references, structured by key aspects
that illustrate the causes for the stated mismatch.
Synchronization-free region size: As explained previously
in Sec. 1, the ability to handle typical synchronization tasks in
the order of tens of cycles and below is of major importance
for energy-efficient parallelization in the context of the targeted
processing clusters. Except for [17], no explicit statement is
made about the targeted synchronization-free region (SFR) size.
The authors of [17] report the speedup for SFRs as small as
ten cycles; however, leave many implementation details open.
For all other references, the SFR size for which the respective
results are reported must be implicitly determined by analyzing
the employed benchmarks. Multiple references [19], [20], [21]
employ rather dated parallel benchmarking suites such as STAMP
[30] or SPLASH-2 [31] which feature SFRs of 10.000s of cycles
and are, therefore – in addition to prohibitive data set sizes –
not suitable for the microcontroller-type clusters we target. The
order of magnitude for the SFR size of the mentioned bench-
marks was determined through experiments on a general-purpose
desktop computer; the misfit of, e.g., the SPLASH-2 suite, is
also mentioned in [14]. As a suitable alternative, the authors
of [14] propose the usage of a subset of the Livermore Loops
[32], a collection of sequential DSP-kernels that, however, can be
parallelized with reasonable effort as the original code is annotated
with data hazards and the like. The parallel versions of loops 2,3,
and 6 are provided in [14] and used in [13], [14], [15]; we include
loops 2 and 6 in our set of applications; loop 3 is omitted as it
is a fully vectorizable matrix multiplications and synchronization
consequently not required within the kernel. Our analysis in Sec. 6
verifies the fitting SFR sizes of loops 2 and 6; they are as small
as 104 cycles. Whenever provided, we compare the execution time
for both loops achieved with our solution to the references in
question and observe better performance even when considering
systems with higher PE counts as ours.
In addition to the conclusions drawn from the used bench-
marks, a closer analysis of the employed bus systems can also
help to estimate the smallest supported SFR: For example, the
synchronization-operation buffer (SB) proposed in [20], managing
all synchronization constructs locally at the shared memory, is
only reachable for the PEs through a network-on-chip (NoC). A
latency of at least 30 cycles for NoC transactions is stated; for
requesting and getting notified about the availability of a lock, at
least two transactions are required, pushing the affordable SFR
size far above the approximately one hundred cycles and below
targeted by us.
The authors of [16], [18], [21] either explicitly state task-level
parallelism or use benchmarks that feature software pipelining
(of independent algorithm parts) and therefore do not target
synchronization at loop-level of whatever granularity.
Due to the similarities in the targeted systems, it can be
assumed that [33] aims at similarly sized synchronization-free
regions as ours, although not explicitly stated. Unfortunately,
results are only reported as number and types of accesses to the
synchronization hardware for various synchronization primitives.
The barrier and mutex primitives that we discuss in detail are
estimated to take two bus transactions or six cycles (measured at
the memory bus); we reduce the number of bus transactions to
one and the latency to four cycles; additionally, we include the
feedback of primitive-specific information.
Synchronization hardware complexity: Keeping overall cir-
cuit area and complexity as small as possible is of crucial
importance for systems that ought to be employed for parallel
NTC, as explained in Sec. 1. Naturally, this also holds true for
any synchronization-managing unit that must both affect overall
area only insignificantly and also not introduce complex (possibly
latency-critical) circuitry due to the associated dynamic power,
thereby diminishing the savings obtained through the acceleration
of synchronization tasks.
The cache-alike SB presented in [20] does not meet these
requirements; considerable circuit area is required for each entry to
store all required information as well as for the logic that performs
single-cycle hit-detection. The need to check every memory access
for a matching address causes the activity of the SB to be much
higher than the frequency of synchronization points in a given
application would require.
The concepts presented in [18] and [21] are based on snoop
devices at the memory, or system bus; [21] is of particular
interest as it distributes the synchronization-management over one
controller per PE that each hosts a locking queue for the respective
variables of interest. This concept may sound very appealing
from a hardware complexity point of view; however, an important
aspect not covered in [21] changes the picture significantly: In the
usual (and desirable) case where the system bus of a CMP can
handle multiple transactions at once, each local synchronization
controller must have global visibility of the bus and be able to
parallelly check the maximum number of concurrent transactions
against any of its monitored locking variables. It is obvious that
even a single such device cannot be built in a slim way, replicating
it for every PE makes the situation worse.
Although no analysis is provided, the complexity of the barrier
filter modules proposed in [14] can be estimated to be slightly
higher than for our proposed hardware barrier modules (based
on the available knowledge about the amount of information that
each barrier filter needs to store). We reduce all address-related
housekeeping overhead and restrictions by assigning each barrier
module a fixed address in the global peripheral address space of
the system and by providing PE-parallel access.
This work is most comparable to the hardware synchronizer
(HWS) proposed in [33] that is connected as a shared peripheral,
enabling synchronization primitives through appropriate program-
ming of an array of atomic counters and compare registers.
We reduce the hardware complexity and remove the burden of
configuring and mapping the atomic counters to synchronization
primitives by providing native and low-cost hardware support for
such while maintaining general-purpose signaling.
The concepts presented in [16], [17] principally match ours
as shared, dedicated registers are used to represent the state of
synchronization primitives at low hardware cost. Sadly, many
implementation and integration aspects are left undiscussed, and
4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
either only locks [16] or only barriers [17] are natively supported.
While a hardware lock can be employed to realize a barrier, the
cycle cost compared to a hardware barrier is clearly prohibitive
for the small SFR sizes that we target. While our solution also
features general-purpose atomic PE-to-PE signaling, the most
critical feature is native hardware support for the most commonly
used synchronization primitives (in the targeted systems and
programming models) as well as concurrent access to them at
very small hardware overhead.
Target system type: The vast majority of prior art assumes
processing systems with data caches; those either affect the re-
spectively proposed memory-mapped synchronization concept or
are even actively modified and used for synchronization [14]. The
exceptions [16], [17], [21], [33] propose synchronization solutions
that sit closer to the PEs than any data cache or are too vague in the
description of the targeted systems to decide about the existence
of caches.
The implications of the targeted system on the complexity of
the synchronization hardware are well illustrated in, e.g., [14],
[15], [20] that all target high-end architectures with much higher
operating frequencies than we do, at the cost of sacrificing energy
efficiency. To ensure scalability beyond a few PEs as well as the
high operating frequencies, all shared memory and peripherals
are only reachable through high-latency buses or networks. As
any synchronization-managing unit must conceptually have global
observability of synchronization requests, it must be either put in
front of the shared memory and keep track of (i.e., store) the
sequential accesses to such [14], [20], or better suitable, dedicated
message-passing networks must be added [15]; both alternatives
infer undesirable complex hardware. In the latter case, the primary
focus is on latency and bandwidth of PE-to-PE transfers while the
synchronization functionality is conveniently added but would not
require the underlying high-bandwidth message-passing hardware.
The low-latency and concurrent access to shared memory in our
targeted clusters allow us to limit the synchronization hardware to
a signaling role and use the shared memory for any data exchange
that is larger than a single word without performance penalties.
On the other side of the spectrum, popular shared-memory
parallel architectures are graphics processing units (GPUs), em-
ploying the single-instruction multiple-thread (SIMT) execution
model, a combination of single-instruction multiple-data (SIMD)
and multithreading. General-purpose GPUs (GP-GPUs) are hi-
erarchical architectures composed of multiple clusters (called
multiprocessors) that, in turn, are made up of multiple PEs. Each
PE executes a group of threads (called a warp) in lockstep,
according to the order of instructions issued by a dispatcher,
which is shared among all of them [34]. While in traditional
GP-GPUs, only global synchronization primitives and hardware
support were available, only allowing to synchronize all threads
running on a multiprocessor, recent architectures such as Nvidia
Volta allow to synchronize the threads within a warp, improving
the synchronization efficiency for kernels with smaller granularity
[35]. However, the SIMT nature of GP-GPUs forces them to
sequentialize divergent threads and threads with critical sections,
jeopardizing performance and energy efficiency, which makes
them very inflexible and definitively not suitable for the appli-
cation domain targeted in this work.
A hybrid SIMD/MIMD (multiple-instruction, multiple-data)
approach has been proposed in [36]. The architecture combines
a tightly coupled cluster of processors featuring a shared in-
struction memory with broadcast capability and a counter-based
hardware synchronization mechanism, dynamically managing the
lockstep execution of cores during data-dependent program flows.
Although this approach achieves 60% of energy reduction, its
applicability is restricted to data-parallel code sequences, and it
is intrusive from the software viewpoint, as it requires explicitly
managing counters and lockstep execution. On the other hand, the
approach proposed in this work is fully flexible and provides easily
usable primitives supported by parallel programming models such
as OpenMP [11].
PM and signaling mechanism: With the exception of [13],
[14], [20], where no explicit statements are made, all references
implement or at least suggest idle waiting for PEs that are blocked
at synchronization points. The majority of the works with a focus
on idle waiting proposes interrupt-based mechanisms [16], [17],
[18], [33] without further consideration for the associated context
switching overhead. Event-based signaling is supported in [33];
however, no details are given on how PM and idle-waiting are
realized. The authors of [15], [19], [37] follow our approach
of power-managing idle cores through clock gating; power and
area of the section-monitoring pool units in [19] are, however,
not suitable for our microcontroller-class target systems. The
architecture of the system proposed in [37] is similar to ours;
multiple PEs, operating in lockstep, are connected to a multi-
banked memory through a single-cycle crossbar. However, the
targeted lockstep operation of the PEs greatly limits the generality
of how the system can be used. Furthermore, no details about the
architecture and integration of the hardware synchronization unit
(SU), that is central to the design, are provided.
In contrast, [14], [21] propose to stall PEs that cannot continue
by means of absent replies to requests on their data- or instruction
ports. As this approach allows handling the aforementioned check
and decision for continuation in one operation as well as PE-
externally and centrally, we follow it in favor of reducing the time
devoted to synchronization. The solution proposed in [33] is the
only one that combines low cycle-overhead synchronization with
PM; unfortunately, an analysis of the achievable gains in terms
of system performance and energy efficiency is not provided.
We combine the concepts of stalling cores that are blocked
at synchronization points and event-based signaling, and tightly
couple those with per-core fine-grain clock gating to allow the
handling of synchronization primitives in less than ten active PE
cycles. Large parts of the required hardware are reused to also
provide general-purpose interrupt support for, e.g., the handling of
data exceptions.
Implementation stage: The majority of the previous works
employs behavioral models of the individual system components
(PEs, interconnect, memories, synchronization hardware) written
in higher-level languages and instruction- or transaction-level
simulators such as, e.g., GRAPES in [20], MPARM in [18], [19],
or M5 in [13], [21]. While this approach enables the simulation
of complex and large-scale architectures in reasonable times, it
has the drawback of reduced accuracy for the figures of interest,
performance (or execution time) and power, when compared to
cycle-exact simulations based on synthesizable modules captured
in a hardware description language (HDL) or gate-level imple-
mentations. The loss in accuracy may be acceptable for evaluating
the performance of synchronization solutions for task-level paral-
lelism (where the main goal is to reduce or eliminate polling over
high-performance interconnect systems) or the support of transac-
tional memory (TM) with applications that spend 50% and more
of their execution time within critical sections [19]. For our goal of
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 5
enabling fine-grain parallelism with few tens or hundreds of cycles
between synchronization points, however, cycle-level accuracy is
required to reliably evaluate the effects of different solutions with
an increasing degree of hardware support (from atomic memory
access to full synchronization primitives). For example, slight
differences in the arrival instants of PEs at synchronization points
due to, e.g., cache misses or small workload imbalances can have
a massive impact on the subsequently caused contention during
lock acquire trials, as we demonstrate in Sec. 6.
To the best of the authors’ knowledge, [33] is the only prior
work that reports gate-level (and even silicon) implementations
of the whole processing system, including the synchronization
hardware. Unfortunately, only latencies for various synchroniza-
tion primitives in terms of number and type of memory bus
transactions are given in addition to area and power figures for
the synchronization hardware. While absolute and relative circuit
complexity of the added hardware is comparable to our proposed
solution, its power consumption of tens of milliwatt is in the
range of our targeted total system consumption and, therefore,
prohibitive [38]. We follow the same approach of a memory-
mapped shared peripheral with comparable circuit area but further
reduce the cost of synchronization primitives, state them in terms
of cycles and energy and quantize the achieved system-wide
energy savings while maintaining the overall power envelope of a
few tens of milliwatts. We consequently use a cycle-exact register-
level transfer (RTL) implementation of the whole system to mea-
sure execution time and a post-layout, fabrication-ready physical
implementation in a current 22 nm CMOS process as a basis for
the most important figure of merit, the total system energy with
and without our proposed solution for fine-grain parallelism. An
important rationale for the post-layout implementation stage is
that it considers the clock distribution network which typically
consumes a significant share of the overall dynamic power of
synchronous digital circuits, yet usually gets neglected in energy
analyses. Furthermore, the efficacy of fine-grain PM in the form of
silencing parts of the clock network, a central part of our concept,
can only be shown in this way.
3 BACKGROUND
This work aims at accelerating synchronization tasks in the context
of tightly memory-coupled multicore processing clusters. This type
of system, such as Rigel [39], STHORM [40], or PULP [10], is
designed to cope with computation-intensive, data-centered pro-
cessing tasks with a minimal amount of hardware complexity to
allow the usage in very energy-efficient, yet performant IoT end-
nodes. The clusters are built around a set of RISC-type PEs that
have low-latency access to a shared scratchpad memory (SPM)
(single cycle in our cluster variant). As a main consequence, this
design does not require the adoption of area-intensive data caches,
thus avoiding any coherency-ensuring overhead. Consequently,
this type of cluster is also called cacheless shared-memory multi-
processor.
Upper-bounding the core count to 16 in a single cluster
allows the usage of rather simple, but fast (in terms of latency)
interconnect and bus systems, while still providing computational
performance of several GOPS. Other architectures like GP-GPUs
scale the number of PEs up to 32 with a two-cycle latency shared-
memory, prioritizing performance over efficiency. This design
choice stands in contrast to the target of this work, where we
propose parallelism as a way to improve the energy-efficiency of a
low-power computing system. Multiple instances of clusters, each
connected to a higher level of shared memory, enable systems with
performance demands that cannot be satisfied with a single cluster.
In this way, the principles for NTC that we outlined in Sec. 1 are
respected: As we confirm in Sec. 6, PEs and memory consume the
lion share of both dynamic and static power. As the performance
of a single cluster already satisfies the requirements of several
real-life applications (e.g., [41], [42]), we set the focus of this
work to single-cluster energy-efficient synchronization and leave
the hierarchical extension to multiple clusters for future work.
The SPM must be realized with either multi-banked SRAMs
or standard cell memory (SCM) to enable PE-concurrent and
low-latency access to it. Furthermore, the low-latency constraint
bounds the size of the SPM to at most a few hundreds of
kilobytes. Otherwise, the PE-to-SPM timing path, which defines
the maximum operating frequency of a cluster, gets too long and
results in degradation of the overall cluster performance. On the
one hand, increasing the number of SPM banks (i.e., scaling up
memory size horizontally) causes a drastic increase in interconnect
latency [43]; on the other hand, scaling up memory vertically
while maintaining the number of banks increases the delay in
address decoding for row selection.
3.1 Relevance of fine-grain synchronization support
The aforementioned memory constraints limit the size of the
working set that can be present in the tightly coupled memory
at a given time. To avoid accesses to outer memory levels and
preserve performance, applications with an extensive working set
must employ techniques such as data tiling to exploit data locality
[44]; a direct memory access unit (DMA) is additionally required
to enable a double-buffering scheme (from/to a larger background
memory with higher latency) and to overlap memory transfers with
computation phases on the PEs. The orchestration of tiling intro-
duces additional dimensions to the iteration space of the original
algorithm, which map to supplementary inner loops iterating on
smaller bounds (i.e., the tile size); consequently, synchronization
moves at a finer level of granularity than the original algorithm,
and the number of synchronization points increases by the number
of tiles [45].
Another important aspect that contributes to the importance
of fine-grain synchronization support is to allow efficient paral-
lelization of kernels that inherently exhibit small SFRs. We list
multiple examples of such kernels in Sec. 6, which can only
be efficiently executed in a parallel fashion when the system
supports the handling of typical synchronization tasks in roughly
ten cycles. The adoption of task-level parallelism, in combination
with software pipelining, can work around these issues; how-
ever, this methodology poses a significant limitation in terms of
programming flexibility and achievable computation latency. It
furthermore requires the constant availability of tasks that can be
independently scheduled and completely occupy the idle PEs.
3.2 Requirements for synchronization hardware
A beneficial adoption of the parallel NTC concept requires adher-
ing to a set of design principles for the cluster architecture. As
the primary guideline, the overhead to provide PEs with access
to shared resources (such as memories or peripherals) and to
communicate with each other must be kept as small as possible
since a significant portion of the overall energy at the OEP is
spent through static power consumption, being directly linked to
6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
Shared Instruction Cache
L1 Logarithmic Interconnect
Peripheral
Interconnect
Timer
Bank 0 Bank 1 Bank 2n...
L1 Data Memory (word-interleaved)
Cluster Bus
(AXI 64-bit)
Core 0 Core 1 ...
...
Core n
DMA
L1 ↔ L2
CNN
Acceler.SCU
...
...
...
Data Demux
instr
Data Demux
instr
Data Demux
instr
EE E
E
Master-Slave Bus
Private Demux Link
Event SourceE
E
Test-and-Set
Fig. 1. Multicore cluster, incorporating the proposed synchronization and
communication unit (SCU).
circuit area. Similarly, complex synchronization hardware would
cause a significant amount of extra active power that could
eventually diminish any energy savings gained from accelerated
computations.
As even simple interconnect systems or caches can quickly
become comparable to or even exceed the area of the PEs [38],
consequently affecting the OEP adversely, special care has to
be taken when designing these building blocks and the clus-
ters. As a result of this constraint, features such as multi-level
data caches with the attached burden of coherency management,
memory management units (MMUs), nested vectorized interrupt
support, or network-like communication systems are unaffordable.
The absence of these blocks, in turn, prohibits the usage in a
control-centric OS-like fashion with virtual memory support but
favors the employment of the clusters as programmable many-core
accelerators (PMCAs) to execute computation-centric kernels with
regular program flow and physical memory addressing [46].
To not lose generality, the changes required in the host system
should be minimal (e.g., no profound modifications or extensions
to the PE data path) and, wherever possible, existing infrastructure
reused. As the central synchronization hardware is best aware
of the set of PEs that is waiting at synchronization points, it is
the best place to implement fine-grain PM on a per-PE basis.
To consequently perform power-managing on any idle system
component, one of the most important parallel NTC principles,
also the synchronization hardware itself must be designed in an
appropriate way to, e.g., not waste active power during phases
were all PEs are busy, and none is involved in any synchronization
action.
A solution that adheres to all stated requirements increases the
energy efficiency of a cluster in two ways, where both essentially
stem from reducing the execution time for a given task: First,
the energy spent gets reduced proportionally to execution time
as long as the power of the accelerated system is similar to that
of the baseline system. Second, the reduction in execution time
allows lowering the operating point (voltage and frequency) of the
system, moving it closer to the OEP for a given performance or
latency target.
4 ARCHITECTURE
This section starts with an introduction to the high-level architec-
ture of the hosting multiprocessor cluster before explaining the
BarrierNotifier
cfg, triggerext. trigger
trigger lock/unlock message
read
write
ev
en
t b
us
fro
m
 p
er
. i
nt
er
co
nn
ec
t
trigger
sources
          
empty
8
32
8 ...
...
...
...
...
...
...
Base unit
Mutex
NB
NC
NC
NC
32
5
32
32
32
...
...
.. .. ..
...
...
NC
NC+1
NC
NB NMx
...
...
Event FIFO
routing,
decoding ID
decode
clear
logic
FSM
register
write
extension routing / decode
sleep
req?
auto
clear?
cl
us
te
r e
ve
nt
s
32
32
32 32
32
NMx×328×NB
NMx
evt
buff
evt
mask
arbiter
irq_req clock_oncore_busyirq_idack_id irq_ack
> 0
> 0
irq
mask
NB NMx
5
active
sleepirq
FSM
trigger
trigger_en
ad
dr
w
da
ta
rd
at
a
gnt
clear
32
ad
dr
rd
at
a
...
...
...... ... ... ...
regular signalsnumber of cores
number of barriers
number of mutexes private demux bus
shared crossbar bus
event linesNB
NC
NMx
Fig. 2. Simplified overall architecture of the SCU, including the available
extensions, and detailed architecture of the base units. Signals on the
bottom connect to the cores, signals on the left to the cluster, and higher
hierarchies.
details of the SCU architecture, its integration into the cluster as
well as analyzing its scalability.
4.1 Multiprocessor Cluster
As a basis for our proposed synchronization solution, we use
the open-source multiprocessor cluster of the PULP project [38],
matching the targeted deeply embedded, data cache-less tightly-
memory coupled system type. The cluster is designed around a
configurable number (typically up to 16) of low-cost in-order
RISC-V microcontrollers. To greatly accelerate the execution of
the targeted DSP-centric processing loads, they feature several
extensions to the base instruction set [47] from which a wide
range of applications benefit. Specialized PEs such as neural
network accelerators can additionally be included in the cluster
to cope with more specific tasks that require very high processing
throughput [48].
All PEs share a single-cycle accessible L1 tightly-coupled
data memory (TCDM), composed of word-interleaved single-port
SRAM macros. In order to reduce the number of contentions
between PEs, a banking factor of two is used (i.e., the number
of banks is twice the number of PEs). Data transfers between
the size-limited TCDM and the L2 memory with larger capacity is
facilitated by a tightly-coupled DMA connected to the L1 memory
like any other PE. Access routing and arbitration between all PEs
and the TCDM banks are handled by a low-latency logarithmic
interconnect (LINT), allowing each TCDM bank to be accessed
by a PE in every cycle. All the cores fetch the instructions from a
hybrid private-shared instruction cache which has – in addition to
the DMA – access to the 64-bit AXI cluster bus that connects to
the rest of the communication and memory system of the SoC.
In addition to the TCDM interconnect, the cluster features a
peripheral subsystem with dedicated LINT. It allows not only the
specialized PEs to be programmed and controlled by means of
memory-mapped configuration ports, but also to connect further
peripherals such as timers and the SCU that is proposed in this
work. A master port to the cluster bus allows the RISC-V cores to
access all cluster-external address space.
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 7
Test-and-Set Atomic Memory Access
Besides routing and arbitrating requests, the LINT provides basic
and universal atomic memory access to the whole TCDM address
space in the form of TAS. Atomic accesses are signaled by
setting an address bit that is outside of the L1 address space;
the LINT checks it upon read-access. The currently stored value
is returned to the requesting core (or to the elected one in case
of multiple contending requests) and -1 written back to memory
in the next cycle before any other core gets its request granted.
Synchronization primitives based on this feature are used as a
strong baseline (TAS transactions take just three cycles) against
which we compare our proposed solution.
4.2 SCU Base Unit
Fig. 2 depicts a high-level overview of the SCU architecture, with
a deeper focus on the SCU base unit. The base unit is instantiated
once per RISC-V core and provides the fundamental functionality
of the SCU, i.e., event and wait-state management, as well as fine-
grain PM through direct control of the clock-enable signal of the
corresponding core. The design is based on 32 level-sensitive event
lines (per core) that are connected to associated event sources. In
a typical usage scenario, a limited number of the event sources
are located outside of the SCU (e.g., specialized PEs or cluster-
internal peripherals), while the remaining ones are responsible for
core-to-core signaling and generated within the SCU by so-called
SCU extensions.
Event lines are stored into a register called event buffer,
which is maskable through the event mask register. Basic interrupt
support is also provided to handle exceptions and other irregular
events; an interrupt mask register allows selective enabling and
disabling of event lines to trigger hardware interrupts. The central
finite state machine (FSM) orchestrates all control flow and
includes the three states, active, sleep, and interrupt-handling. The
main inputs used to evaluate state transitions are pending events
or interrupts, the core busy-status as well as sleep and buffer-clear
requests.
4.3 SCU Extensions
SCU extensions are responsible for core-to-core signaling; gen-
erally, they generate core-specific events that allow a subset to
continue execution. All extensions have trigger and configuration
signals connected to each base unit; as with the base units,
their associated functionality is available through memory-mapped
addresses. The four available types of extensions are depicted in
Fig. 2 and detailed in the following.
Notifier: This extension provides general-purpose, any-to-any
matrix-style core-to-core signaling. Each core can trigger one of
the eight notifier events for any subset of cores (including itself).
For write-triggered events, the write data are used as a target-
core mask; for read-triggered events, a dedicated register in each
SCU base unit holds the target mask. An all-zero value causes a
broadcast notifier to all cores in both cases. This extension is used
in the TAS-based variants of the synchronization primitives that
we profile and use in Sec. 6.3 and Sec. 6.4, respectively.
Barrier: Allows a configurable target subset of cores to
continue execution only after a (possibly different) worker subset
has reached a specific point in the program. The extension contains
a status register that keeps track of each core that has already
arrived at the barrier; this is signaled by reading or writing from
or to specific addresses. Depending on the core that caused the
access, the matching bit in the status register is set. Once the
status register matches the configured worker subset, an event
is generated for all cores that are activated in the target subset,
allowing those to uninterruptedly idle-wait at the barrier until their
condition for continuation is met.
Mutex: Represents an object that can only be owned or
locked by one core at a time and, therefore, directly supports
synchronization primitives that require mutual exclusivity such as,
e.g., mutual exclusive code sections. Try-locks are, similar to the
barrier extension, signaled by reading from a specific address. The
mutex extension keeps track of all pending lock requests and elects
one core by sending an event to only that core. The elected core
must write to the same address once it releases the mutex, causing
the extension to wake up another waiting core (if there is any).
Event FIFO: To be able to react to (relatively slow) cluster-
external event sources as well, the event FIFO extension is in-
cluded in the SCU. It allows handling of up to 256 cluster-external
event sources that can be triggered by, e.g., chip-level peripherals
or higher-level control cores, as can be found in modern system-
on-chips (SoCs). The external events are sequentially received
over a simple request/grant asynchronous 8-bit event bus and
stored into the FIFO. As long as there is at least one event present,
an event line associated with the FIFO is asserted that is connected
to all SCU base units. In a typical use-case, the event line triggers
an interrupt handler on one core that then pops the events from the
FIFO and processes them.
For the targeted parallel programming models such as, e.g.,
OpenMP [11], the barrier and mutex extensions are the most im-
portant ones as they provide hardware support for the fundamental
parallel sections and critical sections programming primitives.
The number of barrier and mutex extension instances, NB and
NMx, can be independently set at design time to, e.g., support
every team-building variant. As every core can only wait at one
barrier or try-lock one mutex at a time, the corresponding core-
specific events of all instances are combined into a single event
per extension type and core.
4.4 SCU Integration
Being a part of the peripheral subsystem that is explained in
Sec. 4.1, the SCU is connected as an additional shared, memory-
mapped peripheral to the corresponding LINT. However, this
single-port solution has the major drawbacks of non-deterministic
core-to-SCU access latency and sequentialized accesses whenever
more than one core wants to access the SCU in a given cycle.
Since the limitations (in terms of performance, energy-efficiency,
and scalability) of synchronization primitives that are realized with
classic atomic memory access largely result from sequential access
to shared variables, parallel access to the SCU base units and
extensions responsible for core-to-core signaling is paramount.
We, therefore, use additional, private one-to-one buses be-
tween each core and its corresponding SCU base unit, shown in
orange in Fig. 1, and Fig. 2. A demux at the data port of each
core selects between the L1 TCDM and peripheral LINTs and
the private SCU link. The one-to-one correspondence between
the cores and SCU base units allows to alias their address space,
thereby simplifying synchronization primitives by removing core-
id dependent address calculations. As the paths through the LINTs,
the private core-SCU links are purely combinational and therefore
allow for single-cycle access. As we demonstrate in Sec. 6.3, the
fully-parallel access to the SCU can even result in constant cycle
8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
2 4 8 16
number of cores NC
10-1
100
101
un
it a
rea
 [k
GE
]
Base unit
Barrier unit
External event FIFO
Mutex unit
2 4 8 16
number of cores NC
0
20
40
60
80
tot
al 
SC
U 
are
a [
kG
E]
Base units
Barrier units
External event FIFO
Mutex unit
Interconnect
a) b)
Fig. 3. Scaling of the circuit area (in gate-equivalents (GE)) for the base
unit and the available extensions (a) and for the overall SCU (b).
cost for certain synchronization primitives, independently from
the number of involved cores – a very favorable scaling property
compared to classical atomic-memory based approaches.
In order to retain a global address space (for, e.g., debugging
purposes), all base units are as well accessible from every core
and from outside the cluster through the peripheral LINT. All
power-managing functionality of the SCU base units (resulting in
a core idle-waiting for an event) is not implemented for this access
method as it would disturb the inter-core control flow.
4.5 SCU Scalability
Fig. 3 shows both the total SCU area as well as the area of the
individual sub-units and extensions in relation to the number of
cores NC. For the total area, a typical configuration with the
number of barrier extensions NB =NC/2, and the number of mutex
extensions NMx = 1, is shown. The plots show post-synthesis num-
bers; we used the same 22 nm CMOS process as for the multicore
cluster, which hosts the SCU. Design synthesis was done in the
slow-slow process corner, at 0.72 V supply voltage, a temperature
of 125 ◦C, and with a 500 MHz timing constraint.1 We restrict NC
to a maximum of 16, matching the typical scalability limits of the
targeted cluster-based architecture. An analysis of the slopes in the
double-logarithmic sub-unit area plot in Fig. 3b) reveals a mildly
super-linear scaling for the barrier extensions and sub-linear or
constant scaling for the others. Overall SCU area favorably scales
sub-linearly up to the typically used configuration of NC = 8 and
mildly super-linearly if NC is further increased to the maximum
configuration. The area contribution of the SCU base units and
barrier extensions dominates in all configurations, however, the
share of the SCU-internal interconnect logic to correctly route all
NC+1 slave ports to the respectively connected sub-units becomes
as well significant for the two largest configurations.
5 SINGLE-INSTRUCTION SYNCHRONIZATION
With our goal of aggressively reducing synchronization overhead
in mind, we are proposing a scheme that allows handling common
synchronization tasks with the execution of a single instruction in
each involved RISC-V core. To achieve this, we extensively lever-
age the dedicated link between each core and the corresponding
SCU base unit, the associated aliased address space of 1 Kibit, and
the possibility to stall a core by not granting accesses made over
the private links.
1. Even though our multicore clusters are usually constrained to slower clock
frequencies (see Sec. 6.2), we chose this constraint to also verify the suitability
of the SCU for systems that target slightly higher clock speeds.
event in
event buffer
ext. trigger
request
grant
resp. valid
resp. data ebuf
-x -1 elw +1 +2
core clock
core busy
program counter (instr. decode stage)
private link core      SCU
SCU base
Fig. 4. Interfacing of the SCU with a RISC-V core and corresponding
timing. Shaded intervals correspond to transitions to (left) and from
(right) sleep state, respectively.
A fundamental aspect of our proposed solution is that whether
a core can continue at a synchronization point is always signaled
through events that are generated inside the SCU by one of the
extensions. Each involved core idle-waits for the appropriate event
to occur; the corresponding event line has to be activated in the
event mask. Waiting is universally initiated by executing the elw
instruction that we added to the extensible RISC-V instruction set
architecture (ISA). The mnemonic stands for event-load-word; the
instruction is identical to the regular load-word (lw) of the base
ISA with the exception of an altered opcode such that the core
controller can distinguish them. Whenever a core executes elw
with an address that requests waiting for an event, the SCU will
block the resulting transaction on the private link by not asserting
the grant signal (given that no events are currently registered in the
event buffer). This process is depicted in the left shaded part of
Fig. 4, which shows the details of a private core-SCU link and the
most important signals of the corresponding SCU base unit. Due to
the in-order nature of the employed cores, the stall at the data port
propagates through the core pipeline. The elw opcode causes the
core controller to release the busy signal once any prior multi-
cycle instructions have been executed; consecutively, the SCU
power-manages the requesting core by lowering its clock-enable
signal. Depending on the address of the discussed read transaction,
an extension in the SCU gets simultaneously triggered (e.g., try-
lock a mutex, set the status bit in a barrier, send a notifier event).
The extension triggering is controlled by the FSM in the SCU
base unit to ensure that per elw-transaction triggering happens
only once.
The right shaded part of Fig. 4 shows the process of a core
waking up and continuing execution, initiated by an incoming
event. The event is present in the event buffer in the consecutive
cycle, causing the SCU both to re-enable the core clock and assert
the grant for the still-pending read request. Another cycle later, the
response channel of the private link is used to deliver additional
information to the requesting core: Often, the content of the event
buffer is sent such that in the case of multiple activated event
lines, the core can immediately evaluate the reason for the return
from sleep. More interestingly, however, the response channel can
also be used to pass extension-specific data. In the example of
the mutex extension, it allows the unlocking core, done with a
write transaction, to intrinsically pass a 32-bit message to the core
that locks the mutex next. Once the response data is consumed,
the event buffer can – again controlled through the address of the
elw – automatically be cleared, freeing cores from yet another
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 9
common task, especially for the usual case of waiting for a single
event line only.
Fig. 4 shows the process of entering and leaving wait state
with an address that results in both triggering an extension and
automatically clearing the buffer, additionally highlighting the
small amount of only six cycles of active core clock for handling
a synchronization point (excluding the possibly required address
calculation for elw). For cases where an active event occurs
before or during a wait request (e.g., when the last core arrives at
a barrier), the grant is immediately given, and no power-managing
is done to not waste any cycles. The required changes in the
core to support the described, powerful mechanism are limited
to decoding the elw instruction to release the busy signal, which
would otherwise remain asserted on a pipeline stall due to the
pending load at the data port.
5.1 Fused Interrupt Handling
The targeted type of clusters is primarily meant for executing
kernels with a regular program flow, the synchronization of which
can be purely handled with events and idle waiting. Still, interrupts
are often required to, e.g., handle data exceptions or react to
other spontaneous, irregular, but important events that require an
immediate change of program flow. A dedicated FSM state and
an additional mask register in each SCU base unit are employed
to fulfill said requirement; the event buffer is shared between both
masks for increased area and energy efficiency. The few cases
where a core needs to be sensitive to the same event source both as
an interrupt and event trigger can be handled with a combination
of an interrupt handler and a self-triggering notifier event. Two
dedicated request/identifier pairs connect each core and the cor-
responding SCU base for both requesting and clearing interrupts,
respectively. The SCU arbitrates one of the pending interrupts to
the core, which, in turn, acknowledges the processing of interrupt
identifiers upon entering the respective handler. Similar to the
auto-clearing capability when waking up through events, the bit
corresponding to the called interrupt handler is cleared in the event
buffer to reduce management overhead in the handlers.
Should an active event occur during an interrupt handler, regu-
lar program flow is immediately continued after its termination.
In the other, usual case, the FSM transits to sleep again and
awaits further incoming events and interrupts. After termination
of the interrupt handler, the elw instruction responsible for the
original wait state is re-executed, allowing the SCU to detect said
termination and power-manage the core again. In such cases, the
FSM takes care of inhibiting erroneous extension re-triggering
upon the repeated sleep request after interrupt handling.
6 EXPERIMENTAL RESULTS
To demonstrate the effectiveness of the proposed event-based,
hardware-supported synchronization concept, we present two
types of experimental results in this section. We first show the
theoretically achievable improvements through synthetic bench-
marks where all experiment parameters can be controlled. We
successively analyze the performance and energy efficiency im-
provements that are observed when executing a range of ap-
plications that the targeted class of deeply embedded CMPs is
typically used for. As the leading principle and motivation for
this work is to reduce the energy that the cluster consumes for
a given workload or task, we report not only the total cluster
energy but also power and execution time in all cases to provide
insight into how the energy reduction is achieved. We additionally
provide power breakdowns into the main contributors to highlight
the importance of fine-grain power management in the form of
clock gating. Finally, an analysis of both the amount of total and
active cycles spent on synchronization shows how the proposed
solution drastically reduces synchronization-related overhead.
6.1 Baseline
As a baseline, we use purely software-based implementations
of synchronization primitives that employ spin-locks on TAS-
protected variables in the L1 TCDM with the help of the TAS-
feature of the logarithmic interconnect. Since most modern mul-
tiprocessor systems, however, feature hardware support for idle
waiting, it is common to avoid thereby the active and continuous
polling of synchronization variables, which removes a very signif-
icant amount of core activity and memory accesses and therefore
wasted energy as well as memory bandwidth.
To take account of this, we include a second TAS-based
solution in our comparisons where cores that do not succeed in
acquiring a synchronization variable (e.g., to update the barrier
status stored in the variable) are put to sleep with the help of
the SCU (by idle waiting on an event as described in Sec. 5).
Whenever the current owner updates or releases the variable, it
also uses an SCU notifier broadcast event to wake up all the
remaining, sleeping cores, which will then again try to acquire
the synchronization variable. While the described synchronization
mechanism can also be realized with similar solutions for idle
waiting and notifying cores that may be found in comparable
systems, it – in our case – already benefits from the low latencies
for notifiers and idle state handling that are enabled through the
SCU. In the following, the purely spin-lock based implementations
of synchronization primitives will be referred to as SW and the
idle-waiting extended versions as TAS.
6.2 Experimental Setup and Methodology
All experiments were carried out on an eight-core implementation
of the multicore cluster of Fig. 1. It features 64 kByte L1 TCDM
and eight kByte of shared instruction cache; the SCU contains
four barrier and one mutex extensions. For all cycle-based results,
an RTL description of the cluster and cycle-exact simulations
were used. In order to obtain more detailed insight than total
execution time only, a range of non-synthesizable observation
tools both in the RTL description of the RISC-V cores as well
as in the testbench is employed. Per-core performance counters
record the number of executed instructions, stalls at both data and
instruction ports, and the like; an instruction tracer allows detailed
analysis of the executed applications and benchmarks. Overall
execution time is measured with the help of a cluster-global
timer that is part of the cluster peripherals and also present in
the fabricated application-specific integrated circuit (ASIC). The
timer is activated only during periods where the actual benchmark
(synthetic or application) is executed to exclude, e.g., initialization
and boot periods. The enable signal of the timer is monitored in the
testbench and the timestamps of rising and falling edges recorded,
allowing to identify the relevant portions of the trace files.
Physical implementation
To obtain power (and ultimately energy) results, the RTL model of
the cluster was synthesized in a 22 nm CMOS process and a placed
and routed physical implementation created; both steps were done
10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
with a 350 MHz timing constraint and in the slow-slow process
corner at 0.72 V supply voltage and a temperature of 125 ◦C2.
The resulting fabrication-ready and functionally verified module3
measures 0.8 mm×1.4 mm with pre-placed SRAM macros for the
L1 TCDM; the SCU accounts for less than 2% of the total circuit
area. Both the synthetic benchmarks and the applications were run
on the resulting gate-level netlist and activity files recorded during
the benchmarking periods for every electrical net in the cluster.
Again, the enable signal of the cluster-global timer is used to
start and stop the activity recording. The subsequent hierarchical
power analysis was done in the typical-typical process corner at
0.8 V supply voltage and 25 ◦C, allowing us to report not only
the total power and energy but also the respective breakdowns to
analyze the contributions of the individual cluster building blocks.
The reported power and energy results correspond to a cluster
operating frequency of 350 MHz.
6.3 Synthetic Benchmarks
We start our analysis by quantifying the cost in terms of cycles
and energy for executing barriers and critical sections, the two
synchronization primitives that are most commonly used in the
targeted parallel programming models. We compare the hardware
variants featured in the SCU with the purely spin-lock based as
well as with the idle-wait extended baseline variants as described
in Sec. 6.1. To highlight the favorable scaling behavior (with
respect to the number of participating cores) of the SCU, we
provide the quantification for two, four, and eight cores even
though the cluster is mainly designed for execution on all eight
cores. To remove any edge effects such as instruction cache misses
from the results, we let the involved cores execute a loop eight
times that contains the respective primitive 32 times and average
the resulting cycle count. A typical use-case for critical sections in
the type of targeted systems is the placement at the end of an SFR
to perform small control tasks like updating a shared variable by
all worker cores. Consequently, the critical section is usually very
short (up to ten cycles only), a circumstance that we consider in
our experiments.
The synthetic benchmarks were compiled with an extended
version of the riscv-gcc 7.1.1 toolchain that supports the elw
instruction, using the -O3 flag. To compute not only the absolute
cost figures that are reported in Tbl. 1 but also the relative energy
overhead, we additionally measure the power during the execution
of 512 nop instructions on a varying number of cores. While the
choice of the nop instruction may intuitively not be a suitable
representation of actual processing loads, the resulting relation
between SFR size and overhead still is a very reasonable estimate
for the behavior that results when executing actual applications, as
our analysis in Sec. 6.4 shows.
Barriers
When considering the pure primitive cost, the SCU barrier requires
between 7.8× (2 cores, SW) and 29× (8 cores, SW and TAS)
fewer cycles, as can be seen in Tbl. 1. The gap widens when
considering energy where the reduction ranges between 10× (2
cores, SW and TAS) and 38× (8 cores, TAS) or 41× (8 cores,
SW), respectively. With higher core counts, the TAS version
2. Timing was verified with all permutations of the slow-slow/fast-fast pro-
cess corners, 0.72 V/0.88 V supply voltage, and temperatures of -40 ◦C/125 ◦C.
3. Multiple ASICs containing very similar clusters are silicon proven in
various technology nodes [38], [49], [50].
shows slightly lower energy compared to the SW version thanks
to the idle-wait behavior. For the SCU variant, not surprisingly,
the fully parallel access to the barrier extension makes the cycle
cost independent from the number of cores and incurs very little
additional energy when increasing the number of participants. As
a result, the SCU supported barrier is especially favorable when a
task is parallelized on all eight cores, which is the intended way
of using the cluster.
Fig. 5a) and d) illustrate the raw barrier cost in relation to a
preceding SFR of varying size by showing the relative overhead
for executing the barrier in terms of cycles and energy. While
significant overhead reductions can be observed with SFRs of up
to around 1000 cycles and eight active cores, the graphs reveal
another even more important characteristic of the SCU barrier:
With a typical constraint of allowing up to 10% of synchronization
overhead, the SCU drastically reduces the smallest allowable
SFR. The cycle-related relative minimum SFR reductions are
(mathematically) identical to those for the raw primitive-cost;
the energy-related reductions show only insignificant differences
compared to the corresponding raw cost ratios. Besides the relative
overhead reductions, the absolute size of the smallest allowable
SFR is important, which is with the SCU barrier for both cycle
and energy overhead and all core counts below 100 cycles and
therefore matches the in Sec. 2 stated requirement for fine-grain
synchronization. This is in stark contrast to the overheads resulting
from the TAS and SW variants, where both cycle and energy-
based SFRsizes must be at least multiple hundreds of cycles when
considering two or four participating cores. The energy-related
minimum SFR with all eight cores participating, representing the
most important case, is with 1622 cycles (TAS) and 1771 cycles
(SW) two orders of magnitude higher than the corresponding SFR
size when employing the SCU barrier (42 cycles) and poses a
strong limitation on the range of applications that can be efficiently
parallelized on the targeted architecture.
Critical Sections
Compared to barriers, the critical or mutual exclusive section
synchronization primitive can be more easily implemented with
basic atomic memory access. The ability to enter the critical
section can be managed with a single TAS-protected variable
that needs to be tested upon entering and written with the test
value by the owning core upon exiting. For the TAS-variant of
this primitive, we link each access to the synchronization variable
to the usage of a notifier event to avoid constant testing of the
variable by all cores that are waiting to enter the critical section:
TABLE 1
Cost of synchronization primitives in terms of cycles and energy.
cycles energy [nJ]
NC (core count) 2 4 8 2 4 8
Barrier
SCU 6 6 6 0.1 0.1 0.1
TAS 52 91 176 0.8 1.7 4.3
SW 47 87 176 0.8 1.8 4.7
5-cycle crit. sect.
SCU 12 23 44 0.2 0.3 0.6
TAS 25 39 69 0.4 0.7 1.6
SW 12 25 72 0.2 0.5 1.6
10-cycle crit. sect.
SCU 13 24 50 0.2 0.3 0.7
TAS 26 50 89 0.4 0.9 2.1
SW 13 26 55 0.2 0.6 1.5
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 11
101 102 103
10-cycle critical section
c)
101 102 103
0%
20%
40%
60%
80%
rel
ati
ve
 cy
cle
 ov
erh
ea
d
5-cycle critical section
b)
101 102 103 104
0%
20%
40%
60%
80%
100%
rel
ati
ve
 cy
cle
 ov
erh
ea
d
Li
ve
rm
ore
6
Di
jks
tra
PC
A
FA
NN
-A
M
FC
C
Li
ve
rm
ore
2
FF
TDW
T
AE
S
Barrier
10% overhead
a)
101 102 103
SFR size [cycles]
f)
101 102 103
SFR size [cycles]
0%
20%
40%
60%
80%
rel
ati
ve
 en
erg
y o
ve
rhe
ad e)
101 102 103 104
SFR size [cycles]
0%
20%
40%
60%
80%
100%
rel
ati
ve
 en
erg
y o
ve
rhe
ad
10% overhead
d)
2 4
cores
SW
TAS
SCU
8
2 4 8
cores
SW
TAS
SCU
8
cores
SW
TAS
SCU
2
8
cores
SW
TAS
SCU
2
8
cores
SW
TAS
SCU
2
8
cores
SW
TAS
SCU
2
Fig. 5. Relative overhead in terms of cycles (a)-c), top) and energy (d)-f), bottom) vs. SFR size for the three realizations of barriers (a), d), left) and
critical sections with two lengths (b), c), e), f), right). The markers in a) indicate the relative share of active synchronization cycles for the range
of DSP applications discussed in Sec. 6.4. For critical sections, the graph lines corresponding to four cores have been omitted to improve graph
readability. The scaling behavior in terms of overhead vs. core count is strictly monotonic (see raw costs in Tbl. 1).
Any core that fails to enter will idle-wait for said event. The core
that currently executes the critical section triggers the event upon
exiting, causing all queued cores to quickly wake up and re-test
the TAS-variable, with all but the elected one immediately going
back to sleep afterward.
In the SCU-based implementation, we simply execute elw
with an address mapped to the mutex extension, which elects one
core for which continuation is enabled through the generation of
a core-specific event. All others idle-wait at the mutex load until
they are elected. Similar to the variants based on TAS-variables,
a write to the mutex by the previously elected core upon leaving
the critical section unlocks the mutex and triggers the election of
the next core to enter the section alongside the appropriate event
generation. The distinction between locking and unlocking the
mutex is done with the access type (read/write) and allows to use
the same address for both operations, further reducing the software
overhead for the synchronization primitive.
As with barriers, we provide both raw-primitive cost (Tbl. 1)
as well as relative overheads in terms of performance and energy
(Fig. 5), each, for two different critical section sizes. The latter is
necessary since the wait behavior of cores that yet have to enter
the critical section greatly differs between the implementations:
For the SW variant, waiting cores do not only test the synchro-
nization variable upon another one exiting the critical section but
constantly. Consequently, the duration of the critical section has an
impact on so-caused parasitic energy. We calculate the overhead
as the difference between ideal cycle count and energy and the
measured ones. The ideal number of cycles is Tideal = NCTcrit and
the ideal energy Eideal = TidealPcomp,1 with Tcrit denoting the length
of the critical section, Pcomp,1 the single-core cluster power and NC
the number of cores that need to execute the critical section.
In relation to the barrier results, the differences between the
SCU and the TAS-based variants are considerably smaller: As can
be seen in the right half of Fig. 5, for two cores the minimum SFR
size for 10% overhead is at most reduced by 2.5× from 232 cycles
to 91 cycles when comparing the energy overhead of the TAS
and SCU variants. The differences in the relative cycle overhead
are smaller or even non-existent. The picture changes, however,
when considering eight participating cores: While the cycle-
related differences remain small, the energy-related gap widens.
The smallest SFR for 10% relative energy overhead is reduced by
at least 2.3× (10-cycle critical section, SW to SCU) and up to
3.3× (10-cycle critical section, TAS to SCU). Still, compared to
the barrier, the savings achievable with the SCU mutex extension
are one order of magnitude lower. The reason for this behavior
is twofold: First, a mutex is a much simpler synchronization
primitive than a barrier, and second, a TAS-protected variable
inherently allows for very efficient implementations. Still, the
avoidance of any TCDM accesses when using the SCU results
in consistently lower power and energy for all core counts and
critical section lengths.
Counterintuitively, the TAS-variant performs both in terms of
cycles and energy worse than the straight-forward SW version
for all core counts and critical section lengths. This circumstance
can be explained by analyzing the software footprint of each
implementation variant: With fully inlined functions for entering
and leaving critical sections, leaving always requires the execution
of a single instruction for the SW and SCU variants and two
for the TAS variant. For entering, however, only the SCU variant
guarantees a single instruction for all cores. The naive SW variant
requires two instructions per locking attempt; the TAS variant can
match this count only if the first attempt is successful. For all
additional ones, five instructions need to be executed to handle
the idle-wait functionality. Conclusively, the TAS variant can
reduce the number of lock attempts; however, each attempt is
more expensive. For cases where re-election takes place after
12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
TABLE 2
Main properties and results for the range of benchmarked DSP applications. Active cycles reflect core-active cycles, i.e., cycles where the core
clock is active, averaged over all eight cores.
Name Domain Barrier SFR Size Energy Execution Cycles Synchronization Cycles IPC
count type [cycles] [µJ] total active (stddev) total active
DWT Signal processing 10
SCU 1.1k 0.7 11.3k 10.8k (155) 0.6k (5.2%) 84 (0.8%) 5.01
TAS 1.1k 0.8 12.9k 12.7k (63) 1.5k (11.7%) 1.3k (10.0%) 4.65
SW 1.1k 0.8 12.9k 12.9k (0) 1.6k (12.6%) 1.6k (12.7%) 4.56
Dijkstra Graph search 238
SCU 122 2.0 33.7k 30.6k (2.9k) 4.6k (13.7%) 1.53k (5.0%) 4.72
TAS 156 4.0 71.3k 69.1k (0.7k) 34.1k (47.9%) 32.0k (46.3%) 4.48
SW 130 4.0 64.9k 64.9k (0) 34.0k (52.3%) 34.0k (52.3%) 4.09
AES Cryptography 4
SCU 10.2k 2.8 41.2k 40.9k (188) 339 (0.8%) 34 (0.1%) 5.84
TAS 10.2k 2.8 41.6k 41.5k (123) 732 (1.8%) 547 (1.3%) 5.82
SW 10.2k 2.9 41.6k 41.6k (0) 719 (1.7%) 719 (1.7%) 5.80
Livermore6 Linear recurrence 127
SCU 104 1.1 24.5k 14.0k (6.8k) 11.3k (46.1%) 760 (7.7%) 6.00
TAS 104 1.7 32.3k 28.1k (3.4k) 19.1k (59.0%) 14.9k (55.0%) 5.25
SW 105 2.1 32.8k 32.8k (0) 19.6k (59.5%) 19.6k (59.5%) 4.74
Livermore2 Gradient descent 12
SCU 744 0.6 9.2k 9.0k (46) 0.3k (2.8%) 71 (0.8%) 6.67
TAS 789 0.7 11.3k 11.2k (17) 1.8k (16.1%) 1.7k (15.4%) 5.94
SW 788 0.8 11.3k 11.3k (0) 1.8k (16.1%) 1.8k (16.1%) 5.84
FFT Frequency analysis 4
SCU 1.5k 0.5 6.1k 6.0k (73) 203 (3.3%) 39 (0.7%) 5.53
TAS 1.4k 0.5 6.4k 6.3k (23) 606 (9.5%) 540 (8.6%) 5.40
SW 1.4k 0.5 6.4k 6.4k (0) 670 (10.5%) 670 (10.5%) 5.34
FANN-A Machine learning 160
SCU 519 6.9 92.4k 84.0k (2.2k) 9.3k (10.1%) 982 (1.2%) 6.72
TAS 483 7.7 103.0k 100.3k (0.9k) 25.7k (25.0%) 23.0k (23.0%) 6.48
SW 482 7.9 103.8k 103.8k (0) 26.7k (25.8%) 26.7k (25.8%) 6.24
MFCC Audio processing 693
SCU 718 36.1 0.53M 0.50M (14.8k) 33.1k (6.2%) 4.64k (0.9%) 6.84
TAS 714 41.5 0.64M 0.60M (10.5k) 142.3k (22.3%) 106.7k (17.8%) 6.28
SW 709 43.5 0.63M 0.63M (0) 142.3k (22.4%) 142.3k (22.4%) 6.05
PCA Data analysis 2305
SCU 375 75.0 2.48M 0.88M (0.6M) 1.62M (65.2%) 20.55k (2.9%) 4.47
TAS 388 89.6 2.66M 1.20M (0.6M) 1.76M (66.3%) 0.30M (29.6%) 4.08
SW 381 148.3 2.73M 2.73M (0) 1.85M (67.8%) 1.85M (67.8%) 3.45
roughly ten cycles, this can thus lead to an overall increase in
both cycles and energy used for the primitive that outweigh the
energy saved with cores that sleep for very short instances only.
Hence, the critical section lengths used in our experiments are
simply too short for the TAS variant to show a benefit over the SW
one; without the SCU programmers have to choose the optimal
implementation in dependency of Tcrit. Additionally, the usage of
nop instructions during the critical section hides a disadvantage
of the SW implementation that would show with real applications:
The repeated synchronization variable polling by all cores that yet
have to enter the critical section puts a significant load on both the
TCDM and the associated interconnect that would slow down the
execution of any critical section which contains TCDM accesses.
6.4 DSP Applications
After exploring the theoretically achievable improvements with
dummy code between synchronization points, we ran actual
DSP-centric applications on the multicore cluster, each with
the three different implementations of synchronization prim-
itives. The applications are, e.g., in turn, applied in real-
world use-cases such as [41], [42]. Compilation was done
in the same way as with the synthetic benchmarks from
Sec. 6.3; additionally, the combination of the GCC-flags -flto
and -fno-tree-loop-distribute-patterns that yields
best performance (for each application individually) has been
determined and applied. In order to obtain accurate results, each
application was run seven times on the RTL model, where the two
first iterations are used to warm the instruction cache and are not
counted towards any results. All cycle-based results are calculated
from the averaged outputs of the observation tools over the last five
iterations. For calculating power results, the applications were run
in the same manner on the post-layout model with signal activity
being recorded during a cache-hot iteration. As with the synthetic
experiments, all results reflect running the cluster at 350 MHz.
A short description of each application and its synchronization
behavior is as follows: DWT: 512-element 1D Haar real-valued
32-bit fixed point discrete wavelet transform (DWT); one barrier
after the initial variable and pointer setup phase and after each
DWT step. Dijkstra: Dijkstra’s minimum distance algorithm for
a graph with 121 nodes; for each node, the minimum distance to
node zero is calculated. Two barriers per node that ensure each
core is done with its part of the graph before deciding on the
minimum distance for each node. AES: One round of encryption
and one round of decryption of 1 kByte of data using the Advanced
Encryption Standard (AES) in counter mode. Barriers are only
used before and after the two phases of the algorithm, as it can be
fully vectorized. Livermore6: General linear recurrence equation
from the Livermore Loops [32]; the transformed, parallelizable
version of the algorithm proposed in [14] was used with a 128-
bit single-precision input vector. A barrier must be passed on
each iteration of the outer loop as there are data dependencies
between the iterations. Livermore2: Excerpt from an incomplete
Cholesky-Conjugate gradient descent that processes an 8 kByte
single-precision input vector. The algorithm reduces the part of the
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 13
-20%
0%
20%
40%
60%
80%
100%
Normalized Performance Improvement w.r.t. SW Barrier
a)
-20%
0%
20%
40%
60%
80%
100%
Normalized Energy Efficiency Improvement w.r.t. SW Barrier
b)
DWT
Dijkstra
AES
Livermore6
Livermore2
FFT
FANN-A
MFCC
PCA
 SCU
 TAS
Fig. 6. Normalized performance (a) and energy improvements (b) for
the range of DSP-applications and the SCU and TAS barrier implemen-
tations relative to the SW baseline.
vector that is processed in each iteration by a factor of two and,
therefore, only requires 12 outer loop iterations after each of which
a barrier is required. FFT: 512-point complex-valued single-
precision radix-8 fast Fourier transform (FFT) with precomputed
twiddle factors. Barriers are only required between each radix-8
butterfly step (two with the input size at hand) and at the end of the
algorithm to arrange the output values in the correct order. FANN-
A: Hand gesture recognition from [51], based on a 32-bit fixed-
point fully-connected fast artificial neural network (FANN) with
five layers, 691 neurons, and over 0.4 MByte of weights. Barriers
are required both after processing each layer (outer loop) as well
as after each fully-parallel inner loop iteration in which each core
calculates a neuron value. The barrier at the inner loop is required
to manage the loading of the currently required weight values
into the TCDM by the DMA in the background as the TCDM
is far too small to fit all values at once. MFCC: Calculation of
the Mel-frequency cepstrum (MFC) (inverse FFT of the logarithm
of the power spectrum) of a 20.000-element 16-bit fixed-point
vector. An outer loop runs over frames of four bytes with a barrier
after each iteration. For each frame, nine processing steps with a
barrier in between each are carried out. With the exception of the
forward FFT to compute the power spectrum, all processing steps
are fully vectorized and free from synchronization points. PCA:
32-bit fixed-point principal component analysis (PCA) based on
Householder rotations on a dataset composed of 23 channels and
256 observations; the algorithm is distributed over five processing
steps (data normalization, Householder reduction to bidiagonal
form, accumulation of the right-hand transformation, diagonal-
ization, final computation of principal components) with a barrier
in between each. Four of the processing steps contain numerous
barriers due to data dependencies and short sequential sections for
combining intermediate results from preceding parallel sections;
the diagonalization part of the algorithm is largely sequential.
The applications were selected with a focus on covering both
a wide range of domains and the relevant parameter space: As
Tbl. 2 shows, barrier count, the number of total cycles as well
as total energy all range over four orders of magnitude. The
range of average SFR sizes is roughly lower bound at around
100 cycles (Dijkstra), a size for which Fig. 5a) and d) show that
synchronization overheads achieved with the SCU barrier are still
well below the acceptable margin of 10%. At the upper end of
the spectrum, applications with SFR sizes of one thousand cycles
and more are as well included (AES, FFT), representing the range
of SFR sizes where Fig. 5a) and d) indicate only small overhead
reductions when comparing the SCU barrier to the TAS and SW
baselines.
Both discussed Livermore Loops were mainly chosen for
benchmarking to allow for quantitative comparison to literature.
We could identify only two cases where systems from literature
performed better, however in both cases with at least doubled core
count compared to our cluster and only when comparing to the
TAS or SW type primitives (they resulted in uniformly very similar
cycle counts for both Livermore loops): For Livermore6 with data
size 256, the 16-core CMP from [14] performs 6% better; for
Livermore2 with a 2048-element vector, the 128-core system of
[13] achieves 14% lower cycle count. For Livermore2 and all
other vector sizes used in [13], [14], we achieved performance
improvements between 7% and 38% with the TAS and SW
barrier variants and between 26% and 6.2× with the SCU barrier.
For Livermore6, the improvements in comparison to [14] range
between 17% and 55% with the TAS and SW barriers and between
13% and 4.9× when using SCU barriers.
In relation to the 7-core system used in [15], performance
for Livermore2 was improved by over 5× with TAS and SW
barriers and by over 8.9× with the SCU barrier. For Livermore6,
we observe an improvement of 2.7× with the TAS and SW type
primitive and 3× when using the SCU. For both benchmarks, [15]
uses a 1024-element vector.
Calculation of Synchronization Overhead
As the main goal of this work is to boost energy efficiency by dras-
tically reducing the synchronization-related overhead, we provide
for each application and synchronization primitive implementation
both the number of total and active cycles that cores use to
execute synchronization primitives. The cycle counts have been
determined with a profiling script that parses the trace files of
each core and application. For SCU-type synchronization primi-
tives, the detection is done by searching for the elw instruction
with matching physical address. Any preceding instructions that
are used to calculate the address are as well counted towards
the synchronization cycles. In the case of the TAS and SW
variants, two detection methods have to be used: By analyzing
the disassembly of each application, the address range(s) of
synchronization functions are extracted. If this step succeeds, the
traces can successively simply be scanned for time periods where
a core executes instructions within a relevant address range. For
many applications, however, this method fails due to the fact that
the compiler inlines synchronization functions. Consequently, the
inlined functions must be detected by matching the disassem-
bly against patterns that unambiguously identify synchronization
primitives. This method requires much more careful analysis as the
functions can be spread across multiple non-contiguous address
ranges with linking jump or branch instructions. Furthermore,
multiple entry- and exit points to and from the primitives may
exist. The output of the described analysis methods is, in any
case, a list of synchronization periods where each entry contains
a begin and end cycle number. Combining these timestamps with
the benchmarking intervals allows us to calculate both the total
and active number of synchronization cycles for each core and
benchmark iteration, the average of which is shown in Tbl. 2.
14 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
DWT Dijkstra AES Livermore6 Livermore2 FFT FANN-A MFCC PCA
0
5
10
15
20
25
Po
we
r [
mW
]
SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW SCU TAS SW
 cores  I-cache  TCDM  interconnect  peripherals  other
Fig. 7. Total cluster power and breakdown into the main contributors.
It is important to note that the total number of synchroniza-
tion cycles naturally includes core wait periods that are mostly
caused by workload imbalance. Therefore, the number of active
synchronization cycles is a much better measure for the actual
synchronization overhead and substantially lower than the former
count for the idle-wait featuring SCU and TAS variants and con-
sidering applications that exhibit significant workload imbalance
(Livermore6, PCA).
Discussion of Results
Tbl. 2 lists the most important properties of each application
alongside the cycle-based and energy results. In order to provide
full insight into the components contributing to energy, Fig. 7
shows both total power and the corresponding breakdown into the
shares associated with the main cluster components. Finally, Fig. 6
highlights the normalized improvements in terms of cycles and
energy that we were able to achieve with each application when
employing the SCU- and TAS-based synchronization primitives in
relation to the SW baseline. Over the range of the benchmarked
applications, the SCU achieves relative performance improve-
ments between 1% and 92%, with an average of 23%. While the
lower and upper bounds for relative energy improvement are very
similar, amounting to 2% and 98%, respectively, power reductions
with the SCU favorably result in a greater average improvement
of 39%.
When relating the average SFR size from Tbl. 2 with the
normalized improvements from Fig. 6, one can see that the SFR
size is a strong indicator whether the type of synchronization
implementation influences overall performance and energy or
not. Consequently, the biggest improvements are achieved with
applications that exhibit SFRs of few hundreds of cycles (Dijkstra,
Livermore6, PCA) and the lowest with SFRs sizes of thousand
or several thousands of cycles (AES, FFT). It can be noticed
that the results of the synthetic benchmarks in Sec. 6.3, shown
in Fig. 5, can provide a rough estimate of the achievable savings:
The relative overhead of active synchronization cycles (averaged
over all cores) for each application and barrier variant is marked
in Fig. 5a. Instead of the number of total synchronization cycles,
the amount of active synchronization cycles is used since, in
the synthetic benchmarks, all cores arrive at almost the same
time at a barrier while in real applications, core-to-core workload
imbalances cause a much higher variation of the arrival instances.
For the SCU barrier, the overhead predicted by the synthetic
benchmarks closely matches the actual application-related one
for all applications; for the TAS and SW barriers, however, the
synthetic experiments mostly predict overheads that are signifi-
cantly too high. This can be explained with the already mentioned
workload imbalances; the spread-out arrival instances also reduce
core-concurrent access to the TAS-variable that protects the barrier
status from hazardous modification. As a consequence, fewer
cycles are wasted due to contention while accessing the said
variable. The fully-parallel access to the SCU barrier extension,
on the other hand, causes the barrier durations to be completely
independent of the distribution of the arrival instances, leading to
a greatly improved (cycle) overhead predictability.
An important observation is the fact, that the SCU, – for
most applications – does not reduce power but either almost does
not affect it at all (see DWT, AES, FANN-A) or even slightly
increases power compared to the TAS primitive variant, which
also features idle-waiting (see Dijkstra, Livermore2, FFT, MFCC).
As Fig. 7 shows, the increase in the latter case is due to higher
power consumption in the cores, which is a consequence of the
reduction of synchronization cycles and the relative higher share of
(usually) energy-intensive processing cycles. There are, however,
two exceptions to this behavior, Livermore2 and PCA, where total
power is reduced by 15% and 38% when using the TAS barrier,
or, respectively, 29% and 44% with the SCU variant. In these
cases, application-inherent workload imbalances indicated through
the standard deviation of active execution cycles (over cores) in
Tbl. 2, result in large differences for the times that individual
cores wait at barriers. Avoiding active spinning on synchronization
variables during the resulting prolonged wait periods with both the
TAS and SCU barriers reduces the power of the involved com-
ponents (cores, interconnect, TCDM) and – since they consume
the lion share of overall power – also of the whole cluster very
significantly. This circumstance also shows when comparing the
normalized cycle and energy efficiency improvements in Fig. 6,
where the gains in energy efficiency are very similar to those
for performance except for the two applications in question; in
those cases, the discussed power reduction results in much greater
improvements for energy efficiency.
This work focuses on the optimization of the core-to-core
communication and synchronization on ultra-low-power clusters
of processors in the IoT domain, leveraging parallelism to im-
prove energy-efficiency of computations rather than performance
only. In different contexts, such as high-end devices, having
power/performance scalable systems, able to scale-up to 100s or
1000s of cores, is a desirable feature. However, state-of-the-art
parallel computing systems, such as GP-GPUs, feature a clear
trade-off between performance and efficiency, both from the point
of view of the parallelism available in embedded applications as
well as from a physical implementation perspective (as discussed
in Sec. 3). In the context of PULP-based systems, where energy-
efficiency cannot be traded off against performance, scalability is
still an open problem, and we plan to explore this scenario as
future work.
GLASER et al.: ENERGY-EFFICIENT HARDWARE-ACCELERATED SYNCHRONIZATION FOR SHARED-L1-MEMORY MULTIPROCESSOR CLUSTERS 15
7 CONCLUSION
We proposed a light-weight hardware-supported synchronization
concept for embedded CMPs that aggressively reduces synchro-
nization overhead, both in terms of execution time and energy,
the latter being a crucial metric for most embedded systems. In
addition to showing energy cost reductions for synchronization
primitives of up to 38× and resulting minimum SFR sizes of as
little as tens of cycles, we demonstrated the importance of energy-
efficient synchronization on a range of typical applications that
covers four orders of magnitude of execution time and SFR size.
The proposed solution improves both performance and energy
efficiency in all cases and has a beneficial impact of up to 92% for
performance and 98% for energy efficiency for applications with
SFR sizes of around one hundred cycles. In the future, we plan
to explore hierarchical architectures composed of multiple tightly-
coupled clusters, with the target of scaling up the performance of
PULP systems with no compromises on energy-efficiency.
REFERENCES
[1] D. Genbrugge and L. Eeckhout, “Chip multiprocessor design space
exploration through statistical simulation,” IEEE Trans. on Computers,
vol. 58, no. 12, pp. 1668–1681, Dec. 2009.
[2] D. Geer, “Chip makers turn to multicore processors,” IEEE Computer,
vol. 38, no. 5, pp. 11–13, May 2005.
[3] D. Bertozzi et al., “NoC synthesis flow for customized domain specific
multiprocessor systems-on-chip,” IEEE Trans. on Parallel and Dis-
tributed Systems, vol. 16, no. 2, pp. 113–129, Feb. 2005.
[4] T. M. Conte and M. Levy, “Embedded multicore processors and systems,”
IEEE Micro, vol. 29, no. 03, pp. 7–9, May 2009.
[5] R. Khan, S. U. Khan, R. Zaheer, and S. Khan, “Future Internet: The Inter-
net of Things Architecture, Possible Applications and Key Challenges,”
Proc. Int. Conf. on Frontiers of Information Technology (FIT), Dec. 2012.
[6] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of things
(IoT): A vision, architectural elements, and future directions,” Future
Generation Computer Systems, vol. 29, no. 7, pp. 1645–1660, Sept. 2013.
[7] O. Azizi, A. Mahesri, B. Lee, S. Patel, and M. Horowitz, “Energy-
performance tradeoffs in processor architecture and circuit design: A
marginal cost analysis,” in ACM SIGARCH Computer Architecture News,
vol. 38, 06 2010, pp. 26–36.
[8] S. Salamin, H. Amrouch, and J. Henkel, “Selecting the optimal energy
point in near-threshold computing,” Design, Automation & Test in Europe
Conf. & Exhibition (DATE), pp. 1670–1675, March 2019.
[9] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and
T. Mudge, “Near-threshold computing: Reclaiming Moore’s Law through
energy efficient integrated circuits,” Proc. of the IEEE, vol. 98, no. 2, pp.
253–266, Feb. 2010.
[10] D. Rossi et al., “Energy-efficient near-threshold parallel computing: The
PULPv2 cluster,” IEEE Micro, vol. 37, no. 5, pp. 20–31, Sept. 2017.
[11] OpenMP Architecture Review Board, The OpenMP API specification for
parallel programming, https://www.openmp.org.
[12] T. D. Matteis, F. Luporini, G. Mencagli, and M. Vanneschi, “Evaluation
of architectural supports for fine-grained synchronization mechanisms,”
IASTED Int. Conf. on Parallel and Distributed Computing and Networks
(PDCN), pp. 576–585, 03 2013.
[13] J. Sartori and R. Kumar, “Low-overhead, high-speed multi-core barrier
synchronization,” Int. Conf. on High-Performance Embedded Architec-
tures and Compilers (HiPEAC), pp. 18–34, 2010.
[14] J. Sampson, R. Gonzalez, J. Collard, N. P. Jouppi, M. Schlansker,
and B. Calder, “Exploiting fine-grained data parallelism with chip
multiprocessors and fast barriers,” IEEE Micro, pp. 235–246, Dec. 2006.
[15] H. Xiao, T. Isshiki, D. Li, H. Kunieda, Y. Nakase, and S. Kimura,
“Optimized communication and synchronization for embedded multipro-
cessors using ASIP methodology,” Information and Media Technologies,
vol. 7, no. 4, pp. 1331–1345, Jan. 2012.
[16] B. E. Saglam and V. J. Mooney, “System-on-a-chip processor synchro-
nization support in hardware,” Design, Automation & Test in Europe
Conf. & Exhibition (DATE), pp. 633–639, March 2001.
[17] C. J. Beckmann and C. D. Polychronopoulos, “Fast barrier synchroniza-
tion hardware,” Proc. of ACM Int. Conf. on Supercomputing, pp. 180–
189, Nov. 1990.
[18] C. Ferri, A. Viescas, T. Moreshet, R. I. Bahar, and M. Herlihy, “Energy
efficient synchronization techniques for embedded architectures,” Proc.
of ACM Great Lakes Symp. on VLSI (GLSVLSI), pp. 435–440, May 2008.
[19] S. H. Kim et al., “C-Lock: Energy efficient synchronization for embedded
multicore systems,” IEEE Trans. on Computers, vol. 63, no. 8, pp. 1962–
1974, Aug. 2014.
[20] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Efficient synchro-
nization for embedded on-chip multiprocessors,” IEEE Trans. Very Large
Scale Integ. (VLSI) Syst., vol. 14, no. 10, pp. 1049–1062, Oct. 2006.
[21] C. Yu and P. Petrov, “Low-cost and energy-efficient distributed synchro-
nization for embedded multiprocessors,” IEEE Trans. Very Large Scale
Integ. (VLSI) Syst., vol. 18, no. 8, pp. 1257–1261, Aug. 2010.
[22] T. E. Anderson, “The performance of spin lock alternatives for shared-
memory multiprocessors,” IEEE Trans. on Parallel and Distributed
Systems, vol. 1, no. 1, pp. 6–16, Jan. 1990.
[23] A. Kagi, D. Burger, and J. R. Goodman, “Efficient synchronization: Let
them eat QOLB,” Int. Symp. on Computer Architecture (ISCA), pp. 170–
180, June 1997.
[24] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for scalable
synchronization on shared-memory multiprocessors,” ACM Trans. on
Computer Systems (TOCS), vol. 9, no. 1, pp. 21–65, Feb. 1991.
[25] C. Ferri, R. I. Bahar, M. Loghi, and M. Poncino, “Energy-optimal
synchronization primitives for single-chip multi-processors,” Proc. of
ACM Great Lakes Symp. on VLSI (GLSVLSI), pp. 141–144, 2009.
[26] O. Golubeva, M. Loghi, and M. Poncino, “On the energy efficiency of
synchronization primitives for shared-memory single-chip multiproces-
sors,” Proc. of ACM Great Lakes Symp. on VLSI (GLSVLSI), pp. 489–
492, March 2007.
[27] T. Tsai, L. Fan, Y. Chen, and T. Yao, “Triple Speed: Energy-aware real-
time task synchronization in homogeneous multi-core systems,” IEEE
Trans. on Computers, vol. 65, no. 4, pp. 1297–1309, April 2016.
[28] M. Loghi, M. Poncino, and L. Benini, “Synchronization-driven dynamic
speed scaling for MPSoCs,” Proc. of Int. Symp. on Low Power Electronics
and Design (ISLPED), pp. 346–349, Oct. 2006.
[29] C. Liu, A. Sivasubramaniam, M. Kandemir, and M. J. Irwin, “Exploiting
barriers to optimize power consumption of CMPs,” Proc. of Int. Parallel
and Distributed Processing Symp. (IPDPS), April 2005.
[30] Chi Cao Minh, JaeWoong Chung, C. Kozyrakis, and K. Olukotun,
“STAMP: Stanford transactional applications for multi-processing,”
IEEE Int. Symp. on Workload Characterization (IISWC), pp. 35–46, Sept.
2008.
[31] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-
2 programs: Characterization and methodological considerations,” Int.
Symp. on Computer Architecture (ISCA), pp. 24–36, June 1995.
[32] J. T. Feo, “An analysis of the computational and parallel complexity of
the Livermore Loops,” Parallel Computing, vol. 7, no. 2, pp. 163–185,
June 1988.
[33] F. Thabet, Y. Lhuillier, C. Andriamisaina, J.-M. Philippe, and R. David,
“An efficient and flexible hardware support for accelerating synchro-
nization operations on the STHORM many-core architecture,” Design,
Automation & Test in Europe Conf. & Exhibition (DATE), pp. 531–534,
2013.
[34] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fifth Edition:
A Quantitative Approach, 5th ed. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 2011.
[35] J. Choquette, O. Giroux, and D. Foley, “Volta: Performance and pro-
grammability,” IEEE Micro, vol. 38, no. 2, pp. 42–52, 2018.
[36] R. Braojos, A. Dogan, I. Beretta, G. Ansaloni, and D. Atienza, “Hard-
ware/software approach for code synchronization in low-power multi-
core sensor nodes,” in 2014 Design, Automation Test in Europe Confer-
ence Exhibition (DATE), March 2014, pp. 1–6.
[37] R. Braojos et al., “A synchronization-based hybrid-memory multi-core
architecture for energy-efficient biomedical signal processing,” IEEE
Trans. on Computers, vol. 66, no. 4, pp. 575–585, April 2017.
[38] A. Pullini, D. Rossi, I. Loi, G. Tagliavini, and L. Benini, “Mr. Wolf:
An energy-precision scalable parallel ultra low power SoC for IoT edge
processing,” IEEE J. Solid-State Circuits, pp. 1–11, 2019.
[39] J. H. Kelm et al., “Rigel: An architecture and scalable programming in-
terface for a 1000-core accelerator,” Int. Symp. on Computer Architecture
(ISCA), vol. 37, no. 3, pp. 140–151, June 2009.
[40] L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building
an ecosystem for a scalable, modular and high-efficiency embedded
computing accelerator,” Design, Automation & Test in Europe Conf. &
Exhibition (DATE), pp. 983–987, March 2012.
[41] D. Palossi et al., “A 64-mW DNN-based visual navigation engine for
autonomous nano-drones,” IEEE Internet of Things J., vol. 6, no. 5, pp.
8357–8371, Oct. 2019.
16 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. XX, MONTH 2020
[42] V. Kartsch et al., “BioWolf: A sub-10-mW 8-channel advanced brain-
computer interface platform with a nine-core processor and BLE con-
nectivity,” IEEE Trans. on Biomed. Circuits and Systems, vol. 13, no. 5,
pp. 893–906, Oct. 2019.
[43] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable
single-cycle interconnection network for shared-L1 processor clusters,”
Design, Automation & Test in Europe Conf. & Exhibition (DATE), March
2011.
[44] G. Tagliavini, G. Haugou, and L. Benini, “Optimizing memory band-
width in OpenVX graph execution on embedded many-core accelera-
tors,” Proceedings of the 2014 Conference on Design and Architectures
for Signal and Image Processing, pp. 1–8, 2014.
[45] G. Tagliavini, G. Haugou, A. Marongiu, and L. Benini, “Enabling
OpenVX support in mw-scale parallel accelerators,” IEEE Int. Conf. on
Compilers, Architectures, and Sythesis of Embedded Systems (CASES),
pp. 1–10, Oct. 2016.
[46] P. Vogel, A. Marongiu, and L. Benini, “Lightweight virtual memory
support for zero-copy sharing of pointer-rich data structures in het-
erogeneous embedded SoCs,” IEEE Trans. on Parallel and Distributed
Systems, vol. 28, no. 7, pp. 1947–1959, July 2017.
[47] M. Gautschi et al., “Near-threshold RISC-V core with DSP extensions
for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integ.
(VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct 2017.
[48] F. Conti et al., “An IoT endpoint System-on-Chip for secure and energy-
efficient near-sensor analytics,” IEEE Trans. Circuits Syst. – I: Reg.
Papers, vol. 64, no. 9, pp. 2481–2494, Sept 2017.
[49] E. Flamand et al., “GAP-8: A RISC-V SoC for AI at the edge of the
IoT,” IEEE Int. Conf. on Application-specific Syst., Architectures and
Processors (ASAP), July 2018.
[50] P. Scho¨nle et al., “A multi-sensor and parallel processing SoC for minia-
turized medical instrumentation,” IEEE J. Solid-State Circuits, vol. 53,
no. 7, pp. 2076–2087, July 2018.
[51] X. Wang, M. Magno, L. Cavigelli, and L. Benini, “FANN-on-MCU:
An open-source toolkit for energy-efficient neural network inference
at the edge of the internet of things,” 2019. [Online]. Available:
https://arxiv.org/abs/1911.03314
Florian Glaser received the M.Sc. degree in
electrical engineering from ETH Zurich, Switzer-
land, in 2015, where he is currently pursuing the
Ph.D. degree at the Integrated Systems Labora-
tory. His current research interests include low-
power integrated circuits with a special focus
on energy-efficient synchronization of multicore
clusters and mixed-signal systems-on-chip for
miniaturized biomedical instrumentation.
Giuseppe Tagliavini received the Ph.D. degree
in electronic engineering from the University of
Bologna, Bologna, Italy, in 2017. He is currently
a Post-Doctoral Researcher with the Department
of Electrical, Electronic, and Information Engi-
neering, University of Bologna. He has coau-
thored over 20 papers in international confer-
ences and journals. His research interests in-
clude parallel programming models for embed-
ded systems, run-time optimization for multicore
and many-core accelerators, and design of soft-
ware stacks for emerging computing architectures.
Davide Rossi received the Ph.D. degree from
the University of Bologna, Bologna, Italy, in
2012. He has been a Post-Doctoral Researcher
with the Department of Electrical, Electronic
and Information Engineering Guglielmo Marconi,
University of Bologna, since 2015, where he is
currently an Assistant Professor. His research
interests focus on energy-efficient digital archi-
tectures. In this field, he has published more
than 80 papers in international peer-reviewed
conferences and journals.
Germain Haugou received the Engineering De-
gree in telecommunication from the University of
Grenoble, in 2004. He was with ST Microelec-
tronics as a Research Engineer, for ten years.
He is currently with ETH Zurich, Switzerland,
as a Research Assistant. His research interests
include virtual platforms, run-time systems, com-
pilers, and programming models for many-core
embedded architectures.
Qiuting Huang received the Ph.D. degree in
applied sciences from the Katholieke Universiteit
Leuven, Leuven, Belgium, in 1987.
Between 1987 and 1992, he was a lecturer at
the University of East Anglia, Norwich, UK. Since
January 1993, he has been with the Integrated
Systems Laboratory, ETH Zurich, Switzerland,
where he is Professor of Electronics. In 2007,
he was also appointed as a part-time Cheung
Kong Seminar Professor by the Chinese Ministry
of Education and the Cheung Kong Foundation
and has been affiliated with the South East University, Nanjing, China.
His research interests span RF, analog, mixed analog-digital as well
as digital application specific integrated circuits and systems, with an
emphasis on wireless communications and biomedical applications in
recent years.
Dr. Huang currently serves as vice chair of the steering committee, as
well as a sub committee chair of the technical program committee of the
European Solid-State Circuits Conference (ESSCIRC). He also served
on the technical program and executive committees of the International
Solid-State Circuits Conference (ISSCC) between 2000 and 2010.
Luca Benini holds the chair of the Digital Cir-
cuits and Systems Group at the Integrated Sys-
tems Laboratory, ETH Zurich and is Full Profes-
sor at the Universita di Bologna. He has been
visiting professor at Stanford University, IMEC,
and EPFL. He served as chief architect in STMi-
croelectronics France.
Dr. Benini’s research interests are in energy-
efficient parallel computing systems, smart
sensing micro-systems and machine learning
hardware. He has published more than 1000
peer-reviewed papers and five books.
He is a Fellow of the IEEE, of the ACM and a member of the
Academia Europaea. He is the recipient of the 2016 IEEE CAS Mac
Van Valkenburg award and of the 2019 IEEE TCAD Donald O. Pederson
Best Paper Award.
