Playing with fire: transactional memory revisited for error-resilient and energy-efficient MPSoC execution by Papagiannopoulou, Dimitra et al.
Boston University
OpenBU http://open.bu.edu
Electrical and Computer Engineering BU Open Access Articles
2015-05-22
Playing with fire: transactional
memory revisited for error-resilient
and energy-efficient MP...
This work was made openly accessible by BU Faculty. Please share how this access benefits you.
Your story matters.
Version Accepted manuscript
Citation (published version): Dimitra Papagiannopoulou, Andrea Marongiu, Tali Moreshet, Luca
Benini, Maurice Herlihy, Iris Bahar. 2015. "Playing with Fire:
Transactional Memory Revisited for Error-Resilient and
Energy-Efficient MPSoC Execution." Great Lakes Symposium on VLSI -
GLSVLSI. https://doi.org/10.1145/2742060.2742090
https://hdl.handle.net/2144/40254
Boston University
Playing with Fire: Transactional Memory Revisited for
Error-Resilient and Energy-Efficient MPSoC Execution
Dimitra Papagiannopoulou
Brown University
dimitra_papagiannopoulou@brown.edu
Andrea Marongiu
ETH Zurich
mandrea@iis.ee.ethz.ch
Tali Moreshet
Boston University
talim@bu.edu
Luca Benini
ETH Zurich
lbenini@iis.ee.ethz.ch
Maurice Herlihy
Brown University
mph@cs.brown.edu
Iris Bahar
Brown University
iris_bahar@brown.edu
ABSTRACT
As silicon integration technology pushes toward atomic di-
mensions, errors due to static and dynamic variability are
an increasing concern. To avoid such errors, designers often
turn to “guardband” restrictions on the operating frequency
and voltage. If guardbands are too conservative, they limit
performance and waste energy, but less conservative guard-
bands risk moving the system closer to its Critical Operat-
ing Point (COP), a frequency-voltage pair that, if surpassed,
causes massive instruction failures. In this paper, we pro-
pose a novel scheme that allows to dynamically adjust to an
evolving COP and operate at highly reduced margins, while
guaranteeing forward progress. Specifically, our scheme dy-
namically monitors the platform and adaptively adjusts to
the COP among multiple cores, using lightweight check-
pointing and roll-back mechanisms adopted from Hardware
Transactional Memory (HTM) for error recovery. Experi-
ments demonstrate that our technique is particularly effec-
tive in saving energy while also offering safe execution guar-
antees. To the best of our knowledge, this work is the first
to describe a full-fledged HTM implementation for error-
resilient and energy-efficient MPSoC execution.
1. INTRODUCTION
Scaling of physical dimensions in semiconductor devices
has opened the way to heterogeneous embedded SoCs inte-
grating host processors and many-core accelerators in the
same chip [5], but at a price of ever-increasing static and
dynamic hardware variability [2]. Spatial die-to-die and
within-die static variations ultimately induce performance
and power mismatches between the cores in a many-core ar-
ray, introducing heterogeneity in a nominally homogeneous
system (formally identical processing resources). Dynamic
variations depend on the operating conditions of the chip,
and include aging, supply voltage drops and temperature
fluctuations. The most common consequence of variations is
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
path delay uncertainty. Circuit designers typically use con-
servative guardbands on the operating frequency or voltage
to ensure safe system operation, with the obvious consequent
loss of operational efficiency. When the guardbands are re-
duced, or when the system is aggressively operated far from
a safe point, the delay uncertainty manifests itself either as
an intermittent timing error [7] [4] or a critical operating
point (COP) [17]. Timing errors may ultimately cause er-
roneous instructions with wrong outputs being stored or,
worse, incorrect control flow. COP defines a voltage and
frequency pair at which a core is error-free. If the voltage is
decreased below (or the frequency is increased beyond) the
COP, the core will face a massive number of errors [17]. The
COP effect is highly pronounced in well-optimized designs
[11] [15] due to so-called “path walls”.
Circuit level error detection and correction (EDAC) tech-
niques [7] [4] can transparently detect and correct timing
errors, with the side-effect of an increased execution time
and energy. Indeed, techniques such as multiple-issue in-
struction replay [4] to correct an errant instruction incur
the cost of flushing the pipeline and executing N+1 replicas
of the instruction (N being the number of pipeline stages).
In addition, while EDAC techniques are suitable for han-
dling sporadic errors, they are obviously not a good solution
for the “all-or-nothing” effect of the COP. In principle the
COP can be determined for a particular chip after its pro-
duction, and the most efficient yet safe voltage/frequency
pair for the chip could be configured at that time. However,
due to static and dynamic variations, the COP may actually
change over space and time. As a result, the “safe” operat-
ing point may i) differ from one core to another (imposing to
conservatively tune the entire chip to meet the requirements
of the most critical core) and ii) suddenly become unsafe
due to aging, temperature fluctuations or voltage drops.
In this paper, we propose an integrated HW/SW scheme
that addresses both types of variation phenomena, and par-
ticularly to dynamically adjust to an evolving COP, thus en-
abling the system to operate at highly reduced margins with-
out sacrificing performance, while at the same time guar-
anteeing forward progress at reduced energy levels. More
specifically, our approach dynamically monitors the platform
and adaptively adjusts to the COP among multiple cores,
using lightweight checkpointing and roll-back mechanisms
adapted from Hardware Transactional Memory (HTM) for
error recovery. In particular, we support two distinct types
of recovery mechanisms: non-critical and critical. Non-
critical recovery is required whenever an error takes place
Host system
cluster-based
many-core accelerator
System INTERCONNECT
DRAM
mem CTRL
Shared-memory CLUSTER
critical path monitors (EDS)
programmable
(PMCA)
Figure 1: Target platform high level view.
in the datapath (e.g., the multiplier). In this case the con-
sequence of an error is that incorrect data may be stored in
memory. Critical recovery is required when an error takes
place in the control part of the processor pipeline (e.g., in-
struction fetch/decode). This type of error breaks the orig-
inal control flow of the program and prevents any software-
based solution from taking control.
We assume the platform is initially configured to operate
at a safe, reference operating voltage (i.e., with safe margins
to hide all variability effects). Every time a new transaction
is started, our technique optimistically lowers the voltage in
small steps, individually on each core. If sporadic or non-
critical errors take place, the HTM-inspired techniques in-
tervene and ensure correct program behavior and progress.
If systematic or critical errors take place, then the system re-
verts to the previous stable operating point. If over time the
COP changes, the technique is re-activated and the system
is re-calibrated. Using cycle-accurate simulation models, an-
notated with power/performance numbers extracted from a
silicon implementation of the target platform, we show that
the proposed technique can achieve up to 40% energy savings
with respect to using conservative voltage margins, over dif-
ferent benchmarks. To the best of our knowledge, this work
is the first to i) describe a full-fledged HTM implementation
for error-resilient and energy-efficient MPSoC execution; ii)
provide realistic energy saving measurements using real-life
benchmarks.
The rest of the paper is organized as follows. In Section 2
we provide details of the target platform, and in Section 3 we
describe the proposed techniques. The experimental setup
and results are discussed in Section 4, and we compare to
related work in Section 5. Section 6 concludes the paper.
2. TARGET ARCHITECTURE
Our HW/SW design is driven to a large extent by the
target architecture (Fig. 1). We envision a general-purpose
host processor, coupled with a programmable many-core ac-
celerator (PMCA) composed of several tens of simple cores,
where critical computation kernels of an application can be
offloaded to improve overall performance/Watt. We assume
that the host core is operated with safe margins. Our work
focuses on the PMCA, and in particular in a design that
leverages a multi-cluster configuration to overcome scalabil-
ity limitations [5] [10] [18]. Our goal is to improve energy
efficiency by operating the PMCA“dangerously close” to the
COP, while exploiting the HTM to avoid failures.
In this multi-cluster configuration, simple processing el-
ements are grouped into clusters sharing high-performance
local interconnect and memory. Several clusters are repli-
cated and interconnected through a scalable medium such
as a network-on-chip (NoC), while within a cluster a lim-
ited number of simple processors (typically 4 to 16) share
an L1 tightly-coupled data memory (TCDM). The TCDM
is configured as a shared multi-banked scratchpad memory1
that enables concurrent accesses to different memory banks.
Simultaneous accesses to the same bank are serialized in a
round-robin fashion. Accesses to memory external to the
cluster go through a network interface. The basic synchro-
nization mechanism is provided via standard test-and-set
registers.
On top of this baseline cluster we design our HTM ex-
tensions for error-tolerance. More specifically, we revisit
existing checkpointing and rollback mechanisms that have
been employed for HTM, to now be used as a lightweight
mechanism for fast and efficient error recovery.
The architecture implemented in this work consists of a
single computation cluster, featuring 8 cores with private
I$ (1 KByte) and 16 TCDM banks (256KB), plus external
(main) L2 memory (2MB). The TCDM is implemented using
two different technologies: 6-transistor SRAM and Standard
Cell Memory (SCM). SCM achieves lower density (∼3X)
than SRAM, but can reliably operate at the same volt-
age ranges as the rest of the logic. SRAM requires higher
voltages to operate reliabily, thus consume (∼4X) the en-
ergy [13]. We use SCM to implement storage that needs
to always be reliable (e.g., to implement function calls, for
control-flow data and for instruction cache), while program
data is stored in SRAM and our HTM techniques are used
to recover from errors.
All the base performance/energy/area numbers used in
this work are derived from a silicon implementation of the
platform in 28nm STMicroelectronics UTB FD-SOI technol-
ogy, and integrated in a cycle-accurate SystemC simulator
(see Section 4). The cluster is able to operate over a wide
range of frequencies (from 20MHz @ 0.5V up to 450MHz @
1.2V). In this work the target frequency is 200MHz, with
a nominal voltage of 0.84. Due to process variation the re-
quired Vdd for a safe operating condition may actually vary
among cores (we observe up to 0.04V increase). Different
sources of dynamic variations also increase the minimum
voltage level required for safe operation. The baseline plat-
form considers safe margins to compensate for all sources of
variability, and is thus conservatively operated at a reference
voltage of 1V.
Any errors caused by dynamic variation need to be de-
tected at runtime. We assume each core is equipped with
error-detection circuitry such as error-detection sequential
(EDS) [4]. In particular, since our techniques try to opti-
mistically lower the voltage and adapt to an evolving COP,
we need to always be able to recover from two types of errors:
1. Non-critical errors are those that originate from
timing delays along the datapath (e.g., multiplier) and
ultimately lead to writing a bad value on memory.
2. Critical errors are those that occur in the control
part of the processor pipeline (instruction fetch/de-
code) and ultimately lead to catastrophic failures.
1Not a data cache. Coherency is managed via explicit copies.
Our simulation models were augmented to have both types
of paths monitored via EDS.
3. IMPLEMENTATION
Our proposed scheme borrows key concepts from Hard-
ware Transactional Memory (HTM) to provide a mecha-
nism for error recovery. Traditionally, HTM requires two
key components: i) some form of bookkeeping (for keeping
track of read/write data conflicts), and ii) data versioning
(for keeping track of speculative and non-speculative ver-
sions of data in case it is necessary to rollback and recover
from a data conflict). For our purposes, we only need to
implement data versioning and rollback in order to recover
from variability-induced errors. Our design uses the TCDM
memory to hold both speculative and non-speculative data.
Data logs are distributed across the TCDM memory banks,
so that each bank is responsible for handling recovery only
for its associated data.
3.1 Checkpointing and Rollback
We protect all the parallel parts of the program from er-
rors by enclosing them within transactions (see Sec. 3.4).
At the beginning of each transaction (Transaction Start) we
save the internal state of the core (i.e. program counter,
stack pointer, internal registers, stack contents) to be able
to roll back in case of errors. Error resolution can be eager
or lazy, meaning that we can resolve the error by aborting
the transaction and rolling back right away or wait until the
end of the transactional region to do so. For our design, we
consider both variants. For non-critical errors we follow lazy
error resolution since we want to avoid the cost of frequent
error checking. We fine tune our transactional regions’ sizes
to be small enough so that if errors start occurring, it won’t
be long before they get detected and the core’s voltage is
adjusted to safer levels. In the lazy error recovery scheme,
when a transaction completes execution (Transaction End)
we check whether (non-critical) errors have been encoun-
tered by the core executing the transaction. If no errors
are detected the transaction commits, the checkpointing in-
formation is discarded and speculative changes to the data
become permanent. If errors are detected the transaction
aborts, and a rollback mechanism restores the internal core
state. In addition, data are restored to their original values
and speculative copies are discarded. For critical errors an
interrupt is generated by path monitors, and error resolution
employs an eager scheme (see Sec. 3.3)
3.2 Data Versioning
Data Versioning can also be either eager or lazy. Lazy
data versioning keeps the original data in place and buffers
the speculative data updates in different locations (allow-
ing for fast error recovery). Eager data versioning makes
speculative changes in place and stores back-up copies of
the original data in separate places (allowing for fast com-
mits but slow abort handling). We expect that our scheme
will incur relatively low error rates; therefore aborts due to
errors will be infrequent and we can choose an eager data
versioning mechanism.
In this work, we propose a distributed per-address log data
versioning scheme that is simple, fast, and significantly more
space efficient than state-of-the art transactional memory
approaches for embedded systems [16]. Figure 2 depicts
how the distributed per-address log design works. In this
...
(Core’s 0 Log) 0
(Core’s 1 Log) 0
(Core’s N-1 Log) 0
...
BANK 0
DATADVM 0
...
(Core’s 0 Log) 1
(Core’s 1 Log) 1
(Core’s N-1 Log) 1
...
BANK 1
DATA
...
(Core’s 0 Log) M-1
(Core’s 1 Log) M-1
(Core’s N-1 Log) M-1
...
BANK M-1
DATADVM M-1DVM 1 ...
Tightly Coupled Data Memory (TCDM)
Figure 2: Distributed per-address log scheme for M
banks and N cores.
design, distributed per-address logs are used to save back-
ups of the original values of data that are written during
transactions, so that they can be recovered in case of errors.
Since memory is distributed across multiple memory banks
that accept and serve access requests in parallel, having a
central control logic to manage the distributed logs would
not be efficient. For this reason, we divide the transactional
handling and log managing responsibilities across multiple
control modules, one for each bank of the TCDM, that we
call Data Versioning Modules (DVM). Each bank’s DVM is
a control block that monitors transactional accesses to the
bank and manages the cores’ logs that reside in that bank. It
is also responsible for restoring the log data of the cores that
abort their transactions and cleaning the logs of the cores
that commit their transactions. All banks’ DVMs work in
parallel and independently of each other. At every bank of
the TCDM, we keep a fixed-size log space for each core in
the system. Each core’s log holds the addresses that belong
to that bank and are written transactionally by that core.
In this way, we keep a log space only for the addresses of
the bank that are actually written transactionally. At the
same time, with this distributed log design we avoid cross-
bank data exchange when saving and restoring the log, since
each addresses’ log falls within the same bank. Thus the log
saving and restoration process is triggered internally by the
DVM of each bank and it does not require interaction with
the DVMs of other banks.
When a core writes transactionally to an address of a
bank, its log is traversed to check whether it already holds
an entry for that address. If not, a new log entry is cre-
ated to store the original data of the address. Note that
the data only need to be logged the first time the address is
written within a specific transaction. Therefore, the log size
depends on the write footprint of each transaction. Since the
log of each core is distributed among all the TCDM banks,
we expect that the log writes will also be divided among
the banks. The size of each core’s log space per bank is a
parameter in our design, so it can be easily adjusted to the
needs of different applications domains. In case of an over-
flow, our technique resorts to software-managed logging into
the main L2 memory. The capability of tuning the transac-
tions’ granularity is intuitively key to reducing the number of
overflows. Using the technique described in Section 3.4, we
found that 1KB total log size per core (64B in each TCDM
bank) is adequate for our target applications. Overall, the
logs for all the cores occupy roughly 3% of the total TCDM
space.
Lower_Voltage
Start_Transaction
Transaction Commits:
- Clean_Logs
- Discard Checkpoint
Transaction Aborts:
- Restore_Logs
- Restore StateTransaction Ends
Increase_Voltage
COP
Found?
Execute 
Transaction
No Yes
No
Yes
Errors 
Detected?
a) LAZY:
b) EAGER:
Interrupt
(Critical Error Detected)
Apply
FBB
ISR
Figure 3: Control Flow of an error-resilient transac-
tion.
If an error is detected and a transaction must abort, each
bank’s log is traversed to restore the original data back to its
proper address. If a transaction commits, the logs associated
with that transaction are all discarded and the speculative
data now becomes non-speculative.
3.3 Error-Resilient Transactions
The flowchart in Figure 3 describes the semantics of our
error-resilient transactions (ERT). We start with all plat-
form components set at the safe reference voltage level (1.0V).
Each time a core encounters a new transaction it saves its
internal state and current stack and checks whether the
self-calibration procedure was previously completed and the
COP for this core is known. If the COP is still unknown,
the executing core optimistically lowers its voltage level by
a pre-defined step (0.02V). If the COP has already been
reached, then no voltage adjustment is made.
If the transaction end is reached without errors being de-
tected, the transaction Commits. A clean logs() process is
activated at each bank’s DVM to clean up the saved log of
the committing core in the respective bank. Note that all
these processes are triggered simultaneously by the DVMs
of all memory banks. If errors are detected, then the trans-
action aborts. A restore logs() process is activated simul-
taneously at each bank’s DVM to restore all the saved log
values of the aborted core. The internal state of the aborted
core is restored, its voltage is adjusted back to the previous
safe level (increase voltage()) (a +0.02V voltage increase be-
yond the recently found COP) and the core is ready to retry
the transaction. From this point on, the voltage level is no
longer reduced when starting a new transaction2.
3.4 Programming model
Similar to prior approaches [1] [8], we have chosen to in-
tegrate transactional memory into OpenMP, a widespread
and easy-to-use programming model. An OpenMP program
starts on a single thread of execution (the master). Once
the parallel directive is encountered, additional threads are
created, and execute the code enclosed within the syntactic
boundaries of the construct. The work is parallelized among
threads using worksharing directives. For illustration pur-
poses we describe here one of the most used among such
directives: dynamic loops. Figure 4 shows a code snippet
2 In case a temperature reduction is detected, the voltage
can be further decreased, as the COP has “moved” down-
wards.
Transformed code
#pragma omp for schedule(dynamic, CHUNK)
for (i = LB; i < UB; i++)
{ /* LOOP_BODY */ }
int start, end, work_left;
work_left = loop_dynamic_start(LB, UB, 1, CHUNK, &start, &end);
while (work_left)
{
...
for (i = start; i < end; i++)
{ /* LOOP_BODY */ }
...
work_left = loop_dynamic_next(&start, &end);
}
OpenMP code
/* ERROR-RESILIENT TRANSACTION */
/* TRANSACTION BODY */
Figure 4: Transformed OpenMP dynamic loop
with a #pragma omp for directive, used to distribute loop
iterations among threads. The schedule(dynamic, CHUNK)
clause is used to specify that iterations should be grouped in
smaller sets of size CHUNK, and distributed in a dynamic
(first come, first served) fashion.
The bottom part of Figure 4 shows how this is achieved
once the code is transformed by an OpenMP compiler. Run-
time library calls are inserted to interact with an itera-
tion scheduler. First, the scheduler is initialized (loop_dy-
namic_start) passing as parameters the original loop bounds
(LB, UB), stride and CHUNK. If there are iterations avail-
able the function returns a positive integer (stored in work_-
left) and initializes the input parameters start and end
with lower and upper boundaries for the current chunk of
iterations. The original loop body is then executed for these
iteration instances and a new call to the runtime library
(loop_dynamic_next) repeats the process until there are no
iterations left.
This mechanism can be easily augmented to wrap each
CHUNK of loop iterations within an error-resilient transac-
tion (ERT). Thus, transaction granularity at the application
level may be adjusted by modifying the CHUNK parameter
or with OpenMP loop scheduling clauses. This is important
for performance as well as energy efficiency since transaction
granularity can impact error-rate in our context.
The same scheme can be easily applied to other OpenMP
constructs (sections, task, etc.). Moreover, to ensure ro-
bust execution at every point in program execution, we silently
define ERT boundaries wherever an OpenMP construct is
encountered. The sequential execution in the master thread
is also wrapped in an ERT. Additional ERTs can be manu-
ally outlined in the code if necessary.
4. EXPERIMENTAL RESULTS
The proposed architecture has been modeled in Virtual
SoC [3], a SystemC-based cycle-accurate virtual platform
for heterogeneous System-On-Chip simulation, with back-
annotated energy numbers for every system component. The
performance, energy, and area numbers are derived from
an implementation of the platform in STMicroelectronics
28nm UTB FD-SOI technology. This approach couples the
advantages of very accurate power models with the simula-
tion speed of the SystemC models. On average, the virtual
platform shows a maximum error in timing accuracy be-
low 6% with respect to a complete RTL simulation of the
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
ROTATE STRASSEN FAST MD Average
Energy consumption @ 25°C
SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
ROTATE STRASSEN FAST MD Average
Energy consumption @ -40°C
SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
ROTATE STRASSEN FAST MD Average
Energy consumption @ 125°C
SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V
41%
38%
43%
25%
22%
30%
12%
6%
17%}SV
}TM
Figure 5: Energy consumption at various operating conditions (voltage, temperature). Steady voltage (SV)
versus transactional memory (TM).
same benchmark. We have conducted our evaluations us-
ing real-life benchmarks from the computer vision domain:
Rotate (image rotation), Strassen (matrix multiplication),
Fast (corner detection) and Mahalanobis-Distance (cluster
analysis and classification).
4.1 Overhead characterization
As a first experiment, we measured the overhead of the
proposed HW/SW support for error-resilience in terms of
energy and execution delay. The energy overhead for our
technique is quite modest: on average, only 1.7% across all
benchmarks and never more than 5% of the total system
energy. Similarly, execution time overhead is a reasonable
6.6% maximum.
Further detailing the analysis for distributed logs, on aver-
age, transactional writes took 0.7 to 1.5 extra cycles to com-
plete and increased total execution time by only 0.5%. The
log restoration time clearly depends on the data footprint
of the target application: the higher the number of writes
within a transaction, the bigger the size of the logs. For
each address that needs to be restored, 2 cycles are spent,
one for reading the original value from the log space and one
for writing it back to its original address. The worst case
restoration time per core in isolation is 32 cycles, which may
be up to 8 times slower in the unlikely event that all cores
were rolling back at the same time. Log restoration time
never accounted for more than 3% of the total benchmark
execution time. The area overhead of our proposed scheme
is quite small. In particular, our distributed per-address log-
ging scheme is space-efficient; the total log space occupies
only 3% of the TCDM space. Moreover, based on [4], the
area overhead introduced by EDS is nominal.
4.2 Energy characterization
Next, we conducted a set of measurements to assess the
energy saving capabilities of our technique. The effect of
static within die variations is modeled in our platform by
considering different nominal voltages for the target fre-
quency (200MHz) among cores, with a maximum variation
of 0.04V (0.84V to 0.88V). To explore how the lowest safe
voltage level changes due to temperature variations, we ran
our experiments at three different temperature corners (25◦C,
−40◦C and 125◦C).
We compare our transactional memory inspired techniques
(TM) to a conservative steady-voltage (SV) technique, which
uses voltage margins (guardbands) to absorb the effects of
static and dynamic variations. From the measurements on
silicon we observe that in the worst case, a 0.96V operating
voltage would be sufficient to compensate for static and tem-
perature variations. In practice, to also account for other
sources of dynamic variations (e.g., aging, voltage drops)
even more conservative voltage margins would be necessary.
Thus, for each temperature corner we consider three refer-
ence voltage levels for the SV configurations (1.0V, 0.98V,
0.96V) that remain unchanged throughout execution.
Figure 5 shows for each temperature corner the total sys-
tem energy consumption of each configuration, normalized
to the baseline SV configuration at reference voltage 1.0V.
For each application we show two groups of three bars. The
three leftmost bars correspond to the SV technique, for the
three reference voltage levels. The three rightmost bars cor-
respond to our TM technique, starting at the three different
reference voltages.
Our technique achieves significant energy savings com-
pared to conservative execution at a steady reference volt-
age, for each temperature corner. Intuitively, we observe
that at lower temperatures the energy improvement is sig-
nificantly better compared to higher temperature corners;
at lower temperatures the COP moves toward lower voltage
levels, leading to larger energy savings. For example, when
using a reference voltage of 1.0V and operating at −40◦C, on
average our technique can save 43% of the energy consumed
by the conservative SV technique. Even when our reference
voltage is lower, our TM configurations still achieve better
energy savings (e.g., at −40◦C, on average TM-0.98V is 41%
more energy efficient than SV-0.98V and TM-0.96V is 38%
more efficient than SV-0.96V). At ambient temperatures,
energy savings diminish, but are still quite substantial (i.e.,
30%, 25%, and 22% on average relative to the 3 different
reference voltages). Even at 125◦C energy savings can still
be realized (i.e., 17%, 12%, and 6% relative to the 3 respec-
tive reference voltages). Overall, our results show a robust,
versatile, and cost-effective technique to saving energy while
guaranteeing safe execution.
5. RELATEDWORK
Many circuit-level error detection and correction techniques
continuously monitor path delay variations [7, 4]. When
an error is detected, a recovery technique is enabled that
prevents the erroneous instruction from corrupting the ar-
chitectural state. Examples of recovery techniques include
instruction replay at half clock frequency and multiple-issue
instruction replay at the same frequency [4]. While ensuring
correct system behavior, these techniques impose substan-
tial error recovery costs for many-core chips operating at
near-threshold voltage [9] to save power.
Software techniques are often more effective at providing
energy-efficient robustness to errors, by exposing variability
at lower levels of the software stack. Early approaches focus
on course-grained tasks [12] [6], lack generality (as they call
for custom programming methodologies) or are only suit-
able for a specific class of approximate computing programs,
in addition to imposing high recovery cycle overhead. A
more recent approach based on OpenMP extensions [19] has
shown good potential for reducing the recovery cost incurred
by HW-based error-correction techniques. Our approach has
some key differences. First, it can deal both with sporadic
timing errors (like [19]) and with systematic, COP-like error
models. To the best of our knowledge our approach is the
first to combine SW and HW techniques for dealing with
COP. Second, [19] requires error detection and correction
(multiple-issue instruction replay) in the HW, as the SW
technique alone cannot guarantee complete reliability.
Other works utilized transactional memory for error recov-
ery. The authors of [20] proposed transaction encoding, a
software implementation that combines encoded processing
for error detection and TM for error recovery. While this
design uses TM for checkpointing and rollback as we do,
it offers a pure software solution, uses encoded processing
for error detection, and does not address energy-efficiency.
FaulTM-multi[22], is an HTM-based fault detection and re-
covery scheme for multi-threaded applications, with rela-
tively low performance overhead and good error coverage.
However, it does not target reducing energy consumption,
which is central to our implementation. The authors of
[21] studied how combining different error detection mech-
anisms and TM could potentially improve energy efficiency,
but they did not provide an actual implementation. To the
best of our knowledge, our work is the first to provide a full-
fledged HTM implementation for error resilient execution
that specifically targets energy savings.
There are various ways to implement eager data version-
ing. In [14], the authors use per-thread software transaction
logs with a stack-based structure, stored in cacheable virtual
memory, to hold the original value/address pairs. Since our
target architecture has a distributed shared memory space
rather than private L1 caches, these logs would result in
excessive cross-bank data exchanges, creating big commu-
nication delays. Alternatively, the authors of [16] chose to
create a mirroring address to hold original data for each ad-
dress in the memory space. While simple, this design is not
space efficient since memory space must be doubled in order
to hold the mirrors of all the addresses. Our distributed per-
address logging scheme is considerably more space-efficient
(only 3% of the TCDM space is reserved for logs).
6. CONCLUSIONS AND FUTUREWORK
In this work, we presented a novel HW/SW scheme adopted
from hardware transactional memory that dynamically ad-
justs the operating voltage to an evolving COP in order to
operate at highly reduced margins. Our lightweight scheme
is integrated into the OpenMP model, making it easy to
program and easy to adjust transaction granularity. Ex-
perimental results demonstrate that our technique is par-
ticularly effective at saving energy while also offering safe
execution guarantees. Based on our findings we draw the
conclusion that playing with fire (ie. dangerously close to
the COP) instead of using conservative guardbands, pays
off, when our lightweight HTM mechanism is used. To the
best of our knowledge, this is the first full-fledged imple-
mentation of HTM for error resilient execution that targets
reducing energy consumption. Future work will consider a
broader range of voltage adjustment strategies due to COP
variations.
7. ACKNOWLEDGMENTS
This work was supported in part by NSF grants CNS-
1319095, CNS-1319495 and CNS-1301924. The authors would
also like to thank Davide Rossi for his help with the error
modeling.
8. REFERENCES
[1] W. Baek, et al., The OpenTM transactional application
programming interface. PACT, p. 376–387, 2007.
[2] S. Borkar, et al., Parameter variations and impact on
circuits and microarchitecture. DAC, p. 338–342, June 2003.
[3] D. Bortolotti, et al., VirtualSoC: A full-system simulation
environment for massively parallel heterogeneous
system-on-chip. IPDPS, p. 2182–2187, 2013.
[4] K. Bowman, et al., A 45nm resilient microprocessor core for
dynamic variation tolerance. JSSC, 46(1):194–208, Jan 2011.
[5] D. Melpignano, et al. Platform 2012, a many-core computing
accelerator for embedded SoCs: Performance evaluation of
visual analytics applications. DAC, p. 1137–1142, 2012.
[6] S. Dighe, et al., Within-die variation-aware
dynamic-voltage-frequency-scaling with optimal core
allocation and thread hopping for the 80-core teraflops
processor. JSSC, 46(1):184–193, Jan 2011.
[7] D. Ernst, et al., Razor: A low-power pipeline based on
circuit-level timing speculation. MICRO, p. 7–, 2003.
[8] C. Ferri, et al., SoC-TM: Integrated HW/SW support for
transactional memory programming on embedded mpsocs.
CODES, p. 39–48, Taiwan, Oct 2011.
[9] M. Kakoee, et al., Variation-tolerant architecture for ultra
low power shared-l1 processor clusters. TCAS II,
59(12):927–931, Dec 2012.
[10] Kalray. MPPA 256 - Programmable Manycore Processor.
www.kalray.eu/products/mppa-manycore/mppa-256/.
[11] V. B. Kleeberger, et al., Workload- and instruction-aware
timing analysis: The missing link between technology and
system-level resilience. DAC, p. 49:1–49:6, 2014.
[12] L. Leem, et al., ERSA: Error resilient system architecture
for probabilistic applications. DATE, p. 1560–1565, 2010.
[13] P. Meinerzhagen, et al., Benchmarking of Standard-Cell
Based Memories in the Sub-Domain in 65-nm CMOS
Technology. JETCAS, 2011.
[14] K. E. Moore, et al., LogTM: Log-based transactional
memory. HPCA, p. 254–265, 2006.
[15] S. Narayanan, et al., Testing the critical operating point
(COP) hypothesis using FPGA emulation of timing errors in
over-scaled soft-processors. SELSE, 2009.
[16] D. Papagiannopoulou, et al., Speculative synchronization
for coherence-free embedded NUMA architectures. SAMOS,
p. 99–106, July 2014.
[17] J. Patel. CMOS process variations: A critical operation
point hypothesis. web.stanford.edu/class/ee380/
Abstracts/080402-jhpatel.pdf, 2008.
[18] Plurality Ltd. The hypercore architecture, white paper.
Technical Report version 1.7, April 2010.
[19] A. Rahimi, et al., Improving resilience to timing errors by
exposing variability effects to software in tightly-coupled
processor clusters. JETCAS, 4(2):216–229, 2014.
[20] J.-T. Wamhoff, et al., Transactional encoding for tolerating
transient hardware errors. SSS, volume 8255 of LNCS, p.
1–16. Springer Intl. Pub., 2013.
[21] G. Yalcin, et al., Combining error detection and
transactional memory for energy-efficient computing below
safe operation margins. PDP 2014, p. 248–255, Feb 2014.
[22] G. Yalcin, et al., Fault tolerance for multi-threaded
applications by leveraging hardware transactional memory.
Computing Frontiers, p. 4:1–4:9, 2013.
