Brigham Young University

BYU ScholarsArchive
Faculty Publications
2005-06-24

Correlation of Fault-Injection to Proton Accelerator Persistent
Cross Section Measurements
Keith S. Morgan
keith.morgan@byu.net

Michael J. Wirthlin
wirthlin@ee.byu.edu

Follow this and additional works at: https://scholarsarchive.byu.edu/facpub
Part of the Electrical and Computer Engineering Commons

BYU ScholarsArchive Citation
Morgan, Keith S. and Wirthlin, Michael J., "Correlation of Fault-Injection to Proton Accelerator Persistent
Cross Section Measurements" (2005). Faculty Publications. 369.
https://scholarsarchive.byu.edu/facpub/369

This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been
accepted for inclusion in Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more
information, please contact ellen_amatangelo@byu.edu.

LA-UR-05-5329
Approved for public release;
distribution is unlimited.

Title:

Author(s):

Submitted to:

Correlation of Fault-Injection to Proton Accelerator Persistent
Cross Section Measurements

Keith Morgan
Michael Wirthlin

BYU D-Space website

Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the University of California for the U.S.
Department of Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the U.S. Government
retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S.
Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the
auspices of the U.S. Department of Energy. Los Alamos National Laboratory strongly supports academic freedom and a researcher’s right to
publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness.
Form 836 (8/00)

Correlation of Fault-Injection to Proton Accelerator Persistent
Cross Section Measurements
Keith Morgan and Michael Wirthlin
Configurable Computing Laboratory
Brigham Young University
morgan@ee.byu.edu, wirthlin@ee.byu.edu
June 24, 2005

1

Introduction

Field Programmable Gate Arrays (FPGAs) are an attractive solution for space system electronics. Unfortunately,
FPGAs are susceptible to radiation-induced single-event
upsets (SEU). As such, the FPGA Reliability Studies
research group (http://reliability.ee.byu.edu) at Brigham
Young University has studied ways to effectively measure the static, dynamic and persistent cross sections of
an FPGA desgin; each of which are characterized in some
way by how the part reacts to an SEU. One such method is
to actually radiate an FPGA and monitor how it reacts to
SEUs. A cheaper, more efficient solution is to use faultinjection to emulate SEUs.
In order to validate the use of fault-injection to measure the persistent cross section of an FPGA design, we
measured the persistent cross section of several designs
using proton irradiation at Crocker Nuclear Laboratory in
Davis, CA. Our goal was to show a high correlation between accelerator and simulation data to prove that faultinjection is a reliable alternative to proton irradation. This
document is a detailed description of the correlation process.
In this document we will outline the data collection and
correlation processes. First, we will describe the hardware
and circuitry used to detect and report dynamic OEs and
SEUs. Next, we will detail the software used to time and
log functional errors and SEUs. We will also describe the
limitations of our accelerator testing. Finally, we will describe how we post-processed the logs and the algorithms

used to correlate the data.

2

Data Collection Hardware

We used a SLAAC1-V configurable computing board for
our hardware test-fixture. Figure 1 is a block diagram of
the SLAAC1-V board. It has four FPGAs on a PCI card.
There are three Xilinx XCV1000s available for user configuration, labeled X0 through X2. The fourth, smaller
FPGA labled XVPI, acts as a configuration controller.
For our tests we used X1 for the circuit design under test
(DUT) and X2 for an identical copy of X1 to be used as a
“golden” or control circuit. X0 housed the logic to compare outputs from X1 and X2. XVPI contained the circuitry for configuration programming and for SEU monitoring of the X1 configuration bitstream.

2.1

Output Error Detection

FPGA X0 continuously monitored the stream of outputs
from the X1 DUT and X2 golden circuits for discrepancies. On the first cycle on which a mismatch occured,
X0 issued an interrupt via the PCI interface to indicate an
output error (OE) occured. The interrupt remained high
until it was reset externally from software. X0 also stored
several user registers. One register, R0, held the number
of cycles on which an OE occured since the last register
reset.

these functions. One thread reacted to the OE interrupt
and one one responded to the bitstream fault interrupt.

3.1

Figure 1:

2.2

SEU Detection

FPGA XVPI contionusly monitored the configuration bitstream for faults. It used an external memory to house
a “golden” copy of the bitstream for comparision. Mismatches, i.e. faults, were logged in a FIFO by configuration bitstream offset. If after a read of the entire bitstream
it had logged one or more faults, an interrupt was issued
via the PCI interface. The interrupt remained high until it
was reset externally from software. Note that XVPI did
not correct faults. The bitstream remained in a corrupted
state until software issued data and commands to perform
partial reconfiguration.
It is important to note that, on average, a read of the
entire bitstream took 22 milliseconds. Due to these timing constraints, OEs induced by an SEU were actually
logged and timestamped before the SEU which could
have caused it.

3

Data Collection Software

Software running on the host PC which housed the
SLAAC1-V PCI card continusouly waited for and logged
interrupts issued by the test-fixture hardware. The software also monitored the X0 user registers and sent commands for partial reconfiguration of the bitstream. The
software contained two independent threads to perform

Output Error Thread

The OE monitoring thread continously pended on the OE
interrupt. Upon an OE interrupt, the thread gained context. It then followed the sequence of events on the timeline in Figure 2. Figure 3 is a more detailed flow diagram of the operations performed by the thread. First,
the thread disabled the OE interrupt, logged a timestamp
from a generator with millisecond accuracy and immediately went back to sleep for tf milliseconds. After tf
elapsed, context returned to the thread. At this point the
thread reset the register R0 which held the number of cycles on which an error occured since the last register reset. It then went back to sleep for tm milliseconds. After
tm elapsed, context returned to the thread again. At this
point the thread read the contents of R0. If R0 was nonzero, or in other words, if OEs occured during tm the log
was annotated with a flag to indicate that the original OE
(represented by the timestamp) was persistent. Otherwise
a flag was added to the log to indicate a non-persistent OE
occured. After a persistent output error (POE), the thread
issued a global reset of X1 and X2. Finally it re-enabled
the OE interrupt and went back to sleep to wait for the
next interrupt.

3.2

Bitstream Fault Thread

The software also included a second thread which continuously pended on the bitstream fault interrupt. This thread
remained inactive until a bitstream fault interrupt was issued at which time it gained context. The thread then disabled the bitstream fault interrupt and logged a timestamp
from the same generator used by the OE thread. Next, it
read each entry from the FIFO in XVPI and repaired the
bitstream at the reported location. Next to the timestamp
in the log, each bitstream offset was also added. Finally
the bitstream fault interrupt was re-enabled and the thread
went back to sleep to wait for the next interrupt.

Figure 3:

4

Testing Limitations

Several factors affected our ability to test for and consquently predict POEs. Proton flux variations and user
flip-flop upsets both impacted our testing.
Proton flux at an accelerator is difficult to precisely
modulate therefore proton testing is very dynamic by nature. Even slight variations in flux make it difficult to control the actual time between events. Consequently it is
nearly impossible to realize completely independent trials. Figure 2 depicts the time-line of events for a single
trial. If one or more additional SEUs occur during time
tf the trial is no longer independent. For example, if one
of the additional SEUs during tf induce functional errors,
the original configuration bit upset, marked on the timeline by the diamond, could falsely be tagged as persistent.
Upsets of user flip-flops is the other factor which affected the accuracy of our testing. Injection of faults into
user flip-flops during dynamic operation is difficult within
the Virtex architecture. Fortunately, the ratio of flip-flops
to configuration memory latches is small, therefore it is
a generally accepted practice to ignore flip-flops during
fault-injection tests. However, protons do cause SEUs

within flip-flops at an accelerator. This leads to seemingly
unexplainable POEs not predicted by the fault-injection
tool.

5

Data Correlation Software

Our ultimate goal in performing radiation experiments
was to validate the use of fault-injection as an accurate
method to measure cross section. To do that we had to correlate the data we collected using proton irradation with
the data we collected using fault-injection.
The first step in our correlation efforts was to parse
the data we collected during proton irradiation and faultinjection and store it in a suitable data structure. Figure 5
is a UML diagram of the classes used to store our data.
All events are a derivative of the Entry class which simply
stores a timestamp. Each event is either a record of an OE
or a bitstream fault. The only additional information an
OE Entry stores is the real-time decision made about the
persistence of that particular error. Each bistream fault
Entry holds an array of one or more configuration upset
events. Each configuration upset event in turn holds an

Figure 4:

address or offset to the bit which was corrupted at the accelerator. In addition, each upset event record stores a
sensitive and persistent percent probability predicted by
fault-injection for that bit.
After the parser creates the data structure for each
event, all Entries are placed in an array in chronological
order by timestamp. Figures 7, 10 and 12 are typical timelines corresponding to the events stored in an Entry array.
The reader will recall from Section 2.2 that due to the timing constraints on the SEU detection hardware, OE events
come before the SEU which would have caused it.

6

Sensitivity Correlation

The next step in the correlation process was to revalidated that we could accurately predict the dynamic
cross section. Correlation of OEs to the sensitive upsets
which caused them turns out to be a rather trivial process. For each OE reported at the accelerator we simply
wanted to know if it was predicted by our fault-injection

tool. Figure 6 is a flow diagram of the logic used to evalute each OE. If after an OE event a configuration upset
was reported with a non-zero sensitive probability within
a window w, then we said that the fault-injection tool correctly predicted that OE. OEs not near a configuration upset with a non-zero sensitive probability we termed unpredicted. Some of these unpredicted OEs can be attributed
to upsets of configuration bits which the fault-injection
tool incorrectly identified as non-sensitive. The remaining unpredicted OEs likely were caused by SEUs within
user flip-flops.
Figure 7 is a hypothetical snapshot in time of events
recorded at the accelerator. The OE depicted in this
graphic would have been classified as predicted if the upset which occured during the time w had a non-zero sensitive probability.
It is important to note how we selected the size of the
window w. To make this selection we evaluated the delta
time between an OE and the first upset reported after it
chronologically. Figure 8 is a histogram of the deltas

Figure 5:

for a particular design. The resulting distrubtion is normal. As expected, the mean is approximately 22 milliseconds which corresponds to the average bitstream read
time mentioned in Section 2.2. Excluding the outliers
(which correspond to OEs caused by an upset of a user
flip-flop), the edge of the distribution is approximately 45
milliseconds. We used this worst-case fault reporting time
as the size of our window w so as to insure that we included all possible upsets which could have cause an OE.

7

Persistence Correlation

After we re-validated our ability to accurately predict the
dynamic cross section, we then used several different approaches to validate our persistent cross section predictions. Correlation of POEs to the persistent upsets which
caused them is a much more difficult process than sensitivity correlation. In this section we will describe the

inherent limitations in persistence correlation and the different algorithms we used.

7.1

Persistence Correlation Limitations

probability

7.2

Detection Algorithm 1

The easiest and most intuitive approach to persistence correlation is to mimic the methodology for correlating sensitivity. For each POE reported at the accelerator we simply
wanted to know if it was predicted by our fault-injection
tool. Figure 9 is a flow diagram of the logic used to evalute each POE. If after a POE event a configuration upset
was reported with a non-zero persistent probability within
a window w, then we said that the fault-injection tool correctly predicted that POE. POEs not near a configuration

Figure 6:

Figure 7:

upset with a non-zero persistent probability we termed unpredicted. Some of these POEs can be attributed to upsets
of configuration bits which the fault-injection tool incorrectly identified as non-persistent. The remaining unpredicted POEs likely were caused by SEUs within user flipflops.

Again it is important to note how we selected the size of
the window w. For this and the following algorithms we
used the same approach. Each POE also had associated
with it in the log the actual flush time tf for that particular
trial. This time varied due to operating system overhead
for context switching. Due to the dynamic length of the
time tf , we used a different w for each POE. So as to inFigure 10 is a hypothetical snapshot in time of events sure that all upsets which could have possibly induced the
recorded at the accelerator. The POE depicted in this POE in question, we took the worst-case bitstream fault
graphic would have been classified as predicted if the up- reporting time described in Section 6 and added to it the
set which occured during the time w had a non-zero per- real flush time t .
f
sistent probability.

Figure 9:

Figure 10:

7.3

Prediction Algorithm

The second approach we used looked at configuration upsets rather than POEs. Here we simply wanted to know
whether each configuration upset, with a non-zero persistent probability as predicted by fault-injection, actually
caused a POE. Due to the non-binary probability distribution of persistence we used an approach which weighted
the occurence of a configuration upset by its persistent
probability. Figure 11 is a flow diagram of the logic used
to evaluate each persistent upset. We kept a sum of the
persistent probability of each persistent upset. The sum

equals a weighted prediction of the number of POE events
the fault-injection tool predicted based on which configuration bits were upset. We also counted how many upsets had a POE within a window w before the upset. The
count equals the actual number of POE events seen. A
percent error can be calculated from the standard equation (predicted − actual)/predicted.
Figure 12 is a hypothetical snapshot in time of events
recorded at the accelerator. Each upset has been labeled
according to its fault-injection persistent probability. The
persistent upset depicted in this graphic had a POE within

Figure 2:

Figure 11:

7.4

Detection Algorithm 2

The final approach we used looked at every event within
a window w after a POE. This more detailed analysis allowed us to better explain why each POE event occured.
We placed each POE in one of five categories. The different categories were matched, anomalous, one or more
Figure 8:
sensitive upsets, one or more non-sensitive upsets and
nothing in window. POEs with one or more persistent
upset within the window w were placed in the matched
category. We called errors with more than three upsets
a window w chronologically before it and therefore would within its window anamolous and declared that error the
have contributed to the count of actual POEs. The upset’s result of an invalid trial. From the remaining POEs, we
persistent probability would have also contributed to the placed those with at least one sensitive upset in the one or
sum defining the weighted prediction of POEs.
more sensitive category. After that we placed those with

Figure 12:

a least one upset (which by process of elimination could
only be non-sensitive) in the one or more non-sensitive
category. And finally, the remaining errors (which by process of elimination could have no upsets in their window)
were placed in the nothing in window category.

