A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug by Cao, Yuting et al.
A Post-Silicon Trace Analysis Approach for
System-on-Chip Protocol Debug
Yuting Cao, Hao Zheng
Computer Science & Engineering
U. South Florida
Tampa, FL 33620
{cao2, haozheng}@usf.edu
Sandip Ray, Jin Yang
Strategic CAD Lab
Intel
Hillsboro, OR
{sandip.ray, jin.yang}@intel.com
ABSTRACT
Reconstructing system-level behavior from silicon traces is a
critical problem in post-silicon validation of System-on-Chip
designs. Current industrial practice in this area is primarily
manual, depending on collaborative insights of the archi-
tects, designers, and validators. This paper presents a trace
analysis approach that exploits architectural models of the
system-level protocols to reconstruct design behavior from
partially observed silicon traces in the presence of ambigu-
ous and noisy data. The output of the approach is a set
of all potential interpretations of a system’s internal execu-
tions abstracted to the system-level protocols. To support
the trace analysis approach, a companion trace signal selec-
tion framework guided by system-level protocols is also pre-
sented, and its impacts on the complexity and accuracy of
the analysis approach are discussed. That approach and the
framework have been evaluated on a multi-core system-on-
chip prototype that implements a set of common industrial
system-level protocols.
1. INTRODUCTION
Post-silicon validation makes use of pre-production silicon
integrated circuit (IC) to ensure that the fabricated system
works as desired under actual operating conditions with real
software. It is a critical component of the design valida-
tion life-cycle for modern microprocessors and system-on-
chip (SoC) designs. Unfortunately, it is also highly com-
plex, performed under aggressive schedules and accounting
for more than 50% of the overall design validation cost [12].
An SoC design is often composed of a large number of pre-
designed hardware or software blocks (often referred to as
“intellectual properties” or “IPs”) that coordinate through
complex protocols to implement system-level behavior [6].
An execution trace of a system typically involves activities
from the CPU, audio controller, display controller, wireless
radio antenna, etc., reflecting the interleaved execution of
a potentially large number of communication protocols. As
SoCs integrate more IPs, the interactions among the IPs
are increasingly more complex. Moreover, modern intercon-
nects are highly concurrent allowing multiple transactions
to be processed simultaneously for scalability and perfor-
mance. They are an important source of design errors. On
the other hand, observability limitations allow only a small
number of participating signals to be actually traced dur-
ing silicon execution. Furthermore, electrical perturbations
cause silicon data to be noisy, lossy, and ambiguous. It is
non-trivial during post-silicon debug to identify all partici-
pating protocols and pinpoint the interleavings that result
in an observed trace.
Previous work [15] proposed a method for correlating sil-
icon traces with system-level protocol specifications. The
idea was to reconstruct protocol execution scenarios from a
partially observed silicon trace, which provide abstract views
of system internal executions to facilitate post-silicon SoC
debug. While that work showed promising results, it has
a number of deficiencies precluding its applicability in prac-
tice. First, there was no way to qualify or rank the quality of
protocol execution scenarios generated by the reconstruction
procedure. Under poor observability condition, it was pos-
sible for the algorithm to generate hundreds or thousands of
potential protocol execution scenarios consistent with a par-
tially observed trace. Without a metric to rank the quality
of these reconstructions, the debugger is faced with the un-
enviable task of wading through these potential scenarios to
infer what may actually have happened in a specific silicon
execution. Moreover, based on past experiences, interleav-
ings of different protocol executions are a major source of
functional bugs. Since the method developed in [15] does
not capture orderings among different protocol executions,
the results obtained with that method offer little help for
bug localization and root causing.
This paper addresses the above deficiencies by introduc-
ing an optimized trace analysis approach. Central to this
optimized approach is a new formulation of protocol execu-
tion scenarios that comprehends ordering relations among
protocol executions. Quantitative metrics are also devel-
oped so that the quality of the results derived by and the
efficiency of the analysis approach can be measured. Trace
signal selections can have great impacts on the complexity
and accuracy of the trace analysis. Therefore, a companion
trace signal selection framework is proposed. This frame-
work is communication-centric, and guided by system-level
protocols. Its objective is to facilitate the trace analysis to
produce high quality interpretations of observed silicon traces
efficiently. Various trace signal selection strategies are eval-
uated and analyzed based on their impacts on the trace anal-
ysis approach applied to a non-trivial multi-core SoC model
that implements a number of common industrial system-
level protocols.
2. FLOW SPECIFICATION
An SoC model as shown in Figure 1 is used to illustrate
and experiment the work described in this paper. It con-
sists of two CPUs (CPU X), each with a private Data Cache
(Cache X), a graphics engine (GFX), a power management
unit (PMU), a system memory, and three peripheral blocks:
ar
X
iv
:2
00
5.
02
55
0v
1 
 [c
s.A
R]
  6
 M
ay
 20
20
CPU 0 CPU 1
Cache 0 Cache 1PMUGFX
Bus
Audio USB UART Memory
Figure 1: The block diagram of the simple SoC
model.
an audio control unit (Audio), a UART controller (UART),
and a USB controller (USB). All these blocks are connected
through an interconnect fabric (Bus).
System operations are realized by executions performed in
various blocks that are coordinated by system-level proto-
cols. These protocols are typically specified in architecture
documents as message flow diagrams, where the words “pro-
tocol” and “flow” are used interchangeably. In this paper, as
in [15], system flows are formalized using Labeled Petri-nets
(LPNs). Figure 2 shows a memory write protocol initiated
from a CPU CPU_X in LPN where X 2 {0, 1} and X0 = 1  X.
An LPN is a tuple (P, T,E, L, s0) where P is a finite set
of places, T is a finite set of transitions, E is a finite set
of events, and L : T ! E is a labeling function that maps
each transition t 2 T to an event e 2 E. For each transition
t 2 T , its preset, denoted as •t ✓ P , is the set of places
connected to t, and its postset, denoted as t• ✓ P , is the set
of places that t is connected to. A state s ✓ P of a LPN is a
subset of places marked with tokens. There are two special
states associated with each LPN; s0 ✓ P which is the set of
initially marked places, also referred to as the initial state,
and the end state send which is the set of places not going
to any transitions.
A transition t can be executed in a state s if •t ✓ s.
Executing t causes the labeled event to be emitted, and leads
to a new state s0 = (s   •t) [ t•. Therefore, executing an
LPN leads to a sequence of events. Execution of a LPN
completes if its send is reached.
For example, in Figure 2, t1 can be executed in s0 =
{p1}. Event (CPU X : Cache X : wr req) is emitted after t1 is
executed, and the LPN state becomes {p2}. The end state
is send = {p9}.
A flow specification may also contain multiple branches
describing di↵erent ways a system can execute such flow.
For example, the flow shown in Figure 2 has three branches
covering the cases where the cache (snoop) operation is hit
or miss.
3. POST-SILICON TRACE ANALYSIS
3.1 Previous Work
This section recaps the previous approach in [15]. The
objective of the trace analysis is to reconstruct design in-
ternal behavior wrt given system-level flow specifications
F from a partially observed silicon trace on a small num-
ber of hardware signals. The o↵-chip analysis includes two
broad phases: (1) trace abstraction, which maps a silicon
p1
t1 : (CPU X : Cache X : wr req)
p2
t2 : (Cache X : Cache X
0 : snp wr req)
p3
t3 : (Cache X
0 : Cache X : snp wr resp)
p4
t4 : (Cache X : Bus : wr req)
p5
t5 : (Bus : Mem : rd req)
p6
t6 : (Mem : Bus : rd resp)
p7
t7 : (Bus : Cache X : wr resp)
p8
t8 : (Cache X : CPU X : wr resp)
t9 : (Cache X : CPU X : wr resp)
t10 : (Cache X : CPU X : wr resp)
p9
Figure 2: LPN formalization of a CPU write proto-
col. Each LPN transition is labeled with an event
(src, dest, cmd) where cmd is a command sent from a
source component src to a destination component
dest. The places without outgoing edges are termi-
nals, which indicate termination of protocols repre-
sented by the LPNs.
trace into a sequence of flow events, higher-level architec-
tural constructs including e.g., messages, operations, etc,
and (2) trace interpretation, which infers possible flow ex-
ecution scenarios that are compliant with the abstracted
event sequence.
To illustrate the basic idea, consider the system flow in
Figure 2, which we call F1. Suppose that the following flow
execution trace is abstracted from an observed silicon trace
by executing a design that implements F1.
t1 t2 t1 t2 t3 t3 t4 t4 t5 t6 t5 t6 . . .
Here the flow events are referred to by their transition names
in the LPN. The first four events result in the following flow
execution scenario
{(F1,1, {p3}), (F1,2, {p3})}. (1)
A flow execution scenario is defined as a set of flow instances
and their respective current states after some events are pro-
cessed [15]. It can be viewed as an abstraction of system
states wrt system flows. The above execution scenario in-
dicates that the sequence of the first four events is a result
from executing those two flow instances of F1 from their ini-
tial states to the shown states. For the first event t3, it may
be a result from executing F1,1 or F1,2, but exactly which
one is unknown due to limited observability. Both possible
cases are considered, and two execution scenarios below are
Figure 1: The block diagra of the si ple So
model.
an audio control unit ( udio), a co troller ( ),
and a USB controller ( S ). ll these blocks are connected
through an interconnect fabric ( us).
System operations are realized by executions perfor ed in
various blocks that are coordinated by syste -level proto-
cols. These protocols are typically specified in architecture
documents as message flow diagrams, where the words “pro-
tocol” and “flow” are used interchangeably. In this paper, as
in [15], system flows are formalized using Labeled Petri-nets
(LPNs). Figure 2 shows a memory write protocol initiated
from a CPU CPU_X in LPN where X ∈ {0, 1} and X′ = 1− X.
An LPN is a tuple (P, T,E, L, s0) where P is a finite set
of places, T is a finite set of transitions, E is a finite set
of events, and L : T → E is a labeling function that maps
each transition t ∈ T to an event e ∈ E. For each transition
t ∈ T , its preset, denoted as •t ⊆ P , is the set of places
connected to t, and its postset, denoted as t• ⊆ P , is the set
of places that t is connected to. A state s ⊆ P of a LPN is a
subset of places marked with tokens. There are two special
states associated with each LPN; s0 ⊆ P which is the set of
initially marked places, also referred to as the initial state,
and the end state send which is the set of places not going
to any transitions.
A transition t can be executed in a state s if •t ⊆ s.
Executing t causes the labeled event to be emitted, and leads
to a new state s′ = (s − •t) ∪ t•. Therefore, executing an
LPN leads to a sequence of events. Execution of a LPN
completes if its send is reached.
For example, in Figure 2, t1 can be executed in s0 =
{p1}. Event (CPU X : Cache X : wr req) is emitted after t1 is
executed, and the LPN state becomes {p2}. The end state
is send = {p9}.
A flow specification may also contain multiple branches
describing different ways a system can execute such flow.
For example, the flow shown in Figure 2 has three branches
covering the cases where the cache (snoop) operation is hit
or miss.
3. POST-SILICON TRACE ANALYSIS
3.1 Previous Work
This section recaps the previous approach in [15]. The
objective of the trace analysis is to reconstruct design in-
ternal behavior wrt given system-level flow specifications
F from a partially observed silicon trace on a small num-
ber of hardware signals. The off-chip analysis includes two
broad phases: (1) trace abstraction, which maps a silicon
CPU 0 CPU 1
Cache 0 Cache 1PMUGFX
Bus
Audio USB UART Memory
Figure 1: The block diagram of the simple SoC
model.
an audio control unit (Audio), a UART controller (UART),
and a USB controller (USB). All these blocks are connected
through an interconnect fabric (Bus).
System operations are realized by executions performed in
various blocks that are coordinated by system-level proto-
cols. These protocols are typically specified in architecture
documents as message flow diagrams, where the words “pro-
tocol” and “flow” are used interchangeably. In this paper, as
in [15], system flows are formalized using Labeled Petri-nets
(LPNs). Figure 2 shows a memory write protocol initiated
from a CPU CPU_X in LPN where X 2 {0, 1} and X0 = 1  X.
An LPN is a tuple (P, T,E, L, s0) where P is a finite set
of places, T is a finite set of transitions, E is a finite set
of events, and L : T ! E is a labeling function that maps
each tr sition t 2 T to an event e 2 E. For eac transition
t 2 T , its preset, denot d as •t ✓ P , is the set of places
connected to and its postset, denoted as t• ✓ P , is the set
of places that t is connect d to. A state s ✓ P of a LPN is a
subset of places marked with tokens. There are two special
tates associated with each LPN; s0 ✓ P which is the set of
initially marked places, also referred to as t e initial state,
and the end state send which is the set of places not going
to any transitions.
A transition t can be executed in a state s if •t ✓ s.
Executing t causes the labeled event to be emitted, and leads
to a new state s0 = (s   •t) [ t•. Therefore, executing an
LPN leads to a sequence of events. Execution of a LPN
completes if its send is reached.
For example, in Figure 2, t1 can be executed in s0 =
{p1}. Event (CPU X : Cache X : wr req) is emitted after t1 is
executed, and the LPN state becomes {p2}. The end state
is send = {p9}.
A flow specification may also contain multiple branches
describing di↵erent ways system can execut such flow.
For example, the flow hown in Figure 2 has hree branches
covering the cases where the cache (snoop) op ratio is hit
or miss.
3. POST-SILICON TRACE ANALYSIS
3.1 Previous Work
This section recaps the previous approach in [15]. The
objective f the trace analys s i to reconstruct design in-
t rnal behavior wrt given ystem-l vel flow specifications
F from partially observed silicon trace on a small num-
ber of hardw re signals. The o↵-chip analysis includes two
broad phases: (1) trace abstraction, which maps a silic n
p1
t1 : (CPU X : Cache X : wr req)
p2
t2 : (Cache X : Cache X
0 : snp wr req)
p3
t3 : (Cache X
0 : Cache X : snp wr resp)
p4
t4 : (Cache X : Bus : wr req)
p5
t5 : (Bus : Mem : rd req)
p6
t6 : (Mem : Bus : rd resp)
p7
t7 : (Bus : Cache X : wr resp)
p8
t8 : (Cache X : CPU X : wr resp)
t9 : (Cache X : CPU X : wr resp)
t10 : (Cache X : CPU X : wr resp)
p9
Figure 2: LPN formalization of a CPU write proto-
col. Each LPN transition is labeled with an event
(src, dest, cmd) where cmd is a command sent from a
source component src to a destination component
dest. The places without outgoing edges are termi-
nals, which indicate termination of protocols repre-
sented by the LPNs.
trace into a sequence of flow events, higher-level architec-
tural constructs including e.g., messages, operations, etc,
and (2) trace interpretation, which infers possible flow ex-
ecution scenarios that are compliant with the abstracted
event sequence.
To illustrate the basic idea, consider the system flow in
Figure 2, which we call F1. Suppose that the following flow
execution trace is abstracted from an observed silicon trace
by executing a design that implements F1.
t1 t2 t1 t2 t3 t3 t4 t4 t5 t6 t5 t6 . . .
Here the flow events are referred to by their transition names
in the LPN. The first four ev nts result in the following flow
execution scenario
{(F1,1, {p3}), (F1,2, {p3})}. (1)
A flow execution scenario is defined as a set of flow instances
and their respective current stat s after ome events are pro-
cessed [15]. It can be viewed as an abstractio of system
states wrt system flows. The above execution scenario in-
dicates hat he sequence of the first four events is result
from executing those two flow instances of F1 from their ini-
tial states to the shown states. For the first event t3, it may
be a r sult from executing F1,1 or F1,2, but xactly which
one is nknown due to limited observability. Both possible
cases are c nsidered, and two execution scenari s bel w are
igure 2: L for alization of a P rite proto-
col. ach transition is labeled with an event
(src, dest, cmd) ere cmd is a co and sent from a
s rce c e t src to a destination component
. l s it o t outgoing edges are termi-
l , i i i t ter ination of protocols repre-
s.
trace into a seque e of flow event , higher-l vel architec-
tural constructs including e.g., message , perations, tc,
and (2) trace interpre ation, which infers possible flow x-
ecution scenarios that are compliant with the abstracted
event sequence.
To illustrate the basic idea, consider the system flow in
Figure 2, which we call F1. Suppose that the following flow
execution trace is abstracted from an observed silicon trace
by executing a design that implements F1.
t1 t2 t1 t2 t3 t3 t4 t4 t5 t6 t5 t6 . . .
Here the flow events are referred to by their transition names
in the LPN. The first four events result in the following flow
execution scenario
{(F1,1, {p3}), (F1,2, {p3})}. (1)
A flow execution scenario is defined as a set of flow instances
and their respective current states after some events are pro-
cessed [15]. It can be viewed as an abstraction of system
states wrt system flows. The above execution scenario in-
dicates that the sequence of the first four events is a result
from executing those two flow instances of F1 from their ini-
tial states to the shown states. For the first event t3, it may
be a result from executing F1,1 or F1,2, but exactly hich
one is unknown due to limited observability. Both possible
cases are considered, and two execution scenarios below are
derived as a result from interpreting t3.
{(F1,1, {p4}), (F1,2, {p3})}
{(F1,1, {p3}), (F1,2, {p4})}. (2)
After handling the next event t3, the above two execution
scenarios are reduced to the one as shown below.
{(F1,1, {p4}), (F1,2, {p4})}.
After the remaining six events are handled, the following
execution scenario is derived.
{(F1,1, {p7}), (F1,2, {p7})}
As another example, now suppose that the design with a
bug generates the flow trace below.
t1 t2 t1 t2 t3 t3 t4 t4 t5 t6 t5 t11 . . . .
This sequence is almost the same as the previous one ex-
cept that the last event is t11 : (Cache_X:CPU_X:rd_resp)
instead of t6 : (Mem:Bus:rd_resp) in the previous trace. t11
is an event used in a different flow specification describing
a CPU memory read protocol. Analyzing the trace right
before t11 leads to the execution scenarios below.
{(F1,1, {p7}), (F1,2, {p6})}
{(F1,1, {p6}), (F1,2, {p7})}. (3)
However, t11 cannot be a result from executing either flow
instances in both scenarios, which indicates a noncompli-
ance of the design implementation with respect to the given
flow specification. Such an event is referred to as being in-
consistent. In this case, the algorithm halts, and returns t11
and the derived flow execution scenarios as shown in (3) for
debugger to examine further.
3.2 Flow Execution Scenarios
The trace analysis approach in [15] does not capture order-
ings among flow instances for execution scenarios. However,
from a debugger’s point of view, communication protocols
can be related. For example, a firmware loading protocol
always happens before a firmware execution protocol. If
a firmware execution protocol is found to happen before a
firmware loading protocol, that possibly indicates an error
in the system implementing such protocols. Such properties
cannot be checked by the previous approach.
To address that problem, this paper presents a new defi-
nition of flow execution scenarios as
{(Fi,j , si,j , start i,j , end i,j) | Fi ∈ F}
where start i,j and end i,j are two indices representing rela-
tive time when Fi,j is initiated and completed. The ordering
relations can be derived by comparing their start and end
indices. For example, for two flow instances in an execution
scenario, (Fu,v, su,v, startu,v, endu,v) and (Fx,y, sx,y, startx,y,
endx,y), Fu,v is initiated before Fx,y if startu,v < startx,y, or
Fx,y is initiated after Fu,v is completed if endu,v < startx,y.
The ordering relations can provide more accurate informa-
tion for understanding system execution under limited ob-
servability. Section 3.3 explains how start i,j and end i,j are
decided during the trace analysis.
In order to support the new definition of flow execution
scenarios, the trace abstraction, which maps an observed sil-
icon trace to a linear sequence of flow events as in [15], is
also generalized. A SoC design can be viewed as a group
of IP blocks networked by an on-chip interconnect fabric.
These blocks communicate with each other through com-
munication links, each of which implements a protocol, such
as ARM AXI, over a set of wires. The approach presented
in this paper is communication centric in that it works on
silicon traces on a selected number of wires of a selected
number of communication links for observation. Suppose
that there are n communication links, and some wires from
each link are selected for observation. A silicon trace is as-
sumed to be a sequence of α0, α1, . . . such that each αi is a
vector defined as
αi = 〈α0,i, . . . αn,i〉
where αk,i is a state on link k in step i.
If all wires of a link are observable, then a state on that
link can be uniquely mapped to a flow event of the same link.
Under limited observability, a state on a link is typically
mapped to a set of flow events. Therefore, a silicon trace is
abstracted to a sequence ~E0, ~E1, . . . where
~Ei = 〈E0,i, . . . , En,i〉 (4)
is a vector of sets of flow events abstracted from αi, and
each Ek,i in ~Ei is a set of flow events abstracted from state
αk,i in αi. No temporal orderings exist among all events in
~Ei. On the other hand, for two events, ei ∈ ~Ei and ej ∈ ~Ej
such that i < j, then ei happens before ej .
Based on different levels of information captured, this pa-
per classifies flow execution scenarios as follows.
• Type-1 execution scenarios capture the number of in-
stances of each flow specification initiated from a sili-
con trace, and their relative orderings of initiations.
• Type-2 execution scenarios, on top of what is captured
by Type-1 scenarios, capture completion of each flow
instance. This additional information can be used to
identify potential problems if there is any flow instance
that is not completed. Furthermore, Type-2 execution
scenarios capture the relative orderings among all flow
instances as described above.
• Type-3 execution scenarios, on top of what is captured
by Type-2 scenarios, capture information on execution
paths followed by individual flow instances. This in-
formation can provide a means to debuggers to have
a detailed examination on how each flow instance is
executed.
These different execution scenarios can be used to provide
different views of system execution, from coarse-grained to
more detailed ones, at different stages of debug.
3.3 Algorithms
Algorithm 1 shows the top-level procedure for detecting
internal flow executions based on a partially observed silicon
trace, and checks the compliance wrt a given flow specifica-
tion. It takes as inputs F, a set of system level flow spec-
ifications, and a signal trace ρ, which is assumed to be a
sequence of states on a set of observable trace signals, and
each state is uniquely indexed starting from 0.
This algorithm scans trace ρ starting from index h initial-
ized to 0, extracts all possible flow events from ρ at index h
as described in section 3.2 (line 6), and maps each of those
extracted flow events to update already detected execution
scenarios (line 11). The algorithm terminates if one of two
conditions holds. If an inconsistence is encountered, the set
Algorithm 1: Check-Compliance(F, ρ)
1 /* F: a set of flow specification */
2 /* ρ: a partially observed silicon trace */
3 M← {∅}
4 h← 0
5 while h ≤ |ρ| do
6 ~E ← abstract(ρ, h)
7 foreach Ei of ~E do
8 inconsistent← true
9 foreach e ∈ Ei do
10 foreach scen ∈M do
11 Scens← analysis(F, scen, e, h)
12 if Scens 6= ∅ then
13 inconsistent← false
14 M← (M− scen) ∪ Scens
15 if inconsistent = true then
16 return (M, h, i)
17 h← h+ 1
18 return (M,−1,−1)
of detected partial execution scenarios along with two in-
dices h and i are returned (line 16). Index h provides tem-
poral information on when the inconsistency occurs, while
i provides spatial information on which communication link
an inconsistent event is transmitted. If no inconsistency is
found, the set of all execution scenarios compliant with the
observed trace is returned (line 18) when index h is larger
than the length of the trace.
Algorithm 2 takes the specification F, an execution sce-
nario scen, a flow event e, and index h of the trace where
e is extracted, and it produces a set of execution scenarios
R consistent with e. This algorithm performs two tasks. In
the first task (lines 5-12), the algorithm checks every flow
instance to decide if e can be accepted. If such an instance
is found (line 7), then it is updated with the new state as
the result of e (line 9). Furthermore, if e causes the flow in-
stance to complete, its index end i,j is set to h (line 10-11),
indicating the completion of that instance due to event e at
step h of the trace. In task 2, all possibilities where e can
initiate a new flow instance are considered (line 14-20). If
a new instance can be initiated, its start i,j is set to h, indi-
cating the initiation of that instance due to a signal event
at step h of the trace.
3.4 On the Complexity and Accuracy
Due to the limited observability, reconstructing system
level executions from an observed silicon trace is an impre-
cise process. The large number of execution scenarios typi-
cally derived during the analysis would take large amounts
of runtime and memory to process and to store, thus making
it less efficient. This is referred to as the complexity prob-
lem of the trace analysis. After the analysis is done, a large
number of derived execution scenarios make it difficult to
understand the analysis results, thus being less helpful for
debugging. Obviously, a single flow execution scenario de-
rived at the end of the trace analysis provides much more
precise information for debug than ten candidate flow exe-
cution scenarios. This is referred to as the accuracy problem
of the trace analysis.
Algorithm 2: Analysis(F, scen, e, h)
1 /* scen = {(Fi,j , si,j , starti,j , endi,j)} */
2 /* e is a flow event abstracted from silicon trace at
index h */
3 R = ∅
4 /* Check if e can change state any existing flow
instances of scen */
5 foreach (Fi,j , si,j , starti,j , endi,j) ∈ scen do
6 s′i,j ← accept(Fi,j , si,j , e)
7 if s′i,j 6= ∅ then
8 Let scen′ be a copy of scen
9 Replace si,j of scen
′ with s′i,j
10 if s′i,j = Fi.send then
11 Update endi,j of scen
′ with h
12 R ← R∪ scen′
13 /* Check if e can extend scen by initiating new flow
instances */
14 foreach Fi ∈ F do
15 create a new instance Fi,h
16 s′i,h ← accept(Fi,h, Fi.s0, e)
17 if s′i,h 6= ∅ then
18 Let scen′ be a copy of scen
19 scen′ ← scen′ ∪ (Fi,h, s′i,h, h, −1)
20 R ← R∪ scen′
21 return R
The contributing factors to the complexity and accuracy
problems are explained below.
1. A signal event mapped to a set of flow events − Due to
the limited observability, a signal event of an observed
silicon trace is often interpreted as a number of differ-
ent flow events, which typically leads to derivation of
a number of different execution scenarios. This situa-
tion is exacerbated by the fact that silicon traces are
often very long, which could lead to excessively large
numbers of possible execution scenarios derived during
or at the end of the analysis.
2. A flow event mapped to different temporal flow in-
stances − Temporal flow instances refer to the flow in-
stances activated by the same component, e.g. read/write
flows activated by CPU_0. If several temporal instances
of some flows are activated by a component, mapping
flow events to those flow instances can be ambigu-
ous. For example, suppose that an execution scenario
includes two instances of the flow as shown in Fig-
ure 2 activated by CPU_0, one in state {p2}, and the
other one in state {p8}. An instance of flow event
(Cache 0 : CPU 0 : wr resp) can be mapped to either
flow instance leading to two new execution scenarios
from the current one.
3. A flow event mapped to flow instances activated by dif-
ferent components − This situation can happen when
flow instances that share some common events are ac-
tivated by different components. For example, suppose
an execution scenario has two instances of the flow as
shown in Figure 2, one activated by CPU_0 and the
other one by CPU_1, and both are in state {p6}. A
flow event (Mem : Bus : rd resp) can be mapped to ei-
ther one of these two instances, leading to two new
execution scenarios derived from the current one.
The above issues can be mitigated by good signal selec-
tions to be discussed in the following section. In order to
evaluate the impacts of different trace signal selections on
the complexity and accuracy of the trace analysis, this pa-
per introduces two quantitative metrics. The complexity is
measured by the peak count of flow execution scenarios en-
countered during the analysis process, i.e., the largest size of
M encountered during the execution of Algorithm 1. The
accuracy is measured by the final count of flow execution
scenarios derived at the end of the analysis process, i.e., the
size of M returned on either line 16 or 18 of Algorithm 1.
4. TRACE SIGNAL SELECTION
Trace signal selection is a critical step in post-silicon de-
bug. It includes two different efforts: pre-silicon and post-
silicon. During pre-silicon selection, a few thousand signals
among a vast number of internal signals are tapped for obser-
vation. All necessary signals must be selected at this stage,
otherwise, expensive re-design along with silicon re-spin are
required. During post-silicon debug, a small subset of those
tapped signals are routed to the chip interface for tracing
during system execution.
Previous work such as [4] is typically applied to gate level
design models, and the quality of the results is evaluated by
the commonly used state restoration ratio. However, it is
difficult to scale those methods to large and complex SoC
designs. More importantly, signals selected at the gate level
are often irrelevant to system-level functionalities. There
is an attempt to raise the abstraction level for trace signal
selection to the register transfer level (RTL) guided by asser-
tions [11], however that work does not consider system level
functionalities either. In [3], a system level protocol guided
approach is proposed. It is similar to our work in that both
are based on system level protocols. However, the selec-
tion techniques developed in [3] are simple and irrelevant
to understanding silicon traces at the system level, and the
evaluation was performed on an abstract transaction level
model.
This section introduces a framework shown in Figure 3 for
trace signal selection guided by system-level protocols. Due
to the page limit, this paper only considers the pre-silicon
trace signal selection. Since the pre-silicon selection needs
to support all types of execution scenarios, it is sufficient to
consider only Type-3 scenarios as they supersede Type-1 or
-2 scenarios.
4.1 System Level Selection
During the system level selection, different subsets of flow
events for observation are selected from given flow specifica-
tions. Then, those results are passed to the more refined bit
level selection.
To support Type-3 scenarios, the start and end events of
all flow specifications must be selected. If a flow specifica-
tion has multiple branches, additional events may need to
be selected so that the branch followed by a flow instance
during system execution can be captured. Figure 4 shows
two examples of different branching structures for flows.
• In Figure 4(a), each branch ends with an unique event.
There is no need to select additional events as observ-
ing different end events can clearly identify the branch
followed during system execution.
execution scenarios derived from the current one.
The above issues can be mitigated by good signal selec-
tions to be discussed in the following section. In order to
evaluate the impacts of di↵erent trace signal selections on
the complexity and accuracy of the trace analysis, this pa-
per introduces two quantitative metrics. The complexity is
measured by the peak count of flow execution scenarios en-
countered during the analysis process, i.e., the largest size of
M encountered during the execution of Algorithm 1. The
a curacy is measured by the final count of flow execution
scenarios derived at the end of the analysis proce s, i.e., the
size of M returned on either line 16 or 18 of Algorithm 1.
4. TRACE SIGNAL SELECTION
Trace signal selection is a critical step in post-silicon de-
bug. It includes two di↵erent e↵orts: pre-silicon and post-
s licon. During pre-s licon selection, a few thousand signals
among a vast number of internal signals are ta ped for obser-
vation. All nece sary signals must be selected at this stage,
otherwise, expensive re-design along with silicon re-spin are
required. During post-s licon debug, a small subset of those
ta ped signals are routed to the chip interface for tracing
during system execution.
Previous work such as [4] is typically a plied to gate level
design models, and the quality of the results is evaluated by
the co monly used state restoration ratio. However, it is
di cult to scale those methods to large and complex SoC
designs. More importantly, signals selected at the gate level
are often irrelevant to system-level functionalities. There
is an attempt to raise the abstraction level for trace signal
selection to the register transfer level (RTL) guided by asser-
tions [ 1], however that work does not consider system level
functionalities either. In [3], a system level protocol guided
approach is proposed. It is similar to our work in that both
are based on system level protocols. However, the selec-
tion techniques developed in [3] are simple and irrelevant
to understanding silicon traces at the system level, and the
evaluation was performed on an abstract transaction level
model.
This section introduces a framework shown in Figure 3 for
trace signal selection guided by system-level protocols. Due
to the page limit, this paper only considers the pre-silicon
trace signal selection. Since the pre-silicon selection needs
to support all types of execution scenarios, it is su cient to
consider only Type-3 scenarios as they supersede Type-1 or
-2 scenarios.
4.1 System Level Selection
During the system level selection, di↵erent subsets of flow
events for observation are selected from given flow specifica-
tions. Then, those results are passed to the more refined bit
level selection.
To support Type-3 scenarios, the start and end events of
all flow specifications must be selected. If a flow specifica-
tion has multiple branches, additional events may need to
be selected so that the branch followed by a flow instance
during system execution can be captured. Figure 4 shows
two examples of di↵erent branching structures for flows.
• In Figure 4(a), each branch ends with an unique event.
There is no need to select additional events as observ-
ing di↵erent end events can clearly identify the branch
followed during system execution.
Flow Specification
F
System-Level Selection
Sets of Flow
Events
RTL Model
Bit-Level Selection
Trace Signals
Figure 3: A framework of trace signal selection.
(a) (b)
Figure 4: Examples of flow structures.
• In Figure 4(b), branches split and then join, and the
flow ends with a common event. In this case, an unique
event needs to be selected for each branch.
The flow shown in Figure 2 has three branches with a
structure similar to Figure 4(b). Its start and end events
are {t1, t8, t9, t10}. Note that t8, t9, and t10 actually refer to
the same event. There is no choice for the right branch as
t10 must be selected. To identify the left two branches from
the right one, either t2 or t3 needs to be selected. Similarly,
one of events in {t4, t5, t6, t7} needs to be selected in order
to identify the left branch from the middle one. Therefore,
all possible event selections for that flow are
{t1, t8, t9, t10}⇥ {t2, t3}⇥ {t4, t5, t6, t7}.
Among possibly large number of event selections, there
are two types of events that have interesting characteris-
tics. One type includes events that are unique to specific
flows (ref. unique events), while the other type includes
events shared by multiple flows (ref. shared events). This
section considers their impacts on the complexity and ac-
curacy of the trace analysis and the signal selection. For
the complexity and accuracy of the trace analysis, only is-
sues #2 and #3 in section 3.4 are considered in this section.
Issue #1 is relevant to the bit level selection.
It is important to select events that can be mapped to
smallest number of flow instances during the trace analysis
in order to reduce the complexity and improve the accu-
racy. Refer to the flow shown in Figure 2. Events t2 and
t3 are used only in that flow. During the trace analysis,
they are just mapped to the instances of that particular
flow. Issue #2 can be addressed in that instances of di↵er-
ent flows initiated by the same component are ignored for
those events. Issue #3 is addressed in that instances of any
Figure 3: A fra ework of trace signal selection.
execution scenarios derived from the current one.
The above issues can be mitigated by good signal selec-
tions to be discussed in the following section. In order to
evaluate the impacts of di↵erent trace signal selections on
the complexity and accuracy of the trace analysis, this pa-
per introduces two quantitative metrics. The complexity is
measured by the peak count of flow execution scenarios en-
countered during the analysis process, i.e., the largest size of
M encountered during the execution of Algorithm 1. The
accuracy is measured by the final count of flow execution
scenarios derived at the end of the analysis process, i.e., the
size of M returned on either line 16 or 18 of Algorithm 1.
4. TRACE SIGNAL SELECTION
Trace signal selection is a critical step in post-silicon de-
bug. It includes two di↵erent e↵orts: pre-silicon and post-
silicon. During pre-silicon selection, a few thousand signals
among a vast number of internal signals are tapped for obser-
vation. All necessary signals must be selected at this stage,
otherwise, expensive re-design along with silicon re-spin are
required. During post-silicon debug, a small subset of those
tapped signals are routed to the chip interface for tracing
during system execution.
Previous work such as [4] is typically applied to gate level
design models, and the quality of the results is evaluated by
the commonly used state restoration ratio. However, it is
di cult to scale those methods to large and complex SoC
designs. More importantly, signals selected at the gate level
are often irrelevant to system-level functionalities. There
is an attempt to raise the abstraction level for trace signal
selection to the register transfer level (RTL) guided by asser-
tions [11], however that work does not consider system level
functionalities either. In [3], a system level protocol guided
approach is proposed. It is similar to our work in that both
are based on system level protocols. However, the selec-
tion techniques developed in [3] are simple and irrelevant
to understanding silicon traces at the system level, and the
evaluation was performed on an abstract transaction level
model.
This section introduces a framework shown in Figure 3 for
trace signal selection guided by system-level protocols. Due
to the page limit, this paper only considers the pre-silicon
trace signal selection. Since the pre-silicon selection needs
to support all types of execution scenarios, it is su cient to
consider only Type-3 scenarios as they supersede Type-1 or
-2 scenarios.
4.1 System Level Selection
During the system level selection, di↵erent subsets of flow
events for observation are selected from given flow specifica-
tions. Then, those results are passed to the more refined bit
level selection.
To support Type-3 scenarios, the start and end events of
all flow specifications must be selected. If a flow specifica-
tion has multiple branches, additional events may need to
be selected so that the branch followed by a flow instance
during system execution can be captured. Figure 4 shows
two examples of di↵erent branching structures for flows.
• In Figure 4(a), each branch ends with an unique event.
There is no need to select additional events as observ-
ing di↵erent end events can clearly identify the branch
followed during system execution.
Flow Specification
F
System-Level Selection
Sets of Flow
Events
RTL Model
Bit-Level Selection
Tra e Signa s
Figure 3: A framework of trace signal selection.
(a) (b)
Figur 4: Examples of flow structures.
• I Figure 4(b), branches split and then join, and the
flow nds with a common event. In this case, an unique
event n eds o b s lected for ch ranch.
Th flow show in Figure 2 has three branches with a
structu e similar t Fig r 4(b). Its start and end vents
are {t1, t8, t9, 10}. Note that t8, t9, and t10 ac ually refer to
the same event. There is no choice for the right branch as
t10 must be selected. To identify the left two branches from
the right one, either t2 or t3 needs to be selected. Similarly,
one of events in {t4, t5, t6, t7} needs to be selected in order
to identify the left branch from the middle one. Therefore,
all possible event selections for that flow are
{t1, t8, t9, t10}⇥ {t2, t3}⇥ {t4, t5, t6, t7}.
Among possibly large number of event selections, there
are two types of events that have interesting characteris-
tics. One type includes events that are unique to specific
flows (ref. unique events), while the other type includes
events shared by multiple flows (ref. shared events). This
section considers their impacts on the complexity and ac-
curacy of the trace analysis and the signal selection. For
the complexity and accuracy of the trace analysis, only is-
sues #2 and #3 in section 3.4 are considered in this section.
Issue #1 is relevant to the bit level selection.
It is important to select events that can be mapped to
smallest number of flow instances during the trace analysis
in order to reduce the complexity and improve the accu-
racy. Refer to the flow shown in Figure 2. Events t2 and
t3 are used only in that flow. During the trace analysis,
they are just mapped to the instances of that particular
flow. Issue #2 can be addressed in that instances of di↵er-
ent flows initiated by the same component are ignored for
those events. Issue #3 is addressed in that instances of any
execution scenarios derived from the current one.
The above issues can be mitigated by good signal selec-
tions to be discussed in the following section. In order to
evaluate the impacts of di↵erent trace signal selections on
the complexity and accuracy of the trace an ysis, this pa-
per introduces two quantitative metrics. The complexity is
measured by the peak count of flow execution scenarios en-
countered during the analysis process, i.e., the largest size of
M encountered during the execution of Algorithm 1. The
accuracy is measured by the final count of flow execution
scenarios derived at the end of th analy is pr cess, i.e., the
size of M returned on either line 16 r 18 of Algorithm 1.
4. TRACE SIGNAL SELECTION
Trace signal selection is a critical step in post-silicon de-
bug. It includes two di↵erent e↵orts: pre-silicon and post-
silicon. During pre-silicon selection, a few thousand signals
among a vast number of inter al ignals a e tapped for bser-
vation. All necessary signals must be selected at this stage,
otherwise, expensive re-design along with silicon re-spin are
required. During post-silicon debug, a small subset of those
tapped signals are rou ed to the chip i t rfac for tracing
during system execution.
Previous work such as [4] is typically applied to gate level
design models, and the quality of the results is evaluated by
the commonly used state restoration ratio. However, it is
di cult to scale those metho s to large and complex SoC
designs. More importantly, signals sel cted t the gate level
are often irrelevant to system-level functionalities. There
is an attempt to raise the abstraction level for trace signal
selection to the register transfer level (RTL) guided by asser-
tions [11], however that work does ot consider system l vel
functionalities either. In [3], a system level protocol guided
approach is proposed. It is similar to our work in that both
are based on system level protocols. However, the selec-
tion techniques developed in [3] ar simple a d irrelevant
to understanding silicon traces at the syst m level, a d the
evaluation was performed on an abstract transaction level
model.
This section introduces a framework shown in Figure 3 for
trace signal selection guided by syste -lev l pro ocols. Due
to the page limit, this paper only considers the pre-silicon
trace signal selection. Since the pre-silicon selection needs
to support all types of ecuti n scenarios, it is su  ient t
consider only Type-3 scenarios as they supersede Type-1 or
-2 scenarios.
4.1 System Level Selectio
During the system level selection, di↵erent subsets of flow
events for observation re sel cted from given flow sp cifica-
tions. Then, those results are passed to the more refined bit
level selection.
To support Type-3 scenarios, the start and end events of
all flow specifications must be selected. If a flow specifica-
tion has multiple branches, additional events may need to
be selected s that the branch followed by a flow ins ance
during system execution can be captured. Figure 4 shows
two examples f d ↵er nt branching structur s for flows.
• In Figure 4(a), each branch ends with an unique event.
There is no need to select additional events as observ-
ing di↵erent end events can clearly identify the branch
followed during system execution.
Flow Specification
F
System-Level Selection
Sets of Flow
Events
RTL Model
Bit-Level Selection
Trace Signals
Figure 3: A framework of trace signal selection.
(a) (b)
Figure 4: Examples of flow structures.
• In Figure 4(b), branc es split an the join, and the
flow ends with a common event. In this case, an unique
event needs to be selected for each branch.
The flow shown in Figure 2 has three branches with a
structure similar to Figure 4(b). Its start and end events
are {t1, t8, t9, t10}. Note that t8, t9, and 10 actually refer to
the same event. There is no choice for th righ bra ch as
t10 must be selected. To identify the left two branches from
the right one, either t2 or t3 needs to be selected. Similarly,
one of events i {t4, t5, t6, t7} needs to be selected in order
to identify the left branc from th middle one. Therefore,
all possible event selections for that flow are
{t1, t8, t9, t10}⇥ {t2, t3}⇥ {t4, t5, t6, t7}.
Among pos ibly large number of event selections, there
are two types of events that have interesting characteris-
tics. One type includes events that are unique to specific
flows (ref. unique events), while the other type includes
events shared by multiple flows (ref. shared events). This
section considers their impacts on the complexity and ac-
curacy of the trace analysis and the signal selection. For
the complexity and accuracy of the trace analysis, only is-
sues #2 and #3 in section 3.4 are considered in this section.
Issue #1 is relevant to the bit level selection.
It is important to select events that can be mapped to
smallest number of flow instances during the trace analysis
in order to reduce the complexity and improve the accu-
racy. Refer to the flow shown in Figure 2. Events t2 and
t3 are used only in that flow. During the trace analysis,
they are just mapped to the instances of that particular
flow. Issue #2 can be addressed in that instances of di↵er-
ent flows initiated by the same component are ignored for
those events. Issue #3 is addressed in that instances of any
(a) )
Figure 4: Examples of flow structures.
• In Figure 4(b), branches split and then join, and the
flow ends with a common event. In this case, an unique
event needs to be selected for each branch.
The flow sh wn in Figure 2 has three bran es with a
structure similar to Figure 4(b). Its start and end events
are {t1, t8, t9, t10}. Note that t8, t9, and t10 actually refer to
the same event. There is no choice for the ri ht branch as
t10 must be selected. To identify the left two branches from
the right one, either t2 or t3 needs to be selected. Similarly,
one of events in {t4, t5, t6, t7} needs to be selected in order
to identify the left branch from t e middle one. Therefore,
all possible event selections for that flow are
{t1, 8, t9, t10} × {t2, t3} × {t4, t5, t6, t7}.
Among possibly large number of event selections, there
are two types of events that have interesting characteris-
tics. One type includes events that are unique to specific
flows (ref. unique events), while the other type includes
events sh red by multiple flows (ref. shared event ). This
section considers their impacts on the complexity and ac-
curacy of the trace analysis and the signal selection. For
the complexity and accuracy of the trace a alysis, only is-
sues #2 and #3 in section 3.4 are considered in this section.
Issue #1 is relevant to the bit level selection.
It is important to select events that can be mapped to
smallest number of flow instances during the trace analysis
in order to reduce the complexity and improve the accu-
racy. Refer to the flow shown in Figure 2. Events t2 and
t3 are used only in that flow. During the trace analysis,
they are just mapped to the instances of that particular
flow. Issue #2 can be addressed in that instances of differ-
ent flows initiated by the same component are ignored for
those events. Issue #3 is addressed in that instances of any
flows initiated by different components are ignored for them.
t4 and t7 have a similar characteristic.
On the other hand, events t5 and t6 are used in many
different flows of different components. Those flows can be
read/write flows of CPU_0 or CPU_1. During the trace anal-
ysis, if there are multiple instances of such flows, it is im-
possible to know which of those flow instances cause those
events to be generated. Therefore, the analysis algorithm
has to map those events to those flow instances in all pos-
sible ways. That can cause huge negative impacts on the
complexity and accuracy of the trace analysis.
In terms of trace signal selection, those two types of events
can lead to different results. If unique events are selected.
then the total number of events selected can be large, and as
a result, a large number of trace signals need to be selected in
order to observe those events. On the other hand, the total
number of events can be smaller if shared events are selected.
That leads to a smaller number of trace signals that need to
be selected. The negative impacts of selecting shared events
can also be mitigated if certain implementation details are
available. Next section gives more discussions on that point.
4.2 Bit Level Selection
The bit level selection takes as inputs the set of event selec-
tions produced in the previous step and an RTL model that
implements the system flow specifications, and performs two
tasks for each event selection:
1. Evaluate its quality wrt the three issues discussed in
Section 3.4;
2. Choose one selection, and generate a set of candidate
trace signals that implement the selected events.
The ultimate goal of the bit level selection is to produce a
reduced set candidate trace signals optimized for the trace
analysis approach. Since the bit level selection depends on
implementation specifics, this section can only discuss some
general guidelines and tradeoffs. Note that flow specifica-
tions are typically independent of memory address and data
information. Therefore, the address and data bits included
in event implementations can be generally ignored.
Signals that implement the Cmd field of flow events are se-
lected based on their respective distinguishing power. Given
a set of flow events E and a set of signals W that implement
E, the distinguishing power of Wi ⊆W , is defined by E can
be partitioned wrt Wi. A finer partition means higher dis-
tinguishing power. For example, suppose two flow events on
link (cpu0, DCache0) implemented by eight signals b7 . . . b0
with the following encodings.
(cpu0 : DCache0 : wr req) 0100 0000
(cpu0 : DCache0 : rd req) 1000 0000
Under these encodings, signals b5 . . . b0 have zero distin-
guishing power. b7 and b6 have the equal power, therefore
selecting either one would be fine. Selecting signals with
high distinguishing power helps to address issue #1 as dis-
cussed in Section 3.4.
RTL models may contain additional implementation in-
formation that can help to address issue #2 and #3. For
example, memory operations may be executed out-of-order.
In this case, CPUs usually assign unique sequence IDs to
flow instances to maintain data and control dependency in
the original programs. If sequence IDs are available, select-
ing signals implementing them can help address issue #2.
If the on-chip interconnect needs to handle events from
different components in a system, the events are usually as-
signed with tags to identify their originating components.
Selecting tags can affect how events are selected. Refer to
Figure 2 for the following discussion.
1. If unique events such as t4 or t7 are selected, observing
tags is not needed.
2. Shared events t5 or t6 are selected along with tags.
For option 2, tags can help to map events to the flow in-
stances with the same tags during the trace analysis, thus
addressing issue #3. Even though additional signals for tags
are selected, the total number of events may be smaller if
the shared events are used in many different flows, therefore
resulting in reduced signals for observation overall.
The following discussion illustrates yet another example of
how implementation information can allow different events
to be selected. Refer to Figure 2. That flow contains two
branching places, p2 and p4. When a flow instance reaches
p2, which branch to take next depends on whether the cache
operation is hit or miss. Similarly, which branch to take at
p4 depends on whether the cache snoop operation is hit or
miss. If these two status signals are available and included
for observation, there is no need to select branch events. Ob-
serving start/end events plus those status signals are suffi-
cient to identify branches followed by a flow instance during
system execution.
5. EXPERIMENTAL RESULTS
To the best of our knowledge, this work is the first to
present a systematic approach to post-silicon trace analy-
sis guided by system level protocols. We are not able to
find any similar previous work where ours can be evaluated
and compared with. The closest work to ours is [15]. How-
ever, our work is more general and developed with practical
considerations. Additionally, the work in [15] is discussed
and evaluated based on an abstract transaction-level model
while our approach is evaluated on a RTL model.
5.1 The Model
The ideas and techniques presented in this paper are eval-
uated on a multi-core SoC prototype, as shown in Figure 1,
which implements a number of common industrial system-
level protocols including cache coherence and power man-
agement. This prototype is a cycle- and pin-accurate RTL
model written in VHDL. Even though this model is simple
compared to real SoC designs, it is much more sophisticated
than the gate-level benchmark suites typically considered as
targets for post-silicon analysis [10, 4, 5].
Since the proposed trace analysis approach is communica-
tion centric, the focus of this model is the implementation of
system-level protocols. The CPUs are treated as a test envi-
ronment where software programs are simulated in VHDL to
trigger various protocols. Therefore, there is no instruction
cache as no instructions are involved when the CPUs are
simulated. The peripheral blocks, GFX, PMU, Audio, etc,
are also described as abstract models that generate events
to initiate flows or to respond incoming requests.
More details of some system-level protocols implemented
in our model can be found in [3]. They include downstream
read/write protocols for each CPU, upstream read/write for
the peripheral blocks, and system power management pro-
tocols, which are abstracted from real industrial protocols.
These system-level protocols are supported by inter-block
communication protocols based on the ARM AXI4-lite [1].
A total of 16 flows are implemented for this prototype.
A flow event is generated from a source and consumed by
a destination by messages transmitted over that link. In our
model, each message is organized as follows.
〈Val(1), Cmd(8), Tag(8), Sid(8), Addr(32), Data(32)〉
The meanings of the message fields are given below. The
numbers following the individual fields indicate their respec-
tive widths. Note that not all fields are used on all links.
That model has over four thousand single bit signals.
Val indicates validity of a message.
Cmd carries operations to be performed by the target block.
Tag is used by Bus to identify the original sources of mes-
sages from different blocks that go to the same desti-
nation, e.g. memory wr_req from Bus in response to
wr_req from both CPUs.
Sid is an unique number generated by a component to rep-
resent sequencing information of flows initiated by the
same component.
Addr carries the memory address at the target block where
Cmd is applied.
Data carries data to a target or from a source. Its width can
vary depending on the links where a message is sent.
On the links between Cache to Bus, the width is equal
to the size of the cache block, which is 64 bytes. For
all the other links, the width is 32 bits.
5.2 Experiment Setup
Test Environment The prototype is simulated in a ran-
dom test environment where CPUs, GFX, and other pe-
ripheral blocks are programmed to randomly select a flow
to initiate in each clock cycle. The contents of Cmd, Addr,
and Data in each activated flow are set randomly. Addi-
tionally, CPUs can activate power management protocols
non-deterministically. Each of these blocks activates a total
of 100 flow instances during entire simulation.
Trace Signal Selection In the experiments, different
selections of trace signals are produced as discussed in sec-
tion 4, and their impacts on the complexity and accuracy
of the trace analysis approach are evaluated. The list below
explains the selections at the system level while information
on the bit level selection is given in Table 1.
S1 All events of all flow specifications, and all signals im-
plementing each event are selected. This selection of-
fers full observation, and provides a baseline for com-
paring with other selections.
S2 The start and end events of all protocols are selected.
Furthermore, for each branch in each flow, one unique
event is selected.
S3 The start and end events of all protocols are selected.
Furthermore, for each branch in each flow, a highly
shared event is selected.
S4 The start and end events of all protocols are selected.
Instead of selecting events for branches in each flow,
signals whose states control the flow branching are se-
lected.
At the bit level, the Addr and Data fields are not consid-
ered. On the other hand, the Val bit is always selected so
that valid messages can be identified from observed traces.
For selections S2, S3, and S4, experiments are performed to
evaluate all combinations of Cmd, Tag and Sid fields.
5.3 Result Analysis
In Table 1, a means that all signals implementing a par-
ticular field for all selected events in selection SX are traced.
Otherwise, all those signals are not traced. Third row (#
Bits) shows the total numbers of single-bit signals are traced
for different selections. As discussed in section 4, system-
level selection may choose events unique to particular flows
or events shared by multiple flows. From the table, we can
see that selecting shared events leads to a smaller number
of trace signals (S3) compared with selecting unique events
(S2). However, if status signals controlling flow branching
are selected without selecting any branch events, S4 leads to
the smallest trace signal selection.
From the table, it is quite obvious that not selecting Cmd
or Sid has severe impacts on the trace analysis as explained
in issues #1 and #2 in section 3.4. On the other hand,
not selecting Tag has negative impacts, but not as severe.
The trace analysis can still finish even though it takes more
time and memory. Next, compare the results obtained by
selecting Cmd and Sid but no Tag under S2−S4. The results
with S4 are much better than S2 or S3. This is due to that
no branch events are selected for S4, therefore, issues #2
and #3 are avoided. Combined with the benefit of reduced
trace signals, S4 appears to be the best option. On the other
hand, not selecting any branch events may cause difficulty
in understanding flow execution if a branch is long and a
system execution fails to reach the end of that branch.
In the above discussion, selections of Cmd, Tag and Sid are
applied to all events as the result of the system-level selec-
tion. A finer selection can be used to reduce trace signals if
unique events and shared events are considered separately.
For unique events, the sources where they are generated are
known from flows, therefore Tags need not be traced. Shared
events may be results of flow instances initiated by different
components, therefore tracing Tags are necessary. On the
other hand, tracing or not tracing Cmds has little impact on
the trace analysis. These points are supported by the re-
sults shown in columns under “U S′′. Under S2, compare
the results under “U S′′ against those with all three fields
selected. We can see that the runtime performance and the
complexity and accuracy of the trace analysis are similar
while the trace signals are reduced with the finer selection.
Comparing the results under “U S′′ against those obtained
with only Cmd and Sid selected, the complexity is signifi-
cantly dropped. The same conclusion can be drawn for S3
and S4.
From the above discussion, it is necessary to trace signals
implementing Cmd and Sid whenever possible, and trace as
many signals implementing Tag as allowed to reduce com-
plexity of the trace analysis even more. If Tag or Sid is not
Table 1: Runtime Results of Trace analysis with different trace signal selections. Runtime is in seconds and
memory usage is in MB. − indicates the results are not available due to the 10 minute time limit exceeded.
System
level
selection
S1 S2 S3 S4
U S U S U S
Cmd
Tag
Sid
# Bits 870 545 401 401 401 401 495 367 367 367 367 378 258 258 258 258
# scen
1 1 − 1 − 1 1 − 1 − 1 1 − 1 − 1
(Final)
# scen
1 1 > 1M 5184 110k 1 1 > 4M 5184 221K 1 1 > 1M 8 > 8M 1
(Max)
Time 1.628 1.475 600 3.679 600 1.464 1.444 600 3.812 600 1.426 1.430 1.411 1.424 600 1.419
Mem 0.516 1.10 > 2GB 4.2 66 1.124 1.11 > 5GB 4.2 101 1.1 0.504 > 2 GB 0.58 > 5 GB 1.116
part of the design, we recommend to add DFx circuitry in
order to trace such information. In the above experiments,
the final execution scenarios under different signal selection,
if available, contain the correct number of flow instances
initiated, and the orderings among the flow instances, as
generated by the test environment, are correctly captured.
6. RELATEDWORK
Our work is closely related to communication-centric and
transaction based debug. An early pioneering work is de-
scribed by Goossens et al. [9, 14, 8], which advocates the
focus on observing activities on the interconnect network
among IP blocks, and mapping these activities to transac-
tions for better correlation between computations and com-
munications. A similar transaction-based debug approach is
presented by Gharebhagi and Fujita [7]. It proposes an auto-
mated extraction of state machines at transaction level from
high level design models. From an observed failure trace, it
tries to derive a set of feasible transaction traces that lead to
the observed failure state. However, this approach requires
manual inputs and may not be able to derive such traces.
Singerman et al. [13] deploys a central repository of system
events and simple transactions defined by architects and IP
designers. It spans across a wide spectrum of the post-silicon
validation including DFx instrumentation, test generation,
coverage, and debug. Also, Abarbanel et al. [2] propose a
model at a higher-level of abstraction, flows, is proposed.
Flows are used to specify more sophisticated cross-IP trans-
actions such as power management, security, etc, and to
facilitate reuse of the efforts of the architectural analysis to
check HW/SW implementations.
7. CONCLUSION
An improved trace analysis approach for post-silicon de-
bug is presented where observed raw silicon traces are in-
terpreted wrt system flow specifications. In this approach,
a new formulation of flow execution scenarios is described
where more diverse information among flows can be cap-
tured and represented. A trace signal selection framework
is also described in support of the proposed trace analysis
approach. Some observations on trace signal selections and
their impacts on the accuracy and efficiency of the trace
analysis are discussed. Experiments on a non-trivial SoC
prototype reveal insights on impacts of different signal se-
lections on the complexity and accuracy of the trace anal-
ysis. In the future, we plan to perform more extensive and
in-depth study on trace signal selections guided by system
flow specifications.
8. REFERENCES
[1] Amba axi and ace protocol specification.
http://www.arm.com.
[2] Y. Abarbanel, E. Singerman, and M. Y. Vardi.
Validation of soc firmware-hardware flows: Challenges
and solution directions. In Proceedings of DAC’14,
pages 2:1–2:4, 2014.
[3] M. Amrein. System-level trace signal selection for
post-silicon debug using linear programming. Master’s
thesis, Univ. of Illinois Urbana-Champaign, May 2015.
[4] K. Basu and P. Mishra. Efficient trace signal selection
for post silicon validation and debug. In VLSI Design
(VLSI Design), pages 352–357. IEEE, 2011.
[5] D. Chatterjee, C. McCarter, and V. Bertacco.
Simulation-based signal selection for state restoration
in silicon debug. In ICCAD, pages 595–601. IEEE, 2011.
[6] H. D. Foster. Trends in functional verification: A 2014
industry study. In DAC, pages 48:1–48:6, 2015.
[7] A. M. Gharehbaghi and M. Fujita. Transaction-based
post-silicon debug of many-core system-on-chips. In
ISQED, pages 702–708, 2012.
[8] K. Goossens, B. Vermeulen, and A. B. Nejad. A high-level
debug environment for communication-centric debug. In
Proceedings of DATE’09, pages 202–207, 2009.
[9] K. Goossens, B. Vermeulen, R. v. Steeden, and
M. Bennebroek. Transaction-based communication-centric
debug. In Proceedings of NOCS’07, pages 95–106, 2007.
[10] H. F. Ko and N. Nicolici. Algorithms for state restoration
and trace-signal selection for data acquisition in silicon
debug. IEEE TCAD, 28(2):285–297, 2009.
[11] S. Ma, D. Pal, R. Jiang, S. Ray, and S. Vasudevan. Can’t
see the forest for the trees: State restoration’s limitations
in post-silicon trace signal selection. ICCAD ’15, pages 1–8,
Piscataway, NJ, USA, 2015. IEEE Press.
[12] P. Patra. On the cusp of a validation wall. IEEE Des. Test,
24(2):193–196, Mar. 2007.
[13] E. Singerman, Y. Abarbanel, and S. Baartmans.
Transaction based pre-to-post silicon validation. In
Proceedings of DAC’11, pages 564–568, 2011.
[14] B. Vermeulen and K. Goossens. A noc monitoring
infrastructure for communication-centric debug of
embedded multi-processor socs. In VLSI-DAT ’09, pages
183–186, 2009.
[15] H. Zheng, Y. Cao, S. Ray, and J. Yang. Protocol-guided
analysis of post-silicon traces under limited observability.
In Proceedings of ISQED’16, pages 301–306, March 2016.
APPENDIX
A. CPUREAD/WRITEDOWNSTREAMPRO-
TOCOL
X={ 0, 1}
X’=1-X
Target={ Memory, USB, UART, AUDIO, GFX}
CMD ={read, write }
APPENDIX
A. CPUREAD/WRITEDOWNSTREAMPRO-
TOCOL
X={ 0, 1}
X’=1-X
Target={ Memory, USB, UART, AUDIO, GFX}
CMD ={read, write }
p1
t1 : (CPU X : Cache X : CMD req)
p2
t2 : (Cache X : Cache X
0 : snp CMD req)
p3
t3 : (Cache X
0 : Cache X : snp CMD resp)
p4
t4 : (Cache X : Bus : CMD req)
p5
t5 : (Bus : Target : CMD req)
p6
t6 : (Target : Bus : CMD resp)
p7
t7 : (Bus : Cache X : CMD resp)
p8
t8 : (Cache X : CPU X : CMD resp)
t9 : (Cache X : CPU X : CMD resp)
t10 : (Cache X : CPU X : CMD resp)
p9
Figure 5: LPN formalization of a CPU read/write
protocol.
Figure 6: MSQ of CPU downstream read/write pro-
tocol
Figure 5: LPN for alization of a P read/ rite
protocol.
Figure 6: MSQ of CPU downstream read/write pro-
tocol
B. UPSTREAMREAD/WRITE PROTOCOL
Initiator={ GFX, USB, AUDIO, UART}
Target={ Memory, USB, UART, AUDIO, GFX}
Note that a peripheral can’t initialize a read flow to read
itself
B. UPSTREAMREAD/WRITE PROTOCOL
Initiator={ GFX, USB, AUDIO, UART}
Target={ Memory, USB, UART, AUDIO, GFX}
Note that a peripheral can’t initialize a read flow to read
itself
p1
t1 : (Initiator : Bus : rd req)
p2
t2 : (Bus : Cache 0 : snp req)
p3
t3 : (Cache 0 : Cache 1 : snp req)
p4
t4 : (Cache 1 : Cache 0 : snp resp)
p5
t5 : (Cache 0 : Bus : snp resp)
p6
t6 : (Bus : Target : rd/wt req)
p7
t7 : (Target : Bus : rd/wt resp)
p8
t8 : (Bus : I : rd resp)
t5 : (Cache 0 : Bus : snp resp)
t8 : (Bus : Initiator : rd resp)
p9
Figure 7: LPN formalization of a upstream read pro-
tocol.
Figure 8: MSQ of upstream read protocol
Figure 7: LPN formalization of a upstrea read pro-
tocol.
Figure 8: MSQ of upstream read protocol
Initiator={ GFX, AUDIO}
Target={ Memory, USB, UART, AUDIO, GFX}
Note that a peripheral can’t initialize a write flow to write
itself
Initiator={ GFX, AUDIO}
Target={ Memory, USB, UART, AUDIO, GFX}
Note that a peripheral can’t initialize a write flow to write
itself
p1
t1 : (Initiator : Bus : wr req)
p2
t2 : (Bus : Cache 0 : snp wr req)
p3
t3 : (Cache 0 : Cache 1 : snp wr req)
p4
t4 : (Cache 1 : Cache 0 : snp wr resp)
p5
t5 : (Cache 0 : Bus : snp wr resp)
p6
t6 : (Bus : Target : wr req)
p7
t7 : (Target : Bus : wr resp)
p8
t8 : (Bus : Initiator : wr resp)
t8 : (Bus : Initiator : wr resp)
t5 : (Cache 0 : Bus : snp wr resp)
p9
Figure 9: LPN formalization of a upstream write
protocol.
Figure 10: MSQ of upstream write protocol
Figure 9: LPN for alization of a upstrea write
protocol.
Figure 10: MSQ of upstream write protocol
C. CPU WRITE BACK PROTOCOL
D. CPU POWER ON/OFF PROTOCOL
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
C. CPU WRITE BACK PROTOCOL
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
p1
t1 : (Cache X : Bus : wb req)
p2
t2 : (Bus : Target : wb req)
p3
t3 : (Target : Bus : wb resp)
p4
Figure 11: LPN formalization of a CPU write back
protocol.
Figure 12: MSQ of power write back protocol
D. CPU POWER ON/OFF PROTOCOL
CMD={ pwr on, pwr o↵}
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
p1
t1 : (CPU X : Cache X : CMD req)
p2
t2 : (Cache X : Bus : CMD req)
p3
t3 : (Bus : PWR : CMD req)
p4
t4 : (PWR : Target : CMD req)
p5
t5 : (Target : PWR : CMD resq)
p6
t6 : (PWR : Bus : CMD resq)
p7
t7 : (Bus : Cache X : CMD resq)
p8
t8 : (Cache X : CPU X : CMD resq)
p9
Figure 13: LPN formalization of a CPU power
on/o↵ protocol.
Figure 14: MSQ of power on/o↵ protocol
Figure 11: LPN formalization of a CPU write back
protocol.
Figure 12: MSQ of power write back protocol
CMD={ pwr on, pwr off}
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
C. CPU WRITE BACK PROTOCOL
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
p1
t1 : (Cache X : Bus : wb req)
p2
t2 : (Bus : Target : wb req)
p3
t3 : (Target : Bus : wb resp)
p4
Figure 11: LPN formalization of a CPU write back
protocol.
Figure 12: MSQ of power write back protocol
D. CPU POWER ON/OFF PROTOCOL
CMD={ pwr on, pwr o↵}
X={1,0}
Target={ Memory, USB, UART, AUDIO, GFX}
p1
t1 : (CPU X : Cache X : CMD req)
p2
t2 : (Cache X : Bus : CMD req)
p3
t3 : (Bus : PWR : CMD req)
p4
t4 : (PWR : Target : CMD req)
p5
t5 : (Target : PWR : CMD resq)
p6
t6 : (PWR : Bus : CMD resq)
p7
t7 : (Bus : Cache X : CMD resq)
p8
t8 : (Cache X : CPU X : CMD resq)
p9
Figure 13: LPN formalization of a CPU power
on/o↵ protocol.
Figure 14: MSQ of power on/o↵ protocol
Figure 13: LPN formalization of a CPU power
on/off protocol.
Figure 14: MSQ of power on/off protocol
