Protocol Guided Trace Analysis for Post-Silicon Debug Under Limited Observability by Cao, Yuting Cao
University of South Florida
Scholar Commons
Graduate Theses and Dissertations Graduate School
10-18-2016
Protocol Guided Trace Analysis for Post-Silicon
Debug Under Limited Observability
Yuting Cao Cao
University of South Florida, cao2@mail.usf.edu
Follow this and additional works at: http://scholarcommons.usf.edu/etd
Part of the Computer Engineering Commons
This Thesis is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate
Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact scholarcommons@usf.edu.
Scholar Commons Citation
Cao, Yuting Cao, "Protocol Guided Trace Analysis for Post-Silicon Debug Under Limited Observability" (2016). Graduate Theses and
Dissertations.
http://scholarcommons.usf.edu/etd/6475
Protocol Guided Trace Analysis for Post-Silicon Debug Under Limited Observability
by
Yuting Cao
A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science in Computer Science
Department of Computer Science and Engineering
College of Engineering
University of South Florida
Major Professor: Hao Zheng, Ph.D.
Swaroop Ghosh, Ph.D.
Srinivas Katkoori, Ph.D.
Date of Approval:
August 2, 2016
Keywords: silicon, validation, signal selection
Copyright c© 2016, Yuting Cao
ACKNOWLEDGMENTS
Foremost, I would like to express my sincere gratitude to my adviser Prof. Hao Zheng
for the continuous support of my Master study and research, for his patience, motivation,
enthusiasm, and immense knowledge. His guidance helped me in all the time of research and
writing of this thesis. I could not have imagined having a better adviser and mentor for my
Master study.
Besides my adviser, I would also like to thank the rest of my thesis committee: Dr. Ghosh
Swaroop, and Dr Katkoori for their encouragement and insightful comments.
Finally, I must express my very profound gratitude to my parents and to my boyfriend
Jae-won Jang for providing me with unfailing support and continuous encouragement through-
out my years of study and through the process of research and the thesis writing. This
accomplishment would not have been possible without them. Thank you!
This research is partially supported by a grant from the Intel Cooperation.
TABLE OF CONTENTS
LIST OF TABLES iii
LIST OF FIGURES iv
ABSTRACT vi
CHAPTER 1 INTRODUCTION 1
1.1 Pre- and Post-silicon Validation 1
1.2 Post-silicon Debug Challenges and Techniques 3
1.2.1 Scan Based Debug Techniques 4
1.2.2 Trace Based Debug Techniques 5
1.3 Motivation 7
1.4 Contributions 8
1.5 Related Work 8
1.6 Thesis Organization 10
CHAPTER 2 BACKGROUND 11
2.1 Representations of SoC Protocols 11
2.2 Labeled Petri-Nets 14
CHAPTER 3 FLOW GUIDED TRACE INTERPRETATION 17
3.1 Post-silicon Trace Analysis 17
3.2 Flow Execution Scenarios 19
3.3 Flow Guided Trace Interpretation Algorithm 20
3.4 Illustration 23
CHAPTER 4 TRACE ANALYSIS UNDER PARTIAL OBSERVABILITY 27
4.1 Mapping Individual Signal Events to Flow Events 27
4.2 Mapping Sequences of Signal Events to Flow Events 28
4.3 Generalized Trace Analysis Algorithm 31
4.4 Difficulties and Solutions 35
4.5 Trace Signal Selection 35
4.6 Interactive Trace Interpretation 36
CHAPTER 5 CASE STUDIES 38
5.1 A Transaction-Level Model of a Simple SoC in GEM5 38
5.2 A Cycle-Accurate RTL Model for a Simple SoC 44
5.2.1 Model Implementation 44
i
5.2.2 Experimental Results 51
5.2.3 Debugging Experience 51
5.2.3.1 Bug One: Duplicated Messages 52
5.2.3.2 Bug Two: Incorrect Command 54
5.2.3.3 Bug Three: Incomplete Protocol Specification 55
CHAPTER 6 CONCLUSION AND FUTURE WORK 57
6.1 Conclusion 57
6.2 Future Work 58
LIST OF REFERENCES 60
APPENDICES 64
Appendix A Protocol Specifications in Message Sequence Chart Provided by
GEM5 65
Appendix B Protocol Specification in LPNs Provided by GEM5 67
Appendix C Protocol Specification in Message Sequence Charts for the RTL
Model 70
Appendix D Protocol Specification in LPNs for the RTL Model 72
Appendix E Copyright Permissions 75
ii
LIST OF TABLES
Table 5.1 Runtime Results of Trace analysis. 41
Table 5.2 The number of flow instances derived by the trace analysis with the full
observability. 41
Table 5.3 The number of flow instances derived by the trace analysis with certain
monitors disabled. 42
Table 5.4 Signals explanations for Interface (1) 47
Table 5.5 Signals explanations for Interface (2), (3) and (4) 50
Table 5.6 Signals explanations for Interface (5) and (6) 50
iii
LIST OF FIGURES
Figure 1.1 Major steps in the IC design flow [1] 2
Figure 1.2 Silicon debug vs. time-to-market 4
Figure 1.3 Scan-based debug [2] 5
Figure 1.4 Structure of an embedded logic analyzer [2] 6
Figure 2.1 A graphical representation of a SoC firmware load protocol [3]. 12
Figure 2.2 Protocol in Figure 2.1 represented in graphical live sequence chart 13
Figure 2.3 LPN formalization of protocol in Figure 2.1 15
Figure 5.1 The TML model structure. 39
Figure 5.2 The RTL model structure. 45
Figure 5.3 Format definition of messages 46
Figure 5.4 Structures of link 1 in Figure 5.2 47
Figure 5.5 Structures of link (2), (3) and (4) in Figure 5.2 48
Figure 5.6 Structures of link 5 and 6 in Figure 5.2 48
Figure 5.7 Peterson’s Algorithm on two CPUs [4] 52
Figure 5.8 The flow specifications for CPU write operations 53
Figure 5.9 The flow specifications for CPU read operations 53
Figure 5.10 Two instances of write flow. 54
Figure A.1 Flow sequence chart of write operation when requested data is not in-
cluded in Dcache. 65
Figure A.2 Flow sequence chart of write operation when XCache has the exclusive
right of requested data. 65
Figure A.3 Flow sequence chart of write operation when requested data is shared by
another component. 65
iv
Figure A.4 Flow sequence chart of read operation when XCache has the exclusive
right of requested data. 66
Figure A.5 Flow sequence chart of read operation when requested data is shared by
another component. 66
Figure A.6 Flow sequence chart of read operation when requested data is not present
in the Cache. 66
Figure B.1 Flow specification of a cache coherent write operation initiated from
CPU1 to instruction cache. 67
Figure B.2 Flow specification of a cache coherent read operation initiated from
CPU1 to instruction cache. 68
Figure B.3 Flow specification of a cache coherent read operation initiated from
CPU1 to data cache. 69
Figure C.1 CPU write when cache has exclusive right of the requested data. 70
Figure C.2 CPU write when data only exist in the other CPU’s cache 70
Figure C.3 CPU write when requested data only reside in Memory 70
Figure C.4 Cache send write back request to Memory 70
Figure C.5 CPU read when cache has exclusive right of the requested data. 71
Figure C.6 CPU read when data only exist in the other CPU’s cache 71
Figure C.7 CPU read when requested data only reside in Memory 71
Figure D.1 Flow specification of a cache write back operation initiated from Cache1. 72
Figure D.2 Flow specification of a cache coherent write operation initiated from
CPU1 to Cache. 73
Figure D.3 Flow specification of a cache coherent read operation initiated from
CPU1 to Cache. 74
v
ABSTRACT
This thesis considers the problem of reconstructing system level behavior of an SoC
design from a partially observed signal trace. Solving this problem is a critical activity in
post-silicon validation, and currently depends primarily on human creativity and insights. In
this thesis, we provide algorithms to automatically infer system level flows from incomplete,
ambiguous, and noisy trace data. This thesis also demonstrates the approach on two case
studies, a multicore SoC model developed within the within the GEM5 environment, and a
cycle accurate register transfer level model of a similar SoC design.
vi
CHAPTER 1
INTRODUCTION
This chapter briefly reviews concepts of pre- and post-silicon validation, and discusses
current challenges in post-silicon validation and possible solutions. This chapter also reviews
related works in post-silicon validation and motivates the proposed trace based analysis
algorithm in this thesis.
1.1 Pre- and Post-silicon Validation
Integrated circuits (ICs) are implemented from a system specification that describes
system behaviors and functionalities. In general, IC designers first convert the specification
into a transaction level model. This model describes the system behavior at a very high
abstraction level, thus allows the designers to explore the architecture design. After that,
designers refine the transaction level model to a register transfer level (RTL) model. The
RTL model captures the cycle accurate behaviors and the interconnections to input and
outputs of the digital circuits on chip [1]. Once the RTL model is verified, it is synthesized
and fabricated on silicon. A more detailed design flow is shown in Figure 1.1. To ensure
the correctness of the design, validation is conducted on the system design in each step of
the design process. As the design goes through each step, it becomes more detailed, and
it becomes more difficult to fix bugs in the later steps of the design flow. Consequently, it
requires bugs be found as early as possible to reduce debugging cost and time.
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
1
Figure 1.1. Major steps in the IC design flow [1]
As the Moore’s law continues, IC designs are becoming more complex. As a result, the
number of system bugs keeps growing, the types of bugs become more diverse, and they are
harder to root cause. Recent studies have shown that validation in modern IC development
process takes up to 70% of design time and is still increasing [6]. Here we define validation
as an activity of ensuring that a product satisfies its specifications, works in target systems
and meets user expectations [7].
Pre-silicon validation aims to verify the design before it is implemented on a silicon chip.
It is a very important research topic because the cost to debugging and refining the design
is relatively low compared to changing the design after it is fabricated on the silicon. Pre-
2
silicon validation techniques includes simulation and emulation using field-programmable
gate arrays (FPGAs). Simulator is a software program that can show the behavior of a
design model running in a test environment. To achieve a satisfying verification coverage of
the system, a large number of test vectors need to be generated and applied during the test.
However, as the circuit size increases, the number of possible behaviors grows exponentially,
thus more test vectors are needed. Moreover, there is only a limited amount of verification
time allowed, therefore only a limited number of test vectors can be applied. As a result,
only small portions of the design can be tested using simulation during pre-silicon validation,
achieving acceptable coverage [8]. Another drawback of simulation is the speed limitation
since simulation is done in software, and it is multiple orders of magnitude slower compared
to the actual circuit speed. Emulator is another technique used very commonly for pre-
silicon validation. It verifies the system by implementing a design onto FPGAs, and runs
up to 3 orders of magnitude faster than simulation. However, this speed is still relatively
slow compared to actual circuit speed. Due to the above limitations, it is impossible for
pre-silicon verification to verify the system with high coverage, thus it cannot guarantee that
the first silicon is error-free.
Post-silicon validation makes use of pre-production silicon ICs to ensure that the fabri-
cated design works as desired under actual operating conditions with real software. Since the
silicon executes at target clock speed, post-silicon executions are billions of times faster than
simulation, and several orders of magnitude faster then emulation. This makes it possible
to explore deep design states which cannot be exercised in pre-silicon verification, and to
identify errors missed during pre-silicon validation and debug.
1.2 Post-silicon Debug Challenges and Techniques
Post-silicon debug is very labor-intensive and may take months to finish. As showed in
Figure 1.2, it has become the most time-consuming part (on average 35%) of the circuit
development process. This is because, as ITRS roadmap states, the time to locate the root
3
Figure 1.2. Silicon debug vs. time-to-market
cause of a problem grows exponentially with the advances in process technology that produce
larger, denser, and more complex designs [9].
During post-silicon debug, debuggers first need to obtain the internal signal information
from silicon. A promising technique to gather internal signal information is by inserting
a design-for-debug(DfD) component into the circuit design. Two of the most commonly
used techniques are scan chains and trace buffers. More detailed definitions and debug
technologies based on these two techniques will be discussed in next section.
1.2.1 Scan Based Debug Techniques
The scan based techniques reuse the internal scan chains that are placed in the CUD.
scan chain was originally designed to increase the controllability and observability of the
system during manufacturing test by using the functional pins as scan pins to load multiple
scan chains concurrently to reduce test time [2].
For post-silicon validation purpose, these scan chains are concatenated as shown in Fig-
ure 1.3, where internal states are loaded and unloaded through a serial interface. During the
post-silicon debug process, when an internal state of the system is needed, debuggers can
stop the system, enable scan chains to capture and oﬄoad the internal state elements (scan
dump). After a scan dump is finished, the system can be resumed from where it is stopped.
4
Figure 1.3. Scan-based debug [2]
During post-silicon validation, scan chains can be very useful when the system is deter-
ministic, allowing CUD to be stopped and resumed from any state of interest. However,
modern systems often include multiple clock domains for power efficiency. Therefore, when
CUD is stopped, it is very hard to obtain a coherent system state of all clock domains. More-
over, it is hard to decide when to stop the CUD as ofthen there is little knowledge about the
cause of bug. For these reasons, scan chains are not practical for complicate systems.
1.2.2 Trace Based Debug Techniques
The limitations of scan chain based techniques can be addressed by using embedded
logic analyzers (ELA) with a trace buffer. Figure 1.4 shows an example of structures of
ELA. There are four components for an ELA: a control unit, a trigger unit, a sample unit
(trace buffer) and an oﬄoad unit. The control unit is in charge of all the other units inside
an ELA. The trigger unit monitors a set of trigger signals to detect certain trigger events
thus activating the sample unit to start the data acquisition. The sample unit contains a
5
Figure 1.4. Structure of an embedded logic analyzer [2]
trace buffer to record data on selected signals, and the oﬄoad unit outputs the data through
low-bandwidth device pins.
The amount of data that can be acquired by a trace buffer is limited by two factors
below:
• Trace buffer width limits the number of observable trace signals
• Trace buffer depth limits the numbers of samples on the observable trace signals to be
stored.
Compared with scan chains, trace buffer based techniques allow temporal observability, mak-
ing trace analysis possible even when the location of the bug is not known. However, because
of the limitation of the trace buffer width, the number of signals that can be observed is lim-
ited. This problem can be mitigated using trace information filtering [10] and compression
techniques [11].
6
1.3 Motivation
Post-silicon debug is a critical component of the design validation life-cycle for modern
microprocessors and SoC designs. Unfortunately, it is also a highly complex component, per-
formed under aggressive schedules and accounting for more than 35% of the overall design
validation cost. Consequently, it is crucial to develop methods and techniques for streamlin-
ing and automating post-silicon validation activities.
A key component of post-silicon validation of SoC designs is to correlate traces from
silicon execution with system level protocols. An SoC design is typically composed of a
large number of pre-designed hardware or software blocks (often referred to as “intellectual
properties” or “IPs”) that coordinate through complex protocols to implement the system
level behavior. Any execution trace of the system involves a large number of interleaved
instances of these protocols. For example, consider a smartphone executing a usage scenario
where the end-user browses the Web while listening to music and sending and receiving
occasional text messages. Typical post-silicon validation use-cases involve exercising such
scenarios.
Due to observability limitations, only a small number of participating signals can be
actually traced during silicon execution. Furthermore, due to electrical perturbations, silicon
data can be noisy, lossy, and ambiguous. Consequently, it is non-trivial to identify all
participating protocols and their interleavings that result in the observed traces.
With the increasing complexity of modern SoC designs nowadays, debugging protocols
inside IP blocks by themselves is not enough anymore. The complexity of the SOC increas-
ingly resides in the interactions between the IP blocks. Debug must be conducted at a
higher abstraction level where the computation threads and communication threads inter-
act. Therefore, communications between the IP blocks are the natural focus for system level
debug [12].
7
1.4 Contributions
In this thesis, we present an approach to reconstruct system level behavior from silicon
traces from system execution. This approach is based on a formalization of system level
protocols via labeled Petri-Nets, which are capable of describing sequencing, concurrency,
and choices over system events. Given a collection of system level communication protocols
and a trace on a limited set of hardware signals with missing, noisy, and ambiguous values,
this approach infers the protocol instances and their interleavings being exercised by the
trace.
The proposed approach can give debuggers an overview of the system level behaviors,
which can be used to check if the system performs the desired behaviors. Moreover, by
checking if a trace is compliant with the system specifications using this approach, debuggers
can decide if the those specifications are implemented correctly. When an error occurs, our
proposed method can provide useful information in system level to help root cause the
problem.
1.5 Related Work
Our work in this thesis is closely related to communication-centric and transaction based
debug. An early pioneering work is described in [12], which advocates the focus on observ-
ing activities on the interconnect network among IP blocks, and mapping these activities
to transactions for better correlation between computations and communications. There-
fore, the communication transactions, as a result of software execution, provide an interface
between computation and communication, and facilitate system level debug. This work is
extended in [13, 14]. However, this line of work is focused on the network-on-chip (NoC)
architecture for interconnect using the run/stop debug control method.
A similar transaction-based debug approach is presented in [15]. Furthermore, it pro-
poses an automated extraction of state machines at transaction level from high level design
8
models. From an observed failure trace, it performs backtracking on this transaction level
state machine to derive a set of transaction traces that lead to the observed failure state.
In the subsequent step, bounded model checking with the constraints on the internal vari-
ables is used to refine the set of transaction traces to remove the infeasible traces. This
approach requires user inputs to identify impossible transaction sequences, and may not find
the states causing the failure if the transaction traces leading to the observed failure state
is long. Backtracking from the observed failure state requires pre-image computation, which
can be computationally expensive. A transaction-based online debug approach is proposed
in [16] to address these issues. This approach utilizes a transaction debug pattern specifi-
cation language [17] to define properties that transactions should meet. These transaction
properties are checked at runtime by programming debug units in the on-chip debug infras-
tructure, and the system can be stopped shortly after a violation is detected for any one of
those properties. In this sense, it can be viewed as the hardware assertion approaches in [18]
elevated to the transaction level.
In [19], a coherent workflow is described where the result from the pre-silicon validation
stage can be carried over to the post-silicon stage to improve efficiency and productivity of
post-silicon debug. This workflow is centered on a repository of system events and simple
transactions defined by architects and IP designers. It spans across a wide spectrum of the
post-silicon validation including DFx instrumentation, test generation, coverage, and debug.
The DFx instruments are automatically inserted into the design RTL code driven by the
defined transactions. This instrumentation is optimized for making a large set of events and
transactions observable. Test generation is also optimized to generate only the necessary
but sufficient tests to allow all defined transactions to be exercised. Moreover, coverage for
post-silicon validation is now defined at the abstract level of events and transactions rather
than the raw signals, and thus can be evaluated more efficiently. In [20], a model at an even
higher-level of abstraction, flows, is proposed. Flows are used to specify more sophisticated
9
cross-IP transactions such as power management, security, etc, and to facilitate reuse of the
efforts of the architectural analysis to check HW/SW implementations.
1.6 Thesis Organization
This thesis is organized as follows. We present background information in Chapter 2.
After that, our proposed method and detailed algorithm are explained in Chapter 3 and
Chapter 4. To demonstrate the correctness and importance of this method, two case studies
are constructed and explained in Chapter 5. Chapter 6 summarizes the thesis, and points
out some future directions. All the flow specifications used in the case studies are given in
Appendix.
10
CHAPTER 2
BACKGROUND
This chapter introduces multiple representations for SoC protocol, and explains basic
functionality of Labeled Petri-Nets and why we choose it.
2.1 Representations of SoC Protocols
In engineering field, there are two approaches for representing the system protocols:
informal and formal representations. Informal representation is human friendly and uses
common graphical notation for better understandability and easier communication with the
client. The formal representation, on the other hand, is designed to be machine friendly. It
is usually built on strong mathematical notations and proofs for more automated verification
purpose.
System development usually need to create protocol in both formal and informal formats.
At the beginning of the product development cycle, system designers create the specification
in graphical (informal) form, providing good understandability while still in a standard
graphical manner. After the design is finalized, specifications in formal format is developed
for verification purpose. Usually it requires manual translation of an informal description
to a formal description, which consumes large amount of time and effort as modern system
involves massive amount of complicate specifications. [21] introduces a tool that translate
live sequence chart into colored petri-nets that can be used to speed up the translation
process.
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
11
An SoC design involves integration of numbers of IPs that communicate through complex
protocols. Such system level protocols are typically specified in architecture documents as
message flow diagrams. In this thesis, we use the words “protocol” and “flow” interchange-
ably.
Fig. 8. A combination of Fa and Ful with the counterexample
(1, 2, 3, 4, 5, 2′, 3, 1¯, 4, 6, 5, 7, 8) that requires CE to execute out-of-order.
Fig. 9. A protocol for Device that executes from local memory LM . Attack:
(1, 2, 3, 2, 3, 4, 5, 6, 4, 8, 2¯′, 3¯, 6, 9).
connect tasks of the same agent. The English text in tasks
is pseudocode; let us assume that it can be readily expressed
as code that takes the form of a conditional assignment: if
a condition (boolean expression) is true then a sequence of
assignments is to be executed. We will refer to variables
occuring in task conditions and assignments as !ow variables.
Let us assume that initial values of all !ow variables are given
as part of the de"nition of F , that every !ow variable may
belong to a unique agent, and that the reset task for any speci"c
agent assigns variables that belong to it to their initial values.
We assume that dynamic access control can be expressed
by means of !ow variables. Abstracting the exact mechanism,
let us just postulate that for every agent A and every !ow
variable x there exist predicates “A can read x” and “A can
write x” written in terms of !ow variables.
Tasks of a !ow can be partitioned into control !ow graphs
(cfg). By de"nition, two tasks are in the same cfg iff they can
be connected by a path of control edges only. Clearly, each cfg
belongs to a unique agent. For simplicity, let us assume that
(1) every cfg has at most one task with an incoming message
(cfg’s start task); (2) every cfg executes deterministically; and
(3) there is exactly one !ow start task that has no incoming
edges.
For example, in the !ow Faul, we see nine tasks distributed
over the three agents, two control edges, and six messages. The
!ow variables are SM , IM , active , and lock IM . Tasks 1 and
2 form one cfg; tasks 7 and 8 form another; all other tasks are
one-vertex cfgs.
B. The system described by a !ow
Given a !ow F , let us now describe how to generate from
it a transition system S. We will de"ne S in a standard fashion
by (1) a set of state variables; (2) initial conditions; and (3)
transition rules in the guard-action format, where the guard
is a boolean expression over state variables and the action
is an assignment to state variables. The system S executes
non-deterministically starting from initial states. A transition
s→ s′ is possible iff there exists a transition rule whose guard
is true in s and whose action when applied to s produces s′.
By de"nition, the state variables of S are all the !ow
variables, together with
- a sequence Q of messages, each paired with its status,
which can take three values: f , t, e
- a subset L of the set of all control edges of F
The sequence (“queue”) Q represents messages currently
“in-!ight” and the status values f , t, e stand for “in fabric”, “at
target” (received by target agent), and “enabled” (executable by
target agent). The set L represents the current set of program
locations or control points in the usual sense.
By de"nition, initial conditions are: L is the empty set; Q
is the empty sequence; every !ow variable has its initial value.
Stipulating that agents execute tasks atomically, we gener-
ate a transition rule τ of S for every task t of F . Recall that t
is of the form “if c then a”, where c is boolean expression
and a is a sequence of assignments. The access condition of t
is by de"nition the conjunction of predicates “A can read x”
and “A can write y”, where A is the agent that contains t; the
conjunction is taken over all variables x that need to be read
in order to do the assignemts a, and for all variables y that
need to be written in a.
We de"ne the guard of τ to be the condition c conjuncted
with the access condition of t and further conjuncted with a
disjunction of trigger conditions
- e ∈ L
- Q contains message m with status e
2014 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST) 73
Figure 2.1. A graphical repr sen ion of a SoC firmware load protocol [3].
Figure 2.1 shows a protocol example that aut nticates and loads a firmware during
system boot for firmware upgrade in Business Process M del nd Notation (BPMN). BPMN
is a standard for business process modeling that provides a graphic l notation for specifying
business processes in a Business Process Diagram (BPD), based on a flowcharting technique
very similar to activity diagrams from Unified Modeling Language (UML) [1].
To start this protocol, the Driver resets Device and copies the needed firmware to a place
in System Memory (SM) and notice the Device to load it. With the location of firmware
provided from Driver, Device can retrieve firmware to Isolated Memory (IM) and sends the
message Auth req to Crypto-engine (CE), providing the location of the copied firmware,
and asking for authentication. After verifying signature of firmware in IM, CE replys with
PASS/FAIL status (sts). Upon receiving the PASS sts such that sts = PASS, the Device
12
sends report message to Driver and acknowledgement message to CE and then jump to the
firmware from Local Memory (LM).
The BPMN format used here is a very detailed format, and it is mostly used in business
field. A more commonly used graphical format in computer engineering field is sequence
diagrams. It represents the life cycle of an processor and the interactions between them.
Commonly used sequence diagram include UML sequence diagrams, message sequence dia-
grams and live sequence charts [21].
Figure 2.2. Protocol in Figure 2.1 represented in graphical live sequence chart
Figure 2.2 shows the live sequence chart representation of the protocol in Figure 2.1. In
this graph, we ca clearly see the relative time and content of communications between each
components. Unlike the BPMN format in Figure 2.1, sequence diagrams is more abstract as
the internal activities of components are not shown.
13
2.2 Labeled Petri-Nets
This thesis focuses on algorithmic analysis of system behavior, therefore, only a for-
mal representation with rigorous semantics, methods and tools for analysis is needed, such
representation selected by this thesis is the Labeled Petri-Nets (LPN).
A LPN is a formalization of state transition system behavior and it is capable of describing
sequencing, concurrency, and choices. Compared with sequence diagrams, LPN is more
machine friendly, and can be analyzed using mathematical techniques and tools.
Formally, an LPN is a tuple (P, T, s0, E, L) where
• P is a finite set of places,
• T is a finite set of transitions,
• s0 ⊆ P is the initial marking.
• E is a finite set of events.
• L : T → E is a labeling function that maps each transition t ∈ T to an event e ∈ E.
For each transition t ∈ T , its preset, denoted as •t ⊆ P , is the set of places connected to
t, and its postset, denoted as t• ⊆ P , is the set of places that t is connected to. A marking
of a LPN is a set of places marked with tokens, and it is also referred to as a state of a LPN.
The initial marking s0, the set of initially marked places, is also the initial state of the LPN.
The communication protocol shown in Figure 2.1 is represented by the LPN shown in
Figure 2.3. This format, compared to sequence diagrams, is even more abstract. It removes
all the structure information of a system, and represents only communications activities
among the components.
In this and the following figures for LPNs, the labeled circles denote places, and the
labeled boxes denote transitions. Each transition is labeled with its name and the associated
event. Each event has a form of (src, dest, cmd, addr) where cmd is a command sent from
14
Figure 2.3. LPN formalization of protocol in Figure 2.1
a source component src to a destination component dest, and addr is the address related
to the command , this can be served as an unique id of the request in some situation. The
protocol presented in Figure 2.3 used the format of (src, dest, cmd). Here the addr is ignored
as the command does not have any address related. In the original protocol specification,
the places without outgoing edges are terminals, which indicate termination of protocols
represented by the LPNs. The initial marking is s0 = {p1}. In this LPN model, only the
communication portion of the protocol specification is represented while the computation
portion is ignored.
The operational semantics of a LPN is defined by transition executions. A transition can
be executed after it is enabled. A transition t ∈ T is enabled in a state s if every place in its
preset is included in s, i.e. •t ⊆ s. The set of enabled transactions in state si is denoted as
enabled(si). Execution of t ⊆ enable(s) results in a new state s′ such that
s′ = (s− •t) ∪ t • .
15
Let s′ = t(s) denote the new state s′ after t is executed in s. When t is executed, the
labeled e is emitted. Therefore, a sequence of transaction execution
t0 t1 t2 .....ti ...
results in a sequence of events
e0 e1 e2 .....ei ..., such that ∀i ≥ 0, ti ∈ enabled(si) ∧ si+1 = ti(si)
Therefore, information exchanges among components in a design can be modeled by se-
quences of LPN transition executions.
16
CHAPTER 3
FLOW GUIDED TRACE INTERPRETATION
In this chapter, we describe a trace analysis method where the observed signal traces
are interpreted at the level of system protocol specifications. In general, the trace analysis
can offer debuggers a structured view of communications among the IP blocks during the
SUD execution by deriving the types and numbers of system flows activated during System
Under Debug (SUD) executions from the observed signal traces.
We formalize the trace interpretation problem in terms of labeled Petri-Nets, and discuss
algorithms to address the problem. For pedagogical reasons, here we assume full observability
of all hardware signals involved in the flow events. In the next chapter we extend the approach
to consider partial observability.
3.1 Post-silicon Trace Analysis
In a typical validation setting, the SUD is executed in a test environment until it is
terminated by the test environment or the system crashes due to a failure. During the
execution, a trace on a small number of observable signals is streamed off the chip for
debugging. The off-chip analysis includes two broad phases:
• trace abstraction that translates signal traces into flow traces
• trace interpretation that maps flow traces into flow execution scenarios
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
17
Trace abstraction maps a signal trace into higher-level architectural constructs, e.g., mes-
sages, operations, etc. A message such as Authorization request may be implemented
in hardware through a Boolean or temporal combination of specific hardware signals in the
NoC fabric between Device and CE, e.g., as a sequence containing a header, a specific value
of a sequence of data words, etc. We refer to such architectural constructs as protocol events
or flow events. Note that due to limited observability, it may not be possible to map events
on a given set of (observed) hardware signals uniquely to a flow event. Finally, signal trace
may be a result from several instances of the same protocol executing concurrently, e.g., a
firmware authentication protocol may be invoked when another instance of the protocol has
not completed.
Trace interpretation entails mapping a sequence of flow events created during trace ab-
straction to system level protocols in order to identify the set of protocol instances (and their
interleavings) responsible for creating the observed behavior. The trace interpretation takes
a finite trace of flow events resulting from the trace abstraction and a set of system flows in
LPNs ~F , and generates a set of possible system flow execution scenarios, which is defined in
next section. A flow execution scenario indicates that at a certain point of SUD execution,
what types of flows and the number of instances of a particular flow are activated and their
corresponding current states.
The observed traces may help to identify problems in the protocols, e.g. an interleaving of
some protocol executions may lead to an unexpected message being sent or cause the system
to crash. More commonly, one finds a bug in the implementation of the protocol, i.e.,
a trace inconsistent with any possible interleaving of the protocol executions. Identifying
these problems involves significant human expertise, and can often take days to weeks of
effort. The trace analysis method and algorithm presented in this chapter intends to address
that hurdle.
18
3.2 Flow Execution Scenarios
The set of system flows in LPN donates ~F . A flow execution scenario is defined as a set
{(Fi,j, si,j)} where Fi,j is the jth instance of flow Fi ∈ ~F , and si,j is a state of Fi,j. A flow
execution scenario indicates the set of protocols and the number of instances of a particular
protocol are activated and their corresponding current states. It represents a system state
during system execution abstracted on system flow specifications. From debugger’s point
of view, communication protocols can be related. For example, a firmware loading protocol
always happens before a firmware execution protocol. If a firmware execution protocol
happens before firmware loading protocol, that possibly indicates an error in the system
implementing such protocols. This information can be used as an assertion during the
debug process. For this purpose, flow execution scenario also represents the partial order
relations that define the relative orderings between initiation and termination of different
flow instances. This relation can provide helpful information for more efficient debug.
Since we assume full observability, we view an observed trace ρ = e1e2 . . . en as a sequence
of flow events. Let
accept(Fi,j, si,j, e) =

s′i,j if ∃t, t ∈ enabled(si,j) ∧ (L(t) = e) ∧ (s′i,j = t(si,j))
∅ otherwise
be a function to decide if event e can be admitted by flow instance Fi,j in state si,j. The
function returns the corresponding new state if event e can be admitted, otherwise it returns
∅. This function is used in the trace analysis algorithm later in this chapter.
Given an observed trace ρ, the goal of trace interpretation is to construct a set of can-
didate flow execution scenarios whose execution can create the sequence of events in ρ. In
other words, ρ is the result of executing the flow instances in those execution scenarios by
following the corresponding LPN operational semantics starting from their initial states.
19
If every event in ρ is successfully mapped to some flow instance, we can say that ρ
is compliant with the given protocol specifications. When this happens, the algorithm
returns a set of flow execution scenarios. On the other hand, inconsistent events may also
be encountered. An event eh is inconsistent if for each flow execution scenario scen, the
following two conditions hold.
1. For each (Fi,j, si,j) ∈ scen, accept(Fi,j, si,j, eh) = ∅, and
2. For each Fi ∈ ~F , accept(Fi, init i, eh) = ∅.
The inconsistent event eh is the one produced by SUD execution but cannot be mapped
to any flow instances no matter how the trace prior to event eh is interpreted. Inconsistent
events may indicate possible causes of observed system failures. When the analysis algorithm
finds an inconsistent message, it returns the the set of partially derived execution scenarios
along with the discovered inconsistent event eh.
3.3 Flow Guided Trace Interpretation Algorithm
Given an observed flow trace ρ and the set ~F of system protocol specifications, Algo-
rithm. 1 describes a basic procedure for computing a set of compliant flow execution scenar-
ios. The algorithm operates by keeping track (in variable Scen) of a set of candidate flow
execution scenarios compliant with each prefix of ρ. At each iteration, for each event eh in
the observed trace, the algorithm updates Scen by either updating the state of a member of
scen or by initiating a new flow instance for each scen ∈ Scen with respect to eh in every
possible way. If eh cannot be accepted by any existing or new flow instances in Scen, this
indicates that trace ρ is inconsistent withe Scen. If event eh is inconsistent with all existing
execution scenarios, then the algorithm reports that the trace is inconsistent with Scen.
Given a trace of flow events ρ = e1e2 . . . en, the trace interpretation algorithm starts with
an empty set of of flow execution scenario Scen = ∅. Then, for each eh where 1 ≤ h ≤ n
starting h = 1, and for each scen ∈ Scen, the following two steps are performed.
20
• Step 1 For each (Fi,j, si,j) ∈ scen, if accept(Fi,j, si,j, eh) = s′i,j, create a new scenario
scen ′ = (scen − (Fi,j, si,j)) ∪ {(Fi,j, s′i,j)}, which is added into Scen ′.
• Step 2 For each Fi ∈ ~F , create a new instance Fi,j+1. If accept(Fi,j+1, init i,j+1, eh) =
s′i,j+1, create a new scenario scen
′ = scen ∪{(Fi,j+1, s′i,j+1)}, which is added into Scen ′.
21
1 Create an empty scenario scen
2 Scen = {scen}
3 foreach h, 1 ≤ h ≤ n do
4 found ← true
5 Scen′ = ∅
6 foreach scen ∈ Scen do
7 foreach (Fi,j, si,j) ∈ scen do
8 s′i,j ← accept(Fi,j, si,j, eh)
9 if s′i,j 6= ∅ then
10 Let scen′ be a copy of scen
11 scen ′ ← scen ′ − (Fi,j, si,j)) ∪ (Fi,j, s′i,j)
12 Scen ′ ← scen ′ ∪ Scen ′
13 found ← false
14 end
15 end
16 foreach Fi ∈ ~F do
17 create a new instance Fi,j+1
18 s′i,j+1 ← accept(Fi,j+1, init i,j+1, eh)
19 if s′i,j+1 6= ∅ then
20 Let scen′ be a copy of scen
21 scen ′ ← scen ′ ∪ (Fi,j+1, s′i,j+1)
22 Scen ′ ← scen ′ ∪ Scen ′
23 found ← false
24 end
25 end
26 end
27 if found == true then
28 return {Scen, eh}
29 end
30 Scen = Scen′
31 end
32 return {Scen, }
Algorithm 1: Check-Compliance(~F , ρ)
22
After eh is processed, Scen = Scen
′, and the above two steps repeat for the next event
eh+1.
Based on the above discussion, the trace interpretation algorithm generates two possible
results:
• {Scen, } when ρ is compliant with the flow specification ~F where Scen is a set of
flow execution scenarios, each of which is derived from the observed trace, and  is an
empty event indicating non-existence of inconsistent events.
• {Scen, eh} when inconsistent event occurs where Scen is a set of partially derived
scenarios and eh is the corresponding inconsistent event. This result provides valuable
information for debuggers to root cause system failures.
3.4 Illustration
To illustrate the basic idea of the trace analysis algorithm, consider the system flow shown
in Figure 2.3. Let F1 denote such flow. Suppose that the following flow trace is abstracted
from an observed flow trace.
t1 t2 t1 t2 t3 t3 t4 t5 t5 t4 . . . (3.1)
This trace is interpreted from the first event to the last in order to derive all possible
flow execution scenarios. Here transition names in the LPN are used to represent the flow
events in the trace. At the beginning, event t1 is processed first. According to the flow
specification F1, we know that one instance of such flow F1, F1,1, is activated by the SUD as
accept(F1,1, init1, t1) = p2 where {p1} is the initial state of F1. The flow execution scenario
after interpreting the first event t1 is {(F1,1, {p2})}.
Next, the second t2 is interpreted. This event is accepted by F1,1 as accept(F1,1, p2, t2) =
p3. Next event t1 activates another instance of flow F1, F1,2. And event t2 after that can be
23
accepted by F1,2, resulting in the following flow execution scenario:
{(F1,1, {p3}), (F1,2, {p3})}.
For the fifth event t3, it can be accepted by both F1,1 and F1,2. Therefore, two execution
scenarios can be derived as showed below.
{(F1,1, {p4, p5}), (F1,2, {p3})}
{(F1,1, {p3}), (F1,2, {p4, p5})}.
After handing the following event t3, the above two execution scenarios are reduced to the
one as shown below.
{(F1,1, {p4, p5}), (F1,2, {p4, p5})}.
After processing the next event t4, the two execution scenarios below can be derived:
{(F1,1, {p6, p5}), (F1,2, {p4, p5})}
{(F1,1, {p4, p5}), (F1,2, {p6, p5})}.
Next, processing the following eventt5 leads to execution scenarios derived from those shown
above :
{(F1,1, {p6, p7}), (F1,2, {p4, p5})}
{(F1,1, {p4, p7}), (F1,2, {p6, p5})}
{(F1,1, {p6, p5}), (F1,2, {p4, p7})}
{(F1,1, {p4, p5}), (F1,2, {p6, p7})}.
Similarly, next event t5 reduces the execution scenarios above to the following ones:
{(F1,1, {p6, p7}), (F1,2, {p4, p7})}
{(F1,1, {p4, p7}), (F1,2, {p6, p7})}.
(3.2)
24
Eventually, after handling the last event t4 the execution scenario below is derived.
{(F1,1, {p6, p7}), (F1,2, {p6, p7})}
In this example all flow events are successful mapped and every flow scenario reached its
end state. The result shows that two instances of the firmware loading flow are activated
during the system run and finished correctly. While no error happens during the analysis
process, debuggers can use this result to check if the numbers of flow instances are correct
compared to the expected data extracted from verified simulation. This process involves
checking types of protocol specification activated and numbers of flow instances of each
protocol. Moreover, depend on the correlation between protocols, together with recorded
order of each flow instance’s start and finish time, debugger can judge if the system functions
correctly.
Now suppose that system generate a trace same as the previous one in (3.1) except that
the last event is t3 instead of t4. The new traced is showed below:
t1 t2 t1 t2 t3 t3 t4 t5 t5 t3 . . .
The same execution scenario as in (3.2) are derived after the first nine elements are handled:
{(F1,1, {p6, p7}), (F1,2, {p4, p7})}
{(F1,1, {p4, p7}), (F1,2, {p6, p7})}.
However, neither of these two existing scenarios can accept t3. Furthermore, because no new
flow instances can be created such that t3 can be accepted in the initial states. Therefore t3
is regarded as an inconsistent event.
When an inconsistent event happens, debuggers can make use of the current partially
derived scenarios and the inconsistent flow event to guess possible causes and the potential
problematic components in the system. Based on this information, debuggers can select a
25
new set of observable signals in order to better visualize the activities around the suspicious
components in a new SUD execution. The new observed traces can help debuggers better
understand the problem, and may eventually lead to locating the root cause of the problem.
26
CHAPTER 4
TRACE ANALYSIS UNDER PARTIAL OBSERVABILITY
In SUD where the given system flow specification is implemented, a flow event is assumed
to be implemented as an event or a sequence of events on a set of hardware signals. Therefore,
a mapping function that can translate a sequence of signal events to flow events is needed.
However, in post-silicon debug which contains millions of gates, it is impossible to observe
every signals. As a result, the algorithm must consider the case where the signal traces are
produced under partial observability.
In this chapter, the analysis algorithm is extended by adapting trace analysis method
presented in the previous chapter to deal with signal traces of partial observability. Hereafter,
the term flow traces is used to refer to traces of flow events, and signal traces refers to traces
of signal events observed from system execution.
4.1 Mapping Individual Signal Events to Flow Events
A signal event is defined as a state on or an assignment to a set of signals. In general,
a signal trace of partial observability is a sequence of signal events such that the values of
non-observable signals are unknown. In this case, all possible values of those signals are
considered for every signal event during trace analysis. Thus we can say that one partially
observed signal trace can be mapped to a set of fully observed signal traces.
Consider the following example for mapping individual signal events to flow events. Sup-
pose there are three flow events: e1, e2, and e3, which are implemented in hardware by the
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
27
signal events shown in the list below. We use Boolean expressions to represent signal events
for the discussion.
e1 : abc
e2 : a¯bc
e3 : ab¯c
In addition, let’s consider that only signal b and c are observable, then we obtain the following
trace:
ρ = bc bc b¯c
Since a is not observable, both possible assignments to a need to be considered when these
signal events are mapped to flow events.
The first and second signal events bc, can be mapped to possible signal events with both
values of a assigned: abc, a¯bc. The first signal event abc can be mapped to e1. While a¯bc
can be mapped to e2. Therefore, signal event bc with a’s value unknown can be mapped to
{e1, e2}.
Similarly, the third signal event b¯c can be mapped to ab¯c and a¯b¯c, respectively. In this
case ab¯c is mapped to e3. On the other hand, a¯b¯c cannot be mapped to any flow event,
therefore, this interpretation of signal a is invalid, and is ignored.
Based on the above discussion, this signal trace ρ is abstracted to four possible flow
traces: {e1, e2} × {e1, e2} × {e3}.
4.2 Mapping Sequences of Signal Events to Flow Events
Next, we consider the more general case where a flow event is implemented by a sequence
of signal events to model a transaction that takes a number of signal events to accomplish.
For example, a flow event that represents a message sent from component A to component
B following a handshake protocol consists of two steps: (1) component A sets the valid bit
to 1 together with the command, (2) and component B sets the acknowledgement signal to
1.
28
1 Result = ∅
2 pref = 
3 foreach i ∈ 0...min(Max− 1, |ρ| − 1− h) do
4 pref = ρ[h, h+ i]
5 foreach (e, σ) ∈ Flow Map do
6 if |pref | == |σ| then
7 if σ ⇒ pref then
8 Result = Result ∪ (e, h+ i+ 1)
9 end
10 end
11 end
12 end
13 return Result
Algorithm 2: Map(ρ, h, F low Map)
Mapping a sequence of signal events to a flow event is more precisely described in function
Map(ρ, h, F low Map), and its pseudocode is shown in Algorithm 2. This function takes the
following inputs: signal trace ρ, index of next signal event h in ρ, and the mapping table
Flow Map between flow events and signal events. It returns a set of pairs (e, h′) where h′ is
the position of the next signal event to be considered and e is a flow event mapped from the
segment of ρ starting from signal event at index h to the signal event at h′− 1. Index i used
in line 3 indicates the distance of last event of the prefix pref relative to the starting event
at index h, and Max is the length of the longest sequence of signal events that implement a
flow event as defined in Flow Map.
Once the mapping function is called, all segments of ρ of increasing length from 1 to
Max or |ρ| − h (when the last signal event of ρ is reached) from the signal event at index
h (expressed as pref = ρ[h, h + i]) are considered. All possible pref s are compared with
σ in each instance (e, σ) in Flow Map where e is a flow event and σ is the corresponding
sequence of fully observed signal events. Due to the limitation of post-silicon validation, the
signal events in pref are under partial observability. And for each partially observed signal
29
event, a set of fully observed signal events can be obtained by considering all possible values
of the unobservable signals. Therefore, a single pref can represent a set of sequences of fully
observed signal events. To compare the pref with σ, we use σ ⇒ pref to represent the
successful mapping that σ is included in the set of sequences of fully observed signal events
represented by pref . More specifically, σ ⇒ pref is defined below.
∀i ∈ [0...|σ|], σ[i]⇒ pref [i]
To illustrate this algorithm, we assume that two flow events are implemented by two
sequences of signal events as defined in the Flow Map below.
Flow Map
e4 : abc a¯bc
e5 : abc abc abc a¯bc
Again, assume that a is not observable, and suppose that an observed trace ρ on signals
b and c is obtained as shown in (4.1).
ρ = bc bc bc bc (4.1)
The given mapping relationship Flow Map between the flow events and the signal events
shows the length of a signal trace for a flow event is either 2 or 4, hence the value of Max
is 4 in this example. The function takes ρ, F low Map and h with value set to 0 as input.
Start with the first sequence of signal events pref with ρ[h, h + 0] = bc, it cannot be
matched to e4 or e5. Next, consider sequence ρ[h, h+ 1] = bc bc, by looking up the mapping
table Flow Map, this sequence can be mapped to e4 where its corresponding signal events
σ = abc a¯bc is included in the set of sequences of fully observed signal events represented
by ρ[h, h + 1]. As a result, (e4, 2) is added to the Result. In the next step, sequence
pref = bc bc bc cannot be mapped to any flow event. Finally, a new pref = bc bc bc bc is
30
generated, and is mapped toe5, therefore (e5, 4) is added to the Result. Subsequently, the
function terminates and returns the set of pairs:
{(e4, 2), (e5, 4)}
As the pair (e5, 4) reaches the end of the signal trace ρ, there is no more signal event to
be considered. For (e4, 2), the mapping function is applied to the same ρ and Flow Map
with index h changed to 2, and it returns the set of pairs {(e4, 4)}. After combining the
previous results, two flow traces are derived from ρ as shown below.
{e4 e4, e5}
4.3 Generalized Trace Analysis Algorithm
As shown in the previous section, more than one flow trace can be derived from a signal
trace observed under partial observability. Each of the derived flow trace is very long and
thus requires a lot of effort to interpret. Moreover, the number of flow traces can grow
exponentially as the system complexity increases and the number of observed signals remains
limited. For the above reasons, the analysis time for the large number of flow traces can be
impractical. To address this issue, our proposed work combines trace abstraction with trace
interpretation into a new generalized algorithm as it is presented next in Algorithm 3.
This generalized algorithm takes two inputs: ~F that contains a set of system protocol
specifications and a signal trace ρ. Instead of abstracting a set of flow traces from the signal
trace ρ and apply Algorithm 1 on each of the flow trace, this generalized algorithm tries to
apply the analysis algorithm each time a new flow event is abstracted from the signal trace
ρ. This method can reduce the analysis time significantly as for flow traces that contains
inconsistent event, the algorithm stops immediately once the inconsistent event happens,
thus no further abstraction is needed.
31
1 Scens = {(∅, 0)}
2 Scens final = ∅
3 while Scens 6= ∅ do
4 get (Scen, h) ∈ Scens
5 flag = false
6 K ←Map(ρ, h, F low Map)
7 foreach (e, h′) ∈ K do
8 Scens′ ← Flow Analysis(~F , Scen, e)
/* if Scens = ∅, e is inconsistent with Scen */
9 if Scens′ 6= ∅ then
10 flag = true
11 if h′ = |ρ| then
12 Scens final = Scens final ∪ (Scens′, h′)
13 else
14 Scens = Scens ∪ (Scens′, h′)
15 end
16 end
17 end
18 if flag == false then
19 Scens d = Scens d ∪ (Scen, h)
20 end
21 Scens = Scens− (Scen, h)
22 end
23 if Scen final = ∅ then
24 return Scens d
25 else
26 return Scen final
27 end
Algorithm 3: Generalized-Check-Compliance(~F , ρ)
32
1 Scen′ = ∅
2 foreach scen ∈ Scen do
3 foreach (Fi,j, si,j) ∈ scen do
4 s′i,j ← accept(Fi,j, si,j, e)
5 if s′i,j 6= ∅ then
6 Let scen′ be a copy of scen
7 scen ′ ← (scen ′ − (Fi,j, si,j)) ∪ (Fi,j, s′i,j)
8 Scen ′ ← scen ′ ∪ Scen ′
9 end
10 end
11 foreach Fi ∈ ~F do
12 create a new instance Fi,j+1
13 s′i,j+1 ← accept(Fi,j+1, init i,j+1, e)
14 if s′i,j+1 6= ∅ then
15 Let scen′ be a copy of scen
16 scen ′ ← scen ′ ∪ (Fi,j+1, s′i,j+1)
17 Scen ′ ← scen ′ ∪ Scen ′
18 end
19 end
20 end
21 return Scen′
Algorithm 4: Flow Analysis(~F , Scen, e)
Inside of the algorithm, an instance Scens is created to hold a set of pairs (Scen, h) where
Scen is a set of flow execution scenarios extracted from a segment of signal trace ρ starting
from the index 0 to index h−1. The algorithm goes through each pair of (Scen, h), and apply
the mapping function from previous section to ρ, h and F low Map to produce a set of pairs of
signal events e and h′ as its corresponding location of the next signal event to be considered.
In the next step, function Flow Analysis(~F , Scen, e) (as shown in Algorithm 4) is applied to
each e and Scen, together with the flow specification set ~F to produce an updated scenario
set Scen′. The updated Scen′ together with h′ as location of the next signal event to be
considered is add to the Scens. For situations when h′ = |ρ|, meaning that it has reached
33
the end of the signal trace ρ, (Scen′, h′) is add to Scen final instead of Scens such that
Scen final holds the set of pairs (Scen, |ρ|) where Scen is a set of flow execution scenarios
that are extracted from the signal trace ρ. For debugging purpose, another instance Scens d
is created to hold a set of pairs (Scen, h) when one of the following two conditions holds:
1. When the returned set K from function Map(ρ, i, F low Map) is empty, meaning that
no flow events can be mapped from current signal events, or
2. When the returned set K is not empty, but the the value of flag is false. This means
that non of the flow event in K is consistent with the current flow execution scenario
Scen.
When all pairs of (Scen, h) in Scens are considered, this algorithm terminates. Depends
on the size of Scen final, the following variables can be returned:
1. When Scen final is not empty, it means that there exists at least one flow execution
scenario that is extracted from signal trace ρ, thus Scen final is returned.
2. When Scen final is empty, indicating that all flow events are inconsistent with its
scenario set, thus Scen d is returned for debugging purpose.
34
4.4 Difficulties and Solutions
In the previous section, we combine the trace abstraction method with flow analysis
algorithm into a new algorithm to reduce the abstraction time. However, the algorithm’s
practical applicability can still be restricted due to the enormous number of potential flow
execution scenarios generated under the partial observability. Note that this is not a limi-
tation of the algorithm; if the observability of critical events is poor, there simply are too
many flow execution scenarios that can produce the observed trace. For example, a read
protocol and write protocol following the same sequence of steps can look exactly the same
when the command signal is not observable.
Nevertheless, we need to address the issue to make trace interpretation (whether auto-
matic or not) practicable. There are two potential approaches: (1) a better selection of
signals to enhance post-silicon trace observability, and (2) use of debugger’s insights during
the analysis. We briefly describes impact of signal selection and how the debuggers’ insights
of a system’s architecture can help to address the complexity issue in the trace interpretation
in the next two sections.
4.5 Trace Signal Selection
Trace signal selection in this research is done manually by the system debugger. This pro-
cess is labor-intensive and heavily relied on the debugger’s insights. During this experiment,
different sets of signals were selected for the algorithm test and the best sets of signals is
selected by comparing the numbers of different scenarios returned after analyzing the signal
traces. This method, even though the debuggers’ knowledge is of great help for debugging
process, cannot guarantee the quality of the selected signals. For bugs that are not clear, it
is nearly impossible to select related signals during the design phase. Therefore, we need to
have a least some trace signals that are selected in an automated manner without debuggers’
insights to achieve effective bug detection.
35
Most of the current signal selection algorithm are done in low-level and using SSR (State
Restoration Ratio) as their metric. SRR measures signal selection algorithm by counting the
number of design states able to be reconstructed from the signals observed. As explained
in [22], this causes an issue because SSR treats all signal equally and thus always favor big
arrays, it may not be very helpful to restore useful system states for our case. We need a
tool that can restore the maximum states in system level and thus lead to less scenarios.
In [22], researchers proposed an interesting algorithm based on Google’s PaperRank system.
This algorithm rank each signal based on their connectivity to other instances and choose
the most valuable signals. However, this algorithm does not support system level assertions
which may hinder its ability to choose the best signals to restore system level state in our
case. In [23], researchers explained a different method of selecting signals in system level
using linear program formulation. This method focuses on communications between IPs and
try to maximize the coverage of each protocol messages. Combining this method on our
system and evaluate its performance is one of our future jobs.
4.6 Interactive Trace Interpretation
Post-silicon validation is performed by debuggers with deep knowledge about the system’s
architecture and micro-architecture, and the test environment. Two key insights are (1) the
maximal number of instances of a flow activated in the test environment, and (2) the mutual
relationship between two flows. For example, the test environment may not allow multiple
instances of firmware authentication to operate concurrently, or a flow involving audio and
Web browsing to initiate until the flows participating in boot are completed. Our framework
permits incorporating such insights as constraints in trace analysis; flow execution scenarios
that violate these constraints are ignored. These insights can lead to two advantages. First,
they help to reduce the potentially large number of partial scenarios generated during the
trace interpretation step, thus making the analysis more efficient. Second, they allow the
36
debugger to quickly filter out uninteresting combinations of flows and focus on interesting
interleavings.
If the precise knowledge of the system (micro-)architecture is hard to come by, this ap-
proach can be considered flexible as it allows a debugger to analyze the observed traces in
a trial-and-error manner. For instance, the debugger might initially make a very restricted
assumption on how the SUD executes a flow specification, and these assumptions can po-
tentially lead to an empty set of flow execution scenarios. Depending on which of these
assumptions triggered during the trace interpretation step, the debugger can study these
assumptions more carefully, and relax some or all of them for the next run of analysis. This
iteration can be repeated as many times as necessary until some results deemed meaningful
are produced.
Alternatively, if all derived execution scenarios seem to be plausible, the implication that
a debugger may draw from this result is that the failure may be independent of the flows
being observed. Therefore, the testing environment can be adjusted in order for a different
part or different behavior of the SUD to be observed. This idea, closely related to trace
signal selection, is critical for post-silicon validation, and a detailed discussion can only be
presented in a separate paper.
37
CHAPTER 5
CASE STUDIES
This chapter demonstrates the usage of our proposed algorithm on two different models.
In the first section, the trace analysis algorithm is applied to a transaction level model of a
simple SoC built within the GEM5 environment. In the next section, a more detailed RTL
model of a similar SoC is constructed to further test the efficiency and effectiveness of our
algorithm.
5.1 A Transaction-Level Model of a Simple SoC in GEM5
To determine the efficiency of the trace analysis method for a realistic example, a trans-
action level model of a SoC is constructed using the GEM5 environment [24]. The GEM5
simulator is a modular platform for computer-system architecture research, encompassing
different system level architectures as well as different processor micro-architectures. This
SoC model, as shown in Figure 5.1, consists of two ARM Cortex-A9 cores, each of which
contains two separate 16KB data and instruction caches. The caches are connected to a
1GB memory through a bus model. The bus works as a system agent that is in charge
of routing messages to maximize the system parallelism. Inside of the SoC, components
communicate with each other by sending or receiving various request and response messages
through links. In order to observe the trace communications occurring inside this model
during execution, monitors are attached to links connecting the components. These moni-
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
38
Figure 5.1. The TML model structure.
tors record the messages flowing through the links they are attached to, and store them into
output trace files.
In this SoC design, there are nine interfaces in total and each interface is attached with
a monitor. By combining the information from all of the nine communication monitors, the
trace of the communication activities inside the SoC can be obtained. As a virtualized SoC
platform, GEM5 has three types of communication transactions supported by its interface:
timing, atomic and functional. Different types of transactions use various time models. For
example, a timing transaction takes a sequence of steps, each of which has a delay. These
steps can be interleaved with steps of other transactions. This type of transaction is the most
detailed one, and it reflects the simulator’s best effort for modeling realistic timing including
the modeling of queuing delay and resource contention [25]. Next, atomic transactions are
faster compared to timing transactions since an atomic transaction works like a function and
does not simulate any delay. When an atomic transaction starts, all of its operations will be
executed without interruption. Therefore, it is impossible to interleave atomic transactions.
39
This type of transaction is used for fast forwarding and warming up caches and return
an approximate time to complete the request without any resource contention or queuing
delay [25]. Finally, functional transactions are used for loading binaries, examining and
changing variables in the simulated system, and allowing a remote debugger to be attached
to the simulator. Those functional transactions are not used for our simple SoC. In our
experiment, we designed our simple SoC’s communications by using timing transactions to
create a system with the most realistic timing.
For this model, we consider the flow specifications describing the cache coherence proto-
cols supported in GEM5 that are used to build the model in Figure 5.1. The specifications
of the GEM5 cache coherence protocols can be found at [25] and their corresponding LPN
descriptions can be found in Appendix A. These flow specifications describe data/instruction
read operations and data write operations initiated from CPUs. Each CPU has three pro-
tocols implemented, one read and two write protocols. Since there are two CPUs, there are
six flows implemented in the model.
We write one simple concurrent program with two threads, one for each CPU to exercise
the flows. The program assigned to CPU1 reads one file three times and does some modifi-
cations to the content of the file. Then CPU2 repeat the same process as CPU1. However,
unlike CPU1, CPU2 modifies the file first and then perform the reading operation.
After this model is executed with the simple concurrent program, the trace analysis is
applied to traces with different observabilities collected from this model. The runtime results
are shown in Table 5.1. The first column shows the results from analyzing the trace with the
full observability, while the next three show the result from analyzing traces with different
partial observability assumptions.
In the first experiment, full observability is assumed. After the SoC model finishes
executing the program, there are totally 343581 messages collected in the trace file. Not
all of the messages are relevant to the flow specification as many are used by GEM5 to
40
Table 5.1. Runtime Results of Trace analysis.
F-Obs.
P-Obs.
No Amb.
P-Obs.
Amb. 1
P-Obs.
Amb. 2
Time 3 2.78 896 < 1
Mem 12 10 420 9
Table 5.2. The number of flow instances derived by the trace analysis with the full observ-
ability.
Flows #Instances
CPU1 Data Read 17582
CPU1 Instruction Read 4002
CPU1 Write 3370
CPU2 Data Read 17386
CPU2 Instruction Read 3955
CPU2 Write 3308
initialize its simulation environment. After removing those irrelevant messages, the number
of messages in the trace file is reduced to 121138.
The time taken to remove the irrelevant messages from the trace is negligible. The total
runtime and the peak memory taken by the trace analysis algorithm on the reduced trace
are 3 seconds and 12MB, respectively. Only one flow execution scenario is extracted, and
Table 5.2 shows the number of flow instances contained in that scenario for the six flows
describing cache coherence operations initiated from both CPUs.
In the second experiment, partial observability is taken into account with the four moni-
tors attached to the links between two CPUs and their caches are disabled. Then, the trace
is generated by the remaining five monitors from the SoC model executing the same pro-
gram. The new trace contains 15089 messages. Similarly, only one flow execution scenario
is extracted, and the numbers of the flow instances contained in that execution scenario
are shown in Table 5.3. From these results, the numbers of the flow instances are dropped
significantly compared to the results extracted from the trace with the full observability as
shown in Table 5.2. This difference is due to that some communications occurred in the
41
Table 5.3. The number of flow instances derived by the trace analysis with certain monitors
disabled.
Flows #Instances
CPU1 Data Read 829
CPU1 Instruction Read 169
CPU1 Write 82
CPU2 Data Read 803
CPU2 Instruction Read 190
CPU2 Write 83
system when executing the program involve the CPUs and their corresponding caches only,
and the traffic on the links between the CPUs and their corresponding caches is not observ-
able. Therefore, the instances of the flow specifications characterizing these communications
cannot be observed from the trace. In other words, all extracted flow instances in Table 5.3
characterize the communications that pass through the Bus in the system model. The run-
time and memory usage as shown in the third column in Table 5.1 are similar to those for
analyzing the trace of the full observability.
In the third experiment, further partial observability is taken into consideration. In this
experiment, only the five links involving the Bus are still considered. However, an assumption
is made that all messages passing the same link are not distinguishable due to the limited
observability. The monitors are modified such that whenever a message is captured on one
of the links, it dumps a set of messages passing through the same link into the trace file.
Therefore, each line of the trace file corresponds to a set of messages. After applying the
trace analysis to this trace, a total of 13944 flow execution scenarios are extracted. This large
number, compared to the results from the first two experiments, is due to the ambiguous
interpretation of the messages with limited observability.
The whole experiment takes about 15 minutes and 420 MB of memory to finish as shown
in column 4 in Table 5.1, this is significantly higher than the numbers for analyzing traces
where there is no ambiguity in the observed messages. This is due to the fact that a trace
of ambiguous messages is in fact a set of traces of messages with full observability, which
42
lead to large numbers of execution scenarios either during or at the end of the analysis. In
this experiment, the peak number of execution scenarios encountered during the analysis
process is 70384, many of which are invalid and removed eventually. However, controlling
the number of intermediate execution scenarios found during the trace analysis is critical in
order for the analysis to be tractable. Here, insights from validators could help, but are not
used in this experiment.
As shown above, the ambiguous interpretation of messages can lead to large numbers
of intermediate and final execution scenarios. This not only causes the trace analysis to be
more time consuming, but also makes it difficult to gain an insightful understanding from
the derived execution scenarios. Careful selection of what to observe may have big impact on
results from the trace analysis. In this last experiment, we relax the assumption made in the
previous experiment such that the messages passing each link are partitioned into two groups:
one for read operations and one for write operations. Similar to the assumption made in
the previous experiment, messages in the same group are assumed to be non-distinguishable.
The monitors are modified accordingly such that they output all messages in the same group
into the trace file if an event from that group is captured. After the trace analysis on this
new partially observed trace is finished, only one execution scenario is derived where the
distribution of the numbers of flow instances is the same as those shown in Table 5.3. The
peak number of execution scenarios encountered during the trace analysis is 4. The total
runtime and memory usage are negligible as shown in the last column in Table 5.1. Compared
to the results from the previous experiment, the precision and the performance of the trace
analysis are improved dramatically as a result of careful selection of observable messages.
One problem we encounter during the experiment is regarding how GEM5 supports
shared memory multi-threaded program execution is unclear, and each CPU has its separate,
non-interleaving address space. Therefore, no memory is shared between two caches in this
test. Furthermore, GEM5 does not support true concurrency. This means when there are
two threads running on the CPUs, GEM5 alternates the instruction executions between the
43
two CPUs. As this indicates, no matter how many times the multithread program runs,
the observed behaviors are always deterministic. To simulate asynchronous concurrency
with the interleaving semantics, those two simple threads are instrumented with pseudo-
blocking commands, one placed before each statement. A pseudo blocking command includes
a random number generator that returns either 0 or 1 and a loop that only exits when the
returned random number is 0. To address the above limitations, we develop a more detailed
cycle accurate RTL model of SoC for more experiments. Details of this new model are
explained in next section.
5.2 A Cycle-Accurate RTL Model for a Simple SoC
The model in the previous section is done in a very high abstraction level, and the trace
abstraction method introduced in Chapter 4 is not fully used. In this section, we construct
a cycle accurate RTL model for a similar SoC to allow more detailed trace signal selection.
For this model, the trace information is collected at the bit level, thus an extra translation
step is required. This model is cycle accurate, and simulates the SoC behavior that is more
accurate than that TLM model.
5.2.1 Model Implementation
Due to the novelty of our work, we could not find an existing model with well documented
flow specification. As a result, we have to build this model from the scratch. Here the flow
specifications from GEM5 are slightly modified and implemented in the RTL model as shown
in Figure 5.2. This model consists of two CPU models, each with its own 4KB Data Cache.
The Data Caches are connected to a 256KB Memory through a bus model. Currently, the
CPUs are treated as a test environment where software programs are simulated to trigger
various flows, and it does not involve instruction execution as normal CPUs. Therefore, there
is no need to include instruction caches. This model implements three system level protocols
for each CPU: read, write and write back where the read protocol and write protocol are the
44
Figure 5.2. The RTL model structure.
same as provided from GEM5 with only minor modifications; and write back protocol is a
new protocol and is invoked when Cache need to flush back dirty cache lines to the Memory.
Details of these three protocols are shown in Appendix B. This model is simulated using an
open-source simulator for the VHDL language called GHDL [26].
To collect communication activities of the system, values of some selected signals for
observation are collected and outputted to a trace file of an internal format on each rising
clock edge. The GHDL simulator itself also outputs a trace file in the VCD (value change
dump) format where any value changes on all signals are collected. This VCD format can
be opened by wave viewer softwares and provide a graphical representation of changes in a
recorded signal’s amplitude. Our trace analysis algorithm can take both formats of trace
files as input.
There are six types of interfaces shown in Figure 5.2, each of which is implemented by
a number of signals to ensure the correct communication behaviors of the SoC. Tow types
45
Figure 5.3. Format definition of messages
of messages are used by these interfaces for communications, and their formats are shown
in Figure 5.3. Message format 1 consists of 51 bits of values and this type of message is
used for interface (1), (2), (3), (4) and (5). As shown in Figure 5.3, message format 1
consists of 4 fields: val, cmd, addr, data. Field val indicates validity of the message.
The next two bits 49-48 is the cmd field, which defines the command of a message. These
two bits can represent four different commands. Right now our model supports only two
commands, we use 01 to represent read command and 10 to represent write command. The
next 16 bits represent the addr field where it contains the address of the memory related to
a memory operation. The last 32 bits represent the data field, and this field stores values
of the related memory data. Message format 2 consists of 52 bits of values, and there are
five fields included: id, val, cmd, addr, data as shown in Figure 5.3. The val, cmd,
addr, data are the same as those for message format 1. The field id is a new field and is
only used for interface (6) where the id bit indicates where the request is originated. For
example, if a request is initialized from CPU1, the id bit is 0, otherwise it is 1.
The communications between components are implemented by using handshake protocols
to ensure correct data transfer. To implement non-blocking communications, we use a first
in first out (FIFO) buffer inside each component for each incoming signal carrying messages.
The model shown in Figure 5.2 contains 6 different interfaces, each in charge of different
operations. As a start, interface (1), as shown in Figure 5.4, is responsible for communications
between a CPU and its Cache. There are three signals included in this interface, each of
which is explained in Table 5.4. This interface is activated when a CPU wants to read a
46
Figure 5.4. Structures of link 1 in Figure 5.2
memory address or write data to a memory address. Before sending requests, the CPU first
checks if the FIFO inside of its Cache is available for new request, which is indicated by
values of signal full c. Once full c’s value is 0 indicating an open slot in the FIFO, then
the CPU can send an request on cpu req to the Cache. The Cache sends the corresponding
response on cpu res to CPU once it obtains the requested data.
There are three interfaces between a Cache and the Bus, and their structures are shown
in Figure 5.5. The signals included in these three interfaces are explained in Table 5.5. In
Figure 5.5, interface (2) is used to support snoop operations, and is composed of those signals
in green. This interface is initialized when the Bus receives a cache request from a Cache.
Before requesting the data from the Memory, the Bus sends a snoop request to the other
Cache to check if it has the requested data first. To send a snoop request, Bus needs to check
the value of signal full srs first to see if the FIFO in another Cache is available for a new
request. Similarly, the Cache also checks the value of signal full srq before sending a snoop
response. Detailed flow specifications for this interface are shown in Figure C.2, Figure C.3,
Figure C.6 and Figure C.7 in Appendix B.
Table 5.4. Signals explanations for Interface (1)
Signal Name Width Definition
cpu req 51 Cpu request from CPU to Cache
cpu res 51 Cpu response from cache to CPU
full c 1 Indicate whether the cpu req FIFO inside a Cache is full
47
Figure 5.5. Structures of link (2), (3) and (4) in Figure 5.2
Interface (3) with signals in blue supports written back operations. This interface is
initialized when a modified memory cell in a Cache needs to be flushed out and write back
to the Memory. The write back flow specification is shown in Figure C.4 in Appendix B.
Interface (4) supports cache request and bus response operations, and is presented in
orange color. Both operations have their own FIFOs and thus, both cache req and bus res
can only be sent when its corresponding FIFO’s full indicator has value 0.
There are two interfaces between the Bus and the Memory as shown in Figure 5.6, and
definitions of signals for these two interfaces are explained in Table 5.6. Interface (5) is
designed for handling memory requests from the Bus, and it uses signals in red. When the
Figure 5.6. Structures of link 5 and 6 in Figure 5.2
48
Bus needs to request data from the Memory, it waits for full m to be 0 indicating that the
FIFO inside the Memory is available for new requests. Once the Memory finds the requested
data, it waits until there is an empty slot in the Bus’s FIFO (full mrs is 0), and sends the
corresponding response message on mem res.
Interface (6) handles write back requests, on signals in purple color. This operation is
initialized by the Bus sending a write back request message on wb req, and then the Memory
sends back the acknowledgement signal on wb ack once the data is written into the Memory.
49
Table 5.5. Signals explanations for Interface (2), (3) and (4)
Interface Signal Name Width Definition
2 snp hit 1 Snoop hit signal
2 snp res 51 Snoop response sent from a Cache to the Bus
2 full srs 1 Indicate whether the snp res FIFO inside the Bus
is full
2 snp req 51 Snoop request sent from the Bus to a Cache
2 full srq 1 Indicate whether the snp req FIFO inside a Cache
is full
3 wb req 51 Write back request from a Cache to the CPU
3 full wb 1 Indicate whether the wb req FIFO inside the Bus
is full
4 full creq 1 Indicate whether the cache req FIFO inside the
Bus is full
4 cache req 51 Cache request send from a Cache to the Bus
4 bus res 51 Bus response from the Bus to a Cache
4 full brs 1 Indicate whether the bus res FIFO inside a Cache
is full
Table 5.6. Signals explanations for Interface (5) and (6)
Interface Signal Name Width Definition
5 full m 1 Indicate whether the mem req FIFO inside the
Memory is full
5 mem req 52 Memory request from the Bus to the Memory
5 full mrs 1 Indicate whether the mem res FIFO inside the
Bus is full
5 mem res 52 Memory response from the Memory to the Bus
6 wb req 51 Write back request from thB bus to the Memory
6 wb ack 1 Acknowledgement of a write operation by the
Memory
50
5.2.2 Experimental Results
• Test 1
To test the correctness of the cache coherence protocol, we hard code a test generator
for each CPU. For every clock cycle, the test generator inside each CPU randomly
generates a memory read or write operation. In order to activate a cache coherent
protocol, only the three most significant bits of the 16 address bits are random gen-
erated, while the other address bits are pre-defined. By limiting the addresses to a
certain range, it is more likely that one CPU will request data that exists in the cache
of the other CPU. During the development process of the RTL model, this test case is
executed on the CPU multiple times to test whether the system works correctly with
no inconsistent message occurred. During this test two types of bugs are found and
fixed using the proposed trace analysis algorithm and the detailed discussions of these
two bugs are given in the next section.
• Test 2
We hard code a simple software that performs Peterson’s Algorithm with two threads,
one for each CPU. The pseudocode is shown in Figure 5.7. This algorithm contains
four shared variables: flag0, flag1, turn, and shared. A CPU that wants to enter
the critical section has to wait until the flag of the other CPU or turn get the desired
values. Whenever a CPU enters the critical section, it increments the variable shared
by one. By running this algorithm N times, the final value of shared should be 2N , as
both CPUs increment variable shared N times. During this test, one bug was found
and fixed. Details of this bug case are discussed in next section.
5.2.3 Debugging Experience
This section discusses a number of bug cases resulting from the SoC model executing test
1 and test 2, and summarizes the experience of using the proposed trace analysis method
51
1 bool flag[2]={false, false}
2 int turn
3 int shared=0
CPU1:
4 flag[0]=true
5 turn =1
6 while flag[1] ∧ turn ==1 do
7 //busy wait
8 end
9 //enters critical section
10 shared++
11 flag[0]=false
12 //leaves critical section
CPU2:
flag[1]=true
turn =0
while flag[0] ∧ turn ==0 do
//busy wait
end
//enters critical section
shared++
flag[1]=false
//leaves critical section
Figure 5.7. Peterson’s Algorithm on two CPUs [4]
to help root cause and locate those bugs. Each of those bug cases is obtained from traces
generated under partial observability. The observable signals includes: 2 bits of command
signal, and 3 bits of address signal related to the memory request, together with the source
and destination of the signal.
5.2.3.1 Bug One: Duplicated Messages
In the initial test, each of the CPU was designed to only generate one memory read
request followed by one write request. After the trace is obtained and analyzed using the
trace analysis algorithm, the analysis was halted and returned a set of partial execution
scenarios and one inconsistent message. This error happens no matter how many times the
model was simulated. Two different inconsistent messages were found and are shown below.
(bus,mem, rd)
(bus,mem,wt)
52
Figure 5.8. The flow specifications for CPU write operations
Figure 5.9. The flow specifications for CPU read operations
These two inconsistent messages belong to two different flow specifications: write and read
flow specification with their message sequence charts shown in Figure 5.8 and Figure 5.9
respectively. The only similarity of these two inconsistent messages is that they are both
sent from the Bus. To better locate the cause of this problem, we checked the system level
flow event happened before the occurrence of the inconsistent message. This translated flow
event from signal event shows that the inconsistent message are always exactly the same
as its previous message. We suspect that the Bus may be processing the same request two
times or sending the same response two times.
By looking into the Bus’s structure, we discovered that the request sent from Cache to
Bus is not reseted after it is accepted by the Bus, hence the Bus always processes the same
request from a Cache two times, and generates two same requests to Memory asking for the
same memory data. After modifies the Cache component and reset the request to Bus one
clock cycle after it is sent, the bug is fixed.
53
Figure 5.10. Two instances of write flow.
5.2.3.2 Bug Two: Incorrect Command
After Bug one was fixed, we executes the Test 1 on the CPU and try to analyze the trace
data collected from the system. During the debugging process, the trace analysis algorithm
is halted and returns an inconsistent message. When running the system multiple time and
apply analysis algorithm on it, the inconsistent message (bus,mem, rd) keeps happening.
We take one specific case and try to find the possible reason. In this case, when the
inconsistent message happens, only two write flow instances are activated and their corre-
sponding states are shown in Figure 5.10. The first write instance is expressed by red arrows
and is in state 2 (after accepting message (cache0, bus, wt)), the second write instance is
expressed with blue arrows and is in state 4 (after accepting message (cache1, bus, snp)).
Here we conduct a set the messages that can be accepted by the current scenarios:
• (bus, cache1, snp) that can be accepted by write instance one at state 2, and
• (bus,mem,wt) that can be accepted by write instance two at state 4.
By comparing these two expected messages with the inconsistent message, we see that the
inconsistent message (bus,mem, rd) has the same source and destination with the expected
message (bus,mem,wt), here we make a guess that the command may be changed for read
flow instance two and this change can be caused by the inconsistent message’s source Bus.
By checking the internal structure of the Bus, we could not find any possible causes, so we
check the message before the inconsistent message happens, which is (cache1, bus, snp). This
54
message is sent from the Cache thus we check the Cache component for errors. Inside the
Cache, we find out that the command bits in the snoop response is always 01(read command)
no matter what the request’s command bits are. After making change on the process that
handles snoop request, the problem was solved.
5.2.3.3 Bug Three: Incomplete Protocol Specification
When executing the Peterson’s Algorithm in Test 2 and applying our analysis algorithm
on the observed trace, no inconsistent message was reported and only one flow execution
scenario was produced. However, by checking the numbers of activated flow specifications,
we found that the flow execution scenario does not include an instance of the cache coherence
protocol. In order for Peterson’s Algorithm to work, each CPU needs to request data that
are shared by the other CPU. Therefore, at least one instance of the cache coherence protocol
need to be activated.
Moreover, the simulation kept running and never stopped. As a result, the number of read
flow instances kept increasing while other flow instances remained the same. This narrows
the root cause to the only while loop in the program where it kept checking the values of
flag and turn to see if a CPU can enter the critical section. This implies that the shared
variable flag and turn might not hold the correct values.
As there are no existence of inconsistent messages, it is unlikely the protocol specifications
is wrongly implemented, so we make an initial guess that the protocol itself may contain
some loopholes. After a through check of the protocol, together with our understanding of
the Peterson’s Algorithm, we discovered that when two CPUs request data from the same
address at the same time, both CPUs get the data from the Memory and thinks they have
the exclusive right of that data, causing inconsistency of the shared data.
This problem is solved by adding another register inside the Bus to guarantee that when
both CPUs ask for data from the same address, only one of them gets the data from the
Memory and both CPUs understand that this data is shared. For example, when CPU0
55
requests an address that CPU1 just requested but have not finished yet, CPU0 waits until
CPU1 finishes its flow and then request the data from CPU0’s cache. After fixing this
problem, we rerun the model and found fair amount of cache coherence protocols activated,
and the simulation ends after small amount of time.
However, when each CPU run Peterson’s Algorithm for three times, the final value of
shared was 5, comparing to the expected 6. Because this problem resides in the actual value
of the data, which is not considered at all in our algorithm, our approach cannot provide
much useful information thus it is very hard to root cause the bug. Our initial guess is
maybe both CPUs entered the critical section at the same time, causing the wrong result
value of variable shared. To check that, we need to know the order of instructions of both
CPUs, thus we propose to record the order the each flow’s start and finishing time for the
future research and by analyzing this we may be able to find whether both CPUs enters the
critical section together.
This bug is fixed by going through codes of each component throughly and eventually
narrows down to the Cache where when a new data needs to be written to a cache line that
already contains some data with same cache line index, the old data is never written back
the Memory. After fixing the write back process inside Cache, the simulation is returning
the correct value for shared.
56
CHAPTER 6
CONCLUSION AND FUTURE WORK
This chapter summaries the work done in this thesis and possible future directions.
6.1 Conclusion
This thesis presents a trace analysis method for post-silicon validation by interpreting
observed hardware signal traces at the level of system flow specification. The proposed trace
analysis method includes a trace abstraction function that takes a single or a sequence of
signal events, and produces a set of corresponding flow events by looking up the mapping
table between flow events and signal events. Each time a new flow event is produced, the
proposed trace analysis algorithm applies the analysis algorithm to the current flow execution
scenarios, and produces a new set of scenarios. Once the whole signal trace is processed, the
final set of flow execution scenarios is returned by our proposed method.
For post-silicon debug, the flow execution scenarios produced by our method can provide
some more structured information on system operations, which is more understandable to
system validators. This information, combined with debugger’s insight, can greatly help to
locate design defects more easily, and also provides a measurement of validation coverage.
Moreover, this proposed method returns a set of flow events that cannot be mapped to
any scenarios, which can be used to decide whether the protocol specifications are correctly
implemented by the SUD.
From ”Protocol-guided analysis of post-silicon traces under limited observability”, by Hao Zheng, Yuting
Cao, S.Ray and J.Yang, 2016, ISQED, Copyright 2016 by IEEE. Reprinted with permission [5].
57
To determine the efficiency of the trace analysis method for a realistic example, a trans-
action level model of a SoC design is constructed and tested with our method. The result
is very encouraging as only a short amount of time and small size of system memory are
used to analyze the traces produced by the model. Next, a cycle accurate RTL model of a
similar SoC design is constructed to further test the effectiveness of our method. A number
of bug cases are discussed to show that the trace analysis method can provide very useful
information for finding implementation errors.
6.2 Future Work
One of the beneficial future work is to record the beginning and finishing order of each
flow instance in the flow execution scenario. This can make greatly help in root causing
process as we can add assertions about order relation into the analyzing process, which can
allow us to find order related errors before it cause an an actual problem. The order relation
assertion can be provided with debugger’s insights.
Due to limited observability, our proposed approach may derive a large number of dif-
ferent flow execution scenarios for a given signal trace. A better selection of signals can
greatly reduce the numbers of final scenarios thus allow a better interpretation of the system
behavior. During the experiments of our research, the signals are selected manually based
on the debugger’s insights. This method can be time consuming and may be limited as
the system grows larger and more complicated. To generate a better trace signal selection
thus achieving a more precise interpretation of the observed trace, one of our future research
direction is to find an automated signal selection generation algorithm.
Insights from system validators can also help to eliminate some false scenarios due to
the partial observability. An interesting future direction is formalization of the validators’
insights using temporal logic on flows so that the validators can express their intents more
precisely and concisely.
58
The trace analysis approach presented in this thesis needs to be iterated with different
observations selected in different iterations in order to eliminate the false scenarios and to
root cause system failures as quickly as possible. The stitching signal traces of different
observations for the above goal will also be pursued in the future.
59
LIST OF REFERENCES
[1] Stephen A White. Process modeling notations and workflow patterns. Workflow hand-
book, 2004:265–294, 2004.
[2] Nicola Nicolici and Ho Fai Ko. Design-for-debug for post-silicon validation: Can high-
level descriptions help? In High Level Design Validation and Test Workshop, 2009.
HLDVT 2009. IEEE International, pages 172–175. IEEE, 2009.
[3] S. Krstic, Jin Yang, D.W. Palmer, R.B. Osborne, and E. Talmor. Security of soc
firmware load protocols. In Hardware-Oriented Security and Trust (HOST), 2014 IEEE
International Symposium on, pages 70–75, May 2014.
[4] Gary L. Peterson. Myths about the mutual exclusion problem. Information Processing
Letters, 12(3):115–116, 1981.
[5] Hao Zheng, Yuting Cao, S. Ray, and J. Yang. Protocol-guided analysis of post-silicon
traces under limited observability. In 2016 17th International Symposium on Quality
Electronic Design (ISQED), pages 301–306, March 2016.
[6] Harry D Foster. Why the design productivity gap never happened. In Proceedings of
the International Conference on Computer-Aided Design, pages 581–584. IEEE Press,
2013.
[7] P. Patra. On the cusp of a validation wall. IEEE Design Test of Computers, 24(2):193–
196, March 2007.
60
[8] Xiao Liu and Qiang Xu. Trace-Based Post-Silicon Validation for VLSI Circuits.
Springer, 2014.
[9] Miron Abramovici, Paul Bradley, Kumar Dwarakanath, Peter Levin, Gerard Memmi,
and Dave Miller. A reconfigurable design-for-debug infrastructure for socs. In Proceed-
ings of the 43rd Annual Design Automation Conference, DAC ’06, pages 7–12, New
York, NY, USA, 2006. ACM.
[10] Miron Abramovici, Paul Bradley, Kumar Dwarakanath, Peter Levin, Gerard Memmi,
and Dave Miller. A reconfigurable design-for-debug infrastructure for socs. In Proceed-
ings of the 43rd annual Design Automation Conference, pages 7–12. ACM, 2006.
[11] Ehab Anis and Nicola Nicolici. Interactive presentation: Low cost debug architecture
using lossy compression for silicon debug. In Proceedings of the conference on Design,
automation and test in Europe, pages 225–230. EDA Consortium, 2007.
[12] Kees Goossens, Bart Vermeulen, Remco van Steeden, and Martijn Bennebroek.
Transaction-based communication-centric debug. In Proceedings of the First Interna-
tional Symposium on Networks-on-Chip, NOCS ’07, pages 95–106, Washington, DC,
USA, 2007. IEEE Computer Society.
[13] Bart Vermeulen and Kees Goossens. A network-on-chip monitoring infrastructure for
communication-centric debug of embedded multi-processor socs. In VLSI Design, Au-
tomation and Test, 2009. VLSI-DAT ’09. International Symposium on, VLSI-DAT ’09,
pages 183–186, 2009.
[14] Kees Goossens, Bart Vermeulen, and Ashkan Beyranvand Nejad. A high-level debug
environment for communication-centric debug. In Proceedings of the Conference on De-
sign, Automation and Test in Europe, DATE ’09, pages 202–207, 3001 Leuven, Belgium,
Belgium, 2009. European Design and Automation Association.
61
[15] Amir Masoud Gharehbaghi and Masahiro Fujita. Transaction-based post-silicon debug
of many-core system-on-chips. In ISQED, pages 702–708, 2012.
[16] Mehdi Dehbashi and Grschwin Fey. Transaction-based online debug for noc-based mul-
tiprocessor socs. In Proceedings of the 2014 22Nd Euromicro International Conference
on Parallel, Distributed, and Network-Based Processing, PDP ’14, pages 400–404, Wash-
ington, DC, USA, 2014. IEEE Computer Society.
[17] Amir Masoud Gharehbaghi and Masahiro Fujita. Transaction-based debugging of
system-on-chips with patterns. In Proceedings of the 2009 IEEE International Con-
ference on Computer Design, ICCD’09, pages 186–192, Piscataway, NJ, USA, 2009.
IEEE Press.
[18] Marc Boule, Jean-Samuel Chenard, and Zeljko Zilic. Assertion checkers in verification,
silicon debug and in-field diagnosis. In Proceedings of the 8th International Symposium
on Quality Electronic Design, ISQED ’07, pages 613–620, Washington, DC, USA, 2007.
IEEE Computer Society.
[19] Eli Singerman, Yael Abarbanel, and Sean Baartmans. Transaction based pre-to-post
silicon validation. In Proceedings of the 48th Design Automation Conference, DAC ’11,
pages 564–568, New York, NY, USA, 2011. ACM.
[20] Yael Abarbanel, Eli Singerman, and Moshe Y. Vardi. Validation of soc firmware-
hardware flows: Challenges and solution directions. In Proceedings of the The 51st
Annual Design Automation Conference on Design Automation Conference, DAC ’14,
pages 2:1–2:4, New York, NY, USA, 2014. ACM.
[21] Binsan Khadka. Transformation of live sequence charts to colored petri nets (lsctocpn).
Master’s thesis, University of Massachusetts Dartmouth, January 2007.
62
[22] Sai Ma, Debjit Pal, Rui Jiang, Sandip Ray, and Shobha Vasudevan. Can’t see the
forest for the trees: State restoration’s limitations in post-silicon trace signal selection.
In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design,
ICCAD ’15, pages 1–8, Piscataway, NJ, USA, 2015. IEEE Press.
[23] Matthew Amrein. System-level trace signal selection for post-silicon debug using linear
programming. Master’s thesis, University of Illinois, May 2015.
[24] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi,
Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti,
Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A.
Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.
[25] The gem5 simulator: A modular platform for computer-system architecture research.
http://www.gem5.org/docs/html/gem5MemorySystem.html.
[26] Ghdl. http://ghdl.free.fr. Accessed: 2016-07-07.
63
APPENDICES
64
Appendix A Protocol Specifications in Message Sequence Chart Provided by
GEM5
Figure A.1. Flow sequence chart of write operation when requested data is not included in
Dcache.
Figure A.2. Flow sequence chart of write operation when XCache has the exclusive right of
requested data.
Figure A.3. Flow sequence chart of write operation when requested data is shared by another
component.
65
Figure A.4. Flow sequence chart of read operation when XCache has the exclusive right of
requested data.
Figure A.5. Flow sequence chart of read operation when requested data is shared by another
component.
Figure A.6. Flow sequence chart of read operation when requested data is not present in the
Cache.
66
Appendix B Protocol Specification in LPNs Provided by GEM5
msg0 : ( CPU1, icache1 , writeReq)
msg1 : ( dcache1, Bus , readExreq )
msg2 : ( Bus, dcahce2 , readExreq)
msg3 : ( dcache2, cpu2 , readExreq)
msg4 : ( Bus, icahce2 , readExreq)
msg5 : ( icache2, cpu2 , readExreq)
msg6 : ( Bus, icahce1 , readExreq)
msg7 : ( dcache1, cpu1 , readExreq)
msg8 : ( Bus, Memory , readExreq)
msg9 : ( true )
msg10 : ( Memory, Bus, readExres)
msg11 : ( icache2, Bus , readExres)
msg12 : ( Bus, dcache1, readExres)
msg13 : ( icache1, CPU1 , writeRes)
msg14 : ( icache1, CPU1 , writeRes)
msg15 : ( dcache1, Bus, UpgradeReq)
msg16 : ( Bus, icahce2 , UpgradeReq)
msg17 : ( Bus, Memory , UpgradeReq)
msg18 : ( icache2, Bus , UpgradeRes)
msg19 : ( Bus, dcache1 , UpgradeRes)
msg20 : ( icache1, CPU1 , writeRes)
Figure B.1. Flow specification of a cache coherent write operation initiated from CPU1 to
instruction cache.
67
msg0 : ( CPU1, icache1 , ReadReq)
msg1 : ( dcache1, Bus , StoreCondreq )
msg2 : ( Bus, icahce2 , StoreCondreq)
msg3 : ( icache2, cpu2 , StoreCondreq)
msg4 : ( Bus, dcahce2 , StoreCondreq)
msg5 : ( dcache2, cpu2 , StoreCondreq)
msg6 : ( Bus, dcahce1 , StoreCondreq)
msg7 : ( icache1, cpu1 , StoreCondreq)
msg8 : ( Bus, Memory , StoreCondreq)
msg9 : ( true )
msg10 : ( Memory, Bus , ReadRes)
msg11 : ( icache2, Bus , ReadRes)
msg12 : ( Bus, dcache1 , ReadRes)
msg13 : ( icache1, CPU1 , ReadRes)
msg14 : ( icache1, CPU1 , ReadRes)
Figure B.2. Flow specification of a cache coherent read operation initiated from CPU1 to
instruction cache.
68
msg0 : ( CPU1, dcache1 , ReadReq)
msg1 : ( dcache1, CPU1 , ReadRes)
msg2 : ( icache1, Bus , LoadLockedreq )
msg3 : ( Bus, dcahce2 , LoadLockedreq)
msg4 : ( dcache2, cpu2 , LoadLockedreq)
msg5 : ( Bus, icahce2 , LoadLockedreq)
msg6 : ( icache2, cpu2 , LoadLockedreq)
msg7 : ( Bus, dcahce1 , LoadLockedreq)
msg8 : ( icache1, cpu1 , LoadLockedreq)
msg9 : ( Bus, Memory , LoadLockedreq)
msg10 : ( true )
msg11 : ( Memory, Bus , ReadRes)
msg12 : ( icache2, Bus , ReadRes)
msg13 : ( Bus, icache1 , ReadRes)
msg14 : ( dcache1, CPU1 , ReadRes)
Figure B.3. Flow specification of a cache coherent read operation initiated from CPU1 to
data cache.
69
Appendix C Protocol Specification in Message Sequence Charts for the RTL
Model
Figure C.1. CPU write when cache has exclusive right of the requested data.
Figure C.2. CPU write when data only exist in the other CPU’s cache
Figure C.3. CPU write when requested data only reside in Memory
Figure C.4. Cache send write back request to Memory
70
Figure C.5. CPU read when cache has exclusive right of the requested data.
Figure C.6. CPU read when data only exist in the other CPU’s cache
Figure C.7. CPU read when requested data only reside in Memory
The read and write protocols in RTL model are very similar to what we used in GEM5
simulator. However, the command name used here is different.
71
Appendix D Protocol Specification in LPNs for the RTL Model
There will be 3 protocols in total: read , write and write back protocl.
All the write operations are implemented in protocol presented in Figure D.2. When the
request activate cache coherent protocol, like in Figure C.2, it will end in state17. The rest
will end in state9 .
All read operations are implemented in protocol presented in Figure D.3. Specification
in Figure C.6 will end in state17. The rest of the specification without activating cache
coherence protocol end in state9.
msg1 : ( Cache1, Bus , wb )
msg2 : ( Bus, Memory , wb)
msg3 : ( Memory, Bus , wb)
Figure D.1. Flow specification of a cache write back operation initiated from Cache1.
72
msg1 : ( CPU1, Cache1 , wt)
msg2 : ( Cache1, CPU1 , wt )
msg3 : ( Bus, Cache2 , snp)
msg4 : ( Cache2, Bus , snp)
msg5 : ( Bus, Memory , wt)
msg6 : ( Memory, Bus , wt)
msg7 : ( Bus, Cache1 , wt)
msg8 : ( Bus, Cache1 , wt)
msg9 : ( Cache1, CPU1 , wt)
msg10 : ( Cache1, CPU1 , wt)
msg11 : ( Cache1, CPU1 , wt)
Figure D.2. Flow specification of a cache coherent write operation initiated from CPU1 to
Cache.
73
msg1 : ( CPU1, Cache1 , rd)
msg2 : ( Cache1, Bus , rd )
msg3 : ( Bus, Cache2 , snp)
msg4 : ( Cache2, Bus , snp)
msg5 : ( Bus, Memory , rd)
msg6 : ( Memory, Bus , rd)
msg7 : ( Bus, Cache1 , rd)
msg8 : ( Bus, Cache1 , rd)
msg9 : ( Cache1, CPU1 , rd)
msg10 : ( Cache1, CPU1 , rd)
msg11 : ( Cache1, CPU1 , rd)
Figure D.3. Flow specification of a cache coherent read operation initiated from CPU1 to
Cache.
74
Appendix E Copyright Permissions
The permission below is for the use of material in Chapters 1, 2, 3, 4, 5 and 6.
Title: Protocol-guided analysis of
post-silicon traces under
limited observability
Conference
Proceedings:
Quality Electronic Design
(ISQED), 2016 17th
International Symposium on
Author: Hao Zheng
Publisher: IEEE
Date: March 2016
Copyright © 2016, IEEE
If you're a copyright.com
user, you can login to
RightsLink using your
copyright.com credentials.
Already a RightsLink user or
want to learn more?
Thesis / Dissertation Reuse
The IEEE does not require individuals working on a thesis to obtain a formal reuse license,
however, you may print out this statement to be used as a permission grant:
Requirements to be followed when using any portion (e.g., figure, graph, table, or textual material) of
an IEEE copyrighted paper in a thesis:
1) In the case of textual material (e.g., using short quotes or referring to the work within these papers)
users must give full credit to the original source (author, paper, publication) followed by the IEEE
copyright line © 2011 IEEE.
2) In the case of illustrations or tabular material, we require that the copyright line © [Year of original
publication] IEEE appear prominently with each reprinted figure and/or table.
3) If a substantial portion of the original paper is to be used, and if you are not the senior author, also
obtain the senior author’s approval.
Requirements to be followed when using an entire IEEE copyrighted paper in a thesis:
1) The following IEEE copyright/ credit notice should be placed prominently in the references: © [year
of original publication] IEEE. Reprinted, with permission, from [author names, paper title, IEEE
publication title, and month/year of publication]
2) Only the accepted version of an IEEE copyrighted paper can be used when posting the paper or your
thesis on-line.
3) In placing the thesis on the author's university website, please display the following message in a
prominent place on the website: In reference to IEEE copyrighted material which is used with
permission in this thesis, the IEEE does not endorse any of [university/educational entity's name goes
here]'s products or services. Internal or personal use of this material is permitted. If interested in
reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for
creating new collective works for resale or redistribution, please go to http://www.ieee.org
/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from
RightsLink.
If applicable, University Microfilms and/or ProQuest Library, or the Archives of Canada may supply
single copies of the dissertation.
Copyright © 2016 Copyright Clearance Center, Inc. All Rights Reserved. Privacy statement. Terms and Conditions.
Comments? We would like to hear from you. E-mail us at customercare@copyright.com
Rightslink® by Copyright Clearance Center https://s100.copyright.com/AppDispatchServlet#formTop
1 of 1 10/7/2016 8:38 AM
75
