As the complexity of embedded systems rapidly increases, the use of traditional analysis and debug methods encounters significant challenges in monitoring, analyzing, and debugging the complex interactions of various software and hardware components. This situation is further exacerbated for in-situ debugging and verification in which traditional debug and trace interfaces that require physical access are unavailable, infeasible, or cost prohibitive. In this article, we present a system-level observation framework that provides minimally intrusive methods for dynamically monitoring and analyzing deeply integrated hardware and software components within embedded systems. The system-level observation framework monitors hardware and software events by inserting additional logic for detecting designer-specified events within hardware cores to observe complex interaction across hardware and software boundaries at runtime, and provides visibility for monitoring complex execution behavior of software applications without affecting the system execution.
INTRODUCTION
As system complexity continues to increase, the integration of software and hardware components within embedded systems presents key challenges in monitoring and analyzing complex hardware and software interactions. The deep integration of software and hardware components within embedded systems often prevents the use of traditional analysis methods to monitor and analyze the internal state of these components. This situation prevents the use of a logic analyzer to observe the interaction within embedded systems and may affect the system correctness during monitoring the erroneous behavior.
Existing debugging methods that require the system execution to be halted are intrusive, either requiring significant hardware resources or leading to system perturbations that can change the execution behavior, and pose considerable challenges for in-situ analysis. For example, JTAG scan chains allow all registers within an SOC design to be monitored or controlled at runtime. However, in order to access these registers, the system execution must be halted. This perturbs the system execution such that observing the desired behavior may no longer be possible. Therefore, for in-situ analysis of monitored events, such intrusive methods are often infeasible, and when utilized may lead to system failure due to timing constraint violations-such as missed execution deadlines.
Intrusive debugging methods pose considerable challenges in real-time systems, for which hard execution constraints are critical to system correctness. If a task within the system does not complete its execution within the required time, the task can be considered to have failed. Whereas failure to meet hard execution deadlines may result in complete system failure, failure to meet soft execution deadlines may lead to undesired system behavior that can incur system failure. Thus, soft deadlines must still be met to meet the desired systems goals. Hence, new debugging and verification methods are needed to provide in-situ analysis methods capable of monitoring software and hardware interactions without perturbing the system execution.
To overcome the challenges of traditional JTAG interfaces, numerous approaches have focused on trace-based methods for logging system events in both hardware and software components using dedicated trace and debug ports. For example, ARM's CoreSight [ARM 2013] and Embedded Trace Macrocell [ARM 2011] can be synthesized within an SOC design to provide system-level trace capabilities using a dedicated trace port, without interruption of the system execution. ARM's system-level trace supports configurable triggers, data filters, and trace compression methods. However, the trace methods are still limited in the amount of data that can be traced, stored, and transferred off chip in real time.
In this article, we present an event-driven system-level observation framework (SOF) providing low-overhead methods for observing and analyzing complex interactions across hardware and software boundaries at runtime. The SOF provides in-situ support for controlling event probes within software and configuring hardware components using blocking, nonblocking, and cascading configurations. For serializing and reporting rapidly occurring events, the SOF provides three types of a priority-based event streaming interface. The contributions in this article are: (1) a configurable, nonintrusive framework for monitoring designer-specified hardware and software events; (2) advanced observation methods for analyzing complex system events using blocking, nonblocking, and cascading event probe specifications; (3) a high-throughput pipelined, priority-based event streaming interface for serializing and analyzing monitored events at runtime; (4) area-efficient priority-based event streaming interfaces for efficiently reporting monitored events at runtime; and (5) a software sorting algorithm for efficiently sorting the event stream to provide a time-ordered stream of observed events.
The rest of this article is organized as follows. In Section 2, we highlight related work. In Section 3, we present an overview of the system-level observation framework. In Section 4, we present three types of event probes to support the observation of various event occurrences. In Section 5, we present a priority-based event streaming interface to serialize and report rapidly occurring events, and discuss three alternative implementations, including: (1) a high-throughput pipelined, priority-based event streaming interface; (2) an area-efficient round-robin priority event streaming interface and a software sorting algorithm; and (3) a priority-level-based event streaming interface and a software sorting algorithm. In Section 6, we present experimental results demonstrating the capabilities of the SOF, highlighting the throughput, area, and latency trade-offs for the three priority-based event streaming interfaces. Finally, in Section 7, we conclude and highlight future work.
PREVIOUS WORK
In this section, we provide an extensive overview of related work on runtime trace and debug methods for hardware and software components. Table I provides a summary and classification of related work, highlighting the collection method, target components, analysis method, storage, intrusiveness, and runtime configurability for each approach. The collection method defines how an approach collects data within the target system using trace-based, scan-based, or event-driven alternatives. The target highlights those components within the SOC that the approach seeks to monitor categorized as hardware or software. The analysis method indicates how and where the observed information is analyzed, including in-situ on chip, in-situ off chip, or offline. The storage defines where the collected information is stored within the system, including on-chip buffers, off-chip memory, none, or user defined. The intrusiveness of an approach is defined as how the approach affects the execution of the system categorized as non-intrusive, minimally intrusive, or intrusive. A non-intrusive approach is one that in no way affects or perturbs the system execution. In contrast, an intrusive approach exhibits considerable impact on the system execution to the extent that it can affect both the correctness of the system execution and the validity of the information. A minimally intrusive approach is one that may impact the system, but the impact is either minor or can be controlled at runtime to minimize or eliminate the negative effects of the monitoring method. Lastly, the runtime configurability indicates whether an approach can be configured at runtime to select which signals or events to monitor.
A software debugger allows an engineer to debug a software design and examine its state by halting the execution of software at a particular point to observe the state of the processor's internal registers and system memory. A software breakpoint works by inserting a special instruction in the software design to be debugged. When the instruction is called, it invokes the debugger's exception handler. Similar tools exist for hardware designs [Camera et al. 2005; Yang et al. 2004] , but it is difficult to pragmatically match the utility of a software debugger, given the inherently parallel execution of hardware cores, for two reasons. First, software is fundamentally linear. While highlevel programming languages may obscure the fact, at the machine interface, software is a linear sequence of instructions. Second, the regularity of the load-store computer architecture means that intermediate results usually return to the memory system. Furthermore, a debugger has a high utility only when testing a subsystem in isolation. As the number of subsystems that the debugger does not control increases, the utility of the debugger decreases dramatically. Halting one subsystem is of low value if the rest of the system-for example, sensors, actuators, physical processes-continues to operate. Debugging within real-time systems presents additional challenges, as proper operation is dependent on meeting tight timing constraints that can be easily perturbed during debugging.
In the context of real-time systems, previous work has focused on reducing the overhead of traditional software debuggers. When debugging a single task within a multitasked system, stopping all tasks during debugging is extremely intrusive and can lead to incorrect behavior and even system failure. To minimize this intrusion, an extension of the GNU debugger (GDB) enables a nonstop mode in which only a single task is stopped during debugging and all other tasks can execute normally [Sidwell et al. 2008 ]. This allows a user to control tasks explicitly in ways that are not possible in all-stop mode, meaning that all tasks of execution stop during debugging. Stollon [2005a, 2005b] present a debug methodology that incorporates distributed on-chip instrumentation (OCI) components, allowing designers to configure how to trace-for instance, by defining the trace width and depth-and when to trace-such as by defining triggers that start the trace process. The distributed OCI components can then be connected by a dedicated bus to an on-chip analyzer that can control and process the trace data before making this data accessible off chip through a JTAG interface. The authors further propose a HyperJTAG interface that combines existing JTAG interfaces for processors within a multicore system with the distributed OCI components available through a single IP interface. Similarly, Vermeulen and Goel [2002] present a silicon debugging strategy for multiple clock domain systems, using a JTAG port in which an on-chip memory is utilized to trace specific internal signals that can later be accessed through the JTAG port. The proposed on-chip debug infrastructure and debugging software provide support for both real-time and time-intrusive monitoring. In order to support real-time non-intrusive monitoring, an on-chip memory is utilized to trace specific internal signals which can then be accessed through the JTAG port. However, due to limitations in the availability of on-chip memory, the duration and number of signals that can be monitored in real time are limited. Abramovici et al. [2006] present a distributed reconfigurable fabric of multiplexers enabling designers to select a subset of signals to monitor. The selected signals are processed by a debug monitor that can directly forward the captured signals or perform basic processing on these signals-for example, the debug monitor can directly process signals to only report anomalous ones. Each configurable multiplexer and debug monitor operates within a single clock domain. The outputs from the distributed debug monitors are collected together by a trace component that records the signals into an on-chip memory that can then be accessed by a JTAG port. The proposed distributed debug offers the advantage of being able to limit the amount of data that is traced, supporting multiple clock domains and eliminating the need to route all probes to a single trace component, which can often lead to a unroutable design during the synthesis process. Peterson and Savaria [2004] present a debugging and verification environment that can monitor multiple internal signals within a hardware design and provide a real-time trace for the signals via a dedicated debug port. The debug port provides a reconfigurable data filter that can be rapidly reconfigured to select the subset of signals that are currently traced. This provides a balance between the inputs and outputs necessary for the debug port and the number of internal probes that can be supported. The trace data is transmitted to an assertion checker implemented within an off-chip FPGA that can be utilized to verify correct execution of the device-typically by verifying properties defined within an assertion language, such as PSL. and Nicolici [2008, 2009] propose a system-level debug architecture targeted for post-silicon validation that utilizes configurable event triggers, a network of trace buffers, and a configurable communication framework for efficiently storing data samples within the available trace buffers. The event triggers enable designers to specify conditions that will start tracing of specific signals. The trigger conditions can be configured by designers through a set of configurable comparators. This approach enables designers to change event triggers at runtime. The proposed debug architecture includes a network of trace buffers for handling the simultaneous tracing of multiple data signals. The trace buffer architecture controls how traced signals are stored among available trace buffers according to designer-specified priorities. While this debug methodology provides support for runtime configuration, the proposed approach does not focus on enabled runtime in-situ analysis of traced events and data. Liu and Xu [2007] propose a methodology for creating an area-efficient trace interconnection fabric. Given the set of signals that need to be traced, a custom inter-connection fabric is created in which multiplexers are utilized to trace mutually exclusive signalsthat is, signals that are unlikely to occur simultaneously-and a custom crossbar network is utilized to trace concurrently accessible signals. Using either designer-specified identification of the types of signals to trace or analysis of the circuit structure, a customized inter-connection fabric can be generated.
Rather than incorporate a scan chain directly within an RTL design, Yang et al. [2004] proposed an alternative automated method for selecting and extracting a subset of internal signals to be monitored. The selected signals are monitored by a separate FPGA-based test platform that provides scan-chain access to these signals for use within a co-simulation environment. The proposed methodology has the advantage of being able to utilize the original testbench developed for the RTL design while relying on automated tools to extract the desired signals for post-silicon co-simulation and verification. Watterson and Heffernan [2007] propose an online software monitoring method with the specific focus on developing a minimally intrusive method so as not to affect the execution of the application. Within the proposed method, events from the processor can be generated by minimally intrusive software instrumentations that report an event to a dedicated on-chip monitoring core. The monitoring core can then process the events and report the required data to the external environment.
The Owl framework [Shultz et al. 2005 ] is a distributed approach that incorporates monitoring modules within the specific parts of the system to be monitored. The distributed monitoring modules communicate the profile data to a specific location in main memory or to a separate memory dedicated for profiling. However, because the monitors will need to transmit this data via the system bus, the proposed approach can be intrusive due to bus contention. As the intended target is a system realized within an FPGA, the monitoring modules themselves are reconfigurable. This allows a designer to change how the monitoring process is implemented at runtime. For example, a designer can reconfigure the monitoring modules to alter the frequency at which the profile data is written to memory to reduce the profiling traffic on the system bus.
The MAMon monitoring system [El Shobaki and Lindh 2001] proposes a methodology for monitoring hardware-and software-based events within SOC designs by incorporating dedicated logic within hardware components to detect occurrences of specific events being monitored. A probe unit is utilized to capture and log all occurrences of these events within an external memory. Events within the MAMon system are defined as conditional expressions that are evaluated during each clock cycle. However, a host workstation is required to view and analyze the event log with capabilities for filtering and searching the monitored events. Backasch et al. [2013] presented a runtime verification approach to observe multicore SOC designs and verify designer-specified system properties. The approach utilizes the hidICE (hidden ICE) emulator [Hochberger and Weiss 2008] that transfers trace data to external analysis tools. The behavior and instructions carried out by the target SOC design can be precisely reconstructed and emulated by the hidICE emulator. The hidICE emulator enables observability of multicore SOC activities (e.g., bus control events, bus reads, interrupts, processor power-state changes) to capture a real-time and concurrent trace of processors and hardware cores in a shared bus multicore SOC. This framework utilizes a combination of on-chip analysis to extract the synchronization events needed between the target SOC and emulator, and off-chip analysis within the emulator and host device for analysis of system execution.
SYSTEM-LEVEL OBSERVATION
System-level observation methods provide the capabilities for monitoring and analyzing rapidly occurring events and in-situ support for configuring and controlling event probes within software and hardware components. Figure 1 presents an overview of the SOF integrated within a multiprocessor system-on-a-chip (SOC) design. The SOF consists of a software observation interface (SWOI) connected to the trace port of each processor core and a hardware observation interface (HWOI) connected to each hardware IP core to be observed. Each observation interface (OI) consists of one or more event probes (EPs), a timestamp counter, a configuration register for each EP, a prioritybased event stream controller, and a small FIFO for buffering events within the event stream. To avoid affecting the execution of the main system, the SOF utilizes an auxiliary lightweight processor for the system observation engine (SOEngine) that executes the runtime observation software.
An EP is the basic element for monitoring events within software and hardware executions. The OI supports blocking, nonblocking, and cascading EPs to provide runtime support for defining and controlling the EPs. Blocking and nonblocking EPs have different semantic behaviors that affect how EPs are reported when the event stream FIFOs are full. A blocking EP will not detect another occurrence of an event until the observation software has processed the current event occurrence. In contrast, a nonblocking EP will continue to detect all occurrences of the event without requiring a reset from the observation software. In other words, a nonblocking EP will continue to observe new event occurrences. Thus, when the event stream FIFOs are full, a blocking EP reports the first event occurrence and a nonblocking EP reports the last event occurrence.
To observe the behavior of software executing on microprocessors within an SOC, the SOF leverages information provided by the processor's trace ports. Figure 2 demonstrates the interface between the trace signals of a processor and the SWOI. Importantly, those possible events that can be observed within the SWOI are limited to the information that is provided by the processor's trace port. For MicroBlaze processors [Xilinx 2009b ], trace pc and trace instruction are 32-bit signals, trace pid reg is an 8-bit signal, and trace valid instr, trace jump taken, trace dcache req, trace dcache hit, trace icache req, and trace icache hit are 1-bit signals.
As the signals provided by the trace interface are determined by the processor manufacture, the designer need not define the events to be monitored from scratch. Instead, a set of predefined configurable software events is provided. The system designer can select which software event probes and how many instances of these software event probes to incorporate within the SWOI. The following provides an overview of the software event probes.
Program Counter (PC) Event Probe. The program counter (PC) event probe is a software EP that detects the occurrence of a configurable program counter value. This allows the PC value of the EP to be configured by the system observation software. Again, as it may be necessary to monitor more than one PC value, a designer can specify how many PC EPs to incorporate within the SWOI.
Instruction Opcode Event Probe. This is a software EP that detects the occurrence of a configurable instruction opcode. This software EP provides the means by which the system observation software could perform detailed profiling of the execution behavior of specific instruction types. As with the PC EP, a designer can specify how many instruction opcode EPs to incorporate within the SWOI.
Branch-Taken Event Probe. This is a software EP that detects the occurrence of a jump or branch instruction being taken. In other words, this EP will be triggered when a branch or jump instruction occurs and the next PC value is not the next sequential program counter value. For the MicroBlaze processor, this event is a direct connection to the trace interface's trace jump taken signal. As this EP is independent of any specific instruction, one instance of this probe is required within the SWOI.
Context Switch Event Probe. The context switch event probe is a software EP that detects the occurrence of a context switch. To observe the occurrence of a context switch, the SWOI internally stores and monitors the PID of the current tasks being executed. When utilizing Xilinx's xilkernel [Xilinx 2009c ], the current PID is provided by the trace interface signal trace pid reg. If the current PID differs from the previously stored PID, the context switch EP will be triggered.
Instruction/Data Cache Hit Probe. This is a software EP that detects instruction or data cache hits. For the MicroBlaze processor, this event is a direct connection to the trace interface's trace icache hit or trace dcache hit signal.
Instruction/Data Cache Miss Probe. This is a software EP that detects instruction or data cache misses. To detect the occurrence of a cache miss, the SWOI monitors both the cache request trace signal-such as trace icache req-and cache hit trace signal-for example, trace icache hit.
EVENT PROBE TYPES
Each EP contains an EP controller implemented as a state machine consisting of three states, namely EP NE, EP EV, and EP BL, as shown in Figure 3 . Initially, the EP controller waits in the no-event state EP NE until the desired event is observed, defined by the logical expression epN cond. When the desired event is observed, the EP controller will capture the current timestamp value and probe data, and then generate an event probe data ep dataN, consisting of an event probe address, data, and timestamp. The EP controller will then transition to the event state EP EV. The subsequent behavior of the EP depends on the configuration of the EP. The EP can be configured as a blocking event probe, a nonblocking event probe, or a cascading event probe.
All EPs can be configured at runtime using software APIs implemented within the SOEngine. These APIs allow a user to configure and control the EPs. All configuration and control commands are transmitted through the FSL link [Xilinx 2009a ] of the SOEngine. All configuration and control commands consist of an initial configuration word
<R, CPC, CAS, BL, EM, DM, TM, PRID, EPID, OID>,
where:
-R is the 1-bit reset flag; -CPC is the 1-bit custom probe configuration; -CAS is the 1-bit cascading event probe configuration flag; -BL is the 1-bit blocking probe configuration flag; -EM is the 1-bit event mask; -DM is the 1-bit data mask; -TM is the 1-bit timestamp mask; -PRID is the 5-bit prior event ID; -EPID is the 5-bit event probe ID; and -OID is the 8-bit observation interface ID.
Each SWOI and HWOI is assigned an 8-bit observation interface ID to uniquely identify each interface and, within these observation interfaces, each EP is assigned a 5-bit event probe ID. This allows all EPs to be uniquely identified within the observation software using 13 bits.
If the reset flag R is set, the blocking event probe specified by the EPID and OID will be reset. For cascading probes, all EPs within a sequence of cascading event probes will be reset simultaneously. This ensures correct observance of the cascading sequence of events being detected. For instance, consider cascading event probes using EP 0 , EP 1 , and EP 2 . If EP 0 , EP 1 , and EP 2 are reset sequentially in order across multiple cycles, EP 0 could be observed again before EP 1 or EP 2 is reset. This could then result in the incorrect observation-such as latency measurement-of the cascading probe sequence.
The custom probe configuration (CPC) is utilized to configure probe-specific configuration data. For example, a program counter event probe requires configuration data to specify the address of the instruction being monitored. If the CPC bit is set within the configuration command, an additional configuration word containing the probe-specific configuration data will be transmitted. For example, to configure a nonblocking program counter event probe EP 3 within software observation interface OI 2 to detect the occurrence of the address 0 × 44001264, the following command would be utilized.
The event mask EM is used to enable and disable individual event probes. When the event mask bit is set, the detection of the corresponding probe will be disabled. The data mask DM and timestamp mask TM are used to control the capture of probe-specific data and timestamps when reporting the event occurrences.
The blocking, BL, and cascading, CAS, configuration bits control the type of event probe. When the CAS bit is set, the prior event ID, PRID, is used to specify the source of the prior event that must be detected for the current event probe to be triggered. For instance, to configure event probe EP 1 within the interface OI 2 as a blocking, cascading event probe with a prior event probe EP 3 , the following command would be utilized.
<0, 0, 1, 1, 0, 0, 0, 3, 1, 2>
Blocking and Nonblocking Event Probe
A blocking event probe will not detect another occurrence of an event until the observation software resets the event probe. Thus, the EP controller will transition to the blocking state EP BL, and wait until a reset signal epN rst is asserted for the event probe. In contrast, a nonblocking event probe will continue to detect and report all occurrences of the event without requiring a reset from the observation software. For nonblocking-(and non-cascading)-probes, the EP controller will immediately transition back to the EP NE state.
Cascading Event Probe
Using the SOF, a designer can implement many different analysis methods in response to the observed system events. A common analysis for many designs is measuring the latency between two events. To monitor this latency, two event probes can be created to observe the occurrence of each event. When these events occur, the system observation software can read the timestamp for both events, compute the latency, and reset the EPs to observe the next occurrence. However, this behavior may lead to an incorrect latency calculation, as the time at which the EPs are reset can influence the correctness of the calculation. Figure 4 presents an example behavior for two event probes EP 0 and EP 1 . At time x 0 , a reset signal is asserted for EP 0 and EP 1 , causing the EPs to return to the EP NE state. When the condition ep1 cond for the EP 1 is observed at the time x 1 , EP 1 will transition to the EP EV state, assert the corresponding event flag, and capture the current timestamp. When the condition ep0 cond for the EP 0 is observed at time x 2 , EP 0 will transition to the EP EV state, assert the corresponding event flag, and capture the current timestamp. Calculating the latency between the occurrence of EP 0 and EP 1 is impossible, as the captured occurrences do not represent the correct temporal relation.
To support correct and predictable latency measurements within the SOF-and any observation requiring a cause-and-effect relationship-we further support a cascading event probe (CEP) presented in Figure 3 . The CEP allows an EP to be dependent on both the event's condition and the occurrence of a prior event. For each CEP, a prior ev en signal is utilized to configure the EP that is dependent on a prior event. If enabled, a prior ev id signal is utilized to specify the prior event. Note that prior events are currently constrained to the same HWOI or SWOI. Figure 5 presents the behavior for two event probes, EP 0 and EP 1 , in which EP 1 is configured as a CEP with an EP 0 as the prior event. At time x 0 , a reset signal is asserted for EP 0 and EP 1 , causing the EPs to return to the EP NE state. At time x 1 , although ep1 cond is asserted, EP 1 will remain in the EP NE, as the prior event EP 0 has not yet occurred. At time x 2 , the condition ep0 cond for the EP 0 is observed. EP 0 will transition to the EP EV state, assert the corresponding event flag, and capture the current timestamp. Subsequently, at time x 3 , the condition ep1 cond and the prior occurrence of EP 0 will be observed, and EP 1 will transition to the EP EV state, assert the corresponding event flag, and capture the current timestamp. For all occurrences of the sequence of events EP 0 and EP 1 , the correct latency can be calculated within the system observation software. Cascading event probes can be configured as blocking or nonblocking. A cascading, nonblocking EP will continue to detect and report occurrences of the cascading event without requiring an explicit reset from the observation software. As a cascading event probe is dependent on one or more prior events, all prior events should be reset in order to detect the same sequence of events. Thus the EP controller will first transition to the EP BL state, where it waits for a reset signal. Within the HWOI and SWOI interface, a cascading event probe reset logic component detects the occurrence of the last nonblocking event probe within the cascading event chain and, simultaneously, reset all EPs-returning all EPs within the chain to the EP NE state. A cascading, blocking event probe will remain in the EP BL state until explicitly reset by the observation software. As with the nonblocking event probe, the cascading event probe reset logic will reset all probes within the cascading chain when this reset is received.
PRIORITY-BASED EVENT STREAM CONTROLLER (PESC)
The SOF utilizes a priority-based event stream controller (PESC) to serialize and report multiple observed events, as shown in Figure 6 . Within each SWOI and HWOI, the PESC serializes and stores observed events within a small FIFO. The system-level observation controller (SOCntrl) serializes and stores monitored events across multiple SWOIs and HWOIs using the same priority-based event stream control mechanism. The observed events are finally reported to the runtime observation software using a dedicated interface to an isolated processor executing the observation software to analyze the event stream in-situ. All EPs can be configured using software APIs implemented within the runtime observation software. We present three types of PESCs:
(
1) an in-order pipelined PESC (IO-PESC); (2) a round-robin PESC (RR-PESC); and (3) a priority-level-based PESC (PL-PESC).

In-Order Priority Event Stream Controller (IO-PESC)
The in-order pipelined, priority-based event stream controller (IO-PESC) is incorporated within the SWOIs, HWOIs, and SOCntrl to serialize and report observed events in order based on the events' occurrence [Lee and Lysecky 2013] . A pipelined binary tree structure of IO-PESC components with log 2 N stages (where N is the number of probes within the SWOI or HWOI) is utilized to forward observed events from the EPs to the SOEngine. Figure 7 presents a pipelined binary tree structure of IO-PESC components. An individual IO-PESC component compares two input events and selects the one with the highest priority to forward to the next stage during each clock cycle. When the forwarded event is read by the following pipeline stage, the IO-PESC will again compare the two current event inputs to determine the next event to forward. When an observed event reaches the final stage of the pipeline within a specific observation interface, the observed event is written to a FIFO. The output of the FIFOs from individual OIs is then connected to the SOCntrl, which uses the same IO-PESC components to control the order in which events from different OIs are reported to the runtime observation software.
In the current SOF implementation, an IO-PESC reports events in order based on the events' timestamps. A lower timestamp indicates the event was observed earlier and thus needs to be reported first. In the case where two events have the same timestamp, the event probe with a lower ID is given priority. Figure 8 demonstrates the cycle-by-cycle execution behavior of the pipelined in-order priority controller for an example system consisting of four nonblocking, non-cascading EPs. The IO-PESC requires a two-stage pipelined tree structure. For illustrative purposes, the EPs are set up such that they will continually trigger once reset. Thus, all four EPs are initially triggered simultaneously and will have the same timestamp at clock cycle x 0 .
In the first stage, the IO-PESC compares timestamps for EP 0 and EP 1 and for EP 2 and EP 3 . As all events currently have the same timestamp, those events with the lowest IDs-EP 0 and EP 2 , respectively-in each comparison will be forwarded to the next stage. In the second stage, the IO-PESC compares the events EP 0 and EP 2 , outputting EP 0 as the first observed event. Whenever an EP is initially read from the EP controller or the reported event is forwarded to the next pipelined stage, a reset/read signal is asserted that allows the EP to detect another event, or allows the previous stage of the pipeline to select the next event. Note that not all pipeline stages may have a valid event at all times (indicated as an X in the figure) .
This pipelined binary tree structure achieves an overall throughput of one observed event per clock cycle. However, this binary tree structure requires significant area overhead when many different events need to be monitored. For instance, to monitor 32 different events, 31 IO-PESC components are required within 5 stages. The area required by the EPs and the binary tree structure increases linearly in relation to the number of EPs. 
Round-Robin Priority Event Stream Controller (RR-PESC)
Instead of utilizing a pipelined binary tree structure to directly sort observed events as they are reported to the SOCntrl, we present a round-robin priority-based event stream controller (RR-PESC). The RR-PESC is an area-efficient event stream ordering technique that significantly reduces area requirements compared to the IO-PESC [Lee and Lysecky 2014] .
The RR-PESC is incorporated within the OIs and the SOCntrl to serialize and report observed events to the runtime observation software. Within the OIs, the RR-PESC compares all input events and selects that to report according to a round-robin priority control scheme. The selected event is written to the output FIFO. The outputs of the FIFOs in each individual OIs are then connected to the SOCntrl, which uses the same RR-PESC mechanism to control the order in which events from different OIs are reported to the runtime observation software.
We consider a set of N EPs in an OI EP i = {EP 0 , EP 1 , . . . , EP N−1 }. The EPs are assumed independent, that is, no correlation exists between the EPs. The RR-PESC checks whether any EP detected a desired event every clock cycle. When observed events exist, the RR-PESC selects one observed event among multiple ones in a cyclic fashion. The RR-PESC starts a search from index j. The index j initially starts at index 0. The EPs within the OI will be searched, starting from index j, to find EP i such that EP i is observed and i is the smallest index greater than or equal to j. After outputting EP i , the RR-PESC will update j to i+1. If there are no EPs with index greater than j, the RR-PESC will continue to search, starting with a search index of 0. Figure 9 presents an example of the RR-PESC behavior for a set of five EPs. The RR-PESC will initially use a search index of j = 0. At time t, EP 2 is first observed. The RR-PESC will output EP 2 and update the search index j to 3. At time t+2, EP 4 is observed. The RR-PESC will output EP 4 and update the search index j to 0. Later, at time t+8, EP 1 is observed and the RR-PESC will output EP 1 , updating the search index j to 2. Then, both EP 0 and EP 3 are observed at the same time at time t+11. Because the search index is currently 2, the RR-PESC will first output EP 3 followed by EP 0 .
Using the RR-PESC, events may be output in an order that is not sorted by the event timestamp, which is useful for system monitoring and analysis. Figure 10 shows an example scenario in which an EP observed later is output before an EP observed earlier for a set of four EPs. At time t+2, both EP 1 and EP 3 are observed at the same time. As the search index is currently 3, the RR-PESC will output EP 3 and update the search index j to 0. At time t+3, EP 0 is observed and, because the search index is currently 0, EP 0 will be output. It is not until time t+5 that EP 1 is output. As highlighted in the dashed box of the figure, EP 1 was observed before EP 0 . However, EP 0 was output before EP 1 , as highlighted in the solid box.
Thus, using a round-robin priority control scheme, the RR-PESC cannot guarantee that events are output according to their observation time. However, an upper bound on the difference between the event observance and the final output can be determined. To find this worst-case event output time, assume all EPs observe their respective events at the same time. Further, assume that the search index j is 0. In this case, the RR-PESC will output an observed event in the following order: EP 0 → EP 1 · · · → EP N−2 → EP N−1 . Event EP N−1 will wait for N event outputs before being output. Thus the worst-case event output time for N EPs is N event outputs. Note that, while events can be output at a maximum rate of one event per clock cycle, the observation software controls the effective event output rate. Hence the worst-case event output time is defined in terms of event outputs, and not clock cycles. This worst-case event output time implies that, in a sequence of N event outputs, the timestamp for event EP N−1 must be greater than or equal to EP 0 .
While the output of the RR-PESC is not sorted by timestamp, the reported events are nearly sorted and the upper bound on the difference between the event observance and event reporting time can be leveraged to implement efficient software-based sorting algorithms to sort the events according to their timestamps. For in-situ analysis of system behavior, we present the immediate sort/output algorithm that utilizes a buffer to store incoming events that need to be sorted using a buffer size equal to twice the number of enabled EPs (numEPs). Because the maximum difference between an event observation and event reporting is N event outputs, after sorting the buffer, the first half of the buffer can be output with a guarantee that any incoming event will not have a timestamp less than the output events.
Algorithm 1 shows the pseudocode for the immediate sort/output algorithm. Its goal is to sort events as the events are read from the event stream and written to the buffer. The procedure starts with reading an event from the SOCntrl. The event is immediately inserted within the buffer in sorted order using an insertion sort. The time complexity of this operation is O(n), where n is the number of events in a buffer. After inserting the new event, the algorithm will determine which events can be immediately output. If the difference in timestamp between any two events in the buffer is greater than numEPs, the event is output from the buffer. As the buffer is sorted, this process compares the first and last event in the buffer and outputs events as long as this condition holds. Furthermore, if the buffer reaches its maximum capacity of 2 * numEPs, the first half of the buffer is immediately output. for (k ← 0 to numEPs) do 
Priority-Level-Based Event Stream Controller (PL-PESC)
To further reduce area requirement compared to the RR-PESC, the SOF supports a priority-level event stream controller (PL-PESC) that reports events based on a priority level (PL) assigned to EPs and OIs. The PL-PESC allows designers to specify priorities of different system components or events. To specify each priority, the PL-PESC utilizes a concept similar to fixed-priority preemptive scheduling that executes the highest priority task that is currently ready to execute. When observed events exist, the PL-PESC will select that observed event with the highest priority.
The priorities of observed events can be configured in two different ways.
-EP → PL: the ID of the EP is utilized to determine the EP's priority level -OI → PL: the ID of the OI is utilized to determine the EP's priority level Figure 11 shows an operation example of the PL-PESC using an EP → PL priority assignment. Observed events from enabled EPs are stored to PLBs having the same ID of EPs, respectively. For example, using an EP → PL priority assignment, all EPs with an ID of 0 will be mapped to the same PLB, namely PLB 0 . This is further extended to the SOCntrl for selecting events between OIs. Figure 12 shows an example of the PL-PESC using an OI → PL priority assignment. Observed events from enabled EPs are stored to PLBs with the same ID as the originating OIs. Using an OI → PL priority assignment, all EPs within the same OI will be mapped to the same PLB. For example, all EPs within OI 0 will be mapped to PLB 0 .
The priority-level buffer (PLB) is a buffer to temporarily store observed events that need to be sorted. The size of each PLB is based on the number of events assigned to a PL, the rate of event occurrence, and the number of higher-priority events. We utilize an extension of the response-time analysis [Audsley et al. 1993 ] to determine the individual PLB size requirements and the total event stream buffer size (TESBS).
The TESBS comprises the total buffers required across all PLBs. The TESBS is calculated as
where l is a priority level and N, the total number of priority levels.
The priority-level buffer size (PLBS) for each PLB is comprised of the total event buffers required for a specific priority level l. The PLBS for priority level l (PLBS l ) is calculated as
where j represents all EPs mapped to PLB l given the current priority assignment, and EBR is the event buffer requirement for a specific EP, given by
where j is an EP mapped to PLB l , k represents all EPs assigned to PLB l , WCRT k represents the worst-case report time for all EPs k, and MFR j represents the maximum firing rate for EP j.
In order to ensure a PLB can be efficiently sorted, the PLB must store events for a sufficient duration to ensure that an occurrence of the lowest-priority event mapped to the PLB can be observed and inserted into the sorted position according to the EP's timestamp. The worst-case report time (WCRT) for an EP is the maximum time between the event observation and the event being reported to the SOEngine. Hence, the maximum WCRT for all events mapped to a PLB defines an upper bound on the period of time that events within the PLB may need to be sorted. Given this upper bound, the EBR for each EP can be determined by considering the EP's maximum firing rate (MFR). The MFR for an EP is the maximum frequency at which the associated event is expected to be observed. For example, consider an event defined as the execution of a periodic task within a multitasked application. The MFR for this event is equal to the period of the task. Within the SOF, we assume a designer can specify the MFR for all EPs.
Finally, the WCRT of an event probe EP i is calculated as
where RT i is the report time (RT) in cycles from the event observation to the reporting of the event to the SOEngine without interference from higher-priority EPs, and RTI j is the interference in cycles for each instance a higher-priority EPs is reported. The WCRT equation can be solved iteratively starting with WCRT i = RT i . For our SOF implementation, the RT i is 2 and the RTI j is 1 for all events. Thus a simplified equation for WCRT i is
While the output of the PL-PESC is not sorted by timestamp, the reported events in each PLB are nearly sorted. After sorting each PLB, events of the PLB can be output with a guarantee that any incoming event will not have a timestamp less than the output events. To sort each PLB, the immediate sort/output algorithm is utilized.
The MFR for events can be utilized to estimate the expected event stream throughput (EST) for the enabled event probes, as.
Given a set of enabled EPs, a designer can analyze whether the selected event stream controller can support the expected EST. This analysis helps to ensure that the number of enabled probes can be observed without overflowing the buffers within the PESCs and software buffers within observation software. However, in the case where the firing rate of an event is greater than the estimated MFR, the observation of some events may be missed. The specific event occurrences that will be missed depend on whether the EPs are configured as a blocking or nonblocking.
Additionally, the MFR for events can also be utilized to estimate the overflow of a FIFO for the enabled event probes in an OI. We utilize an extension of the utilization test [Liu and Layland 1973] to determine the FIFO utilization (FUT) in an OI, as
where N is the number of enabled EPs in OI j and i is the ID of enabled EPs. The FUT analysis determines whether the number of enabled EPs can be observed without overflowing the FIFOs within the OIs. If the utilization test fails, the OI may miss events. Again, the specific event occurrence that will be missed depends on how the EPs are configured.
EXPERIMENTAL RESULTS
To evaluate and demonstrate the system-level observation framework, we consider an FPGA-based prototype of an SOC design consisting of a 125 MHz MicroBlaze processor and several hardware IP cores, presented in Figure 13 . In addition to hardware cores implementing basic system functionality-such as timers, interrupt controllers, memory controllers, UARTs-the system design includes three additional cores accelerating specific operations: (1) a 13-tap FIR filter; (2) a sobel edge detection (SED) processing 640×480 grayscale images; and (3) a TFT controller for displaying the resulting images using a DVI display output. We implemented a real-time application consisting of five periodically executing tasks from the SNU benchmark suite [SNU 2010] , namely binary search (bs), FFT using Cooley-Tukey algorithm (fft1), integer-based forward discrete cosine transform from JPEG image encoding standard (jfdctint), matrix multiplication (matmul), and matrix inversion (minver). The execution periods for individual tasks range from 120ms to 310ms. Xilinx xilkernel 4.00a was utilized as the operating system and configured for priority-based scheduling. To observe the start and end times for all application tasks, 10 configurable software event probes configured as nonblocking were implemented within the SWOI for the MicroBlaze processor. Similarly, six configurable hardware event probes configured as nonblocking were implemented within the HWOIs for the FIR, the SED, and TFT cores. The system was synthesized using Xilinx Platform Studio (XPS) 11.5 targeting a Virtex-5 FPGA (XC5VLX110T). We consider four monitoring scenarios: (1) a constant event probe (CEP) scenario; (2) a fixed-frequency event probe (FFEP) scenario; (3) a real-time system case study (RTCS) scenario; and (4) a system-level case study (SLCS) scenario. The CEP scenario is an artificial observation scenario to evaluate the maximum event stream throughput consisting of two event probes that are constantly observed every clock cycle. The FFEP, RTCS, and SLCS scenarios are more realistic observation scenarios where the FFEP scenario is utilized to measure the latency of a single event probe with a fixed frequency of occurrence of 39.36μs, the RTCS scenario is designed to observe the start time and end time of each periodic task execution for the five software tasks considered, and the SLCS scenario is designed to observe the start time and end time of five software tasks for the RTCS scenario and three additional cores across multiple HWOIs and SWOIs. Figure 14 presents the area required in lookup tables (LUTs) and flip-flops (FFs) for the IO-PESC, the RR-PESC, and the PL-PESC as a function of the number of event probes ranging from 16 to 128. The area requirements increase from 5517 to 43785 total LUTs and FFs, from 1164 to 13581 total LUTs and FFs, and from 1068 to 13019 total LUTs and FFs, respectively. For 16 EPs, the RR-PESC requires 78.9% less area than the IO-PESC, and for 128 EPs, the RR-PESC requires 68.98% less area than the IO-PESC. The area for the IO-PESC is primarily attributed to the binary tree structure of N-1 event ordering components. This ranges from 15 IO-PESC components within 4 stages to 127 IO-PESC components within 7 stages. As the number of EPs increases, the size for number of event ordering components increases linearly. In contrast, the PL-PESC has simpler structure than the RR-PESC because the former does not require a previous location. For 16 EPs, the PL-PESC requires 80.64% less area than the IO-PESC and 8.25% less area than the RR-PESC, and for 128 EPs, the PL-PESC requires 70.27% less area than the IO-PESC and 4.14% less area than the RR-PESC. Table II reports the area required for the main system components of the FPGAbased SOC prototype and for the SOF components reported in LUTs, FFs, block RAMs (BRAMs), and DSP48Es. The SWOI and SOEngine use 29% of the area required for the SOF. Whereas each HWOI is customized for designer-specified events for a particular hardware core, the SWOI supports all possible software event probe types. Note that the area required for the SWOI could be substantially reduced by customizing the set of supported event probes, for instance, only supporting PC event probes. The SOEngine is implemented as a secondary MicroBlaze processor with 29KB of memory to implement the observation software and software event buffers, which could be reduced by utilizing an application-specific processor customized for system-level observation tasks. Lastly, the area required for the SOCntrl is primarily attributed to the PESC. Table III reports the latency of the IO-PESC, the RR-PESC with an immediate sort/output algorithm, and the PL-PESC with an immediate sort/output algorithm. For the CEP scenario, the average latencies of the IO-PESC, the RR-PESC with an immediate sort/output, and the PL-PESC with an immediate sort/output are 1.43ms, 2.46ms, and 2.29ms, respectively. The difference in latency between the IO-PESC and the RR-PESC with an immediate sort/output is 1.03ms and between the IO-PESC and the PL-PESC with an immediate sort/output is 0.86ms. Because the CEP scenario consists of constantly observed events, this difference represents the latency that can be attributed directly to the sorting algorithms. Additionally, the difference in latency between the RR-PESC and the PL-PESC represents the latency between the RR-PESC that depends on the previous search index and the PL-PESC that depends on the number of priority levels. For the FFEP scenario, the average latency of the RR-PESC with an immediate sort/output is 42.6μs. When only one event probe is enabled, the latency of the RR-PESC is dependent on the period of the event observations, as the immediate sort/output algorithm requires a second event to be input into the buffer before an event can be output. In contrast, the IO-PESC achieves an average latency of only 1.48μs, which is due to the binary tree structure that is not affected by the number of enabled EPs or the frequency of the event observations. While utilizing the same immediate sort/output algorithm, the buffer size for the PL-PESC is determined by the priority level. For the FFEP scenario, the buffer size of the immediate sort/output for the RR-PESC is two and one for the PL-PESC. Whenever a new event is updated, the buffer for the PL-PESC is always full. Therefore the PL-PESC with the immediate sort/output algorithm reports the event promptly. As a result, the average latency of the PL-PESC with an immediate sort/output is 2.1μs.
Area Results
For the RTCS scenario, the average latencies of the RR-PESC with an immediate sort/output and the PL-PESC with an immediate sort/output are 22.97ms and 3.25μs, respectively. Again, the buffer size of the RR-PESC is twice of the number of enabled EPs and the immediate sort/output requires a second event. For the RR-PESC, at least two events must be received before the algorithm can determine whether the event can be output. The RTCS scenario presents higher latency than the CEP and FFEP scenarios, which can be attributed to the period execution rates of tasks. As the RTCS scenario monitors the tasks' start and end times, the periodic rate of the fastest executing task will affect the overall latency. In the RTCS scenario, the period of the highest-priority task is 120ms. For the immediate sort/output algorithm of the RR-PESC, the maximum latency should be equal to this period, which is evidenced by the maximum measured latency of 121.04ms. For the PL-PESC, because the buffer size that is determined by the priority level is one, the maximum latency that is not affected by both the buffer size and the execution rates of the monitored tasks is 4.13μs. The SLCS scenario observes five periodic tasks of the RTCS scenario and the start time and end time of three hardware IP cores. For the SLCS scenario, the average latencies of the RR-PESC and PL-PESC with an immediate sort/output are 6.98ms and 14.14ms, respectively. In the SLCS scenario, while the buffer size of PLB 2 to PLB 9 is one, the buffer size of PLB 0 and PLB 1 is four. The immediate sort/output algorithm of the PL-PESC is affected by frequency of event occurrence like the RR-PESC. Additionally, the RR-PESC has smaller average latency compared to the PL-PESC because the latter is affected by SWOIs/HWOIs having higher priorities across multiple SWOIs and HWOIs, unlike the RR-PESC that follows the round-robin priority control scheme.
For the IO-PESC, the latency between an event observation and the final output of this event within a timestamp-sorted event stream is proportional to the total number of EPs enabled within the system. However, the latency of the RR-PESC is dependent on both the number of enabled EPs and the latency of the software sorting algorithm. Similarly, the latency of the PL-PESC is dependent on both the total number of priority levels and the latency of the software sorting algorithm.
Throughput Analysis
Table IV reports throughput for all four scenarios. To measure the effective maximum event throughput of the presented methods, we measure the total number of events that can be processed by the runtime observation software within a fixed time interval. While the IO-PESC provides a maximum throughput of one event per clock cycle, the maximum effective throughput for the IO-PESC, including the delay for the runtime software to read events from the event stream, is 400,790 events per second. The RR-PESC with immediate sort/output and the PL-PESC with immediate sort/output are capable of sorting and reporting 228268 and 243390 events per second, respectively. The IO-PESC that is not affected by the number of enabled EPs and frequency of the event occurrence achieves greater throughput compared to the RR-PESC and the PL-PESC. Overall, the PL-PESC achieves greater throughput compared to the RR-PESC. However, the throughput of RR-PESC is greater than the PL-PESC in the SLCS scenario because SWOIs/HWOIs having lower priorities must wait for the output of SWOIs/HWOIs having higher priorities across multiple SWOIs and HWOIs.
For the SLCS scenario, the PL-PESC has lower throughput than the RR-PESC, due to the fact that the PL-PESC uses separate buffers for each PL while the RR-PESC uses a single buffer. Using the immediate sort/output algorithm, the overhead of sorting the individual buffers can be affected by the frequency of specific events. Additionally, the observation software incurs a slight overhead for determining which PLB to insert in each incoming event.
In comparison with the trace-based approaches, including those supporting event triggers and filtering, the event-driven SOF utilizes lower trace bandwidth. The advantage is that the SOF can observe more events within the system for longer time windows which are important for analyzing or debugging systems. Generally, each signal that needs to be monitored can be characterized by an update rate and an event rate. The update rate characterizes the average rate at which a signal is updated. A lower update rate would enable trace methods to reduce the bandwidth using filtering and compression techniques. The event rate characterizes the average rate at which designer-specific events occur. If a designer desires to monitor all updates to a signal, the event rate will be equal to the update rate. Otherwise, it will be lower. In typical cases, the event rate will be 100-1000x lower than the update rate. The update rate and event rate can be utilized to determine the required bandwidth and trace window. For example, consider an SOC with a 100 MHz clock and 32KB of on-chip memory for trace/observation data. Further, consider a single 32-bit signal with an update rate of 1 MHz and an event rate of 10 KHz. The required bandwidth for a trace-based approach using both filtering and trace compression is approximately 4000KB/s, and the trace window is only 8ms. Using the SOF, the trace bandwidth can be reduced to 90KB/s with a trace window of 400ms.
Event Stream Buffer Size Analysis
Table V reports the event stream buffer size (ESBS) of RR-PESC with the immediate sort/output and PL-PESC with immediate sort/output for all four scenarios. Because the ESBS of RR-PESC with immediate sort/output is always twice the number of enabled EPs, the ESBSs for four scenarios are 4, 2, 20 and 32, respectively. For PL-PESC with immediate sort/output, the ESBS is calculated based on the WCRTs and MFRs for each scenario. The CEP scenario consists of two EPs, namely EP 0 and EP 1 , with WCRTs of 2 and 4 cycles, respectively, and an MFR of 2 for both EPs. The total buffer requirements, or ESBSs, for this scenario is 3. The FFEP scenario consists of only a single EP, requiring only a single buffer. The RTCS scenario consists of 10 EPs in which the WCRT for the EPs ranges from 2-11 cycles and the MFRs range from 15 * 10 6 cycles (once every 120ms) to 38.75 * 10 6 cycles (once every 310ms). Due to the short WCRTs and long MFRs, the buffer requirement for each EP is only 1, resulting in an ESBS of 10. Table VI summarizes the OI IDs, EP IDs, MFRs, priority-level mapping for all EPs, and EBRs for all EPs within the SLCS scenario. WCRTs range from 2-17 cycles with MFRs from 2.1 * 10 6 cycles (once every 16.8ms) to 58.75 * 10 6 cycles (once every 470ms). The resulting ESBS for the SLCS scenario is 16. Overall, PL-PESC requires 50% smaller buffer compared to RR-PESC.
CONCLUSIONS AND FUTURE WORK
We presented a system-level observation framework capable of dynamically observing and analyzing rapidly occurring hardware and software events at runtime. Using a prototype SOC design, we demonstrated the capabilities of this approach in area requirement, latency, and throughput. For all four scenarios, the RR-PESC with immediate sort/output algorithm requires 68.98% less area compared to IO-PESC and PL-PESC with immediate sort/output algorithm that require 70.27% less area compared to IO-PESC. While the average throughput of RR-PESC and PL-PESC with immediate sort/output is decreased by 43% and 39% in the CEP scenario, for common observation situations such as the FFEP, RTCS, and SLCS scenarios, the average throughput is decreased only by 1.4% and 0.78%, 0.07% and 0.007%, 0.04% and 1.42%, respectively. This decrease of throughput is primarily attributed to the operations for sorting and outputting events in the buffer.
Future work includes designing automation tools that automatically generate event probes from system requirements, such as those specified in assertion-based testing using a property specification language. Automation tools can significantly reduce development time and costs by eliminating the need for designers to manually create HDL codes. Future work will also investigate how to extend the SOF to accurately measure timestamps across multiple clock domains within the SOC. Finally, future work includes synthesizing the SOF to an ASIC technology in order to more directly compare area, energy, throughput, debugging efficiency, etc., compared to existing trace-based methods.
