Abstract-The complexity of embedded systems presents key challenges in observing and analyzing complex hardware and software behavior. System-level observation methods provide in-situ monitoring and analysis of complex system behavior across hardware and software boundaries. Previous system-level observation methods utilized efficient pipelined hardware architectures that provide high throughput for reporting observed events but require significant area resources. Alternatively, an area-efficient round-robin priority-based event stream ordering technique significantly reduces area resources but requires tradeoffs in event stream throughput. We present a hardware-based event stream ordering technique capable of providing high throughput and flexibility in area requirements. This hardware-based event stream ordering technique reduces area requirements by 73.6% with a maximum reduction in event stream throughput of 4.6%.
I. INTRODUCTION

R
APIDLY increasing complexity of embedded systems prevents traditional analysis and debug methods to observe and analyze complex hardware and software interactions. Existing debugging methods that require the system execution to be halted are often intrusive and pose considerable challenges for in-situ analysis. For example, JTAG scan chains allow all registers within an SOC design to be monitored or controlled at runtime. However, access to those registers incurs significant overhead, as the system must be halted in order to access the scan chain.
Numerous approaches have focused on trace-based methods for logging system events in both hardware and software components using dedicated trace and debug ports. For example, ARM's CoreSight [2] and embedded trace microcell [3] can be synthesized within an SOC design to provide system-level trace capabilities using a dedicated trace port. However, system-level trace methods are often limited in the number of events that can be traced and stored in real-time or limited by the bandwidth of the trace port in reporting data to external test equipment.
To overcome the limitations of system trace ports, on-chip trace architectures have been proposed that incorporate methods for selecting specific signals to trace, controlling when those signals are traced, and using on-chip memory for buffering rapidly occurring events. Abramovici et al. [1] utilize a reconfigurable fabric on multiplexers that feed selected trace signals into on-chip buffers that can be accessed using traditional JTAG port. Leatherman and Stollen [5] , [6] present a methodology that allows designers to configure the trace width and depth, as well as configure hardware triggers that start the trace. Ko et al. [4] proposes a system-level debug architecture targeted for postsilicon validation that utilizes configurable event triggers, a network of trace buffers, and a configurable communication framework for efficiently storing data samples within the available trace buffers. Vermeulen and Goel [12] utilize on-chip buffers to trace and consolidate signals across multiple clock domains. Peterson et al. [10] present a trace method that utilizes configurable data filters within the trace port to select trace signals that can be reported in real-time.
Other research has focused on the efficient design of the interconnect fabric used for tracing signals. Liu and Xu [9] propose a methodology for creating an area efficient trace interconnection fabric. Given the set of signals that need to be traced, a custom interconnection fabric is created in which multiplexers are utilized to trace mutually exclusive signals and a custom crossbar network is utilized to trace concurrently accessible signals.
Event-based monitoring is an alternative to trace-based methods that offers the advantage of reducing the amount of information that is reported and stored during monitoring. We previously presented an event-driven, system-level observation framework (SOF) [7] , [8] to observe complex interaction across hardware and software boundaries at runtime. For monitoring and reporting rapidly occurring events, the SOF used an in-order pipelined, priority-based event stream controller (IO-PESC) that provides visibility for analyzing complex execution behavior of hardware and software components without affecting system execution. The pipelined architecture provides high throughput but requires significant area. Alternatively, a round-robin priority-based event stream controller (RR-PESC) can be utilized in conjunction with software reordering methods. While the RR-PESC and software reordering provide significant area reduction compared to the IO-PESC, the overhead of software reordering results in lower event stream throughput.
In this letter, we present a hardware event probe sorter (HEPS) capable of rapidly reordering the incoming event 1943 -0663 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. stream using the area-efficient RR-PESC. The HEPS achieves high event stream throughput, with performance close to the IO-PESC. When the number of events being actively monitored is less than the total number of event probes within the system, the RR-PESC achieves significant area savings compared to the IO-PESC design.
II. SYSTEM-LEVEL OBSERVATION FRAMEWORK
System-level observation methods provide the capabilities for monitoring and analyzing rapidly occurring events, and in-situ support for configuring and controlling event probes (EPs) within software and hardware components. Fig. 1 presents an overview of the SOF integrated within a multiprocessor system-on-a-chip (SOC) design. The SOF consists of a software observation interface (SWOI) connected to the trace port of each processor core and a hardware observation interface (HWOI) connected to each hardware core to be observed. Each observation interface consists of one or more event probes, a timestamp counter, a configuration register of each EP, a priority-based event stream controller, and a FIFO controller for buffering events within the event stream. To avoid affecting the execution of the main system, the SOF utilizes an auxiliary lightweight processor for the system observation engine (SOEngine) that executes the runtime observation software.
III. ONLINE EVENT STREAM PROCESSING
The SOF utilizes a priority-based event stream controller (PESC) to serialize and report multiple observed events. Within each SWOI and HWOI, the PESC serializes and stores observed events within a FIFO. The system observation controller (SOController) serializes and stores monitored events across multiple SWOIs/HWOIs using the same priority-based event stream control mechanism. We previously developed two types of PESCs: 1) an in-order pipelined PESC (IO-PESC); and 2) a round-robin PESC (RR-PESC). The observed events are finally reported to the runtime observation software using a dedicated interface to an isolated processor executing the observation software to analyze the event stream in-situ.
A. In-Order Pipelined PESC
The in-order pipelined, priority-based event stream controller (IO-PESC) serializes and reports observed events in-order based on the events' occurrence [7] . The IO-PESC has a pipelined binary tree structure to directly sort events as they are reported. The pipelined binary tree structure provides an in-ordered throughput of one event per clock cycle. However, the binary tree structure requires significant area overhead when many events need to be observed. To monitor different events, N-1 IO-PESC components are required within pipeline stages. The area overhead is primarily attributed to the EP's timestamp register and the pipelined PESC structure that increases linearly in relation to the number of EPs. Furthermore, while a system may have thousands of observable events, one may only need to monitor a subset of those events in-situ. However, the area required for the IO-PESC is dependent on the number of event probes within the system, not the number of events being monitored.
B. Round-Robin PESC
Alternatively, the round-robin priority-based event stream controller (RR-PESC) is an area-efficient event stream ordering technique that significantly reduces area requirements [8] . The RR-PESC sequentially reports observed events based on the EP's ID, where the EP with the next larger ID will be output next. As such, the RR-PESC cannot guarantee observed events are output according to their occurrence time, which is useful for system monitoring and analysis. For analysis in which the ordering of events is required, it is necessary to reorder, i.e., sort, the incoming event.
The round-robin priority scheme imposes an upper bound, equal to the number of enabled EPs, on the time difference between the event occurrence and the output of the event from the SOF. Using this upper bound, an efficient immediate sort/ output algorithm can be utilized to reorder the nearly sorted event stream. The immediate sort/output algorithm utilizes a small buffer, equal to twice the number of enabled EPs, to store incoming events. If the difference in timestamp between any two events in the buffer is greater than the number of enabled EPs, the event is output from the buffer. While the reordering algorithm is effective, the overhead of sorting the data in software reduces the effective event stream throughput by 45% in the worst case [8] .
C. Hardware Event Probe Sorter
To improve throughput compared to the RR-PESC with software reordering while reducing area compared to the IO-PESC, we present a hardware-based event probe sorter (HEPS). The HEPS is a dedicated hardware component for reordering the event stream when using the RR-PESC. The HEPS is connected between the event stream output from the SOController and the input to the SOEngine, as shown in Fig. 1. Fig. 2 presents an overview of the HEPS, which consists of event registers implementing an event buffer, insertion comparators (InsComp) for comparing the incoming event's timestamp to previous events within the event buffer, insertion controllers (InsCntrl) for inserting the incoming event in the correct position within the event buffer, and an output ready (OutReady) component for controlling when an event can be safely output from the event buffer. The goal of the HEPS is to insert a new incoming event every clock cycle and, if the buffer is not full, insert the event into the buffer while keeping the buffer in sorted order.
The interface of the HEPS consists of an input epin for the incoming event stream, and input epin_rdy indicating if the incoming event is valid, an input epout_rd indicating if an event is read, an input enabled_ep indicating the current number of enabled EPs, an output epout for the event output, and an output epout_rdy indicating if an event is ready to be output from the HEPS.
The HEPS' event buffer is comprised of registers that store an event's ID, timestamp, and optional event data. Each event register also includes one bit indicating if the register contains a valid event occurrence, or conversely is empty. The HEPS' event buffer will keep incoming events within sorted order such that new events can be efficiently inserted into the correct position within the event buffer.
For each event register, an insertion comparator (InsComp) compares the timestamp and valid bit in the register with the timestamp of the event stream input, epin. If the timestamp for epin is greater than the timestamp within a register, and the register's entry is valid, then the InsComp will output 0. Otherwise, the InsComp will output 1. By comparing the InsComp output for adjacent registers, the location to insert the incoming event can be determined. If an event register's InsComp output is 1 and the previous register's InsComp output is 0, then the incoming event should be inserted into that register, which is detected using an XOR gate for each event register. If all InsComp outputs are 1, then the incoming event should be inserted into the first register, meaning the incoming event has the lowest timestamp. Alternatively, if all InsComp outputs are 0, then the incoming event has the highest timestamp and all registers in the buffer are full. In this case, the insertion of the incoming event will be delayed until there is space within the buffer. The insertion of the incoming event into the current register will be made only when the event stored into that register is either shifted to another register or is output from the HEPS, which is handled by the insertion controller.
Each event register has an InsCntrl that determines which event will be inserted into the register. Whenever an event is being read out from the HEPS, the contents of the event registers will be shifted by one position. If the XOR gate output for an event register is 1, then the incoming event will be stored into that event register. For all registers before the insertion point, the InsCntrl will select the event from the next register to be stored into the current event register, effectively shifting those events one position forward. For all registers after the insertion point, the InsCntrl will store the previously held event. If an event is not being read from the HEPS and the buffer is not full, the incoming event will similarly be inserted at the insertion point. For all registers after the insertion point, the InsCntrl will select the event from the previous register to be stored into the current event register, effectively shifting those events one position backward.
The output ready component (OutReady) determines if the HEPS is ready to output an event from the buffer. An event can be safely output under three conditions: 1) if the number of valid EPs is equal to or greater than the number of enabled EPs; 2) if the difference between the timestamp of epin and the timestamp in Reg_0 is greater than the number of enabled EPs; and 3) if the buffer is full.
IV. EXPERIMENTAL RESULTS
To evaluate and demonstrate the capabilities of the RR-PESC with HEPS, we compared area requirements, throughput, and latency between the IO-PESC and the RR-PESC with HEPS. We implemented the system-level observation framework using VHDL and utilized an FPGA-based prototype of a SOC design consisting of a 125 MHz MicroBlaze processor with basic system peripherals. The system was synthesized using Xilinx Platform Studio (XPS) 11.5 targeting a Virtex-5 FPGA (XC5VLX110T).
A. Area results
To evaluate the area requirements, we consider a system with a total of 1024 EPs, but where a subset of those EPs needs to be monitored at any time. Fig. 3 presents the area required in lookup tables (LUTs) and flip-flops (FFs) for the RR-PESC with HEPS as a function of the number of enabled EPs. The figure also presents the area requirements for the IO-PESC, for which the area requirements are solely a function of the total EPs within the system. When observing all EPs within the system, the RR-PESC+HEPS requires 39.7% more area compared to the IO-PESC design. However, when the number of enabled EPs is less than the total number of EPs, the RR-PESC+HEPS implementation achieves significant area savings. With half of the EPs enabled, the RR-PESC+HEPS achieves an area reduction of 18.5%. And, with a quarter of the EPs enabled, the RR-PESC+HEPS achieves an area reduction of 44.3%.
B. Event Stream Latency and Throughput Analysis
We consider three monitoring scenarios: 1) a fixed frequency event probe (FFEP) scenario; 2) a constant event probe (CEP) scenario; and 3) a real-time system case study (RTCS) scenario. The FFEP scenario is utilized to measure the latency of a single event probe with a fixed frequency of occurrence of s. The CEP scenario consists of two event probes that are constantly observed every clock cycle. Finally, the RTCS scenario implements a real-time application consisting of five periodic tasks from the SNU benchmark suite [11] . The execution periods for individual tasks range from 120 ms to 310 ms. Xilinx xilkernel 4.00a was utilized as the operating system configured for priority-based scheduling. Within the RTCS scenario, 10 EPs are configured to observe the start time and end time of each periodic task execution for all tasks. Table I reports the latency of the IO-PESC and the RR-PESC with HEPS. In the FFEP and the RTCS scenarios, the two approaches have similar latency, with increases in the average latency to 6.49% and 6.46% compared to the IO-PESC, respectively. The increase in latency for the is primarily due to the delay of buffering all incoming events with HEPS. For the CEP scenario, the incurs an increase in the average latency of 8.77%. of the presented methods, we measured the total number of events that can be processed at runtime by the observation software within a fixed time interval. While the IO-PESC provides a maximum throughput of one event per clock cycle, the maximum effective throughput for the IO-PESC, including the delay for the software to read the events from the event stream, is 400 791 events per second. In comparison, the RR-PESC with HEPS achieves a throughput of 382 521 events per second, which is only a 4.6% decrease.
V. CONCLUSION
We presented a hardware-based event stream ordering technique that dynamically sorts rapidly occurring system events at runtime. The resulting system-level observation framework achieves high throughput but with lower area requirements for systems in which the number of enabled EPs is less than the total number of EPs. Although in the worst case, the event stream throughput is decreased by 4.6%, for common observation scenarios the event stream throughput is decreased by less than 1%.
