This paper presents a simulation platform for architecture exploration of bus based heterogeneous multi-processor systemon chips (MPSoC) -moviSim. The tradeoff between accurate simulation results and simulation time has been obtained by the cycle-count-accurate approach. Its main attributes are: flexibility, integration with the targeted tool chain and increased tracing and analysis capability. The wide range of implemented metrics (program execution time, executed instructions, stalled cycles, bus logging, register and memory port detection, power consumption, function, data and code line profiling, cache metrics (miss/hit ratio, etc), number of memory/subsystem reads/writes performed by a master) allow enhanced architectural exploration capability for complex MPSoC on which large software applications are running. Due to easy integration with debugging tools, the source code targeting the hardware platform can be easily verified and analyzed with the proposed simulation platform.
INTRODUCTION
To remain competitive, SoC designers must keep pace with the increasing customers' demand, for a variety of functions and support for multiple standards, within an overwhelming choice of architectural solutions. All of these are complemented by increased pressure to lower time-to-market and reduce costs. An important support in this direction is given by tools, and in particular by system-on-a-chip (SoC) simulators. Four important features are required by such a tool: flexibility (i.e. ease of configuring the desired architecture from discrete elements and to facilitate component reuse by including off-the-shelfcomponents), accuracy (i.e. level of abstraction for component modeling), tracing capability (for software (SW) and hardware (HW) architecture analysis), and simulation time. Two in particular are critical for the produced results: simulation time and accuracy. However, maintaining a high level of accuracy by employing register transfer level (RTL) modeling would result in higher simulation time due to difficult design, slow verification, and poor scalability. However, lowering this constraint would result in loss in terms of accuracy. Cycle accurate models eliminate unnecessary HW component description while trying to preserve the timing accuracy. A trade-off between simulation time and accuracy -cycle-countaccurate (CCA) simulation, has gained attention in recent work [1]-[3], [5]- [7] . It focuses on eliminating timing and internal component modeling details that would hinder the simulation, but at the same time it targets a correct component behavior with respect to its interface (i.e. the transaction timings are preserved). This approach is employed in our work as well. For the first feature -flexibility -it is worth emphasizing its importance in the context of hybrid multi-core HW platform solutions. At the heart of this requirement lies the current trend in SoC designcomponent/design re-use. Furthermore, as highlighted in [2] , different usage scenarios for the simulation platform would require mixed abstraction-levels. Another aspect refers to the integration of existing IP core descriptions by embedding them within a wrapper. This would create premises to enrich the components pool with significantly reduced development and validation cost/effort. The validation of the correct behavior of the simulation environment requires significant effort, and is critical for industry products.
Work presented in this paper, addresses the aforementioned critical features in a multi-core SoC simulator designed for architecture exploration and software development supportmoviSim. This simulator has been designed to target embedded architectures, which exhibit a larger diversity with respect to general-purpose computer systems. Its intended use is to support the process of architecture exploration for the development of the next generation Movidius MpSoC HW platform. This paper is organized as follows: Section 2 includes a brief review of related work, Section 3 focuses on the proposed simulation platform, in Section 4 the trace capability to aid architecture exploration and software development is described, followed by a case study -the H.264 decoding application and conclusions.
RELATED WORK
The highest simulation result confidence is given by RTL modeling with rigorous component detailing, which requires significant simulation time. Regarding bus and memory modeling several approaches have been developed, such as the CCA-TL2 modeling for buses used in [8] and the CCA memory models proposed in [5] . Our approach is similar to the ones described above. There are two main differences. First, the proposed simulator does not require existing RTL descriptions, as it uses configurable components, with the possibility of integrating/developing new ones. Second, it relies on XML based component configuration (instead of automatic model generation) for tuning different system parameters.
Regarding multi-processor systems simulation, several approaches have been proposed, such as the ones presented in [12] - [15] . Although they have good simulation speeds, their modeling capability is rather limited, due to the following reasons:
1. Simulation is restricted to one processor core; furthermore, they do not provide means for including other types;
2. Limited bus modeling capability; hence no accurate means to analyze bus traffic is provided;
3. Lack of intra-processor synchronization and communication mechanisms implementations;
4. Except for [12] , no methods for integration of hardware interfaces and accelerators simulation models is offered;
The proposed simulator incorporates these features.
CYCLE-COUNT-ACCURATE MODELING 3.1 Simulation Platform
The simulator core is a dynamic simulation environment, designed to facilitate the integration of different hardware modules, in order to aggregate discrete system components into a complete platform. They can be used in conjunction with user applications or benchmarks to evaluate platform performance. The simulation engine relies on the following components:
XML processor (XMP), Debugger Interface module, Simulation Runtime Engine Module (SREM) (see Fig.1 ). The hardware architecture is configured by means of an XML file. Several XML elements have been defined:
-ARCHITECTURE: is the top level, mandatory element and all the other elements are embedded hierarchically within one <architecture> tag.
-BUS: used to define a generic bus component.
-BRIDGE: interconnect element for communication between buses.
-MODULE: simulator contains several sub-classes of elements: PROCESSOR, which is used for user defined processor cores, specified via DLL Libraries; MODULE, which can be used to select several sub-types such as user defined peripheral modules, specified via DLL Libraries and specific elements like: SHAVE, CACHE, ICB, JTAG, MEM, CMX_MEM, T.
-SIGNAL: is used to simulate connections between modules that do not go through the bus (e.g. an interrupt line, the reset input etc.). A signal can be connected to one or more destination modules, but it is mandatory that it has a unique source module.
XMP is responsible for parsing the XML file and extracting the specific tags for each type of system component, followed by initialization of system components and then the building of the MPSoC system. SREM was designed to trigger the execution of each system element during a simulation cycle. An important feature favoring design reuse is the possibility of having imbricate architectures as depicted in Fig.2 . These architectures can communicate through EBI interfaces models. Furthermore, the simulator supports multi-clock domain. The only restriction is that all clock signals must be derived from the system clock.
Component Library
moviSim offers an extensive range of configurable modules which can be used to construct target systems for performanceportability evaluation. These are configurable hardware modules (HM) that are common in embedded architectures and will be reviewed briefly in the following subsections. For further details please refer to [16] where the modeling issues of HW components (e.g. simulator interface, XML configuration parameters) are presented.
Interconnect elements
Bus component. The model is inspired by AMBA/AXI bus protocol [10] . For flexibility we have designed a generic bus that Bridge component. We have implemented a generic bridge that is capable of transferring transactions from one bus to another. The modeled bridge is unidirectional, so two such bridges are required to achieve bidirectional communication between two buses. The unidirectional link between the two buses is configured in the XML file by configuring the master interface (contains the description of a master connection to a specific bus) and the slave interface (contains the description of a slave connection to the specified bus). This component is also responsible for transaction resize (i.e. to split transactions or assemble when possible transactions) in order to adjust the connection of two busses of different sizes. 
Processor modeling

Memory elements
Three types of HMs have been designed:
-CMX: combinational memory made of several SRAM tiles, which can be reorganized according to users' needs. Several tiles make up a slice. The slice size can be configured through XML description. Support for port-clash detection has been implemented.
-Cache model: a generic cache with XML parameter configuration for: size, line size, set-associativity, hit latency, enable write through field, replace policy (LRU, MRU, LFU, SLRU, AR), enable cache bypass (cache acts like a bridge between the master and the slave interface). Selective bypassonly a portion of the address space is bypassed, is also supported.
-Memory: generic memory HM with XML configuration options for: size, read latency (in simulator cycles), write latency (simulator cycles), read ports, write ports, port clash policy (RD_BEFORE_WR, WR_BEFORE_RD), enable parity checking bit.
Our modeling approach is similar to the one described in [5], with one significant difference. A potential drawback to [5] is the fact that the development of RTL memory models requires significant effort and may be unavailable. Thus, in our approach we try to provide a generic timed model, which can be tuned from the XML description according to the usage scenario. The memory timing behavior needs to be correct with respect to the bus interface. A typical memory read (RD) request would pass through the following phases:
-RD request is received by the HM on the slave interface and added in a request queue for servicing. -Every cycle the request queue is inspected and if RD ports are sufficient (i.e. memory is not busy servicing another request) the memory access takes place. -A counter is added to the data (i.e. read latency in simulation cycles). -Each simulation cycle the counter is decremented (memory access delay is mimicked). -When counter is 0, the data response is pushed back to the read responses queue; each cycle the slave interface sends the corresponding memory response (i.e. queue front).
In order to aid software developers, the HM allows the detection of access to uninitialized memory locations.
Miscellaneous
Several hardware accelerators and communication interfaces have been modeled and can be included in different simulation architectures, such as: UART, JTAG, I2C, EBI, H264 encoder and decoder accelerators, LCD interface, camera interface, DMA controller, etc. The behavior of these peripherals can be easily divided into computation and timing (with respect to the behavior observed from the bus interface). In the case of the LCD and camera interfaces, the outputs or inputs for these modules are represented by data files which contain the stream of frames which emulate the activity of the display or digital camera. Regarding the H264 encoder and decoder modules, these implement only the entropy CABAC/CAVLC encoding/decoding.
A bus traffic generator has also been implemented [16] . This module has no hardware counterpart. It can substitute the behavior (from the bus perspective) of hardware modules on a specific bus, replacing either a master (initiating transaction), either a slave (receiving transactions) or a master-slave HM.
It relies on the fact that peripherals such as LCDs, Camera can easily be divided into computation and timing (with respect to the behavior observed from the bus interfaces). Hence, a traffic generator component using the HM bus logs and the HM masterId (needed for filtering) can be used as a substitute. It extracts the bus transactions for the corresponding HM and delivers them on the bus during the corresponding simulation cycle.
Synchronization points may be provided (i.e. wait for RD response from memory before issuing the next WR request and adjust the times stamps for the following transactions). It comes in two configurations: random and pattern mode. In random mode, it generates random transactions on the specific bus, and thus bus capabilities are analyzed. In pattern mode, it reduces simulation overhead by mimicking HM activity with respect to their bus behavior.
Interfacing External Processing Elements
Different HW/SW co-simulation campaigns have different goals and thus, different modeling accuracy requirements. Furthermore, reuse of already implemented simulation models is a plus in trying to reduce the time needed to prepare a simulation platform. Hence, special attention has been given to providing support for integrating existing models into our simulator: CA SystemC descriptions such as [10] or C-based CA descriptions such as [9] . In order to be used, the external model needs to conform to specific requirements. They are encapsulated in an external dynamic library that must implement a standard interface for the simulator to be able to initialize the library and dynamically create model objects. This interface consists of the following functions: initLibrary and createObject. Furthermore, the HMs are required to have the appropriate master and/or slave interfaces in order to be able to properly connect to a bus, and to extend an abstract interface that dictates the runtime methods.
For proof-of-concept we have designed appropriate wrappers for the ARM core described in [9] , and the SystemC description of the PowerPC750 core from [10] . The interfacing effort has been increased due to the need of isolating the processor core. Our approach is similar to that of [13] .
DEBUG & ARCHITECTURE EXPLORATION SUPPORT
Tool chain Integration & Validation
moviSim is a configurable simulator, designed for heterogeneous multiprocessor systems. Its purpose is to simulate architectures, allowing them to execute real software applications rather than execution traces. The features available in this simulator are: Loading and execution of code for the processors in the system, Debugging facilities through a client-server interface, which connects the simulator to a debugger, Profiling facilities (bus traffic analysis, memory access analysis, execution time/ /waiting time analysis, cache activity analysis), data-logging for later inspection.
It can run in two modes: as an independent application, or as a separate application communicating with the moviDebug debugger in client-server mode using a TCP/IP socket for commands and messages (see Fig.3 ). Debugging facilities are the same (e.g. debugger scripts) for both the simulator and the actual hardware. Basically the same code targets the simulator and the hardware and produces the same outputs.
Validation requires significant effort and plays a key role in the acceptance of a tool. Many teams in real world design report an effort of up to 90% for validation. Simulator validation has been obtained against an existing hardware platform (Movidius SABRE HW platform).
Trace Information for Design Exploration
Two types of profiling supported by the Movidius toolchain: simulator based (i.e. different counters are logged in internal structures and these may be enabled, reset, displayed or logged into files) and hardware based (i.e. based on HW performance counters which can be controlled using moviDebug commands). The first method has the following advantages: good resolution, a rich range of profiling information can be retrieved in a single run of the target software. The drawbacks for this approach are: simulation overhead and reduced accuracy with respect to the HW. HW support for profiling is non-intrusive, the code being executed without being interrupted by any breakpoint or instrumentation of the code. The disadvantages for the later are: multiple runs for multiple parameters profiling, lack of HW support for some of the targeted parameters, and may require code instrumentation by moviDebug when smaller block codes need to be profiled. This may lead to inappropriate profiling of the code that is time sensitive or contains some synchronizations.
The profiling information include: execution cycles, stall cycles, average ILP, power consumption, number of memory RDs/WRs, number of RDs/Wrs done by the Movidius DSP core DMA, number of RDs/WRs to each SHAVE register. For the power estimates, the hardware design team has offered measurements obtained from the physical device and from synthesis tools. Furthermore, cache metrics (miss/hit ratio, etc), number of memory/subsystem reads/writes performed by a master is logged. Furthermore, it supports bus logging and Movidius disassembly core logging. Further information that can be extracted using the Movidius debugger for the SHAVE core with .mof (movidius object file) includes information for: functions, data and code labels (number of times each label has been reached/accessed), assembly lines (number of times one line has been executed).
A CASE STUDY
The architecture that we have used for testing is described in Fig.4 
CONCLUSIONS
This paper presents moviSim -a configurable simulator, designed for heterogeneous multiprocessor systems simulation. Its purpose is to simulate HW architectures, allowing them to execute real software applications rather than execution traces. The tradeoff between accurate simulation results and simulation time has been obtained by the cycle-count-accurate modeling. It supports architecture exploration and software development by rich tracing and analysis capability, logging of processing cores, and bus HM. Component reuse by wrapper design and encapsulation of an existing HM description and architecture reuse through nested architectures are supported. In addition to this, a wide variety of ready-to-use generic components that can be customized to the targeted platform are provided. We have designed and simulated a HW platform with variable number of Movidius DSP cores running H.264 decoding application on 720p frames. Future improvements consist of adding support for distributed computation on behalf of the simulation engine.
ACKNOWLEDGMENTS
This work has been partially supported by the EU Falx Daciae -SUIM 499/11844, POS CCE O2.1.1 research program.
REFERENCES
[ 
