Abstract -The increasing complexity and short product cycles drive pipeline are specified. UML diagrams are used to model the developers of mobile systems to analyse the performance of systems system architecture and timing behaviour. UML is a graphical before hardware prototypes are available. Therefore, it is necessary modelling language and is the de-facto standard in the softto predict application runtimes with the help ofsimulations of system ware development area. The presented hardware modelling models. Miscellaneous components andfactors ofmobile devices af-with UML requires specialised concepts which are supported fect the performance, e.g. caches, buses etc. In order to predict the by the UML profile SysML [1]. The MARTE profile [2] is performance ofnew system designs already during early stages ofde-used to enhance the elements of the model with additional sevelopment, models of the timing behaviour are necessary. We have developed a modular timing simulator for models of typical mobile mantics. systems which can be used to predict the runtime of applications on From this UML descriptions simulator configurations are future systems. Since UML is the de-facto standardfor software mod-derived. Therefor we have developed a modular timing simuelling and widely used, we use UML to specify the hardware of the lator for assembler instruction traces. For each component of a system. In this way, the gap between hardware and software mod-mobile system there is a representing simulation module. This elling may be closed and performance analysis of application and modular approach allows for a flexible configuration of a large system design are tight closer The UML system model consists of range of systems. Instruction traces of existing applications an architecture model and an instruction behaviour description. The architecture model describes the components of the system and the capredonexisting hrdwaearesedinputfrthe nsiua connections between them and the behavioural model specifies the tor todpred temruntie o the crr ed applicatioso timing of the processor instructions. These models are used to sim-the modelled systems. Due to this trace-based approach there ulate different configurations of an ARM9 system. Traces from one is no functional simulation necessary and the simulator only foconfiguration are used to predict the performance ofanother configu-cuses on the mere timing of instructions and components. The ration. Predictions for an ARM]] system with parallel pipeline units simulator is designed to be as independent as possible from the are made. instruction set.
evaluation of the methodology and the timing simulator. Sim-types of the SysML profile [1] . SysML is a subset of UML 2 ulation predictions of system models are compared with real-with some extensions to allow for a detailed system modelling. world measurements. Section VI concludes the paper.
The authors of [17] analysed the performance of a video codec on ARM systems and determined components which II. RELATED WORK affect the performance. The proposed modelling methodolgy Due to the complexity of processors and microarchitec-and the timing simulator consider these components.
tures, simulations are used to predict their performance [3] .
Using a subset of benchmark suites is often applied to analIn the context of processor simulation two approaches exist yse system architectures [18] , [19] . Therefore i.e. trace-driven simulation and execution-driven simulation' Trace-driven simulation uses captured or synthetically gener-used in the evaluation. ated trace files as input and simulates their timing behaviour III. ARM PROCESSOR FUNDAMENTALS on a modelled system. This approach is an old technique and Figure 1 depicts a schematic overview of components typiwidely used [3] , [4] . Execution-driven simulation uses soft-cally used in mobile systems like cellphones. A processor conware programs as input and simulates their functional execu-sisting of a pipeline and registers is connected to caches. These tion. SimpleScalar [5] , [6] is an example for this approach. caches are connected via a system bus to the main memory and
The execution-driven approach suffers from the drawback of other peripherals. A writebuffer is placed in between the data a fix instruction set and the necessity to port operating sys-cache and the bus, so that the processor is not delayed by write tems and drivers to the simulation framework. Programs like accesses to the memory. Qemu [7] emulate the functionality of a processor, but lack a timing model for the processor and the architecture. The admery vantage of the trace-driven approach is, that every traceable X instruction program can be used as input without the implementation of°registers cache special system calls or the need to adapt the operating system. bus The drawback is that, branch prediction in the pipeline can not always be modelled, because often the input traces con-, pipeline writebuIfI tain only the executed instructions. However, this has typically pherals no significant influence on the simulation accuracy [3] . The ChARM tool [8] for ARM-based systems follows the trace- driven approach, but the simulated processors are not up to date anymore and the simulated instruction set is not configurable.
The main memory is often built from one or more RAM We also follow the trace-driven approach and developed a modules. These modules have different latencies for read and modular timing simulator for ARM-based systems. The sim-write access and support a special burst mode when successive ulator supports user-defined architectures and instruction sets. data is addressed. Since the RAM modules have to internally Up to date processors of the ARM family are modelled and address the requested data first, the read and write access laused to simulate traces gathered at the hardware level. Thus, tencies depend on the distance between the requested data adeffects of the operating system and drivers are automatically dresses and the internal data cell structure. included which is important for accurate system simulations ARM processors are widely used in mobile devices, thus [9] , [10] . Execution-driven approaches and emulators may be this paper focuses on this processor family. The ARM procesused as alternative to real hardware to generate input traces for sors are Reduced Instruction Set Computer (RISC) based and the timing simulator.
employ modern concepts like pipelines and Harvard separated
The system and the processor details are modelled with instruction and data caches. The advantage of the pipeline con-UML. Software performance engineering methods (SPE) [11] cept is that all pipeline stages work in parallel and thus multiuse annotated UML diagrams to model the system and soft-ple instructions can be processed during one pipeline cycle. ware under study [12] , [13] , [14] . Since UML does not al-There are two types of pipeline cycles. Arithmetical and loglow for the modelling of non-functional aspects many authors ical calculation steps, reading registers etc. last one internal apply the UML Profile for Schedulability, Performance, and cycle (I-cycle) which duration depends on the clock speed of Time Specification (SPT) [15] to enhance the diagrams with the processor. Requesting instructions or data from the memthe necessary semantics [13] , [14] . The UML Profile for ory depends on the latencies of the bus and the memory modModeling and Analysis of Real-Time and Embedded systems ules. This duration is referred to as memory cycle (M-cycle) (MARTE) [2] is the successor of the SPT profile, allows for a in the following. The theoretical maximum parallelism of the detailed modelling of performance aspects, and supports UML pipeline stages cannot always be achieved due to interlocks and 2. Composite structure diagrams of UML 2 are used in SPE stalls. An instruction in the pipeline may need the result of a methods to model system architectures without processor de-predecessor instruction for its own calculation. Such a situtails [13] Instances of these blocks are used as parts in internal block This section presents the UML modelling methodology for diagrams. Figure 4 depicts an internal block diagram describmobile systems that applies the SysML profile for UML and ing the architecture of an ARM9 system with the aforementhe MARTE profile. The SysML profile enhances the seman-tioned components (cf. fig. 1 and ??). The instruction cache tics of UML to allow for the specification of systems consisting is directly connected to the bus, whereas the data cache is conof hardware and software. The MARTE profile is used to en-nected via a writebuffer to the bus. The bus is connected to hance the model elements with the additional needed semantics the main memory. All parts in this internal block diagram of the components.
are annotated with stereotypes. The memory part is annotated
The architecture model describes the system components with <HwMemory> and latencies for different data accesses which affect the performance, e.g. caches, buses, the instruc-depending on the distance between the requested memory adtion set etc. It defines the architecture of the system consist-dresses (cf. section III) are specified in the annotation. The ing of these components and the communication paths between stereotype «<HwBus»> describes the properties of the bus part.
them. Basic block diagrams (BBD) of SysML are used to de-The tags bandwidth, clock, and schedPolicy are used fine the components of the system model. Basic block dia-as defined in the profile and specify the bandwidth, the clock grams are the SysML counterpart of UML class diagrams. In-speed, and the arbitration scheme of the bus. In order to specternal block diagrams (IBD) of SysML are used to model the ify the burst capabilities of a bus, we extended this stereotype internals of the components and their interconnections. They by the tag burst which gives the supported burst lengths. The are similar to composite structure diagrams in UML.
«<HwWritebuf fer»> stereotype provides tags to specify The pipeline behaviour and the timing behaviour of the pro-the properties of the buffer. The tag addre s sBsu f fe r speccessor instructions are modelled in the instructions behaviour ifies how many non-successive write requests can be buffered.
The tag bu f fe r s i z e specifies the maximum number of data Figure 5 depicts the five-ary pipeline of an ARM9 proceswhich can be buffered for the requests. The <HwCache> sor (cf. fig. 2(a) . 3 ). In this way the path of instruction objects through the pipeline is defined.
In case of a pipeline with parallel pipeline stages, e.g.
In order to describe the internal properties of the proces-ARMl 1 pipeline, instructions may take different paths (cf. sor part, this part is refined by another internal block dia- fig. 2(b) ). For example, an add (addition) is processed by the gram. The internal components are pipeline stages and the ALU stage whereas a mul (multiplication) is processed by the register bank. The <HwRegisterbank> stereotype con-MAC stage, but both stages have the same predecessing stage. tains all necessary information to specify the register bank Therefore, it is necessary to specify the paths depending on properties, i.e. the number of registers and the register the instruction types. Specialised instruction blocks are introsize. The refined processor block consists of pipeline stages duced which are employed to typify the atomic flow ports of which are connected with each other to specify possible in-the pipeline stages. In this way, the paths for the different instruction paths. Pipeline stage parts are annotated with the struction types is defined. <HwPipelinestage> stereotype that may contain the Figure 6 presents specialised instruction blocks for the intags defaultCycle and branchExecuteStage. The struction set of the ARMll which correspond to the parallel defaultCycle tag is used to ease the specification of the stages, i.e. ALUInstruction, MACInstruction, and LSInstructiming behaviour of instructions that will be described in sec-tion. The actual instructions are specialisations of these blocks tion B. The tag branchExecuteStage is set for stages and are reused in the behavioural model (cf. sec. B). Figure 7 which determine whether a branch is taken or not. If a branch depicts an extract of the internal block diagram defining the was miss-predicted, the pipeline is flushed to remove already pipeline of an ARM 11 system. The flow of instruction objects fetched instructions from the predecessor stages. Pipeline through the pipeline is restricted by the typed flow ports and stages which access the caches or the memory need appropriate the itemflow of the connectors between the ports. ports which connect them to the cache or memory ports of the The block definition diagram of all involved components, processor block (cf. fig. 4 ). Stages which processing require the internal block diagram of the system layout and its refineregister values, need a connection to the register bank. ments, i.e. the pipeline description, and the instruction inter- H/r6 = r6 + r4
<<HWCycIe>> Optionally, the instruction add supports shift flags to |---<HCycl left/right shift (multiply/divide by a power of two) the value of the second register before the addition. The power is speci- fied as a third parameter. Figure 8 shows the specification of the pipeline behaviour UML sequence diagrams are used to model the pipeline beof the instruction add in an ARM9 processor. Instances of haviour and the timing behaviour of instructions. The MARTE the pipeline stage block corresponding to the stages modelled profile is enhanced by new stereotypes to specify the model in in the internal block diagram pipeline (cf. fig. 5 ) are shown the desired level of detail. The instructions' behaviour is modon top. The registerbank is part of the diagram, because the elled cycle-accurate and register access is modelled, too. instruction has to access registers during processing. Since
The architecture model and the instructions behaviour the instruction is fetched by the fetch stage from the mem-model specify the architecture and the behavioural aspects of ory, a message to the memory element is modelled. The mobile systems. UML profiles, i.e. SysML and MARTE, are <HwCycle> annotation and its tag cycleType specify employed in the modelling. SysML helps in specifying the that the instruction consumes a memory cycle in the this stage. architecture and the MARTE profile enhances the model eleIf the tag cycleTlype is omitted, the default cycle type de-ments with semantical meanings. fined in the internal block diagram pipeline is taken (cf. fig. 5 ).
V. EVALUATION In the decode stage an I-cycle is consumed which is modelled This section presents the evaluation of the simulation frameby an annotated message to the decode stage. In the execute work and the applicability of the proposed modelling methodstage exclusive access to the first register of the instruction ology. Traces from an ARM9 system are gathered and used as input for two system models. First, a system with direct con-used to simulate the runtime on the two aforementioned sysnection between the processor and the bus is analysed. Second, tems, i.e. a system without caches and a system with instrucinstruction and data caches are placed in between the processor tion and data caches. and the bus. These system architectures are modelled in UML Figure 9 [25] . The OMAP the relative error decreases with an increasing block size of the board contains an ARM926EJ-S MPU [22] with a clock fre-memcopy function. An explanation for this is the aforemenquency of 192 MHz. The MPU provides a 16 kB instruction tioned fact of varying RAM latencies which is not modelled cache as well as an 8 kB data cache. Both caches use a block in detail in our simulator and just simulated in an abstract way size of 32 bytes and are four-way associative. The AMBA sys-by using mean access latencies. The influence of this impretem bus [26] connects the processor, caches, and peripherals cise main memory modelling is larger for small traces, because like the main memory. A 32 MByte SD-RAM is used as main only a few memory accesses occur. memory. The operating system used is a Linux branch from Figure 10 presents a comparison of measurements and simMontavista [27]. The Lauterbach tool is able to trace the ex-ulation results as described above for the system with instrucecuted instructions, the instruction addresses, and the data ad-tion and data caches. The mean relative error of these simdresses in case of load and store instructions. The measured ulations is larger than for the system without caches, but the timings are cycle-accurate. Since neither the specification of relative errors of the predictions decreases with an increasing the board, nor the specification of the used RAM give exact in-size of the memcopy traces. This can be explained by the fact formation about the memory access latencies, we measured the that in this configuration (with caches) significantly less memlatency of load and store accesses. Table I memcopy function. This is due to the much larger runtime so that interrupts of the Linux kernel do not significantly influ-
The results for the jpeg decoding algorithm are similar to the ence the measurements. The relative error (right y-axis) of the results of the jpeg encoding. Measurements and simulations of simulation results is plotted for each input. The mean value of six input images result in a mean relative error of less than 1 % the relative error is around 3.5%. The simulation results are in case of the system without caches and an error of around 4% closer to the measured real-world figures than in the memcopy for the system with caches. Sample measurements and simulaanalysis. This underlines the observation that the relative error tions for the ABS algorithm and the Dijkstra algorithm are also decreases with the number of instructions in the trace due to in this range. The relative error for ABS encoding is less than the abstract modelling of the main memory. 2%, for ABS decoding around 4%, and the Dijkstra runtime is Figure 12 shows the runtime predictions of the jpeg encod-simulated with less than 1% deviation.
ing algorithm for a system without caches. The relative error Due to the lack of cycle-accurate measurements for an
