Towards real-time HW/SW co-simulation with operating system support by He, Zhengting
CopyrightbyZhengting He2007
The Dissertation Committee for Zhengting Heerties that this is the approved version of the following dissertation:
Towards Real-Time HW/SW Co-Simulation withOperating System Support
Committee:Aloysius K. Mok, SupervisorVijay K. Garg, SupervisorBaxter F. WomakJames C. BrowneAnirudh DevganJoydeep Ghosh
Towards Real-Time HW/SW Co-Simulation withOperating System SupportbyZhengting He, B.S.E.E.; M.S.E.E.
DissertationPresented to the Faulty of the Graduate Shool ofThe University of Texas at Austinin Partial Fulllmentof the Requirementsfor the Degree ofDotor of Philosophy
The University of Texas at AustinMay 2007
To my family
Aknowledgments
I would thank my parents Zhongqiu Wu and Deren Ding. Without their ontinuousenouragement, this dissertation would not have been possible. I would also like toexpress my gratitude to my wife Ling Wu, who onstantly supports me throughoutthe ups and downs in my Ph.D. studies. It is so luky for me to have them in mylife. I would like to thank my baby boy Ray He who has supported his dad in hisown speial way, using his harming smile. Beause of him, I have made up mymind to nish my dissertation so that one day he an be proud of his dad.I would like to show speial thanks to my supervisors, Prof. Aloysius K.Mok and Prof. Vijay K. Garg, not only for their exellent aademi advie, butalso for all kinds of support in my life. I joined the Real-Time System Lab in theUniversity of Texas at Austin in 2001. I an still vividly remember the rst time Iwalked into Prof. Mok's oÆe and he introdued me the area of real-time system.Step by step, he has guided me to fous my researh on HW/SW o-simulationwith real-time operating system support whih not only is a truly interesting andemerging researh problem, but also mathes my tehnial bakground perfetly. Iwant to thank Prof. Garg for allowing me ontinue my dissertation work meanwhileworking in Texas Instruments In. from 2004. Without his trust and help, I wouldnot have had the hane to nish my Ph.D. in the United States. Furthermore, theirintegrity and ethial behavior will greatly inuene the rest of my life. I am blessedto have them as my Ph.D. supervisors. v
I wish to thank my ommittee members, Prof. Baxter F. Womak, Prof.James C. Browne, Prof. Joydeep Ghosh and Dr. Anirudh Devgan, for their on-strutive feedbak on my dissertation. I'm honored to have them on my ommittee.Last but not least, I would like to thank all urrent and former members inthe Real-Time System Group for the wonderful friendship, espeially Xiang Feng,Deji Chen, Wing-Chi Pong, Jianping Song, Jianliang Yi and Weirong Wang for allthe joys we had together.
Zhengting HeThe University of Texas at AustinMay 2007
vi
Towards Real-Time HW/SW Co-Simulation withOperating System SupportPubliation No.Zhengting He, Ph.D.The University of Texas at Austin, 2007Supervisors: Aloysius K. Mok and Vijay K. GargA trend in the onsumer eletronis market is the demand for new appliationsthat have a lot of similarities to older appliations but the new ones impose morehallenging and speial-purpose performane requirements. In the digital signalproessing (DSP) industry, this learly reets a transition from the design regimeof general DSP to the appliation-spei DSP. From the design perspetive, itmeans that the DSP ore remains unhanged but more and more hardware (HW)aelerators, DMAs and bus arhitetures need to be integrated into the hip. A keyin eeting this transition is the engineering apability to make sure that the designspeiation \mathes" the appliation before detailed design starts. Therefore,appliation software (SW) needs to be developed in parallel with HW to verify thedesign speiation at the system level. Enabling development and simulation ofSW before the atual HW is available also redue the time-to-market period whihis another important benet.HW/SW o-simulation for design speiation renement imposes many hal-lenging requirements to the simulation platform. The simulation omponents (simoms)vii
modeling the real HW (rhw) modules to be designed and the appliation SW needto be integrated to arry out the simulation at system level. Simulation result needsto be aurate. Simulation speed should allow fast design spae exploration andease debugging omplex appliation SW. HW and SW problems should be isolatedleanly sine HW and SW engineers often do not have enough expertise in one an-other's domains. The simulator should be ost-eetive. These requirements oftenonit with one another. For example, ahieving high simulation auray typi-ally requires the simulation to be arried out at low level, whih implies that thesimulation speed is slow. A simulator allowing integration of simoms and applia-tion SW for simulation is very expensive and thus only very few engineers an useit. In many ases, simoms and appliation SW are not onstruted in the sameprogramming language. Interfaing them is not a trivial problem and often impatsthe simulation speed severely. Using a single simulator requires the engineers tounderstand both HW and SW details that violates the requirement of HW/SWproblem isolation. The bottom line is that a single simulator is not possible to fulllall these requirements at the same time.This dissertation desribes three simulation tools for dierent usages. Therst one models and simulates the real-time operating system (RTOS) together withthe appliation SW. It is motivated by the fat that with the appearane of highperformane DSPs, more and more tasks will be implemented as SW on a singleDSP managed by an RTOS. Seleting the \right" RTOS before the SW is developedis very important. The tool is implemented based on SystemC and is ongurable tosupport modeling and timed simulation of most popular embedded RTOSes. Timingdelity is ahieved by using delay annotation. The OS timing information is derivedfrom published benhmark data. Appliation timing information an be proled orestimated from similar legay appliations. The optimized onservative approahis taken to synhronize simoms. Compared to other researh work, an importantontribution of this tool is an online algorithm for prediting the timestamp of theviii
next event based on the realisti assumption that multiple tasks exeute onur-rently on a proessor, managed by a stati or dynami priority driven sheduler.The simulation speed is more than 3 orders of magnitude faster than ommerialinstrution set simulator (ISS) with omparable auray. The tool is used to assistin generation of an initial design speiation.The seond tool is a system dataow simulator (SDFS) and is used by the HWengineers to rene the HW speiations. It models the appliation by a parameter-driven onditional dataow graph (CDFG) at the transation level and the HW bya ongurable HW graph at the yle-aurate level. SDFS takes the appliationCDFG and HW graph as the input and arries out the simulation to ath thedetailed HW ativities, i.e., bus arbitration. It only requires the HW engineers tounderstand the appliation at the CDFG level. To arry out the system simulationat suh a low level, many ommerial simulators need to ouple an ISS for appliationSW with an RTL simulator for simoms that are typially 6 orders of magnitudeslower than the rhw speed. The simulation error of SDFS is within 5% in mostases and the worst ase error is within 13%, whih is omparable to the ISS+RTLapproah. But the simulation speed is only 4 orders of magnitude slower than therhw speed. Compared to other similar researh work that also models the system atCDFG level, SDFS an ahieve higher simulation auray beause of the followingadvantages: 1) it does not need a xed appliation trae as input and thus is exibleenough to over many simulation senarios; 2) it does not assume a xed ost for eahfuntional blok and thus is able to estimate the system performane under atualexeution onditions; and 3) it is able to model the pipelined arhiteture ommonin modern DSPs. The proposed simulator is ost-eetive sine it is implementedin the SystemC language and an be exeuted on most PCs and workstations.The third tool is a real-time simulation platform (RTSP) implemented onlegay DSPs. To the best of our knowledge, this is the rst simulator that trulyenables the appliation SW to be developed in parallel with HW by oering theix
same SW development environment as if the rhw was available. To simulate thebehavior of a rhw module, a orresponding simom is onstruted running on alegay DSP. The suess of this simulation strategy hinges on a novel way to applythe onept of Real-Time Virtual Mahines to simulation. Eah legay DSP employsa two level sheduler to enfore that eah simom arries out the simulation at aproportional speed (1=) to the rhw, so that any job that would nish at time t onthe rhw will nish no later than   t +4 where 4 is a onstant bound. Suh afeature eliminates expensive synhronization between the simoms. RTSP is provento perform simulations faithfully and also is shown experimentally to be eetivefor real industry appliations. For a rhw whose timing behavior an be auratelymodeled by the SW behavior model, the simulation error is shown to be < 5%. Forvery ompliated rhw whose timing annot be aurately aptured by the behaviormodel, the simulation auray was shown to be exellent for the average ase.The simulation speed is quite fast. For the seleted audio and video appliations,simulation is only 10X and 30X slower than rhw exeution. The RTSP platformis pratially zero-ost sine legay EVM boards an be reused for the purpose ofsimulation.RTSP and SDFS an be used to omplement eah other. RTSP arries outthe simulation at a higher level than SDFS and usually annot apture ativities onbuses at every yle. The information olleted from SDFS determines the appro-priate rate settings for simoms to ompensate for the resoure ompetition. RTSPallows SW engineers to optimize the algorithm and suggest improvements to HWarhiteture. Suggested hanges are fed to SDFS for rening the design speiation.
x
Contents
Aknowledgments vAbstrat viiList of Tables xvList of Figures xviChapter 1 Introdution 11.1 Introdution to HW/SW Co-Design of DSP . . . . . . . . . . . . . . 11.2 Survey of HW/SW Co-Design . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Complete Co-Design Environment . . . . . . . . . . . . . . . 51.2.2 Model of Computations and Languages . . . . . . . . . . . . 71.2.3 HW/SW Partition . . . . . . . . . . . . . . . . . . . . . . . . 91.2.4 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2.5 Platform Based Design for SoC . . . . . . . . . . . . . . . . . 121.2.6 SW Toolkit Generation . . . . . . . . . . . . . . . . . . . . . 141.2.7 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3 Introdution to HW/SW Co-Simulation . . . . . . . . . . . . . . . . 161.3.1 Dierent Simulation Abstrat in Level . . . . . . . . . . . . . 171.3.2 Heterogeneous vs. Homogeneous Simulator . . . . . . . . . . 19xi
1.3.3 Synhronization Overhead between Simulation Components . 211.3.4 RTOS Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 241.4 HW/SW Co-Simulation Requirements . . . . . . . . . . . . . . . . . 251.4.1 Requirements of Generating an Initial Speiation . . . . . . 251.4.2 Requirements of Speiation Renement . . . . . . . . . . . 261.5 Summary of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 28Chapter 2 RTOS Modeling 332.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3 RTOS Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.3.1 RTOS State Mahine . . . . . . . . . . . . . . . . . . . . . . . 392.3.2 OS Sub-Module Design . . . . . . . . . . . . . . . . . . . . . 412.4 Simulation Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.5 Simulation Synhronization . . . . . . . . . . . . . . . . . . . . . . . 502.5.1 Event Timestamp Predition . . . . . . . . . . . . . . . . . . 502.5.2 Synhronization Protool . . . . . . . . . . . . . . . . . . . . 542.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.7 Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Chapter 3 System Dataow Simulator 593.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3 Tool Desription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.3.1 Appliation Dataow Graph . . . . . . . . . . . . . . . . . . . 633.3.2 Appliation Conditional-Flow Graph . . . . . . . . . . . . . . 643.3.3 HW Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.3.4 Appliation CDFG & HW Parameters . . . . . . . . . . . . . 683.3.5 Mapping Appliation DFG to HW Graph . . . . . . . . . . . 71xii




2.1 Simulation Result of H.263 Deoder . . . . . . . . . . . . . . . . . . 563.1 Modeling TI DM642 . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.2 Construting CFG for H.263 Deoder . . . . . . . . . . . . . . . . . 743.3 Deoding a P GOB by H.263 Deoder . . . . . . . . . . . . . . . . . 753.4 Cahe Ativities for Deoding P GOB . . . . . . . . . . . . . . . . . 763.5 TI DM642 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 764.1 Audio Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . 1144.2 Video Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . 116
xv
List of Figures
1.1 HW/SW Co-Design Proess . . . . . . . . . . . . . . . . . . . . . . . 21.2 fun, T , and simom . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1 Basi RTOS State Mahine . . . . . . . . . . . . . . . . . . . . . . . 392.2 IO Module Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.3 Devie Strut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4 IoStrut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.5 Clok Advane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.6 Synhronization Overhead . . . . . . . . . . . . . . . . . . . . . . . . 512.7 OS-Wide Event Time Predition . . . . . . . . . . . . . . . . . . . . 522.8 Synhronization Protool . . . . . . . . . . . . . . . . . . . . . . . . 552.9 H.263 Deoder System . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1 Appliation DFG Example . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Appliation CFG example . . . . . . . . . . . . . . . . . . . . . . . . 653.3 HW Graph Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.4 Modeling Pipeline/Nonpipeline DSP . . . . . . . . . . . . . . . . . . 703.5 Deoding P GOB, Modeling Cahe . . . . . . . . . . . . . . . . . . . 763.6 H.263 Deoder on DM642-600MHz . . . . . . . . . . . . . . . . . . . 773.7 MPEG2 Deoder on DM642-600MHz . . . . . . . . . . . . . . . . . . 783.8 H.263 Deoder on DM642-720MHz . . . . . . . . . . . . . . . . . . . 79xvi









HW design interface design SW design












Figure 1.1: HW/SW Co-Design ProessEmbedded systems over a board area and dierent systems have very dif-ferent harateristis. This dissertation fouses on an important appliation area -digital signal proessor (DSP) based multimedia systems. A trend in the onsumereletronis market is the demand for new appliations that has a lot of similaritiesto older appliations but the new ones impose more hallenging and speial-purposeperformane requirements [98℄. In the DSP industry, this learly reets a transitionfrom the design regime of general DSP to the appliation-spei DSP. The typialdesign strategy is to leave the DSP ore unhanged but redesign the on-hip HWaelerators, the DMAs and bus arhitetures for the new appliation [97℄.2
A typial HW/SW o-design ow for a new DSP is depited in Figure 1.1.It is divided into 2 phases, the speiation phase and the implementationphase. The former phase starts with dening an initial system speiation. Somedeisions, suh as whether to selet a xed point or a oating point ore, whihreal-time operating system (RTOS) to deploy et. are made and xed at this stage.Other deisions like dening the memory arhiteture and HW aelerators et. willbe rened in later stages.The HW/SW partitioning stage omputes a good mapping of the systemspeiation to a set of the real HW modules (rhws) to be designed, i.e., DSPore, DMA, HW aelerators et. Some funtionalities are realized diretly by theHW modules and an initial HW module level speiation is generated. Otherfuntionalities are implemented as SW being exeuted on the DSP ore and similarlythe initial SW module level speiation is generated. The mapping is based on theestimation of the ost metris on these modules. Typial HW ost metris areexeution time, hip area, power onsumption and testability. SW ost metris mayinlude exeution time and the amount of required program and data memory.After the HW/SW partition is done, a simulation omponent (simom) isonstruted for eah rhw at the desirable abstrat level to model its funtionality.An iterative simulation proess is used to estimate the system performane until asatisfatory design speiation is found. For examples, simoms are onstrutedusing the behavior model at the beginning and rened gradually to better reet thetiming harateristis. Typially the register-transfer-level (RTL) implementationfor rhws is not started in the speiation phase. Appliation SW an be mod-eled and o-simulated with the simoms in a unied environment to ahieve fastsimulation speed. The implementation an be started if legay SW an be reusedin whih ase an appropriate o-simulation platform is needed to integrate the SWwith the simoms. The interfae is the SW driver to realize ommuniation be-tween the rhw and appliation SW. The ommuniation behavior, i.e., synhronous3
vs. asynhronous data transfer, also needs to be modeled and simulated togetherwith all simoms and appliation SW for determining the most appropriate driverstruture. Implementation of the upper driver layers that are well isolated from therhw an be started in the speiation phase.The other iterative proess, the implementation phase, starts after the systemspeiation has been generated. The implementation of the rhws begins from theRTL level using VHDL or Verilog HDL language and is veried by the orrespondingsimulators. Then it is synthesized to gate level and veried on emulation HW suhas FPGA. Eventually the physial layout is ompleted. Interfae implementationan be made on the prototyping HW. Dierent platforms an be used to supportHW/SW o-veriation in this phase. Some platforms are SW based and some areimplemented on emulation HW. Suh simulators an ahieve very aurate resultsbut the simulation speed is slow. Long setup time, diÆulty of debugging andextremely high ost forbid the development of any omplex appliation SW [169℄.In pratie, HW/SW o-veriation in the implementation phase is often optional.If any error that signiantly aets system funtionality or performane is foundduring the implementation phase, the design proess has to swith bak to thespeiation phase to re-generate a orret speiation. Suh a rollbak an imposea huge ost and thus generating a orret speiation that \mathes" the targetedappliation is ritial.The rest of this hapter is organized as follows. Setion 1.2 gives a litera-ture survey of the HW/SW o-design area. Setion 1.3 introdues some bakgroundknowledge on HW/SW o-simulation. Setion 1.4 desribes various simulation re-quirements in the speiation phase and eluidates the problems to be solved.Setion 1.5 summarizes the work done in this dissertation to solve these problems.
4
1.2 Survey of HW/SW Co-DesignThe researh on HW/SW o-design has been about 15 years. Co-Design is a designmethodology supporting the onurrent development of HW and SW in order toahieve system funtionality and performane goals [46℄. It tries to inrease thepreditability of embedded system design by providing analysis methods that telldesigners if a system meets its performane, power, and size goals and synthesismethods that let researhers and designers rapidly evaluate many potential designmethodologies [168℄. The o-design proess for embedded systems inludes manytasks suh as speiation, modeling, validation, and implementation. The eld isfragmented beause most eorts are applied to spei design problems [130℄. Therest of this setion gives an overview to HW/SW o-design from various perspetives.One thing to note is that o-simulation has always been onsidered as an importanttopi in the o-design area. Survey to researh on o-simulation is given in setion1.3.1.2.1 Complete Co-Design EnvironmentAn important notion in o-design approahes is that there must not be a ontinuityproblem. That is, the steps from model to the synthesis should all be in the de-sign proess [46℄. Many tools have been proposed to oer a omplete environmentovering system speiation, modeling, synthesis, validation and implementation. Ptolemy II [104℄ supports heterogeneous modeling, simulation and design ofonurrent systems. It allows hierarhially ombine a large variety of mod-els of omputation (MoCs) [68℄. Most important MoCs are ontinuous time,disrete event, nite state mahine (FSM) and periodi triggered ators [99℄.The fous of Ptolemy II is on simulation. It is often used in ombination withother tools, i.e. Polis. 5
 Polis [5℄ is a design environment for ontrol-dominated embedded systems.It is based on o-design nite state mahine (CFSM) and simulated in thedisrete event domain of Ptolemy. The entire system is a group of ommuni-ating CFSMs plus an instrution set simulator (ISS) [139℄ [125℄. The systemspeiation language is Esterel [48℄, but a graphial speiation an alsobe given. The SW is automatially generated. HW is synthesized as well,but o-the-shelf IP annot be integrated [46℄. Automati partitioning is notsupported in Polis.The Cadene Virtual Component Co-Design (VCC) [23℄ toolkit has been builton top of Polis. VCC allows better IP reuse than Polis. COSYMA [71℄[72℄ is an older design method. It overs the entire design owfrom speiation to synthesis. The target arhiteture is assumed to onsistof a standard RISC proessor, a fast RAM for program and data with singlelok yle aess time and an automatially generated appliation spei o-proessors. The modeling language is Cx whih is a C-extension with supportfor parallel proesses and timing onstraints. VULCAN [85℄ is another older tool. The speiation language used is alledHardwareC. Although its syntax is C-like its semantis are that of a HWdesign language; thus rather low-level. Initially, a system will be speiedas a omplete HW solution. When the timing and resoure onstraints arespeied, an iterative automati partitioning approah moves suitable partsto SW running on a general purpose proessor (GPP). SpeC [77℄ is a new language based on C. It inludes a methodology for systemdesign that allows a systemati design spae exploration (DSE), alled speify-explore-rene [76℄. This methodology does not tend to support omplex targetplatform [179℄. The modeling approah used in SpeC is similar to ones used6
in SW ompilers, essentially a syntax graph. STATEMATE [92℄ is a set of tools, with a heavy graphial orientation, in-tended for the speiation, analysis, design, and doumentation of large andomplex reative systems. The modeling language is StateChart [126℄. Themain novelty of STATEMATE is in the fat that it \understands" the entiredesriptions perfetly, to the point of being able to analyze them for ruialdynami properties, to arry out rigorous exeutions and simulations of thedesribed system, and to reate running ode automatially.Not all approahes desribed above are \industry-ready". Ptolemy II andPolis are probably the most famous existing tools. However, both of them are onlyable to handle small systems in the ontrol domain. Polis does not support IPreuse whih prohibits it to be aeptable pratially. Ptolemy II does not supportsuÆient SW and HW synthesis [46℄. SpeC does take IP integration into aount.However, it pays a prie in a lower abstration level than for example Polis andthe partition is not automated [46℄. The level of abstration both in VULCAN andCOSYMA is also low. The arhiteture assumption is too simple to keep pae withthe omplexity of urrent embedded systems. Researh work on both of them hasbeen stopped.FPGAs have beome an advantageous alternative to ASICs in low volumeappliations, due to its redued ost [127℄. Its design ow is similar to the ASICdesign ow. Complete design environment oered by industry inludes Xilinx [26℄,Altera [1℄ and Mentor Graphis [9℄.1.2.2 Model of Computations and LanguagesModeling is the proess of oneptualizing and rening the speiations, and pro-duing a HW and SW model [130℄. A model of omputation (MoC) desribesomponents in a system and how they ommuniate and exeute [126℄.7
The amount of MoCs used in the modeling language of a o-design methodan be used to disriminate between two types of approahes: homogeneous model-ing and heterogeneous modeling. Homogeneous modeling ontains multiple MoCs in a single language. It is themost ambitious approah and has the advantage that the deision an be post-poned what to implement in HW and what in SW. Many researhers doubtwhether suh an approah will work, beause it is very diÆult to ombinevarious MoCs in a single language [54℄. Examples of homogeneous methodsare Polis [5℄, COSYMA [72℄ and SpeC [77℄. The partition deision is sup-ported by the homogeneous method itself. However, no real tools support thisompletely. An integral o-design approah is only possible for some speialgorithms or small set of appliations [46℄. Heterogeneous modeling uses dierent languages to model HW and SW om-ponents. The hoie between HW and SW implementation is made before themodeling phase starts, and eah omponent is modeled in a language speiallytargeted towards that type of omputation [130℄. Simulation is used to val-idate the design and integration of the various languages (o-simulation). Atypial example is Ptolemy II [104℄.All o-design systems are based on an MoC or ombine a few of them. For aspei MoC, the following three properties are important for system design. Modeling of time whih an be ontinuous time, disrete time, partial orderedtime (disrete event), or no expliit notion of time [54℄. Orientation whih an be state-oriented reeting the ontrol sequene orativity-oriented reeting the funtionality. Examples of state-oriented MoCsare StateCharts [126℄ and CFSM [125℄. Important ativity-oriented MoCs areKahn Proess Network and Petri Nets [122℄.8
 Main appliation whih an either be dataow oriented or ontrol-ow ori-ented.[122℄ proposed the tagged-signal model as a framework for omparing har-ateristis from dierent MoCs. It is lear that the hoie of modeling languageis fundamental to a o-design methodology. Extension of C/C++, i.e. SpeC andSystemC reeived a lot of interests reently. Although partition still annot be au-tomated, their benets are: imposing a short learning urve to many engineers,allowing synthesis to many target arhiteture at RTL level, and supporting IPreuse.1.2.3 HW/SW PartitionPartition is the task of alloating system funtions to a set of HW and SW resoures.In most pratial ases, sheduling also needs to be onsidered as an integral part ofthe partition proess. These two topis address where and when the system funtionsare implemented, respetively. The ost funtion of a partitioning problem needs tobe evaluated using estimates of the resulting HW and SW. However, the abstrationlevel at whih partitioning is arried out is so high that only rough estimates areavailable, whih makes the partition problem hard [130℄.Various formulations to the partition problem an be ompared on the basisof the 1) arhitetural assumptions, 2) partitioning goals, and 3) solution strategy[130℄. Arhitetural and appliation assumption: [71℄ assumes that the o-proessingHW is operated under diret ontrol and in sequential with the GPP while [85℄[55℄ assume that o-proessing is onurrent with SW exeution. [141℄ [166℄[43℄ [138℄ [50℄ assume the arhiteture to be a set of symmetri proessors.[57℄ [84℄ assume that the ommuniation interfae between the GPP and o-proessor is memory mapped I/O while [171℄ assumes it is bus oriented. [136℄9
assumes that the ommuniation operations between eah pair of omponentsare expliitly speied. Partition objetive: Most algorithms try to maximize the overall speedup,i.e. [55℄ [161℄ [71℄ [106℄ [107℄. [145℄ tries to minimize the overall ost. [62℄provided a apaity onstrained partition algorithm. [85℄ [109℄ are examplesof performane onstrained algorithms. [43℄ minimizes the synhronizationoverhead between HW and SW omponents. Partition strategy : [40℄ [136℄ [78℄ [144℄ provided mixed integer and integerprogramming algorithms. Many heuristi algorithms are published to partitiona big set of funtions to a omplex arhiteture, i.e. [119℄ [164℄ [55℄. [70℄ [71℄are simulated annealing approahes. [63℄ used heuristi based task lustering,alloation, and sheduling approah. [67℄ is a geneti heuristi algorithm. In[45℄ [107℄, sheduling is applied before partition while [69℄ applies partitionbefore sheduling.A ommon problem in many existing partition tool is that some importantparameters are rarely available, i.e. [33℄, or are heavily dependent on the designerexperiene on adequate number of referene implementations, i.e. [34℄. [49℄ providedan aÆnity-driven partition tool in whih a o-analysis step is performed to derive theharateristis for eah funtion. The harateristis help measure if the funtionis suited for GPP or ASIC. The problems are two folds: 1) the aÆnity metris arenot aurate enough; and 2) system funtionality needs to be realized as the inputto the tool whih is ontraditory to the prinipal of HW/SW o-design.The sheduling task an be deomposed to 3 subtasks: HW sheduling, in-strution sheduling and SW task sheduling. HW sheduling is usually stati. The sheduling algorithm an be integer linearprogram, i.e. [65℄ [78℄ [88℄ or list sheduling, i.e. [149℄. Another approah is10
alled fore-direted sheduling in whih operations are sheduled into timeslots, subjet to time-window onstraints indued by preedene and latenyonstraints, i.e. [140℄. Online sheduling suh as [118℄ is needed when somedelay is unknown and synhronization is neessary. Instrution sheduling in ompiler onsists of front end to optimize operationson an intermediate form and a bak end for ode generation. When onsideringGPPs, instrution seletion and register alloation are often ahieved by dy-nami programming algorithms whih also generate order of instrutions [30℄.When onsidering retargetable ompilers for appliation spei proessors,the bak end is more omplex that instrution seletion, register alloationand sheduling are tightly oupled with ode generation [80℄. Task sheduling in a general OS primarily addresses inreasing proessor uti-lization and reduing response time. Sheduling algorithms are often rootedon simple proedures suh as shortest job rst or round-robin [44℄. On theother hand, sheduling in RTOS primarily addresses the satisfation of timingonstraints. The algorithm an be stati or dynami in whih the feasibilitytest is at run time [124℄ [147℄.1.2.4 SynthesisThe partitioning and synthesis tasks are losely interrelated [130℄. This setionreviews some important work on synthesis other than partition.Systems today are often modeled in C/C++ in the beginning and then trans-lated by design engineers. Advantages in using the C/C++ language are numerous,i.e. large user base, modeling in a high level of abstration, possibly automati gen-eration of SW and testbenhes; fast exeution et. [37℄. However, C/C++ does notallow expliitly delare parallel exeution and arbitrary word size required by theatual HW [148℄. 11
To synthesize the HW using C/C++, two main tehniques are deployed.The rst one is alled the superset approah whih uses the C/C++ diretly asthe input to behavioral synthesis. Construts and additional features are added todene parallelism, data-types, hierarhy, timing, and ommuniation. These odingtehniques may not be familiar and onvenient to potential users, and hene areundesirable [148℄. A famous example of the superset approah is SpeC [77℄. Theother approah is alled subset approah whih translates a subset of C/C++ intoHDL whih an eventually be synthesized using the already available ommerialtools [4℄. SystemC [81℄ belongs to the subset approah.1.2.5 Platform Based Design for SoCThe industry has begun to embrae new design and reuse methodologies that areolletively referred to as system-on-hip (SoC) design to redue the developmentost and yle. Due to the omplexity of HW and SW, their reuse is often keyto ommerial protability [130℄. Typially a single but ongurable arhitetureneeds to serve many dierent ustomers in one market segment [152℄. For thesereasons, the onept of platform based design was introdued where new designsould be quikly reated from an original platform to amortize osts over manydesign derivatives.SoC is now a driver for the development and use of industry-wide standards.For example, standards of AMBA bus [14℄, OCP bus interfae [10℄, IP exhangeformats and doumentation, IP protetion and traking [24℄, IEEE 1500 standardfor test wrappers [17℄ have been developed and standardized.The SoC design emphasis is on reusable IP design, integration, veriation. Reusable IP design: Typially it may require ve times as muh work togenerate a reusable IP blok ompared to a single-use blok. The ore reatorshould be responsible for delivering 1) the design-for-test (DFT) HW inside12
the ore; 2) the test patterns of the ore; and 3) the validation of those testpatterns. The validation work may take up to 50% of the total design time[110℄. Integration: The integration proess involves onneting the IP bloks to theommuniation network, implementing DFT tehniques and using methodolo-gies to verify and validate the overall system-level design [146℄.A typial SoC today onsists of many ores operating at dierent lok frequen-ies [152℄. Eah ore an be viewed as a separate synhronous island, and theinterfaes between these islands are provided by the hip's global interonnet.This approah is ommonly referred to as globally asynhronous, loally syn-hronous (GALS) design [155℄ [134℄. Synthesizing designs from multi-timed,transational desriptions is an important topi of ongoing researh. [154℄ [56℄have exploited pipeline to implement a general set of interfaes between timingdomains without xed relationship. Optimized interfaes, i.e. [52℄ [129℄, anbe implemented when speial timing relationships are known. Veriation of SoC is imposing muh more hallenges omparing to the tra-ditional board level veriation. The nal test is to be applied at the in-put/output pins of the SoC. However, the ore may be embedded deep intothe SoC and thus its I/O pins may not be diretly aessible from the externalpins [152℄.While design sizes have grown exponentially over time in aordane withMoore's Law, theoretial veriation omplexity has been growing double-exponentially, beause the number of states that must, in theory, be veriedis exponential in the size of the design [8℄. Most researh ativity, therefore,has foused on formal veriation whih provides a 100% proof that a designmeets its speiations [61℄ without using test vetors. It is already indispens-able industrial pratie for RTL-to-gate equivalene heking [103℄ [169℄ and13
miroproessor veriation [42℄ [153℄.The key researh topis of formal veriation are ompositional model hekingwhih deomposes the veriation of an entire system omposed of severalbloks into multiple, smaller veriation tasks on the individual bloks [60℄;and assume-guarantee reasoning whih emphasizes veriation of ores underassumptions about the behavior of the rest of the system [142℄.Sine formal methods generally require a large memory spae, semi-formalanalysis is often hosen as a trade-o by mixing the formal tehniques with thesimulation-based approahes.. Typial example is assertion-based veriation[27℄ [28℄ [39℄.Testbenh automation in simulation based veriation approah is still impor-tant when the design size is very large. In many pratial ases, it is muheasier to reate large amount of test vetors than to desribe the ompletespeiation for the design in a formal model [169℄.IP-wise emulation using FPGAs has quite limited usage due to its high ostand debugging diÆulty. Therefor, major industry and aademia interest ishow to get easier debugging environment with lower investment.1.2.6 SW Toolkit GenerationThe trend toward smaller mask-level geometries leads to higher integration andhigher ost of fabriation, hene of amortizing HW design over large produtionvolumes. This suggests the idea of using SW as means of dierentiating produtsbased on the same HW platform [130℄.Rapid SW prototyping is often being overlooked in many existing HW/SWo-design methodologies. On one hand, it means that \slak" margins for HW,i.e CPU and memory resoures, should not be overly restritive to make the SWdiÆult design and implement [64℄. One the other hand, it implies that there is a14
ritial need for support of an integrated SW development environment inludingompilers, simulators and debuggers [90℄.The interest in the arhitetural desription language (ADL) based designmethodology for embedded SoC optimization and exploration has been tremendous.ADL is designed to speify arhitetural templates for SoC inluding omponents(proessors, memories et.), the interonnet, and the funtionality of eah om-ponent. The benets of using the ADL not only inlude the ability to performformal veriation and onsisteny heking, but also to automatially generate theSW toolkit from a single speiation. A typial SW toolkit inlude instrutionlevel parallelizing (ILP) ompilers, yle-aurate ISSes, assemblers/disassemblers,prolers, debuggers et.ADLs an be ategorized into 1) behavior entri ADLs, suh as nML [93℄and ISDL [86℄; 2) struture entri ADLs suh as MIMOLA [36℄ and COACH [31℄;and mixed level ADLs suh as LISA [181℄, EXPRESSION [89℄ and FlexWare [140℄.A promising approah to automati ompiler generation is the \retargetableompile" approah. A ompiler is lassied as retargetable if it an be adaptedto generate ode for dierent target proessors with signiant reuse of the om-piler soure ode. Reent approahes to retargetable ompilers have foused on de-veloping optimizations/transformations that are \retargetable" and apturing themahine spei information needed by suh optimizations in the ADL [90℄. Typi-al approahes to generate retargetable ompilers inlude 1) arhiteture-template-based, i.e. [105℄ [13℄; 2) expliit-behavioral-information-based, i.e.[91℄ [121℄; and 3)behavioral-information-extration/generation-based, i.e. [123℄ [89℄.Simulation of the proessor system an be performed at various abstrationlevels. ISS is the highest level of abstration. At lower-levels of abstration are theyle-aurate and phase-aurate simulation models [90℄. Retargetable simulatorsan be ategorized into 1) interpretation based, i.e. [111℄ [87℄; 2) ompilation based,i.e. [181℄; and 3) struture-entri ADL based. Interpretation based simulators are15
easy to implement and maintain by paying the prie of slow speed (2K - 20K instru-tions/s). Compilation based simulators translate eah target instrution diretly toone or more host instrutions at ompile time and thus an be 3 orders of magnitudefaster than interpretation based ones [180℄.1.2.7 Open ProblemsResearhers are still working on the following problems [168℄. Dene and redene MoCs for jointly desribing HW and SW systems. System-level performane analysis is a omplex problem that analysts muststudy under a variety of operating onditions suitable for various appliationtypes. Evaluate algorithms for DSE. Analyze new lasses of arhitetures; develop new methods for performaneanalysis and ode generation with the aim of making VLIW-based arhite-tures more usable. Evaluate the eort of networks on hips (NoCs). Make the o-design tool friendly for SW development.1.3 Introdution to HW/SW Co-SimulationVeriation of system funtionality and timing is one of the most important anddiÆult aspets of the embedded system design. With the inreasing omplexity ofmodern embedded systems, formal method only has limited usage in veriation ofIP funtionality. As a result, HW/SW o-simulation is the typial approah used forsystem level veriations nowadays. Reently the researh on HW/SW o-simulation16
mainly fous on four topis: 1) mixed-simulation of omponents in dierent abstratlevels, 2) integration of multiple simulators implemented in dierent languages intothe same environment, 3) simulation speedup by reduing synhronization overheadbetween simoms, and 4) RTOS modeling.1.3.1 Dierent Simulation Abstrat in LevelFrom simulation abstrat in level perspetive, simulation an be roughly lassiedinto the following four ategories: Gate level simulation, also being alled event driven simulation or phase-aurate simulation, is the most aurate sine every ative signal is alulatedfor every devie during the lok yle as it propagates. Eah signal is simu-lated for its value and its time ourrene. It is exellent for timing analysisof HW iruit and verifying rae onditions. Typially suh simulators areimplemented on emulation HW, i.e. FPGA or a quikturn mahine [11℄. AllHW simoms are synthesized to the emulation HW. The simulation speed ofsuh HW based platform is typially 3 4 orders of magnitude slower than therhw speed, whih is aeptable. However, the number of suh multi-millionmahine is very limited. Transferring the input/output from/to the mahineand programming it to enhane the debugging apability typially require ad-ditional HW support and an be very diÆult. Therefore, suh simulators areintended for HW veriation at unit level but not for performane estimationat system level. Cyle-aurate simulation only alulates the state of the signals at lokedges and is often implemented by SW. Typial examples are RTL simulatorsoupled with ISSes. Compared to the gate level simulator base on emulationHW, this type of simulator is muh slower but osts muh less. Due to the SWimplementation nature, virtually any signal an be aptured and debugged. It17
an be used for omplex design and system level testing. Transation level simulation uses SW funtion alls to model the om-muniation between simoms in a system. For example, a transation levelmodel (TLM) would represent a burst read or write transation using a singlefuntion all, with an objet representing the burst request and another objetrepresenting the burst response [160℄. It is possible to reate TLMs that arefully yle-aurate, however in many ases it is applied to speed up simulationby foregoing full yle auray [81℄. Dataow simulator represents signals as stream of values without notionof time. Funtional bloks are linked by signals and exeuted when signalspresent at the input. The sheduler in the simulator determines the order ofblok exeutions. It is a high level abstration simulation used in the earlydesign stage only for heking the orretness of the algorithms.From the SW perspetive, its aess to the HW an be simulated at one ofthe following three abstrat levels [175℄. ISS level in whih the simulation annot be started until all the SW ode isready to be ompiled and linked. Devie driver level model in whih the target OS has been seleted and thedevie driver funtions aess the abstrat memory model when they are alledby SW tasks or OS ode. OS level model in whih the target OS is yet to be seleted or implemented.The OS is being modeled and only SW tasks are available.In reality, many simulators need to support mixed-level HW and SW modelsfor various needs, i.e. ISS for SW simulation + TLM for HW simulation. The mixed-level o-simulation problem is how to manage many dierent abstration levels of18
SW and HW models [175℄. There are 12 total ombinations of abstrat HW and SWmodels in level and most existing solutions over a subset of them. For example, in[176℄ the HW is TLM and the SW is the devie driver level model. In [178℄, HW isTLM and SW is the OS level model.Generally speaking, there are two types of approahes solving the mixed-levelo-simulation problem. The rst one is to support simulation of all the abstrationlevels of the HW interfae, i.e. [156℄. Its advantages are two folds: 1) ommunia-tion protool being simulated at high level an be validated against the simulationresults obtained at low levels; and 2) interonneting simoms with dierent ab-stration levels of HW interfae do not need an additional adaptation. However,suh simulators are diÆult to implement and maintain. As a result, a more popu-lar approah is to design wrappers that adapt dierent abstration levels, i.e. [54℄[177℄ [41℄ [170℄. [135℄ aims at automatially generating simulation wrappers whileminimizing the number of required library omponents.Raising the abstration levels of simulation helps speed up simulation. How-ever, the yield is typially lower than being expeted. It is beause the dominane insimulation runtime hanges as we raise the abstration level of SW and HW simula-tion. For ISS SW simulation + yle-aurate HW simulation, the HW simulationdominates the total runtime and the speed an be as low as 1K yles/s. By raisingthe abstrat level for HW simulation to TLM level, SW simulation may dominatethe runtime and the typially speed is about 100K yles/s. More than 10M yles/sspeed an be obtained by raising the SW simulation to devie driver or OS level andthe HW simulation to TLM level [175℄.1.3.2 Heterogeneous vs. Homogeneous SimulatorFrom the implementation language perspetive, simulators an be lassied as ho-mogeneous and heterogeneous. 19
 Heterogeneous simulator provides interfaes between simoms written indierent languages so that they an be simulated together. For example,VHDL supports the link of proedures, written in C-ode, to the simulatorthrough its foreign language kernel [22℄. Verilog supports invoation of Cfuntions by a so-alled programming language interfae (PLI) [38℄. Primarilythere are two reasons making the simulation speed of a heterogeneous simulatorprohibitively slow for fast DSE in the speiation phase. Firstly, a heteroge-neous simulator often requires the simulation to be arried at yle-auratelevel. Seondly, the ommuniation ost between the simoms written in dif-ferent languages is expensive. [82℄ presents a C+VHDL simulator for DSPdesign whih requires about 1:5 106 host yles to simulate a target yle. Homogeneous simulator intends to use a single language for system spei-ation and even implementation. For example, SystemC [6℄ provides a subsetof C++ semantis and libraries ombined with a simulation engine to supporto-simulation from dataow to yle-aurate level. Some ommerial toolsare also available to synthesize simoms speied by SystemC into VHDLsimoms [4℄. Similarly, the N2C pakage from CoWare [32℄ adds neessaryloking to C language for system modeling. Compared to the heterogeneoussimulator, the homogeneous one is more exible to model simoms in variousabstrat levels and thus an ahieve faster simulation speed. [98℄ presented aSystemC based simulator whih models the appliation SW with a onditionaldataow graph and arries the simulation at yle-aurate level. It requiresabout 4:4  104 host yles to simulate a target yle. The transation levelsimulator in [96℄ only requires about 2 host yles to simulate a target yle.The interfaing issue still exists in the homogeneous environment when mul-tiple simulators need to be integrated together even if they are written in thesame language, i.e. ISS + SystemC. Various approahes have been published20
to address this issue. [73℄ extends the SystemC kernel to integrate the ISS.Speially, ports and proess for ISS is added into the SystemC kernel andthe sheduler is also modied. The drawbak is that simoms are not properlysynhronized and thus timing delity annot always be guaranteed. [102℄ alsoproposed an approah integrating ISS, fousing on the data type onversion.The abstrat data types of the C++ environment are rst mapped to a binaryrepresentation by a bitmapping layer. The resulting bit-streams are trans-ferred to a protool layer, ut into slies aording to the respetive data buswidth and forwarded into the external simulator. The experiment shows thatthe simulator is yle-aurate and ahieves the simulation speed of 4:5 106yles/s. However, the HW being simulated is a single proessor ore so thatthe synhronization problem between simoms does not exist.The heterogeneous approah an naturally support reusing and integrationof existing intelletual property (IP) bloks and thus is often applied in the imple-mentation phase. The homogeneous approah is more appliable in the speiationphase beause it bridges the gap between HW and SW desription languages [41℄and makes o-simulation easier and more eÆient. It is expeted that these twoapproahes will oexist in the foreseeable future.1.3.3 Synhronization Overhead between Simulation ComponentsInreased number of modules being integrated into a system implies that the sim-ulator needs to make multiple simoms progress onurrently and synhronize toeah other. The onservative approah synhronizes simoms at every lok stepand thus the ausal order of event proessing will not be violated. Suh an approahis only appropriate for simulation of a \busy" system at gate or yle-aurate levelwhen HW ativities at almost eah lok yle need to be aptured. Otherwise, thesimulation speed an be severely impated beause of the unneessary synhroniza-21
tion overhead.From the algorithm perspetive, researh on synhronization overhead re-dution an be lassied into two ategories: optimisti approah and optimizedonservative approah. A rih set of literature on optimisti simulation algorithmshave been published among whih Time Warp is the most well known [74℄. In TimeWarp, a ausality error is deteted whenever an event message is reeived thatontains a timestamp smaller than that of the proess's lok. The event aus-ing rollbak is alled a straggler. Reovery is aomplished by undoing the eetsof all events that have been proessed prematurely. Rolling bak the state is a-omplished by periodially saving the proess's state and restoring an old state onrollbak. Depending on when an anti-message is sent, the anellation mehanisman be aggressive or lazy. In the aggressive anellation, whenever a proess rollsbak to time t, anti-messages are sent immediately for all positive messages sent af-ter t. In lazy anellation, proesses do not immediately send an anti-message uponrollbak. Instead, they wait to see if the re-exeution of the omputation regener-ates the same message; if the same message is reated, there is no need to send ananti-message. Depending on the appliation being simulated, lazy anellation mayimprove or degrade performane. Critis argue that no proof yet exists that TimeWarp is stable, although it is in most pratial ases. [75℄ demonstrated that statesaving overhead an seriously degrade performane of many Time Warp programs,even if the state vetor is only a few thousand bytes. Another problem assoiatedwith the optimisti simulation algorithms is that they are hard to implement andthus usually not available in most ommerial tools.The optimized onservative approah tries to redue synhronization fre-queny by prediting the synhronization points. The tehnique is also alled looka-head whih refers to the ability to predit what will happen, or equally important,what will not happen in the future, based on knowledge of the appliation and eventsthat have already been proessed [74℄. There are two main tehniques for synhro-22
nization predition. The rst is ompiler analysis of the SW being simulated [101℄[108℄, i.e. analyzing assembly ode and using dynami runtime algorithm to handlebranh and loops [58℄. The speedup an be 6x to 40x ompared to the onservativeapproah, depending on the appliation being simulated. The other one onsidersthe exeution semantis of the system being simulated and takes advantage of it[174℄ [159℄, of whih the most famous tehnique is alled virtual synhronization[170℄ [172℄ [113℄ [112℄. The basi idea of the virtual synhronization tehnique isto separate global time management of simulation from loal simulators utilizingalgorithm model. When a loal simulator produes output samples, the time dier-enes between output samples are reorded from the simulator. The atual globaltime of output samples is omputed by the simulation bakplane [114℄. The limita-tion is that stati osts are assumed for loal and shared memory aesses and nobloking is allowed for a simom. [114℄ ombines the trae-driven simulation withvirtual synhronization to solve the stati ost problem. It onsists of two partsrunning onurrently so that required memory spae is redued. In the rst part,it aptures the exeution traes from proessing omponents ignoring the globaltime management. In the seond part, it reonstruts the global time informationfor yle-aurate simulation behavior using trae-driven o-simulation. The simu-lation speed is in the range of 420K - 1500K yles/s. The main problem is thattrae-driven simulation does not have the dynami sheduling apability whih vi-olates the simulation delity. Moreover, deriving appliation trae is not a trivialtask. Some researh work tries to redue the ost per synhronization from theimplementation perspetive. [41℄ provided a SystemC based simulator in whih theforeign ISS is integrated into the same SystemC engine with other simoms so thatexpensive inter-proess ommuniation (IPC) is eliminated. Similarly, [174℄ elimi-nates IPC by integrating the VHDL based simoms with the C based appliationSW using the VHDL-C interfae. [101℄ [100℄ proposed the idea to dynamially al-23
ter the level of ommuniation details during o-simulation. The ommuniationbandwidth is redued at times when the details are not required. [113℄ [112℄ [173℄indiated that the synhronization overhead is aeted more by the number of mes-sages exhanged than by the message size and thus the overhead an be reduedby grouping messages. For example, all signals assoiated with the same event anbe grouped, and a sequene of events an be grouped into a higher level message.Experiment shows that simulation speedup due to message grouping ranges from30% to 70%.1.3.4 RTOS ModelingWith the inreasing omplexity of embedded devies, it is ommon to see in themodern embedded system several SW programs running onurrently on a proessormanaged by an RTOS. Embedded SW typially has real-time onstraints to satisfyin addition to funtionality requirements. It is important to be able to validate allthese properties together as early as possible in the design yle, and in the ontextof running the embedded SW on top of an RTOS. Researh on RTOS modeling andsimulation an be ategorized into the following three diretions: System Call Translation whih re-direts the system all into the host OS. Itis often used with an ISS to verify the SW funtionality [150℄. Native OS whih ompiles the target OS ode and exeutes it on the simhw,i.e. VxSim [25℄. It does not support DSE. Virtual OS whih simulates the funtionality and timing of a real OS and isthe most popular approah. Commerial tools inlude SoCOS [66℄ and Car-bonKernel [3℄. [79℄ presented an RTOS model based on SpeC, but the timingauray is not satisfatory. [131℄ presented an RTOS model on top of Sys-temC. Its fous is only modeling the sheduling algorithm and the proess24
ommuniation mehanism. [29℄ [59℄ and [47℄ proposed the approah om-bining a SytemC based RTOS model with a bus funtional model for HW.Timing is ahieved by inserting delay annotations into the SW ode; however,none of them shows how to easily get the delay information. The arhiteturalassumption is a single-ore-only system and the synhronization preditionability is not available. [29℄ [59℄ did not say how to model the I/O behavior.[47℄ only models the bloking I/O for devie drivers. For [29℄ [59℄, both sim-ulation speeds are about 10x slower than the rhw speed with less than 10%error. [47℄ is 3 orders of magnitude faster than an ISS with less than 14%simulation error.1.4 HW/SW Co-Simulation Requirements1.4.1 Requirements of Generating an Initial SpeiationTo help generate an initial system speiation, the following requirements need tobe fullled by the simulator.1. RTOS modeling needs to be supported by the simulator. The appearaneof high performane DSPs has made it possible to implement more and morefuntionalities by SW. It is ommon to see in the modern embedded systemseveral SW programs running onurrently on a proessor managed by anRTOS. Embedded SW typially has real-time onstraints to satisfy in additionto funtionality requirements. It is important to validate these properties inthe ontext of running the SW on top of an RTOS [96℄. Traditionally, designersperform DSE by implementing the appliation SW to the target arhiteturewhih onsumes a lot of time and is often error-prone [176℄. RTOS modelingis a better tehnique that tries to model existing RTOSes within a singleenvironment and aptures the abstrated RTOS behaviors at the system level.25
The model needs to be ongurable and easy to use so that dierent andidatesan be evaluated with little or no aetion to the appliation SW.2. Speed is one of the most ritial requirements to enable fast DSE. It rstlyimplies that the target HW, appliation SW and RTOS should be modeled atthe transation level. Cyle-aurate or gate level simulation is not only tooslow, but may also not available at this stage. Seondly, simulation synhro-nization overhead needs to be redued. In general, the optimized onservativeapproah is onsidered to be more eetive where the timestamps of the syn-hronization points an be approximately predited with aeptable sarieto simulation auray. Beause both the simom and SW transation levelmodels an be desribed by the same language, i.e. SystemC or C, a homo-geneous simulation environment is preferred due to its lower synhronizationost.3. Simulation auray an be trade o for fast DSE in this stage. The mostimportant riteria is to tell whih RTOS or what kind of HW arhitetureis the most appropriate for the target appliation at a high level sine thesedeisions have to be xed. Exatly how muh they are better omparing toother hoies are less important sine more aurate numbers an be obtainedlater using simulator with higher auray when the speiation is rened.1.4.2 Requirements of Speiation RenementSpeiation renement imposes more hallenging requirements to the simulator.1. Simulation auray is important to nd the most appropriate system ar-hiteture. It is typially required to arry the simulation at the yle-auratelevel. Performane estimation not only needs to be onduted at unit level totell how many yles eah funtional blok takes to exeute on a rhw, more26
importantly, system wide performane estimation is neessary to take into a-ount the hanges in the exeution ondition aused by raes and dierentfuntional ompositions [98℄.2. Speed is still important not only for fast DSE, but also beause it diretlyimpats the eort of debugging omplex appliation SW [97℄.3. HW/SW problem isolation is a pratial onern sine modern embed-ded systems have been omplex enough that HW and SW engineers often donot enough expertise knowledge in eah other's domain. It is desirable thathanges made in one domain should have little or no impat to the otherdomain.4. SW development friendly is another pratial onern motivated from thefat that the appliation nowadays often needs to be developed by many SWengineers in parallel. Before the rhw is available, a low ost simulation plat-form is required to be readily available for eah SW engineer. Ideally the SWdevelopment environment oered by the simulator should be the same as if onthe rhw.Some of these requirements are ontraditory to eah other. For example,yle-aurate level simulation implies that the simulation speed is slow. Simulatingboth simoms and appliation SW at a low level fores the engineers to understandboth HW and SW details, whih violates the requirement of HW/SW problem iso-lation. A simulator allowing integration of both yle-aurate simoms and appli-ation SW is typially very expensive and gives an unfamiliar look to SW engineers,whih is not able to truly support HW/SW o-development. The onlusion is thata single simulator is not possible to fulll all these requirements at the same time.
27















simhw1 simhw2Figure 1.2: fun, T , and simomAn RTOS modeling tool based on SystemC thread primitive (SC THREAD)and simulation engine [6℄ is provided to assist in the generation of an initial spei-ation. Any simom written by C/C++ an also be integrated. The OS model isongurable to support modeling and timed simulation of most existing embeddedRTOSes, i.e. Linux [128℄, DSP/BIOS [15℄ et. To model a spei RTOS, a useronly needs to: 1) set a few parameters whih will be used to ongure the RTOSstate mahine, 2) provide some OS timing information suh as sheduler exeu-tion delay, ontext swith delay et. so that delay annotation an be automatiallygenerated into the model, and 3) \plug in" neessary peripheral driver module(s).The OS timing information an be measured by experiment [167℄, or estimated bySW estimation tehniques [35℄ [101℄ [117℄ [120℄ if the OS soure ode is available.Appliation program timing an adopt a similar strategy and an be isolated fromOS timing so that hanges to either one will not aet the other unless the HWarhiteture is hanged. The optimized onservative approah is taken to reduethe synhronization overhead. A predition algorithm is presented to estimate thetimestamp of the next OS-wide event for any RTOS with a stati or dynami prior-ity driven sheduler. Compared to [58℄, the assumption of the predition algorithmis more realisti. Compared to [79℄, our tool ahieves higher simulation auray.Compared to [131℄ [29℄ [59℄ [47℄, the advantages of the proposed tool are many folds.29
For example, OS timing information is easy to obtain; simulation speed is faster;apability is more omprehensive in terms of handling simom synhronization andmodeling I/O behavior. Although [59℄ ahieves higher auray (5% vs. 15%), oursimulation speed is more than 10 times faster.Two simulators are provided for speiation renement. The rst one isa system dataow simulator (SDFS) and primarily used by the HW engineers. Itmodels the appliation by a parameter-driven onditional dataow graph (CDFG)in whih eah node is a fun being modeled at transation level. The HW tobe designed (H ) is modeled by a ongurable HW graph in whih eah node is asimom modeling the orresponding rhw at yle-aurate level. SDFS takes theappliation CDFG and HW graph as the input and arries the simulation to aththe detailed HW ativities, i.e. bus arbitration. HW engineers are only required tounderstand the appliation's CDFG. The parameter-driven feature gives the perfor-mane estimation exibility, i.e. heking the impat of the SDRAM frequeny isonly a matter of hanging one parameter. Modeling the appliation by the CDFGdoes not neessarily mean that the appliation an only be modeled at high levelwith limited simulation preision. SDFS is designed to allow hierarhial modelingso that any fun in the CDFG be expanded into several ner-sale funs, andthus more details an be inorporated to improve the simulation auray. SDFSis readily available beause it is implemented in the SystemC language and an beexeuted on most PCs and workstations. Compared to the trae-driven simulatorssuh as [158℄ [53℄ [143℄, SDFS provides more onvining results with the apabilityof simulating resoure ompetition senarios and prediting performane using pa-rameters obtained from legay appliation and HW. Compared to [163℄ whih onlymodels appliation by a DFG only, SDFS ahieves higher auray by inorporatingthe appliation CFG. Compare to [151℄ whih proposed a simple parametri modelof exeution ost, SDFS has the apability to model pipelined arhiteture and in-strution level parallelism. Compared to [51℄ [116℄ whih proposed stati abstrat30
models to obtain a lower/upper bound on the system performane, SDFS providessimulation results overing typial resoure usage by the appliation whih is moreappropriate information driving the arhiteture design for multimedia systems.The seond one is a real-time simulation platform (RTSP) provided to SWengineers to truly enable HW/SW o-development and o-simulation. To the best ofthe author's knowledge, it is the only simulator oering the same SW developmentenvironment as if the rhw was available, while ahieving aurate simulation resultand fast simulation speed. It is implemented on legay DSPs as the simhw andan be onsidered as zero ost sine many legay evaluation version module (EVM)boards an be reused for simulation purpose. To simulate the behavior of a rhwto be designed, the orresponding simom is onstruted running on a legay DSPwith an appropriate share rate. Eah legay DSP has a novel two-level shedulerwhih makes eah simom progress properly so that the simulation is arried at aproportional speed (1=) of the rhw speed. For any job whih would nish at timet on the rhw, it will nish no later than   t +4 during simulation where 4 is abound. Suh a feature eliminates expensive synhronization between simoms whilestill keeps simulation delity. RTSP is implemented to allow \plug in" an RTOSmodel for any simom if neessary. Compared to [113℄ [172℄ whih proposed thevirtual synhronization tehnique assuming that the output of the simulation SWonly depends on event order but not on the arrival time of eah event, RTSP does notneed suh a strong assumption sine the sheduler is designed to fore eah simomto progress properly. [137℄ also tries to reate a SW-development-friendly simulationplatform. It requires two host omputers. The appliation SW is simulated on therst one and all the simoms are onstruted in SystemC and VHDL and simulatedon the other one. The ommuniation between the two hosts is soket-based so thatsimulating a register-read an take 3:7 ms. The synhronization problem betweenHW and SW is not addressed. Compared to [137℄, RTSP ahieves higher simulationauray by deploying the legay DSP as the simhw and handling synhronization31




2.1 IntrodutionWith the appearing of programmable high performane DSPs, more and more sys-tem funtionalities that used to be realizable by HW are implemented by SW beauseof its programmable exibility and lower development ost. On the other hand, withthe inreasing omplexity of embedded devies, it is ommon to see in the modernembedded system several SW programs running onurrently on a proessor man-aged by an RTOS. Embedded SW typially has real-time onstraints to satisfy inaddition to funtionality requirements. It is important to be able to validate allthese properties together as early as possible in the design yle, and in the ontextof running the embedded SW on top of an RTOS. Traditionally, designers performDSE by manually porting the appliation SW to the target arhiteture whih on-sumes a lot of time and is often error-prone [176℄. An alternative is RTOS modeling.RTOS modeling is a tehnique that tries to model existing RTOSes within a sin-gle environment and aptures the abstrated RTOS behaviors at the system level.Ideally, RTOS modeling should fulll the following requirements:1. The simulation of the model should be fast enough. It also implies that it33
should model the target RTOS at the transation level. Cyle-aurate orgate level simulation is not only too slow for simulating a meaningful amountof work, but may also not available at the early design stage.2. The simulation result of the model should be reasonably aurate for DSE.3. The model an be easily integrated with other simoms under the same sim-ulation framework.4. The model is ongurable and easy to use. Dierent andidate RTOSes anbe evaluated with little or no hange to the appliation programs.An RTOS modeling tool is presented to assist in generation of the initialspeiation [96℄. It is implemented based on SystemC [6℄ by employing its threadprimitives (SC THREAD) and simulation engine. The model is ongurable tosupport modeling and timed simulation of most existing embedded RTOSes, i.e.Linux [128℄, DSP/BIOS [15℄ et. To model a spei RTOS, a user only needs to:1) set a few parameters whih will be used to ongure the RTOS state mahine, 2)provide some OS timing information suh as sheduler exeution delay and ontextswith delay et. so that delay annotation an be automatially generated into themodel, and 3) \plug in" neessary peripheral driver module(s).We believe that timed OS simulation by delay annotation is an appropriatesolution for DSE at the early design stage ompared to the traditional approah ofusing an ISS sine the former is muh faster and able to ahieve reasonable timingdelity. In [150℄, it was reported that most ISSes simulate one target instrution byevery 50-100 host instrutions. To make it worse, most ISSes do not have RTOSsupport. Another drawbak of ISS is that it is not able to provide aurate timingestimation to appliation programs at the early design stage sine in most asesappliation programs have not been implemented or optimized at that time.34
The OS timing information an be measured by experiment [167℄, or esti-mated by SW estimation tehniques [35℄ [101℄ [117℄ [120℄ if the OS soure ode isavailable. In the presented model, a simple yet eetive approah is demonstrated toderive the information from benhmark data provided by the OS vendor. The benh-mark results are readily available and quite aurate. Appliation program timingan adopt a similar strategy and an be isolated from OS timing so that hanges toeither one will not aet the other unless the HW arhiteture is hanged.One of the problems assoiated with delay annotation is that the simulationlok advanes in maro steps. As a result, the OS being modeled may advane pasta timestamp at whih it is supposed to handle an input event (interrupt). Boundingthe size of the maro lok step to a xed small value and heking input event morefrequently [176℄ does not ompletely solve the problem. A more adaptive solutionis needed. The need for adaptiveness in hoosing step size is partiularly importantfor RTOS simulation sine one of its important goals is to identify the interruptresponse lateny. Without solving the step size problem, it is not possible to deidewhether a delay is truly aused by the modeled OS or beause of a large lokstep advane. To the best of our knowledge, this problem has not been addressedomprehensively so far. We solve the problem by splitting the lok step Æ if theurrent simulation lok t + Æ > tin, whih is the timestamp when the next inputevent will happen; the task that advanes the lok will only be allowed to proeedto tin and then bloks. By the time the input event is handled and the task isresumed, the remaining amount (t+ Æ)   tin will be added to the simulation lokin a similar manner. Details are given in setion 2.4.HW/SW o-simulation often suers from high synhronization overhead,whih gets more severe for o-simulation of interrupt-based systems where inter-rupt is used as the ommuniation protool between simoms [174℄. In our work,we adopt the optimized onservative approah where eah sender simom informsthe reeiver beforehand when it will send an event. The reeiver annot advane its35
lok over this timestamp. A lot of researh has been done to estimate the times-tamp of the next output event [101℄ [173℄. However, most of them are based onthe unrealisti assumption that a SW task exeutes on dediated HW whih is notappliable to the general ase of multiple tasks running onurrently on a proessor.[172℄ presented some results on priority-driven multi-tasking system. However, theirwork is subjet to two assumptions: 1) the task priority is stati; and 2) the shed-uler is alled only when the timer interrupt ours. In general, an RTOS performsre-sheduling whenever neessary and not only at timer interrupts. In our work, analgorithm is derived to estimate the timestamp of the next OS-wide output eventfor any RTOS with a stati or dynami priority driven sheduler.The rest of the hapter is organized as follows. Setion 2.2 gives a surveyto the related work. The overall OS modeling framework is desribed in setion2.3. Setion 2.4 explains how to derive the timing information and delay annotationinterfae. Setion 2.5 desribes the synhronization protool. Some experimentalresults are shown in setion 2.6. Setion 2.7 summarizes the hapter.2.2 Related WorkResearh on RTOS modeling and simulation an be ategorized into the followingthree diretions: System Call Translation, where any system alls from the appliation beingsimulated are re-direted into the host OS and exeuted. This approah isoften used by those ISSes without OS simulation ability to verify the fun-tionality of appliation SW [150℄ and provides limited timing information.Clearly it is not able to assist in RTOS seletion. Native OS, whih ompiles the target OS ode and exeutes it on the simhw.For example, WindRiver Systems provides VxSim as a simulation model for36
its RTOS [25℄. It is also able to verify the funtionality of appliation SW.One of its drawbaks is that it does not support modeling of the HW on whihthe target RTOS runs, and thus the timing of target RTOS is not aurate.Another drawbak is that modeling other RTOSes beomes impossible. Virtual OS, whih simulates the funtionality and timing of a real OS. Ideallythe virtual OS should be ongurable to model any existing RTOSes. OStiming an either be ahieved by delay annotations inserted into the virtualOS soure ode, or alulated by an aggregate timing model. For instane, thesheduling delay an be alulated as a funtion of the number of ready tasks.Commerial tools suh as SoCOS [66℄ and CarbonKernel [3℄ are available. Themain problem of these tools is that it is not easy to use. To model a speiOS, a user often needs to manually \personalize" the virtual OS by insertingappropriate delay annotations. Gerstlauer in [79℄ presented an RTOS modelbased on SpeC. The primitives in SepC are used so that the implementationis simple. The appliation ode an be synthesized and ompiled to run onthe target RTOS by replaing system alls to the OS model with those tothe target RTOS. However, the timing auray is not satisfatory. [131℄presented an RTOS model on top of SystemC. Its fous is only modeling thesheduling algorithm and the proess ommuniation mehanism. [29℄ [59℄ and[47℄ proposed the approah ombining a SytemC based RTOSmodel with a busfuntional model for HW. Timing is ahieved by inserting delay annotationsinto the SW ode; however, none of them shows how to easily get the delayinformation. The synhronization predition ability is not available and thusall of them an simulate a proessor-only system. [29℄ [59℄ did not say how tomodel the I/O behavior. [47℄ only models the bloking I/O for devie drivers.For [29℄ [59℄, both simulation speeds are about 10x slower than the rhw speedwith less than 10% error. [47℄ is 3 orders of magnitude faster than an ISS with37
less than 14% simulation error.Yoo in [176℄ proposed an idea of automati generation of OS model. Thedesigner an hoose a set of OS servies from a library. The servie ode will beompiled and linked with a miro kernel to generate the RTOS that is exeutableon the target proessor. Delay annotations are automatially inserted in the targetOS soure ode. The main problem assoiated with this approah is that it an onlygenerate an RTOS from the library written by the author, but not able to modeland simulate any other ommerial RTOSes.Our approah is similar to the virtual OS onept. Compared to [66℄ [3℄,out tool is easier to use sine delay annotations an be inserted to appropriateplaes automatially. Compared to [131℄, our tool requires little hanges to theappliation simulation ode other than the delay annotation insertion. To model aspei RTOS, the user only needs to set a few parameters for OS state mahinegeneration, and provide OS timing information whih is often readily available fromthe OS and IC vendors. In [131℄ the appliation ode has to expliitly wait formessages from the OS model whih makes the simulation ode signiantly dierentfrom the real ode. Compared to [79℄, our model is also easy to implement by usingthe SystemC simulation engine and its thread primitives. The timing informationderived from benhmark results is aurate enough to assist in the seletion of anRTOS for generation of an initial speiation. Compared to [131℄ [29℄ [59℄ [47℄, theadvantages of the proposed tool are many folds. For example, OS timing informationis easy to obtain; simulation speed is faster; apability is more omprehensive interms of handling simom synhronization and modeling I/O behavior. Although[59℄ ahieves higher auray (5% vs. 15%), our simulation speed is more than 10times faster.
38























Figure 2.1: Basi RTOS State MahineA generi RTOS state mahine shown in Fig. 2.1 is implemented on top ofSystemC. The state mahine an be ongured to model dierent RTOSes.There are seven states in the RTOS state mahine, as explained below.1. power-o: the proessor has not started running. Before powering on anRTOS, the user needs to onnet neessary peripherals suh as the timer, tothe OS, and also speify a set of parameters to ongure the spei statemahine for the RTOS being modeled.2. urr run in user mode: the proessor is exeuting a task in the user mode.In ase the professor doesn't dierentiate the kernel and user spae, the statesimply means that a task is exeuting appliation ode before making a systemall into the kernel.3. urr run in kernel mode: the proessor is exeuting a task in the kernel mode.39
4. syn for in intr: the RTOS bloks to wait for an external event (interrupt).This is a state that does not exist when the RTOS exeutes on the rhw,but only for simulation synhronization purpose. For optimized onservativesimulation, a simom needs to enter this state when it reahes a lok valuewhen an external event may ome.5. handle intr: the RTOS is handling an input event.6. shedule: the RTOS sheduler is seleting a ready task.7. idle: there is no ready task in the RTOS.Transitions (3) and (8) represent the ase where the OS was waiting to reeivean event (interrupt) but it did not atually arrive. Note that in pratie it is hardto predit the exat timestamp of the next event beause of the dynami behaviorof SW. In our model, an event sender tries to predit and notify the reeiver theearliest possible timestamp when it will issue an event. When the reeiver reahesthat time, it waits for the event. If it is sure that the event will not our beause theprevious predition is too onservative, it exits from the synhronization state andontinues. The details on the synhronization protool will be explained in setion2.5. To model a spei RTOS, a set of parameters need to be set to ongure thestate mahine. The main parameters are: 1) sheduling poliy, 2) kernel preemptionpoints (when to make a re-sheduling deision), 3) whether or not to support thread,4) whether or not to support interrupt thread, 5) IPC mehanism, and 6) whetheror not to have memory protetion et.For example, the embedded Linux implements threads in the kernel and usesthreads instead of proesses as sheduling entities. It distinguishes three lassesof threads: real-time FIFO threads whih have highest priority and are not pre-emptable, real-time round robin threads whih have the same priority as real-time40
FIFO threads but are preemptable, and timesharing threads whih have the lowestpriority and are sheduled by a priority aging poliy [165℄. Interrupt thread is notsupported. Interrupt is handled by jumping to the interrupt servie routine (ISR).A re-sheduling deision is not made after returning from an ISR unless it is a timerinterrupt. Other ases where re-sheduling happens are: urrent thread bloks; andurrent thread forks a hild thread whih will blok the parent thread and thenstart. Linux also supports a rih set of IPC mehanisms suh as signals, System VIPC, Unix domain sokets et.Take TI DSP/BIOS as another example [15℄. It is a single address spae OSwithout virtual memory protetion. Proesses and threads are not dierentiated. Ithas three lasses of threads: HW interrupt threads (ISTs) whih have the highestpriority and are not preemptable, SW ISTs whih have lower priority than HW ISTsand are preemptable, and regular threads whih have the lowest priority and are alsopreemptable. The dierene between SW ISTs and regular threads is that SW ISTsrun on a ommon stak while eah regular thread has its own stak. The shedulingpoliy is priority driven without aging. Sine IST is supported, an ISR only doesa minimum amount of work, and then resumes the orresponding HW/SW IST tomake it do the rest work. After returning from an ISR, a re-sheduling deision isalways made so that an IST an be started immediately. A re-sheduling is alsomade when any thread with a higher priority than the urrent one is resumed. Forinstane, if the urrent thread releases a lok, whih resumes a thread with a higherpriority, a re-sheduling deision is made. The IPCs supported by DSP/BIOS arememory sharing, pipe and mailbox.2.3.2 OS Sub-Module DesignWe adopted the objet oriented approah to design the following omponents inthe OS modeling tool: 1) task management, 2) OS kernel, 3) IO module, 4) IPC,5) timing, and 6) event handling and synhronization protool. 5) and 6) will be41
explained in setion 2.5.Task ManagementThe task management interfae onsists of standard routines for task reation, i.e. task reate(), task termination, i.e. task exit() and task kill(), task suspension, i.e. task yield() and task sleep(), task ativation, i.e. task resume(). task priority hange, i.e. task set prio().Eah task is implemented as a SystemC thread (SC THREAD) with a s eventobjet so that ontext swithing an be easily implemented by alling its s eventmethods: wait() and notify(). To work around the limitation that the SystemCsimulation engine does not support dynami thread reation, a stati thread pool isreated before the OS is powered on. These threads blok on their own s event atthe beginning. When a task reate() all is made, a thread objet is retrieved fromthe pool and a funtion pointer to the atual thread funtion is saved in the taskobjet. By the time the thread is sheduled to exeute, it unbloks from its s eventand alls the thread funtion to start.If the OS being modeled has memory protetion mehanism, only threadsbelonging to the same proess an share the resoure. Otherwise, threads andproesses are not dierentiated.OS KernelThe kernel mainly onsists of the task sheduler and synhronization objets. Wehave implemented a exible priority driven sheduler whih supports a mix of pre-emptive/nonpreemptive sheduling poliies with/without priority aging. When a42
task is reated, three elds in the thread objet are speied: priority, aging, andpreemptable. The sheduler shedules the tasks based on these elds. For example,to reate a DSP/BIOS HW IST, a user an set priority = 0 whih is the highestpriority, aging = no and preemptable = no. Assoiated with eah priority is a taskqueue. Sheduling of the tasks with the same priority an either be FIFO or RoundRobin, depending on the onguration speied by the user.Right now the only synhronization objet supported by our tool is thesemaphore. Other synhronization objets suh as mutex and lok an be derivedfrom semaphore. A semaphore has the following interfae to user: sem init(int ount), whih reates a semaphore with the speied ount. sem in(), whih inreases the ount. A bloking thread may be resumed ifount > 0. Depending on the onguration speied, the order of the threadsbeing resumed an be FIFO or priority based. sem de(), whih dereases the ount. The alling thread may be bloked ifount  0. sem kill(), whih destroys the semaphore objet.Another interfae in kernel module is the power on() method whih initializesand starts the OS.IO ModuleModeling IO is a diÆult problem beause dierent devies an work in various ways,and eah OS has its own IO struture and handles IO in its own way. Modelingindividual IO devie is not our onern. Instead, our fous is on reating a exibleIO framework so that (1) devie models and drivers an be easily plugged in withoutmajor modiations, and (2) dierent IO strutures and IO handling mehanismsfrom dierent OSes an be easily modeled and simulated in our framework.43
OS service IO in kernel
kernel
customized IO
modellingFigure 2.2: IO Module LayerThe overall IO module has two layers as shown in Fig. 2.2. The blak blokrepresents the models and drivers of eah individual devie. The grey blok residingin the kernel is the IO interfae between appliations and devie drivers. Driver anrequest OS servie via the IO layer in kernel, i.e. bloking the alling thread, orrequest the servie diretly. A typial example of the former ase is the DSP/BIOSwhose IO manager (IOM) resides in the kernel and manages the state of all tasksrequesting IO. Linux is an example representing the latter ase where drivers managethe state of the alling thread by themselves.The IO interfae to appliation program are: io open, io lose, io read,io write, and io trl.Eah devie in the system has a StrutDev strut as shown in Fig. 2.3. It isshared by both layers. wQueue and rQueue (line 5-6) are queues saving the pendingwrite/read requests. drvBuf (line 8) is the buer alloated by kernel to the devie.drvStrut (line 10) is a pointer to spei driver information. kernelCallbak (line13) is a funtion allable by the driver to inform the kernel that an IO request hasbeen nished. It is used by DSP/BIOS. Line 14 - 24 are pointers to the orrespond-ing driver funtions to whih the IO layer in kernel passes all the requests fromappliations. The interfae between the driver and the kernel is lean but exible.To model a spei devie, the designer only needs to implement these funtions inthe desired abstrat level. HWIThread and SWIThread are pointers to HW and SWIST respetively if they are used. 44
1 strut DevStrut f2 har name[20℄;3 int mode; // devie open mode4 int state; // state of the devie5 Queue *wQueue; // queue of pending write requests6 Queue *rQueue; // queue of pending read requests7 Sem *sem; // semaphore of the devie8 har *drvBuf; // driver buer and length9 int drvBufLen;10 void *drvStrut; // ustomized driver struture11 OS *os; // pointer to os12 PE *devie; // pointer to devie ommuniate with13 FunPtr kernelCallbak; // driver to kernel allbak14 int intrId; // interrupt line used15 Interrupt Data *intrData; // interrupt data16 /* pointers to driver funtions */17 int (*init)( strut DevStrut *dev );18 int (*open)( DevStrut *dev );19 int (*lose)( DevStrut *dev );20 void (*write) ( DevStrut *dev, IoStrut *ios );21 void (*read)( DevStrut *dev, IoStrut *ios );22 void (*io trl)( DevStrut *dev, IoStrut *ios);23 void (*destroy)( DevStrut *dev );24 void (*isr)( DevStrut *dev );25 Task* HWIThread; // HW interrupt thread26 Task* SWIThread; // SW interrupt thread27 g Figure 2.3: Devie Strut
45
Dierent drivers often have dierent request formats. Simulating the exatrequest format for eah devie is not only unneessary, but also inonvenient forOS modeling. We provide a unied request format IoStrut as shown in Fig. 2.4.It is broad enough to fulll most drivers' requirement. When sending a request toa spei devie, the neessary information from IoStrut is retrieve by the driver.When the request is ompleted, the return information from the driver is plaed inIoStrut.Line 4 - 9 is the buer information for the request. For an OS with virtualmemory system, an IO request is typially assoiated with both the buer in userspae and kernel spae. md (line 11) is the ommand of the request. ommID (line12) is used to by multiple simoms sharing the same ommuniation link to identifyeah other. For instane, if two simoms ommuniate to eah other via Ethernet,it an be used to save the IP address.Our IO module supports three types of IO operation mode: 1) synhronousbloking, 2) synhronous non-bloking, and 3) asynhronous. The rst two modesare ommon and supported by most OSes. The third one is speial yet proved tobe eÆient both by DSP/BIOS and Linux 2.6. In mode 3) when an appliationthread issues an IO request, it also provides an appliation allbak pointer (line 13in Fig. 2.4). If the request annot be ompleted immediately, it will be queued tothe driver and the appliation thread an ontinue doing other work. By the timethe request is ompleted, the driver thread, i.e. HW IST, alls bak to kernel whihin turn alls the appliation allbak to notify that the request is ompleted. Insidethe appliation allbak, a new request an be issued.IPCA rih set of IPCs exist in today's OSes. Modeling eah IPC type in its exat form isnot only tedious, but also unneessary. Instead, we abstrat them into the followingthree lasses whih are able to over most IPCs.46
1 strut DevStrut f2 Task *task; // pointer to alling thread3 DevStrut *dev; // pointer to the devie4 har *uBuf; // user buer and length5 int uBufLen;6 int uDataLen; // data length in user buf7 har* kBuf; // kernel buf and length8 int kBufLen;9 int kDataLen; // data length in kernel buf10 int ioMode;11 int md; // request ommand12 ID ommID; // ommuniation id13 FunPtr appCallbak; // kernel to app. allbak14 g Figure 2.4: IoStrut Memory sharing whih is the most widely used IPC form in embedded systemsine most embedded systems have limited memory and a single address spae.Memory sharing is easy to implement and also saves the expensive memoryresoure. Message opying whih is also a ommonly used IPC. A message queue pro-teted by a semaphore is reated. Signal whih is used by Linux based embedded systems to quikly inform aproess to take the desired ation. The default ations supported by us arePROCESS KILL, and PROCESS BLOCK.2.4 Simulation TimingInterfae in lok( Æ ) is provided to allow the OS lok advane Æ units in disretesteps. Fig. 2.5 shows how the simulation lok is inremented without going beyondthe timestamp when the next interrupt will our. If Æ is too large, the lok is onlyadvaned to intrArrTime (line 4). The OS will handle the interrupt if it atually47
1 os.in lok( Æ ) f2 while( lok+Æ > intrArrTime ) f3 Æ := Æ - (intrArrTime-lok);4 lok := intrArrTime;5 if( intr arrive() )6 handle intr(); // thread may be preempted7 g8 lok = lok + Æ;9 g Figure 2.5: Clok Advaneshows up. A re-sheduling deision may be made after handling the interrupt, whihmay preempt the thread alling in lok(). When the thread is resumed later, theremaining amount of Æ will be aumulated to the OS lok similarly.Delay annotations are inserted in appliation and OS ode to all in lok().Our approah has two advantages ompared to the work in [176℄ [79℄. Firstly,simulation an aurately measure interrupt response lateny aused by the OS sineinterrupts are handled at the exat time when they are supposed to be. Seondly,the number of delay annotations an be redued to alleviate user's work. In [176℄[79℄ sine the timing auray is diretly aeted by the lok inrease steps anddelay annotation frequeny, user often needs to manually insert delay annotationsin the soure ode as muh as possible to ahieve the desired auray.The timing information an be obtained in various ways. A ommon strategyis to perform ompiler analysis on the soure ode. [101℄ [117℄ ompile the soureode to java byte ode, and an estimation is made from the java byte ode based onthe spei features of the target proessor. [35℄ [120℄ diretly ompile the soureprogram to the target instrutions to do the timing analysis. Suh tehniques areappliable to appliation timing estimation but not to OS modeling sine the OSsoure ode is often not available. Wang in [167℄ proposed an experimental methodto measure the OS timing information, i.e. ontext swith overhead, timer jitter,sheduling ost et. Suh an approah is not appliable unless both the OS binary48
and the target HW are available. We propose an idea to derive the OS timinginformation from benhmark results. The alulation is simple, and the benhmarksare readily available from OS vendors and/or researh publiations. The timingauray is determined by the benhmark data. Sine most OS kernels are expetedto exeute in on-hip memory of embedded proessors and diretly interfae theHW, ompared to appliation programs, the benhmark an be trusted.Take DSP/BIOS as an example; TI has measured the following timing benh-mark on its DSP proessors [16℄: HW interrupt, whih inludes the interrupt lateny, interrupt enable/disableosts, interrupt prolog/epilog osts et. SW interrupt, whih inludes SW interrupt enable/disable osts, and the ostof resuming a SW IST. Task, whih inludes all kinds of task management osts, i.e. reating a taskwith/with ontext swithing. Sem and Lok, whih inludes the osts of operations on semaphores and loks. Memory, whih inludes the osts of dynami memory alloation and de-alloation. Pipe and Mailbox, whih are the osts of IPC operations. Queue, whih are the osts of all kinds of queue operations.Other information an be derived from the benhmark results. For instane,suppose it has been measured on the TMS320C64x DSP (on whih the I-aheis disabled and the kernel exeutes in on-hip memory) that reating a task withand without ontext swith takes 840 and 744 yles, respetively, then the on-text swith ost an be alulated as 96 yles. For another instane, it has also49

























Figure 2.6: Synhronization Overheadan interrupt to T2 at lok = 35, and T2 would send an interrupt to T1 at lok = 40.Due to the dynami behavior of SW programs, we assume that we an only preditthat T1 would send the interrupt at either lok = 25 or 35, and T2 would send theinterrupt at either lok = 30 or 40.Now also assuming both simoms are shared by other tasks, and both T1 andT2 are preempted at 20 and resumed at 60, the problem is illustrated in Fig. 2.6.1. T1 is preempted at 20 on simom1. The OS on simom1 stops to wait at 30sine T2 on simom2 may send an interrupt to it at that time.2. T2 exeutes and is also preempted at 20. The OS on simom2 has to stop at35 beause when T1 will be resumed is unknown, and a safe way is to assumeT1 may resume immediately at 30 on simom1 whih makes it possibly sendan interrupt at 35 after exeuting for another 5 yles.3. simom1 exeutes other tasks until lok = 45 due to the same reason asexplained in step 2.4. simom2 advanes until lok = 50 due to the same reason as step 2.51
1 SimTime nextOutEvtTime := MAX SIMTIME;2 SimTime preemptInterval := 0;3 SimTime outEvtTime;4 for( i = 0; i < MAX TASK PRIORITY; i++ ) f5 Task tsk := prioTaskQueue[i℄;6 if( tsk == NULL )7 ontinue;8 outEvtTime := lok + preemptInterval + tsk.nextEvt9 preemptInterval := preemptInterval + tsk.nextPause;10 if( outEvtTime < nextOutEvtTime )11 nextOutEvtTime := outEvtTime;12 g Figure 2.7: OS-Wide Event Time Predition5. simom1 advanes to 60 due to the same reason as step 2. When it reahestime 60 and T1 is resumed, it beomes known that the nearest time T1 maysend an interrupt is at 65.6. simom2 advanes to 60 and resumes T2. Then T2 exeutes until reahing 65.It beomes known that the nearest time T2 may send an interrupt is at 70.7. T1 runs to 65 and it beomes known that T1 is taking a longer path and willsend an interrupt at 75. simom1 ontinues exeuting T1 until reahing 70.8. Similarly, simom2 exeutes T2 until 75. It beomes known that T2 will sendthe interrupt at 80.9. T1 runs to 75 and sends the interrupt. It ontinues until reahing 80.10. T2 reeives the interrupt at 75. Then it advanes to 80 and sends its interruptto T1.As we an see, the synhronization operations during interval [20; 60℄ are dueto the fat that how long T1 and T2 will be preempted is unknown. If suh preemp-tion intervals are also preditable, those synhronization ations are not neessary.52
We have designed an algorithm to predit the timestamp of the next OS-wide event for the more ompliated but realisti ase where multiple tasks exeuteonurrently on a proessor, managed by an RTOS. We assume that the RTOSsheduling poliy is priority driven, whih is true for most RTOSes. Tasks an bepreemptive or non-preemptive. Appliation programs are required to report twotime preditions to the kernel: nextEvt whih is the nearest time when it will issuea system all to ause an output event, and nextPause whih is the nearest timewhen it will issue a system all whih may make it blok, i.e. a bloking I/O all, atask sleep() all et. Those preditions are made by analyzing the appliation pro-gram, assuming it uses the proessor exlusively. The predited results are relativevalues, i.e. if the lok is t when the predition is made, and nextEvt = Æ, then thenearest time when the appliation sends out an event is t+ Æ.We provide an interfae reg predition(nextEvt, nextPause) to allow insertpreditions dynamially in appliation programs. The user needs to identify all thepaths leading to system alls whih either send out events or blok the aller. Inase the system all is a bloking I/O all, nextEvt will be the same as nextPause.With the help of some advaned researh ompiler that an performs the analysisautomatially [2℄, we believe that the speedup gain in the simulation is worth theamount of additional work.Owing to the page limitation, we only show the predition algorithm in Fig.2.7 for the ase where the OS sheduling poliy is a preemptive priority shedulerand eah priority level has at most one task. Other ases an be derived similarly.For a task with priority i, preemptInterval is the interval when it may bepreempted by any higher priority tasks eah of whih will exeute for at least itspredited nextPause amount of time (line 9). Thus, the earliest time that it ansend out an event is lok+preemptInterval+tsk:nextEvt (line 8). The timestampof the next OS-wide event is the smallest among them. Suppose it is alulated thatthe next event will be sent out by a task T with priority i, the predition needs to53






























OS code VLD, MB, etc.Figure 2.9: H.263 Deoder SystemWorst Frame Best Frame AverageSyn. read 39:27 ms 6:64 ms 31:37 msAsyn. read 30:21 ms 4:78 ms 23:07 msTable 2.1: Simulation Result of H.263 Deoder2.6 ExperimentWe used an H.263 baseline deoder appliation from TI [7℄ to test the eetiveness ofour tool. This appliation runs on a DM642 [21℄ EVM board whih mainly onsistsof a 600MHz TMS320C64 DSP and 32 MB external SDRAM. The TMS320C64DSP has a 128-Kbit L1 instrution ahe, a 128-Kbit L1 data ahe and a 2-Mbitinternal memory. The system blok diagram is shown in Fig. 2.9. A 4CIF formatinput stream is stored in the external SDRAM. After a frame is deoded, it is plaedin the frame buer whih will be transferred to a monitor to display through a videoport. The RTOS is TI DSP/BIOS. The data being plaed in internal memory are 1)all VLD tables, 2) zigzag index table, 3) TCOEF length table, 4) reonstruted MBbuer, 5) referene MB buer, and 6) IDCT buer. All appliation and OS ode areplaed in external SDRAM initially but reside in instrution ahe during runtime.All data blok transfer and instrution ahe lling use DMA. Exluding the H.263appliation ode, the implementation of the RTOS model and other simom modelsfor related rhw, i.e. DMA onsists of 4200+ lines of C++ ode.56
For eah frame we measured the time interval from the point it starts tobe deoded to the point when the deoded frame is opied into the frame buer.The DSP proessor transfers the deoded frames to the frame buer via DMA inasynhronous mode as explained in setion 2.3.2. We tested two ases for whihthe DSP reads input stream either by synhronous bloking mode or asynhronousmode. The results are shown in table 2.1.As expeted, the simulation results show that the asynhronous input readmode is muh more eÆient than the synhronous bloking mode sine IO ationsare in parallel with the CPU operations. For the same appliation, TI has reportedthat DM642 is able to deode H.263 baseline 4CIF stream at about 20 ms/frame inaverage. Our simulation results are a little inferior but lose to the TI benhmark.The reasons are threefold: 1) inauray of the delay annotation model, 2) ina-uray aused by the behavior model of the DMA and ahe, and 3) data ahe isdisabled.The simulation speed of our model is muh faster than an ISS. On our work-station with an Intel Cerelon 1:33GHz CPU and 256 MBytes RAM, the simulationof 2  109 DSP yles takes 7:1 seonds, whih is about 2:817  108 yles/s. Inontrast, TI CCS simulates 7:630 104 yles/s on the same mahine. The speedupis more than 3 orders of magnitude in this example.2.7 ConlusionAn RTOS modeling tool built on top of SystemC [6℄ is presented to assist in gen-eration of an initial speiation. It is ongurable to support modeling and timedsimulation of most popular embedded RTOSes. Timing is ahieved by delay anno-tations. The OS timing information an be derived from benhmark data providedby the OS vendor or published researh results. Sine OS ode diretly interfaeswith the HW and typially exeutes in on-hip memory, the benhmark data an57




3.1 IntrodutionAs industry moves from the general DSP era to the appliation-spei DSP era,new embedded appliations often adopt from the older ones many similar fun-tions but in dierent ombinations and in system arhitetures that impose morehallenging timeliness requirements. In partiular, the urrent design trend is tokeep the DSP ore unhanged but to re-design the on-hip bus struture and theperipherals, inrease the HW frequeny and/or add more HW aelerators for thenew appliation. Before the design proess starts, it is ritial to make sure thatthe design speiation \mathes" the appliation requirements. As being statedin setion 1.4.2, a single simulator annot meet all the requirements of speiationrenement. The System Dataow Simulator (SDFS) being presented is intendedprimarily to be used by the HW engineers to help generate an aurate HW spei-ation for the targeted appliation. It models the appliation by a parameter-drivenonditional dataow graph (CDFG) in whih eah node is a funtion blok (fun)being modeled at transation level. The HW to be designed (H ) is modeled by aongurable HW graph in whih eah node is a simom modeling the orresponding59
rhw at yle-aurate level. HW engineers are required to understand the appli-ation CDFG and the HW graph with more details to arry the simulation. Thefollowing are the properties of SDFS. System Level Performane EstimationAssuming that the DSP ore and ompiler remain unhanged, the funtionsadopted from the old appliations onsume the same number of yle(s) perdatum under the same exeution ondition. SDFS fouses on performaneestimation at the system level, taking into aount the hanges in the exeutionondition aused by raes and dierent funtional ompositions. Parameter-DrivenThe exeution ost of eah fun in the appliation CDFG is modeled by user-ongurable parameters, whih allows highly exible performane estimation.For instane, to see the impat of the burst-read size on system performane,a user only needs to hange one parameter and hek the simulation result. Hierarhy ModelingOne thing to note is that the CDFG approah does not neessarily mean thatthe appliation an only be modeled at high level with limited simulation pre-ision. The hierarhial modeling feature allows any fun in the CDFG beexpanded into several ner-sale funs, and thus more details an be inor-porated to improve the simulation auray. Low Level SimulationThe simulation is arried out at the yle-aurate level in order to apturethe detailed ativities on the HW, i.e. bus arbitration. Flexible Probing 60
SDFS allows the insertion of various probes in the appliation CDFG and HWgraph to ollet interested information.The experimental results show that SDFS is not only able to auratelyestimate the system performane, but also help identify the system bottlenek andnd the optimal SW implementation solution. Suh information helps the designerto optimize the system arhiteture although the tool does not generate an optimalarhiteture speiation by itself.The rest of the hapter is organized as follows. Setion 3.2 gives a survey tothe related work. The details of the tool are desribed in setion 3.3. Setion 3.4gives an example showing how to use the tool. Some experimental results are shownin setion 3.5. The onlusion and future work are summarized in setion 3.6.3.2 Related WorkA number of simulators for IP-based systems are found in [158℄ [53℄ [143℄. Usually atrae le abstrating the appliation behavior is derived from the appliation and fedinto the simulator. A task is assumed to onsist of multiple funtions eah of whihhas a known exeution ost. The ommon drawbaks of these simulators are: 1) axed appliation trae does not provide onvining estimation results. For example,video lips with dierent motion ativities onsume very dierent numbers of DSPyles for enoding/deoding. Moreover, a minor hange to the appliation behaviorrequires the derivation of a new trae whih is time-onsuming. 2) The performanepredition preision is limited beause the funtion exeution osts are assumed tobe xed even though the exeution onditions may hange. For these simulators, theexeution ost needs to be proled for eah funtion in the same appliation runningon the same HW platform in order to estimate the performane of an appliationon the HW platform. Ideally, feeding the exeution osts to the simulator shouldyield very aurate results without surprises, sine the funtion exeution onditions61
remain the same. However, none of the mentioned work has reported any resultsthat pertain to hanges to the HW arhiteture and/or appliation behavior, i.e.other than raising the DSP ore frequeny, everything else remains as same.[163℄ proposed a simulator fousing on estimating performane hanges ausedby dierent bus strutures. The results are not aurate beause of the followinglimitations: 1) it models the appliation by a dataow graph only whih does notsupport onditional branhes and thus annot truly apture the appliation behav-ior; 2) the ativities on HW, i.e. SDRAM aesses, are simulated at the maro bloklevel whih only aptures the performane of burst aesses but not that of sporadiones. [151℄ proposed a parametri model in whih the exeution ost of a fun isthe summation of three portions: ore proessor time, instrution feth time, andan optional memory aess time. This approah is not able to model the pipelinearhiteture of modern DSPs in whih instrution fething and memory aess areperformed in parallel with instrution exeution and multiple instrutions an beexeuted onurrently on dierent ore units.[51℄ [116℄ proposed stati abstrat models that try to obtain a lower/upperbound on the system performane. Suh information does not reet the typialresoure usage by the appliation and thus not appropriate for driving the arhi-teture design of multimedia systems. Besides, not all appliation behaviors an tinto their models and the stati queuing model does not always abstrat the atualappliation's behavior. Another problem is that some resoure ompetition ondi-tions annot be aptured, i.e. when both the DSP ore and the DMA are aessingthe internal memory at the same time.SDFS solves the problems mentioned above and thus is able to assist inspeiation renement with high ondene.
62














































Figure 3.1: Appliation DFG ExampleAn appliation dataow graph (DFG) is a triple hfun, DE, bufi. Anexample is shown in Fig. 3.1. Eah fun is a funtional blok in the appliation shown as a retangle. Afun may exeute multiple times at run time; eah exeution instane is alleda job. A job may onsist of multiple phases eah of whih involves reading andproessing input data, and produes the output data at the end. We all eahphase a sub-job. Eah DE shown as an arrow is the data edge and denotes the dataow in thediretion of the arrow. Eah buf shown as a irle denotes the virtual buer used by funs to storedata.The DE onnetions in an appliation DFG onform to one of the followingthree patterns, as shown in Fig. 3.1. 63





































































Figure 3.2: Appliation CFG example IdxCE evaluates the ondition based on the soure fun's job index (Idx).Its parameters are: (period, residualStart, residualEnd) whih are all inte-gers and 0  residualStart  residualEnd  period (3.1)The ondition beomes true whenIdx%period 2 [residualStart; residualEnd℄ (3.2)Take CE6 in Fig. 3.2-(a) as an example. Sine period=N and residualStart =residualEnd = 0, the ondition is true only if Idx = k  N (k 2 N), whihmeans that fun3 is exeuted one every N times after fun4 is nished. Cor-respondingly, CE19 means that fun4 repeats exeuting itself in other ases.65














































Figure 3.3: HW Graph ExampleA HW graph is a triple hsimom; bus; strdevi. An example is shown in Fig.3.3. A simom shown as a retangle models a rhw whih has the ability to read/writeand proess data. Typial examples are DSP, DMA et. A bus shown as a bidiretional edge is a link along whih data an be trans-ported in either diretion. A strdev shown as a irle is the storage devie used for storing the appliationdata. Typial examples are SDRAM, internal memory (IRAM) et.A bus an either onnet a simom with a strdev or onnet two simoms.The former represents the ase where the simom an read/write data from/tothe strdev. The latter represents the ase where two simoms an ommuniatesynhronously via the bus. A strdev an have multiple buses onneted to it. Aessrequests from those buses are served in the rst-in-rst-out (FIFO) order.
67
3.3.4 Appliation CDFG & HW ParametersThe appliation CDFG and the HW graph represent the appliation's dataow be-havior and HW arhiteture respetively, while their parameters speify the systemperformane requirements. The key parameters are listed below.fun Parameters property an either be PERIODIC to indiate that a fun is exeuted periodi-ally, or REACTIVE in whih ase a job is red if at least one of its onditionspattern is true. initCond speies whether the rst job of the fun an be red or not at thebeginning of the simulation, regardless of its ondition pattern. priority: we assume eah simom has a sheduler that shedules its funsbased on their priorities. For a DSP, it an be a mimi of the RTOS sheduler.For a DMA, it an be the internal HW arbitrator. numSubjob speies the number of sub-jobs for eah job. inDataPerSubjob and outDataPerSubjob speify a fun's I/O behavior. A jobonsists of numSubjob sub-jobs eah of whih reads inDataPerSubjob amountof data from eah input DE, proesses the data, and writes outDataPerSubjobamount of data to eah output DE. ostPerByte speies the average number of yle(s) required to proess eahbyte for this fun. pipeline speies whether the I/O operation an be exeuted in parallel withthe data proessing operation.
68
DE Parameters dataWidth speies the unit of eah data transfer on the DE. readType denes the read operation property on the DE. It an either beLOOKUP or CONSUME. The latter derements the amount of data in thebuf for eah read operation while the former does not. writeType denes the write operation property on the DE. It an either beOVERWRITE or PRODUCE. The latter inrements the amounts of data inthe buf for eah write operation while the former does not.simom Parameters frequeny is the lok frequeny (in MHz) of the rhw being modeled by thesimom. engineNum speies the maximum number of jobs that an be exeuted inparallel on the rhw. For a DSP ore, it is set to 1. For a DMA devie, it isset to the number of parallel hannels being supported. synCost speies the number of yles to handle an inoming synhronousdatum/signal.strdev Parameters frequeny is the lok frequeny (in MHz) of the storage devie being modeled. busWidth is the width of the bus onneting with the strdev. By hangingthis parameter, the simulation result reets the impat of the bus width tosystem performane. yPerRead and yPerBurstRead speify the ost of a single and burst readoperation, respetively. 69
 yPerWrite and yPerBurstWrite speify the ost of a single and burst writeoperation, respetively.pipeline = false;for ( int i = 0; i < numSubjob; i++ ) fread ( inDataPerSubjob, i ); // read for subjobiproess ( inDataPerSubjob, i ); // proess for subjobiwrite( outDataPerSubjob, i ); // write for subjobig (a) Non-Pipeline DSPpipeline = true;read ( inDataPerSubjob, 0 ); // read for subjob0proess( inDataPerSubjob, 0 ); // proess for subjob0jj read ( inDataPerSubjob, 1 ); // and read for subjob1for ( int i = 1; i < numSubjob-1; i++ ) fwrite( outDataPerSubjob, i-1 ); // write for subjobi 1jj read ( inDataPerSubjob, i+1 ); // and read for subjobi+1jj proess ( inDataPerSubjob, i ); // and proess for subjobig proess( inDataPerSubjob, numSubjob-1 );jj write( outDataPerSubjob, numSubjob-2 );write( outDataPerSubjob, numSubjob-1 );(b) Pipeline DSP, (numSubjob > 1)Figure 3.4: Modeling Pipeline/Nonpipeline DSPThe pseudo ode in Fig. 3.4-(a) shows how to model the ost of a funexeution on a DSP with non-pipeline arhiteture. The fun has an input DE andan output DE. For eah sub-job, the fun reads inDataPerSubjob data, proessthem and writes outDataPerSubjob data. The total ost is the summation of allthe read, proess and write osts.The pseudo ode in Fig. 3.4-(b) shows how to model the exeution ost forthe same fun on a DSP with pipeline arhiteture. Inside the \for" loop, the readoperation for subjobi+1, the proessing operation for subjobi and the write operation70
for subjobi 1 are parallelized to model the pipelining feature. The ost of eah loopiteration an be represented as maxftr; tp; twg, where tr, tp and tw is the ost ofread, proess and write operation, respetively.The ost of eah proessing operation an be alulated astp = ostPerByte  dataWidth  inDataPerSubjob (3.6)The ost of eah read and write operation depends on the strdev beingaessed. A sequene of read/write operations to the same strdev are assumed tobe burst aesses. The following equations model tr and tw.nr = bdataWidthbusWidth   inDataPerSubjobtr = trblk + yPerRead+ yPerBurstRead  (nr   1) (3.7)nw = bdataWidthbusWidth   outDataPerSubjobtw = twblk + yPerWrite+ yPerBurstWrite  (nw   1) (3.8)where trblk (twblk) represents the read (write) bloking time due to resoure ompe-tition and its atual value is aptured during simulation.3.3.5 Mapping Appliation DFG to HW GraphThe following rules dene how to map an appliation DFG to a HW graph. fun  !DE buf ) simom  !bus strdev, in whih ase fun, DE and buf aremapped to simom, bus and strdev respetively. buf  !DE fun ) strdev  !bus simom, in whih ase buf , DE and fun aremapped to strdev, bus and simom respetively. fun1  !DE fun2 ) simom1  !bus simom2, in whih ase fun1 and fun2are mapped to simom1 and simom2 respetively, and they ommuniatesynhronously with eah other via the bus.71
 fun1  !DE fun2 ) simom, in whih ase both fun1 and fun2 are mappedto the same simom and ommuniate with eah other by some SWmehanismso that DE does not need to be mapped to any physial bus.Multiple funs an be mapped to the same simom. They are sheduledbased on their priorities. Multiple bufs an be mapped to the same strdev and theyare assumed not to overlap. A bus an also have more than one DEs mapped to it.In ase multiple DEs ompete for the same bus at the same time, the arbitration isdone as follows: If the bus onnets a simom and a strdev, whih implies that the multiplefuns onneting to the ompeting DE are mapped to the same simom andtry to transfer data over the bus, then the arbitration is done by the simombased on the priorities of the funs. If the bus onnets two simoms and the ompeting funs are on the samesimom, the arbitration is done in the same way as above. This senario hap-pens when multiple funs exeute simultaneously on the simom, i.e. severalDMA hannels are aessing the same bus. If the bus onnets two simoms and the ompeting funs are on dierentsimoms, the data/signal transation is done based on the synhronous om-muniation protool between the two simoms.3.3.6 Simulator ImplementationSDFS is implemented on top of SystemC 2.0 [6℄. Any windows/Linux PC or work-station an be used as simhw. It onsists of the following types of modules: fun,DE, CE, buf , simom, bus, strdev, bakplane and probepoint.The main responsibility of a simom is to dispath the fun(s) mapped to itand to manage its own lok. When a job is to transfer data, it noties its simom72
whih will then interat with the appropriate DE to handle the request. The DEsplits the request if the data unit is larger than the bus width or the length is largerthan the allowed burst size. The request(s) will be submitted to the bus to whih theDE is mapped and nally reahes the orresponding strdev. When the operationis nished, the simom updates its lok. When a data proessing operation isnished by a sub-job, it also noties the simom to update the lok.The bakplane synhronizes all the simoms during the simulation so thatall the events are proessed in a ausal order. Although [174℄ [96℄ demonstrated thatthe optimized onservative approah is eetive in reduing synhronization over-head for high level simulation, we adopt the onservative approah in our tool due tothe following two reasons. 1) For video appliations, the amount of data movementis relatively large. Sine all data movements need to be simulated at yle auratelevel, the event predition overhead outweighs the benet to be gained. 2) The ap-pliation is modeled at transation level with several parameters and simulating thebehavior of eah fun requires a small number of CPU instrutions. Therefore, theoverall simulation speed is still fast enough even though the onservative approahis taken.HW and SW probepoints are implemented to ollet simulation information.By inserting a pair of SW probepoints in the fun boundaries of the appliationCFG, a user an get the best/worst/average exeution time between the two points.A HW probepoint an be inserted into a simom to ompute its utilization fator.Suh type of information helps to identify HW bottlenek(s) and to nd the optimalSW implementation solution.3.4 ExampleThe following two steps must be ompleted before a simulation an start: 1) re-ating and onguring the appliation CDFG and HW graph; and 2) mapping the73
simom1 simom2 simom3 strdev1 strdev2DSP DMA USB IRAM SDRAM(a) Modeling DM642 with Cahe Being Ignoredsimom4 simom5 strdev3 strdev4L1P ahe L1D ahe L1P ahe L1D aheontroller ontroller memory memory(b) Modeling DM642 with L1P and L1D being modeledTable 3.1: Modeling TI DM642appliation DFG to the HW graph.Creating and onguring the HW graph an be done by the HW engineersbased on the proposed HW arhiteture. For example, Fig. 3.3-(a) and Tab. 3.1-(a)show how to model a TI DM642 DSP [21℄ based system with ahes being ignored.Fig. 3.3-(b) and Tab. 3.1-(b) show how to model the L1 program and data ahe.Creating and onguring the appliation CDFG an be made either by HWor SW engineers based on the desired appliation behavior. For instane, Fig. 3.2-(a) and Fig. 3.1 show the CFG and the DFG modeling a TI H.263 deoder [7℄. Thedeoder reeives the input stream from USB and sends the deoded frame to theframe buer in SDRAM. Eah fun and its mapping are desribed in Tab. 3.2.fun mapping desriptionfun0 USB Stream soure generationfun1 DSP Reeiving streamfun2 DMA Moving data from SDRAM to IRAMfun3 DSP Parsing frame headfun4 DSP Deoding I group of bloks (GOB)fun5 DSP Deoding P GOBfun6 DMA Moving deoded GOB to SDRAMbuf1 SDRAM Stream buer storing input databuf2 IRAM Streaming buer in IRAMbuf3 IRAM Streaming buer without frame headbuf4 IRAM Frame buer storing deoded framebuf5 SDRAM Frame buer storing deoded frameTable 3.2: Construting CFG for H.263 Deoder74















































Figure 3.5: Deoding P GOB, Modeling Cahefun mapping desriptionfun5 7 L1P Move ode of fun5 2; 3; 4; 5; 6 from SDRAM to L1Pmemoryfun5 8 L1D Move GOB bloks of oded data from IRAM to L1Dmemoryfun5 9 L1D Move 1 referene blok from IRAM to L1D memoryfun5 10 L1D Move 1 referene blok from IRAM to L1D memoryTable 3.4: Cahe Ativities for Deoding P GOBrelated to ahe.3.5 ExperimentIn the rst experiment, we tried to simulate the TI H.263 deoder performaneon the TI DM642 proessor. All the HW parameters are ongured based on theDM642 datasheet [21℄. All the fun parameters are obtained by benhmarking thedeoder running on DM642. The purpose of this experiment is not to estimate theperformane of a new appliation on a new rhw but to validate our simulator asore frequeny (DSP+ahe) 600MHzon-hip HW (IRAM+DMA) 300MHzbus between DSP and ahe 256 bitsother buses 64 bitsTable 3.5: TI DM642 Parameters76






















H263 decoder cycle estimation on DM642−600MHz

























Figure 3.6: H.263 Deoder on DM642-600MHzwas done in [158℄ [53℄ [143℄. It is expeted that the simulation result should belose to the proling results. The benhmark obtained is the best-ase data whenboth program and data are in L1 ahe. funs modeling ahe ativities have beeninserted into the appliation CDFG at appropriate plaes, as in Fig. 3.5. The videoformat is progressive mode with D1 resolution. The HW arhiteture is the sameas Fig. 3.3-(b) and the main parameters are shown in Tab. 3.5. Sine dierentstreams with dierent senes and motion ativities an require drastially dierentdeoding time per frame, we hose to ompare the deoding time per MB whih doesnot quite depend on the input stream. We note that suh a omparison atuallyimposes a muh bigger hallenge beause we have to hek whether the estimationresults math the real performane on a per MB basis instead of on a per framebasis for whih the deoding time of 1350 MBs is summed.From Fig. 3.6 to Fig. 3.9, the Y-axis is the number of DSP lok yles,77






















MPEG2 decoder cycle estimation on DM642−600MHz

















Figure 3.7: MPEG2 Deoder on DM642-600MHzand the X-axis is the index of the spei ase in an experiment. In eah groupedbar, the green one represents the simulation result, while the yellow one representsthe atual exeution result. For example, the ase 1 (x=1) in Fig. 3.6 shows thebest performane to deode a MB in an I frame for a H.263 deoder running ona 600MHz DM642 hip, and the error between the simulation and atual result is2:13%. Fig. 3.6 shows that the simulation results math well with the real perfor-mane not only in the best ases, but also in the average and worst ases. Sine allthe non-best ases are aused by resoure ompetition, i.e. two simoms aessingIRAM at the same time, it is assuring to see that the simulator is able to apturethe impat of these rae onditions.In the seond experiment, we tried to estimate the performane of a TIMPEG-2 deoder running on the same HW. Compared to H.263 deoder, this one78















H263 decoder cycle estimation on DM642−720MHz

























Figure 3.8: H.263 Deoder on DM642-720MHzhas a dierent CDFG but adopts the same set of key funtions when deoding aprogressive mode stream. The benhmark data obtained from the rst experimentare reused to ongure the fun parameters. The objetive of this experiment is todemonstrate the estimation auray for a new appliation running on the old rhw.In the third experiment, we raised the DSP frequeny and tried to estimatethe performane of the H.263 deoder again. The frequeny for the DSP ore andon-hip HW is raised to 720MHz and 360MHz, respetively. The SDRAM frequenyis still 133MHz. The purpose is to demonstrate the estimation auray for an oldappliation running on a new rhw.In the fourth experiment, we tried to estimate the performane of the MPEG-2 deoder running on the DSP at a higher frequeny. The purpose of this experimentis to demonstrate the estimation apability of the tool for a new appliation runningon a new rhw. 79






















MPEG2 decoder cycle estimation on DM642−720MHz

















Figure 3.9: MPEG2 Deoder on DM642-720MHzFig. 3.7, 3.8 and 3.9 show that the simulation results still math well withthe real performane for the best and average ases when the appliation CDFGand/or HW are hanged. The worst ase estimations have gone up to 13% errorwhih is mainly aused by the inauray of the ahe model.The simulator is also able to suggest better SW implementation solutions.For example, we found that it is muh more eÆient to proess a GOB at a timethan to proess MBs one by one. The main reason is that those omputationalintensive funtions, e.g. IDCT shown in Fig. 3.5, an be alled repeatedly duringGOB proessing to redue the program ahe miss rate. Our simulation results showthat the performane dierene for the average ases an be as large as 294%.The simulation speed of SDFS is suÆiently fast for experiments that arequite typial in pratial design hores. On a Dell D600 laptop (1:5GHz Pentium4CPU + 512MB DDR SDRAM), it takes about 130ms to simulate deoding 1 MB,80
whih equally means that it takes about 4:28 104 host yles to simulate a targetyle. Compared to [82℄ whih ouples an ISS with a VHDL simulator and takesabout 1:5  106 host yles to simulate a target yle, SDFS is about 2 orders ofmagnitude faster but ahieves omparable auray. Sine SDFS does not rely onany spei trae le as input, we found that simulating 10 frames is usually enoughto estimate the system performane. Even for resolutions as large as D1, it onlytakes 3-4 minutes.3.6 ConlusionSDFS is presented for system wide performane estimation of multimedia applia-tions. It arries out the simulation at a suÆiently low level to ath the detailedativities on the HW. Eah blok is ompletely parameter-driven so that the systemarhitet does not need to write any ode but to fous on building the appliationCDFG and HW graph as the simulator input. We demonstrate the usefulness ofSDFS with real appliation examples from industry. SDFS takes about 4:28  104host yles to simulate a target yle when the simulation is arried on a Dell D600laptop (1:5GHz Pentium4 CPU + 512MB DDR SDRAM). Compared to [82℄ whihouples an ISS with a VHDL simulator and takes about 1:5  106 host yles tosimulate a target yle, SDFS is about 2 orders of magnitude faster but ahievesomparable auray. Sine SDFS does not depend on any spei video streamas input, simulating 10 frames is quite suÆient for arhiteture evaluation. Evenfor resolutions as large as D1, a simulation only takes 3-4 minutes. The simulatorhas demonstrated reliable performane estimation apability for use in system-leveltradeo analysis. The error of the best and average ase estimation result is within6% for our experiments; and the error of worst ase result is within 13%. Thisapplies even to hanges in appliation CDFG and in HW.The fun parameters in the appliation CDFG an be obtained by proling81




4.1 IntrodutionMany embedded systems ombine digital signal proessing (DSP) funtions withSW-implemented ontrollers. Beause of the high ost of HW development, it isritial to reuse as muh of the HW ore as possible, and to meet real-time per-formane requirements by redesigning the on-hip HW aelerators, DMAs and busarhitetures for the spei appliation of interest. To realize this strategy, toolsare required to validate design deisions at the system level, and this is made diÆultby the pratial need to develop the appliation SW in parallel with the HW. Theresearh hallenge is to invent a simulation platform for the appliation SW thatan provide suÆient preision to guarantee performane, is suÆiently fast, and beompatible with the SW development environment. From the industry perspetive,it is extremely diÆult to onvine SW engineers to hange their own developmentenvironment [137℄.The real-time simulation platform (RTSP) presented here is implemented onlegay DSPs. To the best of our knowledge, it is the rst simulator that truly enablesHW/SW o-development by oering the same SW development environment as if83


















γ.t1+d1 γ.t2+d1+d2Figure 4.1: Simulation Problemaumulated to make it unbounded and thus simulation delity is violated.Emitting an event too early is also a problem but it an be solved by buer-ing early events until the time they an be delivered. In this hapter, we addressthe problem how to make a simom \ath up" if some jobs start late. A two-level sheduler is designed for eah legay DSP to bound the delay. The simulationspeed is intentionally slowed by a fator of S so that 8 simomi with rate i, ituses the supply at the atual rate iS . However, the hane for simomi to get thesupply is KiiS where Ki 2 N. The exat values of S and Ki are determined by theappliation's harateristis. The rst level sheduler is a real-time periodi shed-uler assigning quantums. The seond level sheduler determines when a simom isallowed to use the assigned quantum. The result is that when a simom's job isstarted late beause of a delayed event, it still an \ath up" before nishing thejob. 85
An audio and a video appliation are seleted as real DSP industry applia-tions for experiment. The results show that the sheduling overhead is negligible,the simulation speed is fast, and the simulation auray is aurate.The rest of the hapter is organized as following. Setion 4.2 gives a summaryon related work. Setion 4.3 explains the sheduling algorithm, fousing on theseond level sheduler. Setion 4.4 explains how to nd appropriate Ki value forsimomi and S. Setion 4.5 provides a heuristi algorithm to assign simoms to theavailable simhws. Some important implementation features are desribed in setion4.6. Setion 4.7 shows the experiment results and setion 4.8 draws the onlusion.4.2 Related WorkThe methodology presented in this hapter is motivated from [133℄ in whih a real-time virtual proessor prototype is implemented based on enhaned Linux kernel.An appliation on a virtual proessor is guaranteed to get its share and leanlyisolated from other proesses. RTSP presented in this paper is implemented on TIDSPs based on DSP/BIOS [15℄.Although theoretially many real-time periodi shedulers an be used as therst level sheduler, it is desirable to have the most \fair" sheduler so that eahsimom has a fair hane to get quantums. The hairman assignment algorithm[162℄ proposed by Tijdeman has been proved to be optimal in the sense that 8 t,the deviation between the atual and normal supply for a simom is bounded andthe bound is the smallest among all algorithms. Stoia in [157℄ proposed an eÆientoating point implementation sheme. In this hapter, we propose the xed pointimplementation whih is friendlier to DSP platform.Numerous tehniques have been published for performane estimation. Statiabstrat modeling approahes are proposed in [151℄ [116℄ to support early arhite-ture level DSE. [143℄ presented a behavior model implemented using SystemC [6℄.86
Eah blok is assumed to onsist of multiple setions eah of whih has a statiexeution ost. [53℄ proposed an estimation approah by analyzing the appliationsoure ode and generating a performane prole. Suh approahes are not ableto estimate the performane by simulating the atual SW to be implemented. Inontrast, the performane data of RTSP is obtained by diretly exeuting the SWon legay DSPs with similar arhiteture as the rhw, and thus the result is moreaurate. Coupling an ISS with a RTL simulator modeling the rhw, i.e. [115℄ [82℄,or using the prototyping HW suh as [58℄, allows exeuting the atual SW ode toget aurate simulation result. However, onstruting suh high-delity platforms islabor intensive and expensive. More importantly, they are typially very slow andhave limited debugging apability whih prohibits any omplex SW to be developedon them [169℄. In ontrast, RTSP exploits legay HW and therefore is muh heaperand by applying the behavior model to eliminate ISS, RTSP is also fast.To the best of the author's knowledge, [137℄ is the only published worktrying to reate a simulation engine that is ompatible with the SW developmentenvironment in HW/SW o-design. It requires two host omputers. The appliationSW is simulated on the rst one. All the simoms are onstruted in SystemC andVHDL and simulated on the other one. The ommuniation between the two hostsis soket-based so that simulating a simple register-read an take as long as 3:7 ms.The synhronization problem between HW and SW is not addressed. Compared to[137℄, RTSP ahieves higher simulation auray by deploying the legay DSPs asthe simhw and handling synhronization between simoms properly. RTSP alsoahieves muh higher simulation speed by utilizing Serial RapidIO (SRIO) [12℄ asthe muh more eÆient ommuniation link between simhws. The bandwidth ofSRIO is 20Gb/s bidiretionally.[98℄ presented a tool to simulate video appliation's dataow in low levelwithout ISS. It an be used with RTSP to omplement eah other. The informationolleted from RTSP an be fed to this tool to rene HW design speiation whih87
in turn may suggest hanges to the behavior models and parameters in RTSP. [172℄[96℄ showed the eetiveness of ombining RTOS model with the appliation SWto improve simulation auray. RTSP is implemented to allow integration of thesemodels.Simulation speed an be severely downgraded by frequent synhronizationbetween simoms. [108℄ proposed the optimized onservative approah in whihsynhronization point is predited by SW analysis. [96℄ provides an algorithm forthe multi-task ontext. The eetiveness of suh approahes diretly depends onknowledge to the timing harateristis of the simulation SW, and the insertion ofpredition ode is error-prone. [43℄ tried to redue the IPC frequeny by resortingto ompile-time/run-time sheduling. All these approahes do not remove unnees-sary synhronization ompletely. [113℄ [172℄ introdued a tehnique alled virtualsynhronization whih ombines the event-driven sheduler with data-driven modelto remove unneessary synhronization under the assumption that the output ofthe simulation SW only depends on event ordering but not the arrival time of eahevent. RTSP does not need this assumption sine the sheduler is designed to foreeah simom to progress together modulo some bounded jitter.4.3 Algorithm4.3.1 AnnotationsThe following are the notations to assist desribing the sheduling algorithm.Notations 4.3.1  = RS is the ratio between the speed of H and simulation, whereR is the minimal ahievable ratio determined by SH without onsidering boundingdelay, and S is the slowdown fator to bound delay.Notations 4.3.2 i and Ki are simomi's sheduling parameters. simomi pro-gresses at rate iS but reeives supply from its simhw at rate = iKiS (Ki 2 N and88
Ki  1).Notations 4.3.3 Q is the sheduling quantum determined by the sheduler on asimhw. Qi denotes the quantum for simomi. 8 simomi 2 simhwj and simomk 2simhwj (i 6= k), Qi = Qk.Notations 4.3.4 Li is the dierene between the atual and normal supply of simomiand is determined by the rst level sheduler on simhwj (simomi 2 simhwj). 8 t,it is assumed that jsi(t)  t  Ki  iS j  Li Qi (4.1)where si(t) and t  KiiS are the atual and normal supply funtion of simomi. 8simomi 2 simhwj and simomk 2 simhwj (i 6= k), Li = Lk.Notations 4.3.5 Vi(t) is the virtual time of simomi. 8 t in the atual simulation,Vi(t) is the time simomi would have reahed in an ideal simulation.Notations 4.3.6 C i , ki = fpki ; Eki ; t"ki ; htski ; V ski i; hteki ; V eki ig:8 simomi, C i is the set of jobs to be simulated. The number of jobs is jC i jand ki is the kth job. At the end of ki , simomi sends an event to itself to triggerk+1i if k < jC i j. Event(s) may also be sent to other simoms to trigger their job(s). tski (teki ): atual start (nishing) time of ki . V ski (V eki ): virtual start (nishing) time of ki . pki : length of ki . Eki : the set of event(s) simomi needs to reeive before ki starts. "lj 2 Ekimeans the the event is sent by simomj at the end of lj. t"ki : the time that the last event in Eki is reeived during simulation.89
Notations 4.3.7 i is the set of simom(s) that send event(s) to simomi to trig-ger its job(s). That is, i = fEki jk = 1:::jC i jgNotations 4.3.8 4i = 2LiQiSKii is the bound to be established for simomi suhthat 8 ki , teki   V eki  4i.Notations 4.3.9 statei(t) is the state of simomi at time t during simulation. Itan either be exeki meaning that simomi is simulating ki , or idleki meaning thatsimomi is waiting for ki to start.Notations 4.3.10 t ! i / t! i denotes that a Q with interval [t; t + Qi℄ is / isnot assigned to simomi.Notations 4.3.11 t  i / t i denotes that simomi is / is not allowed to usethe Q assigned in [t; t+Qi℄.Notations 4.3.12 ui[t0; t1℄ denotes the interval used by simomi within a quantumQ alloated to it where Q spans [t0; t0 + Qi℄. Note that simomi may atually useonly a portion of Q.4.3.2 Sheduling AlgorithmWe assume that it is always possible to nd R so that 8 simomi, speed(rhwi)speed(simomi) = Rby assigning simomi an appropriate rate i. R determines the fastest ahievablesimulation speed without onsidering bounding the delay. The following exampleshows how to nd i and R.Example 4.3.1 jH j = 3, jSH j = 2, simom1 2 simhw1, simom2 2 simhw1, andsimom3 2 simhw2.speed(rhw1)speed(simom1) = 2:5; speed(rhw2)speed(simom2) = 2; speed(rhw3)speed(simom3) = 4;90






1 == VsVe 2
1
2 =Ve



























t0 1 2 3 4 5 6
actual simulation
(a) Delay Propagation for S=1





1 == pp 3/111 == ααFigure 4.2: Idea to Bound DelayExample 4.3.2 If simulation is not slowed down (S = 1), simom1 would nish 11and trigger 12 at V e11 = R  p11 = 1 in ideal simulation. simom2 would nish 12 and91
trigger 21 at V e12 = 1+R  p12 = 2. In the atual simulation, simom1 and simom2an only reeive one Q of every three beause 1 = 2 = 13 . It ould happen thatsimom2 reeives the rst Q but annot use it, and simom1 reeives the seond oneand nishes 11 at te11 = 43 . Suppose that simom2 would reeive its next Q at t = 5,12 will be nished at te12 = 163 . Clearly the delay is aumulated.Fig. 4.2-(b) shows the ase when simulation is slowed by a fator of 3 withboth K1 and K2 being set to 3. Now V e11 = 3 and V e12 = 6. In the atual simulation,both simoms still reeive one Q of every three from the rst level sheduler sineK11S = K22S = 13 , but the seond level sheduler allows the simom to use it onlyif a job an be simulated at that moment. For example, simom2 does not use thegiven Q at t = 0 but uses the one at t = 5 when 12 an be started. Delay is boundedin this ase beause te11 < V e11 and te12 < V e12.The rst level sheduler an be any sheduler that guarantees Equ. (4.1).Tijdeman's hairman assignment algorithm is hosen sine it has been proved toahieve the minimal L value among all algorithms. Fig. 4.3 summarizes the algo-rithm. 8 simhwj , it is ativated at every t = m  Q (m 2 N) assigning the next Qto an eligible simom if there is any.Fig. 4.4 shows the seond level sheduler whih onsists of three portions: 1)pre-sheduler, 2) post-sheduler, and 3) event-handler. The pre-sheduler determineswhether simomi is allowed to use the given Q when t0 ! i. ase1:1, ase1:2 andase1:3 imply that 8 simomi, it is allowed to use a given Q only if it is in the exestate. The pre-sheduler also implies that 8 ki , it always starts at the beginning ofa given Q.The post-sheduler updates simomi's state and virtual time at t0+Qi whent0  i and ui[t0; t1℄. It may use part of the Q to nish the urrent job (ase2:1), oruse the whole Q and then is fored to yield (ase2:2).92
n = jsimhwj j, (number of simoms on simhwj)initially (t = 0): 8 simomi 2 simhwj ,Li :=  1  0:5=(n  1); if n > 10:5; otherwiselagi(0) := 0;sheduling at t = m Q (m 2 N):8 simomi 2 simhwj ,dene simomi is eligible if lagi(t) + Li + KiiS   1  0dene di(t) := [Li lagi(t)℄SKii if simomi is eligiblelagi(t) := lagi(t) + KiiS ;if 9 simomk that dk(t) is minimal among all eligible simoms.t! k; (assign the next Q to simomk)lagk(t) := lagk(t)  1;end if Figure 4.3: Tijdeman's AlgorithmThe event-handler determines the supposed start time of ki , whih is V ski forsimomi. A variable V "ki is kept for ki to reord the time when the latest event issupposed to be reeived among all the event(s) to trigger ki . When an event "lj 2 Ekiis sent to simomi by simomj at the end of lj (t1), the time when it is supposed tobe reeived (Vj(t1)) is ompared with V "ki . V "ki is updated if Vj(t1) > V "ki . Whenall the events in Eki are reeived, V ski is known as V "ki .4.3.3 Corretness Proof of the ShedulerThe objetive of this setion is to prove that 9 S and Ki for simomi so that 8 ki ,teki   V eki is bounded. The proof an be divided to the following steps. Firstly weformally dene the meaning of \ath up" if a job is started late. Theorem 4.3.1,orollary 4.3.1 and 4.3.2 prove the longest time simomi needs to wait to startreeiving the next Q. Then lemma 4.3.1 and 4.3.2 prove the biggest possible delayfor 8 ki to start, using theorem 4.3.1, orollary 4.3.1 and 4.3.2. Given a boundedstart delay and length of ki , lemma 4.3.3, 4.3.4 and 4.3.5 prove that ki an ath93
pre-sheduler : when t0 ! iase1:1: if statei(t0) = exeki & Vi(t0) < t0 +Qit0  i;end ase1:1ase1:2: if statei(t0) = idleki , 9m 2 N that t0  m  SQiiKi andV sk+1i 2 [m  SQiiKi ; (m+ 1)  SQiiKi )statei(t0) := exeki ; Vi(t) := V ski ; tski := t0; t0  i;end ase1:2ase1:3: if neither ase1:1 nor ase1:2 appliest0  i;end ase1:3end pre-shedulerpost-sheduler : at t0 +Qi when t0 ! i, ui[t0; t1℄ and statei = exekiase2:1: if t1 < tekiVi(t1) := Vi(t0) + (t1   t0)  Siend ase2:1ase2:2: if t1 = tekistatei(t1) := idlek+1i ;Vi(t1) := Vi(t0) + (t1   t0)  Siend ase2:2end post-shedulerevent-handler : when "lj 2 Eki is reeived at t1dene V "ki := 0 initiallyif Vj(t1) > V "ki , V "ki := Vj(t1)if all events in Eki have been reeived, V ski := V "ki ;end event-shedulerFigure 4.4: Seond Level Sheduler
94
up ki . Lemma 4.3.6 proves that when ki has been aught up by simomi, its nishtime will be delayed at most by 4i = 2LiQiSiKi . Finally theorem 4.3.2 proves that9 S and Ki for eah simomi that 8 ki , teki   V eki  4i, using previously provedresults.Denition 4.3.1 \ath up": 8 ki to be simulated, simomi is dened to haveaught up (when ki was started late) if 9 t 2 [tski ; teki ℄ that t = Vi(t).Theorem 4.3.1 The Tijdeman's algorithm guarantees that 8 simomi, 8 t, jlagi(t)j =jt  KiiS   si(t)j  Li Qi, where Li is dened by Equ. (4.2) and Li is the minimalamong all algorithms. Li = f 1  0:5n 1 if n > 10:5 otherwise (4.2)where n is the number of simom(s) on simhwj inluding simomi.Proof: The proof an be found in [162℄.Corollary 4.3.1 8 simomi, if t0 ! i, t1 ! i, si(t1)   si(t0) = n  Qi (n 2 N),then t1   t0  (2  Li + n)  QiSiiKi = 4i + n  QiSiiKi .Proof: Theorem 4.3.1 ) si(t0)  t0  KiiS  Li Qi (a)KiiS  t1   si(t1)  Li Qi (b)(a) + (b) ) t1   t0  (2  Li + n)  QiSiKiCorollary 4.3.2 8 simomi, if t0 ! i, t2 ! i, t1 2 [t0; t0 +Qi), si(t2)   si(t0) =n Qi, (n  1, n 2 N), then t2   t1  4i +Qi + (n  1)  QiSiiKiProof: Corollary 4.3.1 ) t2   (t0 +Qi)  4i + (n  1)  QiSiKit1  t0 ) t2   t1  4i +Qi + (n  1)  QiSiKi95
Lemma 4.3.1 shows the biggest delay for ki to start if no triggering event isreeived late.Lemma 4.3.1 8 ki , if t"ki  V ski , then tski   V ski  4i +Qi.Proof: At t"ki , statei(t"ki ) = idleki . ki starts later when ase1:2 in Fig. 4.4 applies.That is, 9 m 2 N that tski  m  SQiiKi , m  SQiiKi  V ski < (m+ 1)  SQiiKi .The proof overs the following two onditions.1) If 9 t0 that t0 ! i and m  SQiiKi 2 [t0; t0 +Qi℄Corollary 4.3.2) 9 t1, t1 ! i, si(t1) si(t0) = Qi and t1  4i+Qi+m  SQiiKi .Sine ase1:2 an be applied at t1, tski = t1  4i +Qi +m  SQiiKi .2) If  t0 that t0 ! i and m  SQiiKi 2 [t0; t0 +Qi℄Corollary 4.3.1 ) 9 t1 > t0, t1 ! i, si(t1) = si(t0) and t1  4i +m  SQiiKi .Sine ase1:2 an be applied at t1, tski = t1  4i +m  SQiiKi .V ski  m  SQiiKi ) tski   V ski  4i +Qi.Lemma 4.3.2 shows the start delay bound for ki if the biggest delay of allthe triggering event(s) is d.Lemma 4.3.2 8 ki , if t"ki > V ski and t"ki  V ski = d, then tski  V ski  d+4i+Qi.Proof: It an be proved similarly as lemma 4.3.1.Lemma 4.3.3 derives the ondition for ki to ath up when it is started lateby d and the length is  Qi.Lemma 4.3.3 8 ki , if tski   V ski = d > 0, pki  Qi and S  i  ( dpki + 1), then 9t1 2 (tski ; teki ℄, Vi(t1) = t1. 96
Proof: pki  Qi ) teki = tski + pki = V ski + d+ pkiS  i  ( dpki + 1)) pki Si  pki + dV eki = V ski + pki Si ) V eki  V ski + pki + d = tekitski > V ski ; teki  V eki ) 9t1 2 (tski ; teki ℄; Vi(t1) = t1Lemma 4.3.4 derives the ondition for ki to ath up when it is started lateby d and the length is > Qi.Lemma 4.3.4 8 ki , if tski V ski = d > 0, pki > Qi and Sn+ri   SKi 2Li+n 1i  r 1 dQi , where n = b pkiQi , r = pkiQi   n, then 9 t1 2 (tski ; teki ℄, Vi(t1) = t1.Proof: Corollary 4:3:2 ) teki  tski +4i + (n  1)  SQiiKi +Qi + r Qi= V ski + d+ (2  Li + n  1)  SQiiKi + (1 + r) Qi (a)V eki = V ski + (n+ r)  SQii (b)(b) (a)Qi = S  n+ri   SKi  (2L+n 1i )  1  r   dQi ()()  0) V eki  tekitski > V ski ; teki  V eki ) 9t1 2 (tski ; teki ℄; Vi(t1) = t1In reality, probably pki an only be estimated so that r in lemma 4.3.4 isunknown. Assuming r = 0, lemma 4.3.5 an be applied.Lemma 4.3.5 8 ki , if tski V ski = d > 0, pki > Qi and S ni  SKi  2Li+n 1i  1  dQi ,where n = b pkiQi , then 9 t1 2 (tski ; teki ℄; Vi(t1) = t1.Proof: the proof is similar to that of lemma 4.3.4.Lemma 4.3.6 proves that 8 ki if it has been aught up (even though it startedlate), its nish delay is  4i. 97
Lemma 4.3.6 8 ki , if 9 t0 2 [tski ; teki ℄, Vi(t0) = t0, then teki  V eki +4i.Proof: Assuming t0 is the latest time in [tski ; teki ℄ that Vi(t0) = t0, that is, t1 2(t0; teki ℄; Vi(t1) = t1, the proof is divided into the following three onditions.1) if  t0 , t0 ! i, t0 2 [t0 ; t0 +Qi℄ase1:1 is applied at t0 ) 8 t 2 (t0; teki ℄, if t ! i, then t  i. It means thatsimomi will use every given Q after t0 until ki is nished.Assuming si(teki )   si(t0) = (n + r)  Qi, (n 2 N, r 2 R, n = b si(teki ) si(t0)Qi ),orollary 4.3.1 ) teki  t0 + r Qi + n  QiSiKi +4iV eki = Vi(t0) + (n+ r)  QiSi = t0 + (n+ r)  QiSiteki   V eki  4i + [nSi  ( 1Ki   1) + r  (1  Si )℄ Qi  4i2) if 9 t0 , t0 ! i, t0 2 [t0 ; t0 +Qi℄ and t0  iVi(t0) = t0 ) Vi(t0 +Qi)  t0 +Qi t 2 (t0; teki ℄ that Vi(t) = t ) 8 t 2 (t0 +Qi; teki ℄, Vi(t)  t ) teki  V eki3) if 9 t0 , t0 ! i, t0 2 [t0 ; t0 +Qi℄ but t0  iase1:1 annot be applied at t0 ) Vi(t0)  t0 +QiVi(t0) = Vi(t0), t0 = Vi(t0)) Vi(t0) = t0  t0 +Qit0 2 [t0 ; t0 +Qi℄) t0  t0 +Qi ) t0 = t0 +Qiase1:1 an be applied at t0 ) 8 t 2 (t0; teki ℄, if t! i, then t i.Assuming si(teki )   si(t0) = (n + r)  Qi, (n 2 N, r 2 R, n = b si(teki ) si(t0)Qi ),orollary 4.3.1 ) teki  t0 + r Qi + n  QiSiKi +4iV eki = Vi(t0) + (n+ r)  QiSi = t0 + (n+ r)  QiSiteki   V eki  4i + [nSi  ( 1Ki   1) + r  (1  Si )℄ Qi  4i98
Theorem 4.3.2 8 simomi, denePi = minfpki j1  i  jC i jg (4.3)Gi = 8<: 2  i Qi; if Pi > Qii  (Pi +Qi); otherwise (4.4)Ai = 8<: bPi=Qi Qi; if Pi > QiPi; otherwise (4.5)fi(Ki) = 8<: (bPi=Qi+4Li 1)QiKi ; if Pi > Qi2LiQiKi ; otherwise (4.6)8 simhwj 2 SH , dene Sj = simomiX2simhwj Ki  i (4.7)if 8 simomi, Equ. (4.8) and (4.9) are true, then 8 ki , teki  V eki +4i.S  GiAi   fi(Ki)  i maxf2Lj QjKj j jsimomj 2 ig (4.8)S  maxfSj jsimhwj 2 SH g (4.9)Proof: dene Di = 4i +Qi +maxf4j jsimomj 2 ig (4.10)From lemma 4.3.3, 4.3.5 and 4.3.6, we an prove that 8 ki , if tski  V ski  Di,then theorem 4.3.2 is true.Next we prove that 8 ki and 8 "lj 2 Eki (Eki 6= ;), if telj  V elj +4j, thentski  V ski +Di. The proof is divided to the following two onditions.99
1) If t"ki  V ski :lemma 4.3.1 ) tski   V ski  4i < Di.2) If t"ki > V ski :9 "lm 2 Eki that t"ki = telm  V elm +4mV elm  V ski and 4m  maxf4j jsimomj 2 ig) t"ki  V ski +maxf4j jsimomj 2 igLemma 4.3.2 ) tski  t"ki +4i +Qi) tski  V ski +4i +Qi +maxf4j jsimomj 2 ig ) tski  V ski +Di.Finally we prove that 8 1i with E 1i = ;, te1i  V e1i +4i. E1i = ; ) ts1i =V s1i = 0. Lemma 4.3.6 ) te1i  V e1i +4i.8 ki , Equ. (4.10) denes its biggest start delay in whih 4i + Qi is ausedby the sheduler for simomi and maxf4jjsimomj 2 ig is introdued by thesimoms whih will send event(s) to simomi to trigger ki .4.4 Find Ki and SThe proof to theorem 4.3.2 shows that 8 ki , its start time an be delayed at mostby Di as dened by Equ. (4.10) and its nish time an be delayed at most by 4i, aslong as Equ. (4.8) and (4.9) hold for simomi. This setion solves the problem ofnding S and an appropriateKi for eah simomi. It is desirable to nd the minimalS sine a bigger S means slower simulation speed. If the number of simoms to beonsidered are small and no i is too small, exhaustive searhing an be performedby inreasing Ki until Equ. (4.8) and (4.9) are met. Fig. 4.5, 4.6 and 4.7 provide a3-step heuristi algorithm to quikly nd the sub-optimal solution.8 simomi, funtion Æi(j;Kj) is dened as the part of the job start delayintrodued by the triggering event(s) that is sent by simomj, saled by i=S.100
8 simomi, dene funtionÆi(j;Kj) = i  2Lj QjKj j , if simomj 2 ii(i) = j if Æi(j;Kj) = maxfÆi(l;Kl)jsimoml 2 iginitially:8 simomi, Ki := 1;8 simhwj, Sj :=Pi; (simomi 2 simhwj)step1:for i = 1 : jH jj := i(i); Ki := maxfdfi(1)Ai e; d Æi(j;Kj)Ai eg;end for Figure 4.5: Find Ki and S: Step1Funtion i(i) is the simom whih aused the maximum delay. Step1 nds theinitial value of Ki to start searhing from for eah simomi. It makes sure thatAi  fi(Ki) and Ai  Æi(j;Kj) (j = i(i)). Otherwise the denominator of Equ.(4.8) is always negative and Equ. (4.8) an never hold.The objetive of Step2 is to make the denominator of Equ. (4.8) positivefor eah simom by inreasing the K parameters. For a partiular simomi, theoptimal solution might inrease both Ki and Kl (l = i(i)) by some amount buttheir exat values are unknown. The algorithm omputes di (dl) whih is the amountto inrease to make the denominator > 0 by only inreasing Ki (Kl), and heksinreasing whih one will potentially make S inrease less. If inreasing Ki (Kl) isbetter, Ki (Kl) is inreased by ddi! e (ddl! e) where ! is the adjustable step size totrade o searh speed against auray. Step2 exits if 8 simomi, the denominatorof Equ. (4.8) is > 0.Step3 nds the nal solution. It is similar to Step2. The main dierene isthat dfi and dfl in Step3 are obtained by solving the seond-order equations whiledi and dl in Step2 are obtained by solving the linear equations.
101
for i = 1 : jH jstart2:l := i(i);if Ai   fi(Ki)  Æi(l;Kl) < 0di := dKi  [ fi(Ki)Ai Æi(l;Kl)   1℄e; si := Sj + di  i; (simomi 2 simhwj)dl := dKl  [ Æi(l;Kl)Ai fi(Ki)   1℄e; sl := Sk + dl  l; (simoml 2 simhwk)if si > sldi := ddi! e; Ki := Ki + di; Sj := Sj + di  i;elsedl := ddl! e; Kl := Kl + di; Sk := Sk + dl  l;end ifgoto start2;end ifend for Figure 4.6: Find Ki and S: Step24.5 Assign simoms to SH4.5.1 IntrodutionThis setion addresses the problem of assigning all the simoms to the set of avail-able legay DSPs (SH ). If the number of simoms and simhws are small enough,an exhaustive searhing method an be taken to nd the optimal assignment solu-tion. Otherwise, the heuristi algorithm in [95℄ an be applied to quikly nd thesuboptimal solution whih is lose to the optimal solution in most ases.8 simhwj , the atual simulation time spent on it inludes the following threeportions: 1) omputations made by all simom(s) assigned to it, 2) sheduling over-head and 3) bloking ost due to serialized aessing to the SRIO link as ommonresoure. It an be expressed by Equ. (4.11)-(4.14), where tj1, tj2 and tj3 orre-sponds to portion 1), 2) and 3) respetively.tj = tj1 + tj2 + tj3 (4.11)102
for i = 1 : jH jstart3:l := i(i); S := maxfSj jsimhwj 2 SH gif S < GiAi fi(Ki) Æi(l;Kl)nd dfi in Sj + dfi  i = GiAi fi(Ki+dfi) Æi(l;Kl) ;di := ddfie; si := Sj + di  i; (simomi 2 simhwj)nd dfl in Sk + dfl  l = GiAi fi(Ki) Æi(l;Kl+dfl) ;dl := ddfle; sl := Sk + dl  l; (simoml 2 simhwk)if si > sldi := ddi! e; Ki := Ki + di; Sj := Sj + di  i;elsedl := ddl! e; Kl := Kl + di; Sk := Sk + dl  l;end ifgoto start3;end ifend for Figure 4.7: Find Ki and S: Step3tj1 = simomiX2simhwj jC i jXk=1 pki (4.12)tj2 = s simomiX2simhwj b tendQj  (4.13)tj3 = q simomiX2simhwj jC i j  [0:5 (jSH j   1) + 1℄ (4.14)Equ. (4.13) and Equ. (4.14) need some explanations. In Equ. (4.13),b tendQj  is the total number of sheduler exeution being made during simulation.The ost of eah exeution is proportional to the number of simoms on simhwj(Psimomi2simhwj ), whih is true for most pratial sheduler implementation [167℄. Usings to denote the unit ost when there is only one simom on simhwj, the totalsheduling overhead an be expressed by tj2. In Equ. (4.14), q represents the ostwhen eah time a simom exlusively utilizes the SRIO link to send event(s) toother simoms when a job is done. Without onsidering the ommuniation details103
between all the simoms, q Psimomi2simhwj jC i j represents the total number of timesthat simhwj will use the SRIO link. Assuming the probability of obtaining the link isequal for all ompeting simhws, it an be shown that eah time the average blokingdelay for simhwj to get the link an be approximated by q [0:5 (jSH j  1)℄ [95℄.Therefore, tj3 represents the total ost that simhwj waits for and uses the SRIOlink. The objetive funtion for the optimal assignment is expressed as Equ. (4.15),whih balanes the atual simulation ost among all simhws.min simhwjmax2SH tj (4.15)4.5.2 Algorithm Finding Suboptimal Assignment SolutionSolving Equ. (4.15) is a binary integer programming problem. Norman in [138℄showed that mapping parallel algorithms onto parallel arhitetures is an NP-hardproblem in all but restrited ases. To obtain the optimal solution, an exhaustivesearh takes O(jSH jjH j) iterations. In ase the number of simoms and simhws arenot large, suh an exhaustive searh method an be taken. Otherwise, the heuristialgorithm needs to be applied to nd the suboptimal solutions lose to the optimalwithin signiantly redued time. Broadly speaking, most heuristi task assignmentalgorithms an be lassied into three ategories: Loal searh algorithm, Genetiapproahes and Greedy algorithms. Our algorithm is similar to the greedy algorithmwhih starts from an empty solution set, repeatedly hooses a simom based onertain seletion strategy and assigns it to a simhw to obtain a partial solution. Thealgorithm ontinues until all the simoms have been assigned. The key dierenebetween our algorithm and the greedy one is that our algorithm keeps a boundednumber of partial solutions after assigning a simom. By adjusting the bound, ithas the exibility to trade o between the goodness of the obtained solution andthe searh time. 104















Figure 4.8: Assigning 3 simoms to 3 simhwsjA hn;n;0i j = jB hn 1;n 1;0i j = 1 (4.17)jA hn;n;ki j = jB hn 1;n 1;ki j+ jB hn 1;n 1;k 1i j  (n  k)(1  k  n  1) (4.18)Equ. (4.17) is easy to understand beause to assign n simoms to n simhwseah of whih has at least one simom, there is only one possibility that eahsimhw gets exat one simom.To see how Equ. (4.18) is obtained, onsider the rst right hand side termjB hn 1;n 1;ki j whih is the number of hn   1; n   1; ki partial solutions beingkept. Eah of them beomes a hn; n; ki partial solution if simomn is assignedto simhwn. Now onsider the term jB hn 1;n 1;k 1i j whih is the number ofhn 1; n 1; k 1i partial solutions kept. Eah of them an beome a hn; n; kipartial solution if simomn is assigned to any of the (n   k) busy simhws.Fig. 4.8 illustrates the idea when n = 3.106
2) If jSH j  n,there is no bakup simhw available, the problem is to assign simomn to anyof the existing jSH j simhws. Similarly, we an show thatjA hn;jSH j;kij = jB hn 1;jSH j;kij  (jSH j   k) + jB hn 1;jSH j;k+1ij0  k  jSH j   1 (4.19)The explanation above implies that the searh spae grows exponentiallywith jSH j and jH j. To bound the searh spae, jB hn;m;ki j is set by Equ. (4.20). Thatis, at most C of the most promising hn;m; ki partial solutions are kept in eah step.By adjusting C, our algorithm has the exibility to trade o the goodness of thesolution and the searh time.jB hn;m;ki j = minfjA hn;m;ki j; Cg (4.20)Seletion of simoms is in desending order of their eets to the objetivefuntion. The eet of a simomi is dened in Equ. (4.21), where the rst part ofthe eet is the total omputation time and the seond part is approximately thetime utilizing the SRIO link. The larger the eet is, the more its assignment willaet the value of Equ. (4.15). jC i jXk=1(pki + q) (4.21)The algorithm is summarized as following.1. Selet 2 simoms and alulate A h2;2;0i and A h2;2;1i .2. If all jH j simoms have been assigned, return the best solution. Otherwise goto step 3.3. Selet the next simom denoted as simomn.(a) If there is a bakup simhw, bring it in. 8 k (0  k  n   1), omputeA hn;n;ki and keep the most promising jB hn;n;ki j partial solutions.107
(b) Otherwise 8 k (0  k  jSH j   1), ompute A hn;jSH j;ki and keep the mostpromising jB hn;jSH j;kij partial solutions.4. Go to step 3.The number of searh operations in step 3 is in the order of O(jSH j2  C) asimplied by Equ. (4.18) and Equ. (4.19). Thus, the total number of searh operationsto nd the suboptimal assignment solution is in the order of O(jH j  jSH j2  C).4.6 ImplementationThe two-level sheduler itself inurs simulation overhead and needs to be imple-mented as eÆient as possible. Sine in most ases xed point DSPs will be used assimhw, the implementation should avoid oating point alulation and the divisionoperation.To implement the original Tijdeman's algorithm shown in Fig. 4.3, KiiS foreah simomi is onverted to qi whih is in Q12 format. qi is the inverse of qi butalso onverted to Q12 format. The implementation is shown in Fig. 4.9.qi := KiiS  2bi ; (211  qi < 212)qi := SKii  224 bi ;simomi is eligible if lagi(t) + (Li   1)  2bi + qi  0di(t) := [Li  2bi   lagi(t)℄  qi for eligible simomiif t! ilagi(t) := lagi(t) + qi   2bi ;elselagi(t) := lagi(t) + qi;end ifFigure 4.9: Tijdeman's Algorithm Implementation8 simomi, Vi(t) needs to be updated by the seond level sheduler as shownin Fig. 4.4. Assuming Si < 216, it is onverted to i in Q16 format. After simomiusing a Q in [t0; t1℄ (ui[t0; t1℄), Vi(t) an be updated as Equ. (4.22).108
i := Si  2i ; (215  i < 216;  2 N)Vi(t) := Vi(t) + (t1   t0)  i  2 i ; (4.22)Code needs to be inserted into the appliation SW to ollet and update spe-i performane information. Simulation auray an be severely impated if timespent on suh ode is harged to appliation exeution time, espeially when detailedperformane information is neessary. RTSP is implemented to allow appliationnotify when to start/stop harging time usage so that the performane-olletiontime an be exluded.Eah simom needs to provide the following two allbak funtions to thesimulation sheduler: preshedule() and postshedule(). When a simom is al-lowed to use an assigned Q, simulation sheduler alls its preshedule() in whihthe simom an be simply resumed or an RTOS model an be exeuted. Afterusing Q, its postshedule() is alled. This is the plae where simom an update itsstate other than stopping itself.The legay DSP as simhw typially only supports 32 bit operation while itis often neessary to maintain 64-bit time information. For example, if the lokfrequeny is 1GHz, a 32-bit timer is only able to last 4:3s without rolling bak.RTSP is implemented to support maintaining 64-bit time information on the 32-bitarhiteture.Theorem 4.3.2 yields the delay bound for the worst ase in the sense that 8ki , 1) its length is always equal to the minimal length; 2) its start time is alwaysdelayed by Di as shown in Equ. (4.10); and 3) its nish time is always delayedby 4i. In reality suh a worst ase probably will not always happen. Therefore,the user should redue Ks and S to speed up simulation while still maintaining thedelity. RTSP is implemented to report warnings when the delay bound is violatedbeause of the redued K and S values.109































(b) PING/PONG buffering schemeFigure 4.10: Audio AppliationThe appliation proesses a blok of input samples at a time. The programresides in the L1P ahe. A proessing buer is alloated in L1 memory whihholds all the data to be proessed at this time. A irular buer is alloated inthe SDRAM whih ontains the delay samples to be proessed and mixed with theurrent input blok. It is updated in FIFO order in the sense that the latest delaysamples always replae the oldest ones. 110




































EMIFFigure 4.11: Simulation Hardwareyles to be emulated on C6455 while on C6727 a native oating operation takesone yle only. speed(simom1)speed(C6727) = 100074%+7026%250 = 14:74 (4.23)Finding the speed ratio for simom2 is more ompliated. For FIFOW,the dmax is ongured to transfer 4 samples a time. Reading eah sample from L1memory and writing it to SDRAM takes 1 and 4 DSP yles, respetively. Managingthe 4-sample transfer takes 6 yles. These numbers an be obtained from the HWdesign speiation. For simom2, reading a sample from L2 memory and writingit to SDRAM takes 6 and 44 DSP yles, respetively. Managing the transfer takesabout 420 yles. These numbers are from benhmarks. Therefore, the ratio forFIFOW an be alulated by Equ. (4.24).speed(simom2;F IFOW )speed(dmaxFIFOW ) = 1000250  (1 + 4) + 6=4(6 + 44) + 420=4 = 15:96 (4.24)The alulation for FIFOR is similar but the dierene is that dmax tranfers8 samples a time, whih makes the speed ratio 4:46 as shown in Equ. (4.25). To geta onstant ratio between dmax and simom2, 5:96 is hosen. Appropriate amountof delay yles are inserted after eah FIFOR to make its ratio equal to 5:96.speed(simom2;F IFOR)speed(dmaxFIFOR) = 1000250  (1 + 4) + 6=8(6 + 44) + 420=8 = 14:46 (4.25)112
Given Equ. (4.23) and (4.24), the minimal speed ratio between rhw exeutionand simulation is 5:96, that is, R = 5:96. 1 and 2 are omputed by Equ. (4.26).1 = 4:745:96 = 0:80; 2 = 1:0; (4.26)Before alulating K1, K2 and S, the minimal job length (P1 and P2) foreah simom and the Q size need to be known. P1 is proled to be about 175s.Given the fat that FIFOR and FIFOW transfer 14 and 26 bloks respetively eahtime with blok size being 32, P2 an be estimated using Equ. (4.27).P2 = FIFOW + FIFOR = [(6 + 44) + 4204 ℄  (14 + 26)  32 = 198:4s (4.27)The Q size of the operating system (DSP/BIOS) [15℄ on C6455 is 1ms. S, K1and K2 are alulated to be 16:0, 20 and 16 respetively, using the searh algorithmdesribed in setion 4.4. To speed up the simulation by reduing Q to 100s, thetimer on C6455 is re-programmed. S, K1 and K2 are redued to 5:6, 7 and 5respetively. They an be further redued beause of the pessimisti assumption oftheorem 4.3.2. For example, if Q = 100s, the simulation shows that using K1 = 2,K2 = 1 and S = 1:6 an still keep the delay in bound.The simulation result is shown in Tab. 4.1. The total overhead inluding bothrst and seond level sheduler is less than 1s on average. The result for simom1(proess) is aurate for both ases. When Q = 100s, the average transfer timefor both FIFOR and FIFOW is still aurate but the minimal and maximal timeshow 20%   50% errors. This is expeted sine the behavior model of simom2annot math the atual HW from timing perspetive, i.e. the 420 yles shownin Equ. (4.24) for transfer management is the average but not min/max number.When Q = 1ms, the average time for both FIFOR and FIFOW shown by simulationare both larger than the rhw exeution number. The reason an be explained asfollowing. For the rhw exeution senario, FIFOW of PING/PONG round overlaps113
Q = 100s,  = 9:6 Q = 1ms,  = 96K1 = 2, K2 = 1 K1 = 20, K2 = 16s min max avg min max avgFIFOR 62:0 123 92:2 104 124 117FIFOW 21:0 70:7 33:6 21:5 65:1 51:0proess 34:0 41:3 37:7 32:9 38:9 36:4shedule 0:466 1:746 0:855 0:472 1:728 0:853C6727 exeutions FIFOR FIFOW proessmin 90:1 30:5 36:8max 106 46:0 46:8avg 93:4 34:0 36:9Table 4.1: Audio Simulation Resultonly partially with FIFOR of PONG/PING round as shown in Fig.4.10-(b). Duringsimulation when Q is large, FIFOR and FIFOW are always started at the same timewhen a new Q an be used by simom2, whih makes FIFOW overlap ompletelywith FIFOR. More overlap inreases time for both transfers. A small Q size notonly an speed up the simulation but also improve auray in this ase.4.7.2 Video AppliationThe seond appliation is a MPEG2 deoder running on TMS320DM642 proessor[21℄ whih has almost the idential arhiteture omparing to C6455 exept thatDM642 is loked at 600MHz. Eah video frame is deoded MB by MB. The ref-erene frame resides in SDRAM. When deoding the urrent MB, a referene MBneeds to be transferred from SDRAM to on-hip L2 memory for fast proessing.After being deoded, the urrent frame beomes the new referene frame and ismoved to SDRAM.The simulation platform is the same. The deoding algorithm is simulated onDSP1 as simom1 and the DMA operation is simulated on DSP2 as simom2. The114
referene frame resides in SDRAM of DSP2. To get a referene MB for simom1,the MB is moved from SDRAM to L2 memory of DSP2 by simom2, and thentransferred to L2 memory of DSP1 via SRIO. Dataow of updating the refereneframe is in reverse.Determining the speed ratio between simom1 and DM642 ore is straight-forward as shown in Equ. (4.28).speed(simom1)speed(DM642) = 1000600 = 10:6 (4.28)The ratio between simom2 and DM642 DMA is shown in Equ. (4.29).Transferring a MB is a two dimensional transfer of 1616 bytes. For DM642 DMA,the transfer management takes 20 yles for eah row. 4 bytes are aessed at atime to optimize the performane. Eah aess to L2 memory and SDRAM takes 2and 33 yles, respetively. For simom2, the management overhead for eah MB isabout 1120 yles. Similarly, 4 bytes are transferred a time. Aessing time to L2memory and SDRAM takes 6 and 44 yles respetively.speed(simom2)speed(DM642DMA) = 1000600  320 + 16  2  (2 + 33)1120 + 16  2  (6 + 44) = 11:14 (4.29)Given Equ. (4.28) and (4.29), R = 1:14, 1 = 0:53 and 2 = 1:0.The number of MBs to be deoded/transferred at a time is determined by thepre-enoded input video stream. The worst ase is one, whih makes P2 = 2:72sgiven by Equ. (4.29). P1 an only be estimated and it is typially smaller than P2.In this ase, we assume it is 2s.For Q = 100s, S, K1 and K2 are alulated to be 134, 253 and 134 re-spetively. For Q = 1ms, S, K1 and K2 are alulated to be 1322, 2503 and 1322respetively. Sine deoding/transferring one MB at a time rarely happens, simula-tion an be arried faster by reduing K1, K2 and S. The simulation result usingthe atual parameters are shown in Tab. 4.2.115
Q = 100s,  = 30 Q = 1ms,  = 180K1 = 50, K2 = 25 K1 = 300, K2 = 150min max avg min max avgDMA(s) 2:34 36:4 4:87 2:34 34:0 4:83deode(ms) 11:7 25:2 22:0 11:5 24:7 21:8shedule(s) 0.394 5.338 0.933 0.412 4.788 0.922DM642 exeutionDMA(s) deode (ms)min 2:36 11:3max 35:5 24:6avg 4:80 21:5Table 4.2: Video Simulation ResultThe simulation auray of this appliation is higher than the audio appli-ation. The simulation error of deoder is < 4% due the arhiteture similaritybetween DM642 and C6455. The simulation error of DMA is < 5% even for themin/max ase. It is beause the DMA behavior is simple in this appliation and theSW behavior model is able to math well with the rhw from the timing perspetive.4.7.3 simom Assignment Result
























Figure 4.12: simom Assignment Result: 4 simhws116





















Figure 4.13: simom Assignment Searh Time: 4 simhwsIn this setion, we present some experimental results on the simom assign-ment algorithm. A random set of test ases are generated. jH j is in the range of[20; 128℄. Constant C is set in the range of [2; 500℄. 8 simomi, eah of its job lengthpki is a uniform distribution in [5; 300℄ and jC i j is set to 10. q and s are both set to1. 8 simhwj, the sheduler is set to be exeuted for 15 times. That is, b tendQj  = 15.























Figure 4.14: simom Assignment Result: 8 simhwsFig. 4.12 and 4.14 plot the number of simoms vs. the suboptimal solutionfound when the number of simhws is set to 4 and 8, respetively. Fig. 4.13 and117






















Figure 4.15: simom Assignment Searh Time: 8 simhws4.15 show the searh time measured on the Toshiba Satellite laptop with a 1:33GHzPentium III Cerelon CPU and 256 MB memory. The results demonstrate that thesuboptimal solutions found by our algorithm are lose to the optimal solutions inmost ases when C is set to 500, while the searh time is negligible.4.8 ConlusionTo the best of our knowledge, RTSP is the rst simulator that truly enables theappliation SW to be developed in parallel with HW. It oers the SW engineer theidential development environment as if the real HW (rhw) was available. Beauselegay DSP ores are used in the simulation, the appliation SW an be diretlyompiled and run on RTSP without worrying about instrution set ompatibility.The simulation omponents (simoms) are sheduled to progress in unison, modulobounded jitter. This eliminates unneessary synhronization between the simoms.A two-level sheduler is used with low overhead. The rst level sheduler is a real-time periodi sheduler that assigns quantums (Qs) to the simom(s). The seondlevel sheduler bounds the ompletion delay for eah job being simulated to ahievetiming auray. RTSP is proved to perform simulations faithfully and also is showed118




A trend in the onsumer eletronis market is the demand for new appliations thathave a lot of similarities to older appliations but the new ones impose more hal-lenging and speial-purpose performane requirements. In the DSP industry, thislearly reets a transition from the design regime of general DSP to the appliation-spei DSP. From the design perspetive, this means that the DSP ore remainsunhanged but more and more HW aelerators, DMAs and bus arhitetures needto be integrated into the hip. A key in eeting this transition is the engineeringapability to make sure the design speiation \mathes" the appliation beforedetailed design starts. Therefore, appliation SW needs to be developed in parallelwith HW to verify the design speiation at the system level. Enabling devel-opment and simulation of SW before the atual HW is available also redues thetime-to-market period whih is another important benet.HW/SW o-simulation for design speiation renement imposes many hal-lenging requirements to the simulation platform. HW and SW needs to be integratedto arry out the simulation at the system level. Simulation result needs to be a-urate. Simulation speed should allow fast DSE and ease debugging omplex ap-pliation SW. HW and SW problems should isolated leanly sine in pratie HW120
and SW engineers often do not have enough expertise in one another's domains.The simulator should be ost-eetive. These are often oniting requirements andannot be met by a single all-purpose simulator Instead, this dissertation proposesthree simulators for dierent usages.An RTOS tool is presented to model the RTOS with appliation SW to helpgenerate an initial speiation. It is motivated by the fat that more and more tasksare hosen to be implemented as SW on a single DSP managed by an RTOS. Selet-ing the \right" RTOS before the SW is developed is very important. The proposedtool is implemented based on SystemC and is ongurable to support modeling andtimed simulation of most popular embedded RTOSes. Timing delity is ahievedby using delay annotations. The OS timing information is derived from publishedbenhmark data. Appliation timing information an be proled or estimated fromsimilar legay appliations. The optimized onservative approah is taken to syn-hronize simoms. Compared to other researh work, an important ontribution ofthis tool is an online algorithm for prediting the timestamp of the next event basedon the realisti assumption that multiple tasks exeute onurrently on a proessor,managed by a stati or dynami priority driven sheduler. The simulation speed ismore than 3 orders of magnitude faster than ommerial the ISS with omparableauray.The seond tool is a system dataow simulator (SDFS) to help HW engi-neers rene the HW speiations. It models the appliation by a parameter-drivenonditional dataow graph (CDFG) at the transation level and the HW by a on-gurable HW graph at the yle-aurate level. SDFS takes the appliation CDFGand HW graph as the input and arries out the simulation to ath the detailedHW ativities. It only requires the HW engineers to understand the appliation SWat the CDFG level. To arry out the system simulation at suh a low level, manyommerial simulators need to ouple an ISS for the appliation SW with an RTLsimulator for the simoms to model the rhws whih is typially 6 orders of magni-121
tude slower than the rhw speed. Our simulation error of SDFS is within 5% in mostases and the worst ase error is within 13% whih is omparable to the ISS+RTLapproah. But our simulation speed is only 4 orders of magnitude slower than therhw exeution. Compared with other similar researh work that also models thesystem at the CDFG level, SDFS an ahieve higher simulation auray beause ofthe following advantages: 1) it does not need a xed appliation trae as input andthus is exible enough to over many simulation senarios; 2 ) it does not assume axed ost for eah blok and thus is able to estimate the system performane underatual exeution onditions; and 3) it is able to model pipelined arhiteture om-mon in modern DSPs. SDFS is ost-eetive sine it is implemented in the SystemClanguage and an be exeuted on almost any PCs and workstations.The third tool is a real-time simulation platform (RTSP) implemented onlegay DSPs. To the best of our knowledge, this is the rst simulator that trulyenables the appliation SW to be developed in parallel with HW by oering the sameSW development environment as if the rhw is available. To simulate the behaviorof a rhw module, a orresponding simom is onstruted running on a legay DSP.The suess of this simulation strategy hinges on a novel way to apply the oneptof Real-Time Virtual Mahines to simulation. Eah legay DSP employs a two-levelsheduler to enfore the poliy that eah simom arries out the simulation at aproportional speed (1=) of the rhw, so that any job that would nish at time t onthe rhw will nish no later than   t +4 where 4 is a onstant bound. Suh afeature eliminates expensive synhronization between the simoms. RTSP is provedto perform simulations faithfully and also is shown to be eetive by appliation toreal industry appliations. For a rhw whose timing behavior an be auratelymodeled by the SW behavior model, the simulation error is shown to be < 5%. Forvery ompliated rhw whose timing annot be aurately aptured by the behaviormodel, the simulation auray was shown to be exellent for the average ase.The simulation speed is quite fast. For the seleted audio and video appliations,122




AronymsASIC: appliation spei integrated iruitCAD: omputer-aided designCE: ondition edgeCFG: onditional-ow graphDE: data edgeCFSM: o-design nite state mahineDFG: dataow graphDFT: design-for-testDSE: design spae explorationEDA: eletroni design automationEVM: evaluation version moduleFIFO: rst-in-rst-outFIFOR: FIFO readFIFOW: FIFO writeFPGA: eld programmable gate arrayFSM: nite state mahineGOB: group of bloksGP: general purposeGPP: general purpose proessorHDL: hardware desription languageHW: hardware
125




 H is the set of real HW modules to be designed. The number of elements inH is denoted as jH j and rhwi is ith element in H . (notation 1.5.1) 8 rhwi 2 H , simomi is a simulation omponent modeling its behavior andtiming harateristis in the speied abstrat level. (notation 1.5.2) SH is the set of simulation HW on whih simulation is being arried. Thenumber of elements in SH is denoted as jSH j and simhwi is ith element in SH .8 simomj , it is simulated on a simhw in SH . If simomj is simulated onsimhwi, it is denoted as simomj 2 simhwi. (notation 1.5.3) tend is the time that simulation ends. (notation 1.5.4) fun is a funtion blok realizing ertain funtionality. It will be implementedon a rhw and simulated on the orresponding simom. (notation 1.5.5) T is a task whih onsists of one or multiple funs and has its own exeutionontext. (notation 1.5.6) 8 fun, its exeution instane is alled a job. 8 simomi, C i is the set ofjob(s) to be simulated on it during [0; tend℄. The number of job(s) is denoted127
as jC i j and the kth job is denoted as ki . (notation 1.5.7) R and N represent the real number set and the non-negative integer set, re-spetively. (notation 1.5.8)B.1 Notations in RTSP  = R  S is the ratio between the speed of H and simulation, where R isthe minimal ahievable ratio determined by SH without onsidering boundingdelay, and S is the slowdown fator to bound delay. (notation 4.3.1) i and Ki are the sheduling parameters of simomi. It progresses at rateiS but reeives supply from its simhw at rate iKiS . Ki 2 N and Ki  1.(notation 4.3.2) Q is the sheduling quantum determined by a simhw. Qi denotes the quantumfor simomi. (notation 4.3.3) Li measures the dierene between the atual and normal supply of simomi.(notation 4.3.4) Vi(t) is simomi's virtual time. (notation 4.3.5) ki = fpki ; Eki ; t"ki ; (tski ; V ski ); (teki ; V eki )g (notation 4.3.6)ki is the kth job on simomi.{ tski (teki ): atual start (nishing) time of ki .{ V ski (V eki ): virtual start (nishing) time of ki .{ pki : length of ki .{ Eki : the set of event(s) simomi needs to reeive before ki starts. "lj 2 Ekimeans the the event is sent by simomj at the end of lj .128
{ t"ki : time that the last event in Eki is reeived during simulation. i is the set of simom(s) that will send event(s) to simomi. (notation 4.3.7) 4i = 2LiQiSKii is the delay bound for simomi. (notation 4.3.8) statei(t) is the state of simomi at time t. It an either be exeki meaningthat simomi is simulating ki , or idleki meaning that simomi is waiting forki to start. (notation 4.3.9) t ! i / t! i represents that a Q with interval [t; t +Qi℄ is / is not assignedto simomi. (notation 4.3.10) t i / t i means that simomi is / is not allowed to use the Q assigned att. (notation 4.3.11) ui[t0; t1℄ indiates that simomi atually used a Q from in [t0; t1℄. (notation4.3.12) A partial solution is alled a hn;m; ki partial solution if n simoms have beenassigned to m simhws among whih k simhws are not assigned any simom.We all those k simhws idle simhws. The other (m   k) simhws are alledbusy simhws. (notation 4.5.1) The most promising hn;m; ki partial solution among all hn;m; ii partial solu-tions is dened as the one whih ahieves the minimum of Equ. (4.15), onlyonsidering the n simoms assigned and the m simhws. (notation 4.5.2) B hn;m;ki is a set ontaining all the promising hn;m; ki partial solutions beingkept. Its size is denoted as jB hn;m;ki j. (notation 4.5.3) A hn;m;ki is a set whih ontaining all the hn;m; ki partial solutions derivedfrom the partial solutions kept in the previous searh step. Its size is denotedas jA hn;m;ki j. (notation 4.5.4) 129
Bibliography
[1℄ \Altera," http://www.altera.om.[2℄ \Broadway ompiler projet," http://www.s.utexas.edu/users/less/broadway.html.[3℄ \CarbonKernel," http://www.arbonkernel.org.[4℄ \Coentri system ompiler," http://www.synopsys.om.[5℄ \A framework for hardware-software o-design of embedded systems,"http://embedded.ees.berkeley.edu/Researh/hs/abstrat.html.[6℄ \Funtinal speiation for system 2.0," http://www.system.org.[7℄ \H.263 deoder: TMS320C6000 implementation," http://www.ti.om.[8℄ \International tehnology roadmap for semiondutors,"http://www.publi.itrs.net.[9℄ \Mentor graphis," http://www.mentor.om.[10℄ \Open ore protool speiation 2.1," http://www.opip.org.[11℄ \quikturn," http://www.adene.om/quikturn/.[12℄ \Serial RapidIO," http://www.rapidio.org.[13℄ \Tensilia Inorporated," http://www.tensilia.om.130
[14℄ \AMBA speiation (rev 2.0)," ARM Ltd. 1999.[15℄ \DSP/BIOS kernel tehnial overview," http://www.ti.om.[16℄ \DSP/BIOS timing benhmarks for ode omposer studio 2.2,"http:// www.ti.om.[17℄ \IEEE 1500 standard," http://www.grouper.ieee.org/groups/1500.[18℄ \TMS320C6455 xed-point digital signal proessor," http://www.ti.om.[19℄ \TMS320C64x image/video proessing library programmer's referene,"http://www.ti.om.[20℄ \TMS320C6727, TMS320C6726, TMS320C6722 oating-point digital sig-nal proessors," http://www.ti.om.[21℄ \TMS320DM642 video/imaging xed-point digital signal proessor,"http://www.ti.om.[22℄ \VHDL 93 referenes," http://www.vhdl-online.de/ref93/.[23℄ \Virtual omponent odesign," Cadene Design Systems In.[24℄ \Virtual soket interfae alliane," http://www.vsi.org.[25℄ \VxWorks 5.4," http://www.wrs.om/produts/html/vxwks54.html.[26℄ \Xilinx," http://www.xilinx.om.[27℄ \Open veriation library assertion monitor referene manual," AeUm,2002.[28℄ \Superlog design assertion subset," Co-design Automation Ins., Apri. 2002.
131
[29℄ H.M. AbdElSalam, S. Kobayashi, K. Sakanushi, Y. Takeuhi, and M. Imai,\Towards a higher level of abstration in hardware/software o-simulation," in24th International Conferene on Distributed Computing Systems Workshops,pp. 824{830, 2004.[30℄ A. Aho, R. Sethi, and J. Ullman, Compilers: Priniples, Tehniques andTools, MA: Addison-Wesley, 1988.[31℄ H. Akaboshi, A Study on Design Support for Computer Arhiteture Design,Ph.D. thesis, Department of lnformation Systems, Kyushu University, 1996.[32℄ Guido Arnout, \C for system level design," in Design, Automation and Testin Europe Conferene and Exhibition, pp. 384{386, 1999.[33℄ J. Axelsson, \Towards system-level analysis and synthesis of distributed real-time systems," in 5th International Conferene on Information Systems Anal-ysis and Synthesis, pp. 40{46, 1999.[34℄ A. Baghdadi, N.E. Zergainoh, W.O. Cesario, and A. A. Jerraya, \Combining aperformane estimation methodology with a hardware/software odesign owsupporting multiproessor systems," IEEE Trans. Software Engineering, vol.28, no. 9, pp. 822{831, 2002.[35℄ Jwahar R. Bammi, Wido Kruijtzer, and Luiano Lavagno, \Software perfor-mane estimation strategies in a system-level design tool," in 8th InternationalWorkshop on Hardware/Software Codesign, pp. 82{86, May 2000.[36℄ S. Bashford, U. Bieker, B. Harking, R. Leupers, P. Marwedel, A. Neumann,and D. Voggenauer, \The MIMOLA language version 4.1," University ofDortmund, 1994.[37℄ Matthias Bauer, Wolfgang Eker, Mihael Gasteir, and Manfred Glesner,132
\Evaluation of sequential VHDL and C for system desription and spei-ation," in VIUF, 1996.[38℄ David Beker, Raj K. Singh, and Stephen G. Tell, \An engineering envi-ronment for hardware/software o-simulation," in 29th Design AutomationConferene, June 1992.[39℄ Ilan Beer, Shoham Ben-David, Cindy Eisne, and Amer Landvn, \Rule base:An industry-oriented formal veriation tool," in Design Automation Confer-ene, pp. 655{660, June 1996.[40℄ A. Bender, \Design of an optimal loosely oupled heterogeneous multiproes-sor system," in Pro. EDTC, pp. 275{281, 1996.[41℄ L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Ponino,\Legay system o-simulation of multi-proessor systems-on-hip," in IEEEInternational Conferene on Computer Design, pp. 494{499, 2002.[42℄ B. Bentley, \High level validation of next generation miroproessors," inIEEE International Workshop on High-Level Design Validation and Test, pp.31{35, 2002.[43℄ S. S. Bhattaharyya, S. Sriram, and E.A. Lee, \Optimizing synhronization inmultiproessor DSP systems," IEEE Transation on Signal Proessing, vol.45, no. 6, pp. 1605{1618, Jun. 1997.[44℄ Lubomir F. Bi and Alan C. Shaw, Operating Systems Priniples, PrentieHall, 2002.[45℄ N. Binh, M. Imai, A. Shiomi, and N. Hikihi, \A HW/SW partitioningalgorithm for designing pipelined asip's with least gate ounts," in Pro.DAC, pp. 527{532, 1996. 133
[46℄ G. Bosman, A Survey of Co-Design Ideas and Methodologies, Ph.D. thesis,Vrije University, Amsterdam, Netherland, 2002.[47℄ A. Bouhhima, S. Yoo, and A. Jerraya, \Fast and aurate timed exeutionof high level embedded software using HW/SW interfae simulation model,"in Asia and South Pai Design Automation Conferene, pp. 469{474, Jan.2004.[48℄ F. Boussinot and R. De Simone, \The ESTEREL language," Proeedings ofthe IEEE, vol. 79, no. 9, pp. 1293{1304, Sept. 1991.[49℄ C. Brandolese, W. Fornaiari, L. Pomante, F. Salie, and D. Siuto, \AÆnity-driven system design exploration for heterogeneous multiproessor so," Com-puters, vol. 55, no. 5, pp. 508{519, May 2006.[50℄ L. Carro, M. Kreutz, F. R. Wagner, and M. Oyamada, \system synthesis formultiproessor embedded appliations," in DATEC, pp. 687{702, 2000.[51℄ Noureddine Chabini, Imed Eddine Bennour, El Mostapha Aboulhamid, andYvon Savaria, \A stati method for system performane estimation," in 10thInternational Conferene on Miroeletronis, pp. 111{114, De. 1998.[52℄ A. Chakraborty and M. Greenstreet, \EÆient self-timed interfaes for ross-ing lok domains," in IEEE International Symposium on Asynhronous Cir-uits and Systems, pp. 78{88, May 2003.[53℄ Nelson Yen-Chung Chang, Kun-Bin Lee, and Chien-Wei Jen, \Trae-pathanalysis and performane estimation for multimedia appliation embeddedsystem," in International Symposium on Ciruits and Systems, 2004.[54℄ W.-T. Chang, S. Ha, and E. A. Lee, \Heterogeneous simulation - mixingdisrete-event models with dataow," VLSI Signal Proessing, vol. 15, pp.127{144, 1997. 134
[55℄ Karam S. Chatha and Ranga Vemuri, \Hardware-software partitioning andpipelined sheduling of transformative appliations," IEEE Trans. on VLSI,pp. 193{208, 2002.[56℄ T. Chelea and S. Nowik, \Robust interfaes for mixed-timing systems,"IEEE Trans. on VLSI Systems, vol. 12, no. 8, pp. 857{873, Aug. 2004.[57℄ M. Chiodo, D. Engels, P. Giusto, A. Jureska, H. Hsieh, L. Lavagno, K. Suzuki,and A. Sangiovanni, \A ase study in omputer-aided o-design of embeddedontrollers," Design Automation for Embedded Systems, vol. 1, no. 2, pp.51{67, Jan. 1996.[58℄ Moo-Kyoung Chung and Chong-Min Kyung, \Enhaning performane ofHW/SW osimulation and oemulation by reduing ommuniation over-head," IEEE Transations on Computers, vol. 55, no. 2, pp. 125{136, Feb.2006.[59℄ Moo-Kyoung Chung, Sangjun Yang, Sang-Hoon Lee, and Chong-Min Kyung,\System-level HW/SW o-simulation framework for multiproessor and mul-tithread SoC," in IEEE International Symposium on VLSI Design, Automa-tion and Test, pp. 177{180, Apri. 2005.[60℄ E. Clarke, D. Long, and K. MMillan, \Compositional model heking," in4th Annual Symposium on Logi in Computer Siene, pp. 353{362, 1989.[61℄ E. M. Clarke, O. Grumberg, and D. A. Peled, Model Cheking, Cambridge,MA: MIT Press, 2000.[62℄ J. Cong and Y. Ding, \Combinational logi synthesis for LUT based eldprogrammable gate arrays," ACM Trans. Design Automation for EletroniSystems, vol. 1, no. 2, pp. 145{204, Apri 1996.135
[63℄ B. P. Dave and N. K. Jha, \COHRA: Hardware/software o-synthesisof hierarhial heterogenous distributed embedded systems," IEEE Trans.Computer-Aided Design, vol. 17, Ot. 1998.[64℄ J. A. Debardelaben and V. K. Madisetti, \Hardware/software odesign forsignal proessing systems - a survey and new results," in the 29th AsilomarConferene on Signals, Systems and Computers, vol. 2, pp. 1316{1320, Nov.1995.[65℄ G. DeMiheli, Synthesis and Optimization of Digital Ciruits, New York:MGraw-Hill, 1994.[66℄ D. Desmet, D. Verkest, and H. De Man, \Operating system based softwaregeneration for system-on-hip," in Design Automation Conf., pp. 396{401,Jun. 2000.[67℄ R. P. Dik and N. K. Jha, \Moga: A multiobjetive geneti algorithmfor hardware/software osynthesis of distributed embedded systems," IEEETrans. Computer-Aided Design, vol. 17, Ot. 1998.[68℄ J. Eker, J. W. Jannek, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorer,S. Sahs, and Y. Xiong, \Taming heterogeneity - the ptolemy approah,"Proeedings of the IEEE, 2002.[69℄ P. Eles, \VHDL system-level speiation and partitioning in a hard-ware/software o-synthesis environment," in International Workshop on Hard-ware/Software Codesign, pp. 22{24, Sept. 1994.[70℄ P. Eles, Z. Peng, K. Kuhinski, and A. Doboli, \System level hard-ware/software partitioning based on simulated annealing and tabu searh,"Design Automation for Embedded Systems, vol. 2, no. 2, pp. 5{32, 1997.136
[71℄ R. Ernst, J. Henkel, and T. Benner, \Hardware/software osynthesis for mi-roontrollers," IEEE Design and Test of Computers, pp. 64{75, De. 1993.[72℄ R. Ernst, J. Henkel, Th. Benner, W. Ye, U. Holtmann, D. Herrmann, andM. Trawny, \The COSYMA environment for hardware-software osynthesisof small embedded systems," Miroproessors and Mirosystems, May 1996.[73℄ L. Formaggio, F. Fummi, and G. Pravadelli, \A timing-aurate HW/SWosimulation of an ISS with SystemC," in International Conferene on Hard-ware/Software Codesign and System Synthesis, pp. 152 { 157, 2004.[74℄ Rihard M. Fujimoto, \Parallel disrete event simulation," in Winter Simu-lation Conferene, pp. 19{28, 1989.[75℄ R. M. Fujimoto, \Time warp on a shared memory multiproessor," in Inter-national Conferene on Parallel Proessing, 1989.[76℄ D. Gajski, R. Domer, and J. Zhu, \IP-entri methodology and design withthe SpeC language," 1998.[77℄ D. D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao, SpeC: Spei-ation Language and Methodology, Kluwer Aademi Publishers, 2000.[78℄ C. Gebotys and M. Elmasry, Optimal VLSI Arhitetural Synthesis, Amster-dam: Kluwer, 1992.[79℄ Andreas Gerstlauer, Haobo Yu, and Daniel D. Gajski, \RTOS modeling forsystem level design," in Design, Automation and Test in Europe Confereneand Exhibition, pp. 130{135, 2003.[80℄ G. Goossens, P. G. Paulin, J. Van Praet, D. Lanneer, W. Guerts, A. Kii, andC. Liem, \Embedded software in real-time signal proessing systems: Designtehnologies," Pro. IEEE, pp. 436{454, 2001.137
[81℄ Thorsten Grotker, Stan Liao, Grant Martin, and Stuart Swan, System Designwith SystemC, Kluwer Aademi Publishers, 2002.[82℄ Lisa Guerra, Joahim Fitzner, Dipankar Talukdar, Chris Shlager, BassamTabbara, and Vojin Zivojnovi, \Cyle and phase aurate DSP modeling andintegration for HW/SW o-veriation," in Design Automation Conferene,pp. 964{969, 1999.[83℄ Pallav Gupta, \Hardware-software odesign," IEEE Potentials, vol. 20, no.5, pp. 31{32, De. 2001.[84℄ R. Gupta, C. Coelho, and G. DeMiheli, \Program implementation shemesfor hardware-software systems," IEEE Computer, pp. 48{55, Jan. 1994.[85℄ R. K. Gupta and G. De Miheli, \Hardware-software osynthesis for digitalsystems," IEEE Design and Test of Computers, vol. 10, no. 3, pp. 29{41, Sept.1993.[86℄ G. Hadjiyiannis, S. Hanono, and S. Devadas, \ISDL: An instrution set de-sription language for retargetability," in 34th Design Automation Conferene,pp. 299{302, 1997.[87℄ G. Hadjiyiannis, P. Russo, and S. Devadas, \A methodology for aurate per-formane evaluation in arhiteture exploration," in 36th Design AutomationConferene, pp. 927{932, 1999.[88℄ L. Hafer and A. Parker, \Automated synthesis of digital hardware," IEEETrans. Computers, vol. C-31, no. 2, Feb. 1982.[89℄ A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Niolau,\EXPRESSION: A language for arhiteture exploration through om-piler/simulator retargetability," in Design Automation and Test in Europe,1999. 138
[90℄ A. Halambi, P. Grun, H. Tomiyama, N. Dutt, and A. Niolau, \Automatisoftware toolkit generation for embedded systems-on-hip," in 6th Interna-tional Conferene on VLSI and CAD, pp. 107{116, Ot. 1999.[91℄ S. Hanono and S. Devadas, \Instrution seletion, resoure alloation, andsheduling in the AVIV retargetable ode generator," in 35th Design Au-tomation Conferene, pp. 510{515, 1998.[92℄ D. Harel, H. Lahover, A. Naamad, A. Pnueli, M. Politi, R. Sherman,A. Shtull-Trauring, and M. Trakhtenbrot, \STATEMATE: A working en-vironment for the development of omplex reative systems," IEEE Transa-tions on Sofware Engineering, vol. 16, no. 4, pp. 403{414, April 1990.[93℄ M. R. Hartoog, J. A. Rowson, P. D. Reddy, S. Desai, D. D. Dunlop, E. A.Harourt, and N. Khullar, \Generation of software tools from proessor de-sriptions for hardware/software odesign," in 34th Design Automation Con-ferene, pp. 303{306, 1997.[94℄ Zhengting He, \How to reate delay-based audio eets on a TMS320C6727DSP," http://www.ti.om.[95℄ Zhengting He, Aloysius K, and Mok, \Fast osimulation of transformativesystems withOS support on SMP omputer," inHardware/Software Codesignand System Synthesis, pp. 164{169, 2004.[96℄ Zhengting He, Al. Mok, and C. Peng, \Timed RTOS modeling for embeddedsystem design," in 11th IEEE Real Time Appliation and System Symposium,pp. 448{457, Mar. 2005.[97℄ Zhengting He and Aloysius K. Mok, \A real time simulation platform forhardware/software odesign," in submitted to EMSOFT, 2007.139
[98℄ Zhengting He, C. Peng, and Al. Mok, \A performane estimation tool for videoappliations," in 12th Real-Time and Embedded Tehnology and AppliationsSymposium, pp. 267{276, Apri. 2006.[99℄ Dan Henriksson, Ola Redell, Jad El-Khoury, Martin Torngren, and Karl-Erik_Arzen, \Tools for real-time ontrol systems o-design - a survey,"http://www.se.unt.edu/ sweany/CoDesign/Hendriksson05.pdf.[100℄ K. Hines and G. Borriello, \Dynami ommuniation models in embeddedsystem o-simulation," in Design Automation Conf., pp. 395{400, Jun. 1997.[101℄ K. Hines and G. Borriello, \A geographially distributed framework for embed-ded system design and validation," in Design Automation Conf., pp. 140{145,Jun. 1998.[102℄ A. Homann, T. Kogel, and H. Meyr, \A framework for fast hardware-softwareo-simulation," in Design, Automation and Test in Europe, pp. 760{764, Mar.2001.[103℄ S.-Y. Huang and K.-T. Cheng, Formal Equivalene Cheking and DesignDebugging, Boston, MA: Kluwer, 1998.[104℄ C. Hylands, E. Lee, J. Liu, X. Liu, S. Neuendorer, Y. Xiong, Y. Zhao, andH. Zheng, \overview of the ptolemy projet,"http://ptolemy.ees.berkeley.edu/publiations/papers/03/overview/overview03.pdf.[105℄ A. Inoue, H. Tomiyama, H. Okuma, H. Kanbara, and H. Yasuura, \Languageand ompiler for optimizing datapath widths of embedded systems," IEICETrans. Fundamentals, vol. E81-A, no. 12, pp. 2595{2604, De. 1998.[106℄ A. Jantash, P. Ellervee, J. Oberg, A. Hemani, and H. Tenhunen, \Hard-ware/software partitioning and minimizing memory interfae traÆ," in EU-RODAC, pp. 226{231, 1994. 140
[107℄ J. Jeon and K. Choi, \Loop pipelining in hardware/software partitioning," inASPDAC, 1998.[108℄ Jinyong Jung, Sungjoo Yoo, and Kiyoung Choi, \Performane improvementof multi-proessor systems osimulation based on SW analysis," in Design,Automation and Test in Europe, pp. 749{753, Mar. 2001.[109℄ A. Kalavade and E. A. Lee, \A global ritiality/loal phase driven algorithmfor the onstrained hardware/software partitioning problem," in 3rd Interna-tional Workshop on Hardware/Software Codesign, pp. 42{48, Sept. 1994.[110℄ M. Keating and P. Briaud, Reuse Methodology Manual: For System-on-a-Chip Designs, 3rd edition, Boston, MA: Kluwer, 2002.[111℄ A. Khare, N. Savoiu, A. Halambi, P. Grun, N. Dutt, and A. Niolau, \V-SAT: A visual speiation and analysis for system-on-hip exploration," inEUROMICRO, 1999.[112℄ Dohyung Kim, Chan-EunRhee, Youngmin Yi, Sunghan Kim, Hyunguk Jung,and Soonhoi Ha, \Virtual synhronization for fast distributed osimulation ofdataow task graphs," in 15th International Symposium on System Synthesis,pp. 174{179, Ot. 2002.[113℄ Dohyung Kim, Chan-Eun Rhee, and Soonhoi Ha, \Combined data-driven andevent-driven sheduling tehnique for fast distributed osimulation," IEEETransation on VLSI, vol. 10, no. 5, pp. 672{679, Ot. 2002.[114℄ Dohyung Kim, Youngmin Yi, and Soonhoi Ha, \Trae-driven HW/SW osim-ulation using virtual synhronization tehnique," in 42nd Design AutomationConferene, pp. 345{348, June 2005.[115℄ J. Kim and Y. Kim, \Simulating multimedia systems withMVPSIM," IEEETransation on Design and Test of Computers, vol. 12, no. 4, pp. 18{27, 1995.141
[116℄ Sunghan Kim, Chaeseok Im, and Soonhoi Ha, \Shedule-aware performaneestimation of ommuniation arhiteture for eÆient design spae explo-ration," in Hardware/Software Codesign and System Synthesis, pp. 195{200,Ot. 2003.[117℄ Christian Kreiner, Christian Steger, Egon Teiniker, and Reinhold Weiss, \AHW/SW odesign framework based on distributed DSP virtual mahines,"in Euromiro Symposium on Digital Systems Design, pp. 212{219, Sept. 2001.[118℄ D. Ku and G. De Miheli, \Relative sheduling under timing onstraints: Al-gorithms for high-level synthesis of digital iruits," IEEE Trans. CAD/ICAS,pp. 696{718, June 1992.[119℄ E. D. Lagnese and D. Thomas, \Arhitetural partitioning for system leveldesriptions," in Pro. DAC, pp. 62{67, 1989.[120℄ Marello Lajolo, Mihai Lazaresu, and Alberto Sangiovanni-Vinentelli, \Aompilation-based software estimation sheme for hardware/software o-simulation," in 7th International Workshop on Hardware/Software Codesign,pp. 85{89, May 1999.[121℄ D. Lanneer, J. Van Praet, A. Kii, K. Shoofs, W. Geurts, F. Thoen, andG. Goossens, CHESS: Retargetable Code Generation for Embedded DSP Pro-essors, In Code Generation for Embedded Proessors, Kluwer Aademi Pub-lishers, 1995.[122℄ E. A. Lee and A. Sangiovanni-Vientelli, \Comparing models of omputation,"in ICCAD, pp. 234{241, 1996.[123℄ R. Leupers and P. Marwedel, \Retargetable ode generation based on stru-tural proessor desriptions," Design Automation for Embedded Systems,Kluwer Aademi Publishers, vol. 3, no. 1, pp. 1{36, Jan. 1998.142
[124℄ C. L. Liu and J. W. Layland, \Sheduling algorithms for multiprogrammingin a hard-real-time environment," Journal of ACM, vol. 20, no. 1, 1973.[125℄ Jie Liu, Marello Lajolo, and A. Sangiovanni-Vinentelli, \Software timinganalysis usingHW/SW osimulation and instrution set simulator," in 6th In-ternational Workshop on Hardware/Software Codesign, pp. 65{69, Mar. 1998.[126℄ X. Liu, J. Liu, J. Eker, and E. A. Lee, \Heterogeneous modeling and designof ontrol systems," Software-Enabled Control: Information Tehnology forDynamial Systems, 2002.[127℄ R. Mateos, J. L. Lazaro, and F. Espinosa, \Hardware/software o-simulationenvironment for CSoC with soft proessors," in IEEE International Confer-ene on Field-Programmable Tehnology, pp. 445{448, 2004.[128℄ S. A. Maxwell, Linux Core Kernel 2nd Edition, Coriolis Tehnology Press,2001.[129℄ D. Messershmitt, \Synhronization in digital systems design," IEEE Trans.on Seleted Areas in Communiations, vol. 8, no. 8, pp. 1404{1419, Ot. 1990.[130℄ G. De Mihell and R.K. Gupta, \Hardware/software o-design," Proeedingsof the IEEE, vol. 85, no. 3, pp. 349{365, 1997.[131℄ R. Le Moigne, O. Pasquier, and J.-P. Calvez, \A generi RTOS model forreal-time systems simulation with SystemC," in Design, Automation and Testin Europe Conferene and Exhibition, vol. 3, pp. 82{87, 2004.[132℄ A. Mok and A. X. Feng, \Real-time virtual resoure: A timely abstration forembedded systems," in EMSOFT, pp. 182{196, 2002.[133℄ Al. Mok, X. Feng, and Zhengting He, \Implementation of real-time virtualCPU partition on Linux," in 7th Real-Time Linux Workshop, 2005.143
[134℄ J. Mutterbah, T. Villiger, and W. Fihtner, \Pratial design of globally-asynhronous, loally-synhronous systems," in International symposium onAdvaned Researh in Asynhronous Ciruits and Systems, pp. 52{59, 2000.[135℄ G. Niolesu, Sungjoo Yoo, A. Bouhhima, and A. A. Jerraya, \Validation in aomponent-based design ow for multiore SoCs," in International Symposiumon System Synthesis, 2002.[136℄ R. Niemann and P. Marwedel, \Hardware/software partitioning using integerprogramming," in EDTC, pp. 473{479, 1996.[137℄ J. Noguera, L. Baldez, N. Simon, and L. Abello, \Software-friendly HW/SWo-simulation: An industrial ase study," in Design, Automation and Test inEurope, vol. 2, pp. 1{6, Mar. 2006.[138℄ M. G. Norman and P. Thanish, \Models of mahines and omputation formapping in multiomputers," ACM Computing Surveys, vol. 25, no. 3, pp.263{302, 1993.[139℄ C. Passerone, L. Lavagno, C. Sansoe, M. Chiodo, and A. Sangiovanni-Vinentelli, \Trade-o evaluation in embedded system design via o-simulation," in Proeedings of the ASP-DAC, Jan. 1997.[140℄ P. Paulin, C. Liem, T. May, and S. Sutarwala, \Flexware: A exible rmwaredevelopment environment for embedded systems," Code Generators for Em-bedded Proessors, P. Marwedel and G. Goossens, Eds. Amsterdam: Kluwer,1995.[141℄ R. Perego and G. De Petris, \Minimizing network ontention for mappingtasks onto massively parallel omputers," in Euromiro Workshop on Paralleland Distributed Proessing, pp. 210{218, Jan. 1995.144
[142℄ A. Pnueli, \In transition from global to modular temporal reasoning aboutprograms," Logis and Models of Conurrent Systems, pp. 123{144, 1989.[143℄ H. Posadas, F. Herrera, P. Sanhez, E. Villar, and F. Blaso, \System-levelperformane analysis in SystemC," in Design, Automation and Test in EuropeConferene and Exhibition, pp. 378{383, 2004.[144℄ S. Prakash and A. C. Parker, \SOS: Synthesis of appliation-spei hetero-geneous multiproessor systems," Parallel Distributed Computing, vol. 16, pp.338351, 1992.[145℄ D. Ragan, P. Sandborn, and P. Stoaks, \A detailed ost model for onurrentuse with hardware/software o-design," in 39th Design Automation Confer-ene, pp. 269{274, 2002.[146℄ R. Rajsuman, System-on-a-Chip Design and Test, Boston, MA: Kluwer, 2000.[147℄ P. Ramanathan and J. Stankovi, \Sheduling algorithms and operating sys-tem support for real-time systems," Pro. IEEE, vol. 82, pp. 55{67, Jan.1994.[148℄ K. Ramani and R.L. Haggard, \A survey of tehniques used in the synthesisof hardware from C/C++ as a part of hardware/software o-design," in 33rdSoutheastern Symposium on System Theory, pp. 301{304, Mar. 2001.[149℄ B. R. Rau and C. D. Glaeser, \Some sheduling tehniques and an easilyshedulable horizontal arhiteture for high performane sienti omputing,"in 14th Workshop Miroprogramming, p. 183198, Ot. 1981.[150℄ Bertil Roslund and Per Andersson, \A exible tehnique for OS-support ininstrution level simulators," in 27th Simulation Symposium, pp. 134{141,Apri. 1994. 145
[151℄ Jery T Russell and Margarida F Jaome, \Arhiteture-level performaneevaluation of omponent-based embedded systems," in Design AutomationConferene, pp. 396{401, Jun. 2003.[152℄ R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet, G. Lemieux, P.P.Pande, C. Greu, and A. Ivanov, \System-on-hip: reuse and integration,"Proeedings of the IEEE, vol. 94, no. 6, pp. 1050{1069, June 2006.[153℄ T. Shubert, \High level formal veriation of next-generation miroproes-sors," in Design Automation Conferene, pp. 1{6, 2003.[154℄ J. Seizovi, \Pipeline synhronization," in IEEE International Symposim onAsynhronous Ciruits and Systems, pp. 87{96, 1994.[155℄ D. Shapiro, Globally-asynhronous, loally synhronous systems, Ph.D. thesis,Computer Siene Department, Standard University, Stanford, CA, 1984.[156℄ R. Siegmund and D. Muller, \SystemCSV: an extension of SystemC for mixedmulti-level ommuniation modeling and interfae-based system design," inDATE, 2001.[157℄ I. Stoia, H. Abdel-Wahab, K. Jeay, S.K. Baruah, J.E. Gehrke, and C.G.Plaxton, \A proportional share resoure alloation algorithm for real-time,time-shared systems," in Real-Time Systems Symposium, pp. 288{299, De.1996.[158℄ H. J. Stolberg, M. Berekovi, and P. Pirsh, \A platform-independent method-ology for performane estimation of streaming media appliations," in Multi-media and Expo, vol. 2, pp. 105{108, Aug. 2002.[159℄ Wonyong Sung and Soonhoi Ha, \Optimized timed hardware software osim-ulation without roll-bak," in Design, Automation and Test in Europe, pp.945{946, Feb. 1998. 146
[160℄ S. Swan, \SystemC transation level models and RTL veriation," in 43rdACM/IEEE Design Automation Conferene, pp. 90{92, Jul. 2006.[161℄ D. Thomas, J. Adams, and H. Shmitt, \A model and methodology forhardware-software o-design," IEEE Design and Test, vol. 10, no. 3, pp. 6{15,Sept. 1993.[162℄ R. Tijdeman, \The hairmain assignment problem," Disrete Mathematis,vol. 32, pp. 323{330, 1980.[163℄ Kyoko Ueda, Keishi Sakanushi, Yoshinori Takeuhi, and Masaharu Imai,\Arhiteture-level performane estimation for IP-based embedded systems,"in Design, Automation and Test in Europe Conferene and Exhibition, vol. 2,pp. 1002{1007, Feb. 2004.[164℄ F. Vahid, J. Gong, and D. Gajski, \A binary-onstraint searh algorithmfor minimizing hardware during hardware/software partitioning," in Pro.EURODAC, pp. 214{219, 1994.[165℄ Catherine Lingxia Wang, Bo Yao, Yang Yang, and Zhengyong Zhu, \A surveyof embedded operating system,"http://www.s.usd.edu/lasses/fa01/se221/projets/group2.pdf.[166℄ Duen-Jeng Wang and Yu Hen Hu, \Fully stati multiproessor realizationfor real-time reursive DSP algorithms," in International Conferene onAppliation-Spei Array Proessors, pp. 664{678, Aug. 1992.[167℄ Shige Wang, Sharath Kodase, Kang G. Shin, and Daniel L. Kiskis, \Measure-ment of OS servies and its appliation to performane modeling and analysisof integrated embedded software," in Real-Time and Embedded Tehnologyand Appliations Symposium, pp. 113{122, 2002.147
[168℄ W. Wolf, \A deade of hardware/software odesign," Computer, vol. 36, no.4, pp. 38{43, Apri. 2003.[169℄ Wooseung Yang, Moo-Kyeong Chung, and Chong-Min Kyung, \Current sta-tus and hallenges of SoC veriation for embedded system market," in IEEEInternational onferene on SoC, pp. 213{216, Sept. 2003.[170℄ Mitsuhiro Yasuida, Barry Shakleford, and Fumio Suzuki, \A top-down hard-ware/software o-simulation method for embedded systems based upon a om-ponent logial bus arhiteture," in ASP-DAC, pp. 169{175, Feb. 1998.[171℄ T.-Y. Yen and W. Wolf, \Communiating synthesis for distributed embeddedsystems," in ICCAD, pp. 288{294, 1995.[172℄ Y. Yi, D. Kim, and S. Ha, \Virtual synhronization tehnique withOS model-ing for fast and time-aurate osimulation," in Hardware/Software Codesignand System Synthesis, pp. 1{6, 2003.[173℄ Sungjoo Yoo, Performane Improvement of HW/SW Cosimulation Based onSynhronization Overhead Redution, Ph.D. thesis, Seoul National University,Korea, Feb. 2000.[174℄ S. Yoo and K. Choi, \Synhronization overhead redution in timed osimula-tion," in International High Level Design Validation and Test Workshop, pp.157{164, Nov. 1997.[175℄ S. Yoo and A. A. Jerraya, \Hardware/software osimulation from interfaeperspetive," IEE Proeedings-Computers and Digital Tehniques, vol. 152,no. 3, pp. 369{379, May 2005.[176℄ Sungjoo Yoo, Babriela Niolesu, Lovi Gauthier, and Ahmed A. Jerraya,\Automati generation of fast timed simulation models for operating systems148
in SoC design," in Design, Automation and Test in Europe Conferene andExhibition, pp. 620{627, 2002.[177℄ Sungjoo Yoo, Gabriela Niolesu, Damien Lyonnard, Amer Baghdadi, andAhmed A. Jerraya, \A generi wrapper arhiteture for multi-proessor SoCosimulation and design," in 9th International Hardware/Software CodesignSymposium, pp. 195{200, Apri. 2001.[178℄ H. Yu, A. Gerstlauer, and D. Gajski, \RTOS sheduling in transation levelmodels," in International Conferene on Hardware/software Codesign andSystem Synthesis, pp. 31{36, Ot. 2003.[179℄ V. D. Zivkovi, E. Deprettere, P. van der Wolfa, and E. Kok, \From highlevel appliation speiation to system-level arhiteture denition: Explo-ration, design and omplitation," in International Workshop on Compilers forParallel Computers, pp. 39{49, Jan. 2003.[180℄ V. Zivojnovi and H. Meyr, \Compiled HW/SW o-simulation," in 33rdDesign Automation Conferene, pp. 690{695, June 1996.[181℄ V. Zivojnovit, S. Pees, and H. Meyr, \LISA: Mahine desription languageand generi mahine model for HW/SW o-design," in International Work-shop on VLSI Signal Proessing, 1996.
149
Vita
Zhengting He reeived his B.S.E.E. from Tsinghua University, Beijing, China in2000 and M.S.E.E. from the The University of Texas at Austin in 2002. From May2002 to August 2003, Zhengting worked at Texas Instruments in Houston, Texas forseveral terms as a o-op. From May 2004, he joined Texas Instruments in Houston,Texas as a full-time employee. Zhengting has been a member of the Institute ofEletrial and Eletronis Engineers (IEEE) sine 2001.
Permanent Address: 7223 Sierra Night Dr.Rihmond, TX, 77469, U.S.A.This dissertation was typeset with LATEX2"1 by the author.
1LATEX2" is an extension of LATEX. LATEX is a olletion of maros for TEX. TEX is a trademark ofthe Amerian Mathematial Soiety. The maros used in formatting this dissertation were writtenby Dinesh Das, Department of Computer Sienes, The University of Texas at Austin, and extendedby Bert Kay and James A. Bednar. 150
