Abstract-This paper addresses performance estimation and architecture exploration issues within the context of hardware/software codesign. We introduce a new methodology to rapidly explore the large design space encountered in hardware/software systems. The proposed methodology is based on a fast and accurate estimation approach. This estimation approach takes advantage of both system and RT levels of abstraction, and combines both static and dynamic analysis techniques, in order to obtain the best trade-off between speed and accuracy. It has been implemented as an extension to a hardware/software codesign flow to enable the exploration of a large number of multiprocessor architecture solutions from the very start of the design process. The effectiveness of the proposed methodology is illustrated by a significant application example. Experimental results indicate strong advantages of the proposed methodology.
INTRODUCTION
T HE ever growing demand of application performance makes multiprocessor architectures more and more important in many industries (e.g., telecommunications, aerospace, automotive). So, in order to deal with these complex architectures and to meet the more severe time-tomarket constraints, we need new system design methods. Hardware/software codesign has emerged as a promising approach to cope with this challenge. One of the most important issues of this approach is design space exploration. In other words, it is important to find the best system architecture including the right partition between hardware and software components and the right hardware components and communication protocols. Starting from the same system specification, several architectures may be produced. The exploring of all these architectures requires the ability to rapidly determine the performance resulting from a particular partitioning.
The number of solutions for mapping a system specification made of n tasks (i.e., processes) on an architecture made of q nonempty modules (i.e., processors) may be computed using the stirling numbers of the second kind 
Now, we assume that we have p different kinds of technologies to implement each module. Each module may be implemented as specific hardware or targeted as a software executed on a specific processor. The following equation gives us the number of architectural solutions: Nb Architecture ðn; pÞ ¼ X n q¼1 p q Sðn; qÞ:
We notice that the number of solutions increases exponentially with n and p. For example, assume that we have a system composed of four tasks, now if we use three kinds of technologies, e.g., hardware implementation and two kinds of processors to execute software, we find a design space with 309 different architectures. This space will be even bigger, if we consider different communication protocols.
For each architecture, synthesis, and low-level cosimulation may take days. Thus, we cannot afford to synthesize and to simulate at the cycle level every single architecture to measure its performance. These facts constitute the basis of our motivation for the work presented in this paper. They explain the need for a performance estimation approach which can accomplish the complex task of architecture exploration within a reasonable lapse of time. The combination of such an approach with a codesign flow constitutes a complete environment for the efficient implementation of complex heterogeneous multiprocessor systems.
Related Work
In the literature, several works address architectural exploration and performance analysis within the hardware/ software codesign context. There are also many works on isolated problems like the estimation of the running time of the software, performance analysis of ASIC circuits, and even complex system architectures. Existing works on performance analysis for hardware/software codesign space exploration can be classified in two classes according to the complexity of the target architecture.
In the first class, the target architecture is monoprocessor. PMOSS [5] , COSYMA [14] , [15] , [16] , and LYCOS [19] , [23] follow this scheme. In PMOSS [5] , the authors only calculate the speed-up due to the coprocessor (i.e., hardware) on the overall system performance. To that end, for the software part they combine both dynamic analysis (profiling) and static analysis (calculation of a lower-bound execution time based on the assembler code). For the hardware part, they use static analysis (calculation of the lower bound execution time based on the description of the control machine) and for the communication part dynamic analysis (profiling). In COSYMA [14] , [15] , [16] , the authors calculate separate metrics for the software, the hardware, and the communication parts. Then, these metrics are combined into equations to work out a partition based on the simulated annealing method. Worst-case execution time is calculated for software and hardware implementations using variants of path analysis techniques. The communication time is assessed for their particular model: shared memory. In LYCOS [19] , [23] , the authors estimate performance using profiling techniques and evaluations of low-level execution time for hardware, software, and communication. In spite of their performances, these methods do not permit to handle complex architectures containing more than one processor.
In the second category, the target architecture is multiprocessor. SpecSyn [9] , [10] , [27] , POLIS [22] , [25] , [26] , and the method proposed by Yen and Wolf [28] , [29] follow this scheme. In SpecSyn [9] , [10] , [27] , the authors deal with multiprocessor architectures. The performance estimation approach is mixed: static/dynamic. With this approach, it is difficult to capture dynamic changes of the execution time during the design space exploration, as during this phase only simple static methods are used (the global time is the sum of all partial delays of the different execution resources).
The approach for timing analysis used in POLIS [22] , [25] , [26] may capture much of the dynamic timing behavior because this system uses a combination of high-level simulation and low-level estimations (i.e., static/dynamic approach). However, it still lacks precision because of the use of static analysis (a prior characterization of microprocessors) for software parts. Yen and Wolf [28] , [29] tackle the problem from a generic point of view. They analyze at the system level the interaction between the different processes giving the best-and worst-case execution time for each of them. Then, starting from an acyclic graph representing data dependencies among processes together with information on the partitioning/allocation-the distribution on processing units-they calculate the worst-case execution time for the whole system. This method is accurate and can take communication time into account. Unfortunately, its use remains limited to applications for which it is sufficient to know the worst-case delays for those sets of behavior which may be described by acyclic graphs.
None of the above works solve the problem of accurate estimation or hardware/software architecture exploration in the case of multiprocessor architectures. The main contribution of this paper is to provide an accurate performance estimation method enabling design space exploration in the case of hardware/software codesign. Our estimation/exploration methodology makes use of an existing system-level simulation tool and a codesign tool.
Methodology Overview
When defining the specifications, one of the major issues of our methodology was the optimal trade-off between speed and accuracy. For accuracy, it is necessary to use simulation at the cycle level for every data-dependent behavior. However, this approach is not feasible in our context as it does not allow for fast exploration of the design space. This is why we would rather use simulation at the system level. This high-level simulation allows us to evaluate the dynamic behaviors of the interaction between the different processes, whatever the complexity of the architecture. However, it lacks accurate timing information. To make up for this shortcoming, we use, in addition, a back-annotation approach. As a matter of fact the analysis of "some" implementations (at RT level) allows us to extract "all" timing elements needed for performance estimation of "all" feasible implementations. These timing elements are then combined and introduced into the system specification once and for all. Thanks to this new time-annotated specification, it is possible to predict the performance of all feasible architectures.
The estimation/exploration methodology we present in this paper makes use of the system-level simulator GEODESIM from the ObjectGEODE environment of VER-ILOG [20] and the codesign tool MUSIC [18] . The system design flow starts with a specification given in SDL [6] , [8] .
We can identify four main steps in the overall flow of our methodology (see Fig. 1 ): Execution time computation of basic elements, back-annotation, architecture modeling, and system-level simulation. The execution time computation step makes use of the codesign tool to map the initial specification on each of the available target technologies. The back-annotation step uses the timing information to instrument the initial specification. Each architecture that needs to be evaluated has to be described. The designer has to provide the designed partitioning and the technology used for the design of each module. The system-level simulation step starts with the instrumented specification of the system and computes the performance for each architecture. In this scheme, the first step is executed for each target technology. The second step is executed only once and the last two steps are executed for each architecture model. These steps will be explained in the fourth section in detail.
The rest of this paper is organized as follows: Section 2 presents some aspects of the system description language and the tools we used for a clear understanding of subsequent sections. Section 3 describes the proposed estimation/exploration methodology and how it is used to provide feedback, which guides the search for good architectural solutions. Section 4 contains the experimental results of our method applied to an application example. The results show how the combination of the codesign flow with the system-level simulator, together with our approach, permits a fast and efficient exploration of the design space. Finally, Section 5 concludes the paper.
SYSTEM DESIGN ENVIRONMENT
In this section, we only present certain aspects of the system description language and the tools we used for a clear understanding of subsequent sections.
The Codesign Tool
For this work, we used a codesign tool called MUSIC [18] . It starts from the system-level specification language SDL to produce heterogeneous multiprocessor architectures composed of hardware and software components. The generated hardware components are described in VHDL to be synthesized later on ASICs/FPGAs. The software components-for one or several processors-are described in C language. The codesign flow may be summarized to three main stages as shown in Fig. 2 :
. System modeling: In this stage, the required system functionality is specified. The system is described in SDL [6] , [8] . This specification is validated by the use of the GEODESIM simulator [20] . The result of this stage is a functional specification without realization detail. The first step consists in generating a functional prototype (VHDL/C) which may be validated through cosimulation [13] . The second step consists in targeting the components of the architecture and generating a cycle accurate model, which may also be validated through cosimulation [13] .
SDL Language
The SDL language (Specification and Description Language) [6] , [8] is an object-oriented formal language defined by the ITU-T [17] for specification of complex, real-time applications. It is particularly adapted for the modeling and the simulation of real-time, distributed, and telecommunication systems. The basic theoretical model of an SDL system consists of a set of extended finite state machines (EFSMs) that run in parallel. These machines are independent of each other and communicate with discrete signals. SDL supports the different concepts which are essential to system description: structure, behavior, and communication.
The structural view of an SDL system is hierarchical. The highest entity of the hierarchy is called "system." A system is composed of a set of "blocks." A block may contain other blocks or a set of "processes." The different processes of one and the same block are interconnected by "signal routes" all over the block. These blocks are interconnected by "channels." This hierarchical structure allows to model complex systems with different topologies.
The dynamic behavior in an SDL system is described in the processes. The system/block hierarchy is only a static description of the system structure. Processes in SDL can be created at system start or created and terminated at run time. A process is described by a Finite State Machine that communicates asynchronously with the other processes through signals. Each process has an infinite length FIFO queue at its entrance where signals are stored upon arrival. The arrival of a signal determines and validates the transition that has to be executed by the process.
The communication is based on the "message passing" model. The signals are the only means of interprocess synchronization. A signal always implicitly carries the address of the issuing process, the recipient's address if it is specified explicitly, and, if need be, a set of parameters. The signals exchanged by processes follow a path composed of signal routes and channels. A channel can be connected to several signal routes, whereas a signal route can only be connected to one channel.
GEODESIM Simulator
The GEODESIM simulator [20] , [21] employs the semantics of the SDL language (Z.100 and Z105 recommendations) [17] . Additionally, it uses an extra annotation format for the assessment of performances and architecture modeling [24] . These annotations have no influence on the initial specification of the system since they are considered as comments and only interpreted by the simulator. This is possible thanks to the so-called directives i.e., COMMENT strings, that contain a specific pattern that is recognized by the GEODESIM simulator. These directives can be attached to individual actions inside the SDL transition in order to get accurate evaluations. Fig. 3 illustrates the architecturemodeling directives:
. The NODE directive allows to identify the execution resources of a given model, called nodes. This is generally used to specify the kind of processor that will be used to execute the module. For instance the first node in Fig. 3 is attached to a 80C51 processor. A node can be associated to systems, blocks, or processes. All processes (or blocks) inside a node share the same execution resources. Processes, which are not inside a node, are considered as default nodes. . The PRIORITY directive allows to assign an order of priority to every SDL process, inside the same node, to model priority-based multitask execution (modeling of the scheduling strategy). In standard SDL, the choice of a transition among all fireable transitions of all process instances inside a node is done according to a random uniform distribution. The PRIORITY directive can be associated with a process in order to modify this random choice. . The DELAY directive is associated with SDL actions in order to specify their execution time. The default execution time of an action is zero delay. During the simulation, when a delay action is reached, the corresponding node is blocked during the time specified by the directive parameters, then the action is executed and its effects can be observed.
ESTIMATION/EXPLORATION METHODOLOGY
The proposed methodology is based on an accurate estimation approach and has been implemented as an extension to MUSIC codesign flow. It takes advantage of both system and RT levels of abstraction, and combines both static and dynamic analysis techniques, in order to obtain the best trade-off between speed and accuracy. The facts that delays are computed starting from the implementation model (at RT level) and that dynamic behaviors are captured by simulation (i.e., dynamic analysis) allow to have a very high accuracy. The speed is maintained by the fact that the architecture exploration loop (see Fig. 1 ) is performed at the system level (using system-level simulation). The analytic aspect of our approach is noticeable at the computation of the timing elements, the back-annotation equations, and the architecture modeling stages.
In this section, we will explain the four stages of the flow of our methodology in detail (see Fig. 1 ).
Execution Time Computation of Basic Elements
This first stage consists in analyzing an implementation (at RT level), generated by MUSIC, to calculate the execution time of all SDL operations (i.e., computation and communication operations).
Each SDL process may have a software or a hardware realization. For the software realization, the SDL-specified process will be translated into C code. The execution time computation, which depends on the couple "processor, compiler," will be achieved by the analysis of the corresponding assembler code. For the hardware realization, the SDL process is translated into VHDL RTL code. At this abstraction level, we assume that parallelism is limited to the cycle level and we consider a simple architecture model: each VHDL RTL transition takes one clock cycle.
The approach consists in identifying all basic blocks in the SDL specification. A basic block-noted BB-is a sequence of operations without control instructions (branches and conditioning) or communication. Then, after executing MUSIC, the corresponding generated code is identified and the execution time of each basic block is calculated and assigned. The communication time is also calculated separately for all available protocols. A communication time is modeled by a sequence of delays to be as close as possible to the execution model of the physical implementation. The results of this stage are used to build a database containing enough timing information for the architecture exploration loop (cf. Fig. 1 ).
Identification and Execution Time Computation of the Basic Blocks
The first step consists in identifying and labeling of the beginning and the end of each basic block in the SDL specification. The correspondence between SDL actions and the generated code is obtained thanks to this labeling and by a technique worked out for the code generator of MUSIC. After identifying the basic blocks in SDL and starting from the same system specification, two implementations are made: one purely software and the other one purely hardware. The software implementation may give several realizations according to the number of available couples "processor, compiler." The generated codes (assemblers, VHDL RTL) are analyzed to isolate the basic blocks and to calculate their execution time in terms of "number of clock cycles." Fig. 4 gives the overall flow.
Communication Time Modeling
The execution model of the system allows every processor of the target machine to execute one or several concurrent processes. The communication may be internal or external. Modeling the communication time and synchronization between two processes actually means modeling three kinds of delays [11] , [12] : interface initialization time (particularly for external communication) T Startup , data transmission time T T rans , and synchronization time T Synchro . Thus, the time needed for a communication operation can be modeled by the following equation:
where n is the amount of data (e.g., number of bytes), and stands for "0" or "1" according to the type of internal or external communication, respectively. Interface initialization time T Startup is the fixed part of the protocol that remains independent from the amount of data. For a particular communication protocol, this time only depends on the target processor. Data transmission time T T rans depends on the amount of data and the communication speed. Synchronization time T Synchro is the time the process is delayed for the communication to be completed. This delay, which may vary a lot, depends on the protocol which is used and on the dynamic execution of the system, but remains data-independent. In fact, as T Synchro is execution-dependent, it is taken into account by the GEODESIM simulator thanks to its supported communication schemes [21] . T Startup and T T rans are calculated for each communication protocol following the same computation flow as the one shown in Fig. 4 . Yet, the starting point is the description of the communication protocol in SDL or in another intermediate language supported by MUSIC.
Unlike the execution time computation of the basic blocks, which has to be done again and again for every new application, the communication time can be reused in other applications. For that reason, we gathered all fixed parts of the communication time of all available protocols into a library of communication delays.
Back-Annotation
During this stage, the SDL specification is annotated with the timing elements calculated during the previous stage using therefore the directive DELAY. This stage consists in annotating the basic blocks and the communications. At completion of this stage, we have a time-annotated SDL specification that is ready for the architecture exploration loop.
Annotation of the Basic Blocks
For every basic block in SDL, we got several delays corresponding to the different possible realizations. The timing parameters corresponding to these delays are inserted in SDL by adding "TASK COMMENT DELAY ExecTime_BB(x)_Techno(p)" the beginning of every basic block. The parameter "ExecTime_BB(x)_Techno(p)" is replaced by a numerical value at the architecture modeling stage according to the selected realization. When the simulator reaches the operation "TASK COMMENT DE-LAY ExecTime_BB(x)_Techno(p)," the process is blocked during the time specified by ExecTime_BB(x)_Techno(p), while the global time is progressing. This operation allows the simulator to execute the basic block within a lapse of time that corresponds to its real execution time, whereas the SDL execution of a basic block normally takes zero time.
Annotation of the Communications
An action that models the communication time is inserted in SDL by adding "TASK COMMENT DELAY T Comm ðnÞ" after all the output and input actions in the communicating processes. The parameter T Comm ðnÞ is given by the (3) (cf. x3.1.2) and it is replaced by numerical values at the architecture modeling stage according to the selected protocol and using the library of communication delays.
Architecture Modeling
The SDL specification obtained from the previous stage is instrumented with the basic timing elements of all available technologies. Then, in this stage, the designer has to instrument this specification with additional information to target one specific system architecture. The result of this stage is an instrumented SDL specification that fully models the timing behavior of the execution of the selected architecture.
The architecture information that the designer has to introduce in the SDL specification is related to the system partitioning and technologies used for the design of each module. This may be achieved by three steps:
1. Selection of SDL processes that will share the same execution resource: This step is equivalent to the system partitioning which is done by a codesign tool. Therefore, in our case, it consists in modifying the hierarchical structure of the SDL model in order to gather all processes that have to be executed on the same processor in the same SDL block. Then, the directive NODE is added to the declaration of this new block. So, all processes of this block share the same execution resource. This modification of the hierarchical structure of the SDL model does not have any effect on the functional behavior of the system. Moreover, it can easily be automated. 2. Assignment of the software and hardware processors: For each node (i.e., module), the designer has to choose the kind of processor on which the processes of this node will be executed (i.e., the technology). The processor may be hardware (ASIC) or software (microprocessor, microcontroller, DSP). In this stage of processor assignment, the timing parameters of all the basic blocks are replaced by the numerical values corresponding to the target processor. 
System-Level Simulation
This last stage consists in executing the architectureannotated SDL specification using the GEODESIM simulator. The simulation result gives us the performances of the architecture. We define the same test-bench for all the architectures that we have to compare. At the end of every simulation, we collect the overall execution time.
With the GEODESIM simulator, the scheduling of all the processes of a node is done according to a uniform random distribution. The directive PRIORITY associated with a process allows us to change this order of execution. We can also define our own scheduling strategy through a "scenario" that corresponds best to reality, so we can pre-determine the scheduling strategy, priorities management, etc.
EXPERIMENTAL RESULTS AND METHOD ANALYSIS
In this section, we present the results that we obtained from the design of a robot arm control system. We started by describing the system functionality in SDL. Then, several architectural solutions were explored using our estimation/ exploration methodology. These architectures were also synthesized using an existing codesign tool called MUSIC [18] . The resulting architectures were validated by cycle accurate cosimulation. The comparison of these two results shows the effectiveness of our approach.
Robot Arm Control System
This system adjusts the speed variation of each motor for the arm to be smoothly moving and for the physical constraints of acceleration and braking to be met. The complete version of this application contains 18 motors. In our case, we only consider two motors for a clear explanation of the example. The SDL specification of the system is composed of four processes "HOST, PID, MOTOR1, and MOTOR2," as shown in Fig. 5 . The process HOST sends speed and control parameters to the PID in order for the latter to take over the control of the two motors. The PID controls the speed of the two motors and for every cycle it calculates the new value of speed and sends it to one of the two motors.
Method Assessment
Three kinds of technologies are available: two microcontrollers-ST10 and 80C51-and the hardware option. Concerning communications, we limited our choice to a point-to-point communication with a "Rendez-vous" protocol. For this sample example, and using (2) (cf. x1), we find a design space with 309 different architectures. Of course, this number will drop significantly if we consider some nonfunctional constraints (e.g., the two motors are always implemented with the same technology).
We applied the different stages of our approach. Among all the architectures of the design space, we selected five alternatives to explore. We used the same test-bench for all the architectures to compare their performances. In addition, in order to validate the results obtained with the estimation/exploration method, we measured the performances of the five architectures generated through synthesis and RT-level cosimulation using the MUSIC codesign tool. The partitioning and the assignment of processors for these architectures, as well as their performances are illustrated by Fig. 6 . The given time corresponds to one control iteration. Two performance results are presented, the first is obtained using the estimation/exploration approach and the second is obtained through synthesis and RT-level cosimulation.
Analysis of the Results
The comparison between the performance results obtained with our estimation approach and those obtained through synthesis and RT-level cosimulation is given in Table 1 .
We notice that the results are almost perfect for the last three architectures and lack precision when the two processes of the two motors share the same processor (e.g., ST10). The execution of these two processes on the ST10 is sequential. However, the scheduling strategy of the execution model of the GEODESIM simulator is set to random. This lack of precision was quickly fixed through the implementation of a new predefined scenario modeling the right sequential execution of both processes in the simulator. With this transformation, the error rate, in the case of the first two architectures, drops from 10 percent to about 3 percent. These very small error rates for all the explored architectures show the high accuracy of our approach.
Considering the simulation speed, the result is even more promising since the simulation time at the system level of an architecture model is 10 4 times faster than the cosimulation of the same architecture at the RT level. The system-level simulation of each of the five architectures takes approximately 10 seconds, whereas the cycle accurate cosimulation takes more than 30 hours. This approach nevertheless has some weaknesses. We encountered the following difficulties:
. It is sometimes very tedious to extract, with fidelity and statically, the execution time from the assembler code for the more elaborate microprocessors (e.g., pipeline, internal parallelism,...). However, since this step is executed only once for each implementation technology, it remains affordable for the exploration of a large solution space. Additionally, this step is easy to automate. . Some compilers make use of their own internal procedure calls, which complicates the identification of the basic blocks in spite of the labeling technique used by our approach. The same problem was encountered with the use of code optimization options in these compilers. When the annotation steps will be automated this difficulty will disappear. . The approach is limited to only one performance factor, that is execution time. Area and power consumption have not, so far, been taken into account. Of course this work may be easily extended to cover these performance factors. From a practical point of view, one of the most important advantages of our methodology is the facility to automate all its stages. Actually, this methodology is based on the codesign flow of MUSIC and GEODESIM simulator. Yet, the principal ideas can be reused to adapt this methodology to other system design environments.
CONCLUSIONS
In this paper, we presented a new methodology to rapidly explore the large design space encountered in hardware/ software systems. The proposed methodology is based on a fast and accurate estimation approach. This approach takes advantage of both system and RT levels of abstraction and combines both static and dynamic analysis techniques, in order to obtain the best trade-off between speed and accuracy. It has been implemented as an extension to a hardware/software codesign flow to enable the exploration of a large number of multiprocessor architecture solutions from the very start of the design process. The effectiveness of the proposed methodology was illustrated by a significant application example. . For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dilb.
