. UNU/IIST is jointly funded by the Governor of Macau and the governments of the People's Republic of China and Portugal through a contribution to the UNU Endownment Fund. As well as providing two-thirds of the endownment fund, the Macau authorities also supply UNU/IIST with its office premises and furniture and subsidise fellow accommodation.
Introduction 1

Introduction
Computer system specification is usually completely implemented as software solution. However, some strong requirements for performance of the system demand an implementation fully in hardware. Consequently, in between two extremes, Hardware/Software Codesign [24] , which studies systematically the design of systems containing both hardware/software components, has emerged as an important field. A critical phase of codesign process is to partition a specification into hardware and software components.
One of the objective of hardware/software partitioning is to search a reasonable composition of hardware and software components which not only satisfies the constraint such as timing, but also optimized desired quality metrics, such as communication cost, power and so on.
Several approaches based on algorithms have been developed, as described, for example, in [2, 20, 21, 23, 25] . All the approaches above emphasis the algorithmic aspects, for instance, integer programming [20, 25] , evolution algorithm [23] and simulated annealing algorithm [21] are respectively introduced to the partitioning process in the previous researches. These approaches are applied to different architectures and cost functions. For example, in [25] , Markus provides a technique based on integer programming to minimise the communication cost and total execution time in hardware with certain physical architecture. The common feature of these approaches is that the communication cost is simplified as a linear function on data transfer or the relation between adjacent nodes in task graph. This assumption is reasonable in asynchronous communication model, but is not reasonable in synchronous communication model in which the cost of waiting time for communication between processes is very high. In order to manage the synchronous model which is common in many practical systems, a new approach must be introduced into the partitioning problem.
A number of papers in the literature have introduced formal methods into the partitioning process [17, 22, 3] . Some of them adopt a subset of the Occam language [16] as specification language [17, 22] . For example, in [22] , Qin provided a formal strategy for carrying out the partitioning phase automatically, and presented a set of algebraic laws to prove the correctness of the partitioning process. In that paper, he did not deal with the optimization of the partitioning.
Few approaches deal with the analysis of the initial specification for exploring the hidden concurrency to relax the initial condition of the partitioning for optimization. In [17] , Juliano et al. provide several algebraic laws to transform the initial description of the system into a parallel composition of a number of simple processes. However, this method delivers a large number of processes and communication channels, which not only increases the difficulty of merging those small processes, but also raises the communication load between the hardware and software components. • Explore the hidden concurrency, i,e, find the processes which could be executed in parallel from the initial sequential specification.
• Obtain the optimal performance of the overall program in terms of the limited resources in hardware. The communication waiting time between software and hardware components is considered as well.
• Improve the communication efficiency by moving the commands to reduce the communication waiting time between hardware and software components after partitioning.
Given a specification, system designers are required to divide the specification into a set of basic processes (blocks) which are regarded as candidate processes for partitioning phase. In general, the next step is to select and put some processes into hardware components to obtain the best performance. On account of the parallel structure of software and hardware components, the hidden concurrency among the processes will relax precedence condition of the partitioning, that is, it will obtain an optimal solution from a larger search space. We design two algorithms to explore the control and data flow dependency. To allocate the processes into the software and hardware components, we transform the partitioning to a reachability problem of timed automata [10] , and obtain the best solution by means of an optimal reachability algorithm.
Since the synchronous communication model is adopted in our target architecture, to reduce the communication waiting time further, we adjust communication commands in the program of each component by applying a scheduling algorithm.
The paper is organized as follows. Section 2 presents the overview of our technique. Section 3 explores the dependency relation between processes. Section 4 describes a formal model of hardware/software partitioning using timed automata. In section 5, we propose a scheduling algorithm to improve the communication efficiency. Some partition experiments are conducted in Section 6. Finally, Section 7 is a short summary of the paper.
Overview of Our Partitioning Approach 3 2 Overview of Our Partitioning Approach
In this section we present our approach to hardware/software partitioning problem. The partitioning flow is depicted as Figure 1 .
In profiling stage, a system specification is divided into a set of basic candidate processes which could never be split further. However, there is a trade-off between the granularity of the candidate precesses and the feasibility of optimisation. The smaller the candidate processes are, the greater the number of different partitions is. The large number of partitions may increase the complexity of the computation of the optimum. Furthermore, the small candidate processes will bring heavy communication cost. On the other hand, The candidate processes of larger size will restrict the number of possible partitions, but may reduce the concurrency and increase the waiting time for communication. We leave this choice to the designers. Our approach enables them to repeat the profiling process as long as they are not satisfied with the software/hardware partitioning results according to the current granularity of candidate process.Once the designer decide the process granularity, the initial specification is transformed into the one that is a sequential composition of candidate processes. That is, P 1 ; P 2 ; . . . ; P n , where P i denotes the ith candidate process.
The analyzing phase in Figure 1 is to explore the control and data flow dependency among the sequential processes P 1 ; P 2 ; . . . ; P n . The data flow dependency is as important as the control flow dependency and helps to decide whether data transfer occurs between any two processes. The details are discussed in Section 3.
Our goal is to select those processes which yield the highest speedup if moved to hardware. Namely, the total executed time of the initial program is minimised in terms of the limited resources in hardware. The overhead of the required communication between the software and the hardware should be included. The synchronous waiting time will be considered in the performance of the partitioning as well. In fact, we will find that this partitioning is a scheduling problem which is constrained by precedence relation, synchronous communication and limited resources. We transform the scheduling problem into a reachability one of timed automata(TA) [10] and obtain the optimal result by means of an optimal reachability algorithm. TA is finite state automata added with clock variables. It has proven that TA is useful formalism to model realtime systems [18] . The verification of a system is usually converted to checking reachability properties, i.e. if a certain state is reachable or not. In recent years several automatic model checking tools for timed automata have become available, such as UPPAAL [19] , KRONOS [7] and HyTEch [13] . We will use the UPPAAL as our modelling tool to conduct some partitioning experiments in Section 7.
When the partitioning process is finished, we will get the following form:
where P S i s (1 ≤ i ≤ m) denote the processes which are allocated in the software component, and P H i s (1 ≤ i ≤ n) denote the processes which are allocated in the hardware component.
In the end of partitioning phase, communication commands are added to support data exchange between software and hardware components. To reduce the waiting time, we reorganize the software and hardware components by moving those added communication commands. For example, let us consider the following partitioned processes P and Q.
Q2(z); C!y; C?q; P 3(x); Q3(z);
Suppose process P is implemented in software and process Q is implemented in hardware. Where C!y denotes the output and C?q as the input. In the process P , moving the action C!y before P 2(x) or after P 3(x), and in the process Q moving the action C!q before Q2(x) or after Q3(x) will not effect the result of the program P ||Q. We assume that the estimate of the execution time of P 1 , P 2 and P 3 is 2, 2, and 2 respectively, and the estimate of the time for the execution of Q 1 , Q2 and Q3 is 1, 1 and 1 respectively. Then moving C!y to the line in between y := f (x) and P 2(x) will make the program running faster. We will propose a general algorithm which could be applied for more than two parallel processes to improve the communication efficiency.
Exploring Dependency Relations between Processes
Dependency Relations
Let P 1 ; P 2 ; . . . ; P n be the initial sequential specification produced by the system designer in the profiling stage. In this section we explore the dependency relations between any two processes. This is an important step in analyzing phase. Our intention is to disclose the control and data flow relations of processes in the specification. These relations will be passed to the next step as an input for partitioning using timed automata model. Moreover, through the analysis of control relation among processes, we will find those processes that are independent so that they can be executed in any order on one component or in parallel on two components without any change to the computation result specified by the original specification.
Let for process P i , W r(P i ) and Re(P i ), respectively, denote the set of variables modified by P i and the set of variables read by P i .
The control flow dependency is represented by the relation ≺ over two processes defined as follows.
Definition 1
We call P i as a control predecessor of P j . If P i ≺ P i+1 does hold, then process P i+1 can not start before the completion of process P i . Otherwise, P i+1 can be activated before P i leaving the behaviour of the whole program unchanged.
Theorem 1
To prove formally this property we follow the convention of Hoare and He [14] that every program can be represented as a design. A design has the form pre ⊢ post, where pre denotes the precondition and post denotes the postcondition. Sequential composition is formally defined as follows [14] :
Where variable lists v and v ′ stand for initial and final values respectively, and m is a set of fresh variables which denote the hidden observation.
The following lemma is taken from [14] :
Because processes P i and P i+1 do not satisfy relation P i ≺ P j , we can easily obtain W r(
where variables x, y, z stand for list variables respectively. Let P i = pre 1 ⊢ post 1 , P i+1 = pre 2 ⊢ post 2 . We could note pre 1 , post 1 , pre 2 and post 2 as follows:
From the above, we could easily obtain: (1)(2), we establish
In the same way, we can prove the following equation
From Lemma 1, the theorem is proved.
2
Let set S j (1 ≤ j ≤ n) store all the control predecessors of P j , and constant max j be the maximum index of processes in S j . To uncover the hidden concurrency among the processes, we have the following corollary.
Corollary P 1 ; . . . ; P max j ; P max j +1 ; . . . ; P j ; . . . ; P n = P 1 ; . . . ; P max j ; P j ; P max j +1 ; . . . ; . . . ; P n if max j < j − 1
This corollary shows each process P k which is between processes P max j and P j could be executed in parallel with process P j . If processes P k and P j are allocated in software and hardware components respectively, it should reduce the execution time of the whole program. Method: for j := 1 to n for i := n to 1 S j := ∅;
T i := ∅; end for end for for j := 2 to n for i := n to 1
To be more concrete on the data flow specified by the initial specification, we introduce the relation d ≺ between processes which is exactly the relation "read-from" in the theory of concurrency control in databases.
Definition 2
If processes P i and P j satisfy relation d ≺ , there is direct data transferring between them in any execution. We call process P i as a data predecessor of process P j . Through this relation, we know the data from which process, a process may need and estimate the communication time between them.
Algorithms for Exploring Dependency Relations
In this section, we present two algorithms. One is for finding control predecessors of each process, and the other is for finding data predecessors of each process. The two algorithms are intuitive, so we will omit the proof of their correctness here.
Set variables S and T are vectors with n components. They store the control and data prede-cessors of each process respectively. i.e, the postconditions of the two algorithms are as follows,
Obviously, S 1 = T 1 = ∅. Table 1 shows the two algorithms.
Although control flow algorithm is very simple, set S j (1 ≤ j ≤ n) discloses the hidden concurrency in the sequential specification based on the corollary in the last subsection.
The two algorithms have both the complexity O(n 2 ). Set variables S and T will provide us all necessary information on the temporal order between processes which will be the input for modelling the partitioning with timed automata in UPPAAL. For simplicity, let D i be the set of indexes for the processes in
In this section, we transform the hardware/software partitioning into a reachbility problem of timed automata. The timed behaviour of each process is modelled as a timed automaton, and the whole system is composed as a network of timed automata. The timed automata models of all processes are similar except for some guard conditions. After the model is constructed, an optimal reachbility algorithm is applied to obtain an optimal trace in which each process is sequentially recorded whether it is allocated in hardware or software components. As model checker UPPAAL has implemented this algorithm in its latest version, we use the UPPAAL as our modelling tool.
Behaviour Analysis
Here we list some key elements of the timed automata in our model. Share variables. Each process has two possible states which indicate whether the process is allocated in software or in hardware. We use a global variable St i to record the state of process P i . St i is 1 if process P i is implemented in software, otherwise it is 0.
Precedence constraints. It is obvious that only when all the control predecessors of a process have terminated, the process has the opportunity to be executed either in hardware or in Hardware clock for
The set of the processes from which P i reads data from
The set of the indexes of the processes that P i reads data from software. We use local variable X i to denote the number of the control predecessors of P i which have finished.
Resource constraints. There is only one general processor in our architecture, so no more than one process can be executed on the processor at any time. We introduce a global variable SR to indicate whether the processor resource is occupied or not. The processor is free if SR is 1, otherwise SR is 0. As far as hardware resources are concerned, the situation is a little complicated. We introduce global variable Hres to record the available resources in hardware.
As the processes in hardware are also sequential like in software in our architecture, we introduce another global variable HR to denote whether a process is executed in hardware. If HR is 1, it indicates that no process occupies the hardware. Otherwise HR is 0.
Clock variables. Local clock variables CH i and CS i represent the hardware clock and software clock for process P i respectively. To calculate the communication time between the software and hardware we introduce a local clock CC i for process P i . Table 2 lists the variables used in our timed automata models together with their intended meaning. Most of these notations have been explained above. 
Model Construction
In this section we present two models, one is called ideal and helps to understand the behaviour of the system, and the other is complete which takes into account all the elements including resource and communication. Figure 2 shows the ideal model. It expresses the timed behaviour of process P i . There are four states in this model. The wait state denotes that process P i is waiting. The states srun and hrun denote that process P i is allocated in software or in hardware respectively. When process P i finishes its computation task, it will be in the state end. Our purpose is to find the fastest system trace in which all processes reach their end states.
Ideal Model
If the control predecessors of process P i are all terminated, i.e X i = T oken i is satisfied, P i is enabled to be executed in software or in hardware. When both of the components are free, it will choose one of them nondeterministically. If both components are occupied by other processes, P i will still be in state wait.
Suppose process P i is decided to run in software. Once P i has occupied the software, it sets the global variable SR to 0 to prevent other processes from occupying the processor. It sets the software clock CS i to 0 as well. The transition between the srun state and end state can only be taken, when the value of the clock CS i equals ST exe i . As soon as the transition is taken, the variable X j will be added one if process P i is the control predecessor of process P j . At the same time, process P i also releases the software processor. The situation is similar if process P i is implemented the hardware.
This ideal model shows that every process P i may be implemented in software or in hardware. When all the processes reach its end state, it is said that reachability property of the system is satisfied. For obtaining the minimal execute time of the whole system, we use a global clock in UPPAAL tool and once the optimal reachability trace is found the global clock will show the minimal execute time.
Complete Model
Now we present the complete model taking into account the communication and resource, etc. The complete model is depicted in Figure 3 In addition to the states the ideal model introduced, we solve two problems which are not involved before. One is how to simulate the resource allocation in the hardware component, and the other is how to simulate the data transfer between the hardware and software components.
The approach to the first one is simple. Whether process P i will be implemented in hardware is considered, the automata tests not only variable HR but also variable Hres to see if the resources are available for process P i in hardware component. If the left resources are enough, process P i may be put into the hardware. Though there exist reusable resources in hardware such as adder and comparator, we do not need to consider here because the processes are implemented sequentially in hardware in terms of our target architecture. When process P i finishes its computation task, it will release the reusable resources for next processes.
In order to model the communication between the software and hardware, the data dependency within the processes has to be considered. When process P i uses the data which is defined in other processes, the data transfer will occur between them. If they are all in the same component, the communication time could be ignored. For example, when the processes which communicate with each other are all in the software, they exchange data through shared memory. Supposing that process P i is put into the software component, and at least one process which will communicate with process P i is allocated in the hardware component, the communication between them will take place by means of the bus or other physical communication implementation. The overhead of communication between the software and the hardware cannot be negligible, and we should take it into account in our model.
Recall that variable St i is introduced to denote that process P i is implemented in hardware or software components. When process P i is decided to run in the software component, it should do St i := 1. Process P i then checks whether those processes that will transfer data to it (i.e. the processes in T i ) are in software component or hardware component. If at least one of them has been put into the hardware, the communication should be taken into account. In Figure 3 , the condition Σ j∈D i St j < |T i | is a guard to express that at least one process that P i reads data from is in hardware. The situation is similar if process is in hardware component (the guard becomes Σ j∈D i St j > 0 for this case).
Next, when the communication really occurs between the software and the hardware, it should occupy both the software and hardware components. That is to say, no other processes would be performed until the communication is finished. According to this, the two variables SR, HR are set to 0 simultaneously as long as the communication takes place. The clock CC i is set to 0 as well. For the communication time T comm i is related to the states of procees P i 's data predecessors, there are two methods to estimate the value of T comm i . One is that let T comm i be the probability average value of all the communication combinations. The other is that the value of every possible communication is computed in advance, then T comm i choose one of them according to the current states of process P i 's data predecessors.
In the end, once the communication of process P i is finished it releases the control of the hardware and software immediately. Process P i will compete hardware or software resources with other ready processes.
It is worthy to point out that even if process P k is one of the data predecessors of process P j , it is not necessary that there will be a non negligible time consuming communication between processes P j and P k . Other process P i may be as a delegate to transfer the data for them. The data will not be modified by process P i in terms of the data dependency defined before. For example, if both processes P i and P j are implemented in the hardware and they have to transfer data to the process P k which is allocated in the software. Process P i or P j will be a delegate to send all the data to process P k . Although more than one process will communicate with process P k , the communication between the hardware and software occurs only once.
An Optimal reachability Algorithm
We have showed that the hardware/software partitioning is formulated as a scheduling problem which is constrained by precedence relation, limited resource, etc. In the partition model, we need not only to check all processes could reach their end states, but also to obtain a shortest accumulated delay trace. This is regarded as the optimal reachability problem of model checking in timed automata.
For model checking algorithm, it is necessary to translate the infinite state-spaces of timed automata into finite state presentation. For pure reachability analysis, symbolic states [11] of the form (l, Z) are often used, where l is a location of the timed automata and Z is a convex set of clock valuations called a zone. The formal definition of (l, Z) could be found in [11] . Here an abstract optimal reachability algorithm based on forward reachability analysis is given. Function D(l, Z) calculates the minimal time delay in comparison with a global clock in zone Z. The algorithm is as follows,
which is not in W AIT IN G is added to W AIT IN G; return AccumuT ime
The algorithm uses two sets WAITING and PASSED to store states waiting to be checked, and states already explored respectively. This algorithm always searches the entire state-space of the analyzed automaton to find the optimal trace.
To generalize the minimum-time reachability, in [5] , a general model called linearly priced timed automata(LPTA) which extends the model of TA with prices on all transitions and locations is introduced to solve the minimum-cost reachability problem. Uniformly Priced Timed Automata(UPTA) [4] , as a variety of LPTA, adopts a similar algorithm of ours which uses some techniques such as branch-and-bound to improve the searching efficiency has been implemented in the latest version of UPPAAL. In Section 6, we will use UPPAAL to do some experiments on hardware/software partitioning cases.
Improving the Communication Efficiency
After the partitioning stage is finished, we obtain two parallel processes running in software and hardware components respectively. The communication is synchronised between them. Moreover, we can improve communication efficiency further by moving the communication commands appropriately.
The idea is that we find a flexible interval [ASAP, ALAP ] for each communication command in which the communication could occur without changing the program result. This interval denotes the earliest and latest time when the communication command can execute relatively to the computation time of a program. Then we apply a scheduling algorithm to decide the appropriate place of communication command to reduce the waiting time between processes. Here we propose a general algorithm which is for more than two processes in parallel.
System Modelling
Let S be a system of m processes {P 1 , . . . , P m } running in parallel and synchronised by handshaking. All the processes start at time 0. In our partitioning problem, let m equal 2.
Description of each process P i
• P i has a communication trace
, where Σ i is the alphabet of the communication actions of P i .
• P i needs a computation time A i before it completes.
• Each c i j has a "flexible" interval for the starting time [a i j , b i j ] relatively to the computation time of P i , a i j ≤ b i j ≤ A i . This means that c i j is enabled when the accumulated execution time of P i has been reached a i j time units, and should take place before the accumulated execution time of P i reaches b i j time units. b i j and A i can be infinity, and a i j can be 0. To be meaningful, we assume that a i j ≤ a i j+1 and b i j ≤ b i j+1 for 1 ≤ j < n i .
• P i is either running or waiting when not yet completed. It is waiting iff it is executing a communication action c i j for which the co-action has not been executed.
The system S completes when all of its processes complete. Our task is to schedule communication actions such that S completes at the earliest time.
We now formulate the problem precisely. The purpose of formalisation here is just to avoid ambiguity and to simplify the long text in the proof when applicable. Any formalism to model the problem must have the capacity to express the "accumulated execution time" for processes.
For this reason, we take some idea from Duration Calculus (DC) ( [8] ) to formalism for the problem.
For each process P i , we introduce four state variables (which are mappings from [0, ∞) to {0, 1}) P i .running, P i .waiting, P i .completed and P i .start to express the behaviour of P i . At a time t, the state variable P i .running (P i .waiting, P i .completed and P i .start) has the value 1 iff P is running (waiting, completed and start, correspondingly) at the time. These state variables are mutually exclusive. A process P i starts at time 0 and terminates when its accumulated execution time has reached A i . All processes stay in the state "complete" when after they terminate.
Systems and Assumptions
• α 1 , α 2 , . . . , α m are assumed to be matched in the sense of handshaking synchronisation. Let f be the matching function, i.e. f (c i j , c k h ) = 1 iff c i j and c k h are matched (they are the partners for the same communication).
• Let t i j be starting time of c i j (according to the unique global clock). Then t i j must satisfy the constraint for c i j , i.e. a i j ≤
• P i .waiting(t) = 1 if and only if there exists c i j and c k h such that t i j ≤ t ≤ t k h and f (c i j , c k h ) = 1 (P i is waiting iff it decides to communicate and its partner is not ready).
To formalise the behaviour of communication actions as mentioned in the items mentioned above, we introduce for each c i j and c k h such that f (c i j , c k h ) = 1, a state variable comm(i, j, k, h). comm(i, j, k, h)(t) = 1 iff at time t, one of the partner action (either c i j or c k h ) has started and the communication has not completed. Note that comm(i, j, k, h) = comm(k, h, i, j).
An execution of S is a set of intervals [t i j , t ′ i j ] of the starting time and the ending time for communication actions c i j . An execution terminates at time t iff t is the termination time of the latest process.
Question: Develop a procedure for the scheduling to produce an execution of S that terminates at the earliest time.
In the following algorithm and example, we assume that communication takes no time for simplicity. The algorithm is also correct when including the communication time.
Scheduling Algorithm
Because α 1 , . . . , α m are matched, we can easily construct a dependency graph G to express the synchronised computation for the system S (a Mazurkiewicz trace [1] , [9] ) as follows. Each node of the graph represents a synchronised action (c i j , c k h ) with f ((c i j , c k h )) = 1 (and labelled by (c i j , c k h )). There exists a directed edge from
G is used as an additional input for the algorithm. (a) If B = ∅ then t n := min B. In this case, t i j := t k h := t n (no waiting time),
(b) If B = ∅ then t n := min I ∩ K if max J ∩ K < min I ∩ K, and update the waiting time
(3) Remove all the nodes in C and the edges leaving from them from graph G. Example:Suppose there are 3 processes P 1 , P 2 and P 3 to communicate each other. The communication intervals of the precesses are showed in Fig 4. The dependency graph G for S is constructed as well.
The first execution of Step 2 is on the slice C1 = {n1}, and gives t1 = 4, W = (0, 0, 0), V = (4, 4, 0) meaning that until time t1 for the finishing of the actions represented by n1, no process is waiting, and that at the action represented by n1 involves P 1 and P 2, and terminate at time 4.
The second execution of Step 2 is on the slice C2 = {n2}, and gives t2 = 6, W = (0, 0, 0), V = (4, 6, 6) meaning that until time t2 for the finishing of the actions represented by n2, no process is waiting.
The last execution of Step 2 is on the slice C3 = {n3}, and gives t3 = 11, W = (1, 0, 0), V = (11, 11, 6) meaning that until time t3 for the finishing of the actions represented by n3, P1 has to wait for 1 time unit. Theorem 2 t n is the earliest time that the communication actions represented by n can take place. Hence, the algorithm gives a correct answer to the problem mentioned above.
Proof: Let V i , V ′ i and W i , W ′ i be the values of the ith components of vector V and W just before and after an execution of Step 2 respectively. We will prove by induction on the number of step 2 on graph G that t n , W ′ i , t i j , t k h and V ′ i produced by the last application of Step 2 have the following properties:
1. t n is the earliest time that the communication actions represented by n can take place (i.e. the constraints for the communication actions represented by n are satisfied), 2. W ′ i is the waiting total time of the process i over the interval [0, t n ], and
First, from assumption that α i 's are match, and from the definition of G, G is acyclic, and each set C produced in
Step 2 has at most one node with the label including an action from one process.
Basic step: Before the execution of Step 2 for each node n = (
t n ] will make the constraints for c j j
and c k h satisfied together with other desired properties. We verify this for the two cases of Step 2.
When min{t i j , t k h } = t h k and t h k = t n (the case I ∩ J = ∅), it holds that t i j = t n . Both P i .waiting and P k .waiting have the value 0 in the interval [0, t n ]. Hence, tn 0 P i .running(t) dt = t n and tn 0 P k .running(t) dt = t n . Because t n = min B, a i j ≤
When min{t i j , t k h } = t h k and t h k < t n (the case I ∩ J = ∅ and max I ∩ K < min J ∩ K), it holds that in the interval [0, t n ], P k .waiting(t) has the value 1 iff t ∈ [t h k , t n ] and P i .waiting(t) has the value 0 for all t ∈ [0, t n ]. Therefore,
Because in this case t i j = t n , and
Because for n, n ′ ∈ C, n = n ′ implies that n and n ′ do not have a common communication action, we conclude that the three properties are verified for all n ∈ C in the first time of application of Step 2.
Induction step: The arguments are almost the same as for the basic step, with some small modification.
Let C be the slice of G for in the rth application of Step 2, and n = (c i j , c k h ) be any node in C. The fact comm(i, j, k, h)(t) = 1 iff t ∈ [min{t i j , t k h }, t n ] will make the constraints for c j j and c k h satisfied together with other desired properties. We verify this from the two cases of Step 2.
When min{t i j , t k h } = t h k and t h k = t n (the case I ∩ J = ∅), it holds that t i j = t n . P i .waiting(t) has the value 0 in the interval [V i , t n ] and P k .waiting have the value 0 in the interval [V k , t n ]. Hence, by the inductive hypothesis,
When min{t i j , t k h } = t h k and t h k < t n (the case I ∩ J = ∅ and max I ∩ K < min J ∩ K), it holds that in the interval [V k , t n ], P k .waiting(t) has the value 1 iff t ∈ [t h k , t n ] and P i .waiting(t) has the value 0 for all t ∈ [V i , t n ]. Therefore, The fact comm(i, j, k, h)(t) = 1 iff t ∈ [min{t i j , t k h }, t ′ n ] with t ′ n < t n would violate the constraints for communication actions because that either the constraint (a i j ≤
or the temporal order of actions is not satisfied.
Because for n, n ′ ∈ C, n = n ′ implies that n and n ′ do not have a common communication action, we conclude that the three properties are verified for all n ∈ C in the first time of application of Step 2. 2
Experiments in UPPAAL
We have used the technique in the previous section to find optimal solution for some hardware/software partitioning case studies. In this section we present some of our experiments in solving them with the model checker UPPAAL version 3.3.32, running in Linux machine with 256Mb memory.
After we have modelled a hardware/software partitioning problem as a network of timed automata with n processes, we input the model to the UPPAAL model checker. Then we asked the UPPAAL to verify: E <> P1.end and P2.end and ... Pn.end
This property in UPPAAL specification language specifies that there exists a trace of the automata network in which eventually all n processes reach their end states.
To let UPPAAL find out the optimal solution to our problem, we choose the breadth-first model checking algorithm (UPPAAL offer various different algorithms) and the option "fastest trace" offered by UPPAAL. A global clock variable is declared to store the execution time. When the reachability property is satisfied, the fastest trace which records the partitioning scheme will be found, and the global clock will record the minimal execution time of all the processes. This trace, after having been added with the necessary communication statements, can be passed into the software compiler and hardware compiler to be implemented.
Here we use a Occam-like language as our specification language, and use the hardware compiler technique [6] to estimate the required resources in hardware of each process. For simplicity, as resources we list here only the estimate required gates of each problem.
The experimental results for the three case studies are shown in Table 3 . We assume there are 15,000 gates in hardware resources.
The first one is Huffman decoder algorithm. The second is a matrix multiplier algorithm, and the last example is a pack data algorithm in network.
Conclusion
This paper presents a new approach to hardware/software partitioning supporting the abstract architecture in which the synchronous communication takes place. After the designer decides the process granularity of the initial specification, the partitioning process could be carried out automatically. We explore the relations among processes to find the hidden concurrency and data dependency in the initial specification. These relations are as the input of timed automata to ensure the behaviours of processes are modelled correctly. Once the formal partitioning model is constructed with timed automata, the optimal result can be obtained by means of an optimal reachability algorithm. To further improve the synchronous communication efficiency between hardware and software components, a scheduling algorithm is introduced to adjust communication commands after partitioning. The experiments in model checker UPPAAL clearly demonstrated the feasibility and advantage of our proposed approach.
