The complexity of multiprocessor architectures for mobile multimedia applications renders their validation challenging. In addition, to provide the necessary flexibility, a part of the functionality is realized by software. Thus, a formal model has to take into account both hardware and software. In this article we report on the use of the CADP toolbox for the formal modeling and analysis of the DTD (Dynamic Task Dispatcher), a complex hardware block of an industrial hardware architecture developed by STMicroelectronics. The formal LNT model developed by an industry engineer was appropriate to discuss implementation details with the architect and enabled model-checking temporal properties expressed in MCL, which discovered a possible problem. We investigated the existence of the problem in the architect's C++ model using co-simulation of the C++ and the formal LNT models.
Introduction
Current multimedia applications require complex multiprocessor architectures, even for mobile terminals such as smartphones or netbooks. Due to physical constraints, in particular the distribution of a global clock on large circuits, modern multiprocessor architectures for mobile multimedia applications are implemented using a globally asynchronous, locally synchronous (GALS) approach, combining a set of synchronous blocks using an asynchronous communication scheme.
Due to the high cost of chip-fabrication, errors in the architecture have to be found as early as possible. Therefore, architects are interested in applying formal methods in the design phase. In addition, a formal model has to take into account both hardware and software, because a part of the system's functionality is implemented in software to provide the flexibility required by the rapidly evolving market. However, even if the software part can be updated easily, the basic functionalities implemented in hardware have to be thoroughly verified.
This article reports on the application of a modern formal analysis tool (CADP [1] ), and in particular the LNT (formerly LOTOS NT, for "LOTOS New Technology") language [2, 3] , to a complex hardware block of an industrial architecture developed by STMicroelectronics, namely the Dynamic Task Dispatcher (DTD). The DTD serves to dispatch data-intensive applications on a cluster of processors for parallel execution.
Until now, formal methods have been used by STMicroelectronics mainly for checking the equivalence between different steps in the design flow (e.g. between a netlist and a placed and routed netlist) or for establishing the correctness of a computational block (e.g. an inverse discrete cosine transform) by theorem proving. However, STMicroelectronics is unfamiliar with formal methods to validate a control block such as the DTD. For this reason, STMicroelectronics participates in research projects, such as the Multival project on the validation of multiprocessor architectures using CADP. Our choice of CADP was also motivated by related successful case-studies, in particular the analysis of a system of synchronous automata communicating asynchronously [4] , and the cosimulation of complex hardware circuits for cache-coherency protocols with their formal models [5] . Finally, because the considered design is a GALS architecture, the interfaces between the processors and the DTD can be considered asynchronous, which fits well with the modeling style supported by CADP.
Contributions. We illustrate several advantages of modeling and analyzing the DTD using LNT, a new formal language based on process algebra and functional programming. First, although modeling the DTD in a classical formal specification language, such as LOTOS [6] , is theoretically possible, using LNT made the development of a formal model practically feasible. In particular, features such as predefined array data-types, loops, and modifiable variables helped to obtain a model easily understandable by hardware architects. Second, the automatic analysis capabilities offered by CADP (e.g. step-bystep simulation, model checking, co-simulation) enabled to uncover a problem in the borderline use case with both, heavy application load and partially broken hardware.
Compared to [7] , this article presents a generalized model of the DTD that can manage a heterogeneous cluster with processors featuring different processor extensions; this generalization was facilitated by the use of LNT. Handling processor extensions also required to generalize the temporal logic correctness properties. We also took advantage of this model change to optimize the LNT model and to ease the study of different numbers of processors. The resulting model is included as appendix. Finally, we experimented with adding constraints: this enabled the formal analysis of scenarios larger than those of [7] .
Dynamic Task Dispatcher
The joint STMicroelectronics-CEA "platform 2012" project [8] aims at developing a many-core programmable accelerator for ultra-efficient embedded computing. This accelerator includes one or several processor clusters. We focus on a cluster designed for fine-grained parallelism (data and task level), consisting of processors, memories, communication networks, and control blocks.
The cluster features 16 STxP70 processors, 32-bit microcontrollers with a Harvard architecture (separated data and instruction busses). The STxP70 technology allows the extension of the general purpose instruction set through the addition of user-defined instructions and registers.
For the platform 2012 project, two processor extensions were considered: vector and bitstream. The vector extension is optimized for video and imaging applications providing a Single Instruction Multiple Data (SIMD) instruction set so that a 128-bit wide data can be seen as eight 16-bit values or four 32-bit values, on which the same arithmetic instruction is performed simultaneously. The bitstream extension is devoted to the manipulation of non-aligned words and is useful for encoding/decoding compressed data. With this extension, a word with a size of an arbitrary number of bits can be extracted from any position in an aligned 32-bit word or two consecutive words. The opposite manipulation is also possible (moving a register value to an arbitrary position in a word).
The underlying programming model is the "ready to run until completion" model, i.e. a task can be divided in several subtasks, which can be executed independently if each subtask has all the data needed for its completion at the time it is launched. As there is no interaction between subtasks, the subtasks respect the Bernstein conditions [9] , and thus can be executed in any order, even in parallel (this might be required to reach the expected performance).
Among the routines for subtask-execution, we focus on dup(void *f (int i), int e, int n), which requests the creation of n subtasks requiring processor extension e (each subtask executes the function f with an index i as argument) and returns when all the subtasks have terminated.
In order for this execution scheme to be efficient, tasks must be switched in only a few cycles and subtasks must be allocated at run time to an idle processor. This has several implications on the hardware architecture. First, the cluster is based on a data memory shared by all processors. Thus, even if a subtask runs on a different processor than its ancestor, it has the same frame pointer and thus an easy access to global variables. Second, all processors share the same instruction cache, lowering the cost of replicating a task on several processors. Lastly, a dedicated hardware block, the Dynamic Task Dispatcher (DTD), is responsible for task selection and launch on the selected processor. Figure 1 shows the overall architecture of the cluster. Processors communicate with the DTD through data accesses on dedicated, memory-mapped addresses. The DTD is thus connected, in parallel, to the data bus of each processor. A processor will use a store operation to ask the DTD to dispatch a task and a load operation when willing to execute a new task.
The cluster is designed as globally asynchronous, locally synchronous (GALS) system: Even if all the processors run at the same clock frequency, their clocks may not be ...
Shared Data Memory
Shared Instruction Cache Figure 1 : Global architecture of the cluster synchronized due to physical limitations. Furthermore, due to its complexity, the DTD is not targeted to run at the same clock frequency as the processors. In order to reduce power consumption, inactive processors are kept in idle mode and are woken up by the DTD using an asynchronous wakeup signal. After wakeup, a processor immediately issues a load to the DTD. The answer to a load is either a task descriptor, containing the address of a function to execute (in this case, the processor jumps to the address and executes the function), or a special descriptor indicating that there is no more work (in this case, the processor switches to the idle mode). To signal the end of task execution, a processor issues a load for a new task.
The implementation of dup() first issues a store to ask for a task to be dispatched, and then enters a loop, which starts by issuing a load. The response is a task descriptor (in this case, the processor executes the task -a processor p is guaranteed to execute one instance of the function f it asked to replicate if p provides the extension required for f ), a special descriptor indicating that there are no more instances to execute but some instances executing on other processors are not yet terminated (this case is called active polling), or a special descriptor indicating that all the subtasks have been executed (in this case, the processor can leave the active polling loop, return from dup(), and go on executing the calling task). The cluster supports three levels of nested tasks per processor, which is enough for the forecasted applications and is not too expensive in terms of silicon area.
The DTD also has an interface to handle the main tasks requests issued by the host processor (application deployment on the accelerator). This interface is connected to a queue. As soon as there is a task to execute in this queue and an idle processor with the relevant extension, the task is assigned to the processor and removed from the queue.
Example 1. Figure 2 shows a subtask execution scenario using three processors. The processor P0 requests the execution of four instances of the subtask foo() requiring processor extension ext, which is supported by all three processors. Processor P0 is assigned the execution of the subtask with index 3, processors P1 and P2 are awakened and assigned the execution of the subtasks with respective indexes 2 and 1. As execution on processor P2 terminates, P2 is assigned the execution of the subtask with the last index, 0. When the processor P0 finishes its execution, it is first informed that it has to wait for the completion of subtasks instances ( LD RSP (WAIT SLAVE)). When asking once more after all subtasks have been executed, P0 is informed about the completion ( LD RSP (DONE)). worrying about its mapping on the hardware processors. The application only requests the execution of its subtasks, on processors having the requested extensions, without explicitly identifying processors or specifying the number of tasks to be executed in parallel. As the application's subtasks can be executed in any order, the application's results are always the same, but the performance might not be optimal if too few subtasks are executed in parallel. In addition, this feature enables the very same application to be run on depreciated hardware (second class chips where some processors are known to be non-functional, hardware failures due to mechanical or electrical chocks, or voluntarily turned off processors to save energy when battery level is low).
Formal Model of the DTD
We formally modeled the DTD using LNT [2, 3] , a variant of the E-LOTOS [10] standard implemented within CADP. LNT combines the best of process-algebraic languages and imperative programming languages: a user-friendly syntax common to data types and processes, constructed type definitions and pattern-matching, and imperative statements (assignments, conditionals, loops, etc.). LNT is supported by the lnt.open tool 2 , which translates an LNT specification into a labeled transition system (LTS) suitable for on-the-fly verification using CADP.
Design Choices
From the DTD point of view, all the interfaces (with the host, the memory, and the processors) evolve in parallel: hence, an unconstrained state space exploration would lead to state space explosion. Furthermore, an application running on processors must respect some rules that are embedded in the programming model (such as the number of nested tasks and order of transactions on the interface). Modeling these rules in the DTD model would be artificial. For all these reasons, we have chosen to abstract applications and to limit the analysis to a set of scenarios running on abstracted processors.
The classical way of verifying a hardware block is to run massive simulations. For a block like the DTD, these simulations mean executing several scenarios. These simulations rely on the event scheduler of the simulator. Precise hardware simulations of the whole system are expensive in time and some abstractions are used, which imply that the resulting scheduling may not be the same as the real one. Even if we restrict the verification of our formal model to a set of scenarios, we improve coverage because we explore all the scheduling possibilities for each scenario. Furthermore, we are able to use model checking, which is impossible for standard simulations.
We decided to model everything, hardware (both the DTD and the processors), applications, and software routines (in particular dup()) using LNT processes, because only the code inside an LNT process has access to the gates and can synchronize with other processes. For example, it is mandatory to define dup() as a sequence of three rendezvous, namely a store, a load request, and a load response.
The representation in an asynchronous language of events taken into account simultaneously was a modeling challenge. Indeed, the DTD is a classical synchronous hardware block, scanning its inputs at each cycle of its clock and computing the relevant outputs. Hence, the decisions taken by the DTD are not based on a response to a single input but on the totality of all inputs. We did not want to artificially synchronize on a global clock, so we used a multi-phase approach: an input is, asynchronously, taken into account by modifying an internal state vector S i , and outputs are issued according to a second state vector S o . The outputs are computed, asynchronously, by scanning the state vector S i , updating the state vector S o by a decision clause. This clause may include a rendezvous on a particular gate, which can be seen as clock for this decision function in a synchronous design. This rendezvous prevents the non-determinism in the generated LTS. Using this multi-phase approach enables interleaving of synchronization in the independent interfaces of the model because the model is never blocked waiting for a synchronization and parallel parts of the model evolve atomically.
The main difference between this approach and that proposed for integrating a synchronous automaton in an asynchronous environment [4] is that we need to aggregate several asynchronous events into a single synchronous event, whereas in [4] each asynchronous message is decomposed into a set of synchronous signal changes.
Example 2. Consider an arbiter used to serialize accesses to a shared resource (in hardware, for instance, access to a shared bus from several initiators). Each actor willing to access the resource, sends a request I and waits for being elected ( O rendezvous). From the arbiter point of view, several requests can be received, but only one actor is selected each time a decision is taken according to the arbitration policy. Figure 3 (a) presents the code of an arbiter between two actors respecting the rules presented before.
3 This arbiter has two interfaces A and B, the states of which are recorded in the variables state A and state B. Each interface evolves by the rendezvous on gate I followed by the rendezvous on gate O. The first two when-clauses deal with the first rendezvous and modification of the state, while the last two clauses deal with the second rendezvous, according to the computed state. The middle clause is the decision function which updates the state; this clause issues a rendezvous on gate D. Figure 3 (b) presents the resulting LTS. Due to the fact that time is not modeled, one or two inputs can be received before a decision is taken and inputs may be received before the result of a decision has led to an output. This explains why several decisions can be taken before any rendezvous on OA or OB. The unfair characteristic of the arbitration policy (giving priority to agent A) can be seen on state 4 of the LTS where, after receiving the two requests IA and IB, the decision always selects OA. Moreover, if a second decision is taken before the rendezvous on OA, rendezvous OA and OB may occur in any order.
Modeling the Dynamic Task Dispatcher Hardware
From the DTD point of view, the state of a processor can be unknown (before the processor signals it has booted), idle (in the idle mode), neutral (executing a top-level task), master (having caused a dispatch of subtasks by calling dup()), or slave (executing a subtask dispatched by another processor). In the last case, the DTD has to keep a reference to the corresponding processor having called dup(). Due to the nested task mechanism, the processor state has to be kept in a stack-like structure of fixed depth.
Additionally, we have to record the state of the interface of each processor. The state of the interface of a processor is used to propose the relevant rendezvous. For example, the running state of the interface is used when the processor executes a task or a subtask, so that the interface can accept a load signaling the end of task execution. Each rendezvous affects only the state of the corresponding interface: thus, all interfaces can change independently of the others. DTD decisions are based on (and modify) all the interface states and processor states.
The model of the DTD is thus described by an LNT process Dtd executing an infinite loop containing:
• For each processor, several guarded clauses dealing with its interface. Each of these clauses handles a rendezvous with the processor and updates the variables representing the state of the processor interface. A clause also deals with the communication with the host processor, filling a queue with task requests. These blocks of code implement the connection between the asynchronous communication scheme and the synchronous decision function.
• Several clauses to launch tasks requested by the host and to achieve the dispatches requested by the tasks executing on the processors. This corresponds to the function executed by the real hardware DTD on each cycle.
The communication between the DTD and the processors is modeled using four gates: WAKEUP, LD RQ (load request), LD RSP (load response), and ST (store, considered to be atomic). Each synchronization on one of these gates has a first offer identifying the involved processor, e.g., "WAKEUP (2)" models the activation of processor P2.
The number of processors impacts the internal structure of the DTD, mainly because the arbitration is based on the status information of all processors (and their interfaces), and not only on the status of a single processor. Hence, we used an array indexed by processor identifiers to store the status of the processors: changing the number of supported processors thus requires simply to change the size of the array. 4 The LNT process defining the behavior of the DTD is given in Appendix A.
Modeling Applications and Processors
First, we define an enumerated type, called PC representing the addresses (or program counters) of the task functions. Second, we define the type Pid of process identifiers as the range 0..N − 1 of natural numbers.
To circumvent a limitation of the LNT compiler, which rejects some non-tail recursive calls, we include a call-stack in the processor model; this call-stack is passed by reference (mode inout) to the processes Execute and Dup implementing the execution of the tasks. Thus, the execution of a task function is modeled by a simple process, called Execute, that is mainly a switch between the various values of PC, as shown in Figure 4 .
Dup adds the continuation cont (the address of the task function to be executed at the end of the subtask) to the stack, performs the store operation, and exits. When Dup exits, so does the calling Execute process. Then Processor requests a new subtask. After termination of all subtasks, Processor calls Execute to execute the continuation, which is removed from the stack. The corresponding LNT code is shown in Figure 5 .
Formal Analysis of the DTD
We used the CADP toolbox [1] to formally analyze the LNT model of the DTD for 19 finite scenarios, each for up to eight processors. Let N b (respectively, N v ) be the number of available processors with bitstream (respectively, vector) extension: hence, the total number of processors is N = N b + N v . Without loss of generality, we consider that processors with bitstream extension are numbered from 0 to N b − 1, and processors with vector extension are numbered from N b to N − 1. We distinguish two kinds of scenarios:
• The twelve homogeneous scenarios 1 to 5 1 contain only tasks that require no particular processor extension (i.e. dont care); these scenarios are exactly those studied already in [7] . Scenario 1 defines a set of N + 1 tasks, which can be executed in parallel and do not fork subtasks. This scenario aims at verifying the execution of main tasks requested by the host under saturation of the system (more tasks than available processors).
The other homogeneous scenarios all contain calls to dup(), the simplest one being scenario 2 (see also Figure 4 ). Scenario 2 defines one main task that forks N subtasks; scenario 2 1 adds to scenario 2 more subtasks and scenario 2 2 adds to scenario 2 two other main tasks that do not fork subtasks. These scenarios are designed to validate the fork mechanism issued by a main task and the spreading of subtasks among the available processors. Scenario 2 1 aims at validating the system saturation by subtasks and scenario 2 2 aims at validating the system saturation by other main tasks (forking subtasks is perturbed by the execution of other main tasks).
Scenario 3 uses nested calls of dup(): a main task forks subtasks that also fork, the total number of tasks and subtasks being greater than N . The main task of Scenario 4 invokes dup() twice consecutively, each time forking more than N subtasks; scenario 4 1 just forks more subtasks at each invocation of dup() than scenario 4. This enables to validate the sequencing of subtasks.
Lastly, scenario 5 consists of two main tasks, each invoking dup(); scenario 5 1 changes the number of subtasks. These scenarios aim at validating that subtasks of different main tasks do not interfere.
• The seven heterogeneous scenarios 6 to 9 handle tasks requiring different processor extensions. Scenario 6 defines a set of more than N +1 tasks (one "don't care" task, N b + 1 bitstream tasks, and N v − 1 vector tasks); scenario 6 1 adds to scenario 6 two more vector tasks. These scenarios are extensions of scenario 1 to the case of heterogeneous processors.
Scenario 7 defines one main task that first forks 2 * N b bitstream subtasks and then 10
forks N v + 1 vector subtasks. This scenario corresponds to an application that uses the bitstream extension to align a bit-compressed multimedia stream and relies on the vector extension to adapt the data (for instance to dynamically rescale images to a different screen resolution). This is the typical intended application scheme of the accelerator. This set of scenarios covers different ways of using the hardware accelerator with, or without, saturation of the resources. Note that each scenario may be executed in many different ways, depending on the interleaving: for instance, subtasks might be executed in parallel (on several processors) or sequentially (on the same processor). Therefore, we believe that this set of scenarios covers the main features of the DTD.
State space generation
Using CADP, we generated the LTS for each of the scenarios listed in Table 1 , columns four and five of which give the LTS size. Each LTS has been minimized for strong bisimulation using the bcg min tool 5 . On average, minimization reduced the number of states by a factor of 1.7 (we observed a minimal factor of 1.03 and a maximum factor of 4.08). Because we did not hide any transition (to enable the verification of any temporal logic property and to obtain as explicit counterexamples as possible), using weaker bisimulations (such as branching bisimulation) would have led to the same results. For all these experiments, we used a computer with a 2.8 GHz processor and 100 GB of RAM; the execution time to generate and minimize one scenario varied from less than one second to almost four hours (scenario 7 for eight processors). For six (respectively, eight) processors, LTS generation was possible for only eight (respectively, one) scenarios. For more processors, even the generation of the smallest scenario ran out of memory. For even smaller scenarios (only two tasks in scenario 1, or a duplication to only two processors in scenario 2), the LTS can be visualized step-by-step and checked manually.
For each of the homogeneous scenarios, the generic model yields an LTS that is strongly bisimilar to the one generated by the model described in [7] . Interestingly, because the generic LNT model is smaller, state space generation is more time-efficient than for the model described in [7] . Table 1 : Minimized state space sizes; the total number of processors is N = N b + Nv Figure 6 : Number of states for some selected scenarios; the x-axis is labeled with a subset of the scenarios of Table 1 , where "s-N " stands for scenario s for N processors and "s t-N " stands for scenario for s t for N processors; the repartition between processors with bitstream and vector extension is as indicated in Table 1 The LNT model of the DTD includes the initial communication between the DTD and the processors to determine the set of available processors. Hence, each scenario covers all possible interleavings of this protocol with the execution of the jobs requested by the host processor. However, in a concrete implementation, all processors that will eventually boot do so before the arrival of the first job requested by the host processor. A possibility to reproduce this constraint in the LNT model is to synchronize the DTD on gates BOOT and HOST with the additional process Force Boot shown in Figure 7 (a). Columns six and seven of Table 1 show that adding this constraint significantly reduces LTS size, enabling, for some scenarios, the generation of the LTS for a larger number of processors: for six processors the LTS for 17 scenarios can be generated, and for eight processors the LTS of seven scenarios can be generated. Figure 6 illustrates the state space reduction for some selected scenarios.
Note that in both the unconstrained and constrained model all processors eventually boot. To take into account the possibility that some processors may never boot (corresponding to an execution on a depreciated hardware), the DTD can be synchronized on gates BOOT and HOST with the additional process Partial Boot shown in Figure 7 (b), which ensures that no processor boots after the arrival of the first host request -in the LNT behavior "disrupt B 1 by B 2 end disrupt" behavior B 1 might be interrupted at any moment to start execution of B 2 . For each kind of processor extension, Partial Boot forces at least one processor to boot, thus avoiding trivial blocking situations where no processor would be able to execute a task.
Columns eight and nine of Table 1 and Figure 6 show that adding this constraint also reduces LTS size, but not as much as forcing all processors to boot. This should be expected, as each scenario now includes executions with exactly i processors for all 13 
Model-Checking the DTD
To formally verify the correct execution of the different scenarios, we expressed some properties in MCL (Model Checking Language) [11] . The ability to capture the number of a processor in one transition label proved to be crucial to express properties in a concise and generic way.
A first property expresses that each scenario is acyclic, i.e. from each state, a terminal state without outgoing transitions is eventually reached:
The set of states satisfying this fix-point property is computed iteratively, starting with X = ∅: Initially, "[ true ] ∅" is satisfied by states without outgoing transitions, and iteration k adds to X those states from which a deadlock can be reached in k steps.
Unfortunately, this property does not hold for all scenarios with a dup() operation, because the master processor stays in its state after receiving a WAIT SLAVE. Indeed, the third block of messages in Figure 2 (i.e. "LD RQ (NEED JOB)" followed by "LD RSP (WAIT SLAVE)") might be repeated an arbitrary number of times. However, under the hypothesis that each slave always terminates, such a cycle is executed at most a finite number of times. Thus, cycles of this form should not be considered a problem, and the property must be refined, for instance by requiring that only cycles of this form are permitted, i.e. that the system inevitably reaches either a deadlock or gets stuck in a cycle of the permitted form, yielding property ϕ1 (the property "< true* . ϕ >@" is satisfied by all states of a cycle containing a transition with a label of the form ϕ): µ X . [ true ] X or exists y:Nat . < true* . {LD RSP !y !"WAIT SLAVE"} >@ A second property ϕ2 requires that, after waking up a processor, the DTD eventually tells the processor that there is no more work left, i.e. each "WAKEUP !x" is eventually followed by "LD RSP !x !NONE" (where x is a processor number):
[ true* . {WAKEUP ?x:Nat} ] inevitable {LD RSP !x !"NONE"} Note how the number x of the processor woken up is extracted from a transition label by the first action predicate "{WAKEUP ?x:Nat}" and is used subsequently in the property. The predicate "inevitable(B)" expresses that a transition labeled with B is eventually reached from the current state. It can be defined in MCL by the following macro definition: macro inevitable (B) = not (B) * < true* . B > and not < not {LD RSP ?y:Nat !"WAIT SLAVE"} >@ end macro
As for property ϕ1, the definition of inevitable ignores any (spurious) cycles corresponding to a master processor waiting indefinitely for the slave processes to terminate.
This macro definition of inevitable improves the one given in [7] by using fair inevitability [12] , i.e. by requiring, as long as there has been no B, that there is always an execution leading to B. This additional condition is required to properly handle the case of heterogeneous processors, because the assignment of task c to processor x, i.e. a transition of the form "LD RSP !x !c !i" (where i is either −1 or the index of the subtask), is only correct if x can handle the extension required to execute c.
A third property ϕ3 expresses that each call to dup() executes to completion, i.e. each "ST !x !DUP ..." is eventually followed by "LD RSP !x !DONE" (the three dots "..." are part of the concrete syntax of MCL and match any remaining offers):
[ true* . {ST ?x:Nat !"DUP" ...} ] inevitable {LD RSP !x !"DONE"} This and the fifth property require an additional renaming operation to enable access to the parameters of the dup() operation, i.e. to rename "ST !n !DUP (c, e, m)" to "ST !n !DUP !c !e !m".
A fourth property ϕ4 requires that each task sent by the host application is executed exactly once, i.e. each "HOST !c !e" (c being the address of the task to be executed and e being the required processor extension) is eventually followed by a transition of the form "LD RSP !x !c" (x being the identifier of a processor with extension e), but cannot be followed by a sequence containing two transitions of the form "LD RSP !y !c" (x and y being processor numbers, and c being the task received previously from the host processor). This property can be expressed as a conjunction of the properties for each kind of processor extension; the one for the bitstream extension can be expressed in MCL as follows (the expression "(true* . L){2}" characterizes transition sequences that contain exactly two occurrences of L): This and the fifth property require an additional renaming operation, namely to rename "LD RSP !n !EXEC(c, i)" to "LD RSP !n !c !i". A last property ϕ5 expresses that each dup() operation is followed by the correct number of subtask assignments. As the fourth property, this property can be expressed as a conjunction of properties for each kind of processor extension; the one for the 15 vector extension can be expressed in MCL a follows (the property "forall i:Nat among {low ...high} . ϕ(i)" is satisfied by all states that satisfy ϕ(i) for all natural numbers i in the range [low ..high]): true * . {ST ?x:Nat !"DUP" ?c:String !"VECTOR" ?n:Nat} forall i:Nat among { 0 ... n-1 } . inevitable {LD RSP ?y:Nat !c !i where (y ≥ N b )} Verifying these properties for all scenarios listed in Table 1 (i.e. a total of 540 model checking tasks) with the EVALUATOR 4 model checker [11] 6 required less than two weeks on the computer used for LTS generation; the 285 model checking tasks for only the scenarios with four processors required less than three hours; see Table 2 for details.
For all kinds of systems, scenarios 8 1 and 8 2 do not satisfy properties ϕ2, ϕ3, and ϕ5; for partially booted systems, also scenario 8 does not satisfy properties ϕ2, ϕ3, and ϕ5, and scenarios 8 1 and 8 2 do not satisfy property ϕ4.
The counterexamples provided EVALUATOR 4 help in understanding the reasons. For instance, Figure 8 shows the counterexample produced for property ϕ3, scenario 8 (two tasks, each forking two bitstream and four vector subtasks), and a partially booted system. One observes that only processors 0 (with bitstream extension) and 3 (with vector extension) boot. The first task (PC11) is assigned to processor 0, which requests the execution of two bitstream subtasks (PC12); then the second task (PC21) is assigned to processor 3, which also requests the execution of two bitstream subtasks (PC22). The transition DISPATCH DUP from state 14 to state 15 represents the decision made by the DTD to schedule first the subtasks requested by processor 0. Thus, processor 3 is waiting for the execution of the requested bitstream tasks, whereas processor 0 is assigned two subtasks PC12 (first with index 1, then 0) and then requests the execution of four vector subtasks. At this point, the system is live-locked: both processors wait for the termination of subtasks that cannot be executed because no processor with an appropriate extension is available; in Figure 8 this corresponds to the strongly connected component of states {35, 37 -44}. Note that this scenario describes a heavy application load: a total of twelve subtasks for only two available processors.
To avoid the live-lock, one might think of changing the DTD to allow to assign another task t to a processor p waiting for the termination of subtasks t i not executable by p itself. However, this approach requires to significantly increase the depth of the stacks used by the DTD, because the subtasks t i might be executed at maximum depth and t also might fork further subtasks t ′ i not executable by p, etc. As our model is not suitable to the execution of forecasted applications on the full architecture, we decided to co-simulate two models of only the DTD: the C++ model of the architect and our LNT model.
Co-simulation of the C++ and LOTOS NT models
The DTD has been designed by the architect directly as a C++ model suitable for high level synthesis tools such as CatapultC 7 or the Symphony C compiler 8 . Therefore, 6 See also "http://cadp.inria.fr/man/evaluator4.html". 7 See "http://www.mentor.com/esl/catapult/overview". ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 ϕ1 ϕ2 ϕ3 ϕ4 ϕ5  1  51m  8s  1s 24s 3s  3m  1s  0s  2s 0s  7m  1s  0s  0s  1s  2  0s  1s  0s  1s 1s  0s  0s  0s  0s 0s  0s  0s  0s  0s  0s  2 1  0s  2s  1s  1s 2s  0s  0s  0s  0s 0s  0s  0s  0s  0s  1s  2 2  1s 22s  9s 35s 20s  0s  2s  1s  2s 1s  0s  2s  1s  0s  2s  3  0s  3s  2s  2s 2s  0s  0s  0s  0s 0s  0s  0s  0s  0s  0s  3 1  1s 40s 24s 22s 32s  0s  2s  1s  1s 1s  0s  2s  2s  0s  2s  3 2  1s 15s  9s 10s 14s  0s  1s  0s  1s 1s  0s  1s  1s  0s  1s  3 9  0s  1s  1s  1s 1s  0s  0s  0s  0s 0s  0s  0s  0s  0s  0s  1  -----10h 7m 18s 18m 78s  -28m 73s 61m 5m  2  5s 11m 2m 4m 8m  0s  4s  2s  2s 4s  0s 10s  3s  5s  9s  2 1  15s 47m 12m 20m 41m  1s 32s  8s 13s 24s  1s 76s 16s 28s 57s  2 2  -----6s 16m 3m 12m 10m  10s 33m 6m 24m 18m  3  -----1s 28s 16s 11s 13s  2s 2m 56s 46s 54s  3 1  -----10s 36m 24m 11m 17m  17s 68m 41m 22m 37m  3 2  -----2s 3m 64s 67s 98s  3s 7m 2m 3m 4m  3 3  -----56s 5h 2h 3h 2h  96s 7h 4h 6h  4h  6 4  19s 25m 6m 11m 21m  0s 10s  3s  5s 8s  1s 26s  8s 14s 23s  4 1  33s 2h 30m 42m 91m  1s 97s 17s 31s 63s  2s 4m 41s 79s 3m  5  59s 4h 2h 2h 3h  0s 16s  7s 11s 12s  1s 59s 26s 41s 48s  5 1  -----48s 4h 2h 2h 3h  57s 5h 2h 2h  4h  6  -75m 4m 3h 17m 19m  4s  1s 11s 1s  2h 24s  3s 2m  7s  6 1  -----18h 16m 35s 42m 3m  -35m 85s 90m 7m  7  0s  9s  4s  6s 10s  0s  0s  0s  0s 1s  0s  0s  0s  1s  1s  8  -----3s 7m 2m 3m 5m  4s ko ko 5m ko  9  6s 14m 9m 6m 10m  0s  5s  4s  3s 4s  0s  8s  6s  4s  6s  2  -----6s 22m 4m 6m 18m  24s 2h 23m 32m 76m  2 1  -----37s 3h 37m 49m 2h  2m 9h 2h 3h  6h  4 -----12s 57m 9m 12m 34m 58s 4h 58m 79m 3h
51s 3h 54m 73m 2h 1s 60s 13s 21s 38s 2s 3m 32s 58s 2m 9 -----24s 2h 75m 29m 53m ----- Table 2 : Rounded property verification times ("s" means seconds, "m" means minutes, "h" means hours); unsatisfied properties are marked as "ko" this model follows the synchronous approach commonly applied in the hardware design community. In this approach, a hardware block is represented as a function f : inputs × state → outputs × state that is called on each clock cycle to evaluate its inputs and to compute the outputs and the new internal state to be used in the next clock cycle. The C++ model of the DTD comes with a clock-based simulation environment providing abstractions of the host processor, the cluster processors, and the software executing on them. In order to assess the correctness of the C++ model (and thus the generated hardware circuit), we experimented with the co-simulation of the C++ and LNT models, using the EXEC/CAESAR framework [5] . Practically, we added the LNT process Dtd (i.e. the model of the DTD without its environment) to the simulation environment coming with the C++ model. Keeping also the C++ model of the DTD ensured that both models were exposed to the same stimuli, enabling us to crosscheck both models, in particular that both models behave similarly. This differs from classical co-simulation environments where a part of a design is replaced by a model not depicted at the same level of abstraction, such as an Instruction Set Simulator of a processor inserted in the simulation of a full System On Chip with peripherals depicted in a hardware description language.
In some sense, this co-simulation is similar to model-based conformance testing, as for instance with TGV [13] or JTorX [14] 9 . Taking as input a model and a test purpose, TGV computes a test case that, when used to test an implementation, enables to check conformance of the implementation to the model: without a test purpose, our co-simulation simply checks the conformance of each step in an execution. Similar to our approach, JTorX does not require an explicit representation of the model; hence our approach differs from JTorX mainly in the connection between the model and the system under test: calls to C functions versus communication via standard input/output (or dedicated adapters).
The main challenge was the combination of an asynchronous event-based LNT model with a synchronous clock-based C++ model and simulation environment. Indeed, in one single clock cycle, several inputs to the DTD might change, and it might also be necessary to change more than one output: thus, a single simulation step of the C++ model might require several events (i.e. rendezvous synchronizations) in the LNT model. To further complicate matters, the number of events corresponding to a single clock cycle is not known in advance, because it depends on the current state and inputs.
Before presenting our approach to driving an asynchronous model within a synchronous simulation environment and the results of our experiments, we briefly recall the principles of the EXEC/CAESAR framework.
Principles of the EXEC/CAESAR framework
In the EXEC/CAESAR framework, an LNT model interacts with its simulation environment only by rendezvous on the visible gates. Practically, for each visible gate, the simulation environment has to provide a C function, called a gate function; offers of the rendezvous are passed as arguments to the gate function (in a nutshell, offers sent from the LNT model to the environment are passed by value, and offers received from the environment are passed by reference). Each gate function returns a boolean value, indicating whether or not the simulation environment accepts the rendezvous.
Using the CAESAR compiler [15, 16] , an LNT model is automatically translated into a C function f , which tries to advance the simulation by one step. In each state, f first determines the set of rendezvous permitted by the LNT model; if this set is empty, f signals a deadlock, otherwise it iterates on the elements of the set, calling the corresponding gate functions with appropriate parameters. As soon as one rendezvous is accepted by the environment, the model performs the corresponding transition and moves to the next state. If none of the rendezvous is accepted, f returns with an indication that the state has not changed; this feature enables the simulation environment to compute the set of all rendezvous possible in the current state of the LNT model; calling f once more then enables one of these rendezvous to be accepted.
Approach
To integrate the asynchronous LNT model into the synchronous C++ simulation environment, we took advantage of the feature of EXEC/CAESAR mentioned above to compute the set of all enabled rendezvous. We also exploited the fact that, as usual for hardware circuits, input and outputs can be distinguished by the gate of the rendezvous: the gates ST, LD RQ, and HOST represent inputs of (i.e. signals received by) the DTD, whereas the gates LD RSP and WAKEUP represent outputs of (i.e. signals sent by) the DTD. Furthermore, we used the fact that any output of the DTD is always the reaction to (a set of) inputs. Last but not least, we relied on the modeling style, in particular the independence of the different interfaces in the LNT model of the DTD. Indeed, for a set of actions (only inputs or only outputs) that may occur in the same clock cycle, the modeling style ensures the confluence of the execution of the actions in the set, i.e. when the LNT model of the DTD executes such a set of actions, all orderings lead to the same state. Thus, one can arbitrarily choose one ordering.
Concretely, to simulate the equivalent of one clock cycle of the synchronous C++ model, we execute the following steps.
• Iterate over all proposed rendezvous to compute the set of all enabled outputs of the LNT model. If this set is different from the set of outputs produced by the C++ model (since the last clock), signal an error.
• Accept all outputs in the set once. If an output is enabled more than once, signal an error.
• Iterate over all proposed rendezvous to compute the set of all enabled inputs of the LNT model. If this set does not include all inputs to be given to the C++ model, signal an error.
• For all inputs given to the C++ model, provide them once to the LNT model.
• Accept the rendezvous marking the execution of the decision function.
Example 3. If we apply this approach to the arbiter example presented in Figure 3 , the output signals are OA and OB, input signals are IA and IB, and the decision making signal is D. In a co-simulation, the behavior of the model will not cover the full LTS as an output is always accepted before the next input. For example, in state 3, the input 20 transition 3 → 6 cannot be taken, due to the output transition 3 → 0; this implies that transition 6 → 8 is never taken. Because also transition 5 → 7 cannot be taken, states 7 and 8 are unreachable. Thus, co-simulation obviously explores only a sub-set of the LTS.
Results
Using the EXEC/CAESAR framework, we co-simulated the LNT model of the DTD for 16 processors with the architect's C++ model, using the architect's simulation environment for stimuli generation. After a ramp-up phase mainly devoted to fine-tuning which signal should be considered in which clock phase and dealing with C/C++ mangling, we were able to run the first scenarios.
We found that the architect's C++ model did not behave correctly for heterogeneous scenarios where a task t running on a processor with extension e requests the execution of subtasks requiring an extension different from e: the subtasks are correctly executed, but t never receives a "LD RSP (DONE)".
At the time we made these experiments, some part of the hardware had been designed and the silicon area of the shared instruction cache, needed for this fine-grained dispatching scheme, appeared to be too expensive regarding the whole chip silicon budget. The architecture of the system has been redefined and the DTD has been discarded. Unfortunately, this occurred before the architect had been able to modify its model. This prevented us to achieve the full co-simulation.
Conclusion
We illustrated that LNT, a formal modeling language based on process algebra and supported by the CADP analysis and verification toolbox, is well-suited for modeling of a complex industrial hardware circuit in an asynchronous multiprocessor environment. Although all this would certainly have been possible using a classical formal specification language or other formal methods, we found that using LNT/CADP helped in obtaining the model, communicating with the architect, and pointing out a problem that has not been detected by other methods used in STMicroelectronics. Our formal model thus does not appear as "just another model" and the time spent to develop the model is not a loss.
This work points to several research directions. First, the case study poses a challenge of using more elaborate and/or prototype state space exploration techniques (e.g. distributed, compositional, and on-the-fly verification, or static analysis for state space reduction) to handle larger scenarios. Second, the user friendly syntax of LNT enabled its application in an industrial environment. Furthermore, it would also be interesting to widen the analysis by considering the performance evaluation features provided by CADP.
