Messagepassing mechanism is commonly used to preserve data coherency in distributed systems. This paper presents an algorithm tor insertion of minimal message-passing in system-level. design to guarantee data coherency. The target architecture is a multi-component heterogeneous s y s tem, where some components have local memory (or they are memory components by themselves. The algorithm enables automatic insertion of messagepassing during systemlevel desjgn.to relieve designers from tedious and error-prone manual work. The optimal solution given by the algorithm also ensures the quality of automatic insertion. Experiments show that 'the automatic approach achieves a productivity gain of 2OOX over manual refinement. 
INTRODUCTION
In order to handle the ever increasing complexity and timetemarket pressures in the design of system-on-chip (SOCs) or embedded systems, design abstraction has been raised to system level to increase productivity. At the system level,' designers deal with system components including microprocessors, special-purpose hardware units, memPermission to make digital or hard copies of all or part of this work for personal or classrmm use is granted without fee provided that copies are not made or dismbuted for profit or commercial advanrage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on sewers or to redistribute to lists. requires prior specific permission and/or a fee. 
ISSS'OZ

Daniel Gajski Center For Embedded Computer Systems
University of California Irvine, CA 92697, USA gajskiQics.uci.edu ories and busses. System level design usually starts with a specafication model written in system level design languages, such as C++, VHDL, SystemC and SpecC. The specification model is a pure functional description of the system, which is composed of a hierarchy of modules. Leaf module in the hierarchy encapsulates a small part of the computation (code segment). Other non-leaf modules can be a parallel, sequential, pipelined or finitestatemazhine composition of sub-modules. Inter-module communication is realized using shared variables and channels. As a pure functional model, the specification does not assume any implementation d e tail of the modules and variables. Specification model can be simulated to get profiling data, which can help designer make good design decisions.
M d u l c 3 ,
F i g u r e 1: Module/Variable M a p p i n g .
During the process called architecture exploration, designers come up with a system architecture by selecting a set of system components and connecting them with busses. Then modules in the specification are partitioned and mapped to the system components, shared variables are partitioned and mapped to the memories of these components (Figure 1) while channels are partitioned and mapped to the busses. A new description, called architecture model, is developed to reflect the selected architecture and channel/variahle/channel mapping decisions. In the architecture model, additional component modules representing allocated components are introduced in the module hierarchy. The design then is d e scribed as a parallel decompostition of these component modules since they run concurrently. Architecture model can he simulated to verify the desired functionality and to estimate performance metrics thus to evaluate the quality of the selected architecture.
Since the architecture allocation and mapping decisions have first order impact on the quality of fin2 design, architecture exploration usually is an iteratve process seeking for the hest solutions. We can divide architecture exploration into two tasks, sunthesis and refinkment. Synthesis task ables.
ardmapped to memories. Depending on how and where variables are mapped, there are different approaches to achieve data coherency. crease productivity. There are four major tasks in refinement from a specifcation into architecture model. The first task, behavior refinement, is to synchronize execution of modules running in parallel on different components after module mapping, in order to preserve the original execution order specified (or implied) in the specification model. ' The second task, scheduling refinement, is to serialize module execution on components that are single-threaded. The third task, vanable refinement, is to insert messagepassing among components to ensure data coherency after shared variables are mapped to local memories of different components. The last task, channel refinement is to implement channels using bus interfaces of the components.
The focus of this paper is on variable refinement. We will identify and discuss the major issue to achieve automated variable refinement. The paper is presented in the following way. In section 2, we point out some related works. Data coherency at system level is discussed in section 3. The problem is formulated in section 4. In section 5 we present our algorithm to the problem. Experimental results are shown in section 6. At the end, we give our conclusions.
RELATED WORK
Most of the work in system level design has focused on synthesis problems including architecture allocation ([I], [2])
and software/hardware partitioning (131, [4] ) and cosimulation ([l] ). However, automatic refinement has not received much attention from the system level design community.
Automatic model refinement, including control-related refinement, data-related refinement and interface synthesis, is described in [5] . In (61, a set of formal models and tranformations between model are defined to enable automatic model refinement. In 
DATA COHERENCY WITH MESSAGE PASSING
The goal of variable refinement is to gurantee data c p herency in a multi-component architecture when variables In the shared-memory mechanism, there is a dedicated memory component in the system architecture. Shared variables are all stored in this memory component and other components all have access to the memory through the system bus. In this case, the original correct order of accesses to the variable can be preserved by synchronizing the execution of modules on different components. An example is shown in Figure 2 . In the original specification, module A produces data z for module B, which in turn modifies it and passes it to C. After partitioning, A and C run on PE1 and B runs on PE2. Here, z is mapped to a global memory component.
Synchronizations, wait and notify, are inserted to preserve the correct wcessing order to x. The issue of inserting synchronizations was discussed in [7] . With this mechanism, the memory component becomes the critical component, which dictates the overall performance. There have been a variety of techniques proposed to reduce memory access latency. ;led as a (special) component whose only task is to stdre and retrieve data.) The values of the local copies are kept consistent by sending messages through message-passing channels. Messagepassing channels encapsulate the implementation details of communication methods, i.e., send and mu. Channels are widely used in system level design. The use of channels separates communication from computation so that they can be refined separately without any interference. Continuing with the same example, hut now x is mapped to both PE1 and P E l s local memories (Figure 3) The values of z on PE1 and PE2 are updated via message passing (send(x) and recv(z)). The issue of adding appropriate messagepassings is our focus.
I Send-before-read and send-after-write
Although messagepassing can he inserted at any point between the writing module and the reading module, it is advantageous to send the data across after it is produced (send-after-write). rather than send it when data is needed ,.
by the reading party (send-heforeread). This will allow prompt delivery of the data through the channel. For instance, if the data size is big, it takes considerable communication time to transfer the data. In addition, since it is common to have singlewritemultipleread instead of the other way around, another advantaRe of send-after-write is to avoid iedundant messages when data is written only once, but read multiole times (or the read is inside a IOOD). An
example is shown in Figure 4 . Send-beforeread results in two messagepassings while send-after-write needs only one messagepassing. Therefore, send-after-write is to be used in our approach. 
Potential Redundancy
Because of rich control constructs (branch, loop, fsm) in most lanauaaes, it is expected to have a very complex control flow graph of modules in a specification other than a simole seauential execution as the examole shown in the Figure 3 . Furthermore, the same variable is usually accessed by multiple modules. Simply broadcasting a message to all other comoonents after each write to the variable will definitely guarantee data coherency. But this method will most probably introduce redundant messages thus increase intercomponent communication overhead. In practice, designers This manual approach is time-consuming for a normal size design. Even worse, it is error-prone and difficult to d e bug. To determine a minimal number of messages needed requires thorough data dependency analysis, to which some compiler techniques can he applied. In this paper, we will propose a graph algorithm to find the true data dependency between modules and insert messagepassing as needed to avoid redundancy.
PROBLEM FORMULATION
As being pointed out in the previous section, the problem here is to derive a minimal set of messages sent across components to keep data consistent. A couple of definitions are introduced to formulate our problem. 
t r a n s i t i o n graph is G(V, E), where 1) V represents modules in the specification 2) E represents transitions between modules
In the transition graph, each node has two attributes, PE and TYPE. P E stores the module mapping information, i.e., which component the module is mapped to. (Note that leaf modules are indivisible and can not be partitioned to different components.) TYPE stores variable access information, which is initialized and used internally by the algorithm described later. Each edge has one attribute, Length, which is also initialized and used by later algorithm. The transition graph is a directed graph that can he constructed from the original specification. It can be cyclic if loops or finitestatemachines present in the specification.
The optimal messagepassing problem can be formulated as follows.
Given:
1) a specification in the form of a transition graph G(V, E);
2) a set of variables D: {variablel, variahle2, ...}.
Determine:
A set of messages M {messagel, messagel, ...}.
S u c h t h a t : 1) data coherence is kept (Correcteness) and 2) the number of messages is minimal. (Optimality)
Eliminate redundancies
As we pointed out earlier, sending a message to each of other components following the writer would satisfy the correctness requirement but not the optimal requirement. To obtain a minimal set of messages, a number of conditions can he checked to eliminate all potential redundancies.
A module that reads from a given varihle is called reader and a module that writes to that variable is called writer. Each variable may have mutiple writers and readers, which are partitioned and mapped to different components.
We claim that a message is needed only when following conditions on a pair of writer and reader are satisfied: 1) writer and reader are mapped to different components; 2) there exists at least one path from writer to reader in the transition graph; 3) among all paths from writer to reader, at least one path does not contain other writers (overwrite); 4) the message is not already in the message set.
Conditions l), 2) and 4) are easy to check. To check condition 3), we can augment the transition graph to reduce the problem of finding data dependency into a shortest path problem, to which we can apply existing algorithms.
Algorithm
The input to our algorithm is a transition graph G representing the specification and a set of variables D. The output of the algorithms is a set of messages M. The algorithm is a two-step iteration over all variables in D. The first step augments the transition graph with variable access information for a given variable. The second step then checks all pairs of writers and readers against aforementioned conditions and adds messages as needed. 
OPTIMAL MESSAGE-PASSING APPROACH
23
F i g u r e 5 : A u g m e n t e d T r a n s i t i o n Graph. To perform data dependency analysis, the access type of module on a variable is needed. Since it is very common for current system-level design languages to specify port directions (in, out, inout) for modules, it is trivial to obtain access type of any module on a given variable. For languages that do not support port directions, access type can be o b
Augment transition graph
I
To simplify our explanation, we will abstract the process of finding the access type of a module (m) on a variable (v) as an operation FindAccess(m, U). FindAccess(m, v) simply returns the access type, which can be R (read), W (write), RW (read-write) or X (no access)
In this step, both the nodes (V) and edges (E) of the transition graph are augmented with information that will be used in the later step. First, each node is assigned a TYPE value depending on its access type to the given varaible. Secondly, each edge is assigned a length. The edge length is set t o 1 if its tail has type of W or RW, 0 otherwise. An example of augmented transition graph is shown in Figure 5 .
Each node has a PE number and a T Y P E while each edge has a length. The pseudcxode for this step is shown here. 
Analyze transition graph
This step checks all 4 conditions stated in section 5.1 for each pair of W (or WR) type node w and R (or WR) type node r to decide if a message is needed. Subroutine ShortestPath(G, w, r) is called to return the length of the shortest path from node w to node T in graph G. An infinite length implies there is no path from w to T. This sub-rontine can employ any Directed Graph shortest path algorithms, for instance, Dijkstra's Algorithm. The pseudo-code of this step is shown here. Although ttie algorithm presented here operates on anonhieiarchical graph. it is straightforward t o extend it t o handle a hierarchical graph, where each node itself is a transition graph. The only modification needed is t o introduce a source node and a sink node for each transition graph to glue a hierarchical node. The source node and sink node have no functionality internally.
EXPERIMENTS AND RESULTS
The algorithm has been implemented and integrated into the SpecC Architecture Refinement tool. The input to the tool is a system specification model and architectural parameters, such as allocation and mapping decisions. The output is an architecture model reflecting the selected architecture.
A GSM Vocoder design, an industrial-strength example, was taken t o perform our experiments. The original specification model has 10,000 lines of SpecC code. It is composed of a hierarchy of 120 modules with more than 100 variables used t o connect these modules hierarchically (Figure 6 ). For clarity, not all modules are shown in the figure. After architecture exploration is performed, a system architecture composed of a DSP56600 and an ASIC connected with a system bus was decided, The module code-book, which is the performance bottleneck, is decided t o be implemented in a customer ASIC t o speed it up. The rest of the specification is t o be executed on DSP56600. Since there is no global memory component in the architecture; all d a t a shared by code-book and the rgst of the system are mapped t o local memories of both components, i.e., registers and DSP local memory. The module code-book is 5-depth down the module hierarchy and it accesses more than a dozen of shared variables. Therefore, it has been much effort t o perform d a t a dependency analysis and insert messagepassing between these two components.
Before the automatic refinement tool was available, it took 24 hours (3 days) for a person t o manually write and debug the archit,ecture model. The architecture model has 11,000 lines of SgecC code with 14 global messagepassing channels used for inter-component communications. With the autmatic refinement, the only work here is t o input architecture parameters and module/variable mapping information t o invoke the tool. By using a graphical user interface, this work usually can be done within 5 minutes. Then the tool performs the refinement in less than 1 minute. The aut* matically generated model has the same number of channels and synchronizations as in the manually refined'model.
T h e generated architecture model was successfully simulated t o validate functional equivalence against the specification. As we can see, the productivity increase isover 2OOX for this example with only two components. We can expect even better gain for a typical SOC design that consists of more.than a hardware and a software components.
Another example, JPEG Encoder design, was also experimented with the tool (Figure 7) . T h e size is relatively smaller than the Vocoder design. But the speed up with the automatic tool is also significant. This automatic model refinement not only Saves designers from tedious yet error-
