Abstract. Asynchronous Self-Timed designs are beginning to attract attention as promising means of dealing with the complexity of modern VLSI technology. In this paper, we present our views on why asynchronous systems matter. We then present details of our high level synthesis tool SHILPA that can automatically synthesize asynchronous circuits from descriptions in our concurrent programming language, hopCP. W e outline many of the novel features of hopCP and also sketch h o w these constructs are compiled into asynchronous circuits, and then focus on the high level optimizations employed by SHILPA, including concurrent guard evaluation and concurrent process decomposition.
Introduction
It has been pointed out by many researchers recently that asynchronous circuits|circuits that do not employ global clocks|have a n umber of advantages over synchronous circuits when it comes to building large and complex sequential systems 1, 2, 3, 4 . In this paper, we summarize recent developments in asynchronous circuit design and then present our high-level synthesis system, SHILPA 1 . W e will focus on the high-level optimizations used by SHILPA. High-level optimizations are similar to ow-graph level optimizations" in programming language compilers 5 ; they should not be confused with circuit level optimizations which are similar to machine code optimizations.
Synchronous vs. Asynchronous Circuits
Synchronous circuits are employed virtually everywhere. They have a n umber of desirable characteristics, some of which are the following. The clock period of a synchronous circuit is chosen to be long enough to allow its combinational stages to settle down, thereby preventing failures due to hazards. In asynchronous circuits, hazards can be mistaken for genuine signal transitions. Hence, it is of paramount importance to eliminate hazards, for instance by employing special purpose Boolean minimization procedures 6 . Synchronous circuits do not have the overhead of handshaking. Very many simulation and testing techniques, as well as Computer-Aided Design CAD tools, are available for them. Synchronous circuits also have many shortcomings. Large synchronous circuits employ high frequency and low skew global clocks, driving which can consume considerable amounts of power 7 . The design of synchronous asynchronous interfaces|for example, peripheral interfaces|must be done with great care, for fear of inviting failure due to metastability 1 . There are many speci c kinds of asynchronous circuits, some of which are: self-timed circuits those that generate completion signals, delay insensitive circuits whose behavior is invariant o v er module-and wire-delays, and speed independent circuits whose behavior is invariant o v er module-delays, but not necessarily wire-delays. These distinctions depend largely on the granularity of the circuit primitives. F or example, a synchronous system that communicates externally using handshake signals can be regarded as a self-timed component in a larger context. Asynchronous circuits are attractive in many w a ys. To a large extent, they allow one to focus on functionality and not on timing details. This makes the task of high-level synthesis centered around asynchronous circuits much easier in many respects. For example, there is no need to perform clock s c heduling. Operations whose durations are data dependent a s w ell as I O dependent can be more cleanly and e ciently handled in the asynchronous high level synthesis framework. Asynchronous circuits can also exhibit better average case performance, unencumbered by clocking rules 8 .
Despite these promises, many designers have h a v e s h unned away from asynchronous circuits. It is feared that asynchronous circuits are excessively larger than synchronous circuits. Asynchronous circuits o er the designer with even more freedom to explore the design space. The designer has the choice of numerous concurrent algorithms to begin with; upon each chosen algorithm, he can e ect numerous high level optimizations; each lead to circuits to which di erent circuit level optimizations can be applied; nally, each speci c circuit has its own best suited" circuit design style; and, all these tasks are inter-related. For instance, if an addition operation is used in a thread whose average execution time should be kept low, carry-completion addition would be a viable alternative. This may, in turn suggest a DCVSL CMOS style implementation with its own associated transistor sizing rules. Without adequate design space exploration support tools, this added freedom o ered by the asynchronous style can be a burden for the designer.
It is hoped that many of these limitations of asynchronous circuits can be overcome very soon through additional research. Many of the early failures involving asynchronous circuits can now b e a v oided through careful design 9 or veri cation 10 . Area overheads are be-coming less severe, especially if a slight increase in area can actually buy reduced design time. Recently, there have been many convincing demonstrations of the practicality of large asynchronous designs 11, 12 . Many high-level 11, 13, 14, 15 and low-level synthesis tools 16, 17, 18 have been developed.
The clean separation between synchronous", and asynchronous" systems has already begun to blur. Mixed synchronous asynchronous circuits 19, 20 , Q-modules 21 , locally clocked asynchronous systems 18 , and the asynchronous style" synchronous control networks used in Olympus 22 are indicative of this trend. Whichever course the hardware design community m a y ultimately follow, it seems inevitable that asynchronous design will play an increasing role as time goes by. Based on this assumption, we are justi ed in taking the approach of studying asynchronous designs in isolation, in this paper.
Context and Motivation for our work
A prominent category of e orts in asynchronous design deals with compiling behavioral descriptions in high-level languages based on the communicating sequential process paradigm into asynchronous circuits. In these e orts, asynchronous design is viewed as concurrent programming, where the computation to be implemented is expressed in a high-level concurrent HDL. This approach is more suitable for system level synthesis. This is in contrast to the works of 23, 24 , as well as more recent w orks of 25, 10, 9, 26 , which are more suited for low level synthesis and veri cation of asynchronous state machines.
Our system, SHILPA, belongs to the former category. To the best of our knowledge, systems similar to ours that have been fully implemented and tried out in practice are those by Brunvand 14 29 , that can determine if two actions are serially ordered or not, could be developed fairly easily, thanks to the HFG based notation. C oncur is central to many o f the optimizations performed by SHILPA. The HFG based intermediate representation also helps in smoothly integrating all the SHILPA tools a compiled code simulator, C oncur, and, in future, performance evaluation tools. SHILPA compiles circuits by taking each action in the HFG and rewriting it to a normal form HFG NHFG fragment to be explained later as well as the associated resources; this graph rewriting based compilation keeps the SHILPA compiler modular, easier to understand, and it is hoped easy to verify in future. Flow analysis based optimizations, a common intermediate form for a variety of asynchronous design tools, and compilation through graph rewriting have not been addressed before in asynchronous high level synthesis.
There are two classes of approaches for realizing asynchronous circuits in hardware: Boolean gate based, and macromodule based. Asynchronous macromodules implement functions such as rendezvous, arbitration, procedure call and return, and control merging. Many approaches using macromodules view the given design problem as a concurrent programming problem| more speci cally, one of mapping a given concurrent program into an interconnection of macromodules. There are also many e orts in which macromodules are used directly for realizing state machines i.e. for low level synthesis. Some examples are 25, 9 . Some of these distinctions are also rapidly blurring, with the use of complex gates that directly realize multi-input multi-output Boolean functions as macromodules. In SHILPA, macromodules are the target of compilation, at present. Our set of macromodules were originally developed by Brunvand 30 using the Actel eld programmable gate arrays FPGAs; we h a v e made numerous extensions to this cell set.
Organization
In Section 2, we brie y sketch the syntax and semantics of hopCP. In Section 3, we illustrate SHILPA o n a t w o-stage pipeline. In Section 4, we examine concurrent guard evaluation in some detail. In Section 5, we present an example of parallel decomposition, a useful technique for obtaining pipelined designs. Concluding remarks are provided in Section 6.
hopCP System Overview Syntax
A hopCP description consists of one or more sequential processes composed in parallel using the k operator. Two sequential processes are shown in Figure 1 . through HFGs as well as using a textual notation. A sequential process is one or more process de nitions composed in series using the ; operator, such that for every process call, there is a corresponding process de nition. Each sequential process shown in Figure 1 consists of two process de nitions each. The processes de ned are P, Q, R, and S.
A process de nition consists of a choice n o de annotated with a process name and a list of formal parameters. Arcs lead o from the choice node a circle" to one or more alternative transitions that are annotated with actions. These actions are commonly known as guards. Arcs lead o from the guards to nodes that perform process calls.
The left-most process de nition de nes process P that has two formal parameters x and y. W e will use the words process and state synonymously. The guards of P are a?z and b?x. These actions belong to the category data input. These transitions are, in turn, followed by the process calls Q x+1, fy, z-x and P x+y, y-x . Note that every process call has a corresponding process de nition in the same sequential process. When a process call is made, the actual parameters are passed by value. The guard of process Q is the compound action c?, d!x1+y1. A compound action can appear as a guard if it is the only guard of a choice node. Further restrictions on hopCP's guards are noted later. This guard requires the input synchronization action c? and the data output d!x1+y1 to be both nished before the process call to P is made. All the constituent actions of a compound action must be disjoint, i.e., m ust not share channels, registers, or other resources, so that they may run in parallel without interference. Compound actions are useful for specifying a collection of primitive actions to be done in parallel. They are also very useful for specifying the compilation rules of SHILPA which break up high level actions into collections of simpler actions that can be done concurrently.
The guards of process R are the expression actions oddx2 and oddx2. These form Boolean guards that decide where control passes from state R x2 . W e encourage designers to specify Boolean guards in a mutually exclusive manner using the form formula and formula, as this situation arises very frequently. It compiles such guards using predicate action blocks Figure 2 . A predicate action block e v aluates preddata and steers the request transition to either the T if preddata o r t h e F if preddata output. If SHILPA does not nd the pattern formula" and formula," it assumes that the Boolean guards are not mutually exclusive, and uses an arbiter to select one of the true Boolean guards Figure 2 . We use the ring-style arbiter from 30 which functions roughly as follows: after a request is applied, a token is circulated within the arbiter; more than one reqi input may be asserted at any time; one of these requests is acknowledged. In general, arbiters occupy more area to realize than predicate action blocks. They also use circuits such as the interlock 2 that cannot be realized in many technologies, such as most of today's FPGAs. Note: The FPGA realization in 30 is only an approximation, to permit rapid prototyping.
The di erent categories of variables, channel names, and their scoping rules, are as follows. Variables can either be local to a process de nition e.g., x,y are local to P, or declared to be globals in this example, dsvar has been declared as a global variable. Variables used in data input actions e.g., z are local to the process de nition in which they appear; their scope begins at the data input action and lasts till the ensuing process call. Channel names are local to a sequential process. Global variables can be shared across process de nitions as well as sequential processes. Other comparable description languages disallow sharing global variables across parallel threads for a good reason: they have no tool support to determine if global variable accesses can be potentially concurrent. In hopCP, w e allow such shared variables because a it has been our common observation that many real world systems frequently communicate over shared registers or busses; b procedure C oncur can determine whether two actions in an HFG are serially ordered or potentially concurrent. Using C oncur, all accesses to global variables can be checked and made sure that they are serial. Notes: The serial ordering itself is imposed by the synchronizations b e t w een the sequential processes.
Algorithm C oncur works as follows. When invoked with two actions a and b as arguments, C oncur rst composes the sequential processes into one HFG b y merging transitions that can rendezvous. Then it performs a reduction of the HFG b y removing places and transitions in such a w a y that the causal orderings between a and b are una ected. Then, C oncur performs reachability analysis on the reduced HFG, to determine all the reachable markings. It then checks whether there exists a marking y such that the union of the preconditions of a and b is a subset of y, but the intersection of the preconditions of a and b is empty; if so, actions a and b are potentially concurrent; if not, these actions are serial. C oncur assumes that all Boolean guards are true. Therefore, although it can tell that only one guard will be picked, it cannot tell which one will be picked. Hence, its results are pessimistic. Despite this caveat, in practice, we nd that it is relatively easy to determine whether two actions are serially ordered or not. Though its worst case complexity is exponential in the HFG size, C oncur has performed reasonably fast on many practical examples. A commonly used notational abbreviation is as follows: if a process de nition has only one reference i.e., only one process calls the process being de ned, then it is possible to in-line substitute the process de nition in place of the process call. This abbreviation is illustrated on an example consisting of two process de nitions for T and U that are mutually recursive. Here, process U has exactly one reference, while process T has two references, because it is also the initial state. We can eliminate an explicit de nition for process U. The textual syntax for this abbreviated de nition would be after simpli cations:
Informal Execution Semantics
The informal execution semantics of the example in Figure 1 are as follows the formal semantics of hopCP are given in 31 . Suppose the execution is begun at P and R. These processes begin their execution concurrently. Process P rst makes a choice between the guards a?z and b?x. This alternative choice" command has the same meaning as in CSPlike languages. For example, if action a?z is to take place, a matching action of the form a!exp must also be enabled in another sequential process; in this case, a?z and a!exp are said to rendezvous, whereupon the value of exp gets bound to z. In our example, the data input action b?x of P is matched by the data output action b!dsvar+1 of S; a matching action for a?z is not shown. Input and output synchronization actions are value-less counterparts of data input and data output, respectively.
In hopCP, data output follows the multicast semantics: a data output action such a s b!dsvar+1 can rendezvous with more than one data input action that uses the same channel. For example, suppose that three concurrent processes P 1 ; P 2 and P 3 attain a state in which P 1 o ers action b!dsvar+1, while P 2 and P 3 o er b?v and b?x, respectively. According to the multicast semantics, P 2 can proceed as soon as b!dsvar+1 is o ered by P 1 ; likewise, P 3 can proceed as soon as b!dsvar+1 is o ered by P 1 . H o w ever, P 1 can proceed only after both b?x and b?v have been o ered by P 2 and P 3 , respectively.
Valueless communication actions in hopCP follow the barrier synchronization semantics: an output synchronization action such a s e! can synchronize with more than one e? action in as many sequential processes. In this case, all the actions e? as well as the single e! action must wait for each other and proceed only after they all have been enabled.
A designer may use barrier synchronization when time alignment" is called for. Since, in hopCP, i n teractions between concurrent threads can occur through value assignments on global variables as noted earlier or through rendezvous, it makes a semantic di erence whether barrier synchronization is followed or multicast. In addition, the e ect of multicast can be obtained even for valueless communications, by suitably faking" a value communication for example, following the syntax e!nullvalue and e?ignore.
The availability of barrier synchronization as well as multicast o ers considerable exibility in specifying system level behavior, as we h a v e shown through numerous large examples, notably the speci cation of the high level protocols obeyed by I n tel 8251 USART 32 . These constructs are also useful for specifying concurrent algorithms 33 . These features of hopCP are absent from comparable languages that are used for asynchronous high level synthesis.
Coming back to process P, consider a situation in which the communication actions a?z and b?x can arrive potentially concurrently. In this situation, an arbiter would be used to pick one of these communications for example, as in 14 . However, if it can be determined using C oncur that these actions are mutually exclusive, SHILPA compiles a circuit using the concurrent guard evaluation technique. This technique also uses a circuit that is smaller and easier to realize than an arbiter. This is one of the high level optimizations to be discussed later.
Coming back to the guards of P, if action b?x is chosen, the existing value of x is overwritten during the data input action b?x. Then, control goes back t o P through a process call P x+y, y-x , when the current v alue of x gets replaced by the value of x+y and the value of y by the value of y-x. Note that this particular process call P ... cannot have used variable z in its actual parameter expressions because z is visible only in the scope of action a?z.
If guard a?z of P is chosen for execution, control reaches process Q. In the process, formal parameters x1,y1,z1 are bound to the values of expressions x+1, fy, z-x, respectively. Process Q performs a compound action; i.e., i t w aits for the input synchronization c? and the data output d!x1+y1 to both nish before it engages in the process call P x1, x1+dsvar . While the value of the global variable dsvar is being acquired during the computation of expression x1+dsvar, dsvar must not be concurrently changed by another process. We can determine whether this is the case, using C oncur. The execution semantics of processes R and S are similar. Process R involves Boolean guards that are mutually exclusive. If control passes to S, it performs the data output which can synchronize with action b?x of process P.
Notice the common subexpressions dsvar+1 in process S. Currently SHILPA cannot avoid recomputing dsvar+1; h o w ever, this optimization can be incorporated in a straightforward way, as done in standard compilers. However, SHILPA can be made to do resource sharing: for example, since the two uses of`+' are in the same process de nition, the designer can request SHILPA to share the adder, if he she so desires. The two i n v ocations of add used in process de nitions Q and S can be shared only if they are guaranteed to occur serially. Again, C oncur can be used to determine if these two usages are always serial or not.
Restrictions on Guards
Guards in hopCP have t o o b e y a n umber of restrictions. These restrictions help in many ways: they help avoid potentially dangerous situations e.g. deadlocks. They also help in obtaining e cient circuits without compromising the expressive p o w er too much. Some of the restrictions on guards are now listed. If a compound action is used as one of the guards, it must be the only guard going out of the choice node. Similarly, if a data output action is one of the guards, it must be the only guard going out of the choice node. Also, if an assignment action is one of the guards, it must be the only guard going out of the choice node. Two guards must not use the same input channel. All input channels used in guards must be point-to-point: in other words, broadcast or multicast channels should not be used in guards. The guards associated with a choice node may consist of expression actions, data input actions, and input synchronization actions. During execution, however, all the expression actions are examined before any of the non-expression actions within guards are examined.
Summary of Features
To sum up, our work makes a number of advances over comparable works. hopCP has been designed for supporting the speci cation of large hardware systems at a high level. It is more expressive than the HDLs used in comparable works. Although Martin 11 also makes the distinction between mutually exclusive and non-exclusive guards", his approach is slightly di erent. In Martin's approach, an input guard is turned into a input probe which is then made part of the Boolean guard. We do not use probes in hopCP for several reasons. First, we believe that not having probes k eeps the HDL simple. Second, many of the proposed uses of probes can be replaced by corresponding uses of global variables. The synthesis systems developed by Martin, van Berkel, or Brunvand do not support ow analysis or sharing analysis. Last, but not the least, we h a v e built an integrated design system that includes a ow analyzer, an e cient compiled code simulator, and a high level synthesis system. It generates circuits ready for implementation in Actel FPGAs, supported by Viewlogic tools.
3 Overview of SHILPA SHILPA generates transition style circuits using bundled data, as presented in 1 . We illustrate SHILPA through the design of a two-stage pipeline: Each action in the HFG is then re ned into simpler actions which consist of signal transitions on allocated r esources. This results in NHFGs, which w ere introduced in Section 1. For process Q, the NHFG is as follows: Actions that end with two question marks are input signal transitions that are awaited. Actions that end with two exclamation marks are output signal transitions that are generated. For example, C 12 in1!! is a signal transition generated on input in1 of c-element number 12. Notice that the allocated resource instances for process Q include one c-element and one register. SHILPA can explain why each resource is being allocated, in the following form:
C_12:data assert for b!x + y REG_5:argument for AB_4_arg1 C_13:data query for a?y REG_6:argument for AB_4_arg2 REG_3: datapath for x FAB_4:2 for x + y REG_14:query var for z REG_7:result for 4 REG_10:const for y CTREE_8:2 for AB_4_arg2 -y
One example from this printout, FAB 4:2 for x + y, explains that function action block FAB n umber 4, of arity 2 has been allocated to support x+y. Control circuitry is now generated in SHILPA b y detecting shared r esources|resources that are triggered from two di erent places. In the pipeline, there are no shared resources. Next, excess registers are eliminated based on user's interactive commands. For example, users may like to retain result registers to function blocks, so as to share the results of evaluating common subexpressions. Retaining registers can also help pipeline the evaluation of nested expressions e.g. x+y+z+w. Sometimes, result registers have to be retained to prevent combinational loops from forming. For instance, if value of the actual parameter expression in process Q is z+1, then the result of z+1 is held in a result register and then only loaded back t o z . Detection of these situations is straightforward though not automated at present.
After eliminating the desired number of registers, an abstract netlist can be generated: SHILPA c hecks whether this netlist is structurally well-formed for example, whether two outputs are connected together, etc.; this check is redundant but quite re-assuring. Then it technology maps the netlist, currently to Actel FPGAs. The resulting circuit is shown in Figure 3 . The circuit works as follows. First CLR is lowered to reset the components. Then START is applied. This arms" both the celements, which then await A IN as well as B IN. When A IN comes from the external world, the lower C-element res. It causes the lower reg8 module variable y to store the data coming through the A DATA port. The acknowledge signal of this register is forked to A OUT as well as starts the addition of x held in the upper reg8 module and y. Completion of this addition triggers the upper C-element, thus nishing the synchronization involved with b?z. This causes the rightmost reg8 variable z to be loaded, thus nishing the data acquisition part of b?z. Finally, C OUT is generated, register x gets loaded with the value of y, and process P is resumed. Process Q is resumed by the arrival of C IN.
Concurrent Guard Evaluation
We shall illustrate concurrent guard evaluation through process P given below: In this example, after engaging in an a? action, P engages in a b? action and after a b?, i t d o e s a n a?. There are two i n v ocations of the synchronous input action on channel a? and likewise on b?. The usual semantics of channels requires that these invocations share the same resources c-elements, and handshake wires, in our case. Usually this is achieved by using a call module 1 . Coming back to our example, assuming that the guards are mutually exclusive, SHILPA generates the circuit shown in Figure 5 . The circuit works as follows. After CLR, when START is applied, both the call elements make a procedure call" onto the c-elements through the respective R2 inputs. Suppose A IN happens rst; then C2 res. It generates A OUT and also returns the call" through AS and A2 of CALL2. This transition rst triggers XOR1 through its lower input. This causes another R2 on CALL1.
We h a v e selected the CALL implementation of 30 in which the sequence R1; R1 causes a sequence RS; RS and likewise R2; R2 also causes RS; RS, and in addition, the CALL element is reset at the end of this sequence. Therefore, CALL1 is reset by the two R2 transitions it sees. Since the R2; R2 sequence causes an RS; RS sequence at the output of CALL1, c-element C1 also gets reset! The value of delay must be large enough to make sure that this resetting happens before a call is made through the A1 input of CALL1.
Though the generated circuit is large, the situation shown is fortunately rare. Thus, in most instances, we will have to generate a circuit similar to that in Figure 4 ; in those circumstances, we can indeed use a library cal component.
The purpose of presenting this example was to demonstrate a how non-obvious the interactions between features|for example concurrent guard evaluation and sharing|can be; b to show that by imposing one-sided delay constraints, often clever" designs can be obtained. As discussed in Section 2, Martin's implementation of mutually exclusive guards uses probes and hence may a v oid some of the di culties we are facing. However, an exact comparison is not possible between our approach and Martin's approach, because we synthesize two-phase transition style circuits which h a v e a large number of desirable characteristics 1 while Martin synthesizes four-phase level based circuits.
Concurrent Process Decomposition
It is easy to come up with iterative speci cations for many computations. We h a v e identi ed a useful heuristic for implementing iterative computations through concurrent process decomposition. Concurrent process decomposition is a very convenient w a y t o a c hieve software pipelining. T o make this clear, consider the iterative speci cation of a multiplier: Notice that the value of the actual parameter z+x is not needed until the corresponding formal parameter, z, is used in the body of MULTF. Also notice that z is used only in certain threads; while not odd y is true, this updated value of z is not needed! The situation not odd y being true for many iterations can happen if the number being multiplied has a string of 0s in it. Thus, holding up the recursive i n v ocation of MULT till z+x has nished computing can be wasteful in time.
A modi ed MULT algorithm can take advantage of this situation. Expressing such algorithmic modi cations in traditional sequential HDLs e.g., VHDL can be tricky. F ortunately, a CSP-style language is very expressive in this regard because, using rendezvous style communications, the desired interactions between various threads of computation can be conveniently speci ed. Note that concurrent decomposition, a s w e propose here, is di erent from Martin's process decomposition, which essentially only gives the ability to call a subroutine and return to the place of call|and not spawn two concurrent threads as we do.
The modi ed speci cation is as follows: We rst factor out variable z from MULT, and make it a local variable of a new process PZ.
PZ's role is to treat variable z as an abstract data type object, allowing it to be accessed only through two operations: operation sz that stands for send z", and operation azx that stands for add z to x". These operations can be conveniently implemented through rendezvous communications sz? and azx?x respectively. Notice that the second rendezvous communication involves data x that is sent from MULT Figure 6 . We do not show the rest of the circuit to conserve space. Process PZ has been compiled to take advantage of concurrent guard evaluation because it is clear as was checked using C oncur also that the guards sz? and azx?x1 are mutually exclusive. Process PZ functions as follows. Upon receiving START, a I f AZX IN is triggered, the bottom C-element is rst reset; then, AZX DATA is loaded into the bottom reg8 module. Its acknowledgement starts the addition and also generates AZX OUT. When the addition nishes, the results of addition are rst loaded into the result register top-left reg8, and then transferred into register z top-right reg8 before restarting process PZ. We are currently studying the process of semi-automating concurrent process decomposition in SHILPA 34 . Until we h a v e tool support for concurrent process decomposition, we believe that this technique can still be applied manually without too much trouble. 
Concluding Remarks
In this paper, we h a v e tried to demonstrate that asynchronous VLSI design can be greatly facilitated by designing a high level synthesis system that uses an expressive HDL and incorporates numerous optimizations. We h a v e detailed such a system in this paper. It is well known that writing and debugging concurrent programs is di cult without tool support. The hopCP notation avoids many of the possible pitfalls in writing concurrent HDL programs by o ering many high level descriptive mechanisms. To facilitate design debugging, the hopCP system o ers CFSIM, a compiled code functional simulator, and C oncur, a o w analyzer. hopCP also allows low level hardware features, such as global variables, to be used in hardware descriptions, and such usages checked for safety using C oncur. Last, but not the least, SHILPA tries to keep the designer fully informed about its actions, and allows the designer to in uence the nal circuit in many w a ys, through many interactive commands.
Unit-delay simulation of the pipelined multiplier showed that despite its pipelined nature, it will run slower than its non-pipelined counterpart! The reason is that the`+' operation nishes too soon, thereby not allowing the`+' t o o v erlap in any signi cant w a y with other operations. However, with other examples, we h a v e actually observed signi cant speedups due to pipelining. As is clear from these examples, the main problem we are facing currently is in performance estimation. In our experience, a high level synthesis framework for asynchronous circuits o ers the designer with even more freedom to explore the design space. Tool support for conducting design space exploration in this manner is sorely missed in SHILPA; but that is exactly what we will begin working on, next. At present, we h a v e synthesized many small circuits using the SHILPA system. Sizes of SHILPA generated circuits seem to compare favorably with the results produced by one VHDL synthesis system that generates synchronous circuits the VHDLDesigner tool of the Viewlogic family was fed VHDL descriptions obtained through hand-translation of hopCP descriptions, Figure 7 . These results, though by no means de nitive, are at least reassuring.
