Abstract. In our terminology, the term \formal synthesis" stands for a synthesis process where the implementation is derived from the speci cation by applying elementary mathematical rules within a theorem prover. As a result the implementation is guaranteed to be correct. In this paper we introduce a new methodology to formally derive register-transfer structures from descriptions at the algorithmic level via program transformations. Some experimental results at the end of the paper show how the run-time complexity of the synthesis process in our approach could be.
Introduction
The synthesis of hardware systems is heading toward more and more abstract design levels. This is due to the fact that the systems are becoming more complex and so does the synthesis process for deriving them. Therefore, the correctness of hardware components has become an important matter | especially in safety-critical domains. By correctness we mean that the synthesis result (implementation) satis es the synthesis input (speci cation), in a formal mathematical sense. It is assumed that the speci cations are correct, which has to be examined separately, e.g., by model-checking certain properties or by simulation. For proving the correctness of implementations, simulation is no longer suitable, since it is normally (i.e. for large designs) not exhaustive in reasonable time. Formal post-synthesis veri cation 1] on the other hand needs manual interactions at higher abstraction levels; it can be automated at the gate level, but is extremely costly | and can only be applied, if some very simple synthesis steps have been performed. Therefore, it is our objective to perform synthesis via logical transformations and thus to guarantee \correctness by construction".
There are many approaches, that claim to ful ll this paradigm. When regarding the state of the art in this area, one can distinguish two concepts: transformational design and formal synthesis. In transformational design 2], the synthesis process is based on correctness-preserving transformations. However, in most cases a lot of intuition is used during the proofs 3]. Furthermore the proofs are often based on non-mathematical formalizations 4] and are performed in a paper&pencil style 5] , which means that they have to be examined by others to verify them. However, the most restrictive fact in transformational design is that the implementations of the transformations are not proven to be correct. The transformations are realized by complex software programs, that might be error-prone. Therefore these approaches do not ful ll the above mentioned paradigm.
In formal synthesis approaches the synthesis process is performed within some logical calculus. The circuit descriptions are formalized in a mathematical manner and the transformations are based on some logical rules. The DDD system 6], e.g., starts from a speci cation in a Lisp-like syntax. The behavior is speci ed as an iterative system of tail-recursive functions. This is translated into a sequential description which can be regarded as a network of simultaneous signal de nitions comprising variables, constants, delays and expressions involving operations. Then a series of transformations are applied to re ne the description into an implementation. The disadvantage of this method is that the description language is not strongly typed. Therefore the consistency of an expression has to be checked separately. Furthermore, although all the transformations are based on functional algebra, their implementations have not been formally veri ed, nor are they based on a small core of elementary rules. Finally, the derivation process needs manual interactions. An automatic design space exploration method is not provided.
Our work is based on a functional hardware description language named Gropius, which ranges from the gate level to the system level. Gropius is stronglytyped, polymorphic and higher-order. Each construct of Gropius is de ned within the higher-order logic theorem prover HOL 7] and since it is a subset of higherorder logic, Gropius has a mathematically exact semantics. This is the precondition for proving correctness. The implementation of HOL is not formally veri ed. However, since the implementation of the correctness-critical part of HOL | i.e. deriving new theorems | is very small and is independent of the size of our formal synthesis system, our approach can be considered to be extremely safe as to correctness. In the next section, we brie y introduce the way we represent circuit descriptions at the algorithmic level and give a small program as a running example.
Existing approaches in the area of formal synthesis deal with lower levels of abstraction (register-transfer (RT) level, gate level) 8, 9, 6, 10, 11] or with pure data ow graphs at the algorithmic level 12]. This paper addresses formal synthesis at the algorithmic level. The approach goes beyond pure basic blocks and allows synthesizing arbitrary computable, i.e. -recursive programs.
The starting point for high-level synthesis (HLS) is an algorithmic description. The result is a structure at the RT level. Usually, hardware at the RT-level consists of a data-path and a controller. In conventional approaches 13], rst, all loops in a given control/data ow graph (CDFG) are cut, thus introducing several acyclic program pieces each corresponding to one clock tick. The number of these cycle-free pieces hereby grows exponentially with the size of the CDFG. Afterwards scheduling, allocation and binding are performed separately on these parts leading to a data-path and a state transition table. Finally, the controller and the communication part are generated.
We have developed a methodology that absolutely di ers from this standard. In our approach, the synthesis process is not reduced to the synthesis of pure data ow graphs, but the circuit description always remains compact and the RT-level structure is derived via program transformations. Besides an RT-level structure, our approach additionally delivers an accompanying proof in terms of a theorem telling that this implementation is correct. High-level synthesis is performed in four steps. In the rst two steps which are explained in Section 3, scheduling and register allocation/binding are performed. Based on pre-proven program equations which can be steered by external design exploration techniques, the program is rst transformed into an equivalent but optimized program, and then this program is transformed into an equivalent program with a single while-loop. The third step (Section 4) performs interface synthesis. An interface behavior can be selected and the program is mapped by means of a pre-proven implementation theorem to a RT-level structure, that realizes the interface behavior with respect to the program. In the last step, which is not addressed explicitely here, functional units are allocated and bound. Section 5 will give some experimental results.
Formal representation of programs
At the algorithmic level, behavioral descriptions are represented as pure software programs. The concrete timing of the circuit, that has to be synthesized, is not yet considered. In Gropius, we distinguish between two di erent algorithmic descriptions. DFG-terms represent non-recursive programs that always terminate (Data Flow Graphs). They have some type ! . P-terms are means for representing arbitrary computable functions (Programs). Since P-terms may not terminate, we have added an explicit value to represent nontermination: a P-term either has the value De ned (x) indicating that the function application terminates with result x, or in case of nontermination the value is Unde ned.
The type of P-terms is expressed by ! ( )partial.
In our approach, P-terms are used for representing entire programs as well as blocks. Blocks are used for representing inner pieces of programs. In contrast to programs, the input type equals the output type. This is necessary for loops which apply some function iteratively. In Gropius, there is a small core of 8 basic control structures for building arbitrary computable blocks and programs based on basic blocks and conditions. Basic blocks (type ! ) and conditions (type ! bool) itself are represented by DFG-terms. In Table 1 , only those control structures are explained that are used in this paper. Based on this core of control structures further control structures like for-and repeat-loops can be derived by the designer. In the rest of the paper a speci c pattern called single-loop form (SLF) plays an important role. Programs in SLF have the following shape:
PROGRAM out init (LOCVAR var init (WHILE c (PARTIALIZE a))) (1) The expressions out init and var init denote arbitrary constants, c is an arbitrary condition and a an arbitrary basic block.
Basically, no front-end is required, since Gropius is both the input language and the intermediate format for transforming the circuit description. However, since the \average designer" may not be willing to specify in a mathematical notation, it would also be possible to automatically translate Gropius-descriptions from other languages like Pascal. But on the other hand this adds an error-prone part into the synthesis process you can abandon, since Gropius is easy to learn | there are only few syntax rules. The correspondence between the two descriptions in Fig. 1 is not one-toone. To yield a more e cient description with less addition and multiplication operations, the following two theorems for conditionals have been applied: f MUX(c; a; b) = MUX(c; f a; f b)`MUX(c; a; b) g = MUX(c; a g; b g) (2) The imperative program can rst be translated into a Gropius description containing the same eight multiplications and six additions in the loop-body. Then the theorems (2) can be applied to generate the description shown in Fig.  1 , which needs only four multiplications and three additions in the loop-body.
Program transformations
The basic idea of our formal synthesis concept is to transform a given program into an equivalent one, which is given in SLF. This is motivated by the fact that hardware implementations are nothing but a single while-loop, always executing the same basic block. Every program can be transformed into an equivalent SLF-program (Kleene's normal form of -recursive functions). However for a given program there might not be an unique SLF, but there are in nitely many equivalent SLF-programs. In the loop-body of a SLF-program, all operations of the originally given program are scheduled. The loop-body behaves like a casestatement, in which within a single execution certain operations are performed according to the control state indicated by the local variables (LOCVAR var init). After mapping the SLF-program to a RT-level structure (see Section 4), every execution of the loop-body corresponds to a single control step. The cost of the RT-implementation therefore depends on which operations are performed in which control step. Thus every SLF corresponds to a RT-implementation with certain costs. Performing high-level synthesis therefore requires to transform the program into a SLF-program, that corresponds to a cost-minimal implementation.
In the HOL theorem prover we proved several program transformation theorems which can be subdivided into two groups. The rst group consists of 27 theorems. One can prove that these theorems are su cient to transform every program into an equivalent program in SLF. The application of these theorems is called the standard-program-transformation (SPT). During the SPT, control structures are removed and instead auxiliary variables are introduced holding the control information. Theorem (3) is an example: WHILE c1 (LOCVAR v (WHILE c2 (PARTIALIZE a))) = LOCVAR (v; F) WHILE ( (x; h1; h2): c1 x _ h2) PARTIALIZE ( (x; h1; h2):MUX (c2 (x; h1); (a (x; h1); T); (x; v; F))) (3) Two nested while-loops with a local variable at the beginning of the outer loopbody are transformed to a single while-loop. The local variable is now outside the loop and there is an additional local variable with initial value F. This variable holds the control information, whether the inner while-loop is performed (value is T) or not (value is F).
Although the SLF representation is not unique, the SPT always leads to the same SLF for a given program by scheduling the operations in a xed way. Therefore, the SPT unambiguously assigns costs to every program. To produce other, equivalent SLF representations, which result in another scheduling and thus in other costs for the implementation, the theorems of the second group have to be applied before performing the SPT. Currently, we proved 19 optimizationprogram-transformation (OPT) theorems. These OPT-theorems can be selected manually, but it is also possible to integrate existing design space exploration techniques which steer the application of the OPT-theorems. The OPT-theorems realize transformations which are known from the optimization of compilers in the software domain 15]. Two of these transformations are loop-unrolling and loop-cutting.
Loop unrolling reduces the execution time since several operations are performed in the same control step. On the other hand, it increases the combinatorial depth and therefore the amount of hardware. Theorem (4) shows the loop unrolling theorem. It describes the equivalence between a while-loop and an n-fold unrolled while-loop with several loop-bodies which are executed successively. Between two loop-bodies, the loop-condition is checked to guarantee that the second body is only executed if the value of the condition is still true. 
The counterpart to loop unrolling is the loop cutting: the loop is cut into several smaller parts. Each part then corresponds to a separate control step. This results in a longer execution time; however, the hardware consumption might be reduced, if the parts can share function units. is to be performed is stored in a local variable that has been introduced. This variable has an enumeration datatype. Its initial value is 0 and its value ranges from 0 to (LENGTH r). The semantics of enum is shown in (7) . If the local variable has value 0, the loop-condition c is checked, whether to perform the loop-body or not. If the local variable's value di ers from 0, the loop-body will be executed independent of c. (CASE L i) picks the i th function of the list L. Therefore, within one execution of the loop-body, a function of the list (k :: r) is selected according to the value h of the local variable and then this function is applied to the value x of the global input variable. Furthermore, the new value of the local variable is determined. The semantics of next is shown in (7). enum n m = MUX(m < n; m; 0)`next n x = MUX(SUC x < n; SUC x; 0) (7) Returning to our example program b, the body of the while-loop can be scheduled in many ways. The decision on how the body should be scheduled can be made outside the logic by incorporating existing scheduling techniques for data-paths. Table 2 shows the results of applying the ASAP (as-soon-aspossible), force-directed 16] and list-based scheduling techniques to our example program. The ASAP algorithm delivers the minimal number of control steps. In addition to this, the force-directed-algorithm tries to minimize the amount of hardware. The list-based scheduling on the other hand restricts the amount of hardware components and tries to minimize the execution time. For each control step, we list the local variables of the loop from Fig. 1 that hold the result of the corresponding operations. Note that no chaining was allowed in the implementation of these scheduling programs. However, this is not a general restriction. For performing the list-based scheduling, the number of multiplications and additions was each restricted to two.
In the next step, the number and types of the registers will be determined that have to be allocated. This is also listed in Table 2 . Control information extracted by di erent scheduling algorithms the loop-body by a logical transformation, additional input variables have to be introduced, since the number of input and output variables directly corresponds to the number of registers necessary at the RT-level. The number of additional variables is (#regalloc ? #invars) with #invars being the number of input variables of the loop-body and #regalloc being the number of allocated registers. For our example b in the case of allocation after the ASAP scheduling, this value is 11 ? 6 = 5. Therefore 5 additional input variables for the loop have to be introduced. This is done by theorem (8) . Applying it to the loop in Fig. 1 with appropriately instantiating i gives program (9) . 
The additional variables are only dummies for the following scheduling and register allocation/binding. They must not be used within the loop-body. Since the original input variables are all of type num, one variable of type bool (h1) and four variables of type num (h2; : : : ; h5) are introduced. Some default initial values are used for each type. Since the output type must equal the input type, additional outputs have to be introduced as well. Now the DFG-term representing the loop-body can be scheduled and register binding can be performed, both by logical conversions within HOL. Fig. 2 shows the resulting theorem after this conversion. The equivalence between the original and the scheduled DFG-term is proven by normalizing the terms, i.e. performing -conversions on all -redices 19]. The register binding was performed based on the result of a heuristic that tries to keep a variable in the same register as long as possible to avoid unnecessary register transfer. The right hand side of the theorem in Fig. 2 is actually an expression of the form (list o L) with L being a list of ve DFG-terms. The theorem in Fig. 2 can be used to transform the circuit description b by rewriting and afterwards the loop-cutting theorem (6) can be applied.
` (((n; y1); a1; a2; y2; m); h1; h2; h3; h4; h5): Besides loop-unrolling and loop-cutting, several other OPT-theorems can be applied. After that the SPT is performed generating a speci c program in SLF. Fig. 3 shows the theorem after performing the SPT without applying any OPTtheorem before. When the program in SLF is generated without any OPT, then the operations in the three blocks before, within and after the while-loop in the original program b, will be performed in separate executions of the resulting loop-body. Therefore, function units can be shared among these three blocks. Although ,e.g., two DIV-operations appear in Fig. 3 , only one divider is necessary. One division comes from the block before the loop and the other results from the loop-body. These two division-operations are therefore needed in di erent executions of the new loop-body. They can be shifted behind the multiplexer by using one of the theorems (2). The allocation and binding of functional units is the last step in our high-level synthesis scenario. Before this, interface synthesis must be applied. is not yet considered. During high-level synthesis the algorithmic description is mapped to a RT-level structure. To bridge the gap between these two di erent abstraction levels one has to determine how the circuit communicates with its environment. Therefore, as second component of the circuit representation, an interface description is required.
In contrast to most existing approaches, we strictly separate between the algorithmic and the interface description. We provide a set of at the moment nine interface patterns, of which the designer can select one. Some of these patterns are used for synthesis of P-terms and others for synthesis of DFGterms. The orthogonal treatment of functional and temporal aspects supports reuse of designs in a systematic manner, since the designer can use the same algorithmic description in combination with di erent interface patterns. Remark: At the algorithmic level we only consider single processes that do not communicate with other processes. In addition to this, we have developed an approach for formal synthesis at the system level, where several processes interact with each other 17]. Fig. 4 shows the formal de nition of two of those interface patterns. Beside the data signals of an algorithmic description P, the interface descriptions contain additional control signals which are used to steer the communication, to stop or to start the execution of the algorithm.
The two patterns are functions which map an arbitrary program P and the signals (in; start; out; ready) and (in; reset; out; ready), respectively, to a relation between these signals with respect to the program P. The pattern P IFC START states that at the beginning the process is idle, if the start-signal is not active. As long as the process is idle and no calculation is started, the { the calculation P (in t) will terminate after some time steps m. As long as the calculation is performed, the process is active, i.e. ready is F. When the calculation is nished at time step t + m, out holds the result y and ready is T, indicating that the process is idle again. However, the calculation can only be nished, if start is not set to T while the calculation is performed. { the calculation P (in t) will not terminate. Then the process will be active producing no result until a new calculation is started by setting start to T.
The pattern P IFC CYCLE describes a process, which always performs a calculation and starts a new one as soon as the old one has been nished. The reset-signal can be used here to stop a (non-terminating) calculation and to start a new one.
For each interface pattern that we provide, we also have proven a correct implementation theorem. All the implementation theorems corresponding to patterns for P-terms expect the programs to be in SLF. The formal Gropiusdescriptions of implementations at the RT-level can be found in 18]. In (10) , an implementation theorem is shown, stating that an implementation pattern called IMP START ful lls the interface pattern P IFC START for each program being in SLF (see also the pattern of the SLF in (1) 
The nal theorem (11) is achieved by rst instantiating the universal quantied variables in theorem (10) with the components a SLF ; c SLF ; out init SLF and var init SLF of the SLF in Fig. 3 . Afterwards, the SLF-theorem` b = b SLF in can be applied which is described there.
Experimental results
Our formal synthesis approach consists of four steps. OPT and SPT for scheduling and register allocation/binding, applying an implementation theorem for interface synthesis and allocation/binding of functional units within the resulting basic block of the RT-implementation. SPT and interface synthesis consist of rewriting and -conversions, which can be done fully automatically within the 2 Due to lack of space the loop-body of the SLF (component aSLF) is not explicitely shown in Fig. 5 . q denotes an arbitrary initial value in the eight allocated registers.
HOL theorem prover. For the OPT and the FU-allocation/binding, however, heuristics are needed to explore the design space. Those non-formal methods can be integrated in the formal synthesis process, since the design space exploration part is separated from the transformation within the theorem prover. After establishing the formal basis for our synthesis approach, it is our objective in the future to develop further heuristics and to integrate more existing techniques for the OPT. An interesting approach is proposed in 20] where a method is described in which the design space is explored for performing similar transformations that are used in the OPT of our synthesis process.
To give an impression about the costs in formal synthesis, Fig. 6 shows the run-times (on SUN UltraCreator, Solaris 5.5.1, 196 MB main memory) of several programs for both performing the SPT and instantiating an implementation theorem. The descriptions of the programs in the programming language C can be found in 21]. Since we did not perform any OPT, it does not make sense to compare the results with other approaches with respect to the number of registers and FUs. As we have demonstrated on our small running example, very di erent implementations can be achieved if OPT-theorems are applied. Since only few heuristics for automatic invoking the OPT-theorems have been implemented, we have considered only the SPT and the interface synthesis. The cost for the SPT mainly increases with the number of the control structures but also with the number of operations in the program. The cost for the interface synthesis mainly increases with the size of the loop-body of the SLF-program. The experiments have been run using a slight variant of the HOL theorem prover. As compared to the original HOL system, it has been made more e cient by changing the term representation and adding two core functions. See 22] for a detailed description and discussion about this. As we have demonstrated in the paper, the result of our synthesis process is a guaranteed correct implementation. A proof is given together with the implementation, stating that the implementation ful lls the speci cation. Therefore, one should be aware that the run-times must be compared with conventional synthesis plus exhaustive simulation. Furthermore, we believe that due to the complexity it is very hard (or even impossible) to develop automatic post-synthesis veri cation methods at this abstraction level which could prove the correctness of the synthesis process.
Conclusion
In this paper, we presented a formal way for performing high-level synthesis. The main contribution is that we perform the whole synthesis process within a theorem prover. The result is therefore not only an implementation but also an accompanying proof that this implementation is correct. Furthermore, we developed a new synthesis method, where the implementation is derived by applying program transformations instead of generating and analyzing a number of control paths that grows exponentially with the CDFG size. Last but not least we orthogonalize the treatment of algorithmic and temporal aspects and therefore support a systematic reuse of designs.
