The design space of a special-purpose system together with the tools currently experimented at Irisa. The Signal language is used for speci cation purpose. Speci cation and derivation of regular parts is done using Alpha. The result of the derivation is an Alpha 0 program, which can be translated in Vhdl or Model or given to the Madmacs system for layout generation. MicMacs is a lego architecture for real-time simulation.
the design time should be as short as possible. Indeed, the cost of the total system depends heavily on this parameter, not only because man-power is costly, but also since a short time-to-market is a crucial factor of success especially when the technology is changing so fast. the system must be e cient, i.e., optimal in hardware cost and speed. In fact, the goal is most often to reach a system which just meets some given speed requirement, while minimizing real-estate in term of silicon. There exist no general accepted methods for specifying and implementing a complete hardware system Hay88]. The di erent design steps are most often performed using a set of heterogeneous tools, which do not guarantee the coherence of the design through its various representations.
Several formalisms have been studied and applied for the modeling of systems { algebraic speci cations, functional languages, temporal logic, Petri nets, Statecharts, etc. HGdR88] { , with the goal of studying their properties and reasoning about their behaviour. However, the proof power of these formalisms is currently orders of magnitude away from the needs of applications.
The state-of-the art approach consists of using a single framework such as Vhdl ABOR90] for both simulation and description. But even if the use of Vhdl represents signi cant progress over the previously prevailing methodology, it can hardly be said that all the mentioned problems are solved. The most severe limitation of Vhdl is its lack of formal semantics, which prevents its use for doing formal veri cation.
In this paper, we consider several steps which we believe to be essential in the design path of a special purpose architecture, and we present methodologies for achieving design requirements. These solutions are based on experience gathered in the Parallel VLSI Architecture group of Irisa. Our e ort is guided by three key ideas:
the most permanent part of a system is its high-level speci cation; indeed, the lifetime of a system spans a period of time which goes far beyond one particular implementation as a special-purpose architecture. The architecture is therefore technology dependent, and as the technology changes, so does the structure of the implemented system. We believe that the use of a formal speci cation language, supporting synthesis techniques and veri cation, will eventually provide a great bene t for designers; in many situations, a real-time simulation of a system is needed; however, even supercomputers do not have the computing power needed to simulate in real-time systems such as those we are currently considering for implementation. We advocate a lego approach based on programmable hardware elements and software components that would make it possible to quickly emulate a desired architecture. nally, we think that methods for synthesizing regular parts of systems are becoming mature, and that they will eventually provide a great saving in design time, as well as more independence from the technology. In section 2, we consider the problem of specifying a system. Our approach relies upon the premise that most embedded systems are a mix of computation intensive but regular parts glued together by highly complex control mechanisms. We present the use of the synchronous language Signal for the speci cation and the functional simulation of a system, and we explain why such a language is useful for learning about the synchronization di culties of the application early in the design process.
In section 3, we describe an approach to the real-time simulation of parallel systems, corresponding to the bottom right region of the design space triangle. Our claim is that building blocks for rapidly assembling a real-time simulator for a system can be designed both at the hardware and the software level. We illustrate this concept on the design of a special-purpose chip for doing spelling correction with application to the optical reading of postal addresses. The machine upon which our experiment is based is named MicMacs. Section 4 is concerned with the design of regular architectures. We describe the use of a language named Alpha which was designed especially for the description and the synthesis of systolic architectures { but has application beyond this class of architectures. In our approach, Alpha serves as a formal speci cation language for regular algorithms. It is also a framework for the formal derivation of parallel architectures which are described as a subset of the language named Alpha 0 . This language can then be translated into hardware languages such as Vhdl 
Speci cations
Most embedded processing systems, at least among those which are candidates for Vlsi or Ulsi implementation, contain computation intensive but regular parts. These parts are amenable to highly parallel hardware implementations such as systolic, Simd or pipeline architectures. The speci cation and implementation of these regular parts will be covered in section 4. One of the main di culties faced by a system designer is to take into account the interaction between these regular parts, in such a way that the global utilization of resources such as memory, input/output, and silicon can be optimized. The control of the system as a whole can therefore become extremely complex. Our opinion is that this problem is often underestimated, and this greatly increases the probability of residual error in the nal design. Moreover, this problem is even more severe when one has to implement the speci cation on complex parallel architectures. Indeed parallelization transformations on the initial speci cation can dramatically a ect the control of the whole system, leading to intractable situations.
Another problem of design is that an application has to be represented by di erent descriptions, at di erent stages in the design process. The initial speci cation has to be augmented with technology related information and transformed to t the constraints of the implementation. It is essential that these descriptions be checked against one another, either by synthesis, or by formal veri cation. To this end, a speci cation language should meet several requirements:
the execution results should conform to the speci cation and be independent of the underlying architecture; transformations applied on the description should be valid, and if possible, formally proved; the language must allow for parallel and hierarchical descriptions, for the obvious reason of modularity; the formalism should permit both the description of algorithms and architectures; the ease in speci cation should not come at the cost of nal e ciency of simulation.
The Signal language
The so-called synchronous language approach Ber91] has been de ned to address some of the above mentioned problems, mainly, the mastering of synchronization of real-time systems. Synchronous languages are based upon the hypothesis that calculations or reaction of the program take no time. This makes it possible to express and check logical properties of an application, before taking into account the physical properties of a particular hardware realization. The implementation of a speci c process is obtained when one associates an execution time to each elementary action, which should be checked to have a bounded duration. Examples of synchronous languages are Esterel, Lustre, and Signal Ber91].
In the following, we report our experience in using the Signal language LGLL91], developed at Irisa, to specify systems. A Signal process is a system of equations whose variables are signals, i.e., in nite sequences of values each associated with a discrete time. The set of time instants when the signal is de ned is called the clock of the signal. The kernel of the Signal language is made of basic operators which are use to de ne elementary Signal processes. The speci cation of more complex processes is obtained by composition of elementary processes. There are four elementary processes: component-wise signal arithmetic operators, delay, subsampling of a signal by a condition, and shu e of two signals. External processes can be written in another language, provided they can be considered as instantaneous operations.
As an example, consider the modulo counter described in gure 3. The counter is described as a process counter parameterized by N (line 1). The input is a boolean signal h (line 2), and the output is an integer signal nmod (line 3) whose value is the number of ticks of h modulo the parameter N. The instruction of line 4 speci es that h and nmod are synchronous, that is to say, they have the same clock. Line 5 gives the de nition of an intermediate signal znmod which is a delayed version of nmod. The de nition of nmod in line 5 can be read as follows : nmod has value 1 whenever its past value znmod reaches value N+1 (this is the reset of the molulo function). Otherwise, (default operator), nmod receives the value of znmod increased by 1. Finally, line 7 indicates that the signal znmod is local to the process.
A Signal program can be easily associated to a conventional signal ow graph. A graphical Signal editor can be used to input programs, using a mix of graphic and text representation. A typical representation of a Signal process is shown in gure 4. Boxes represent Signal processes, connection ports correspond to signals, and links between boxes gure ow of data between the processes. Processes can be organized into a hierarchy.
Application to image compression
We have experimented with the use of Signal for an application of animated image sequences compression. This example is interesting for several reasons. First, compression algorithms are now standardized (for example, mpeg Gal91]) and thus are \real life" examples. Second, they are good examples of complex signal processing systems.
The compression algorithms we consider belong to the class of hybrid coding schemes. Inter-image predictive coding exploits the temporal redundancy of successive images, whereas intra-image coding takes bene t of the spatial correlation of luminance and chrominance amplitude around a given pixel. Transformations are applied on subblocks of the image, { typically 8 8 {, as shown in gure 2.
The functional speci cation of such an application using Signal is a direct translation of the structure of its block diagram. A Signal process is associated with each block of the diagram, and the data transfers between processes consist of blocks of pixels.
The main advantages of Signal for representing speci cations are the following ones: the compiler can check many coherence properties of programs. For example, the Signal compiler checks that there exists no deadlock situation resulting from cycles in the de nition of signals. More generally, the compiler carries out a detailed analysis of the clocks, and checks the synchrony of signals. These constraints on the way programs have to be written generally result in a better speci cation of events. Signal expressions obey algebraic rules which can be used either to derive programs, or to check the equivalence of programs. This property is well suited to architecture design, because synthesis and formal veri cation are becoming pervasive in this eld.
Simulation and real-time emulation
The simulation of speci cations is important for two reasons. First, behavioral speci cations cannot be proved, and they must be checked by simulation. Second, during the design process, speci cations are often re ned. This may a ect the behavior of the system, and in many cases, real-time simulation is the only way of seeing the e ects of these re nements. This is particularly true for image compression applications. Even normalized compression coding schemes leave the designer with the freedom of choosing the representation of data and the best architectural organization for algorithms, for example. Functional simulation can often be done directly from the speci cation language. Signal, for example, can be executed. However, the simulation of such a language is unsatisfactory, because it is much too slow to attain real-time, even when executed on a super computer.
Accelerating the simulation may rely upon various techniques: 1. one can execute in parallel the speci cation language. However, this is possible only if a parallel compiler is available for the speci cation language. This is seldom the case. 2. one can rewrite the most time consuming parts of the speci cation in another language which can be executed more e ciently on a general-purpose parallel architecture. 3. nally, on can execute the time consuming parts on a special-purpose programmable architecture. Although solution 2 seems to be reasonable, we believe that in practice, only solution 3 can lead to realtime. Indeed, the computing power needed to simulate systems such as those we consider here is much higher than the one available even on super computers. Moreover, general purpose parallel machines often lack the input/output capabilities needed for high performance signal processing applications.
Our proposition is to build a lego of hardware and software elements partly tailored to the domain of application. The hardware elements needed to reach this goal must be easy to interconnect so that parallel architectures can be quickly assembled, and their communication bandwidth and computing power must be balanced, in order to e ciently execute ne grain parallel algorithms. These requirements exclude the use of Digital Signal Processing chips or general purpose processors available of the shelf 1 . Our approach is supported by the experience we report here, which concerns the design of special-purpose architectures for string matching problems, with an application to correction of optical reading. We also describe the ReLaCs language, an extension of C that we designed to program systolic algorithms.
The MicMacs machine
The MicMacs machine is a Vlsi programmable systolic array which was designed at Irisa FL91a] to become the building block of various systolic architectures. The MicMacs machine can be seen as a peripheral for high speed systolic-like data processing. It is organized into two modules as shown in gure 5. The rst module, called Mics, is the systolic array properly speaking, while the second one, called Macs, is in charge of supplying the array with instructions and data. The systolic architecture { the Mics module { is a network of locally interconnected processing elements operating in SIMD mode. The processing element itself is a single chip named Api15c which was also designed at Irisa. To support systolic applications, the processing element was designed along the following ideas : it is a single chip processor. This is essential to meet the high speed requirements of applications while keeping the hardware volume reasonably low, it is programmable, in order to cover as large a class of applications as possible, it follows a Simd execution mode, in order to simplify the control of the machine and its programming, its instruction set is dedicated to systolic computations. In particular, local reinitializations and boundary conditions which are present in almost all systolic algorithms can be handled e ciently by special conditional instructions, the chip has several parallel I/O ports. Thus, di erent interconnection topologies { linear or bidimensional { can be supported, and the communications do not slow down computations when ne grain parallelism is required. These requirements resulted in the design of the Api15c systolic processor.
The Macs control module has two functions : it generates instructions and handles data transfers to/from the systolic network. It can be observed that these functions are executed by programs which have roughly the same structure: the program which generates the ow of instructions is data-independent whenever the program handling the data data-dependent. As observed in AAG*87], merging both programs in a single one often results in ine ciencies. In MicMacs, these functions are implemented as two separate controllers running independent processes. These controllers have to be synchronized when data are supplied to the network. Synchronization operations are implemented directly by a hardware mechanism to keep throughput performance high.
Two di erent machines were built upon this model. The rst one is a linear array of 18 Api15c processors Lav89]. It can be considered as a general-purpose architecture (cf gure 6a) for regular computation, and as such, can be used to program e ciently video processing algorithms. The second machine is an accelerator for string matching algorithms and was developed as a vehicle to study automatic recognition of typewritten postal addresses. The address is read by an Optical Character Recognizer (OCR) device. This operation is not fault-free and, therefore, it has to be followed by a correction algorithm. To meet the reading speed of the OCR, the correction algorithm must be able to do the comparison of 2 million words in one second. These performances can be reached only with a 2-D systolic 
A single-chip version
The 28 processor version was used also to study a single chip architecture for spelling correction named Api69 Lav92]. The idea was to integrate on silicon the same 2-D structure after processors were customized to the application. The resulting chip is an array of 69 processors containing 300000 transistors which can process more than 2 million words per second. Our approach, which consisted of designing a special purpose parallel machine using a exible building block processor and a powerful control module, allowed us to test very rapidly the architecture before integrating it in silicon. In the case of spelling correction, this step was fundamental, since the possible variants of the string comparison algorithm had to be simulated in real-time.
The ReLaCs language
The ReLaCs language is designed to e ciently program parallel architectures that support systolic communications. Current target machines include Intel iPSC/2, Iwarp and MicMacs Lav89]. ReLaCs is a C-like language KBR78] augmented with parallel constructs. The user writes a single source program, from which the compiler generates code for the host processor and for each processor of the network.
The programmer of such a machine needs a way to di erentiate between host variables (scalar variables), and objects which are located on the systolic array (systolic variables). The ReLaCs language de nes a new storage class speci er, systolic, to allocate variables in each cell of the network. A statement operating on systolic variables describes a simultaneous execution of this operation on all the cells of the network. This programming model is referred to as data parallel HS86]. Figure 7 provides an example of such a statement: the statement B = A is executed on the host, whenever the statement y = x is a parallel component-wise assignment of systolic variable x to y. This example also shows that, despite the fact that these operations are written sequentially, they can be executed in parallel. In systolic architectures, data transfers between processors are very important. Special care is devoted to this I/O mechanism in the ReLaCs language. New assignment operators match the hardware architecture and express the tight coupling between neighbouring cells.
The programming model underlying the ReLaCs language is that of a SIMD architecture, whose processors use a synchronous communication mechanism. In this model, each cell, when communicating, is forced to execute a send followed by a receive during the same communication cycle. As all cells execute the send / receive simultaneously, one can view the data transfer as a global shift operation on systolic variables involved in the communication sequence. In ReLaCs, this overall operation is expressed by the left (=<) and right (=>) systolic assignment operators. Figure 8 illustrates the operation of the =< operator, which describes data owing from the right to left of the array : each cell sends the content of its component of the systolic variable x to its left neighbour and receives in its component of the variable y the value sent by its right neighbour. Systolic assignments may have two additional arguments which allow communications between the host machine and the array to be described. A scalar variable or a constant added to the right hand side of the assignment means that a data is input from the host to the network. Adding a scalar variable to the left hand side of the assignment indicates that a data is read from the array to the host. Figure 9 illustrate the case of an extended communication. The compiler of the ReLaCs language provides object code for sequential workstations as well as for several parallel architectures. Together with the MicMacs architecture, it provides a basic hardware/software lego which can be used for assembling quickly exible real-time simulators for systolic like special-purpose parallel architectures.
Derivation of parallel regular algorithms
As noticed earlier, the regular parts of applications are the ones which perform most of the computations, and are candidates for parallel Vlsi implementation. These recent years, much work has been devoted to automatic techniques for designing regular algorithms (see Thi92] for a survey). The motivation for using such techniques was the observation that systolic algorithms can be represented by space time reindexing of systems of linear recurrence equations (see Qui92] ). The work we report here is based on the Alpha du Centaur design environment developed in our laboratory.
The Alpha language
The Alpha language DVQS91, DGL*91, LMQ91] is based on the recurrence equation formalism. It is therefore an equational language whose constructs are well-suited for the expression of regular algorithms. The Alpha language can also be used to describe synchronous systems, and therefore, provides a natural framework for the transformation of algorithm speci cations into architectures. Interactive transformations of Alpha programs can be done using the Alpha du Centaur environment, implemented with the language design system Centaur BCD*87]. Alpha du Centaur includes a library of mathematical routines that are used to support e cient transformations of programs.
To explain the language, let us consider the Alpha program, also called a system of equations, presented in gure 10. This program represents an iterative version of the calculation s = P 3 i=1 X i . It takes an input variable X, indexed on the set fij1 i 3g, and returns an integer s. The program makes use of a local variable sum, de ned on the set fij0 i 3g. Between the keywords let and tel are the de nitions of sum and s that we now explain in more detail. Spatial operators allow recurrence equations to be expressed. In gure 10, the value of the variable sum is the sequence of partial sums of the elements of X and is de ned by means of a case, whose rst branch speci es the initialization part and the second one the recurrence itself. Finally, Alpha includes reduction operators. They can be used for writing directly expressions containing P or operators. With a reduction operator, the de nition of s in the above program could be simply written s = red(+; (i !); X), which means \sum all element of variable X over the i index".
Transformations
Alpha follows the substitution principle: any variable can be substituted by its de nition, without changing the meaning of the program. For example, substituting sum in the de nition of s in program of gure 10, gives the program shown in gure 11. One can prove that any Alpha expression can be rewritten as an equivalent expression, called its normal form, whose structure consists of one single case expression.
The normalization process serves to simplify expressions obtained by complex transformations. It can be used, together with the substitution, to do a symbolic simulation of an Alpha program. For example, the de nition of s in program (11) 
A practical application
An e ective algorithm for video-compression uses the idea of motion compensation. In this algorithm, the previous position of part of the current image is searched. Only the distance (or motion vector) is then transmitted instead of all the pixels. This results in a large compression of the video-signal.
The current image is divided into small neighbourhoods of m n pixels (current window) whose location in the previous image is to be determined. In order to limit the number of comparisons, this search is restricted to a window of appropriate size in the previous image.
The 
The motion vector to be transmitted is the pair (i; j) for which the minimum in (2) is reached. Figure 12 shows the Alpha description of this algorithm. It combines both equations into a single line of Alpha code. From this initial speci cation, the nal architecture is derived through formal transformations which lead to an Alpha description of about 100 lines. Figure 13 presents a fragment of this description. One can see that Delta is now de ned by means of two variables S1 and S2. Each one of these variables corresponds to the serialization of one reduction in the initial system. One can also notice that these variables are indexed by t1, t2, x, and y. The rst two indexes represent a multidimensional system motion estimator ( N : 
Automatic processor array generation
In this last section, we consider the problem of generating layouts from the description of a regular architecture. Our research is motivated by the observation that available circuit compilers { gate arrays, standard cells, datapath generators, array compilers, etc.{, consist most often of a unique placement and wiring strategy. They produce dense layouts only if this strategy ts the circuit topology. For instance, the so-called datapath approach MLB*86] is only e cient for the computational part of a chip, but is unacceptable for generating control logic. Past experience in the design of regular arrays with available compilers have shown poor results DGL*91] compared to designs made by hand.
There is a wide range of possible array structures { linear, bidimensional, triangular, etc.{, and it is not practical to de ne a unique place and route strategy nor to develop a single compiler for each conceivable array structure. The alternative chosen was to implement an environment for the development of speci c generators, named Madmacs. A regular array consists of active elements (processors, memories, latches, etc.) interconnected with neighbours. The layout of a regular array can be generated in two steps, as follows: Active element generation: rst, the layout of active elements is generated by classical compilers.
However, this generation must be constrained in order to retain the routing regularity at the array level.
Array assembly: the array is then produced by abutment of the active elements.
The Madmacs system GP92] is an environment for the development of generators. Basically, Madmacs is a Lisp language based design tool, coupled with a graphical front-end. Mixing language and graphics is interesting for two reasons. On the one hand, the use of a language gives great exibility. For example, parameters and conditional generation instructions can be used. On the other hand, a graphical front-end gives a great interactivity. With the \WYSIWYG" macro-command mechanism, one can develop functions for repetitive tasks such as routing. The most important feature of Madmacs is the possibility that it o ers of using coordinate free movements and functions. It is similar to the emacs text editor in which some commands are independent of word or line size. A designer can therefore interactively produce a symbolic layout by using only object size independent operations. Such procedures can be easily generalized into a program. The Madmacs system has been used to develop several generators. In particular, the layout of the Api69 chip presented in section 3.2 FL91b] was designed in order to test the concepts. The oorplan of this chip is shown in gure 14.
Conclusion
We have presented some of the research directions followed by our group regarding methods for designing special-purpose architectures. These directions are based on several ideas :
formal speci cations will play a prominent role in the future, as they permit the synthesis and veri cation of designs, real-time simulation is very often needed before a special-purpose architecture is committed to silicon. A hardware and software lego approach is a good way to reach the performances needed while keeping the development e ort at a reasonable level, regular parallel architectures will be the key to the design of Vlsi computationally intensive parts of systems. The design of these architectures can make bene t of new powerful methods such as those which are studied for systolic arrays. We have illustrated these principles through various examples. The use of the Signal synchronous language for the speci cation of a video compression algorithm was presented. We have then described the MicMacs lego architecture. The principles of the Alpha du Centaur environment for the design of regular architectures was explained and illustrated on the example of a motion estimation algorithm. Finally we have explained the concepts of the Madmacs layout generator for regular arrays.
In the future, it is very likely that approaches such as the ones mentioned in this paper will become more commonly used, either directly, or more likely embedded in integrated CAD frameworks.
The main impediment to the spreading of such methods is certainly their current esoteric notations which are very close to the formal model they come from. A considerable e ort has thus to be done in order to shorten the distance between these languages and tools and the way designers are used to work.
