This paper describes a set of simple design and performance analysis techniques that have been successfully used to design a number of nontrivial delay insensitive circuits. Ezamples are building blocks for digital filters and a vector multiplier using a serialparallel multiply and accumulate algorithm. The vector multiplier circuit has been laid out, submitted for fabrication and successfully tested. Throughout the paper, elements from this design are used to illustrate the design and performance analysis techniques. The design technique is based on a data flow approach using pipelines and rings that are composed into larger multi-ring structures by joining and forking of signals.
Introduction
Design and automatic synthesis of delay insensitive circuits are active areas of research [l, 2, 3, 4, 5 , 61. Because delay insensitive circuits are different from synchronous circuits, and because delay insensitive circuits are difficult to verify by simulation, most research is devoted to the development of formal design methods. Performance analysis and optimization of delay insensitive circuits is a topic that has only been addressed very recently [7, 8, 9, lo] , and therefore the material has not yet matured into widespread use. This paper describes a set of simple design and performance analysis techniques that have been successfully used to design a number of nontrivial delay insensitive circuits. Examples are: building blocks for digital filters [11] , and a vector multiplier using a serialparallel multiply and accumulate algorithm [12] . The design technique is based on a static data flow concept, and the structure of the circuits consists of pipelines and rings that are connected into multi-ring structures by forking and joining of signals. The designs are implemented using a small set of building blocks (latches, combinational circuits and switches) that are realized using C-elements and simple gates like inverters and OR-gates. Only static CMOS circuitry is used, and the circuits are delay insensitive except for some isochronous forks at well defined places within the basic building blocks.
The performance analysis technique that we have used is based on signal transition graphs and follow along the lines developed in [8, 91. The main contribution of the paper is to demonstrate that by limiting to a specific class of pipeline and ring structures implemented from a simple set of circuit elements, it is possible -even for complex designs -to analyze the performance and establish an understanding the bottlenecks.
The paper is organized as follows: Section 2 introduces delay insensitive pipelines, rings and multi-ring structures. Section 3 describes in qualitative terms the basic performance characteristics of these structures. Section 4 describes the implementation of the basic building blocks. Section 5 illustrates the multiring concept by describing briefly the vector multiplier design. Sections 6, 7, 8, and 9 deal with performance analysis: Section 6 defines some key performance parameters. Section 7 analyzes simple ring structures composed of identical stages. Section 8 describes a general analysis technique based on signal transition graphs, and finally section 9 illustrates this technique by calculating the critical path that determines the performance of the vector multiplier.
Delay insensitive multi-rings -
Delay insensitive circuits are asynchronous and the sequencing of their computations is determined by the data flow rather than by clock signals or other global control signals. When inputs to a sub-circuit are ready, the computation can start and as soon as the result is computed, the next computation can be initiated. In this section we describe a class of circuits, called delay insensitive multi-rings, using such a data driven approach.
Data representation
In a delay insensitive circuit there is no clock signal to determine when a computation can start and when it is complete. Instead, it must be possible to detect the arrival of a new input from the data themselves. To be able to do this, a 
Pipelines
The composition of F, G and H described above can be realized as a pipeline if a latch is added on the output of each function block. Figure 1 shows this three stage pipeline. A delay insensitive latch holds back input data until the successor circuits are ready t o receive them. The latch is controlled by acknowledge signals from succeeding latches. A latch may load and hold a valid value when its successor latch in the pipeline holds the empty value (indicated by the incoming acknowledge being false). Similarly a latch may load and hold an empty value when its successor latch in the pipeline holds a valid value (indicated by the incoming acknowledge being true).
In general the function blocks pass the acknowledge signals backwards without any modification, and throughout the rest of the paper the bus symbol representing data signals also implies the associated acknowledge signal.
The handshaking described above ensures that the data flowing through a delay insensitive pipeline always consist of alternating valid and empty values, and the term data token or just token is used to denote a valid-empty data pair. A token thus occupies two latches or pipeline stages.
When data has been propagated from one latch, L1, to the succeeding latch, L2, the contents of L1 is no longer needed, and it can later be overwritten by data propagated from the preceding latch. Note, that there is period where data stemming from the same token occupies two or more neighboring stages. This is called a bubble. Bubbles can be viewed as catalysts: They are necessary for propagating data. When data values flow forward in the pipeline, bubbles flow backwards.
As we shall see in the following sections the number of bubbles in a delay insensitive circuit has a dominating influence on its performance.
Delay insensitive rings
In this section, we describe a number of generalisations of the pipeline leading to a general characterization of the delay insensitive circuit structures used in this work.
In a latched pipeline with a t least three latches, it is possible to connect the output of the last stage to the input of the first, forming a delay insensitive ring. Such a ring is capable of performing an iterative computation. In [5, 81 it is described how such a ring is used to perform floating point division. The data in the ring represents a partial remainder, and each stage computes one bit of the final result and forms a new partial remainder which is sent to the next stage etc. Consider a delay insensitive ring with three latches. These latches will always contain one of two patterns:
either two valid and one empty element or two empty and one valid. A sequence of computations is shown in figure 2. It is quite simple to build rings with more than three pipeline stages. The choice is mainly governed by performance considerations. In a large ring with many pipeline stages, computations can be overlapped by having more tokens in the ring, and this can improve the throughput.
Independent rings (and pipelines) can be connected in different ways using fork and/or join elements. In a simple join element both inputs are propagated when either they are both empty or they are both valid. There are many possible variations and combinations of join and fork elements. The switch described in section 4.3 is an example such an element: It has two data inputs and two data outputs and a control signal selects how inputs and outputs are connected.
Combining the various building blocks such as function blocks, latches, fork elements, join elements and switches, it is possible to construct a class of delay insensitive circuits consisting of interacting rings and pipelines called multi-ring structures.
Basic performance characteristics
This section describes in qualitative terms the basic performance characteristics of pipelines, rings and multi-ring structures. 
Introduction
In a synchronous circuit all storage elements are updated in parallel and in this sense a synchronous design is highly concurrent. Another characteristic of a synchronous circuit is that its performance can be determined by a static analysis of the structure of the circuit.
In both respects a delay insensitive circuit is different: The updating of storage elements in a delay insensitive circuit depends on the state (content) of neighbouring storage elements, and if the circuit is not designed carefully these dependencies can greatly reduce the number of operations that can take place in parallel -thus reducing the performance significantly. Furthermore, the performance of a delay insensitive circuit depends not only on the structure of the circuit but also on how the circuit is initialized and on how it is used by the environment.
An understanding of this dynamic behaviour is essential to the design of efficient delay insensitive circuits, and this section introduces some basic principles and develops a qualitative understanding of the topic. Sections 6-9 in the paper describe in detail the analysis techniques.
Basic concepts
The basic concepts can be illustrated by a simple example: A shift register in which there are N tokens. A shift register is simply a pipeline that is used by its environment in such a way that the number of tokens is invariant. This example is relevant because (1) the vector multiplier described in section 5 contains a number of shift registers, and 2) because it of tokens is also constant Figure 3 illustrate the ehaviour of different implementations with 2N, 3N, and 4N stages respectively (for the case N = 3). The boxes represent pipeline stages (i.e. latches) and the numbers represent different valid data values. The boldfaced bus-arrows depict state changes (data transfers).
The difference between the three pipeline realieations is the number of bubbles. In figure 3 a) reading a value at the output introduces a bubble t 6 at travels backwards causing the data transfers to take place one at a time. Hence, the time it takes to move all elements one stage to the right is at least 2N. In Figure 3 (b) the same computation is illustrated in a pipeline with models the behaviour of a ring (in whic 6 the number
1;
3N stages. This pipeline contains N bubbles, and therefore N data transfers can occur simultaneously. Hence, the time it takes to move all elements is at least 2. Finally in figure 3(c) the pipeline consist of 4 N stages. This makes it possible for all valid and empty data values to move simultaneously reducing the time to 1. Increasing the number of bubbles beyond the 2N found in figure 3(c) does not increase the performance further.
As the number of bubbles in a design depends on the number of latches per token, the above analysis illustrates that performance optimization of a given circuit is primarily a task of structural modification -circuit level optimization like transistor sizing is of secondary importance.
Finally, it must be emphasized that the above is only a simplified analysis that illustrates some fundamental qualitative properties. It is, however, sufficient to understand the following illustrative but more complex example.
3.3
In the vector multiplier described in section 5 we use a shift register with parallel load, a Parallel-InSerial-Out register called PISO. Figure 4 (a) shows an initial design with one switch (SW) and two latches per bit. The switch (see figure 7(a)) controls when to parallel load new data and when to perform a right shift. The two latches contain a data token. This design has the same problem as the pipeline in figure 3 (a) -there are too few bubbles.
To obtain a reasonable performance 3 latches are needed per stage. This creates more bubbles which enables more data transfers to take place concurrently when data is being shifted. Instead of having 3 latches in the data path the 3rd latch is added in the control path, figure 4(b). In this way, broadcasting of the switch control signal to all switches in the PISO can be avoided, and as will be clear from the following sections, broadcasting of information can have severe impact on the performance.
The PISO design in figure 4(b) exhibits an interesting and illustrative dynamic behaviour: Initially, the data latches in the PISO are densely packed with data tokens, and the latches in the control signal path all contain empty values. For each control token that is input to the PISO it outputs a data token at the serial port or it performs a parallel load. A parallel Figure 4 : Shift register sections from the PISO: (a) initial and poor design, and (b) final design that avoids broadcasting of the switch control signal.
load replaces the valid data items with new values. As data is being read from the serial port the data tokens will spread over more latches, and control tokens will be travelling in the opposite direction in the control latch chain. When a control token reaches the leftmost end of the PISO a "0-data token" is input from the environment, and as part of this operation the control token disappears. Finally, when the environment stops reading data from the PISO the control latch chain will eventually flush, and the chain of data latches will again fill with valid data items in every second register stage. Notice finally, that the number of data-and control tokens in the design is invariant.
Realization of building blocks
In this section, we describe the realization of the most important building blocks (function blocks, latches and switches) that are needed for constructing the delay insensitive multi-ring structures introduced in section 2. More details on the complete set of circuit elements are given in [12] .
Function blocks
A function block computes a specific (combinational) function when its input are valid, and it also propagates empty values. To synthesize such a delay insensitive circuit for a function block we use a technique called delay insensitive min-term synthesis (DIMS . This technique resembles the traditional sum tant differences: of pro d ucts approach, but there are also a few impor-0 the min-terms are formed using C-elements (instead of AND-gates), 0 reduction of the boolean equations by combining min-terms into simpler terms is (in general) not allowed. Together, these requirements assure that the combinational circuits do not produce any valid output signals until all input signals are valid, and that none of the output signals change back to the empty value until all inputs are empty. A similar technique has been used by others [13] .
Data is represented in a dual-rail code where two wires, x . t , x . f, are used to represent a single bit x.
The value empty (E) is represented with the signals on both wires low, true T) is represented by x . t high and x.f low, and false b) by x.f high and x . t low.
The DIMS technique can be illustrated with the circuit for a delay insensitive dual-rail AND-gate, see figure 5. The DIMS technique has been automated in a tool which can synthesize delay insensitive circuits from high-level descriptions [15 , and we are currently working on a more refined synt E, esis tool. 
Multi output function blocks
The DIMS technique does not in general allow reduction of boolean equations. If, however, multiple logic functions depend on the same input, they can share the C-elements and thus achieve a reasonably efficient circuit implementation. As an example, we mention a full adder where both the sum and the carry depend on the two input operands and the incoming carry. A delay insensitive full adder can thus be built using 8 C-elements and 4 OR-gates.
Latches
A latch for a single dual-rail encoded bit is built from two C-elements, an OR-gate and an inverter, see figure 6 . The OR-gate generates the acknowledge signal, indicating whether the latch holds a valid or empty value. The corresponding acknowledge from the succeeding register, ackout, determines whether the register should hold its current value or load a new. No-tice the similarity between the latch implementation and the function block implementation.
Two degenerated forms of the latch are also needed, one for consuming values (and generating acknowledgements) and one for producing constant values (and consuming acknowledgements). These are quite simple variations of the fundamental latch shown in figure 6 . They are, for example, used to "terminate" unused inputs and outputs at the boundary of multiring structures.
Switches
Switches are used as data flow control elements. In the general case two data signals are either crossed or just passed through, determined by a control signal. For the vector multiplier design discussed in this paper we need an asymmetric switch where either both data signals are passed through or only one of them is crossed over and the other waits, see figure 7 (a). The asymmetric switch makes it possible to implement shift register structures and to combine rings with different data rates. It should be noted that the control input also follows the four cycle protocol alternating between empty and valid values.
The switch consist of five single output combinational DIMS circuits, see figure 7(b): two for generating data on the output ports, two for generating the acknowledge signals on the input ports and one for generating the acknowledge signal on the control port.
Fork and join elements
The fork and the join elements complete the set of building blocks. Forks are used when the same signal is input to more circuits and joins are used when signals from more sources are input to a circuit. A join is simply concatenation of data busses -it does not involve any active circuitry. Similarly, a fork is mainly just wires. It only require a C-element to combine the acknowledge signals from the sinks into a single acknowledge signal.
Initialization
The initialization of a delay insensitive circuit plays a major role. Because the circuit is data driven, it is important to insert tokens and bubbles in such a way that it will start to operate (i.e. to avoid deadlocks). In a synchronous circuit it is very often the case that only a subset of the registers are reset. The remaining registers will then assume well defined values during the first clock cycles of normal operation. This is not a feasible scheme in delay insensitive circuits because, all latches depend not only on their input but also on the output of the succeeding state holding element (via the acknowledge signal). This bidirectional flow of information (data forward and acknowledge backward) makes it necessary to explicitly initialize all Celements including those used in combinational logic.
The vector multiplier
Based on the structural and circuit design techniques presented in the preceding sections we have designed a number of experimental chips. One of these is a vector multiplier which is described briefly belowboth to illustrate the design technique, and to provide an example for the following sections on performance analysis. More details on the design are given in [12].
Algorithm
Input to the vector multiplier are two streams of vector elements in bit-parallel form) and output is the form). An iterative serial-parallel multiplication algorithm implementing the "paper and pencil approach" is used, figure 8 . In each iteration step the circuit performs a multiply, add and shift operation, corresponding to the processing of one row of bit products. This algorithm requires: (1) A multiply-accumulate unit in which the result is gradually formed. The width of the accumulator corresponds to the width of the result. (2) A shift register that converts one of the operands into serial representation, and (3) a shift register for shiftin the other operand (extended with zeros at both ends! one place to the left in each iteration. The two operands are called the "serial operand" and the "parallel operand" respectively.
To avoid ripple carry propagation in each iteration step, and because it fits nicely with the serial-parallel algorithm, the temporary result is represented in carry save form. Conversion into binary representation is postponed until after the last two vector elements have been multiplied, and the conversion is then done by extending the last %erial operand" with leading zeroes and by taking the circuit through a number of additional iteration steps as indicated in figure 8.
Design
The core of the design is a combined and bit-sliced implementation of the multiply-accumulate unit and the shift register for the parallel operand. This block is called the Multiply-Accumulate-Shift (MAS) block, and figure 9 shows the design of a bit-slice of this block. In addition, the design consist of a parallel-in-serialout shift register for the serial operand (called PISO), and a small and simple control unit that in each iteration step issues the control signals ( C t l i and Ct12) for the switches in the MAS-block. The implementation of the PIS0 is explained in section 3.3. A description of the control unit is beyond the scope of this paper, and the interested reader is referred to [12] .
Figure 9: A bit-slice of the MAS unit
The MAS bit-slice in figure 9 is a nice illustration of the multi-ring concept. Starting from the top there are two switches controlling the data flow. Below, a number of signals are broadcast to all MAS bit-slices. These signals are the serial-operand bit and the control signals for the switches. In the bottom there are three rows of latches, an AND-gate and a full-adder, that implement the actual multiply-accumulate-shift function.
jFrom a data-flow point of view the bit-slice consists of a small ring and two pipeline sections: 0 A 3-stage ring for the sum bit of the accumulator. 0 A 3-stage pipeline (or shift register) for the carry bit of the accumulator (the path from Cy Ci-11 to Cy [il ) 0 A 3-stage pipeline (or shift register for the parallel operand (the path from PCi-1 3 to PEil).
The sum ring and the two pipeline sections are connected and synchronized with each other and with the corresponding pipeline sections and rings in the other bit-slices via the the switches and the function blocks.
Physical implementation
A test chip has been fabricated on EUROCRIP's October 1991 run at ES2 (European Silicon Structures Inc.) in a 1.5 micron CMOS technology. This chip multiplies vectors with &bit elements, and it has a 10-bit accumulator for the result. The test of the fabricated chips showed that they are f d y functional.
The physical implementation is a standard cell layout. The Autocells tool (part of the GDT design system from Mentor Graphics Inc.) has been used for the physical implementation. As C-elements are used extensively in the design, we have developed a C-element standard cell generator for the GDT design system. The chip contains 12.450 transistors, and the area of the core of the chip is 7.3 mm'. The area including pad-cells is 18 mm2.
The time for a multiply, accumulate and shift iteration step has been determined by a post layout simulation to be 30 ns, which is in accordance with the result of the performance analysis described in section 9.2.
Performance parameters
The performance of a pipeline, can be characterized by the parameters: throughput and latency. A third performance parameter, which does not have an equivalent for synchronous circuits, is the dynamic wavezength. Below is a brief definition of these three parameters. A more elaborate definition is given in
Latency:
The latency is the delay from input of a data item until the corresponding output data item is produced. When data flows forward, acknowledge signals propagate in the reverse direction, and therefore two parameters are defined:
The forward latency, L j , is the delay from a new data on the input of a stage to the production of the corresponding output. It is assumed that the latency is independent of the value of the data.
The reverse latency, L, , is the delay from receiving an acknowledge (from the succeeding stage) until the corresponding acknowledge is produced (to the preceding stage).
Period:
The period, P, is the minimal delay between input of successive tokens. As a token consists of both a valid and an empty data value the period includes a complete four-phase handshake cycle: 1 forward propagation of a valid data value, t 2 1 reverse propagation of acknowledgement, (3) forward propagation of the empty data value, and 4) reverse propagation of acknowl-P, 91.
edge. There d ore:
Throughput: the throughput, T , is the number of tokens that can flow through a pipeline stage per time unit: T = 1/P Dynamic wavelength: The dynamic wavelength, because this accounts for most practical designs and because it simplifies the analysis. The cycle time is expressed in terms of the delays in the basic components: C-elements, NOR-gates and inverters (used as buffers). Their delays are denoted t c , t N and tI respectively. From simulations of standard cells implemented in the 1.5 micron CMOS technology that is available to us via EUROCHIP, we have found wd, of a pipeline is the number of pipeline stages that a forward propagating data item will pass through during P:
Pipeline stages without
The parameters defined above are local performance parameters characterizing the circuit implementation of the individual pipeline stages. When a number of pipeline stages are connected to form a ring the following parameter is relevant:
The cycle time of a ring, Tcycle, is the time it takes a token to make one round trip through all pipeline stages in the ring. To achieve maximum performance (minimum cycle time) of a ring, the number of pipeline stages (per token case T c y c l e = P. If the number of pipeline stages is less, the cycle time will be limited by lack of bubbles, and if there are more pipeline stages the cycle time will be limited by the forward latency through the pipeline stages. In 
2N
TCycle (BubbZeLimited) = -L,
N -2
For the sake of completeness we mention that a third possible mode of operation called "control limited" exists for some circuit configurations [8, 91. This is, however, not relevant for delay insensitive multiring structures implemented using the building blocks described in section 4.
Analysis of simple rings
When the overall structure of a design is being settled, an important design task is to determine the optimal number of pipeline stages in the rings in the design. In order to establish a basis for first order design decisions, this section analyzes some simple rings composed of identical pipeline stages each consisting of a one-level DIMS function block followed by a latch. To get a lower bound on the cycle time, the analysis also includes rings without function blocks, i.e. rings consisting of latches only. Although rings may contain many tokens we restrict to rings with a single token, the following actual delay values: tc = 1.1 ns, t~ = 0.6 ns and tI = 0.7 ns. These values include the effect of representative output loads.
It should be noted that although most circuits use OR-gates it is normally possible to optimize the circuits by combining the OR-gates with inverters elsewhere in the circuit (c.f. the latch in figure  is In order to perform computations a ring must contain at least one function block, and from table 1 it can be seen that the optimal number of pipeline stages is less than 5. Going from 3 to 4 stages result in a marginal performance improvement at a significant area increase. Therefore, rings with 3 pipeline stages is in general the optimal choice with our circuit configuration.
Signal transition graph analysis
When all pipeline stages are identical (as it was the case above), it is quite simple to determine the cycle time directly. However, in general a more systematic procedure is needed. Signal transition graphs have been proposed as a formal model in which it is possible figure 1O(c) is composed. The labels outside the node boxes denote circuit delays associated with the signal transition. We have developed a particular style for the graphs that we find very illustrative and easy to understand: The nodes corresponding to the forward flow of valid and empty data values is organized as two horizontal rows, and nodes representing the reverse-flowing acknowledge signals appear as segments connecting the rows.
The cycle time of the ring is the time from some signal transition until the same signal transition occurs Notice that the single bubble is involved in 6 data transfers and the bubble therefore makes two reverse round trips in the ring for each forward cycle of data. Figure 1O (c) also illustrates that if function blocks with delays greater than tc + 2 t N are used, the cycle time will be data limited corresponding to a path through the upper row of "valid assignment nodes" or through the bottom row of "empty assignment nodes." A dependency graph analysis of a 4-stage ring is very similar. The only difference is that there are two bubbles in the ring. In the signal transition graph this corresponds to the existence of two "bubble cycles" that do not interfere with each other.
Cycle time of the vector multiplier
The core of the vector multiplier chip is the multiply-accumulate-shift block and as a designer one would expect the critical path in the design to be found here. As we shall see in the following this is unfortunately not the case in the present design.
Cycle time of the M A S block
The MAS block in the design figure 9 ) consist of 3 more complex than the simple DIMS-circuits from the analysis in the previous section. Because of the implicit synchronization of data signals in the function blocks and switches, the MAS-block can be analyzed using an equivalent three stage ring with the following function blocks between the latches:
1. The fulZ adder (section 4.1) is a multi-output function block. Conceptually it consists of two combinational circuits to which the input signals are forked. This forking requires a C-element in the acknowledge path to combine the acknowledge signals from the two output ports. figure 7 ) is even more complex than the adder. It consist of: (a) two combinational circuits in the forward data path (from control input to data output), (b a combinational circuit combinational circuit that computes the acknowledge signal on the control port.
The switch (
3. The AND-gate (figure 5 that computes the bitIn order to take the acknowledge circuitry in the adder and the switch into account, the signal transition graph in figure 10 must be extend with extra nodes in the outgoing edges of the appropriate acknowledge assignment nodes. The delay associated with these extra nodes is tc for the new full-adder acknowledge nodes and tc + t N for the new switch ac- extra nodes and as these nodes go into the longest cycle in the graph, the period in equation 1 of the riLg increases with 2 ( t c + t N ) + 2 t c :
This is the minimum period for a step of the serialparallel multiplication algorithm and therefore the speed at which one would wish the entire circuit to operate. As explained in the next section the critical path is unfortunately somewhat larger.
The critical path
The critical path in the design is found in a small control block that generates the two signals, C t l i and C t 1 2 , that control the switches in the MAS-block (figure 9) in the PIS0 (figure 4) and in the control unit itself. The control signals make one four-phase handshake cycle per iteration step.
The control unit is based on the same circuit elements that are used in the PIS0 and in the MAS-block (switches, latches and function blocks) and its structure is also based on connected pipelines and rings. By virtue of the switch control signal C t l i the control block contains some 3-stage rings. An equivalent circuit diagram from which the critical path can be computed is shown in figure ll (a) . The corresponding signal transition graph is shown in figure I l ( b ) .
In comparison with the results of the previous sections, extra delay is associated with forking (or broadcasting) of the control signal C t l i to the many switches. This broadcasting requires buffering of the control signal and the depth of the C-element tree that combines all the acknowledge signals becomes significant. This circuitry is represented by two "fork" nodes in the graph representing C-element trees with a depth of 4. From figure l l ( b ) we can compute the cycle time of the vector multiplier. There are two cycles in the graph with the same worst case cycle time: 
summary
The results of the previous sections are summarized in table 2. The only difference between ( 2 and (3) in switch control signals has been taken into account. Broadcasting of signals is very expensive, primarily because the many acknowledge signals have to be combined into a single acknowledge signal using a tree of C-elements (section 4.4), but in some cases the buffers that are needed to drive the large capacitive load of the many inputs can also enter the critical path.
In general, broadcasting should therefore be avoided. A possible redesign that could reduce the table 2 is that in (3) broadcasting (i.e. for k ing) of the cycle time would be to distribute the control signal to the switches in the RING in the same way as in the PIS0 (section 3.3) . This would cause some minor skewing of the operations taking place in the bit-slice rings and a corresponding overhead at start up and termination.
In such a redesign the switch control signal would have to bubble through a latch in each of the 10 bitslices from which the ring is composed, and with a delay of tc = 1.1 ns in each this would take 11.0 ns.
With this investment per multiplication the cycle time of the iterations could be reduced from 30.2 ns to 15.8 ns. A design incorporating these modifications is currently in fabrication.
Finally we mention that the analytically determined cycle time of 30.6 ns. of the existing design conforms nicely with the results obtained from a post layout simulation of the entire chip (30 ns c.f. section 5).
Conclusion
We have described a data flow based approach to the design of delay insensitive circuits. The underlying structural concept is simple: pipelines and rings. These can be combined into more complex structures -called multi-rings -by joining and forking of signals.
Such multi-rings can be implemented from a small set of building blocks, consisting of latches and a variety of combinational circuits: function blocks, switches etc.
One of the nice features of delay insensitive multirings is that they make it relatively simple to analyze and optimize the performance of a design. This has been illustrated with examples taken from a fully developed design of a delay insensitive VLSI chip for computing the inner product of two vectors.
