Architecture design of a fully asynchronous VLSI chip for DSP custom applications by Fan, Xingcha & Bergmann, Neil W
ARCHITECTURE DESIGN OF A FULLY ASYNCHRONOUS 
VLSI CHIP FOR DSP CUSTOM APPLICATIONS 
Xingcha Fan and Neil Bergmann 
CSIRO/Flinders Joint Research Center in Information Technology 
School of Information Science and Technology 
Flinders University 
GPO Box 2100, Adelaide 5001, Australia 
ABSTRACT 
A fully asynchronous, distributed VLSI architecture is 
introduced for dedicated real-time Digital Signal Processing 
(DSP) applications. The architecture is based on a data- 
driven computing model to allow maximum exploitation of 
the fine-grained concurrency. An asynchronous, self-timed 
signalin protocol is used in the architecture to naturally 
match Atadr iven com uting and circumvent the clock skew 
problem. After a brief description of the architecture, k e y  
issues o the architecture, such as the interconnection net- 
cussed. Finally, advantages of the architecture and future 
work are outlined. 
work, d ata identification, and operand matching are dis- 
1. INTRODUCTION 
Advances in modern signal processing technology de- 
pend critically on the device and architecture innovations 
of computing hardware. Due to the severe system control 
overhead, general-purpose computers cannot offer satisfac- 
tory processing speed for most real-time di ital signal pro- 
cessing(DSP) applications, and they are often an overkill 
in price and programmabilit for specific DSP ap lications. 
Therefore, application speciic integrated circuits&SICs) or 
dedicated VLSI implementations of DSP algorithms have 
been considered as the most appealing alternative when de- 
sign costs allow. 
The architectures of the most commonly used ASICs or 
dedicated DSP processors are either systolic array architec- 
tures [l], or centrally controlled multi-functional unit archi- 
tectures [2], [3]. The locally interconnected systolic arrays 
maximize the strength of VLSI in terms of intensive and 
pipelined computin and yet circumvent its main limitation 
on communication. %heir massive concurrency derived from 
the pipeline processing and or parallel processing, makes 
systolic arrays very success / ul for those DSP applications 
with highly regular algorithms, but the strict regularity re- 
quirement for the algorithms limits their applications. 
The centrally controlled multi-functional unit architec- 
ture, on the other hand, has been used to cover a much wider 
range of DSP applications with medium to high performance 
requirements. The common feature of this architecture is 
the utilization of multiple functional units to exploit con- 
currency by supporting pipeline processing and/or parallel 
processing. The interconnections between functional units 
are through buses and multiplexing. Resources are shared 
among operations, hence leading to an efficient hardware 
utilization. All the operations and data transfers are con- 
trolled in a predefined schedule by a central controller, which 
is usually a microcode sequencer. The control-flow comput- 
ing model of this architecture limits the exploitation of the 
fine-grained parallelism in the algorithms. 
Conventionally, these architectures both use the syn- 
chronous design methodology, i.e. the system is synchro- 
nized by a global clock signal. This becomes a limiting 
factor for VLSI system performance with the continuous 
growth of chip size and scaling down of the IC process tech- 
nology, because of the difficulty of clock signal distribution 
and clock skew [4]. The synchronous design methodology 
also requires a large effort for the chip level layout design 
and timing simulation, and can limit the board level extensi- 
bility of the system because of the global timing constraint. 
To circumvent these disadvantages, we introduce a fully 
asynchronous, self-timed distributed VLSI architecture for 
dedicated real-time DSP applications, especially those with 
irre ular(e.g. data dependent) algorithms. 
isynchronous, self-timed techniques have been attract- 
in a lot of attention in VLSI design in recent years. Several 
pu%lications [4], [5], \6] about the design theory and auto- 
matic synthesis of se f-timed logic circuits have appeared. 
Due to the handshaking character of the self-timed signal- 
ing 41, self-timed circuits are inherently slower than their 
more complex because of the extra circuitry needed for the 
self-timed signaling protocol. But self-timed signaling has 
the advantage of its event-driven character. Therefore, we 
believe our efforts to apply self-timed techniques to VLSI 
system level design will allow us to take full advantage of 
its potential, and make it a more competitive technique for 
VLSI design. 
Asynchronous, self-timed signaling has been introduced 
into VLSI system level design by several researchers [7], [8], 
[9]. Sutherland’s “Micropipeline” [8 concept is commit- 
which are simpre and have no feedback paths. On the 
other hand, Martin’s asynchronous microprocessor [9] and 
the asynchronous DSP processor in [7] are both based on 
the conventional centrally controlled control-flow computing 
model. In contrast to previous research, the architecture we 
introduce here is a data-driven, distributed controlled multi- 
functional unit architecture. 
The purpose of the this paper is to introduce such an ar- 
chitecture, and briefly discuss some key issues of the archi- 
tecture and their implementations. The synthesis of VLSI 
chips based on this architecture from high-level descriptions 
is outside the scope of this paper. The paper is organized 
as follows. First, we present a general description of the 
architecture. Then we discuss the implementation of the in- 
terconnection network by using a multi-bus structure. Fol- 
lowing a discussion of the data identification scheme, the 
matching block and 1 /0  blocks will be briefly discussed. Fi- 
nally, some conclusions and future work will be outlined. 
2. ARCHITECTURE DESCRIPTION 
The architecture is based on the data-driven comput- 
ing model, instead of the conventional control-flow comput- 
ing model. As a special case of dataflow computing, data- 
sync 6 ronous counterparts at the gate level, and are usually 
ted to those ap lications with data A ow and control flow 
2112 
0-7803-0593-0192 $3.00 1992 IEEE 
driven computing allows the maximum exploitation of the 
fine-grained concurrency inherent in an algorithm. A four- 
cycle self-timed signaling protocol [4] is enforced through- 
out the system to naturally match the data-driven com- 
puting. The DSP algorithms are represented by data flow 
graphs(DFG’s) [l], in which nodes model computation(or 
logical operations), and arcs model communication. 
Figure 1: An asynchronous, distributed VLSI architecture 
The architecture is shown in Fi .l. It is composed of 
multiple self-timed functional unitsfFU’s), such as adders, 
multipliers, ALU’s, and even some complex computation 
blocks. For example, a complex multiplication block can be 
designed and included as a functional unit for a FFT chip. 
These functional units are interconnected through an inter- 
connection network which allows the concurrent transfer of 
multiple data items, or tokens. External communications 
are through the Input and Output blocks which are also 
connected to the interconnection network. 
A functional unit can be shared by logical operations 
with the same function, which trades off the system through- 
put with chip size. 
The execution of functional units, and the communica- 
tion among them are controlled and coordinated by a decen- 
tralized, or distributed control scheme. A functional unit is 
fired in a data-driven manner, i.e. it will start the execution 
of a logical operation mapped to it as soon as all of the in- 
puts for that operation are ready and the FU is free. The re- 
sult of the execution together with a tag which identifies the 
data is sent out as a token. This token is self-routed through 
the interconnection network to the functional unit(s) that 
will consume it. The t a  is generated and ta ged locally by 
each functional unit. jherefore, the centraf controller, or 
control-path of conventional VLSI processor architecture is 
eliminated. 
Because of functional unit sharing, the operands of log- 
ical operations shared on a functional unit may overtake 
each other and arrive at the two input ports of the func- 
tional unit in a different sequence due to the irregularity of 
the algorithm, and the data-dependent computation delay 
of self-timed functional units. Therefore, all the operand 
pairs are matched before being forwarded to an FU for ex- 
ecution through a local matching block. A matching block 
also produces tags for each logical operation result. 
S. INTERCONNECTION NETWORK: MULTI-BUS 
STRUCTURE 
VLSI systems are wire limited. With the continuous 
scaling down of IC process technology, the area occupied by 
interconnections becomes more significant relative to that 
of functional units, and the communication delays become 
a limiting factor of the system performance. Therefore, 
the structure and the implementation of the interconnec- 
tion network in our architecture are very im ortant to the 
performance and the size of the synthesised clips. Two per- 
formance metrics of an interconnection network are network 
latency and network throughput. 
In recent years, regular, direct communication networks 
such as the hypercube [ll], or dynamic multistage switch- 
ing networks [lo] have been commonly used to interconnect 
highly concurrent computers. The advantages of these com- 
munication networks are higher throughput, which allows 
many packets to travel concurrently through the network, 
and their extensibility, which allows the system intercon- 
nected by these networks to be extended arbitrarily. 
On the other hand, the bus structure has often been 
criticized for its data transfer ability since it is limited to a 
single transaction at once. More importantly, it is limited in 
extensibility because of the lar e capacitance of long inter- 
connection wires, especially off chip interconnection wires. 
Nonetheless, the bus has the advantages of being simple and 
wire-efficient . 
At this stage, our architecture is aimed for dedicated 
DSP applications which can be implemented with less than 
or around ten functional units on a single chip. In this spe- 
cial architecture, we believe the advanta es of hypercubes 
or multista e switching networks cannot f e  fully exploited. 
Although t i e  simulation data have y t  to be obtained, we 
believe the data transfer latency o such networks would 
be larger than that of a bus, because with relatively low 
numbers of functional units, the delay of on-chip buses is 
not so si nificant, but significant switching delay must be 
counted for self-routing communication networks. Such a 
network occupies larger chip area than the buses due to 
the wires and the switching circuitry utilized. Therefore, 
for the current version of the architecture, we choose the 
multi-bus structure for on-chip interconnection. With the 
future up rade of the architecture on which large number of 
functionaf units are implemented, pipelined buses, or high- 
dimension communication networks discussed above will be 
considered. 
. .  
i .  i ......................................... iL  ..................................... ! ! ............................................. 
N-1 w-2 FU-NI 
Figure 2:  The architecture with multi-bus interconnection 
With the multi-bus interconnection structure, the cur- 
rent version of our architecture is as shown in Fi .2. Each 
bus transfers the tokens in parallel. A token is defivered to 
its destination(s) through proper decoding of its ta The 
number of buses used in the system is decided by t f e  cost 
constraints and the performance requirements of the specific 
applications. The output of a functional unit is connected 
to only one bus, but the inputs of the functional unit can be 
obtained from different buses for different logical operations 
through proper connection and decoding. Buses are shared 
by functional units to reduce the number of buses required. 
Because of the distributed character of the architecture, 
the functional units operate independently of each other. 
Operations and bus transfers are not assigned to any spe- 
cific times, as in a synchronous architecture. Therefore, an 
2113 
arbiter is necessary for a bus whenever more than one FU 
will possibly request the bus concurrently. The function of 
the bus arbiter is to ensure at most one bus request can 
be served at any time, i.e. only one data can be loaded to 
the bus at any time. Although a distributed bus arbiter is 
desired to match the distributed architecture, a fair arbiter 
constructed on a simple interlock circuit [4] is currently im- 
plemented for its simplicity and speed advantages [13]. 
The self-timed buses in an asynchronous architecture can 
be utilized more efficiently than their counterparts in syn- 
chronous architecture. In a synchronous architecture, bus 
transfers are usual1 carried out only in a specific phase of 
the clock signal [2{ [12], thus if a bus request is delayed 
due to bus con estion, it must be delayed to the next clock 
cycle. This wilfsignificantly affect the system performance, 
especially when such a request is a critical one and a long 
clock cycle is used. For the self-timed buses in an asyn- 
chronous architecture, if a bus request is delayed because of 
congestion, it will be served as soon as the bus finishes the 
current bus service. 
The buses in this architecture do not need to be all 
global. Each bus can be made as short as necessary to 
cover the functional units it connects, reducing its capaci- 
tance and improving its speed. 
Two communication styles can be supported by self- 
timed bus structure. In one style, a functional unit sends 
out its out ut token through a bus without taking care of 
the state ofthe functional unit(s) which will receive the to- 
ken. In another style, a functional unit sends out its output 
token only when the functional unit(s) which will receive 
the token is ready to do so. In our architecture, the former 
is used because it is faster and simpler. A potential roblem 
of this communication style is that it may cause Ladlock 
of the system. For example, a token which is waiting to be 
accepted on a bus may block other bus transfers, and hence 
may put the system into a deadlock state. To avoid dead- 
lock, it is desirable that each token on a bus will always be 
accepted instantly by its destination s) by offerin enough 
storage s ace to accommodate it. $his will be $iscussed 
later in tgis paper. 
The authors have desi ned a self-timed bus with a four- 
cycle self-timed, bundled t a t a  communication scheme. This 
bus is a so called indirect-transfer bus, on which a bus trans- 
fer is completed by two separate sef-timed communications, 
i.e. a handshaking communication from the sending FU to 
the bus, and a handshaking communication from the bus to 
the receiving FU(’s). A functional unit that gets the grant 
from the bus arbiter puts a token onto the bus. The FU 
is then released after the bus has received the token. Then 
the bus requests the receiving functional unit(s) through 
the decoding of the bus lines which carry the tag. After all 
the receiving functional units have received the token, the 
bus is released and precharged. By dividin the bus trans- 
fer into two separate procedures, more overfapping between 
the operations of sending and receivin functional units is 
allowed, hence hardware utilization anfsystem throughput 
are improved. 
This indirect-transfer self-timed bus is estimated to have 
the same drive capability as a typical synchronous precharged 
bus [12]. The details of our self-timed bus are to be discussed 
in a forthcoming paper 1131. 
4. DATA IDENTIFICATION AND TOKEN STRUC- 
TURE 
The synchronization of logical operations in a data-driven 
computing system is a difficult task. Because of functional 
unit sharing and communication path sharing, it is neces- 
sary to identify the lo ical operations shared on the same 
functional unit, and iientify data in the system, so that 
2114 
a data item can be sent to the correct functional unit, and 
matched with another operand for the correct execution of a 
specific lo ical operation. This identification is represented 
as a tag wtich is bound to the data. 
There are two approaches to naming data items in the 
DFG. One is by identif ing the parent of the data, i.e. by the 
logical operation whicg produces the data, while the other 
is by identifying the child of the data, i.e. by the logical 
operation s) which will consume the data. If an algorithm 
contains Lq logical operations, and we assume each logi- 
cal operation has two inputs, then the parent identification 
scheme has NW different data to identify, but the child iden- 
tification scheme has at least 2N, different items to identify. 
In our architecture, it is desired that an output of a func- 
tional unit is transferred only once, no matter how many 
children it has, i.e. one-to-multiple destination bus trans- 
fer may be needed. With the parent identification scheme, 
this is easily done by a proper decoding scheme, but this 
is difficult with the child identification scheme. Therefore, 
the parent identification is chosen to identify the output 
data of every functional unit in the current version of our 
architecture. 
Because the logical operations shared on a functional 
unit are specified in the design, a tag which includes the 
parent identification of a data is sufficient for uniquely defin- 
ing that data, its destination in bus transfer and the logical 
operation(s it belongs to. Therefore, a token can be com- 
the data is not fixed in this architecture description. It can 
vary between different applications, with a practical choice 
from 12 bits to 16 bits for most real-time DSP applications. 
Wt equals Pog NW1. 
5. LOCAL MATCHING BLOCK 
Operand matching to synchronize logical operations is 
an inherent problem of dataflow computin As we men- 
tioned above, because of functional unit staring and the 
irregularity of the algorithms, it is possible for the operands 
of logical operations to arrive at the two input ports of the 
functional unit they share in different sequences. Operand 
matchin is carried out for every functional unit to ensure 
that thefunctional unit takes the correct or matched pairs of 
operands to execute. The local matching block also gener- 
ates the tag for the output token. A fast and simple match- 
ing block is strongly desired for our architecture to suit the 
real-time applications. 
posed of a b d bits data and W, bits of tag. The width of 
-- 
ut---, 
. . .  
Figure 3: The matching block 
A matchin block is shown in Fig.3. Each input port 
of a functionafunit has a register file, i.e. R F A  for port 
A and RF-B for port B. To ensure that every token on 
the bus can be accepted instantly to avoid the potential of 
system deadlock, a register is offered for operands of each 
logical operation in both register files. To reduce the size 
of register files, two re isters in the same register file can 
be merged if the two fa ta  they accommodate have direct 
or indirect dependent relation, i.e. one of them cannot be 
exist without the consumption of the other. 
To allow the concurrent loading of the registers in a reg- 
ister file, each register is connected to an appropriate bus, or 
buses for the shared register, throu h a simple decoder. The 
decoder decides which tokens on t i e  bus should be loaded 
into the register based on the tag currently on the bus. 
The decoder also decodes the tag of the incoming token 
to give the identification of the logical operation which will 
consume it, to allow the correct matching of the operand 
pair. This is especially important for the matching of data 
in a shared register. After the operands have been matched 
and forwarded for execution, this identification can be used 
directly as the tag for the output token. 
In Fig.3, a token in one register file only needs to match 
its partner in a specific register, or several s ecific re isters 
in the opposite re ister file, hence fast matcging can %e ex- 
pected. The mat&ng block in Fig.3 is also convenient for 
implementing the priority scheme between operations. 
6. INPUT AND OUTPUT BLOCK 
The function of an input block is to buffer and tag the in- 
put data. An input FIFO buffer is usually needed to smooth 
out density fluctuations in the data flow, even though a real- 
time DSP chip is usually designed to process a continuous 
input data stream at a rate e ual to or faster than the rate 
the data are fed in. If blocks o? data are transferred between 
chips, then buffer memories may be needed to store the data 
block. 
In addition to the buffering function, an input block also 
tags the input data with appropriate identification. We call 
this the input operation. If an input port is shared for several 
inputs, then data are tagged based on the sequence of data 
entering the input block. The tagged input data tokens can 
be sent to different buses. The number of input operations 
should be counted into N-. 
Another important function of input blocks is to control 
entry of the data in the buffer into the system in a pipelined 
implementation of the architecture, so that only the desired 
number of continuous input data sets can enter and concur- 
rently stay in the system, to make the s stem work most 
efficiently. This is done by a token controrscheme. 
The output block is much simpler than the input block. 
It just de-tags the tokens to be sent to off-chip and puts 
them into the output FIFO buffer in an appropriate se- 
quence. 
7. CONCLUSIONS AND FUTURE WORK 
In this paper, we presented a fully asynchronous dis- 
tributed VLSI architecture based on a data-driven comput- 
ing model and a self-timed signaling protocol. This archi- 
tecture has the following advantages: 
0 The asynchronous, self-timed signaling protocol cir- 
cumvents the difficult global distribution of clock sig- 
nal, and overcomes the potential clock skew problem 
in a large VLSI system design. This advantage will be 
more significant with the continuous scaling down of 
IC process technology and growth of chip size in the 
near future. 
0 The data-driven computing and the distributed con- 
trol allows the maximum exploitation of the fine- 
grained concurrency. The out-of-order execution of 
logical operations gives the architecture more efficient 
hardware utilization and higher throu hput than the 
conventional control-flow synchronous &LSI processor 
architecture. 
0 This architecture covers a wider range of applications 
than the systolic array architecture, from a plications 
with very regular algorithms to those wit[ irregular 
algorithms . 
0 The self-timed design methodology allows functional 
units in the system work on an average computation 
delay, instead of the worst-case computation delay in 
a synchronous design. Thus, a single functional unit 
may potentially have a better average performance 
than its synchronous counterpart. 
0 The design effort of an asynchronous chip can be re- 
duced, because each functional unit and its matching 
block can be designed and tested locally, without un- 
due attention to the global timing constraints. 
The as nchronous chip can be used more flexibly in a 
board Lvel design because of the very loose inter-chip 
communication constraint. The system constructed 
on asynchronous chips is more extendable. 
The problems of the current version of the architecture 
include the inefficient utilization of local registers. At this 
stage, because register sharin is only considered within the 
register files, the chance of t8e register sharin is limited. 
The direct connection of re isters in a register fife to the bus 
may increase the loading of the bus. These problems are to 
be considered in the near future. 
The future work will be focused on the high-level syn- 
thesis of the fully asynchronous VLSI chip from a data flow 
graph(DFG) or a high level behavior descri tion. Special 
synthesis algorithms are to be developed for knctional unit 
sharinf, !ogical operation assignment, especially for the de- 
sign o pipelined systems. 
As far as we know, this is the first fully asynchronous, 
distributed VLSI architecture which supports data-driven 
computing, for ASICs or dedicated DSP applications. We 
anticipate that our research will lead to the wider adoption 
of asynchronous, self-timed techniques, with their inherent 
advantages, in such applications. 
8. REFERENCES 
[l] S.  Y. Kung, VLSI Array Processors. Reading, Prentice Hall, 1988. [?I. B. S. Haroun and M. I. Elmasry, 'Architecture synthesis for DSP 
si icon compilers," ZEEE %-ana. Computer-Aided-Design, vol. CAD-8, 
Wesley, pp. 311 - 360,-1988. 
)4] C. L. Seitr, "System timing." Chap. 7 in: C. Mead and L. Conway, 
ntroduction to VLSI Systems. Reading, MA:Addison-Wesley, 1980. 
[5] A. J.  Martin, "Programming in VLSI: From communicating pro- 
cesses to delay-insensitive circuits," in Developments in Concurrency 
and Communication. C. A. R. Hoare, Eds. pp.l-  64, Addison-Wesley, 
1990. 
[6] T. H.-Y. Meng, R. W. Brodersen, and D. G. Mcsserschmitt, 'Au- 
tomatic synthesis of asynchronous circuits from high-level specifica- 
tions." IEEE %M. Computer-Aided-Design, vol. CAD-8, pp. 1185 
- 1205, Nov. 1989. 
[7] G.  M. Jacobs and R. W. Brodersen, "A fully asynchronous digital 
signal processor using self-timed circuits," ZEEE J .  Solid-State Cir- 
cuits vol. 25, pp. 1526 - 1537, Dec. 1990. 
I81 I.'E. Sutherland. "Micropipelines." Communications of the A CM, 
MIT Press 1989. 
[lo] X. C. ban and N. W. Ber mann "Desi n of elements for a self- 
timed fast packet switch," in jroc. 1$91 ZEgE International Sympo- 
sium o n  Circuits and Systems, pp. 1025 - 1028, 1991. 
111 C. L. S e i t ~  "Let's route packets instead of wires," in Advanced 
keaearch in VLkI, Proc. 6th MZX Conf., pp. 133 - 138, 1990. k] C. Mead and L. Conway, Introduction to VLSZ Systems. Reading, 
[13] X. C. Fan and N. W. Bergmann, "Design of a self-timed bus for 
a fully asynchronous VLSI DSP chip," Zn preparation. 
A:Addison-Wesley, 1980. 
2115 
