An O(n) time discrete relaxation architecture for real-time processing of the consistent labeling problem by Henderson, Thomas C. & Gu, Jun
A n  O ( n )  T im e  D is c r e t e  R e l a x a t i o n  A r c h i t e c t u r e  f o r  
R e a l- T im e  P r o c e s s in g  o f  t h e  C o n s i s t e n t  L a b e l i n g  P r o b l e m
UUCS-TR-86-116
Jun Gu, Wei Wang* and Thomas C. Henderson
Department of Computer Science
* Department of Electrical Engineering
University of Utah 
Salt Lake City, UT 84112
December 1986 
Abstract
Discrete relaxation techniques have proven useful in solving a wide range of problems in digital signal and digital 
image processing, artificial intelligence, operations research, and machine vision. Much work has been devoted to 
finding efficient hardware architectures. This paper shows that a conventional hardware design for a Discrete 
Relaxation Algorithm (DRA) suffers from 0(n2m3) time complexity and 0(n2m2) space complexity. By 
reformulating DRA into a parallel computational tree and using a multiple tree-root pipelining scheme, time 
complexity is reduced to O(nm), while the space complexity is reduced by a factor of 2. For certain relaxation 
processing, the space complexity can even be decreased to O(nm). Furthermore, a technique for dynamic 
configuring an architectural wavefront is used which leads to an O(n) time highly configurable DRA3 architecture.
Index Terms
Algorithm-configured dynamic architectural wavefront system, associative circular pipelining, data-structured 
computer architecture, Consistent labeling problem (CLP), data-communication-intensive VLSI computation, 
Discrete Relaxation Algorithm (DRA), interleaved processing, multiprocessor architecture, recursive systolic 
computation, VLSI.
O This work is supported in part by National Science Fountain Grants MCS-82-21750, DCR-85-06393, and DMC-85-02115; and
UUCS-TR-86-116 Page 2
in part by a University of Utah Research Fellowship and an ACM/IEEE Scholarship.
® The UUCS-TR-86-008 contains a DRA2 architecture which is good for DR A computation only with n = m = 8. Upon inquiry 
of its extensibility for computing of the large size DRA problems, this paper, in addition to presents a faster DRA3 architecture, it 
also generalizes DRA2 for arbitrary size of DRA problems (as long as chip size permits) with the descriptions that differentiates 
the object number n and label number m, throughout the paper.





1.1 Consistent Labeling Problem and Discrete Relaxation Operator
1.2 Related Work
2.Discrete Relaxation Algorithm
2.1 Boolean Formulation of Discrete Relaxation Algorithm
2.2 An Example of Region Coloring Problem .
3. The Hardware Implementation Problems for DRA
3.1 Assumptions and Statements
3.2 DRA Hardware Implementation
4. A Conventional DRA1 Design and the Complexity Analysis
5. Parallel Tree-Root Pipelining and the DRA2 System
5.1 A Parallel Tree-Structured Reformulation for DRA
1. Constructing the Parallel Computation Tree
2. Speeding Up Iteration
3. Introducing a Time Dimension in the Computation
5.2 Implementation Issues for the DRA2 Chip
1. System Architecture and Block Diagram
2. Multiprocessor SIMD Array and the Cell Design
3. Circuit Features and Design Techniques
5.3 Performance Evaluation
5.4 The PPL Layout for DRA2 System
6. An O(n) Time Algorithm-Configured Dynamic Architectural Wavefront System
6.1 Complexity Issues Revisited
6.2 C ■lJ(k,p)~Pattern Distribution Analysis
6.3 A Dynamically Configurable DRA3 Architecture
1. The Architectural-Computational Wavefront (ACW) Notation





Figure 1: Leaf Node for Computing ljp x A;j(k,p)
Figure 2: Modified Leaf Node Computation for 1^ x Ajj(k,p)
Figure 3: A Parallel Tree for Computing n* 1^
Figure 4: Circuit Block Diagram for the DRA2 System 
Figure 5: Basic Principle of the Parallel DRA2 System 
Figure 6 : Two Cells in Multiprocessor DRA2 Array ■
Figure 7: Construction of SIMD Multiprocessor Array Using Cell-A and Cell-B
Figure 8 : Computational Wavefront Pipelining and Circulation for Interleaved Processing
Figure 9: Broadcasting Scheme for Signal
Figure 10: Associative Circular Pipelining
Figure 11: Self-Timed Synchronization
Figure 12: State Graph of Finite State Machine
Figure 13: The PPL Layout of the Multiprocessor SIMD DRA2 System
Figure 14: The Photomicrograph of the DRA2 Chip
Figure 15: The Cjj(k,p) Pattern for generating /lk’s
Figure 16: The C2j(k,p) Pattern for generating /2k’s
Figure 17: A Circularly Skewed Ci]{k,p)-Pattern when Generating /^’s
Figure 18: A Candidate Design for DRA3 System
Figure 19: The DRA3 Architecture
UUCS-TR-86-116
Figure 20: A C^kpj-Pattern under Dynamic Reconfiguration of DRA3
Page 5
UUCS-TR-86-116 Page 6
1.1 Consistent Labeling Problem and Discrete Relaxation Operator
The Consistent Labeling Problem (CLP) [4,5,6] has long been viewed as a computational model based on a unit 
constraint relation containing 2N-tuples of units and labels which specifies which N-tuples of units are compatible 
with which N-tuples of labels.
A variety of problems in digital signal and digital image processing, artificial intelligence, operations research, 
symbolic logic, computer graphics, robot vision, and robot manipulation are special cases of CLP. Some problems 
that are naturally viewed in this way are finding N-ary relations such as transitive closure [24], detection of the 
graph and subgraph isomorphism [3,18], data-base consistency-maintenance, query-answering and redundancy- 
checking [30], the relational homomorphism problem [4], the constraint satisfaction problem [23,25], the automata 
homomorphism problem [17], finding spanning trees and Euler tours in a graph [32], the graph coloring problem 
[18], the packing problem [22], scene and edge labeling problems [4], the scene and shape matching problem [24], 
many puzzles such as the la tin square puzzle [4,28], the cryptarithmetic puzzle [31], determining satisfiability of 
propositional logic statements and theorem proving [26], space planning problem [29], and finding Hamiltonians in 
a graph [24],
One general technique for finding a consistent labeling is depth-first search but it suffers from thrashing [4], The 
search procedure fixes labels to units as long as it can find a label for each new unit that is compatible, according to 
the constraint relation, with the labels already fixed to previous units. Whenever the procedure cannot find a label 
for a new unit, it backtracks to the previous unit and tries to find a different label for that unit If the procedure finds 
a label for all units, it has found a consistent labeling. If the procedure backs up all the way to the first unit without 
finding any consistent labelings and there are no more possible labels for the first unit, the procedure fails and there 
exists no consistent labeling. Montanari [33] pointed out that this problem is NP-complete.
The arc-consistency algorithms filter the problem variable domain to eliminate unacceptable candidate values, while 
path-consistency algorithms filter problem constraints to eliminate unacceptable tuples. For example, the Waltz 
filtering algorithm reduces any binary compatibility relation to its maximal arc consistent subset [35]. Ullman [3] 
applies the Waltz algorithm to the subgraph isomorphism problem. Montanari [33] discusses various aspects of the 
constrained labeling problem for binary relations. The Discrete Relaxation Algorithm (DRA), described in [9,21],
1. Introduction and Motivation
UUCS-TR-86-116 Page 7
is a look-ahead operator which takes polynomial time in accelerating tree search required to find the consistent 
labelings. Henderson [9] also noted that DRA is a restriction of the classical relaxation techniques to systems of 
Boolean inequalities which take values over the two element set (0,1).
12 Related Works
Since the introduction of the discrete relaxation technique in the mid-seventies, many DRA applications have been 
found, and much work has been done to develop optimal DRA algorithms [8,9,21], On the other hand, people have 
also tried to build efficient hardware architectures to support real-time DRA processing. Unfortunately, due to high 
computational cost including space complexity, time complexity, and data communication cost, a conventional 
hardware architecture implementation is not feasible [13]. Current research in this aspect is blocked, and has 
appeared only in a virtual software simulator format [7],
The research described in this paper concerns the hardware implementation of discrete relaxation architecture. In 
section 2 we briefly describe the Discrete Relaxation Algorithm, and give an example for solving the region coloring 
problem. Then we define the DRA hardware implementation problem in section 3. Complexity analyses and our 
first design of the DRA1 chip are presented in section 4. In section 5, we concentrate on the design of the parallel 
DRA computational tree and a tree-root pipelining scheme, and its corresponding 0(nm) time SIMD multiprocessor 
architecture DRA2. Finally, it is shown that we have adopted a dynamically configurable, highly parallel routing 
scheme on the DRA3 switch lattice, so that an 0(n) time algorithm-configured architectural wavefront system, 
DR A3, is found.
These three architectures are designed for the same 8-object 8-label DRA problems. The first two systems are 
implemented using PPL (Path Programmable Logic) [36] at the University of Utah. The DRA1 chip requires two 
4K memory blocks and maximum execution time of over an hour in a 3 n NMOS process, which makes such a 
hardware implementation infeasible [11, 13]. The DRA2 design eliminates excessive memory requirements and 
performs the DRA computation in microseconds, at the worst case in milliseconds. This chip was fabricated using a 
3 (i. NMOS process by MOSIS [11,15]. DRA3 will be fabricated using an 1.0 (i. GaAs technology. The clock speed 
of DR A3 is expected to be over, at least, 500 MHz [12]. In this paper, we try to describe our research ideas and the 
architectural concepts for these DRA architectures as concisely as possible. For detailed reference see [11-15],
2.1 Boolean Formulation or Discrete Relaxation Algorithm
Instead of seeking a real number solution in a numerical relaxation situation [19], the solution to be found in the 
discrete relaxation case involves the assignment of a set of labels at each unknown such that some constraint relation 
among the labels is satisfied by neighboring unknowns [8,9]. Whereas the unknowns in numerical relaxation take on 
real number values, the unknowns in a labeling problem take on a Boolean vector value with each element in the 
vector corresponding to a possible label.
The generalized problem involves a set of unknowns which usually represents a set of objects to be given names, a 
set of labels which are the possible names for the unknown, and a compatibility model containing ordered groups of 
units which mutually constrain one another and ordered groups of unit-label pairs which are compatible. The 
compatibility model is sometimes called a world model. This model tells us which objects mutually constrain one 
another and which labelings are permitted or legal for those objects which do constrain one another. The problem is 
to find a label for each object such that the resulting set of object-label pairs is consistent with the constraints of the 
world model.
Boolean vector operations are denoted by ’, x, t, *, + and • which represent complementation, vector multiplication, 
transpose, Boolean "and," Boolean "or," and Boolean vector dot product, respectively. Our basic definitions which 
are used in formulating DRA are given in:
Definition 1: Let U = {uj,...,un] be the set of unknowns, and A = {Xj,...,X,m) be the set of possible labels. Then
1. C is an m by m compatibility matrix for label pairs, where C(ij)=l if is compatible with 0 
otherwise.
2. Aj = is the column vector describing the set of labels (i.e., zero or one) possible for u;, where 
lj=l if Xj is compatible with u;; 0 otherwise.
3. A  ^= (A; x Aj*)*((Nei(i j) ’E)+C) is an m by m compatibility matrix for u- and Uj, where E is the m by m 
matrix for all l ’s, and Nei(ij) = 1 if u, neighbors u-; 0 otherwise. We use A. denotes kth row of A--.
1 J  K  1J
Definition 2: A labeling is a vector L = (Lj,...^)1, where L; = (l^,...,!^) in Ai is a Boolean vector with l;j. = 1 if
UUCS-TR-86-116 Page 8
2. The Discrete Relaxation Algorithm (DRA)
UUCS-TR-86-116 Page 9
Definition 3: A labeling is consistent if for every i and k,
i* * OifijfA i/kj,))]. id
Once a formal definition of local consistency such as (1) has been given, it is easy to see that a situation very similar 
to classical relaxation now holds. However, instead of having to manipulate (1) into a form amenable to iterative 
solution, we merely note that (1) can be rewritten
label A.j is a possible label for object 0 otherwise.
for every i and k. Since the 1^  on the right-hand side is independent of j  and p. It is clear that if (2) does not hold it 
can be made to hold by setting equal to the value on the right-hand side. This is, in fact, equivalent to discarding 
label k for ui if at some neighbor u- there does not exist a compatible label. If the l^ ’s, k=l,m are now gathered 
together in vector form:
In In L  ihP* A,i(l,/>)) 
/’-i
m
L  {inp* Ain{\,p)) 
P=i
In hi L  (hp* An(2,p))
m
I  (lnp * Ain(2, p))
<. * p-1 * ... * p- 1
 ^im
m
L  0ip* An(m>P))P-i
m
L  (lnp* A,n(m,p))







1,2 < hi * ^ * A  n(2)' * ... * * Aln(2)(








where the column vector:
Pi = UnH  ({lLj]ylAijd)...Ai/m)]}‘). (7 )
Gathering together the Lj’s, i=l,n, we have
L <, L*P [&)
This formulation emphasizes the relation to classical relaxation. The relaxation is achieved by repeating
Lt=L*P [ 9)
until L does not change value.
2.2 An Example of Region Coloring Problem
Suppose that we are analyzing a picture of a scene, with the aim of describing it, and that we have detected a set of
objects Uj....u,j in the scene, but have not identified them unambiguously. The relationships that exist among the
objects are used to eliminate the ambiguity.
An example for eliminating the ambiguity in a region coloring problem is given here to demonstrate these ideas and 
computation procedures. For simplicity, consider the case of three regions to be colored red, green or blue with the 
constraints:
1. Region 1 must be red.
2. Region 3 must be blue, and
3. No two regions may be colored the same color.
UUCS-TR-86-116 Page 11
Thus, U; = Region i (for i=l,2,3) and:
U = {Uj.Uj , u3)
A—{Xj, X^ , ^ 3)
where is red, ^  is green, and X3 is blue. Since region 1 must be red, we have:
Aj=[l 0 0]1
and since region 3 must be blue:
A3=[0 0 l ]1
Finally, since there is no restriction on region 2’s color, we have all possibilities:
A2=[l 1 l]1
Since only similar colors are incompatible, we have:
C=
0 1 1 
10 1 
U 1 0 J
10)
for different objects, and
C= ( 100>! 0 1 0  I 0 0 1 J
n ;
for the same object.




f Nei(ij)’ Nei(ij) Nei(ij)  ^




f 0 if Region i does not neighbor Region j,
Nei(ij)= •{
L 1 if Region i does neighbor Region j.
Now we can calculate A~ as:
An=([l 0 0]‘ x [1 0 0])*((0’E)+C)= f 1 0(0 f 1 0(0 f 100^0 00 ¥ 010 000
^0 0 0 , lO O l; ^000  )
(13)
A 12=([1 0 0]1 x [1 1 1 ])*((1’E)+C)= ( l 10
f o i n fOl I s]
000 * 1 0 1 = 000
1,000 J Li i o J ^0 0 0 )
(14)
A13=([l 0 0]* X [0 0 1])*((1’E)+C)=
f o o n f o i n foo O000 ¥ i o i  U 000
1,000 J Li i o  J Loo 0 J
(75)
A21=([l 1 l]1 x [1 0 0])*((1’E)+C)=
f 1 00"l fo 1 O fo oo ^1 00 fc 101 = 100
l i o o j l l i o j Li 00 J
(76)
A ^ t l  1 l]1 x [1 1 1])*((0’E)+C)=
r u n ( i o ( 0 f l  00^111 ¥ 0 1 0 = 010
{ i l l ) Loo 1 J Loo 1 J
(77)
A ^ a i  1 l]l x[0 0 1 ])*((!’E)+C)=
f o o n f °  i n foo n
001 •t 1 0 1 = 00 1
Loo 1 J Li i o  J t o o o ;
(IS)
A31=([0 0 I]1 x [1 0 0])*((1’E)+C)=
f o o o ^ f o 1 n f 0 0 0
000 4c 101 = 000
L i o o j L u o J L i o o j
(79)
UUCS-TR-86-116 Page 13
A32=([0 0 1 ]*x [ lll] )* ((rE )+ C )=
foocO f o i  0 ' 0 0 0 ^000 ¥ 101 = 000
\ 1 1 1 ^ U  10 J t l i o j
[20]
A33=([0 0 1]‘ x [0 0 1])*((0’E)+C)=
fooo ' l ( l 00"l fooo' )
000 * 0 10 = 000
Loo 1 J Loo 1 J Loo 1 J
121
The (p,q) entry of A~ tells if Xp at object i is compatible with at object j. For example, All reveals that only is 
compatible with at object 1; i.e., that Region 1 must be colored red.
Finally we continue the iteration process. According to Equation (2),








1 < 1 which is true
This says that the color red is all right for Region 1. To determine if the color red is possible for Region 2, we must 
find l2j.
For i=2 and k=l:
/2l(n)</2i(n-1)*[/ii(n-1)*A2i(l,l>+/i2(n-1)*A2i(l,2)+/i3(n-1)*A2i(l,3)]






1 < 1*0 '
1 < 0 which is false. '
Thus, /21 must be set to zero. Likewise, for i=2 and k=3, /23 is set to zero, and blue is not a possible label for Region
2. Finally,
For i=2 and k=2:






1 < 1 which is true.
We see then that the value of Zj j, /22, and Z33 are not affected by the change of Z21 and Z23 to zero. In fact, the system 
of equations stabilizes after the change of Z21 and Z23, and the result is Zn = l22 = /33 = 1, while all other hypotheses 
are zero. Thus, the only consistent labeling is to label Regions 1,2 and 3 the colors red, green and blue, respectively.
UUCS-TR-86-116 Page 15
3.1 Assumptions and Statements
To concentrate our attention on tackling the major design complexities without losing generality for developing 
various DRA architectures, the following assumptions are adopted in our implementations. ,
Assumption 1: The numbers of objects and labels are assumed to be equal, i. e., n = m. Since the extensibility for 
different numbers of objects and labels has first been considered to be an important factor in system design, DRA2 
and DRA3 are able to be specified to an arbitrary number of objects and labels as long as chip size permits [3].
Assumption 2: For useful DRA applications, we have chosen the minimal value of n and m to be 8. This is 
reasonable, for example in image analysis applications.
Assumption 3: Contrary to the assumption in [7] that the input data are always ready before the DRA computation 
begins, we assume that both data load-in time and their hardware support cannot be ignored in a VLSI DRA 
computation. This assumption provides a greater challenge for an advanced DRA system implementation.
Assumption 4: The time for binary operation is generally derived from the gate delay of the combinational logic 
circuits, which is, in our DRA designs, always less than one global clock cycle. Similarly, it is common that one 
clock cycle is usually less than one Read/Write memory cycle. That is, we have that
tgate delay < c^lock cycle ~ ^memory R/W cycle (25 )
Whenever we analyze the time complexity, the following three statements hold.
Statement 1: Since the fact that the time complexity for each DRA computing iteration is the same, theoretically, we 
estimate the time complexity only within one iteration of computation.
Statement 2: As for the worst case of iteration times, which is solely determined by the convergence property of the 
algorithm and problem feature, we give its estimation in terms of the corresponding VLSI fabrication technology.
3. The Hardware Implementation Problems for DRA
UUCS-TR-86-116 Page 16
Statement 3: We also follow the traditional notation for time complexity analysis for the DRA system, i. e„ we 
describe it according to object number n and labels number m, and not the bit number of input data. For example, 
following statement 1 that, it takes O(n) time for computing DRA problem in DRA3 architecture, here n is the 
number of objects, whereas we have actually loaded in n by m bits of input data for each computation.
3.2 DRA Hardware Implementation Problems
Definition 4: The DRA2 Hardware Implementation problem has been defined as finding the consistent labeling
matrix L:
126]
v ^ nl’ ^ n2’"-’ ^ nmJ
for the given Region Coloring Problem, provided by the initial labeling matrix:
(27)A-(Aj,A2,... —
and the object-label pairs' compatibility matrices C- for every i and j (i j  = 1,2....n).
Definition 5: The DRA3 Hardware Implementation problem has been defined as finding the consistent labeling 
matrix L:
(2S)
 ^^ nl* n^m/
for the given World Model, provided by the initial labeling matrix:
A=(Aj A2.- Ai,...,An)t = (29)
9
UUCS-TR-86-116 Page 17
and the objecl-label pairs' compatibility matrices Cy for every i and j (i j  = 1,2....n).
The above definitions represent our research strategy for developing DRA architectures. The DRA2 system is aimed 
at first reducing both the space and time complexities for a particular DRA application problem, i. e., the region 
coloring problem. The second one, DRA3 design, generalizes DRA2 for an arbitrary DRA problem and leads to a 
faster and more general purpose architecture. '
4. A Conventional DRA1 Design and the Complexity Analysis
The entire DRA computation consists of mainly two parts. The first part generates the n2 m by m compatibility 
matrices for object pairs ui and Uj; i. e., computing each matrix Ay(k,p) ( ij = l,...,n; k,p = l,...,m) from the most 
recently evaluated (or originally given at the first iteration) labelings; as depicted in Eq.s (13) to (21). The second 
part as shown in Eq.s (22) to (24), is an iteration process for the relaxation of the current object pairs’ constraints 
among all different objects to produce new labelings; i. e., computing each new labeling (i = l,...,n; k = l,...,m), 
provided by the currently known object constraint matrices AyQc.p). Reviewing the DRA formulation and the 
computational procedure for the region coloring example in section 2 reveals the complexity issues met in the DRA1 
system design.
Time Complexity: With Eq.s (22) - (24), at least n x m memory Read/Write operations must be performed for 
computing an n by m matrix of / ^ n"1)*Aij(k,p) elements. Because there arenxm iteration equations (for computing 
new labels l^, where i = l,...,n; k = l,...,m) which need to be evaluated, the sequential computation time would be 
CKn^2). In the first DRA1 design, in addition to using the sequential evaluation strategy, another O(m) time is 
spent on the complicated nested control loop for memory Read/Write operations during each iteration evaluation 
[11,13]. Thus the entire computation time within each iteration costs 0(n2m3). There are other binary computations 
which must be taken into account. According to assumption 4, their contributions compared to R/W operations of 
memory are ignored.
Space Complexity: During each iteration of DRA computation, several intermediate arguments (initial arguments at 
the first iteration) take certain amounts of space be stored. The initial label matrix A (L ^  matrix) and the next 
iteration result of A each takes 0(n2) space. The matrices A; x Ajl and the label pairs compatibility matrices Cy(k,p)
as well as the resulting object pairs compatibility matrices A;j(k,p) (ij = 1....n; k,p = l,...,m) each takes 0(n2m2)
space. The total space complexity is:
UUCS-TR-86-116 Page 18
0 (2n2 + Sn2! ^ )  = O fa2!^ ) .
Data Routine Complexity: It is meaningful to consider the routing complexity only if a parallel DRA architecture 
may possibly exist. We are to discuss this issue in section 5 and section 6 when developing parallel DRA 
architectures. In fact, the data routing complexity during each iteration is over 0(n2m2). For an 8-object 8-label 
DRA parallel system, there are at least 8K data being routed or communicated during each iteration.
The DRA1 architecture, unfortunately, serially computes each intermediate element. The computation strategy 
imbedded in this design is purely I/O-bound. Assuming t ^  = twriLe •= 500 ns for an NMOS process and the nine 
complexity is 0(n2m3) of each iteration. Multiplying the possible maximum iteration times, which is on the order of 
O(nm) and is determined by the feature of the computational model [9], the worst case execution time is over 
seconds or minutes. For practical applications, the number of objects could be 8, 16 or 32, the coiTesponding 
memory requirements for these different cases are 8K, 32K and 128K, respectively. As shown in the DRA1 design, 
this has greatly added to the circuit size which has been a bottleneck in DRA1 system design when n becomes 
larger.
5. Parallel Tree-root Pipelining and the DRA2 System
In the analyses above, most of the computing time was spent on the R/W operations for matrices A;j(k,p); moreover, 
one half of the 0(n2m2) space is taken for the generation of matrices Aij(k,p), which blocks the DRA1 VLSI 
computation. In this section we describe a DRA2 architecture which is completely independent of the Ajj(k,p) 
evaluation [11].
5.1 A Parallel Tree-Structured Reformulation for DRA
It should be clear that any attempt to speed up an I/O-bound computation must rely on an increase in the memory 
bandwidth [39]. Speeding up a compute-bound computation, however, may frequently come from the concurrent 
use of many processing elements [39-44]. The degree of parallelism in a special-purpose system is largely 
determined by the underlying algorithm. In order to solve the complexity arising in the conventional DRA1 design, 
the following three steps have been taken to design a hardware algorithm that supports a high degree of concurrency 
in DRA computation.
1. Constructing the Parallel Computation Tree
UUCS-TR-86-116 Page 19
When more effort is spent analyzing Eq. (2), we see that element A-Qc.p) can be decomposed as:
A = llk(0)lJp(0)Ct/k,p)
which can form a leaf node like:
y-'> '*<■” y°> cs<k,p)
Figure 1: Leaf Node for Computing l-p x A^k.p)
so that Eq. (2) can be hierarchically formed as a tree-like structure with each level imbedded in the parallel 
computation for their leaves’ operands as shown in Figure 3.
2. Speeding Up Iteration
Once a DRA problem in processing is identified as convergent, its current computing labeling reaches its final 
consistent labeling asymptotically in such a way that the error vectors, i. e., the difference vectors between these two 
labelings, monotonically decrease. Therefore, the node computation in Figure 1 can be speeded up by replacing the 
initial given elements and /jp^  with the most recently n-l*11 iterated results and The modified
computation composed of the leaf node:
UUCS-TR-86-116
yn-1) c .j ( k p )
Figure 2: Modified Leaf Node Computation for /jp x A^Qc.p) 
The computation tree for is formed as:
'ik(n)
i^k *11 Zik l^m Cii(k,m) 'ik /„l /* lm  Cm(k,m)
Figure 3: A Parallel Tree for Computing nth
Page 20
UUCS-TR-86-116 Page 21
In section 5.2 it is shown that the bottommost leaves’ operands have been associated with a data pipelining channel, 
completing a multiple parallel tree-root pipelining scheme which supports a highly concurrent DRA computation. 
For convenient notation in the rest of the paper, we classify these three kinds of bottommost leaves’ operands of 
Figure 3 in:
Definition 6: (1) We define the labeling elements of the l-p's (j = l,...,n and p = l,...,m), which are inside the 
pipelining channel, to be l-pipe; and those on the tree root to be l-root. Both l-pipe and l-root have the same logic 
values except they are topologically separated in the system layout. (2) Signal passes through all roots 
horizontally through each row array of k; it is defined as the broadcasting signal,^. (3) The CyQc.p) matrix 
elements, which are systematically distributed along each tree root, are defined as the Cij(k,p)-Pattern.
3. Introducing a Time Dimension in the Computation
To compute an n-object m-label DRA problem, a total of n by m Z ’^s need to be evaluated. This means at least 64 
computation trees as shown in Figure 3 need to be built inside the circuit for our n = m = 8 case, which greatly 
increases the circuit size. To minimize this problem, each operand at the bottom of the tree has been constructed in 
a time-dimension. As the time changes forward, different 1^ ,’s (i = l,...,n; k = l,...,m) can be generated. These 
time-varying characteristics can best be described by the following spatial-temporal (ST) index equations:
j'<-/+rmodn, (32)
j<-j+t mod n. (3 3 )
denoted by A <— B or we say that A is generated by B. We call i, or i+t, and j, or j+t the spatial-temporal indexes is 
since they characterize the indexes in the recursive formulas of Eq.s (1-9). They constitute the theoretical basis for 
the recursive systolic computation and interleaved processing in the DRA2 and DRA3 systems, while the basic 
DRA-PE cell realizes the computation in the formula. It is also of interest to see that in the DRA2 system [11], the 
ST indexes generates the dynamic forward DRA computational wavefront on a statically configured DRA 
architecture; while in the DRA3 architecture [12], it forms the dynamically configurable DRA3 architectural 
wavefront for a virtually static computation.
5.2 Implementation Issues for the DRA2 Chip
1. System Architecture and Block Diagram
UUCS-TR-86-116 Page
The block diagram of the DRA2 circuit is illustrated in Figure 4. The chip consists of four functional blocks.
1. Compatibility Matrix Registers (CMR). Cy Registers are a set of eight 8-bit shift registers in the 
leftmost part of the circuit; they are used for storing each Cy matrix. Another set of Cu Registers in 
the rightmost part of the circuit are for storing C^.
2. 8 x 8 Multiprocessor SIMD Array (MSA). The MSA is composed of 8 by 8 simple and regular 
cells. They are predefined to map the highly parallel computation algorithm of Figure 3 onto silicon. 
A number of parallel horizontal and vertical communication wires are designed around the four edges 
of the cells to make use of higher degrees of parallelism in the computation. ,
3. L-matrix Shift Register (LSR). It is used for (1) the input and output data paths for the original and 
final labeling matrices, (2) the pipelining channel for tree-root operand broadcasting and pipelining, 
forming a recursive DRA computational wavefront, and (3) performing temporarily the data storing 
and updating.
4. Control Module (CM). This module includes four units. An 8-Bit Comparator is located on top of 
the first 8-bit shift register of the LSR to sense the equality between the n01 output vector L ^  of the 
MSA array and the corresponding n-l* row vector L^"'1^ inside the LSR. A Timer is served as both 
the systole pacer and tagged-bit signal generator for iteration control. An 8-Bit State Register is used 
for collecting comparison results from the Comparator and monitoring iteration states. Finally a Finite 
State Machine (FSM) is built for performing a self-timed synchronization among these functional 







Cii 8 8-Bit SRs
8x8 
SIMD Array
8 8 8-Bit 
SRs Cii
Figure 4: Circuit Block Diagram for the DRA2 System
This diagram of four functional blocks also serves as the PPL layout floor plan for efficient layout (in subsection 
5.4).
2. Multiprocessor SIMD Array and the Cell Design
The basic principle of the Multiprocessor SIMD architecture for DRA2 is illustrated in Figure 5. By replacing a 
single Processing Element with an array of 8 by 8 PEs, a higher computation throughput can be achieved without 
increasing memory bandwidth. The function of the memory (i.e., the L matrix shift registers) in the diagram is to 
pulse data l-pipe}p (j = l,...,n; p = l,...,m) through the array of cells. Then new data l-pipe^ , (i = l,...,n ; k = l,...,m) 
are returned to memory in a rhythmic fashion. The crux of this approach is to ensure that once the data are brought 
out from the memory they can be used effectively at each cell they pass while being pumped through the entire 
array.
UUCS-TR-86-116 Page 24
Figure 5: Basic Principle of the Parallel DRA2 System
To perform the parallel DRA2 computation, two cells (as illustrated in Figure 6 (a) and (b)) with almost identical 
logic and structure were used in constructing the entire array. The only difference is that the first cell performs the 
generation of the multiple broadcasting signals bk (k = l,...,m) for each row array while the second cell is only 
transparent to the passing of the bk signals. The construction of the multiprocessor SIMD array using these two cells 
is illustrated in Figure 7.
Out(j,k)j=1 = (lJpxAij(k,p))='Lp=] (IjpXlj^CJk.p)^ ^ p=i ii(k,p))=lJp=I (ljp+bl^ -(:ii(k,p))
(a) Cell-A
UUCS-TR-86-116
bk(in) =  bk(out)'a tco lum nsJ * L  (36)
OutW j *, = l " ,  ( l ^ f K p ) ) ^  (ljpxlu?<Ctfkp))= l " ,  djpxbl?<Ci/k,p))=I^1 (ljp+bk+L .< m  (37)
(b) Cell-B
Figure 6: Two Cells in Multiprocessor DRA2 Array
According to Figure 3 and equations (34) to (37), these two cells are implemented in two levels of NOR gate 












i _ L  T J








Figure 7: Construction of SIMD Multiprocessor Array Using Cell-A and Cell-B
We see from Figure 6 that in essence the inner summation part of Equation 2 is carried out in Cell-A and Cell-B, 
while the outer multiplication part of that equation is implemented in each horizontal array in Figure 7.
It is also noted from Figure 6 and 7 that the DRA2 architecture can readily be extended to different numbers of 
objects and labels. For example, a row array of the number k in Figure 7 can be added in order to attach one more 
label for all objects, while a column array of the number j  in the same Figure can be added to extend the system for 
one more object, provided that the corresponding changes inside each cell are made. The critical analysis indicates 
that the DRA2 systems for 8, 16, and 32 objects are technically implementable in the 3.0 (i NMOS, 2.0 (i and 1.2 
CMOS, and 1.0 [X GaAs processes [14],
3. Circuit Features and Design Techniques
UUCS-TR-86-116 Page 27
In addition to designing the simple and regular cells, several efficient techniques, such as interleaved processing, 
multiple signal broadcasting, and self-timed synchronization, were applied to the implementation of the SIMD 
multiprocessor DRA2 architecture.
Recursive Systolic Computation and Interleaved Processing
Since the introduction of the time dimension in subsection 5.1, the multiprocessor SIMD array in Figure 7 possesses 
a time-varying characteristic which makes recursive systolic computation and interleaved processing possible. Let’s 
focus on the first column (j = 1) array. It is clearly indicated that the first input vector, which is the i* row vector of 
L, the labeling matrix, at the n-11*1 iteration, is fed into the first column of DRA2 array as
.... ’V ’
where j = l,...,n. The corresponding output vector Lj of the multiprocessor SIMD array, which is the i* row vector 
of L labeling matrix at the n* iteration, is generated:
Oil Jilt'd'....
where i is fixed at a time t = i. As time moves forward, the elements in the L shift register have shifted from the left 

















Figure 8: Computational Wavefront Pipelining and Circulation for Interleaved Processing
Each L; vector is computed based on the interleaved utilization of the multiprocessor SIMD array, whereas eight Ll 
vectors form an entire computational wavefront of the L labeling matrix, of the n* relaxation iteration. Note that we 
use n computing trees for generating nxm  l-^ 's in O(nm) time; we may also use 0(nm) computing trees to compute 
the same number of /^’s in O(n) time, provided that the latter has a uniformly progressing wavefront in time and in 
space.
Multiple Siena! Broadcasting
The broadcasting technique is probably one of the most obvious ways to make multiple use of each input element. It 
plays an important role in making the parallel computation tree of Figure 3 implementable. Two multiple 
broadcasting schemes are used in DRA2 architecture. In the first, n by m vertical broadcasting lines from each 
pipelining operand are connected to the bottom most leaves’ node of each parallel computing tree (passing each 
l-pipe to /-root). Secondly, as depicted in Figures 6 and 9, Cell-A at column j  (=1) is used to jog signal l^ n'^  (which 
is the n* l-pipe^ and then propagate it horizontally from right to left through the entire row array. Thus, the output 
vector of the multiprocessor SIMD array, i.e., (/ii^i2^i3’Wi5^i6’Wi8)> 030 ^  generated simultaneously in a highly 
concurrent manner.
UUCS-TR-86-116 Page 29
Definition 7: The second multiple data routing scheme for jogging bk at column /=  1, as illustrated in Eq.s (32) and 
(34), and Figure 9, is defined as the J-Pattern.
Ijg*—■-Jj2»*jl»
Figure 9: Broadcasting Scheme for the bk Signals 
Cross Circle Associative Pipelining
Much associativity is introduced through the associative circular pipelining in the DRA2 design. Refer Figures 7, 8, 
9 and 10, two circular pipelining loops are implemented.
The first pipelining is designed for ljp elements. This pipelining can be viewed in two parts, master pipelining part 
and slaver pipelining part The master pipelining refers a serial horizontal circular pipelining of the lj elements 
inside the main pipelining channel, LSR. While the master pipelining is going on, a parallel horizontal virtual 
circular pipelining is followed in k row horizontal arraies. The parallel pipelining in DRA array is a slaver 
pipelining of the master one which is supported by multiple signal broadcasting.

UUCS-TR-86-116 Page 31
vector from the multiprocessor SIMD array in order to update the current n-l* L; row vector. This iteration and 
updating process is the core of the relaxation process described in Eq. (8). To sense the completion of computation, 
a Comparator is built on top of the first 8-bit SR. If two vectors are equal, a row-eq signal of 1 is produced and 
stored into 8-bit States SR of the Control Module; otherwise a 0 signal is sent. As soon as the State Register gets 
eight l*s, which means the equality of Eq. (8) is reached, an all-eq signal is issued to the FSM. Since the control 
processes in this system are based on the data validity of a control data flow, reliable and fast execution in a 
data-driven environment is created. The control mechanism used in the parallel DRA2 architecture is shown in 
Figure 11.
To ensure that the iteration cycle completes at the end o fnxm cycles, a tagged-bit is derived from an ANDed term 
of both the /n bit and the 64th-count of the Timer, which has served as a reliable alignment signal for computation 
in the control flow.
Figure 11: Self-Timed Synchronization 
In Figure 12 the state graph of the FSM is illustrated.
UUCS-TR-86-116 Page 32
Reset Ending
Figure 12: State Graph of the Finite State Machine
53 Performance Evaluation
With the parallel tree-structured reformulation and tree-root pipelining in its system design, the DRA2 architecture 
takes advantage of a high degree of pipelining and multiprocessing. It gets rid of the need for storing and computing 
each element associated with the object pairs matrices A-dc.p). One-half of the O fr ^ 2) space is eliminated; the 
other half is reserved for input data C-Oc.p), the label pairs compatibility matrices. In certain DRA computations, as 
shown in the Region Coloring Problem, the space complexity can even be reduced to 0(nm); i. e., only an 0(nm)-bit 
shift register is required to store all n by m intermediate (or original) labeling elements [11]. Obviously, the DRA2 
computation during each iteration only takes O(nm) clock cycles. Assuming a clock cycle is about 120ns (using a 3(i 
NMOS process and the PPL design methodology), DRA2 performs the DRA computation in microseconds, and in 
the worst case, i.e., multiplying the maximum possible iteration time, O(nm), in milliseconds [11,15].
UUCS-TR-86-116 Page 33
The DRA2 chip was built by assembling the four functional blocks in Figure 4 using PPL (Path Programmable 
Logic) tools at the University of Utah [36]. Since parallel computation and the multiprocessor SIMD array greatly 
simplify the design difficulties, the PPL layout is very simple and straightforward. An overview of the complete 
PPL layout, which is a PPL mapping of the block floor plan in Figure 4, is shown in Figure 13.
For details associated with the complete system design, the simulation results, interfacing strategy with host 
computer, timing and wiring delay analysis, testing, pinout description, and fabrication of the DRA2 chip see 
[11,15].
5.4 The PPL Layout for DRA2 System
_____ Mr.:
m m  i  ;„ f
■M" •H— — ‘ N— H— H— H— r ‘r - - H - - H —
fjgsasj
Figure 13: The PPL Layout of the Multiprocessor SIMD DRA2 System
UUCS-TR-86-116 Page 34
Figure 14: The Photomicrograph of the DRA2 Chip
UUCS-TR-86-116 Page 35
Compared to the conventional DRA1 design, the DRA2 system achieves a very impressive performance 
improvement in terms of speed, size, and memory access requirements. However, there are still several bottlenecks 
in developing a fast general purpose DRA architecture.
6.1 Complexity Issues Revisited
Time Complexity: The O(nm) time of the DRA2 system is suitable only for a small number of objects, though the 
number of objects and labels in DRA2 architecture can be extended to a larger number of objects and labels. The 
larger the number of objects and labels are, the slower the DRA2 system becomes. It is desirable that a much faster 
DRA architecture be designed for real-time processing of a larger number of objects and labels.
Data Preprocessing Complexity: One of the remarkable characteristics of the DRA problem is that a large number 
of data, such as © (n ^2) elements of label pairs compatibility matrices Cjj(k,p), must be loaded into the system 
before the computation is performed. In addition to this, the data ordering and format which might be processed 
during the load-in time must be arranged to support efficient DRA computation. Following assumption 3, the term 
Data Preprocessing Complexity (DPC) is used here to refer to the complexity issues which arise in the analysis of 
the time and space, as well as hardware complexities in this kind of data-preprocessing-intensive VLSI computation. 
Some labeling solutions avoid this problem by assuming that the data is always ready for computation.
Data Routing Complexity: One of the proposed DRA3 architectures [12] is shown in Figure 18, in which the data 
routing complexity in the horizontal direction between two DRA modules is of 0(m2); a total of 0(nm3) 
connections are required for nm DRA modules to be routed. The data routing complexity becomes another dominant 
factor in designing the DRA3 system as the number of labels increases.
Fabrication Difficulty: Although we have separated the main pipelining channel and have used the larger buffers to 
drive between shift registers, its characteristic turns out to be worse as the numbers of n and m increase.
6.2 Cij(k,p)-Pattern Distribution Analysis
The DRA3 architecture overcomes these difficulties based on the use of the dynamically configured architectural 
wavefront on the DRA3 switch lattice [12]. It runs the DRA computation in O(n) time without the extra
6. An 0(n) Time Algorithm-Configured Dynamic Architectural Wavefront System
UUCS-TR-86-116 Page 36
requirement for data preprocessing hardware and without requiring the horizontal routing among the DRA modules, 
provided that only an 0(nm2) number of switches must be added to the original multiple broadcasting wires of the 
DRA2 system.
In fact, computing time, data preprocessing complexity, and data routing complexity form a coherent mixture in the 
DRA3 system design. We illustrate its major architectural concepts beginning with the analysis of C^k^-Patterns 
in the case of n = m = 3. ’
Referring to Eq. (2) and Figure 8, the first computational wavefront for computing the n^ labels is formed at 
time t = i = 1 as: .




A snapshot of the Cjj(k,p) (j, k ,p = 1,2,3) matrix data distribution pattern which is matched with the /lk^  wavefront 
at t = 1 is:
UUCS-TR-86-116 Page 37




M i l !
k=l C13(U) C13(l,2) C13(l,l) CI2(U) C12(1J) C12(l,l) C„(l,3) C„(U) C„(M)
k=2 Cu (2,3) C13(2,2) C13(2,D C,2(23) C,2(2J) C12(2,l) C„(2,3) C„(2,2) C„(2,l)




Figure 15: The Cjj(k,p) Pattern for generating /lk’s 













Il3(n’1) Il2(n,) |„(nl) '.13(n-1) \31M) l3,(n-l) ,a <n-l> «22(n-,) >2.[n-D
i i
i 1
1 ! '1 ! 1
C21(l,3) C21(U) C21(l,l) C23(U) Cu(U) C23(1,D C22(M) cM(U) C22(l,l)
C2I(W) C2,(2J) C21(2,l) C23(2,3) C23(2,2) C22(2*3) C22(2>2) 022(2,1)







Figure 16: The C2j(k,p) Pattern for generating /21c’s 
From Figures 15 and 16, several criteria for DRA3 architecture are derived.
(1) At time t = i, to compute each new labeling vector L;, different computational wavefronts are formed and the 
associated Cjj(k,p) matrices are distributed. More accurately, at time t, each matrix Ctj(k,p) elements (where t, j  =
1....n and k, p = 1,..., m) stored in the k*1* row of a parallel memory in Figures 14 and 15 are retrieved. Thus a
parallel Row-Readable RAM is required.
(2) It is clearly indicated that the Cjj(k,p) elements associated with each dynamically pipelined operand in the main 
tree-root pipelining channel are one unit of j  skewed and horizontally circulated. The cyclic shifting of the bits of 
Ci/k,p)-Patterns elements to the right for m-bit position corresponds lo a Block-data Shuffling operation [43]. We 
summarize and extend all distributed C matrices’ patterns in Figure 17. (Each symbol actually is a matrix 
distributed in Figures 15 and 16.)
t= 1: C13(k,p) C12(k,p) Cn (k,p)
t = 2: C21(k,p) C23(k,p) C^Ck.p)
1 = 3: C32(k,p) C31(k,p) C33(k,p)
UUCS-TR-86-116 Page 39
Figure 17: A Circularly Skewed C^k.phPattern when Generating 1^ ,’s
The first requirement requires a simple associative processing, i.e., parallel Row-Readable RAMs; while the second 
criterion demands higher order design complexity in time and data preprocessing, as well as in data routing.
A candidate design for the DRA3 system, where n = m = 8, is presented in Figure 18. The DRA3 system is made up 
of using 8 by 8 DRA modules. Each module consists of a PE cell (same structure as Cell-A and Cell-B in DRA2 
system) and a local Row-Readable parallel RAM for Ci}{k,p)-Pattern.
UUCS-TR-86-116 Page 40
Figure 18: A Candidate Design for DRA3 System
UUCS-TR-86-116 Page 41
A pre-shuffling chip is implemented and fabricated using a 3 |i NMOS PPL design methodology in order to make 
the C^k.pyPattern the same format as in Figure 17 [16]. However the pre-processing hardware costs 0(b3) time 
(noting that b is the number of bits to be shuffled) and takes more than half of the chip, provided that a difficult 
inter-chip routing problem remains unsolved [12].
6.3 A Dynamically Configurable DRA3 Architecture
The configurable, highly parallel computer system is a multiprocessor architecture that provides a programmable 
interconnection structure integrated with the PEs [10]. The original idea is that, the computer processing begins 
with the controller broadcasting a command to all switches to invoke a particular static configuration setting. The 
design of the DRA3 system has revealed that by adopting an advanced dynamic configuring strategy on the DRA3 
switch lattice, not only can the DRA computational wavefront be generated while retaining the benefits of 
uniformity and locality that DRA-PE exploits, but also the combined limitations imposed in upgrading DRA3 
architecture can be completely eliminated.
1. The Architectural-Computational Wavefront (ACW) Notation
We rewrite Weiser and Davis’s definition of wavefront [38]:
Definition 8: A wavefront, denoted as A, represents an ordered set of data elements: (a(l,m), a(2,m),..., a(N-l,m), 
a(N,m)}, where m is the "time” subscript. The elements a(I,m) for all m belong to the Ith data stream. For 
simplicity, the "time" subscript in the elements of a wavefront is omitted and a(i,m) will be simply be represented as 
a(i).
We extend the definition as:
Definition 9: The Architectural-Computational Wavefront (ACW) Notation [12] consists of two different 
wavefronts. The Computational Wavefront is an ordered set of data elements: (c(l,i+t), c(2,i+t),..., c(N,i+t)}, 
where i+t is the spatial-temporal index in Eq. (32). The Architectural Wavefront is an ordered set of architectural 
configurations: [a(l j+t), a(2j+t),..., a(N j+t)}, where j+t is the spatial-temporal index in Eq. (33).
UUCS-TR-86-116 Page 42
1.The computational wavefront dynamically progresses on a static architecture, as the ST indexes 
increase. The architectural wavefront progresses, by dynamically configuring the system architecture, 
in a way against the computational wavefront, as the ST indexes increase. For instance, as we have 
seen that the ST index in DRA2 forms a dynamic DRA computational wavefront on a static DRA2 
architecture; we will show see that the ST index of DRA3 generates a dynamic DRA3 architectural 
wavefront for a virtual static DRA computation.
2. The equivalence between computational wavefront and architectural wavefront holds. Thus either one 
characterizes both of them.
3. Both architectural and computational wavefront notation have spatial parameters, such as i and j, and 
temporal parameter, such as f, so that both can be manipulated on the space and time domains.
4. The conventional wavefront notation assumes a uniform progression of the data stream. This 
restriction is eliminated by combining ST indexes with ACW notation, as is obvious that these indexes 
may behave synchronously or asynchronously.
5. The ACW notation first differentiates the computational wavefront and architectural wavefront and is 
more suitable for exploring architectural configurability and the synthesis of highly configurable 
system.
Referring to Eq.s (2), (32), and (33), there is always a data dependence relation among these arguments with respect 
to time t and positions k and j, that
The following distinctions are made which are necessary for building a DRA3 system:
(40)
In Figure 7 we see that this data dependency equation is topologically mapped onto silicon in terms of the parallel 
tree structure. The vertical broadcasting of the l-pipe operands from pipelined channel to DRA array is described by 
a routing function:
[41)
where the symbol I  denotes an alignment relation such that A I  B if and only if A and B have the same physical 
column position of j. The J-Pattem of definition 7 is routed by equation:
UUCS-TR-86-116 Page 43
1421
The DRA computation is valid if and only if these two dependency equations are true at any time t and positions k 
and j, as well as the iteration times n. Eq. (41) stands for a static network in the DRA3 architecture. Eq. (42) is a 
virtually static network; a potential speed-up in the DRA3 computation can be reached by dynamic configuring Eq. 
(41) in an ACW notation.
2. Dynamic Configuring the DRA3 Architectural Wavefront
Though symbol j  represents the horizontal spatial shifting (in Figures 15 and 16) while the symbol t represents the 
temporal unit, the computational wavefront progression of l-root-+l p proceeds (referring to Eq.s (32) and (33)) as the 
increasing of j+t, where the combined parameter j+t must not be associated with any dimensions. By definition 9, it 
is obviously correct that the computational wavefront moves from left to right, as the ST indexes increase and is 
equivalent to that of the J-Pattem (one configuration of the architectural wavefront) shifts in the reverse horizontal 
direction, as the ST indexes increase. The DRA computation can be performed by manipulating either a 
computational wavefront or configuring dynamically an architectural wavefront.
In Figure 19, we create a dynamically configurable DRA3 architectural wavefront by building an algorithm- 
configured switch lattice. A total number of 0(nm2 + nm) switches are added between each vertical wire and 
horizontal wire.

operands of the "pipelining channel," noting that the pipelining channel is now static and is replaced with an 
0(nm)-bit RAM. These switches are programmed by a Self-Timed System Controller, as is shown in DRA2 circuit, 
to avoid the time delay occurred during the broadcasting on the switch lattice. For DRA3 computation, we give:
Definition 10: We define the DRA3 architectural wavefront to be the interconnection patterns of the DRA3 switch 
lattice, of which both the switches SNjk for a column j, i.e., the J-Pattem at a specified column j, and switches BSjk 
for that column j  are invoked simultaneously, where index j  is defined in Eq. (33). -
During the DRA3 computation, at index j  (mod n), the DRA3 architectural wavefront shifts from right to left in O(n) 
clock cycles, therefore, an entire DRA computation, which is virtually static, is dynamically generated.
It would now be of interest to examine whether the complexity issues in section 6.1 are solved upon building the 
DRA3 architecture. First, a time upper bound of 0(n) is reached. Secondly, there is no particular hardware support 
and no special requirements for data-preprocessing. Under dynamic parallel configuring of DRA3, the 
Cij{k,p)-Pattern is naturally distributed in matrix index order in the original parallel memory. For an intuitive 
understanding, we redraw the ClJ{k,p)-Pattern of Figure 17 in Figure 20 which is associated with DRA3 
computation strategy.
UUCS-TR-86-116 Page 45
t= 1: C13(k,p) C12(k,p) Cn(k,p)
t = 2: C23(k,p) C22(k,p) C21(k,p)
t = 3: C33(k,p) C32(k,p) C31(k,p)
Figure 20: A Cij{k,p)-Pattern under Dynamic Reconfiguration of DRA3
The third problem, an intensive multiple data routing requirement is eliminated by adding 0(nm2 + nm) switches on 
the DRA3 switch lattice. As for the long pipelining channel, it is replaced with O(nm) bits of smaller and more 
reliable RAM cells, under the dynamic data routing. Finally, since C^k,p)-Pattern is designed for 0(n2) arbitrary m 
by m C-matrices, the DRA3 system is good for any general purpose DRA computation.
7. Conclusions
We have described several VLSI architectures for speeding up the computation of the Discrete Relaxation
UUCS-TR-86-116 Page 46
Algorithm. The key issues are a new tree-root pipelining scheme and a technique to dynamically configure the 
architectural wavefront. The implementations of these architectures offer much greater processing performance than 
general purpose processors. Further research in this area is to imbed the highest degree of flexibility in DRA design 
by allowing programmability in cells as well as reconfigurability of cell interconnections, for generating efficient 
and faster dynamically configurable MIMD DRA architectures.
UUCS-TR-86-116 Page
REFERENCES
1. D. H. Ballard and C. M. Brown, Computer Vision, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 
1982.
2. B. K. P. Horn, Robert Vision, The MIT Press, Cambridge, 1986.
3. J. Ullman, An Algorithm for Subgraph Homomorphisms, In J. ACM. Vol. 23, pp. 31-42, Jan. 1976.
4. R. M. Haralick and L. G. Shapiro, The Consistent Labeling Problem: Part 1, In IEEE Trans, on 
Pattern Analysis and Machine Intelligence. Vol. PAMI-1, No. 2, pp. 173-184, April, 1979.
5. R. M. Haralick and L. G. Shapiro, The Consistent Labeling Problem: Part 2, In IEEE Trans, on 
Pattern Analysis and Machine Intelligence. Vol. PAMI-2, No. 3, pp. 193-203, May, 1980.
6. B. Nudel, Consistent-Labeling Problems and their Algorithms: Expected-Complexities and Theory- 
Based Heuristics, In Artificial Intelligence. Vol. 21, pp. 135-178,1983.
7. J. T. Mccall, J. T. Tront, F. G. Gray, R. M. Haralick and W. M. Mccormack, Parallel Computer 
Architectures and Problem Solving Strategies for the Consistent Labeling Problem, In IEEE Trans, on 
Computer. Vol C-34, No. 11, pp. 973-980, November 1985.
8. R. Mohr and T. C. Henderson, Arc and Path Consistency Revisited, In Artificial Intelligence. Vol. 28, 
pp. 225-233,1986.
9. T. C. Henderson, Discrete Relaxation, Oxford University Press, London, to appear.
10. L. Snyder, Introduction to the Configurable, Highly Parallel Computer, In IEEE Computer, pp. 47-56, 
January, 1982.
11. W. Wang, J. Gu and T. C. Henderson, A Pipelined Architecture for Parallel Image Relaxation 
Operations In IEEE Trans, on Circuits and Systems. Vol. CAS-34, No. 11, 1987. A detailed version is 
in Technical Report. UUCS-TR-86-008. Department of Computer Science, University of Utah, March, 
1987.
12. J. Gu, W. Wang and T. C. Henderson, An 0(n) Time Discrete Relaxation Architecture for Real-Time 
Processing of the Consistent Labeling Problem, Technical Report UUCS-TR-86-116. Department of 
Computer Science, University of Utah, December, 1986.
13. D. Ku, DRA1 Chip Implementation Report, Project Report, Department of Computer Science,
14. M. Gaspar, Discussion with the PPL Design Methodology and MOSIS Fabrication Technology, 
Private Communication, December, 1986.
15. MOSIS, DRA2 Chip Fabrication Report, Information Science Institute, University of Southern 
California, December, 1986.
16. W. Burger, A Virtual Shuffle Dynamic RAM Controller, Project Report, Department of Computer 
Science, University of Utah, February, 1986.
17. A. Ginzberg, Algebraic Theory of Automata, Academic, New York, 1968.
18. F. Harary, Graph Theory, Addison-Wesley, Reading, 1969.
19. R. V. Southwell, Relaxation Methods in Engineering Science, Oxford University Press, London, 1940.
20. D. Waltz, Understanding Line Drawings of Scenes with Shadows, In P. H. Winston (editor), 
Psychology of Computer Vision, pp. 19-91, McGraw-Hill, New York, 1975.
21. A. Rosenfeld, R. A. Hummel and S. W. Zucker, Scene Labeling by Relaxation Operations, In IEEE 
Trans, on Systems. Man, and Cybernetics. Vol. SMC-6, No. 6, pp. 420-433, June 1976.
22. J. P. A. Deutsch, A Short Cut for Certain Combinational Problems, In Proceedings of the British Joint 
CompuL Conference, 1966.
23. R. E. Fikes, REF-ARF: A System for Solving Problems Stated as Procedures, In Artificial Intelligence.
Vol. l,pp. 27-120,1970.
24. A. Bundy (editor), Catalogue of Artificial Intelligence Tools, In SYMBOLIC COMPUTATION 
Artificial Intelligence, L. Bole, A Bundy, P. Hayes and J. Siekmann (editors), Springer-Verlag, Berlin,
1984.
25. E. C. Freuder, Synthesizing Constraint Expressions, In Comm. ACM. Vol. 21, No. 11,1978.
26. R. Kowalski, A Proof Procedure Using Connection Graphs, In J. ACM. Vol. pp. 572-595, October,
1975.
27. R. M. Haralick and J. Kartus, Arrangements, Homomorphisms, and Discrete Relaxation, In IEEE 
Trans, on Systems. Man. and Cybernetics. Vol. SMC-8, No. 8, pp. 600-612, August, 1978.
28. E. G. Whitehead, Jr., Combinatorial Algorithm, Courant Inst. Mathematical Sciences, New York 
University, 1972.
UUCS-TR-86-116 Page 51
University of Utah, March, 1986.
UUCS-TR-86-116 Page
29. C. Eastman, Preliminary Report on A System for General Space Planning, In Comm. ACM. Vol. 15, 
pp. 76-87, 1972.
?r. . V,’. Grossman, Some Data Base Applications of Constraint Expressions, Research Technical 
Report, LCS-158, MIT, 1976.
31. A. Newell and H. A. Simon, Human Problem Solving, Prentice-Hall Inc., Englewood Cliffs, 1972.
32. A. Nijenhuis and H. S. Wilf, Combinatorial Algorithms, Academic Press, New York, 1975. *
33. U. Montanan, Networks of Constraints: Fundamental Properties and Applications to Picture 
Processing, In Inform. Sci.. Vol. 7, pp. 95-132, 1974.
34. A. Rosenfeld, Networks of Automata: Some Applications, In IEEE Trans, on Systems. Man. and 
Cybernetics. Vol. SMC-5, pp. 380-383,1975.
35. D. L Waltz, Generating Semantic Descriptions from Drawings of Scenes with Shadows, MIT 
Technical Report A1271. November, 1972.
36. K. F. Smith, T. M. Carter and C. E. Hunt, Structured Logic Design of Integrated Circuits Using the 
Storage/Logic Array (SLA), In IEEE Trans, on Electron Devices. Vol. ED-29, No. 4, April, 1982.
37. A. B. Hayes, Self-Timed 1C Design with PPL's, In R. E. Bryant (editor), Proceedings of the Third 
CalTech Conference on VLSI, pp. 257-273, Computer Science Press, Rockville, January, 1983.
38. U. Weiser and A. Davis, A Wavefront Notation Tool for VLSI Array Design, In H. T. Kung, B. Sproull 
and G. Steele (editors), Proceedings of CMU Conference on VLSI System and Computations, pp. 
226-234, Computer Science Press, October 1981.
39. H. T. Kung, Putting Inner Loops Automatically in Silicon, Lecture Notes in Computer Science Vol. 
163: VLSI Engineering, pp. 70-104, Edited by Tosiyasu L. Kunii, Springer-Verlag, 1985.
40. J. Miklosko and V. E. Kotov, Algorithms, Software and Hardware of Parallel Computers, Springer- 
Verlag, Berlin, 1984.
41. S. Y. Kung, et al., Wavefront Array Processor: Language, Architecture, and Applications, In IEEE 
Trans, on Computer. Vol. C-31, No. 11, pp. 1054-1066, November 1982.
42. D. J. Kuck, The Structure of Computers and Computations Vol. 1, John Wiley & Sons, New York, 
1978.
43. K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, New
UUCS-TR-86-116 Page
York, 1984.
44. C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley Publishing Company, 
Reading, 1980.
