Abstract-The design of a flexible parallel architecture for both the discrete relaxation labeling (DRL) algorithm and the probabilistic relaxation labeling (PRL) algorithm is addressed. Through the analysis of parallelism in the computational models of both algorithms, the parallel execution of the algorithms on a flexible parallel architecture i s presented. Three basic types of parallel operations are performed in the architecture: simultaneous, pipeline, and systolic. An illustrative example is used to show how the DRL algorithm can be executed on the parallel architecture. In doing so the processing element (PE) organization and the combiner organization of the architecture are described. The same architecture with programmable functional units is shown to be able to execute the PRL algorithm, too. The performance comparisons between the proposed architecture and some other existing ones are also given.
I. INTRODUCTION ELAXATION labeling algorithms have been applied
R successfully to a number of applications, for instance, signal restoration, language identification, graph homomorphism, image segmentation, and scene interpretation [1]- [8] . One of the major issues of using this type of algorithms is that it is computationally intensive and requires typically an exponential execution time on a single processor architecture [9] . Many researchers have tried to build fast architectures [9]- [15] for executing the relaxation labeling algorithms. Uresin (MCC) architectures to approach the relaxation problems. Their techniques have an appealing run time performance from a theoretical viewpoint, but suffer from physical realization problems such as data communication congestion and 110 routing complexities. Most of the above approaches remain in the phase of virtual software simulations [9] . Lately, there has been an increasing interest in implementing relaxation operations by using dedicated hardware architecture that can be built in a modular fashion, for instance, in the form of systolic arrays. Several preferential factors of the systolic approach, such as the Manuscript The underlying idea behind these methods is the application of suitable transformations to the relaxation algorithms to obtain a representation that can be easily mapped onto their proposed architecture. Thus, different relaxation labeling algorithms give rise to different array designs. However, the function of resultant systolic array is somewhat restricted simply because the applied architecture is too fixed to cover different applications.
In this paper, we consider the relaxation labeling algorithms in two different forms: discrete relaxation labeling (DRL) and probabilistic relaxation labeling (PRL). Both of them are executable on our proposed architecture. The arrays of the architecture use one-dimensional, oneway communication lines between adjacent PE's and interact with the external environment through a single U 0 port. Because of the hardware simplicity and programmability features of the PE's, the architecture is well suited for VLSI implementation and is flexible enough to be adaptable to a number of applications. Moreover, the proposed architecture will run in a linear time for each iteration of labeling process. It is also able to check the consistency status of the DRL algorithm and the convergence condition of the PRL algorithm at the hardware level without the host involvement. On the other hand, the use of one-way flow through the arrays simplifies the circuit design and the system can be converted to selftimed arrays to avoid the clock skew problem for large labeling processes [ 181.
The paper consists of five sections. Section I1 presents the overview of the two relaxation labeling algorithms: DRL and PRL. In Section 111, parallelism in the two relaxation mathematical models is the key to our design of a flexible parallel architecture. Three basic types of parallel operations are used: simultaneous, pipeline, and systolic. The processing element organization of the systolic array and the combiner organization are then introduced. An illustrative example is used to show how the DRL algorithm is executed on the architecture. The same architecture with programmable functional units is also shown to be able to execute the PRL algorithm. Section IV covers the performance comparisons between our architec-ture and some other existing ones. The final section is a summary.
11. OVERVIEW OF Two RELAXATION LABELING G i v e n a s e t o f N o b j e c t s U = { U l , U 2 , e . . , U N } and a set of M labels A = { XI, A2, , A, } , a relaxation labeling algorithm attempts to assign iteratively the labels to all objects such that these object-label assignments are consistent with a set of prespecified compatibility constraints (or coefficients) between the pairs of object-label assignments. We shall consider two types of algorithms: discrete relaxation labeling (DRL) and probabilistic relaxation labeling (PRL).
ALGORITHMS *

A . Discrete Relaxation Labeling (DRL)
In this case the labels are assigned to objects in an allor-none fashion. The prespecified compatibility coeffi-
are such that Cl.,( A,, A,) = 1 if the assignment of label A, to objects U, is compatible with the assignment of label A, to object U, and 0 otherwise. In the iterative relaxation labeling process for each object U,,
] denote the vector of the label assignments given to ob'ect U, at the kth iteration, k = 1, 2, 3, * * , where Li( A,) = 1 if label A, is assigned to U, and 0 otherwise. The complete list of object-label assignments, called a labeling zk, is indi-
, L h ( A ) } . Initially, each object U, is assigned to have all labels in A, i.e.,
viously, these initial assignments may not be consistent with the prespecified compatibility coefficients, so the labeling will be updated through the following formula:
where *, U, and C denote the Boolean AND, PRODUCT, and SUM operations, respectively. When there exists ] and t E [ 1, M I , then the labeling process is said to satisfy the consistency condition and the process stops.
B. Probabilistic Relaxation Labeling (PRL)
The PRL algorithm can be thought as a generalization of the DRL algorithm. The algorithm assigns different probabilities, instead of the zero-or-one fashion, to the object-label pairs. The probabilistic labeling estimates of object Ui E U , denoted by Pi( A t ) , t E [ l , MI, are ranged in the interval [0, 13 and will be updated during each iteration. The heuristic knowledge embedded in what are termed compatibility coefficients CC = {Ci,j( A,, A,), i, j E [ l , N I ; t, p E [ l , M I } is to control the contribution that the probability of assigning label A, to object Vj made to the probability of assigning label A, to object U;. Historically, Ci,j( A,, A,) is so defined that it takes a value in the range of [ -1, 11 where -1, 0 and 1 indicate "totally incompatible," "independent," and ''totally compatible," respectively. The initial labeling estimate of The updating of these labeling estimates stops if the estimates are unchanged or nearly unchanged after a certain finite number of iterations. In this case, it is said a convergence condition is reached.
PARALLEL EXECUTION OF RELAXATION LABELING ALGORITHMS
A. n e DRL Algorithm
In the following we shall examine the parallelism in the computation models of both DRL and PRL algorithms. From this analysis of parallelism we shall determine three basic types of parallel operations for hardware execution of these algorithms. We shall first describe the architecture for the DRL algorithm, then point out the necessary modifications of the DRL architecture needed in order to execute the PRL algorithm.
To illustrate the idea more explicitly, let us consider a region color labeing problem [13] . Suppose that we are analyzing a picture with five regions which are to be colored in red, green, and blue, subject to certain compatibility constraints. The DRL formulation for this problem is given below:
where Ui = region i. The object set size is N = 5.
2) The coloring labels A = { AI, A2, A,} where AI is red, A2 is green, and A3 is blue. The label set size is M =
3.
3) The compatibility coefficients (or constraints) between any two region label assignments Cj,j( A,, A,) (3a)
The main module 
All computations in (3) will be executed in parallel on a parallel architecture to be introduced below. Next we shall describe, in more detail, i) how the systolic computation of Si ( A,) is done in the first row of five PE's in the main module, and ii) how the pipeline computation of L;" (A) is done in the combiner module. We shall introduce the required hardware architecture, too. First of all, the input data streams at the Y,, and X,, ends of the first PE's of all the three rows are given in Table   I . These input streams will step through the PE's in each row simultaneously. The three pipeline stages of a PE, as shown in Fig. 2(a) , are so designed that:
1) The first PFIFO stage is to delay the input stream at The delayed Lf (A) stream is fed to the W line.
2) The I stage is to compute Si,,( A,) of (3a) for some j E [ I , 51.
3) The A stage is to compute some partial product of S: ( A,) of 3(b).
4) The Z buffer is to adjust the data flow rate on the X,,, line such that it is one clock behind that on the Y line.
This synchronization is needed in the systolic operation for executing (3b). . . . Table 111 .
Si ( A,) and Si ( A,) are simultaneously generated in the second and third rows, when S!( XI) is generated in the first row. The subsequent supporting evidences Sf ( A,) for i = 2 , 3 , --, are generated in the same way during the clock cycles given in Table 111 .
The correct values of supporting evidence generated above in each row rely on the proper pairing between the compatibility coefficients and the labeling estimates, as given in (3a). The compatibility coefficients are arranged in a circular shift register in the five PE's of the first row and are ordered according to the required timing relations needed to compute SI( XI), S,( A,), S3( A,), S4( A,), and S5( A,) in the first row of Fig. 3 .
The compatibility coefficient arrangements in the PE's of other rows in Fig. 3 can be obtained by simply changing XI to X2 for the second row, and changing XI to A3 for the third row.
Next, consider (3c) for computing the new labeling estimates. The generated new labeling estimate stream at the R,,, end is shown in Table IV . . , starts to appear at X,,end. The compatibility coefficient arrangements in the PE's of the first row. 1) The I stage is configured as a multiplier.
PIP0 Stage
2) The A stage is configured as an adder.
3) The number of PE's in each row is extended to 15. 4) The lengths of the PFIFO buffer in the first PE's of the three rows are 14, 13, and 12, respectively, where 14 = N X M -1. . . . The computation of a new labeling estimate P f + I (A,)
according to (4c) using the combiner is more complicated than before. The modifications include
1) The first pipeline stage is changed to parallel-in-and-
2) The constant buffer C is set to 1 .O to get the term of
3) The A stage is configured as an adder. C. The General Parallel Architecture for DRL and PRL Algorithms From above, we can see a common parallel architecture can be used for executing DRL and PRL algorithms. The differences in the PE and combiner hardware organizations for these two cases can be settled by using programmable units in the organizations. The different modes of these functional units for the DRL and PRL cases are shown in Table VIII . 
IV. PERFORMANCE EVALUATION
To evaluate the proposed architecture, we shall consider several evaluation factors including the time complexity, space utilization, and input channel bandwidth. The comparisons of our architecture with other relevant architectures for both DRL and PRL cases are shown in Tables IX and X, respectively.
A . The DRL Algorithm
Table IX summarizes the differences between our architecture and those by Gu et al. [14] and Resis and Kumar [ 121 for the execution of the DRL algorithm. Assuming the clock cycle time is T, the time complexity per iteration of our architecture is estimated as follows:
1) The time to produce the first new labeling estimate. It consists of two parts: a) (N -1) T which is the time required to wait for the arrival of L i ( A ) in order to compute the first partial result of a supporting evidence, and b) (3 X N + 5 ) T which is the time to get through both the main module and the combiner module.
2) The time for computing the subsequent labeling estimates. There are N -1 subsequent labeling estimates to be generated at the rate of one per clock cycle, so it takes ( N -1) T to complete.
As a result, the time complexity for a single iteration Table IX also lists several advantageous features of our architecture, such as the communication with the external environment through only one single I/O port, the simplicity of the architecture using one-dimensional, one-way systolic arrays and, the programmability of the functional units.
B. The PRL Algorithm
Table X summarizes the differences between our architecture and those by Kamada et al. [l l] and Guerra [15] for executing the PRL algorithm. The time complexity per iteration is estimated as follows, assuming T is the clock cycle time:
1) The time to produce the first new labeling estimate. It consists of two parts: a) (N X M -1) T which is the time required to wait for the arrival of p i ( A, ) in order to compute the first partial result of a supporting evidence, and b) ( 3 x N x M + M + 4) T which is the time to get through the main module and the combiner module.
2) The time for computing the subsequent labeling estimates. There are N X M -1 subsequent new labeling estimated to be generated at the rate of one per clock cycle, so it takes ( N x M -1) T to finish. Therefore, the total time complexity for a single iteration is O(N x M). Although it is of the same order as the multiprocessor architecture proposed by Kamada et al., yet the multiprocessor system takes a much longer clock cycle time than ours due to its complicated task scheduling and synchronization. On the other hand, ours is superior to the architecture proposed by Guerra as far as the time complexity is concerned. However, our architecture needs more PE's than the other two architectures. Nevertheless, our PE circuit is simpler and can be implemented in a high density VLSI chip. Furthermore, the convergence condition can be checked in our architecture at the hardware level without the host involvement, while the other two systems cannot do so.
V. SUMMARY We have presented a flexible parallel architecture for executing the DRL and PRL algorithms. The architecture is designed based on the analysis of the parallelism in the mathematical models of the algorithms. We preload the compatibility coefficients into the PE's so that only the stream of labeling estimates need to step through the PE's. The one-dimensional, one-way layout of systolic arrays is suitable for VLSI fabrication. The performance evaluations listed in Tables IX and X show that the proposed architecture is generally better than the existent architectures.
In order to verify the design of the proposed systolic array and the combiner module, a simulation software package called DAISY system [ 191 has been used to check the specification. We have finished the logic simulation to check the timing sequence for the architecture operation and the signal simulation to verify the intermediate processing results. The experiments show that the proposed architecture is working properly. Future work includes the development of the VLSI customer chip and proper architecture modifications to cover a wide spectrum of related algorithms.
