We present a digit-serial architecture for gray-scale morphological operations which operates on radix-2 redundant numbers. We present new implementations for a redundant number adder and a maximum unit that are used in the morphological dilation unit. These new designs have areas comparable to 2's complement implementations, but have signi cantly smaller latencies.
Introduction
Morphological operations are widely used in image processing applications for feature extraction and machine vision applications 1, 2] . The basic morphological operations are dilation, erosion, opening, and closing. Complex operations such as skeletonization and feature extraction utilize these basic operations 3]. Morphological operations can be de ned in the binary domain as well as in the gray-scale domain. Morphological operations in the gray-scale domain are signi cantly more complex than those in the binary domain. The operations in the gray-scale domain require additions and subtractions followed by minimum or maximum operations, instead of the simple Boolean logic functions used in the binary domain.
silicon, and as a result, have large area and large latency. Recently, gray-scale morphological operations which use threshold decomposition to map gray-scale operations into binary operations, have been proposed in 9, 10] . Such implementations are very fast, but have area complexities that are proportional to 2 k , where k is the wordlength.
In this paper we present an e cient digit-serial VLSI architecture for gray-scale morphology, which has an area complexity of O(N), where N is the window size, and a latency of O(1). This work is an extension of our work in 11] . The basic module is a dilation unit, which can be used to implement erosion, opening and closing. The dilation unit consists of adders followed by a maximum computation unit. We choose digit-serial radix-2 redundant arithmetic over bit-serial 2's complement, in order to reduce the latency of the dilation unit from O(k) to O (1) . Furthermore, the area requirements of the redundant number implementation are comparable to those of the 2's complement implementation. We present new designs for a most signi cant digit (MSD)-rst adder and a MSD-rst N-input maximum unit. The proposed architecture has a small area, supports variable wordlengths, and scales linearly with the input size. We have implemented this architecture in 2 micron custom CMOS using the Berkeley CAD tools and in 1.2 micron CMOS standard cells using the Mentor Graphics CAD tools.
The paper is organized as follows. In Section 2, we de ne the common morphological operations. In Section 3 we describe bit-serial 2's complement implementations, and their limitations. We describe the proposed MSD-rst digit-serial implementation in Section 4, and present the VLSI implementation of a dilation/erosion unit in Section 5.
Morphological Operations
The gray-scale morphological operations can be de ned as follows 3]. Let f(x) and g(x) be onedimensional gray-scale functions of coordinate x, where f(x) is the image and g(x) is the structuring element. Let E represent Euclidean space. Then f : F ! E and g : G ! E. The dilation and the erosion operations are de ned as follows:
(f g)(x) = maxff(x ? z) + g(z)g 8 z 2 G and x ? z 2 F: (1) Erosion :
(f g)(x) = minff(x + z) ? g(z)g 8 z 2 G and x + z 2 F: (2) The erosion operation can also be de ned in terms of dilation as
whereĝ(z) = g(?z) is the re ection of g. The opening operation can be de ned in terms of dilation and erosion, and the closing operation can be de ned in terms of opening.
Opening : f g = (f g) g (4) Closing :
f g = ?(?f ?g): (5) Since erosion, opening and closing can all be implemented by the dilation operation, we choose the dilation operation to be our kernel operation. The aim of this paper is to develop an e cient architecture for the dilation unit. The architecture should have (i) small area, (ii) small latency, and (iii) support variable sized wordlengths. We choose a digit-serial mode of operation since it satis es requirements (i) and (iii). The proposed architecture should also be able to support large structuring elements. This can be easily achieved by decomposing the large structuring element into small ones, and computing the output in multiple passes.
2's Complement Implementations
The gray-scale dilation unit is described in Fig. 1 . Here the N input samples represent the sample window of data or image elements, and are labeled f 1] to f N], and the corresponding structuring elements are labeled g 1] to g N]. The sample window could be one or two-dimensional. In this paper we assume that the data is stored in a bu er, and is accessed appropriately.
There are two possible implementations of a dilation unit when the inputs are represented by 2's complement numbers, and are accessed least signi cant bit (LSB)-rst. In the rst implementation, the input samples and the structuring elements are rst added by bit-serial adders. The outputs of the adders are then compared using a tree of 2-input comparison units. The comparisons are performed bit-serially as in a subtracter unit. Thus the result of the comparisons are not available until the entire word has been processed. The latency of this implementation is O(kdlog 2 Ne).
In the second implementation, the input samples and the structuring elements are added using bit-serial adders, but now the outputs of the adders are fed to an N-input maximum computation (MAX) unit which operates on most signi cant bit (MSB)-rst bit-serial data. Fig. 2 describes such an implementation. Here the output from each bit-serial adder is placed in a LIFO (Last In First Out) register which holds the results of the add operation until the entire word has been processed. The N-input MAX unit recursively computes the output bits, starting from the MSB. All inputs are initially candidates for the maximum output. The current output bit is the bit with the maximum value among the inputs in the candidate set. In each cycle, the candidacy of an input is updated by comparing the current output bit with the current input bit. Since the output of the MAX unit is generated MSB-rst, it requires an additional LIFO to reverse the bit order for input/output compatibility. The latency of this implementation is O(k).
In an attempt to further reduce the latency, Lee, Chen and Jen 12] proposed a scheme consisting of candidate pair selection (which selects the correct f j]; g j] input pair that contributes to the output), followed by a single addition. Unfortunately, this scheme is erroneous, since both f and g are represented in binary 2's complement form, and thus a MSB-rst candidate pair selection cannot be done on inputs prior to adding bits MSB-rst.
The latency of the second implementation cannot be reduced to less than O(k) when the inputs are represented in 2's complement form, because 2's complement addition is inherently a LSB-rst operation, while computing the maximum is inherently a MSB-rst operation. If, however, the numbers are represented in the redundant number system or in the carry save system, both the addition and the maximum can be done Most Signi cant Digit (MSD) rst, and the latency of the dilation unit reduces from O(k) to O(1).
Redundant Number Implementation
In this section, we describe a digit-serial implementation where the numbers are represented in the radix-2 redundant system. We chose a radix-2 system over a higher order radix system, since the input formatting is straightforward, and the hardware components have area and latency comparable to binary systems. The proposed architecture consists of (i) N digit-serial adders which add radix-2 redundant numbers to 2's complement numbers, and (ii) a new digit-serial MAX unit which nds the maximum of a set of N radix-2 redundant numbers. The block diagram of the proposed dilation unit is given in Fig. 3 . This dilation unit has a latency of O(1) and does not require any LIFO registers.
It is also possible to achieve a similar reduction in the latency and in the number of LIFO registers by operating on carry save numbers. Carry save numbers also allow addition and MAX operations to be performed MSD rst. We choose to work with the radix-2 redundant number system because it is easier to perform inversions with the radix-2 redundant numbers. Recall that inversions are required to build erosion, opening and closing units using the dilation unit (refer to Section 2).
Redundant Adder
The proposed redundant adder is a digit-serial adder which adds a radix-2 redundant number, f, to a binary 2's complement number, g, and outputs a radix-2 redundant number, y. We represent the image samples, f, in redundant form to allow us to use the result of one dilation as the input to the next dilation. The image samples could also be input as 2's complement numbers, and be converted into radix-2 redundant numbers. This conversion is straight-forward. The structuring element, g, is represented in 2's complement form, as it is never the result of a previous dilation. This simpli es the design of the redundant adder. The proposed adder is an extension of the digit-parallel adder of 13] which adds a k-bit 2's complement number to a k-digit redundant number, and outputs a (k+1)-digit redundant number. We have modi ed this adder to operate as a digit-serial adder, and also to correct the MSD so that the output can be represented by k digits (instead of k+1 digits as in 13]).
In the proposed adder, the redundant digits belong to the set f?1; 0; 1g. Each digit is coded with 2 bits. Digit 0 is coded with 00 or 11, digit 1 is coded with 01, and digit -1 is coded with 10.
Since the structuring element, g, is in 2's complement form, its digits belong to the set f0; 1g. In our notation, f i represents digit i of input f, where i = 0 corresponds to the MSD. f c represents the cth bit that is used to encode a particular digit, c = 0 for the MSB, and 1 for the LSB. Fig. 4 describes the proposed redundant adder. It consists of an adder followed by a MSD correction unit. The inputs are processed digit-serially MSD-rst. In the ith cycle, the adder adds f i and g i and generates an interim sum, s i , and carry, c i . Since we are adding the digits MSD rst, the output digit depends on both the current carry bit and the sum bit from the previous cycle.
Thus the adder has a latency of 1 cycle. Let R be the control signal that is active high when it accompanies the MSD of the operands (R=1 for i = 0, and is 0 otherwise). Let 
N Input Redundant Max Unit
The redundant adders output k-digit radix-2 redundant numbers, and thus cannot be fed directly to the MAX unit for 2's complement numbers. One possible way of handling this is to convert the radix-2 redundant numbers into 2's complement numbers using the method of 14], and then feed it to the MAX unit. However, such a method is ine cient since it requires an additional 2N registers and an additional delay of k for the conversion. Instead, we propose a new MAX unit which operates on a set of N radix-2 redundant numbers.
The N input redundant MAX unit is a digit-serial unit which nds the maximum of N redundant numbers entered MSD-rst, and outputs a redundant number which is equivalent to the maximum number. Here each input y j] has a state variable p j] associated with it, where p j] can be one of three states: candidate, undecided, or eliminated. Initially, all the inputs belong to the candidate set. The current output digit, z, is the maximum digit among the inputs in the candidate set. In each cycle, the state of each input is updated by comparing the current input digit to the current output digit. If p j] = candidate, then y j] is a candidate for the maximum output. If p j] = undecided, then y j] may become a candidate for the output in the future. If p j] = eliminated, then y j] can never be the maximum output. The operation of the MAX unit is described by the following algorithm. shows the state transition diagram of the MAX unit. The operation of the MAX unit is described with the following algorithm.
Algorithm:
Step 1: For j = 1 to N do in parallel p j] 0 = candidate.
Step 2: Repeat steps 3,4, and 5 for each digit i = 0 to k ? 1. (It is assumed that steps 3,4, and 5
are applied in parallel to all the inputs j = 1 to N.)
Step 3: z i = max j (y j] i and p j] i = candidate).
Step If the dilation unit is used for erosion, then the inputs and outputs have to be replaced by their additive inverses. In other words, digit 1 has to be replaced by -1, and digit -1 has to be replaced by 1. Since digit 1 is represented by 01, and digit -1 is represented by 10, the bits corresponding to a digit must be inverted. Since the output of the dilation unit is a radix-2 redundant number, it can be fed directly to the redundant adders of other erosion and dilation units.
Implementation
We have implemented a morphology chip which functions either as a dilation unit or an erosion unit depending on a control signal. We have implemented this unit three ways. First using custom CMOS and Berkeley CAD tools. Next we handcrafted the design using standard cells and the Mentor Graphics CAD tools. Finally we used VHDL and synthesized the design using standard cells and the Mentor Graphics CAD tools. A comparison of the three chip layouts is shown in Table 1 . Fig. 10 shows the layout of the standard cell implementation produced by the VHDL version. We have simulated all implementations and have fabricated the VHDL synthesized 1.2 um standard cell design. The sample period of these implementations is bound by the delay in the MAX unit. We are currently studying ways in which the delay can be reduced by applying look-ahead techniques as in 15], and by processing multiple digits at a time.
Conclusion
We have presented a new architecture for gray-scale morphological operations with the dilation unit as the basic module. The dilation unit operates on radix-2 redundant numbers in the digit-serial MSD-rst mode. This implementation has a latency of O(1), compared to O(k) for the existing digit-serial implementations. We have proposed new designs for an adder, and an N-input MAX unit, both of which operate on radix-2 redundant numbers. We have designed the dilation unit so that it can be used for both dilation and erosion. This allows the dilation unit to be used as part of a larger morphological engine to implement open and close operations as is shown in Fig. 9 .
Acknowledgements
We would like to acknowledge the students of the VLSI Design classes at the University of Minnesota and Arizona State University for their help in the implementations. 
List of Figures

