# Murdoch 

UNIVERSITY

# MURDOCH RESEARCH REPOSITORY 

http://dx.doi.org/10.1049/ip-cds: 19941109

Kuczborski, W., Attikiouzel, Y. and Crebbin, G. (1994) Decomposition of logic networks with emphasis on signed digit arithmetic systems. IEE Proceedings - Circuits, Devices and Systems, 141 (4). pp. 307-314.
http://researchrepository.murdoch.edu.au/18735/

# Decomposition of logic networks with emphasis on signed digit arithmetic systems 

W. Kuczborski<br>Y. Attikiouzel<br>G. Crebbin


#### Abstract

This paper describes an attempt to combine advantages of the signed digit number representation, applied at the word-level, and the residue number system applied at the digit-level, to achieve arithmetic decomposition of high-radix systems. Also introduced is a new decomposition algorithm for multiple-output Boolean functions based on partition products. Analysis of the proposed new method of arithmetic decomposition, when compared to an approach based on the theory of digit sets, reveals a more efficient use of data storage plus a higher degree of structural uniformity. The practical importance of the proposed method has been tested on a number of designs for the field programmable gate arrays. Comparison with a commercially available CAD system indicates a significant reduction in implementation complexity.


## 1 Introduction

Interest in the arithmetic decomposition of high-radix systems and the Boolean decomposition for multipleoutput functions was prompted by the needs of real-time grey-scale morphological processors. The enormous hardware complexity of processors applying the principles of umbra transform [4] or threshold decomposition [11] directed attention towards the direct implementation of morphological transformations. The direct approach demands a fast implementation of two conflicting operations: (1) addition and maximum search (dilation); (2) subtraction and minimum search (erosion). A conflict results from the characteristic of conventional arithmetic systems, which require opposite directions of digit processing.

Image data throughputs of the order of $10^{8}$ pixels per second and a high degree of design regularity can be achieved by applying the principles of digit-level systolic arrays [7]. However, the conflict between addition and magnitude comparison introduces loops into the dependence graph, and the whole problem becomes noncomputable. The solution adopted is to replace the
(C) IEE, 1994

Paper 1109 G (C2), first received 22nd February and in revised form 20th Decernber 1993
W. Kuczborski is at Edith Cowan University, Department of Computer and Communication Engineering, Joondalup WA6065, Australia
Y. Attikiouzel and G. Crebbin are with The University of Western Australia, Centre for Intelligent Information Processing Systems, Nedlands WA6009, Australia
conventional arithmetic system by the signed digit number representation (SDNR). The SDNR eliminates loops in the dependence graph by allowing a uniform direction of addition/subtraction and magnitude comparison - from the most to the least significant digit. Moreover, carry propagation is restricted to a single digit.

SDNR systems are more difficult to implement than conventional arithmetic circuits. Boolean functions, which define the SDNR systems, are more complex and have a larger number of independent variables. Apparently some new methods of logic synthesis are required for a wider application of this non-conventional data representation. In most cases, especially for systems of radices higher than two, a complex logic network must be replaced by a cluster of simpler ones.

Carter and Robertson [2] applied the theory of digit sets to replace high-radix modules by simpler modules of radix two. An alternative approach is proposed here, based on a combination of the SDNR, applied at the word level, and the residue number system (RNS), applied at the digit level. Such a combination simplifies the complexity of basic modules and reduces the number of inputs to these modules. If the selected implementation technology, e.g. programmable logic arrays (PLAs), gate arrays, or field programmable gate arrays (FPGAs), requires a further reduction of the input number, then another design step is applied - the decomposition of the multivalued Boolean functions [3]. A new decomposition algorithm is presented here, based on partition products. The algorithm has been tested on FPGAs from Xilinx which allow five Boolean variables per logic module.

Being aware of the speed and density limitations of the technology, when compared with full custom or semicustom designs, the choice was influenced by two strengths of FPGAs: (1) fast and zero-cost modifications of the prototypes, a significant virtue for the unconventional data representation employed; (2) the very complex SDNR functions can be efficiently implemented by look-up tables of the Xilinx FPGAs.

2 Principles of the signed digit representation (SDNR) and the residue number system (RNS)

### 2.1 SDNR

The SDNR [1] has become an attractive alternative for dedicated VLSI systems such as morphological processors [6], CORDIC processors [13], and IIR filters [9].

The SDNR is a redundant data format which requires additional memory space. However, the redundancy reduces carry propagation to at most a single digit. The

SDNR has a number of important advantages: paralle processing of all digits, modularity, variable operand lengths, regular logic structures, and local connections. SDNR, unlike RNS, is a positional number system:

$$
N_{S D N R}=\sum_{i=0}^{n} a_{i} r^{i}
$$

The digits $a_{i}$ can be negative, zero, or positive. For symmetric digit sets $\{-a,-a+1, \ldots,-1,0,1, \ldots, a-1, a\}$, the allowed range of $a$ becomes

$$
a_{\text {min }}=r / 2+1
$$

(minimal redundant set), and

$$
a_{\max }=r-1
$$

(maximal redundant set), where $r$ is the radix. A sensible choice of carry values is within the range -1 to 1 . The threshold value $t$, which determines carry generation, must be within the range:

$$
1 \leqslant r-a \leqslant t \leqslant a-1
$$

For the addition $S U M=X+Y$, the sum digit at the $i$ th position is a function of only four arguments, $X_{i}, Y_{i}$, $X_{i-1}, Y_{i-1}$, and can be calculated in two stages:
Stage 1:

$$
\begin{array}{lll}
\text { if } & X_{i}+Y_{i}>t & \text { then } C_{i-1}=1 \\
\text { else if } & X_{i}+Y_{i}<-t \text { then } C_{i-1}=-1 \\
\text { else } & & C_{i-1}=0 \\
S_{i}=X_{i}+Y_{i}-r C_{i-1} &
\end{array}
$$

Stage 2:

$$
S U M_{i}=S_{i}+C_{i}
$$

### 2.2 RNS

The RNS $[5,12,14]$ is defined by a set of relatively prime moduli

$$
\{p 1, p 2, \ldots, p n\}
$$

The dynamic range of the numbers is defined by the product of all moduli:

$$
0 \cdots(p 1 \times p 2 \times \cdots \times p n)-1
$$

The use of the RNS for coding the SDNR digits requires a range which includes negative numbers as well:

$$
-(p 1 \times p 2 \times \cdots \times p n / 2) \cdots(p 1 \times p 2 \times \cdots \times p n / 2)-1
$$

A numerical value $N$ can be converted into its $n$-digit RNS equivalent

$$
N=\langle X P 1, X P 2, \ldots, X P n\rangle
$$

by modulo operations:

$$
\begin{aligned}
& X P 1=N \bmod p 1 \\
& X P 2=N \bmod p 2, \ldots, X P n=N \bmod p n
\end{aligned}
$$

The main advantage of the residue arithmetic lies in its paralielism - additions, subtractions, and multiplications can be executed at each digit independently. This parallelism was utilised in the first stage of logic synthesis - arithmetic decomposition (next section).

Despite its parallelism, the RNS is not an efficient representation of image data in the morphological processor. The reason for this is the strongly non-linear character of mathematical morphology, requiring constant magnitude
comparisons. According to Winograd's lower bound theorem [14], the speed potential of the RNS can be fully utilised only if the number of additions significantly outnumbers the number of magnitude comparisons. However, the usual difficulties with RNS magnitude comparisons or sign detections can be avoided if the application of the RNS is restricted to SDNR digit level.

## 3 Arithmetic decomposition based on the SDNR and the RNS

As mentioned above, the SDNR system is used to represent data of the grey-scale morphological processor. Although the signed digit representation demands additional storage space for signal/image data, memory requirements can be reduced for systems with higher radices. Higher radices have other advantages too. They reduce the interconnection complexity, which is an important consideration in view of the fact that interconnections occupy approximately $70 \%$ of silicon area of the VLSI devices. Finally, high-radix systems may be faster - for example, multiplication will be completed after fewer algorithmic steps.

The logic synthesis procedure is executed in two consecutive steps. The first step, based on a combination of SDNR and RNS, reduces the number of inputs to each module by a factor of two. The SDNR, applied at the word level, allows easy magnitude comparisons and sign detections. The RNS, applied at the digit level, decomposes each SDNR digit into simpler networks. Since the dynamic range of RNS values is very small and is determined by the SDNR digit set, the RNS conversions and sign detections are handled by unified and simple logic circuits. It is seen in the next section that the SDNR/RNS decomposition method reduces storage requirements, when compared to an alternative method based on the theory of digit sets [2].

An important decision for designers of an RNS system is the choice of relatively prime moduli. Assuming the system to be implemented using conventional digital circuits and restricting the maximum number of inputs in each module to six, then, the choice of moduli will depend on the required set of SDNR digits. Table 1
Table 1 : Choice of moduli for various radices

| Bits/SDNR <br> digit | Moduli <br> $\rho 1, \rho 2$ | Dynamic <br> range <br> $p 1 \times p 2$ | $a$ <br> $p 1 \times p 2 \geqslant 2 a+1$ | Maximum <br> radix for <br> minimum <br> redundant |
| :--- | :--- | :--- | :--- | :--- |
| SDNR $2 a-1$ |  |  |  |  |,

specifies the choices of moduli for various radices of the arithmetic system. The moduli guarantee maximum dynamic ranges of digit sets for three, four, five and six bits per SDNR digit. It is assumed that all digit sets are symmetric.

The principle of the arithmetic SDNR/RNS decomposition is based on RNS parallelism, which allows independent handling of the RNS digits because of carry-free arithmetic.

Assume it is required to add two SDNR digits of a radix-10 system. To represent the maximum redundant digit set of $\{-9, \ldots, 9\}$, the sign-and-magnitude (or two's complement) code would require five bits. Then, a singledigit adder would require 10 inputs - too many for an
efficient implementation of the circuit using FPGAs or similar technology. However, if the sign-and-magnitude code is replaced by the RNS, no module will have more than six inputs.
Table 2 specifies the RNS representation of the digit set $\{-9, \ldots, 9\}$ for moduli $p 1=4$ and $p 2=7$. The single line indicates intermediate sums which require a correction by -10 and generate carry $=1$; the double line indicates a correction by +10 and carry $=-1$.

Table 2: RNS representation of the digit set $\{-9, \ldots, 9\}$ for moduli $p 1=4, p_{2}=7$

|  | SDNR |  | RNS | SDNR | RNS |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 0 |  | $\langle 0,0\rangle$ | \|14-14 | <2, 0) |  |
|  | 1 |  | $\langle 1,1\rangle$ | 15-13 | $\langle 3,1\rangle$ |  |
|  | 2 |  | $\langle 2,2\rangle$ | 16-12 | $\langle 0,2\rangle$ |  |
|  | 3 |  | <3, 3) | 17-11 | $\langle 1,3\rangle$ | carry |
|  | 4 |  | <0, 4〉 | 18-10 | $\langle 2,4\rangle$ | $=-1$ |
|  | 5 |  | $\langle 1,5\rangle$ | -9 | $\langle 3,5\rangle$ |  |
|  | 6 |  | $\langle 2,6\rangle$ | -8 | $\langle 0,6\rangle$ |  |
|  | 7 |  | $\langle 3,0\rangle$ | -7 | $\langle 1,0\rangle$ |  |
|  | 8 |  | $\langle 0,1\rangle$ | -6 | <2, 1) |  |
|  | 9 |  | $\langle 1,2\rangle$ | -5 | <3, 2> |  |
|  |  |  | (2, 3) | -4 | $\langle 0,3\rangle$ |  |
|  | 11 | -17 | <3, 4〉 | -3 | <1, 4) |  |
| carry | 12 | -16 | $\langle 0,5\rangle$ | -2 | <2, 5> |  |
| $=1$ | 13 | -15 | $\langle 1,6\rangle \mid$ | -1 | $\langle 3,6\rangle$ |  |

Table 3 clarifies SDNR addition for $r=10, a=9$ and $t=8$. The equivalent SDNR/RNS addition operates independently on each module to calculate a noncorrected intermediate sum. If the intermediate sum generates $C_{i-1}<>0$, a correction by $\pm 10$ is required. Finally, $S U M_{i}$ is calculated by adding carries.

This example of the SDNR/RNS addition reveals that a proper correction of intermediate sums and a carry generation require identification of four regions of the non-corrected intermediate sums

$$
\begin{array}{ll}
\text { region I } & C_{i-1}=0 \quad\left(-8 \leqslant S_{i} \leqslant 8\right) \\
\text { region II } & C_{i-1}=1 \quad\left(S_{i}=9\right) \\
\text { region III } & C_{i-1}= \pm 1 \\
& \left(10 \leqslant S_{i} \leqslant 18\right. \\
\text { region IV } & C_{i-1}=-1 \\
\text { or } & \left.\left(S_{i}=-9\right) \leqslant S_{i} \leqslant-10\right)
\end{array}
$$

Identification of this region demands a magnitude test of non-corrected sums. Additionally, in the case of region III, where positive and negative values have identical RNS representations, it is required to identify the signs of both input arguments. The structure given in Fig. 1 implements all necessary operations. It should be stressed that no module requires more than six inputs despite the circuit's suitability for a wide range of radices, between 3 and 53.
It will be even possible to simplify the unified circuit of Fig. 1 if sufficient redundancy of module selection eliminates the ambiguous region III. We must make sure that
the condition

$$
n \geqslant 4 \times a+1
$$

is true, where $n$ is the product of RNS moduli (the dynamic range) and $a$ is the greatest element of the SDNR digit set. Fig. 2 shows the simplified structure of the SDNR/RNS adder.


Fig. 1 The unified structure of the SDNR/RNS adder ( 1 . argument $=\langle X P 1, X P 2\rangle ; 2$. argument $=\langle Y P 1, Y P 2\rangle$, non-corrected intermediate sum $=\langle S P 1, S P 2\rangle$, corrected intermediate sum $=\left\langle S P I^{\prime}\right.$, $\left.S P 2^{\prime}\right\rangle$, final sum $\left.=\langle S U M P 1, S U M P 2\rangle\right)$


Fig. 2 SDNR/RNS adder for disjoint sets

## 4 SDNR/RNS arithmetic decomposition versus arithmetic decomposition based on the theory of digit sets

The importance of new design methods for the high-radix systems was appreciated by Carter and Robertson [2]

Table 3: SDNR addition for $r=10, e=9$ and $t=8$

| SDNR addition | SDNR/RNS addition |  |
| :---: | :---: | :---: |
| 9 373 (-8767) | $\langle 3,5\rangle\langle 3,3\rangle\langle 1,0\rangle\langle 3,3\rangle$ |  |
| +1864 (+1744) | $+\langle 1,1\rangle\langle 0,1\rangle\langle 2,1\rangle\langle 0,4\rangle$ |  |
| 81575 | $\begin{aligned} & \langle 0,6\rangle\langle 3,4\rangle\langle 3,1\rangle\langle 3,0\rangle \\ & \langle 0,0\rangle\langle 2,4\rangle\langle 2,3\rangle\langle 0,0\rangle \end{aligned}$ | non-cor. S correct. S |
| 0110 carries | $\begin{aligned} & \langle 0,6\rangle\langle 1,1\rangle\langle 1,4\rangle\langle 3,0\rangle \\ & \langle 0,0\rangle\langle 1,1\rangle\langle 3,6\rangle\langle 0,0\rangle \end{aligned}$ | carries |
| $0 \overline{7} 0 \overline{3} 7 \operatorname{SUM}(-7023)$ | $\langle 0,0\rangle\langle 1,0\rangle\langle 0,0\rangle\langle 1,4\rangle\langle 3,0\rangle$ | SUM |

who developed a unified design technique based on the theory of digit sets. The theory applies the concept of the digit set and, not unlike our method, is suitable for redundant SDNR systems.

A digit set $\left\langle\delta^{\Omega}\right\rangle$ is defined by two parameters - $\delta$, the diminished cardinality (that is, the number of digits less one) and $\Omega$, the offset (that is, the magnitude of the smallest digit).

For example, the conventional digit set of a radix-10 system is represented by $\left\langle 9^{\circ}\right\rangle$ or the radix- 10 maximum redundant SDNR digit set is defined by $\left\langle 18^{9}\right\rangle$.

The decomposition process replaces a digit set of a high diminished cardinality by weighted sums of digit sets of lower diminished cardinalities (ternary and binary sets).

For example,

$$
\left\langle 18^{9}\right\rangle \rightarrow 8\left\langle 1^{1}\right\rangle+4\left\langle 1^{0}\right\rangle+2\left\langle 2^{0}\right\rangle+\left\langle 2^{1}\right\rangle
$$

An important characteristic of any decomposition method is storage requirements. The above decomposed digit set requires $1+1+2+2=6$ bits (two bits/ternary set, one bit/binary set).

The decomposition method described here reduces storage requirements for the same digit set to 5 bits (moduli $p 1=4$ and $p 2=7$ are to be selected).

Assuming 32-bit words for a signal/image processing system, then Fig. 3 compares the dynamic ranges for


Fig. 3 Dynamic ranges of SDNR/RNS and digit set systems for 32-bit words
radices between 3 (smallest radix for the SDNR) and 53 (the largest radix for six bits per SDNR digit). For each radix an optimum digit set is selected (between minimum and maximum redundancy) which provides a maximum dynamic range. The figure clearly shows that the SDNR/ RNS decomposition method allows a more efficient use of data storage. The additional advantage of our approach is a higher degree of structural uniformity for a wide range of radices. The structure of Fig. 1 can be applied to any radix between 3 and 53 . On the other hand, the set theoretical approach, although very interesting from a mathematical point of view, requires different structures for different radices.

## 5 Theory of decomposition of Boolean functions

The purpose of the second stage of logic synthesis, Boolean decomposition, is the replacement of a complex
logic network by a number of simpler ones. In the case of the FPGAs, the primary objective is the elimination of networks with more than five inputs using a minimum number of configurable logic blocks (CLBs). Each CLB can implement any function of five variables or two functions of four variables. For functions of more than five variables, there are two optimisation criteria: minimal number of required CLBs and minimal number of logic levels. The complexity of decomposed functions is irrelevant, since functions are implemented via look-up tables.

The above optimisation criteria require new synthesis algorithms which return efficient designs. The very general method of decomposition based on Shannon's expansion theorem is not very useful because it requires a large number of CLBs. For example, a function of seven variables would require as many as seven CLBs. A less general but more efficient approach is simple disjoint decomposition [3]. It expresses the Boolean function $F(X)$ as $G(H(Y), Z)$ where $Y$, the set of bound variables, and $Z$, the set of free variables, are disjoint sets and the union of $Y$ and $Z$ equals $X$. The problem here is that only a tiny percentage of all possible functions meets the above requirement; for instance, $0.00046 \%$ in the case of five arguments [10]. However, many functions applied in practice, especially functions which are not fully defined or sequential functions with optimum state assignments, can be decomposed in this way.

The less restrictive simple non-disjoint decomposition allows common elements in the bound and free variables: $F(X)=G(H(C, Y), C, Z)$, where $C$ is a set of common variables (Fig. 4).


Fig. 4 Simple non-disjoint decomposition

Other possibilities include multiple disjoint, iterative disjoint, and complex disjoint decompositions.

Conditions for all types of decompositions have been well documented, althoutgh there is a shortage of efficient implementation algorithms. The spectral technique (Walsh transform) [10] is an attempt to avoid timeconsuming searches, although its complexity has prevented its practical implementation. Another approach has been proposed in Reference 8. It uses symbolic partition description (SPD) for multiple-output functions. Although the decomposition procedure is very interesting from a mathematical point of view, the final stage of the procedure requires time-consuming merge-and-check operations on partitions.

## 6 Boolean decomposition based on partition products

The task of the decomposition algorithm is to find a simple disjoint or non-disjoint decomposition of a multiple-output function. The objective is to replace a logic network with too many arguments by an intercon-
nected pair of networks with a reduced number of inputs. The algorithm is explained with a specific example. Then, the general pseudo-code is given, followed by a comparison between the algorithm and a commercially available CAD system.
Consider the three output function $S_{i}$, representing the first stage of a radix-4 SDNR adder:

$$
\begin{array}{llrl}
\text { if } & X_{i}+Y_{i}>2 & \text { then } C_{i-1} & =1 \\
\text { else if } & X_{i}+Y_{i}<-2 & \text { then } C_{i-1} & =-1 \\
\text { else } & & C_{i-1} & =0
\end{array}
$$

$$
S_{i}=X_{i}+Y_{i}-4 \times C_{i-1}
$$

Table 4 specifies all possible function values. Letters A, $\ldots, \mathrm{E}$ represent function vectors $000, \ldots, 110$. It is assumed that the adder uses the maximum redundant set of digits $\{-3,-2,-1,0,1,2,3\}$ and each digit is represented by the three-bit radix-magnitude code.
For any valid permutation of common, free, and bound arguments, the algorithm calculates the number of required outputs of the network $H$ in Fig. 4. If the total

Table 4: Function values for radix-4 SDNR adder

| $x_{i}$ |  |  | $Y_{1}$ |  |  | $S$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\begin{gathered} x_{1} \\ 1 \end{gathered}$ | $\begin{aligned} & x_{2} \\ & 2 \end{aligned}$ | $\begin{gathered} x_{3} \\ 3 \end{gathered}$ | $\begin{gathered} y_{1} \\ 4 \end{gathered}$ | $\begin{aligned} & r_{2} \\ & 5 \end{aligned}$ | $\begin{aligned} & v_{3} \\ & 6 \end{aligned}$ | $s_{1}$ | $s_{2}$ | $s_{3}$ |  |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | A |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | B |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | C |
| 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | D |
| 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | D |
| 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | E |
| 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | B |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | B |
| 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | C |
| 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | D |
| 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | A |
| 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | A |
| 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | D |
| 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | E |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | C |
| 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | D |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | A |
| 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | B |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | B |
| 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | A |
| 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | D |
| 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | D |
| 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | A |
| 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | B |
| 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | C |
| 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | C |
| 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | B |
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | A |
| 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | D |
| 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | A |
| 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | B |
| 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | C |
| 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | E |
| 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | B |
| 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | A |
| 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | E |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | D |
| 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | A |
| 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | B |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | B |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | A |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | D |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | B |
| 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | E |
| 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | D |
| 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | A |
| 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | A |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | D |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | E |

[^0]number of inputs of the network $G$ does not exceed a present maximum, then the algorithm accepts the permutation and, in the case of FPGAs, calculates the number of required CLBs.

If the selected permutation contains common argument(s), the algorithm creates a function table for each combination of common argument(s), otherwise a single function table is generated. For example, the set combinations

$$
C=\{3\} \quad Z=\{4,6\} \quad Y=\{1,2,5\}
$$

which correspond to the network

leads to the following tables:

| $C: 0(i . e . ~$ | $\left.x_{3}=0\right)$ |  |  |  |  |  |  |  |
| ---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| $Z:$ | 0 | A | C | C | A | . | . | E |
| $\mathbf{A}$ |  |  |  |  |  |  |  |  |
| 1 | B | D | D | B | . | . | D | B |
| 2 | . | E | . | A | . | . | . | A |
| 3 | D | B | B | D | . | . | . | D |

$0,3,7, ; 1,2, ; 6, \quad$ (partition products)
$0,3,7, ; 1,2, ; 6$;
$0,3,7, \quad ; 1,2, \quad ; 6$,
$C: 1$ (i.e. $x_{3}=1$ )
$Y$ :

|  |  | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| ---: | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 0 | B | D | D | B | D | B | B | D |
| 1 | C | A | A | C | A | C | E | A |
| 2 | . | D | . | B | . | B | . | D |
| 3 | A | E | C | A | E | A | A | E |

$0,3,5, ; 6, ; 1,2,4,7, \quad$ (partition products)
$0,3,5, \quad ; 1,2,4,7, \quad ; 6$, ;
$0,3,5, ; 2, ; 1,4,7, ; 6$;
Each column of the table(s) represents a unique combination of bound variables (one, two and five for our specific example). Rows of the function table represent combinations of free arguments (4, 6). The next step creates partitions for each line using function values as equivalence criteria. For example, the first line of the lower table is represented by the partition

$$
0,3,5,6 ; 1,2,4,7
$$

In the case of rows which are not fully defined, don'tcares will be added to every partition block. For example, the third row of the lower table is represented by the pseudo-partition

$$
1,7, \quad 0,2,4,6 ; 3,5, \quad 0,2,4,6
$$

Finally, the algorithm calculates the product of all partitions and pseudo-partitions. After each product calculation, all redundant blocks caused by don't-cares are
deleted. The number of blocks in the final product determines the required number of network $H$ output lines.
In the above example, the partition products are formed sequentially, row by row. However, for a larger problem, it is possible to calculate many partition products simultaneously, as the partition product is an associative operation.
This algorithm is at least one order of magnitude faster than the XACT-CAD system and significantly reduces the required number of CLBs (Table 5).
The following decomposition algorithm is applicable to both partial and total functions.
MAIN PROGRAM
INPUT:
$n$ - number of arguments, $m$ - number of function outputs, function definition, MAX-maximum number of inputs.
OUTPUT:
Ordered triples (C, Z, Y) allowing simple decompositions, number of Configurable Logic Blocks.
ALGORITHM :
Input.
Repeat
Generate (C, Z, Y).
If |CUZ|+1 $\leqslant \mathrm{MAX}$ and $|C \mathrm{U} Y| \leqslant M A X$ then DECOMPOSITION TEST.
If |C U Z | $+\mathrm{h} \leqslant \mathrm{MAX}$ then Calculate number of Configurable Logic Blocks OUTPUT.
Until all non trivial ( $\mathrm{C}, \mathrm{Z}, \mathrm{Y}$ ) tested.
DECOMPOSITION TEST FOR PARTIAL OR TOTAL FUNCTIONS
INPUT:
Function definitions.
OUTPUT:
$h$-required number of the $H$ network outputs.
ALGORITHM :
$i=0$.
For each combination of $C$ do
Create a table of function values with $2^{|z|}$ rows
and $2^{|Y|}$ columns.
Identify empty rows.
Identify empty columns.
For every row do

## If row not empty then

Create a partition excluding empty positions. Convert the partition into a pseudo-partition adding empty positions, excluding positions of empty columns, to each partition block.
Calculate intersection of all pseudo-partitions
Delete all blocks which are subblocks of other blocks
Delete blocks containing only redundant positions.
If number of intersection blocks>i then
i=number of intersection blocks.
$h=\left\{\log _{2}(\mathrm{i})\right\}$.

## 7 Design examples

A number of experiments indicated that the XACT-CAD systems could not efficiently handle the implementation of SDNR modules with radices higher than two. In contrast, a decomposition of the modules based on the unified structure of the SDNR/RNS adder (see Fig. 1) and the new decomposition algorithm leads to highly efficient hardware implementations. To illustrate the potential of the new approach, the complex case of a radix-53 SDNR/RNS is presented. As was mentioned in Section 3, high-radix arithmetic has a number of functional and technological advantages. Most importantly, the freedom of radix choice offered by the new decomposition method may dramatically widen the data dynamic range (see Fig. 3).

According to Table 1 the $\left\langle 52^{27}\right\rangle$ digit set may apply moduli $p 1=7$ and $p 2=8$. All modules of the SDNR/RNS adder which have more than five inputs must be decomposed by the algorithm from the previous section. For instance, the modulo-8 adder (the ' + mod p2' block of Fig. 1) is a network of six Boolean arguments XP21, XP22, XP23, YP21,
$Y P 22, Y P 23$ which encode values $0, \ldots, 7$ of $X P 2$ and $Y P 2$. The algorithm found that the following segregation of arguments reduces hardware requirements to just three CLBs:


Symbols A, ..., H represent values $0, \ldots, 7$ of the noncorrected intermediate sum $S P 21, S P 22, S P 23$. The product of all partitions has only four blocks:
$\{0,7,10,13 ; 1,4,11,14 ; 2,5,8,15 ; 3,6,9,12\}$
and, as a result, only two internal outputs ( $h=2$ ) are required:


Repeating the decomposition algorithm for all modules with more than five inputs leds to a very efficient mapping of the radix-53 SDNR/RNS adder with just 25 CLBs (Fig. 5).


Fig. 5 Radix-53 SDNR/RNS adder

The significance of the SDNR/RNS data representation concept is not restricted to linear systems. The magnitude comparison, required by morphological transformations, median filters, and other non-linear methods, can be based on a subtractor and a sign detector. The sign of an SNDR/RNS difference is determined by the sign of the most significant non-zero digit [1]. Again, this algorithm leads to an efficient mapping of the network. For example, only two CLBs are required for a sign detector of a comparator with the dynamic range of
$\pm 2727277_{53}= \pm 233,280$.
IEE Proc--Circuits Devices Syst., Vol. 141, No. 4, August 1994

## 8 Conclusions

The two-stage decomposition process leads to a unified structure, suitable for a wide range of linear and nonlinear systems. It is applicable to high-radix systems which could not be designed using conventional methods.

A comparison of the first decomposition stage, based on SDNR/RNS data representation with a set theoretical approach, indicates a significant reduction of memory requirements. Another benefit is a higher degree of structural uniformity for a wide range of radices.

An additional synthesis stage, which uses a new algorithm for simple Boolean decompositions, allows a further reduction in module complexity and an efficient hardware implementation based on conventional technologies, such as PLAs, gate arrays, and FPGAs. Both stages lead to a reduction in hardware complexities by factors of 2-3 when compared to designs generated by a complex, commercially available CAD system.

Although the research results presented here have been inspired by our work on morphological processors, their significance carries over to other linear and non-linear signal and image processing systems, such as FFT, rankorder filters, and encryption devices.

## 9 References

1 AVIZIENIS, A.: 'Signed-digit representations for fast parallel arithmetic', IRE Transactions on Electronic Computers, 1961, pp. 389400
2 CARTER, T.M., and ROBERTSON, J.E.: 'The set theory of arithmetic decomposition', IEEE Trans. Comput., 1990, 39, (8), pp. 9931005
3 CURTIS, H.A.: 'A new approach to the design of switching circuits' (D. Van Nostrand, Princeton, 1962)

4 GIARDINA, C.R., and DOUGHERTY, E.R.: 'Morphological methods in image and signal processing' (Prentice-Hall, Englewood Cliffs, 1988)
5 JULIEN, G.A., BIRD, P.D., CARR, J.T., TAHERI, M., and MILLER, W.C.: 'An efficient bit-level systolic cell design for finite ring digital signal processing applications', J. VLSI Signal Process., 1989, 1, (3), pp. 189-207
6 KUCZBORSKI, W., ATTIKIOUZEL, Y., and CREBBIN, G.: 'Video rate morphological processor based on a redundant number representation', in B.G. BATCHELOR, M.J.W. CHEN, and F.W. WALTZ (Eds.): 'Machine vision, architectures, integration, and applications' (SPIE, Bellingham, 1992), pp. 249-260
7 KUNG, S.Y.: 'VLSI array processors' (Prentice Hall, Englewood Cliffs, 1988)
8 LUBA, T., JASINSKI, K., and KRASNIEWSKI, A.: 'Combining serial decomposition with topological partitioning for effective multi-level PLA implementations', in P. MICHEL, and G. SAUCER (Eds.): 'Logic and architecture synthesis' (North-Holland, Amsterdam, 1991), pp. 243-252
9 McNALLY, O.C', McCANNY, J.V., and WOODS, R.F.: 'A 40 Magasample IIR filter chip', in M. VALERO, S.Y. KUNG, T. LANG, and J.A.B. FORTES (Eds.): 'Special purpose architectures' (IEEE Computer Society Press, 1991), pp. 416-430
10 POSWIG J.: 'Disjoint decomposition of Boolean functions', IEE Proc. E, 1991, 138, (1), pp. 48-56
11 YEONG-CHYANG SHIH, F., and MITCHELL, O.R.: 'Threshold
decomposition of gray-scale morphology into binary morphology',
IEEE Trans. Pattern Anal. Mach. Intell., 1989, 11, (1), pp. 31-42
12 SODERSTRAND, M.A., JENKINS, W.K., JULLIEN, G.A., and TAYLOR, F.J.: 'Residue number systems arithmetic: modern applications in digital signal processing' (IEEE Press, New York, 1986) 13 TAKAGI, N., ASADA, T., and YAJIMA, S.: 'Redundant CORDIC
methods with a constant scale factor for sine and cosine computation', IEEE Trans. Comput., 1991, 40, (9), pp. 989-994
14 TAYLOR, F.J.: 'Residue arithmetic: a tutorial with examples', Computer, 1984, pp. 50-62
15 TORNG, H.C.: 'Switching circuits - theory and logic design' (Addison-Wesley, Reading, 1972)


[^0]:    IEE Proc.-Circuits Devices Syst., Vol. 141, No. 4, August 1994

