## UNIVERSITY OF OKLAHOMA

GRADUATE COLLEGE

## A HIGH-PERFORMANCE INNER-PRODUCT PROCESSOR FOR REAL AND COMPLEX NUMBERS

A Dissertation<br>SUBMITTED TO THE GRADUATE FACULTY in partial fulfillment of the requirements for the degree of Doctor of Philosophy

By
GUOPING WANG
Norman, Oklahoma
2003

## INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

> UMÍ

UMI Microform 3260769
Copyright 2007 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest Information and Learning Company<br>300 North Zeeb Road<br>P.O. Box 1346<br>Ann Arbor, MI 48106-1346

© Copyright by Guoping Wang 2003 All Rights Reserved.

# A HIGH-PERFORMANCE INNER-PRODUCT PROCESSOR FOR REAL AND COMPLEX NUMBERS 

A Dissertation APPROVED FOR THE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING

BY


To my family
who have lovingly supported my years of study.

## ACKNOWLEDGMENTS

In the years leading to this dissertation, I have had the great pleasure of working for and with an advisor who provided an environment conducive to learning. I would like to take this opportunity to thank my advisor, Dr. Monte P. Tull for his consistent guidance.

I would also like to thank Dr. Gerald Crain, Dr. Linda DeBrunner, Dr. Joe Havlicek and Dr. Murad Ozaydin for serving on my supervisory committee.

I appreciate many others who have been helpful at the University of Oklahoma. They make my years at the university a unique and memorable experience.

I am especially indebted to my family for their support. Guanglan Zhang, my wife, has been invaluable for her patience and love during my research. My daughter, Ying, and my son, Christopher always give me great joy and peace.

"This is what the Lord says ...<br>'Call to me and I will answer you and tell you great and unsearchable things you do not know.'"

Jeremiah 33:2-3

## TABLE OF CONTENTS

CHAPTER 1 INTRODUCTION ..... 1
1.1 Inner-Product Implementation by the General Purpose Processors ..... 3
1.2 Inner-Product Implementation by Digital Signal Processing Processors ..... 6
1.3 Other Inner-Product Processor Implementation Methods ..... 7
1.4 Multiplier Implementation Review ..... 10
1.5 The Redundant Binary Number System. ..... 13
1.6 The Conversion of 2's-Complement to Redundant Binary. ..... 15
CHAPTER 2 INNER-PRODUCT PROCESSOR OF REAL, COMPLEX AND REDUNDANT BINARY NUMBERS ..... 21
2.1 Real Number Inner-Product Computation ..... 21
2.1.1 Inline Partial Product Redundant Binary Inner-Product ..... 21
2.1.2 Cross Partial Product Redundant Binary Inner-Product ..... 26
2.1.3 Booth Encoding Methods ..... 30
2.1.4 Implementation Comparison of Inline and Cross Inner-Product Methods... ..... 34
2.2 Complex Number Inner-Product Computation ..... 36
2.2.1 Review of Complex Number Arithmetic ..... 36
2.2.2 Comparison of Different Complex Radices ..... 39
2.2.3 Complex Number Multiplier and Inner-Product Computation ..... 51
2.3 Inner-Product Computation Comparison ..... 58
2.4 Implementation of Unified Signed/Unsigned Multiplier. ..... 62
2.4.1 Unified Signed/Unsigned Multiplier Without Booth Coding ..... 62
2.4.2 Unified Signed/Unsigned Multiplier With Booth Coding ..... 68
2.5 The Implementation of a Unified Signed/Unsigned Inner-Product Processor for $\boldsymbol{A B} \pm \boldsymbol{C D}$ ..... 69
2.6 The Implementation of a Redundant Binary Multiplier ..... 71
2.6.1 Direct Implementation of Redundant Binary Multiplier ..... 72
2.6.2 Redundant Binary Multiplier Implementation Using Inner-Product Processor ..... 73
2.7 Redundant Binary Inner-Product Computation ..... 75
CHAPTER 3 IMPLEMENTATIONS OF DIVISION METHOD ..... 76
3.1 Division Algorithm Review ..... 76
3.2 Further Studies of the Goldschmidt and Newton-Raphson Methods ..... 80
3.2.1 Comparison of the Goldschmidt and Newton-Raphson Methods ..... 80
3.2.2 Further Discussion of the Goldschmidt Method. ..... 83
3.2.3 Implementation of the Goldschmidt Division. ..... 84
3.3 Real Number Division Implementation ..... 85
3.4 Comparison of the Implementations of Division. ..... 87
3.5 Complex Number Division Implementation ..... 88
CHAPTER 4 COMPUTATIONAL EXTENSIONS ..... 90
4.1 Real-Number Computational Extensions ..... 92
4.1.1 $\quad 8$-Element Real Number Inner-Product Computation ..... 92
4.1.2 Dual 4-Element Real Number Inner-Product ..... 92
4.1.3 Quad 2-Element Real Inner-Product Using Four Redundant Binary Accumulators ..... 93
4.1.4 Eight Parallel Multipliers Using 8 Redundant Binary Accumulators. ..... 93
4.2 Complex-Number Computational Extensions ..... 94
4.2.1 Single 2-Element Complex Number Inner-Product Computation Using One Real/Imaginary Redundant Binary Accumulator ..... 94
4.2.2 Dual Single-element Complex Number Inner-Product Computation Using Four Redundant Binary Accumulators ..... 94
4.2.3 Two Parallel Complex Number Multipliers ..... 95
4.3 Redundant Binary Number Computational Extensions ..... 95
4.3.1 Single Element Redundant Binary Number Inner-Product Computation ..... 95
4.3.2 Dual 2-Element RB Inner-Product plus Two Redundant Binary Accumulators ..... 96
4.3.3 Four Parallel Redundant Binary Multipliers Using Four Redundant Binary Accumulators ..... 97
4.4 Pipeline Extensions ..... 97
CHAPTER 5 REDUNDANT BINARY TO 2'S-COMPLEMENT NUMBER CONVERSION ..... 102
5.1 An Improved Redundant Binary to 2's-Complement Converter ..... 102
5.2 Comparison Result ..... 108
CHAPTER 6 SUMMARY AND CONCLUSIONS ..... 109
APPENDIX: VHDL HARDWARE IMPLEMENTATION ..... 113
REFERENCES ..... 114
Figure 1-1. Flowchart of Inner-Product Computation by the General Purpose Processors ..... 4
Figure 1-2. Inner-Product Implementation in Pentium MMX Processor ..... 5
Figure 1-3. Sample Code for an Inner-Product by Pentium MMX ..... 5
Figure 1-4. Sample Code of the Fixed-Point Inner-Product by TMS 320C60 ..... 6
Figure 1-5. Kazakova's Inner-Product Processor Architecture ..... 8
Figure 1-6. Lin's Reconfigurable Inner-Product Processor ..... 9
Figure 1-7. A RB MAC (Multiply and Accumulate) by Huang ..... 9
Figure 1-8. Baik's Redundant Binary Filter Implementation ..... 10
Figure 1-9. Design of a $5 \times 5$ Array Multiplier. ..... 11
Figure 1-10. 4:2 Counter Using 3:2 Counter ..... 11
Figure 1-11. 3:2 Counter Based $4 \times 4$ Multiplier ..... 12
Figure 1-12. 4:2 Counter Based $4 \times 4$ Multiplier ..... 12
Figure 1-13. Mapping from the Sum of Two 2's-Complement Numbers to a RB Number ..... 17
Figure 1-14. Mapping from the Subtraction of Two 2's-Complement Numbers to a RB Number ..... 18
Figure 2-1. Inline Partial Product RB Implementation of $A_{0} B_{0}+A_{1} B_{1}$ ..... 24
Figure 2-2. Inline Partial Product Structure of $A_{0} B_{0}+A_{1} B_{1}$ ..... 25
Figure 2-3. Overall Structure of the Redundant Binary Inner-Product ..... 26
Figure 2-4. Cross Partial Product Structure of $A_{0} B_{0}+A_{l} B_{1}$ ..... 29
Figure 2-5. Basic Diagram of Complex Number Multiplication ..... 38
Figure 2-6. Blahut's Complex Number Multiplier ..... 38
Figure 2-7. An Example of Addition in Radix-(2j) ..... 43
Figure 2-8. An Example of Subtraction in Radix-(2j) ..... 44
Figure 2-9. An Example of Multiplication in Radix-(2j) ..... 45
Figure 2-10. An Example for Radix- $(j-1)$ Carry Propagation ..... 50
Figure 2-11. Inline Implementation of $A_{0} B_{0}-A_{1} B_{1}$ ..... 52
Figure 2-12. The Real Part of the Complex Number Inner-Product ..... 56
Figure 2-13. The Imaginary Part of Complex-Number Inner-Product ..... 56
Figure 2-14. Unified RB IP Processor for $A B \pm C D$ ..... 57
Figure 2-15. An Example Code of Fixed-Point Inner-Product ..... 58
Figure 2-16. Dependency Graph of Fixed-Point Inner-Product ..... 59
Figure 2-17. RB Inner-Product Implementation ..... 61
Figure 2-18. Unsigned Multiplier with Partial Product Generation ..... 62
Figure 2-19. Mapping of $\boldsymbol{P} \boldsymbol{P}_{N-2}$ and $\boldsymbol{P} \boldsymbol{P}_{N-1}$ for Signed Multiplier into a RB Digit ..... 63
Figure 2-20. Mapping of $P Q_{N-2}$ and $P Q_{N-I}$ for Signed Multiplier into a RB Digit ..... 64
Figure 2-21. Circuit Realization of the Last Partial Product $\boldsymbol{P} \boldsymbol{P}_{N-I}^{\prime}$ for Signed/Unsigned Multiplier ..... 64
Figure 2-22. First Partial Product PQ0 for Unsigned Multiplier ..... 65
Figure 2-23. Circuit of the First Partial Product $\boldsymbol{P P}_{0}^{\prime}$ for Signed/Unsigned Multiplier ..... 67
Figure 2-24. Circuit of the Partial Products from $\boldsymbol{P} \boldsymbol{P}_{1}^{\prime}$ to $\boldsymbol{P} \boldsymbol{P}_{N-2}^{\prime}$ for Signed/Unsigned
Multiplier ..... 67
Figure 2-25. A Unified Sign/Unsigned Multiplier ..... 68
Figure 2-26. Unified Signed/Unsigned IP Processor for $A B \pm C D$ ..... 70
Figure 2-27. A RB Multiplier Diagram ..... 71
Figure 2-28. An Example of RB Multiplication ..... 72
Figure 2-29. Implementation of RB Multiplier $A_{R B} B_{R B}$ ..... 75
Figure 2-30. IP Implementation for RB Number $A B+X A$ ..... 75
Figure 3-1. Goldschmidt Divisor Implementation ..... 82
Figure 3-2. Newton-Raphson Divider Implementation ..... 82
Figure 3-3. Implementation of 2- $A_{R B}$ ..... 85
Figure 3-4. First Iteration Implementation of the Goldschmidt Division ..... 86
Figure 3-5. Implementation of Successive Iteration Computation for $Z$ and $D$ ..... 86
Figure 3-6. Overall Structure of Divider Using RB IP Processor ..... 87
Figure 3-7. Complex-Number Division Implementation Initial Process ..... 88
Figure 4-1. An Example of a Redundant Number Adder Tree ..... 90
Figure 4-2. Dual 4-Element Real Number Inner-Product ..... 92
Figure 4-3. Quad 2-Element Real Number Inner-Product ..... 93
Figure 4-4. Eight Parallel Multipliers Using 8 RB Accumulators ..... 93
Figure 4-5. Single 2-Element Complex Number IP Using One Real/Imaginary RB Accumulator ..... 94
Figure 4-6. Dual 2-Element Complex Number Inner-Products Using Four RB Accumulators ..... 95
Figure 4-7. Two Parallel Complex Number Multipliers ..... 95
Figure 4-8. 4-Element Redundant Binary Inner-Product ..... 96
Figure 4-9. Dual 2-Element RB Inner-Product ..... 96
Figure 4-10. Four Parallel RB Multipliers Using 4 RB Accumulators ..... 97
Figure 4-11. 8-Word 8-Bit RB IP Processor ..... 98
Figure 4-12. Two-Stage Pipelined RB IP Processor ..... 99
Figure 4-13. Three-Stage Pipelined RB IP Processor ..... 100
Figure 5-1. Four-Bit Carry-Lookahead RB-NB Converter ..... 106
Figure 5-2. Diagram of a 4-Bit Carry-Lookahead RB-NB Carry Generator ..... 107
Figure 5-3. Two-Level 16-bit RB-NB Converter ..... 108
Table 1-A. Computation Rules for the First Step in Carry-Propagation-Free Addition for RB Numbers ..... 15
Table 1-B. Coding Table for Binary Signed Digits ..... 16
Table 1-C. Logic Functions for RBFA and RBHA ..... 19
Table 2-A. Modified Booth Encoding Table ..... 30
Table 2-B. Booth Correction Factors for the Inline Multiplication Method ..... 32
Table 2-C. Booth Correction Factors for the Cross Partial Product Method. ..... 33
Table 2-D. 16-Bit FPGA Implementations of $A_{0} B_{0}+A_{1} B_{1}$ Without Booth Encoding ..... 35
Table 2-E. 16-Bit FPGA Implementations of $A_{0} B_{0}+A_{1} B_{l}$ With Booth Encoding ..... 35
Table 2-F. Truth Table for Radix- $(-1+j)$ One-Bit Addition ..... 50
Table 2-G. Booth Correction Factors for Redundant Binary Partial Product Generation of $A_{0} B_{0}-A_{1} B_{1}$ ..... 55
Table 2-H. Comparison of IP Computation between TMS320C62X and RB Inner- Product Processor ..... 61
Table 2-I. Partial Product $P Q 0_{0}$ to $P Q 0_{N-3}$ for Unsigned Multiplier ..... 65
Table 2-J. Partial Product $\mathrm{PQO}_{\mathrm{N}-2}$ for Unsigned Multiplier ..... 65
Table 2-K. Partial Product $P Q 0_{N-1}$ for Unsigned Multiplier ..... 66
Table 2-L. Partial Product $P Q 0_{N}$ and $P Q 0_{N+1}$ for Unsigned Multiplier ..... 66
Table 2-M. RB Partial Product Generation ..... 72
Table 2-N. Encoded RB Partial Product Generation ..... 73
Table 4-A. Time Delay Model of RB Multiplier ..... 97
Table 5-A. Conversion Rules in Stage $I$. ..... 103
Table 5-B. Conversion Truth Table for RB-NB ..... 104

## ABSTRACT OF THE DISSERTATION

# A High-Performance Inner-Product Processor for Real and Complex Numbers 

by<br>Guoping Wang<br>Doctor of Philosophy in Electrical and Computer Engineering University of Oklahoma, Norman, OK, 2003<br>Dr. Monte P. Tull, Chair


#### Abstract

A novel, high-performance fixed-point inner-product processor based on a redundant binary number system is investigated in this dissertation. This scheme decreases the number of partial products to $50 \%$, while achieving better speed and area performance, as well as providing pipeline extension opportunities. When modified Booth coding is used, partial products are reduced by almost $75 \%$, thereby significantly reducing the multiplier addition depth. The design is applicable for digital signal and image processing applications that require real and/or complex numbers inner-product arithmetic, such as digital filters, correlation and convolution. This design is well suited for VLSI implementation and can also be embedded as an inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic control of the computing structure permits different computations, such as a variety of inner-product real and complex number computations, parallel multiplication for real and complex numbers, and real and complex number division. The same structure can also be controlled to accept redundant binary number inputs for multiplication and inner-product computations. An improved 2 's-complement to redundant binary converter is also presented.


## Chapter 1 Introduction

Consider the definition of the inner-product. For two vectors $\boldsymbol{A}=\left(A_{0}, A_{1}, A_{2}, \ldots . A_{M-1}\right)$, and $\boldsymbol{B}=\left(B_{0}, B_{1}, B_{2}, \ldots . B_{M-1}\right)$, the inner-product of $\boldsymbol{A}$ and $\boldsymbol{B}$ is defined as:

$$
\begin{equation*}
<\boldsymbol{A}, \boldsymbol{B}\rangle=\boldsymbol{A}\left(\boldsymbol{B}^{*}\right)^{T}=\sum_{i=0}^{M-1} A_{i} B_{i}^{*} \tag{1.1}
\end{equation*}
$$

In general, $A_{i}$ and $B_{i}$ may be real or complex numbers. $\boldsymbol{A} \boldsymbol{B} *$ denotes matrix multiplication with the row vectors $A$ and $B$ considered as $1 \times M$ matrices, and $\left(B^{*}\right)^{T}$ denotes the conjugate transpose of $B$. In the traditional method, all of the multiplications are processed independently of one another, thereby requiring $M$ multiplications and $M-1$ additions. To obtain high-performance circuit implementations of the inner-product, several salient features of Equation (1.1) can be utilized; namely, carry-free addition, high-speed multiplication, and parallel or pipelined multiplication and addition.

The application of redundant binary ( RB ) numbers was previously investigated for carry-free addition and fast multiplication. These techniques have proven to be easily laid out in VLSI and result in high-speed circuit implementations [1]-[4]. In this dissertation, high-performance and easily pipelined implementations of an inner-product processor are presented. The designs utilize RB numbers for achieving the carry-free addition of partial products. Redundant binary schemes are less viable in applications that require persistent conversion back to 2 's-complement [5]-[7], since this process is relatively slow due to an unavoidable carry propagation requirement. The overall motivation for this work is the design of a high-performance Complex Arithmetic Signal Processor (CASP) capable of offering novel extended inner-product operations. The

CASP design relies on the high-speed multiplication afforded by redundant binary techniques, while avoiding the relatively slow conversion back to 2 's-complement numbers until a final 2's-complement result is necessary. Inherently, the CASP device provides intermediate register storage for redundant binary, as well as 2 's-complement numbers. The methods for implementing the core inner-product structure and general extensions are presented in this dissertation.

Inner-product computations play a central role in digital signal processing, most often in digital filters, signal correlation, convolution, FFT, etc. Current implementations of inner-product computations include the following methods: 1) general purpose processors, 2) digital signal processor devices, such as Texas Instruments TMS320C60, 3) VLSI devices, such as FPGAs or ASICs. Various researchers have investigated the implementation of inner-product processors. Implementations include array multipliers [8],[9], VLSI Residue Number System architecture [10], serial implementations [11],[12], distributed arithmetic [13],[14], carry-save addition [15]-[19], specific DSP processor and FPGA [20]-[27], redundant binary implementations [28]-[30].

Complex number arithmetic computation is a key arithmetic feature required in modern digital communication and optical systems [31]-[38]. Many algorithms based on convolution, correlation, and complex number filters require complex number multiplication and high-speed inner-product computation. These applications require efficient representation and manipulation of complex numbers together with real numbers. Considerable research exists for hardware implementations of complex number systems [39]-[54] and representations of complex numbers in different radices [55]-[68].

The redundant binary ( RB ) representation is one of the signed-digit number representations originally introduced by Avizienis [69] for achieving the carry-propagation-free addition. RB numbers differ from the conventional 2's-complement representation in that the individual digits comprising a number may have negative values as well as positive values. High-speed VLSI multiplication algorithms, which are based on redundant binary numbers, are proposed in [1],[3],[4]. Since integer numbers in most digital systems are represented in 2's-complement form, a converter is needed to convert a redundant binary number to a 2 's-complement number in the last step. Different implementations of this converter have been proposed in [5], [7], [12], [70]-[74].

Although division is an infrequent operation, it has been shown [75] that ignoring its hardware implementation can result in significant system performance degradation for many applications. Extensive literature describes the theory of division [75]-[90]. Division algorithms can be generally divided into the following classes: digit recurrence, functional iteration, table look-up and variable latency [84]. Choosing an optimal design of a divider depends heavily upon its requirements for area and speed.

In the following sections, these hardware implementations and research issues will be reviewed and investigated.

### 1.1 Inner-Product Implementation by the General Purpose Processors

General purpose processors, such as Intel Pentium and 80x86, Motorola 68000, AMD K6 and K7, etc., can perform different algorithms using combinations of various machine instructions. The systems built with these programmable processors are adaptable to different applications and easily upgradeable to changing requirements.

Even with such potential advantages, traditional programmable processors have not been widely used for high-speed inner-product computation because of their limited performance. For example, in order to find the inner-product of two vectors $\boldsymbol{A}$ and $\boldsymbol{B}$, the flowchart in Figure 1-1 is usually employed.


Figure 1-1. Flowchart of Inner-Product Computation by the General Purpose Processors

In a general purpose (GP) processor, all these computations are sequential and each load, multiplication or summation requires one or more clock cycles. Traditional multiplication and accumulation methods are generally used. Some GP processors provide additional hardware features for inner-product calculations. Among these processors, the Pentium MMX processor contains a super scalar architecture, which includes: 1) enhanced pipelines 2) two pipelined integer units capable of two instructions
per clock, as well as other features. With the new architecture, the Pentium MMX can compute inner-products more efficiently than other general purpose processors. The diagram of inner-product implementation by the Pentium MMX processor is shown as Figure 1-2 [91]:


Figure 1-2. Inner-Product Implementation in Pentium MMX Processor [91]
Sample code [91] for an inner-product implementation using Pentium assembly language is shown Figure 1-3:


Figure 1-3. Sample Code for an Inner-Product by Pentium MMX [91]

### 1.2 Inner-Product Implementation by Digital Signal Processing Processors

DSP processors are specifically designed for DSP applications. One typical DSP processor is the Texas Instruments TMS320C60. It is a highly integrated, multiprocessor, single chip device specifically designed for DSP applications. The TMS320C60 integrates the following components onto a single device [92]:

1. a single 32-bit RISC master processor (MP) with an integral IEEE-754 floating point unit
2. four 32-bit integer DSP parallel processors (PP)
3. a sophisticated direct memory access (DMA) transfer controller (TC)
4. a video controller (VC)
5. 50 K bytes of on-chip SRAM memory

The five processors on the TMS320C60, i.e. the MP and four PPs, can be configured for a variety of multiple-instruction, multiple-data, multiple-instruction, single-data, or single-instruction, multiple-data modes. The PPs, similar to most DSPs, perform all operations, except division, in a single cycle. For example, it can perform the parallel operations, $\mathrm{A} * \mathrm{~B}=>\mathrm{C}$ and $\mathrm{A}+1 \Rightarrow \mathrm{~A}$ in one clock cycle, while in a general purpose processor, at least two cycles are required. Sample code of the fixed-point innerproduct computation is shown Figure 1-4 [92].

|  | ZERO | L1 | A7 |  |
| :--- | :--- | :--- | :--- | :--- |
|  | LDH | .D1 | ${ }^{*} A 4++$, A2 | ;load $a_{i}$ from memory |
|  | LDH | .D1 | ${ }^{*} A 3++, A 5$ | ;load $b_{i}$ from memory |
|  | MPY | .M1 | A2,A5,A6 | ; $a_{i}{ }^{*} b_{i}$ |
|  | ADD | L1 | A6,A7,A7 | ; sum $+=\left(a_{i}{ }^{*} b_{i}\right)$ |
|  | SUB | .S1 | A1,1,A1; decrement loop counter |  |
| [A1] | B | .S2 | LOOP | ; branch to loop |

Figure 1-4. Sample Code of the Fixed-Point Inner-Product by TMS 320C60 [92]

While DSP processors allow flexibility, for some applications that require high speed inner-product computation, FPGAs or ASICs can provide higher performance options.

### 1.3 Other Inner-Product Processor Implementation Methods

Besides the inner-product implementations on general purpose processors and DSP processors, other arithmetic and implementations of inner-product processors have been investigated. Ahmad and Poornalah [8] proposed an inner-product implementation using array multipliers. Although the array multipliers provide convenient layout for VLSI, this method may not be a good option in high-performance requirements for innerproduct computation because of its high latency. Fahmi, et al., [11] and Haynal and Parhami [12] investigated serial implementations of an inner-product processor. The designs result in a small area but has a high latency. Inner-product implementations based on distributed arithmetic are proposed by Burleson et al., [13] and Vega, et al., [14]. Various inner-product implementations using carry-save adders are investigated by many researchers [15]-[19]. Application specific inner-product processors are studied in [21][25] and redundant binary implementations are proposed in [28],[29]. In this research, which is focused upon the high-performance implementation of an inner-product processor, only implementations of high-performance inner-product processors will be reviewed and compared.

With a carry-save adder structure, Kazakova [15] investigated a fast and lowpower three-dimensional inner-product processor. This processor consists of Booth encoders, a Wallace reduction tree, and a final two-operand adder. Its structure is shown in Figure 1-5.


Figure 1-5. Kazakova's Inner-Product Processor Architecture [15]

A novel approach for high-performance inner-product processor, which is dynamically reconfigurable, was proposed by Lin [24],[25]. This processor mainly consists of an $8 \times 8$ or $4 \times 4$ array of small multipliers plus two or three arrays of adders. It requires very simple reconfigurable components. The entire summation network can be reconfigured by using a few control bits for the desired computations, where the reconfiguration can be done dynamically. The design is regular, modular, and, it can easily be pipelined. The diagram is shown in Figure 1-6.

Since the implementation of an array multiplier has a high latency, compared to the design of carry-save addition and redundant binary representation, this proposed inner-product processor has a high latency.


Figure 1-6. Lin's Reconfigurable Inner-Product Processor [24],[25]


Figure 1-7. A RB MAC (Multiply and Accumulate) by Huang [2]
Based upon redundant binary numbers and the Booth encoding, Huang [2] proposed a high-performance, two-stage pipelined MAC (Multiply and Accumulate) unit,
which is shown in Figure 1-7. Later, Sacristan [29] further developed this structure as a reusable inner-product unit for multipliers with different sizes of word length.

Baik et al. [28] proposed a redundant binary implementation of an FIR filter. The diagram is shown in Figure 1-8 (See Section 2.1.4. for further discussion):


Figure 1-8. Baik's Redundant Binary Filter Implementation [28]

### 1.4 Multiplier Implementation Review

Multiplication is the key operation in the implementation of inner-product computation. Three popular implementations for multipliers are an array multiplier [85],[93], a multiplier using a Wallace tree [94] and a multiplier using redundant binary number representation [1],[4]. An array multiplier has good repeatability of unit cells and is very regular in its structure. It uses only short wires that connect one full adder to horizontally, vertically, or diagonally adjacent full adders. Thus, it results in a very simple and efficient layout in VLSI implementation. However, the $N$-bit multiplication time is linearly proportional to $N$. This method requires a long computation time for larger $N$ 's. Figure 1-9 [85] is a design of $5 \times 5$ unsigned multiplier:


Figure 1-9. Design of a $5 \times 5$ Array Multiplier [85],[93]
The Wallace-tree method is commonly used to realize high-speed multiplication. The basic cell in Wallace-tree multiplication is 3-to-2 or 4-to-2 CSA (Carry Save Adder), also called $3: 2$ or $4: 2$ counter. A $3: 2$ counter can be realized by a full adder, which reduces three numbers to two numbers while a $4: 2$ counter can be realized by two $3: 2$ counters, as shown in Figure 1-10 [85]. Figure 1-11 and Figure 1-12 are $4 \times 4$ multipliers, using $3: 2$ counters and $4: 2$ counters.


Figure 1-10. 4:2 Counter Using 3:2 Counter [94]


Figure 1-11. 3:2 Counter Based $4 \times 4$ Multiplier [85]


Figure 1-12. 4:2 Counter Based $4 \times 4$ Multiplier [85]
The traditional Wallace-tree method uses a 3:2 counter. This scheme results in a complicated interconnection between three-input/two-output counters. This makes the VLSI layout difficult and inefficient. The extended layout process increases the design complexity. As the multipliers increase in bit length, the interconnection becomes exponentially complicated. To solve this problem with conventional Wallace-trees, the following two methods have been proposed. One method is to use $4: 2$ counters [94] instead of $3: 2$ counters [94]. The use of $4: 2$ counters simplifies the interconnection
drastically because the partial products are added using a binary tree. Another method is to use redundant binary representation for the partial products [1], [3], [4]. The use of the RB simplifies the interconnection because the RB partial products can be summed using an RB adder tree. The $N$-bit multiplication time of RB multipliers and Wallace-tree multipliers is proportional to $\log _{2} N$. The physical layout of a RB multiplier has good repeatability. The RB multiplier does not require any optional sign bits for adding partial products. Makino's research [3] indicates that a $54 \times 54$-bit multiplier using redundant binary number representation is faster than the conventional 4:2 counter-based multiplier and has lower power dissipation. The power dissipation of 540 mW is estimated for the $54 \times 54 \mathrm{RB}$ multiplier operating at 100 MHz . These figures are more than $12 \%$ faster speed and $38 \%$ lower power than the conventional CSA multipliers.

Using redundant binary representation in our research results in an easily controlled/reconfigurable high-performance computing structure capable of handling various computations for both real and complex numbers.

### 1.5 The Redundant Binary Number System

Redundant binary (RB) representation is one of the signed-digit (SD) number systems originally introduced by Avizienis [69], which provides carry-propagation-free (CPF) addition. In a signed-digit system, the individual digits have negative as well as positive values. Given a radix- $(r)$ signed-digit number, each digit of the signed-digit number can take one of the following $2 \alpha+1$ values:

$$
\begin{equation*}
\{-\alpha, \ldots,-1,0,1, \ldots, \alpha\} \tag{1.2}
\end{equation*}
$$

where the magnitude of a positive integer $\alpha$ must be within the following interval:

$$
\begin{equation*}
\left\lfloor\frac{r}{2}\right\rfloor \leq \alpha \leq r-1 \tag{1.3}
\end{equation*}
$$

The radix-(2) signed-digit system (Redundant Binary (RB) representation) uses the digit set $\{-1,0,1\}$ to represent numbers. The SD number system is also called redundant because a given integer number may have more than one representation. For example, the radix-(2) integer, (7) ${ }_{10}$, can be represented in several ways, e.g., $\left[\begin{array}{llll}0 & 1 & 1 & 1\end{array}\right]_{\mathrm{RB}}$, $\left[\begin{array}{lll}1 & 0 & 0\end{array}-1\right]_{\mathrm{RB}}$, or $\left[\begin{array}{llll}1 & -1 & 1 & 1\end{array}\right]_{\mathrm{RB}}$. Based on the SD redundancy property, addition rules can be devised so that carry propagation is limited to only one digit position, thereby eliminating the possibility of a carry from the LSD (Least-Significant-Digit) to the MSD (Most-Significant-Digit). In a RB adder circuit implementation, the addition time is fixed and does not depend on the word length. Also, no explicit mechanism to handle the overall sign of a signed-digit number is required since it is determined by the most significant non-zero digit. Since the multiplication of two numbers is generally performed by the addition of partial products, the carry-propagation-free (CPF) feature of the RB arithmetic can be used to design high-speed multipliers [1], [3], [4] and multiply-andaccumulate (MAC) units [2].

The algorithmic rules for the RB addition are defined by Takagi, et al., [4]. Basically, two steps are required. In the first step, the intermediate carry-out, $\chi_{i} \in\{-1,0$, $1\}$, and the intermediate sum digit, $\sigma_{i} \in\{-1,0,1\}$, is generated at each position and satisfy the equation:

$$
\begin{equation*}
\alpha_{i}+\beta_{i}=2 \chi_{i}+\sigma_{i} \tag{1.4}
\end{equation*}
$$

where $\alpha_{i}$ and $\beta_{i}$ are the RB augend and addend digits, respectively. Note that for increased speed, the circuit implementation may utilize the next lower order digits,
$\alpha_{i-1}$ and $\beta_{i-1}$, to determine the carry-out from that digit position. Table 1-A describes these rules of step 1 in detail. In the second step, the final sum digit $\zeta_{i}$ is obtained at each position by adding the intermediate sum digit, $\sigma_{i}$, and the intermediate carry, $\chi_{i-1}$, from the next-lower-order position, without generating a carry. That is,

$$
\begin{equation*}
\zeta_{i}=\sigma_{i}+\chi_{i-1} \tag{1.5}
\end{equation*}
$$

Table 1-A. Computation Rules for the First Step in Carry-Propagation-Free Addition for RB Numbers [1]

| Type | Augend digit $\alpha_{i}$ | Addend digit <br> $\beta_{i}$ | Digits at the next-lower-order position $\left(\alpha_{i-1}, \beta_{i-1}\right)$ | Intermediate carry ( $\chi_{i}$ ) | Intermediate sum digit ( $\sigma_{i}$ ) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| <1> | 1 | 1 | ------------- | 1 | 0 |
| <2> | 1 | 0 | Both are negative | 0 | 1 |
|  | 0 | 1 | All other cases | 1 | -1 |
| <3> | 0 | 0 | ------------- | 0 | 0 |
| <4> | $\begin{gathered} 1 \\ -1 \\ \hline \end{gathered}$ | $\begin{gathered} \hline-1 \\ 1 \\ \hline \end{gathered}$ | ------------ | 0 | 0 |
| <5> | 0 | -1 | Both are negative | -1 | 1 |
|  | -1 | 0 | All other cases | 0 | -1 |
| <6> | -1 | -1 | ------------- | -1 | 0 |

In general, throughout this dissertation, RB numbers are expressed using Greek symbols.

### 1.6 The Conversion of 2's-Complement to Redundant Binary

A limited precision RB number, $\Delta$, can be derived from the addition of a pair of $N$-bit 2's-complement numbers $A$ and $B$ [2].

$$
\begin{align*}
(A+B)_{2 c} & =A-(-B)_{2 c} \\
& =A-(\bar{B}+1) \\
& =A-\bar{B}-1 \\
& =\left(-a_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} 2^{i}\right)+\left(\overline{b_{N-1}} 2^{N-1}-\sum_{i=0}^{N-2} \overline{b_{i}} 2^{i}\right)-1  \tag{1.6}\\
& =\left(-a_{N-1}+\overline{b_{N-1}}\right) 2^{N-1}+\left(\sum_{i=0}^{N-2}\left(a_{i}-\overline{b_{i}}\right) 2^{i}\right)-1 \\
& =\left(\delta_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} \delta_{i} 2^{i}\right)-1 \\
& =\Delta-1
\end{align*}
$$

where $\delta_{N-1}=-a_{N-1}+\overline{b_{N-1}}, \delta_{i}=a_{i}-\bar{b}_{i}$ for $0 \leq i \leq N-2,2 \mathrm{c}$ is the 2 's-complement operations, $\bar{B}$ is the 1 's-complement operations, $\bar{b}_{i}$ is the bit-complement, and -1 can be considered as a -1 carry-in to a subsequent RB addition. For inner-product calculations, the -1 correction is applied in the RB partial product adder tree.

The binary-signed digits can be encoded into binary in several ways. In this work, the binary signed digits $\{-1,0,0,1\}$ are coded as $\{00,01,10,11\}$, respectively, as given in Table 1-B. Another encoding method is to encode redundant binary number in signedmagnitude [28], that is, to encode $\{-1,0,0,1\}$ as $\{11,00,10,01\}$. It is less efficient to map 2C to RB for signed-magnitude encoding. See Section 2.1.4 for further discussions.

Table 1-B. Coding Table for Binary Signed Digits

| $a$ value | Encoded $a^{-} a^{+}$ |  |
| :---: | :---: | :---: |
| -1 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 1 | 1 |

Examining Equation (1.6), beginning with the $\delta_{i}$ term, the signed digits are encoded using the relationship, $\delta_{i}=a_{i}-\bar{b}_{i}$, where $\delta_{i}$ is a binary signed digit, $\delta_{i} \in\{-1,0$, $1\}$. The mapping equations for $\delta_{i}^{-}$and $\delta_{i}^{+}$are [2],,[49],[50]

$$
\begin{align*}
& \delta_{i}^{-}=a_{i}  \tag{1.7}\\
& \delta_{i}^{+}=b_{i} \quad \text { for } 0 \leq i \leq N-2
\end{align*}
$$

Similarly, in the Most Significant Digit (MSD) term of Equation (1.6), $\delta_{N-1}$ is encoded with the mapping equations

$$
\begin{align*}
& \delta_{N-1}^{-}=\overline{a_{N-1}} \\
& \delta_{N-1}^{+}=\overline{b_{N-1}} \tag{1.8}
\end{align*}
$$

The structure of mapping the sum of two 2 's-complement binary numbers to a RB number is shown in Figure 1-13:


Figure 1-13. Mapping from the Sum of Two 2's-Complement Numbers to a RB Number

Note that a single 2 's-complement number, $A$, is converted directly into a RB number $\Delta$ in Equation (1.9):

$$
\begin{align*}
& \delta_{N-1}^{+}=\overline{a_{i}}, \delta_{N-1}^{-}=0  \tag{1.9}\\
& \delta_{i-1}^{+}=a_{i}, \quad \delta_{i}^{-}=1 \quad(0 \leq i \leq N-2)
\end{align*}
$$

For example, a 2's-complement number $(00000101)_{2 C}$ is converted directly into a RB number (0101010101110111) $)_{R B}$.

The subtraction of two N -bit 2 's-complement numbers can also be represented by a redundant binary number:

$$
\begin{align*}
E=(A-B)_{2 c} & =\left(-a_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} 2^{i}\right)+\left(b_{N-1} 2^{N-1}-\sum_{i=0}^{N-2} b_{i} 2^{i}\right) \\
& =\left(-a_{N-1}+b_{N-1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(a_{i}-b_{i}\right) 2^{i}  \tag{1.10}\\
& =\varepsilon_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} \varepsilon_{i} 2^{i}
\end{align*}
$$

where $\varepsilon_{N-1}=-a_{N-1}+b_{N-1}$ and $\varepsilon_{i}=a_{i}-b_{i}$ for $0 \leq i \leq N-2$.
The mapping equations for the encoded $\varepsilon_{i},(0 \leq i \leq N-1)$ in Equation (1.10) are:

$$
\begin{align*}
& \varepsilon_{i}^{-}=a_{i} \\
& \varepsilon_{i}^{+}=\bar{b}_{i} \text { for } 0 \leq i \leq N-2 \tag{1.11}
\end{align*}
$$

and

$$
\begin{align*}
& \varepsilon_{N-1}^{-}=\overline{a_{N-1}}  \tag{1.12}\\
& \varepsilon_{N-1}^{+}=b_{N-1}
\end{align*}
$$

The structure of mapping the subtraction of two 2 's-complement binary numbers to a RB number is shown in Figure 1-14.


Figure 1-14. Mapping from the Subtraction of Two 2's-Complement Numbers to a RB Number

Based on this coding for RB numbers, the logic functions of a RBFA (RB full adder) and RBHA (RB half adder) are obtained [49] and shown in Table 1-C for the sum, $z_{i}$, with inputs $x$ and $y$. Boolean variables, $g$ and $h$, are used as intermediate variables to simplify the equations for the carry, $c$, and sum, $z$. Note, in Table 1-C, the notation used is the same as [49] but corrects the RBHA equations found in [49].

Table 1-C. Logic Functions for RBFA and RBHA

| RBFA | RBHA |
| :--- | :--- |
| $g_{i}=\left(x_{i}^{-} \oplus x_{i}^{+}\right) \oplus\left(y_{i}^{-} \oplus y_{i}^{+}\right)$ | $z_{i}^{-}=\left(\overline{x_{i}^{-} \oplus x_{i}^{+}}\right) \oplus y_{i}^{-}$ |
| $h_{i}=x_{i}^{-} x_{i}^{+}+y_{i}^{-} y_{i}^{+}$ | $z_{i}^{+}=y_{i}^{+}$ |
| $z_{i}^{-}=g_{i} \oplus c_{i-1}^{-}$ | $c_{i}^{-}=x_{i}^{-}+x_{i}^{+}$ |
| $z_{i}^{+}=c_{i-1}^{+}$ | $c_{i}^{+}=\left(\overline{x_{i}^{-} \oplus x_{i}^{+}}\right) y_{i}^{-}+\left(x_{i}^{-} \oplus x_{i}^{+}\right) x_{i}^{-} x_{i}^{+}$ |
| $c_{i}^{-}=\left(x_{i}^{-}+x_{i}^{+}\right)\left(y_{i}^{-}+y_{i}^{+}\right)$ |  |
| $c_{i}^{+}=g_{i} c_{i-1}^{-}+\overline{g_{i} h_{i}}$ |  |

In this work, a novel, high-performance, fixed-point, inner-product processor based upon a redundant binary number system will be investigated. Similar to Baik's [28] methods, this scheme decreases the number of partial products by $50 \%$, while achieving better speed and area performance and providing pipeline extension opportunities. When modified Booth encoding is used, partial products are reduced by almost $75 \%$, thereby significantly reducing the multiplier addition depth. This design is well suited for VLSI implementation, and it can also be embedded as an inner-product core inside a general purpose DSP FPGA-based processor. This inner-product processor can be easily reconfigured for different computations, such as real number inner-product computations, parallel real number multipliers, complex number multipliers, complex number inner-
product processors, redundant binary multipliers, redundant binary inner-product processors, etc. Chapter 2 proposes a fixed-point number inner-product processor. Computational structures for both real and complex number inner-products for both 2 'scomplement and unsigned integers is presented. A new division method using the IP structure is investigated in Chapter 3. Two convergence division methods Goldschimdt and Newton-Raphson are compared. Chapter 4 discusses extended computations, such as parallel multiplications, inner-product processors using the innerproduct processor for real, complex and redundant binary numbers. In Chapter 5, an improved redundant binary number to 2 's-complement number converter is discussed. Chapter 6 provides a summary of contributions and future research directions for this work. The implementation of the redundant binary IP processor for real and complex number and the Goldschmidt division unit using the IP processor have been implemented using VHDL on Xilinx FPGA. The original contributions of this research are:

- IP processor reduces the number of partial products.
- A unified signed/unsigned 2's-complement/RB multiplier is developed using this IP structure.
- With the same IP structure, a novel Goldschmidt high-performance division circuit is developed.
- This IP structure can be used to build a multi-purpose dynamical processor for real, complex and redundant binary number computations.
- An improved 2's-complement to RB converter is proposed.


## Chapter 2 Inner-Product Processor of Real, Complex and Redundant Binary Numbers

### 2.1 Real Number Inner-Product Computation

Consider an inner-product for an $M$ dimensional ( $M=$ even), $N$-bit real vectors ( $N$ even), $\boldsymbol{A}$ and $\boldsymbol{B}$, where $\boldsymbol{A}=\left(A_{0}, A_{1}, A_{2} \ldots . A_{M-1}\right)$ and $\boldsymbol{B}=\left(B_{0}, B_{1}, B_{2} \ldots . B_{M-1}\right)$ with

$$
\begin{equation*}
A_{i}=\left(a_{N-1, i} a_{N-2, i} a_{N-3, i} \ldots . . a_{1, i} a_{0, i}\right) \text { and } B_{i}=\left(b_{N-1, i} b_{N-2, i} b_{N-3, i} \ldots . b_{1, i} b_{0, i}\right) \tag{2.1}
\end{equation*}
$$

where $A_{i}$ and $B_{i}$ are real numbers.
The real inner-product is defined as:

$$
\begin{equation*}
<\boldsymbol{A}, \boldsymbol{B}>=\left(A_{0} A_{1} \cdots A_{M-2} A_{M-1}\right) \bullet\left(B_{0} B_{1} \cdots B_{M-2} B_{M-1}\right)=\sum_{i=0}^{M-1} A_{i} B_{i} \tag{2.2}
\end{equation*}
$$

Two basic approaches exist for performing the necessary inner-product multiplications using redundant binary arithmetic. The first method uses inline conversion or mapping of 2 's-complement partial products into a redundant binary number for each multiplication of $A_{i} B_{i}$. The second method combines or maps equivalent 2's-complement partial products into a redundant binary number across the $A_{i} B_{i}$ pairs [49]. Both approaches are considered in the following sections.

### 2.1.1 Inline Partial Product Redundant Binary Inner-Product

Considering the simple case of $M=2, A_{0} B_{0}+A_{1} B_{1}$, we first compute the redundant binary products for $A_{0} B_{0}$ and $A_{1} B_{1}$, and then add the RB products together to produce the inner-product. Redundant binary partial products are generated by mapping even/odd pairs of 2 's-complement partial product sums. For N -bit numbers, the product of $A B$ is expanded in the following equations:

$$
\begin{aligned}
A B= & \left(-a_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} 2^{i}\right)\left(-b_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} b_{i} 2^{i}\right) \\
= & \left(-a_{N-1} b_{0} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{0} 2^{i}\right) 2^{0}+\left(-a_{N-1} b_{1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{1} 2^{i}\right) 2^{1}+ \\
& \left(-a_{N-1} b_{2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{2} 2^{i}\right) 2^{2}+\left(-a_{N-1} b_{3} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{3} 2^{i}\right) 2^{3}+ \\
& \ldots \ldots . . \\
& \left(-a_{N-1} b_{N-2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{N-2} 2^{i}\right) 2^{N-2}+\left(a_{N-1} b_{N-1} 2^{N-1}-\sum_{i=0}^{N-2} a_{i} b_{N-1} 2^{i}\right) 2^{N-1}
\end{aligned}
$$

Denote the 2 's-complement partial products as:

$$
\begin{align*}
& P P_{0}=\left(-a_{N-1} b_{0} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{0} 2^{i}\right) 2^{0}, \quad P P_{1}=\left(-a_{N-1} b_{1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{1} 2^{i}\right) 2^{1} \\
& P P_{2}=\left(-a_{N-1} b_{2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{2} 2^{i}\right) 2^{2}, P P_{3}=\left(-a_{N-1} b_{3} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{3} 2^{i}\right) 2^{3} \tag{2.3}
\end{align*}
$$

$P P_{N-2}=\left(-a_{N-1} b_{N-2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{N-2} 2^{i}\right) 2^{N-2}, \quad P P_{N-1}=\left(a_{N-1} b_{N-1} 2^{N-1}-\sum_{i=0}^{N-2} a_{i} b_{N-1} 2^{i}\right) 2^{N-1}$
Consider the first $N-2$ even/odd partial products, $P P_{2 j}$ and $P P_{2 j+l}$, where $j=0,1,2, \ldots, \frac{N-4}{2}$. To align the 2 's-complement partial products, the sign of the even partial products is extended, and a low order zero is appended to the odd partial products:

$$
\begin{align*}
P P_{2 j} & =\left(-a_{N-1} b_{2 j} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{2 j} 2^{i}\right) 2^{2 j} \\
& =\left(-a_{N-1} b_{2 j} 2^{N}+a_{N-1} b_{2 j} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{2 j} 2^{i}\right) 2^{2 j} \\
P P_{2 j+1} & =\left(-a_{N-1} b_{2 j+1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{2 j+1} 2^{i}\right) 2^{2 j+1}  \tag{2.4}\\
& =\left(-a_{N-1} b_{2 j+1} 2^{N}+a_{N-2} b_{2 j+1} 2^{N-1}+\sum_{i=1}^{N-2} a_{i-1} b_{2 j+1} 2^{i}+0\right) 2^{2 j}
\end{align*}
$$

Combining even/odd 2's-complement partial product pairs according to Equation (1.6), we have:

$$
\begin{align*}
P P_{2 j}+P P_{2 j+1} & =\left\{\left(-a_{N-1} b_{2 j}+\overline{a_{N-1} b_{2 j+1}}\right) 2^{N}+\sum_{i=1}^{N-1}\left(a_{i} b_{2 j}-\overline{a_{i-1} b_{2 j+1}}\right) 2^{i}+a_{0} b_{2 j}-1-1\right\} 2^{2 j}  \tag{2.5}\\
& =\left(\alpha_{N, 2 j} 2^{N}+\sum_{i=1}^{N-1} \alpha_{i, 2 j} 2^{i}+\alpha_{0,2 j}-1\right) 2^{2 j}
\end{align*}
$$

where,

$$
\begin{array}{ll}
\alpha_{N, 2 j}=-a_{N-1} b_{2 j}+\overline{a_{N-1} b_{2 j+1}} & \\
\alpha_{i, 2 j}=a_{i} b_{2 j}-\overline{a_{i-1} b_{2 j+1}}, & 1 \leq i \leq N-1  \tag{2.6}\\
\alpha_{0,2 j}=a_{0} b_{2 j}-1 &
\end{array}
$$

Encoding the redundant binary coefficients, $\alpha_{i}$, using two binary bits, all but the final RB partial product is encoded as:

$$
\begin{align*}
& \text { For } j=0,1, \ldots, \frac{N-4}{2}, \\
& \qquad \begin{array}{rll}
\alpha_{N, 2 j}^{+} & =\overline{a_{N-1} b_{2 j}}, & \alpha_{N, 2 j}^{-}=\overline{a_{N-1} b_{2 j+1}}, \\
\alpha_{i, 2 j}^{+} & =a_{i} b_{2 j}, & \alpha_{i, 2 j}^{-}=a_{i-1} b_{2 j+1}, \\
\alpha_{0,2 j}^{+} & =a_{0} b_{2 j}, & \alpha_{0,2 j}^{-}=0
\end{array} \tag{2.7}
\end{align*}
$$

Now, consider the last two 2's-complement partial products, $P P_{N-2}$ and $P P_{N-1}$ :

$$
\begin{align*}
P P_{N-2}+P P_{N-1}= & \left(-a_{N-1} b_{N-2} 2^{N}+a_{N-1} b_{N-2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{N-2} 2^{i}\right) 2^{N-2} \\
& +\left(a_{N-1} b_{N-1} 2^{N}-a_{N-2} b_{N-1} 2^{N-1}-\sum_{i=1}^{N-2} a_{i-1} b_{N-1} 2^{i}\right) 2^{N-2}  \tag{2.8}\\
= & \left\{\left(-a_{N-1} b_{N-2}+a_{N-1} b_{N-1}\right) 2^{N}+\sum_{i=1}^{N-1}\left(a_{i} b_{N-2}-a_{i-1} b_{N-i}\right) 2^{i}+a_{0} b_{N-2}-0\right\} 2^{N-2} \\
= & \beta_{N}+\sum_{i=1}^{N-1} \beta_{i} 2^{i}+\beta_{0}
\end{align*}
$$

where,

$$
\begin{equation*}
\beta_{N}=-a_{N-1} b_{N-2}+a_{N-1} b_{N-1}, \beta_{i}=a_{i} b_{N-2}-a_{i-1} b_{N-1},(1 \leq i \leq N-1), \beta_{0}=a_{0} b_{N-2}-0 \tag{2.9}
\end{equation*}
$$

Encoding the redundant binary coefficients, $\beta_{i}$, using two binary bits, the final RB partial product is:

$$
\begin{array}{rlrl}
\beta_{N}^{+} & =\overline{a_{N-1} b_{N-2}}, & \beta_{N}^{-}=a_{N-1} b_{N-1}, & \\
\beta_{i}^{+} & =a_{i} b_{N-2}, & \beta_{i}^{-}=\overline{a_{i-1} b_{N-1}}, & 1 \leq i \leq N-1  \tag{2.10}\\
\beta_{0}^{+} & =a_{0} b_{N-2}, & \beta_{0}^{-}=1
\end{array}
$$

Figure 2-1 shows the RB implementation diagram of $A_{0} B_{0}+A_{1} B_{1}$ for 8 -bit numbers and Figure 2-2 shows the hardware implementation of $A_{0} B_{0}+A_{1} B_{1}$ with the inline redundant binary partial product generation (RBPPG) using 2 's-complement even/odd partial products. If the final redundant binary adder (RBA) is bypassed, the circuit in Figure 2-2 can also perform the separate multiplications, $A_{0} B_{0}$ and $A_{1} B_{1}$.


Figure 2-1. Inline Partial Product RB Implementation of $\boldsymbol{A}_{0} \boldsymbol{B}_{0}+\boldsymbol{A}_{I} \boldsymbol{B}_{I}$

Defining the redundant binary number $\Delta_{j}=A_{2 j} B_{2 j}+A_{2 j+1} B_{2 j+1}$, the general form of the inline multiplication inner-product is given by:

$$
\begin{equation*}
<\boldsymbol{A}, \boldsymbol{B}>=\sum_{j=0}^{M-1} A_{i} B_{i}=\sum_{j=0}^{\frac{M}{2}-1}\left(A_{2 j} B_{2 j}+A_{2 j+1} B_{2 j+1}\right)=\sum_{j=0}^{\frac{M}{2}-1} \Delta_{j} \tag{2.11}
\end{equation*}
$$



Figure 2-2. Inline Partial Product Structure of $\boldsymbol{A}_{0} \boldsymbol{B}_{0}+A_{1} B_{1}$

To realize the inner-product, all of the redundant binary numbers, $\Delta_{j}$, are added using a redundant binary adder tree, and the final sum of the redundant binary numbers can be converted into a 2 's-complement number using a RB-NB converter [5],[7],[12]. Figure 2-3 depicts the overall architecture of the RB inner-product circuit showing the RB adder tree, a RB accumulator, a RB-to-2's-complement converter, and additional data paths that might exist when the circuit is implemented in a DSP core.


Figure 2-3. Overall Structure of the Redundant Binary Inner-Product

### 2.1.2 Cross Partial Product Redundant Binary Inner-Product

An alternative method for mapping 2's-complement partial products to redundant binary partial products is to combine like partial products across the $A_{j} B_{j}$ and $A_{j+1} B_{j+1}$ pairs. The method derived here is similar to that provided by Shin and Jeon [50] for complex number multiplication. Consider the simple case of $M=2, A_{0} B_{0}+A_{1} B_{1}$, and expand it as:

$$
\begin{align*}
& A_{0} B_{0}+A_{1} B_{1} \\
&=\left(-a_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} a_{i, 0} 2^{i}\right)\left(-b_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} b_{i, 0} 2^{i}\right)+ \\
& \quad\left(-a_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i, 1} 2^{i}\right)\left(-b_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2} b_{i, 1} 2^{i}\right) \\
&= 2^{N-1}\left\{a_{N-1,0} b_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}\right) 2^{i}+a_{N-1,1} b_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}+ \\
& \sum_{i=0}^{N-2} 2^{i}\left\{b_{i, 0}\left(-a_{N-1,0} 2^{N-1}+\sum_{j=0}^{N-2} a_{j, 0} 2^{j}\right)+b_{i, 1}\left(-a_{N-1,1} 2^{N-1}+\sum_{j=0}^{N-2} a_{j, 1} 2^{i}\right)\right\} \\
&= 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}+a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}-a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}+ \\
& \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}-a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}+a_{j, 1} b_{i, 1}\right) 2^{j}\right\} \tag{2.12}
\end{align*}
$$

Consider the first term of Equation (2.12),

$$
\begin{align*}
& 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}+a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}-a_{i, 1} b_{N-1,1}\right) 2^{i}\right\} \\
&=-2^{N-1}\left\{\left(-a_{N-1,0} b_{N-1,0}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(a_{i, 0} b_{N-1,0}\right) 2^{i}\right. \\
&\left.\quad+\left(-a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}  \tag{2.13}\\
&= 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}-\overline{a_{N-1,1} b_{N-1,1}}\right) 2^{N-1}\right. \\
&\left.\quad+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}+\overline{a_{i, 1} b_{N-1,1}}\right) 2^{i}+1\right\} \\
&= 2^{N-1}\left\{\kappa_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \kappa_{i, 0} 2^{i}+1\right\}
\end{align*}
$$

where $\kappa_{i, 0}$ is a redundant binary number and,

$$
\begin{equation*}
\kappa_{N-1,0}=a_{N-1,0} b_{N-1,0}-\overline{a_{N-1,1} b_{N-1,1}} \text {, and } \kappa_{i, 0}=-a_{i, 0} b_{N-1,0}+\overline{a_{i, 1} b_{N-1,1}} \text { for } 0 \leq i \leq N-2 \tag{2.14}
\end{equation*}
$$

Encoding the redundant binary numbers, $\kappa_{N-1,0}$ and $\kappa_{i, 0}$, the Boolean equations are

$$
\begin{align*}
& \kappa_{N-1,0}^{-}=a_{N-1,0} b_{N-1,0}, \kappa_{N-1,0}^{+}=a_{N-1,1} b_{N-1,1},  \tag{2.15}\\
& \kappa_{i, 0}^{-}=\overline{a_{i, 0} b_{N-1,0}}, \text { and } \kappa_{i, 0}^{+}=\overline{a_{i, 1} b_{N-1,1}}
\end{align*}
$$

Considering the second term of Equation (2.12), and using Equation (1.6),

$$
\begin{align*}
& \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}-a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}+a_{j, 1} b_{i, 1}\right) 2^{j}\right\} \\
= & \sum_{i=0}^{N-2} 2^{i}\left\{-a_{N-1,0} b_{i, 0}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}\right) 2^{j}-\left(a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 1} b_{i, 1}\right) 2^{j}\right\} \\
= & \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}+\overline{a_{N-1,1} b_{i, 1}}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}-\overline{a_{j, 1} b_{i, 1}}\right) 2^{j}-1\right\}  \tag{2.16}\\
= & \sum_{i=0}^{N-2} 2^{i}\left\{\lambda_{i, N-1} 2^{N-1}+\sum_{j=0}^{N-2} \lambda_{i, j} 2^{j}-1\right\}
\end{align*}
$$

where $\lambda_{i, j}(0 \leq i \leq N-2,0 \leq j \leq N-1)$ is a redundant binary number with
$\lambda_{i, j}=a_{j, 0} b_{i, 0}-\overline{a_{j, 1} b_{i, 1}}$ for $0 \leq j \leq N-2$, and $\lambda_{i, N-1}=-a_{N-1,0} b_{i, 0}+\overline{a_{N-1,1} b_{i, 1}}$ for $j=N-1$
Encoding $\lambda_{i, j}$ as two binary bits, $\lambda_{i, j}^{-}$and $\lambda_{i, j}^{+},(0 \leq i \leq N-2,0 \leq j \leq N-1)$,

$$
\begin{align*}
& \lambda_{i, N-1}^{-}=\overline{a_{N-1,0} b_{i, 0}}, \lambda_{i, N-1}^{+}=\overline{a_{N-1,1} b_{i, 1}}  \tag{2.18}\\
& \lambda_{i, j}^{-}=a_{j, 0} b_{i, 0}, \text { and } \lambda_{i, j}^{+}=a_{j, 1} b_{i, 1}
\end{align*}
$$

The overall inner-product is expressed as:

$$
\begin{align*}
& A_{0} B_{0}+A_{1} B_{1} \\
&= 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}+a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}-a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}+ \\
& \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}-a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}+a_{j, 1} b_{i, 1}\right) 2^{j}\right\} \\
&= 2^{N-1}\left\{\kappa_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \kappa_{i, 0} 2^{i}+1\right\}+\sum_{i=0}^{N-2} 2^{i}\left\{\lambda_{i, N-1} 2^{N-1}+\sum_{j=0}^{N-2} \lambda_{i, j} 2^{j}-1\right\}  \tag{2.19}\\
&= 2^{N-1}\left\{\kappa_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \kappa_{i, 0} 2^{i}\right\}+\sum_{i=0}^{N-2} 2^{i}\left\{\lambda_{i, N-1} 2^{N-1}+\sum_{j=0}^{N-2} \lambda_{i, j} 2^{j}\right\} \\
&+2^{N-1}-\sum_{i=0}^{N-2} 2^{i}
\end{align*}
$$

Since $2^{N-1}-\sum_{i=0}^{N-2} 2^{i}=1$, Equation (2.19) becomes:

$$
\begin{equation*}
A_{0} B_{0}+A_{1} B_{1}=2^{N-1}\left\{\kappa_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \kappa_{i, 0} 2^{i}\right\}+\sum_{i=0}^{N-2} 2^{i}\left\{\lambda_{i, N-1} 2^{N-1}+\sum_{j=0}^{N-2} \lambda_{i, j} 2^{j}\right\}+1 \tag{2.20}
\end{equation*}
$$

The adjusting term, +1 , can be applied as a carry-in to the LSD of the redundant binary full adder. Figure 2-4 shows the schematic structure of the $A_{0} B_{0}+A_{1} B_{1}$ hardware implementation.


Figure 2-4. Cross Partial Product Structure of $A_{0} B_{0}+A_{1} B_{1}[50]$

Defining the redundant binary number $\Phi_{j}=A_{2 j} B_{2 j}+A_{2 j+1} B_{2 j+1}$, the general form of the cross partial product method inner-product is given by:

$$
\begin{equation*}
\langle\boldsymbol{A}, \boldsymbol{B}\rangle=\sum_{j=0}^{M-1} A_{i} B_{i}=\sum_{j=0}^{\frac{M}{2}-1}\left(A_{2 j} B_{2 j}+A_{2 j+1} B_{2 j+1}\right)=\sum_{j=0}^{\frac{M}{2}-1} \Phi_{j} \tag{2.21}
\end{equation*}
$$

Again, all of the redundant binary numbers, $\Phi_{j}$, are added using a redundant binary adder tree, and the final sum of the redundant numbers can be converted into a 2 's-complement number using a RB-NB converter [5],[7],[12]. The same redundant binary adder tree used for the inline inner-product, shown in Figure 2-4, can be applied to the cross partial product method.

### 2.1.3 Booth Encoding Methods

To further reduce the number of partial products, the modified Booth encoding technique is used [96]. The modified Booth algorithm recodes an $N$-bit 2's-complement number, $B$, by the following equation:

$$
\begin{align*}
B & =-b_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} b_{i} 2^{i}=\sum_{i=0}^{\frac{N}{2}-1}\left(-2 b_{2 i+1}+b_{2 i}+b_{2 i-1}\right) 2^{2 i}  \tag{2.22}\\
& =\sum_{i=0}^{\frac{N}{2}-1} Q_{i} 2^{2 i}
\end{align*}
$$

where $b_{-1}=0$ and $Q_{i} \in\{-2,-1,0,+1,+2\}$ is determined according to the bit pattern of the 3-bit string of B as given in Table 2-A.

Table 2-A. Modified Booth Encoding Table [96]

| $b_{i+1}$ | $b_{i}$ | $b_{i-1}$ | $Q_{i}$ |
| :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 |
| 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 2 |
| 1 | 0 | 0 | -2 |
| 1 | 0 | 1 | -1 |
| 1 | 1 | 0 | -1 |
| 1 | 1 | 1 | 0 |

For the inline partial product method, we apply Booth coding to find the product of $A$ and $B$ (assuming $N$ is even), using Equation (2.22),

$$
\begin{align*}
A B & =\left(-a_{N-1} 2^{N-1}+\sum_{k=0}^{N-2} a_{k} 2^{k}\right) \sum_{i=0}^{\frac{N}{2}-1} Q_{i} 2^{2 i} \\
& =\sum_{i=0}^{\frac{N}{2}-1} Q_{i}\left(-a_{N-1} 2^{N-1}+\sum_{k=0}^{N-2} a_{k} 2^{k}\right) 2^{2 i}  \tag{2.23}\\
& =\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i}+g_{i}\right) 2^{2 i}
\end{align*}
$$

If $Q_{i}=1, C_{i}=-a_{N-1} 2^{N-1}+\sum_{k=0}^{N-2} a_{k} 2^{k}$ and $g_{i}=0$
If $Q_{i}=2, C_{i}=-a_{N-1} 2^{N}+\sum_{k=0}^{N-2} a_{k} 2^{k+1}$ and $g_{i}=0$
If $Q_{i}=-1, C_{i}=-\overline{a_{N-1}} 2^{N-1}+\sum_{k=0}^{N-2} \overline{a_{k}} 2^{k}$ and $g_{i}=1$
If $Q_{i}=-2, C_{i}=-\overline{a_{N-1}} 2^{N}+\sum_{k=0}^{N-2} \overline{a_{k}} 2^{k+1}$ and $g_{i}=1$
Notice that $C_{i}$ is a 2 's-complement number. Mapping the product to redundant binary,

$$
\begin{align*}
A B & =\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i}+g_{i}\right) 2^{2 i} \\
& =\sum_{i=0,2,4, \ldots \frac{N}{2}-2}\left\{\left(C_{i}+g_{i}\right) 2^{2 i}+\left(C_{i+1}+g_{i+1}\right) 2^{2 i+2}\right\} \\
& =\sum_{i=0,2,4, \ldots, \ldots-2}^{2}-2  \tag{2.25}\\
& \left.=\sum_{i=0,2,4, \ldots \frac{N}{2}-2}\left\{C_{i}+4 C_{i+1}\right) 2^{2 i}+\left(g_{i}+4 g_{i+1}\right) 2^{2 i}\right\} \\
& \left.\left.=\sum_{i=0,2,4, \ldots, \frac{N}{2}-2}\left(E_{i, i+1}+\gamma_{i}+y_{i+1}\right) 2^{2 i}-1\right) 2^{2 i}\right\}
\end{align*}
$$

where $E_{i}$ is a redundant binary number and $E_{i}-1=C_{i}+4 C_{i+1}$. The correction factors, $\gamma_{i}$ and $\gamma_{i+1}$, depend on the values of $g_{i}$ and $g_{i+1}$, as shown in Table 2-B. Here, the two redundant binary Booth correction factors, $\gamma_{i}$ and $\gamma_{i+1}$, are used since $g_{i}$ and $g_{i+1}$ can not be combined. The -1 is encoded into the $\gamma_{i}$ correction factor.

Table 2-B. Booth Correction Factors for the Inline Multiplication Method

| $g_{i}$ | $g_{i+1}$ | $\gamma_{i}\left(\gamma_{i}^{+} \gamma_{i}^{-}\right)$ | $\gamma_{i+1}\left(\gamma_{i+1}^{+} \gamma_{i+1}^{-}\right)$ |
| :---: | :---: | :---: | :---: |
| 0 | 0 | $-1\left(\begin{array}{ll}0 & 0\end{array}\right)$ | $0\left(\begin{array}{ll}0 & 1\end{array}\right)$ |
| 0 | 1 | $-1\left(\begin{array}{ll}0 & 0\end{array}\right)$ | $1\left(\begin{array}{ll}1 & 1\end{array}\right)$ |
| 1 | 0 | $0\left(\begin{array}{ll}0 & \end{array}\right.$ | $0\left(\begin{array}{ll}0 & 1\end{array}\right)$ |
| 1 | 1 | $0(01)$ | $1\left(\begin{array}{ll}1 & 1\end{array}\right)$ |

Referring to Booth coding Table 2-A, the Boolean equations for the correction factors are,

$$
\begin{align*}
\gamma_{i}^{+} & =0 \\
\gamma_{i}^{-} & =b_{i+1} \overline{b_{i}} \overline{b_{i-1}}+b_{i+1} \overline{b_{i}} b_{i-1}+b_{i+1} b_{i} \overline{b_{i-1}} \\
& =b_{i+1} \overline{b_{i} b_{i-1}}  \tag{2.26}\\
\gamma_{i+1}^{+} & =b_{i+3} \overline{b_{i+2} b_{i+1}} \\
\gamma_{i+1}^{-} & =1
\end{align*}
$$

where $i=0,2,4, \ldots \frac{N}{2}-2$. After the redundant binary products of $A_{0} B_{0}$ and $A_{1} B_{1}$ are computed, a redundant binary adder is used to compute $A_{0} B_{0}+A_{1} B_{1}$.

The general form of the inline Booth encoded inner-product is,

$$
\begin{equation*}
<\boldsymbol{A}, \boldsymbol{B}>=\sum_{j=0}^{M-1} A_{j} \boldsymbol{B}_{j}=\sum_{j=0}^{M-1} \sum_{i=0,2,4, \ldots, \frac{N}{2}-2}\left(E_{j,(i, i+1)}+\gamma_{j, i}+\gamma_{j, i+1}\right) 2^{2 i} \tag{2.27}
\end{equation*}
$$

where the number of partial products is decreased to slightly above $25 \%$, with consideration being given to the correction factors.

Applying Booth encoding for the cross partial product method, we must consider $A_{0} B_{0}+A_{1} B_{1}$. Using Equation (2.22),

$$
\begin{align*}
A_{0} B_{0}+A_{1} B_{1}= & \sum_{i=0}^{\frac{N}{2}-1} Q_{i, 0}\left(-a_{N-1,0} 2^{N-1}+\sum_{k=0}^{N-2} a_{k, 0} 2^{k}\right) 2^{2 i} \\
& +\sum_{i=0}^{\frac{N}{2}-1} Q_{i, 1}\left(-a_{N-1,1} 2^{N-1}+\sum_{k=0}^{N-2} a_{k, 1} 2^{k}\right) 2^{2 i} \\
= & \sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}+g_{i, 0}\right) 2^{2 i}+\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 1}+g_{i, 1}\right) 2^{2 i}  \tag{2.28}\\
= & \sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}+C_{i, 1}+g_{i, 1}+g_{i, 0}\right) 2^{2 i}
\end{align*}
$$

From Equation (1.6), the sum of two 2's-complement numbers can be considered as a redundant binary number minus 1 . Equation (2.28) can be converted to:

$$
\begin{align*}
A_{0} B_{0}+A_{1} B_{1} & =\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}+C_{i, 1}+g_{i, 1}+g_{i, 0}\right) 2^{2 i} \\
& =\sum_{i=0}^{\frac{N}{2}-1}\left(\Gamma_{i, 01}+g_{i, 1}+g_{i, 0}-1\right) 2^{2 i}  \tag{2.29}\\
& =\sum_{i=0}^{\frac{N}{2}-1}\left(\Gamma_{i, 01}+\gamma_{i, 01}\right) 2^{2 i}
\end{align*}
$$

where $\Gamma_{i, 01}$ is a redundant binary number from the addition of $C_{i, 0}$ and $C_{i, 1}$, and the redundant binary number $\gamma_{i, 01}=g_{i, 0}+g_{i, 1}-1$. The correction factor, $\gamma_{j, 01}$, depends on the values of $g_{i, 0}$ and $g_{i, 1}$, as shown in Table 2-C.

Table 2-C. Booth Correction Factors for the Cross Partial Product Method

| $g_{i, 0}$ | $g_{i, 1}$ | $\gamma_{i, 01}=g_{i, 0}+g_{i, 1}-1$ | $\gamma_{i, 01}^{-} \gamma_{i, 01}^{+}$ |  |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | -1 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |

From the Booth coding Table 2-A, the bit encoding Boolean equations are:

$$
\begin{align*}
\gamma_{i, 01}^{-}=g_{i, 0} & =b_{i+1,0} \overline{b_{i, 0}} \overline{b_{i-1,0}}+b_{i+1,0} \overline{b_{i, 0}} b_{i-1,0}+b_{i+1,0} b_{i, 0} \overline{b_{i-1,0}} \\
& =b_{i+1,0} \overline{b_{i, 0} b_{i-1,0}}  \tag{2.30}\\
\gamma_{i, 01}^{+}=g_{i, 1} & =b_{i+1,1} \overline{b_{i, 1}} \overline{b_{i-1,1}}+b_{i+1,1} \overline{b_{i, 1}} b_{i-1,1}+b_{i+1,1} b_{i, 1} \overline{b_{i-1,1}} \\
& =b_{i+1,1} \overline{b_{i, 1} b_{i-1,1}}
\end{align*}
$$

Using Booth coding, the cross partial product inner-product method is given by

$$
\begin{equation*}
<\boldsymbol{A}, \boldsymbol{B}>=\sum_{j=0}^{M-1} A_{j} B_{j}=\sum_{j=0}^{\frac{M}{2}-1}\left(A_{2 j} B_{2 j}+A_{2 j+1} B_{2 j+1}\right)=\sum_{j=0}^{\frac{M}{2}-1} \sum_{i=0}^{\frac{N}{2}-1}\left(\Gamma_{i,(2 j, 2 j+1)}+\gamma_{i,(2 j, 2 j+1)}\right) 2^{2 i} \tag{2.31}
\end{equation*}
$$

Again, the number of partial products is decreased to slightly above $25 \%$, with consideration being given to the correction factors.

### 2.1.4 Implementation Comparison of Inline and Cross InnerProduct Methods

An 8-tap digital filter implementation in [28] uses a signed-amplitude system to encode a redundant binary number. The signed-amplitude method requires two gate delays for the conversion from 2 's-complement to redundant binary. Examining Equations (1.7) and (1.8), only inverters are necessary for the inline partial product redundant binary mapping.

In [49], the cross partial product implementations of $A_{0} B_{0}+A_{1} B_{1}$ and $A_{0} B_{0}-A_{1} B_{1}$ are discussed for complex number multiplication. An equivalent derivation was provided in Section 2.1.2. The inline method, presented in Section 2.1.1, combines partial products within the partial products of $A_{0} B_{0}$ and $A_{1} B_{1}$, respectively. Figure 2-2 and Figure 2-4 depict these methods. For a qualitative comparison of these two designs, note that as feature sizes shrink in deep submicron VLSI technology, interconnection wires contribute
a large portion of the total delay [49],[50]. The inline implementation provides more direct routing for vertical (horizontal) wires, while the cross partial product method [49],[50] will need crossing horizontal (vertical) wiring paths for partial product mapping, with the routing distance proportional to the word width. Therefore, the inline partial product method will result in improved performance, compared to the cross partial product method. In addition, the inline method offers more extended operational capability than the cross partial product scheme (see Chapter 4). The inline method requires more horizontal gates, primarily due to the overhead of the partial product alignment. Table 2-D and Table 2-E show comparisons for Xilinx FPGA implementations for $A_{0} B_{0}+A_{1} B_{1}$, with and without Booth encoding. The Xilinx Virtex2 2V6000FF1517 device was targeted for the implementations using VHDL and Xilinx Foundation software. For this word length, the higher performance and area savings of the Booth encoded designs are evident (See Appendix for VHDL code availability).

Table 2-D. 16-Bit FPGA Implementations of $A_{0} B_{0}+A_{1} B_{I}$ Without Booth Encoding

|  | Cross PP Method | Inline PP Method |
| :--- | :---: | :---: |
| Number of Slices | 1094 | 1094 |
| Number of LUTs | 1992 | 2002 |
| Equivalent Gate Count | 11952 | 12012 |
| Maximum Delay | 30.014 ns | 28.084 ns |

Table 2-E. 16-Bit FPGA Implementations of $\boldsymbol{A}_{0} B_{0}+A_{1} B_{1}$ With Booth Encoding

|  | Cross PP Method | Inline PP Method |
| :--- | :---: | :---: |
| Number of Slices | 846 | 905 |
| Number of LUTs | 1603 | 1717 |
| Equivalent Gate Count | 9618 | 10302 |
| Maximum Delay | 29.460 ns | 27.970 ns |

### 2.2 Complex Number Inner-Product Computation

### 2.2.1 Review of Complex Number Arithmetic

Complex number arithmetic computation is a key arithmetic feature in modern digital communication, radar systems and optical systems. Many algorithms based on convolutions, correlations, and complex filters require complex number multiplication, complex number division, and high-speed inner-products. These applications require efficient representation and manipulation of complex numbers together with real numbers. Among these computations, high-performance complex number multipliers and complex number inner-products are desirable in modern digital communication, optical systems, and radar systems. Recent research in hardware implementation of complex number arithmetic circuits is focused on utilization of radix-(2), as well as alternative radices, for the representation of complex numbers.

In this chapter, different complex radices are investigated and compared. It is found that the complex radices have no advantage in hardware implementations. Traditional radix-(2) redundant binary numbers are used to implement complex-number multiplication and inner-product processing. The investigated inline inner-product processor can be reconfigured/controlled to perform complex-number computations. The computational structures of $A_{0} B_{0}+A_{1} B_{1}$ and $A_{0} B_{0}-A_{1} B_{1}$ are developed for performing complex number inner-products. The implementation of $A_{0} B_{0}+A_{1} B_{1}$ can be easily controlled to perform the computation of $A_{0} B_{0}-A_{1} B_{1}$. A complex number inner-product processor is realized, based upon a unified structure for $A_{0} B_{0} \pm A_{1} B_{1}$.

To represent a complex number other than radix-(2), several representations have been proposed. Knuth [63],[64] described a "quater-imaginary" number system with
radix-(2j). Dao [58] further analyzed this quater-imaginary system for complex-radix arithmetic. Penney [66] proposed a complex number representation with the base of $j-1$. Slekys [67] defined arithmetic operations on radix- $(j \sqrt{2})$. Recently, further investigations examined the arithmetic algorithms and hardware implementation of these representations. Aoki [31],[97],[98], and Mcllhenny [99],[100] investigated complex number arithmetic in a redundant radix-(2j) number system. Jamil [62] and Blest [55] further analyzed the complex number computations in the radix- $(j-1)$ number system and included proposed arithmetic methods for addition, subtraction, multiplication and division. Frougny [60] and Koren [65] provided a theoretic investigation of complex number arithmetic for complex numbers in the bases $j \sqrt{b}$ and $-b+j$. Stepanenko [68] also investigated the complex number arithmetic in radix- $(j \sqrt{2})$.

Multiplication is an essential operation for high-speed hardware implementation of complex number computations. It can be used to compare the complexity of complex number arithmetic with different complex radices. The analysis of complex number multiplication in these various radices will provide one metric for comparison.

To compute the product of two complex numbers, the conventional method is to use four binary multiplications, one addition, and one subtraction, as shown in Figure $2-5$. Define two complex numbers as:

$$
\begin{align*}
& A=A_{r}+j A_{i}  \tag{2.32}\\
& B=B_{r}+j B_{i}
\end{align*}
$$

where $j=\sqrt{-1}$, and $A_{r}, A_{i}, B_{r}$, and $B_{i}$ are the real and imaginary parts of the complex numbers, $A$ and $B$. Multiplication of $A$ and $B$ is given by:

$$
\begin{equation*}
A \times B=\left(A_{r}+j A_{i}\right) \times\left(B_{r}+j B_{i}\right)=A_{r} B_{r}-A_{i} B_{i}+j\left(A_{r} B_{i}+A_{i} B_{r}\right) \tag{2.33}
\end{equation*}
$$



Figure 2-5. Basic Diagram of Complex Number Multiplication

In this direct implementation method, four multiplications plus two additions are required. To reduce the arithmetic complexity of the complex number multiplication, an algebraic transform, given in Equation (2.34), is proposed by Blahut [101]. This method saves one real number multiplication at the expense of three more additions:

$$
\begin{align*}
& \left(A_{r}+j A_{i}\right) \times\left(B_{r}+j B_{i}\right)  \tag{2.34}\\
& =\left\{\left(A_{r}-A_{i}\right) B_{i}+A_{r}\left(B_{r}-B_{i}\right)\right\}+j\left\{\left(A_{r}-A_{i}\right) B_{i}+A_{i}\left(B_{r}+B_{i}\right)\right\}
\end{align*}
$$



Figure 2-6. Blahut's Complex Number Multiplier [101]

As shown in Figure 2-6, this method requires pre-addition of $B_{r}+B_{i}$ and presubtractions, $A_{r}-A_{i}$ and $B_{r}-B_{i}$, before the binary multiplications, resulting in an increase of critical path delay. Although addition is generally less expensive in area than multiplication, the overall savings in hardware does not offset the non-trivial critical path delay. Therefore, the complex multiplication scheme given in Equation (2.33) will be utilized in this research.

### 2.2.2 Comparison of Different Complex Radices

Representing complex numbers with a complex radix implies that the complex numbers can be manipulated without separating the real and the imaginary part. It is supposed that in these complex radices, the complex-number computational arithmetic will be simplified. For example, complex number multiplication may only need one complex radix multiplication and hence provide a major performance improvement. Can a complex radix system really achieve such improvement? In the analysis, the various complex radices are compared, although it is interesting that no significant improvement is achieved compared to the traditional 2's-complement binary representation of complex numbers with the real and imaginary parts treated separately. Further, many of the alternative complex radix representations are unbalanced or fractal, thereby providing, in limited precision hardware, a significant representation issue for the range of the real and imaginary number components as well as the positive and negative fixed-point values. A review of these alternative complex radices follows.

### 2.2.2.1 Radix-(2j)

As early as 1960 , Knuth [63],[64] proposed radix-(2j) which leads to an interesting system called "quater-imaginary" (by analogy with "quaternary" or base-4). In this system every complex number is represented with the digits $0,1,2$, and 3 without a sign. For example:

$$
\begin{equation*}
(11210.31)_{2 j}=16+(-8 j)+2 \times(-4)+(2 j)+3 \times\left(-\frac{1}{2} j\right)+\left(-\frac{1}{4}\right)=7 \frac{3}{4}-7 \frac{1}{2} j \tag{2.35}
\end{equation*}
$$

Here the number $\left(a_{2 N} \mathrm{~L} a_{1} a_{0} \cdot a_{-1} \mathrm{~L} a_{-2 K}\right)_{2 j}$ is equal to

$$
\begin{equation*}
\left(a_{2 N} \mathrm{~L} \quad a_{2} a_{0} \cdot a_{-2} \mathrm{~L} a_{-2 K}\right)_{-4}+2 j\left(a_{2 N-1} \mathrm{~L} a_{3} a_{1} \cdot a_{-1} \mathrm{~L} a_{-2 K+1}\right)_{-4} \tag{2.36}
\end{equation*}
$$

Conversion to and from quater-imaginary radix reduces to the conversion to and from negative quaternary representation of the real and imaginary parts. In his book [64], Knuth proposed that the interesting property of this system is that it allows the multiplication and division of complex numbers to be done in a fairly unified manner without treating real and imaginary parts separately. For example, we can multiply two numbers in this system, much as we may do with any radix, by merely using a different carry rule. Whenever a sum digit exceeds 4 , we subtract 4 from the sum digit and carry -1 two columns to the left; when a sum digit is negative, we add four to it and carry +1 two columns to the left.

Representing complex numbers in radix-( $2 j$ ) is the same as representing the real and imaginary parts in radix-(-4). Although there is no sign to deal with in radix-(-4), the number system is imbalanced. The imbalance of the negative-base number system in Zohar's work [102] isn't correct. Zohar's results [102] are shown in Equations (2.37) and (2.38):

For radix $q<0$, the maximum positive number is given by

$$
\begin{equation*}
P=\frac{q^{2[(D+1) / 2]}-1}{|q|+1} \tag{2.37}
\end{equation*}
$$

and the minimum negative number is given by

$$
\begin{equation*}
N=q \frac{q^{2[D / 2]}-1}{|q|+1} \tag{2.38}
\end{equation*}
$$

Here are the correct results. Consider a system that uses $D$ digits to represent numbers in the base $q<0$. When $D$ is even, the largest representable integer is the positive number $P$, whose representation is:

$$
\begin{equation*}
P=(0,|q|-1, \mathrm{~L}, 0,|q|-1,0,|q|-1) . \tag{2.39}
\end{equation*}
$$

Its value is given by,

$$
\begin{equation*}
P=(|q|-1) \frac{1-q^{D-2} q^{2}}{1-q^{2}}=\frac{|q|^{D}-1}{1+|q|} \tag{2.40}
\end{equation*}
$$

Similarly, the smallest integer $(N)$ is:

$$
\begin{equation*}
N=(|q|-1, \mathrm{~L}, 0,|q|-1,0,|q|-1,0) . \tag{2.41}
\end{equation*}
$$

Its value is given by,

$$
\begin{equation*}
N=(1-|q|) \frac{|q|-|q|^{D-1}|q|^{2}}{1-|q|^{2}}=\frac{|q|\left(1-|q|^{D}\right)}{1+|q|} \tag{2.42}
\end{equation*}
$$

The number of integers contained in the closed interval defined this way is $P-N+1$. That is:

$$
\begin{equation*}
P+N-1=P=\frac{|q|^{D}-1}{1+|q|}+\frac{|q|\left(|q|^{D}-1\right)}{1+|q|}+1=|q|^{D} \tag{2.43}
\end{equation*}
$$

The result is very similar when $D$ is odd. The largest representable integer is the positive number $P$ whose representation is

$$
\begin{equation*}
P=(|q|-1, \mathrm{~L}, 0,|q|-1,0,|q|-1) \tag{2.44}
\end{equation*}
$$

Its value is given by:

$$
\begin{equation*}
P=(|q|-1) \frac{1-q^{D-1} q^{2}}{1-q^{2}}=\frac{|q|^{D+1}-1}{1+|q|} \tag{2.45}
\end{equation*}
$$

and the smallest integer ( $N$ ) is

$$
\begin{equation*}
N=(0,|q|-1, \mathrm{~L}, 0,|q|-1,0,|q|-1,0) \tag{2.46}
\end{equation*}
$$

Its value is given by:

$$
\begin{equation*}
N=(1-|q|) \frac{|q|-|q|^{D-2}|q|^{2}}{1-|q|^{2}}=\frac{|q|\left(1-|q|^{D-1}\right)}{1+|q|} \tag{2.47}
\end{equation*}
$$

The number of integers contained in the closed interval defined this way is $P-N+1$. That is:

$$
\begin{equation*}
P+N-1=P=\frac{|q|^{D+1}-1}{1+|q|}+\frac{|q|\left(|q|^{D-1}-1\right)}{1+|q|}+1=|q|^{D} \tag{2.48}
\end{equation*}
$$

This, however, is the number of different configurations of the $D$ digits. We conclude, therefore, that $D$ digits span all the integers from $N$ to $P$; regardless of $D$ being even or odd. A troubling result is that the closed interval $(N, P)$ is quite asymmetrical. A simple example will illustrate these statements.

Assume

$$
q=-10 \text { and } D=3
$$

Then the largest number is $(909)_{-10}=909$, and the smallest is $(90)_{-10}=-90$.
Assume

$$
q=-10 \text { and } D=4
$$

Then the largest number is $(0909)_{-10}=909$, and the smallest is $(9090)_{-10}=-9090$.

### 2.2.2.1.1 Complex Number Addition in Radix-(2j)

Dao [58] proposed a hardware implementation for the radix-( $2 j$ ) addition. The adding of two numbers X and Y in the quater-imaginary system, as in any positional representation, consists of adding digits of the same weight. The modulo-(-4) result produces a sum digit and a carry digit. In this radix, the carry is -1 and has a weight equal to that of the digit two digit positions to the left:

$$
x_{i}(2 j)^{i}+y_{i}(2 j)^{i}=\left(x_{i}+y_{i}\right)(2 j)^{i}
$$

where

$$
\begin{align*}
& 4 \leq\left(x_{i}+y_{i}\right) \leq 6  \tag{2.49}\\
& x_{i}+y_{i}=c_{i}(2 j)^{2}+s_{i}
\end{align*}
$$

For example:


Figure 2-7. An Example of Addition in Radix-(2j) [58]

Actually, the radix- $(2 j)$ adder is a radix-(-4) adder in the separate even and odd digit positions. The negative radix addition for real numbers is further investigated in [103],[104]. We conclude that the radix-(2j) addition reduces to the radix-(-4) addition in the even and odd positions separately.

### 2.2.2.1.2 Complex Number Subtraction in Radix-(2j)

In negative bases like (-4), no explicit sign digit is required in the representation. The negation of a number is obtained by taking the 4's-complement of each non-zero digit together with a positive carry digit of 1 two positions ahead:

$$
\begin{align*}
-x_{k}(2 j)^{k} & =\left(-4+\overline{x_{k}}\right)(2 j)^{k} \text { with } \overline{x_{k}}=4-x_{k}  \tag{2.50}\\
& =(2 j)^{k+2}+\overline{x_{k}}(2 j)^{k}
\end{align*}
$$

The subtraction of a number $X$ is reduced to adding its 4's-complement with proper carry propagation:

Similar to addition, the implementation of radix-(2j) subtraction is radix-(-4) subtraction in the even and odd positions separately.


## Figure 2-8. An Example of Subtraction in Radix-(2j) [58]

### 2.2.2.1.3 Complex Number Multiplication in Radix-(2j):

Serial multiplication, i.e., one digit of the multiplier at a time, proceeds as in the binary case. Given,

$$
\begin{aligned}
& X=x_{0}(2 j)^{0}+\mathrm{L}+x_{N-1}(2 j)^{N-1} \\
& Y=y_{0}(2 j)^{0}+\mathrm{L}+y_{N-1}(2 j)^{N-1}
\end{aligned}
$$

the product is:

$$
\begin{equation*}
Z=X Y=\sum_{k=0}^{N-1} \sum_{i=0}^{N-1}\left[y_{k} x_{i}(2 j)^{i}\right](2 j)^{k} \tag{2.51}
\end{equation*}
$$

The digit product $y_{k} x_{i}$ can generate a carry $(0,-1,-2)$, which must be added to the digit position two places ahead in the partial sum:

$$
\begin{align*}
& y_{k} x_{i}=-c_{k, i}(-4)+d_{k, i}=-c_{k, i}(2 j)^{2}+d_{k, i} \\
& Z=\sum_{k=0}^{N-1} \sum_{i=0}^{N-1}\left[-c_{k, i}(2 j)^{i+2}+d_{k, i}(2 j)\right](2 j)^{k} \tag{2.52}
\end{align*}
$$

The terms inside the bracket represent the partial product from the lower digit of the multiplier.

Notice that shifting a number $X$ one position to the left is equivalent to rotating the vector $x$ by 90 degrees and doubling its length, while shifting to the right one position results in a -90 degree rotation and halving the length.


Figure 2-9. An Example of Multiplication in Radix-(2j) [58]

The implementation of radix-( $2 j$ ) multiplication results in a radix-( $2 j$ ) partial product generator with even and odd positions separately followed by a radix-(-4) addition tree to generate the final product (see example in Figure 2-9).

An analysis of Knuth's "quater-imaginary" radix shows that there are several disadvantages of this imaginary number system:

- Since the numbers obtained from sensors and digital systems are normally 2's-complement binary numbers, a conversion from 2's-complement to radix(2j) must be conducted. Our research [105] shows that the implementation of this conversion procedure will require a delay on the order of a carrylookahead adder and will add an additional computational delay in the critical path.
- A key computation for complex numbers in radix-( $2 j$ ) will invariably be multiplication. Without further developments, radix-(2j) is slow, compared to 2's-complement binary multiplication that uses Wallace trees, redundant binary number addition, Booth encoding, or array multipliers.
- Representing complex numbers in radix-(2j) is the same as representing the real and imaginary parts in radix-(-4). As previously shown in Equations (2.40), (2.42), (2.45) and (2.47), the radix-(-4) is an imbalanced system while a traditional positive radix system is a balanced number system.

Recently, Aoki [31],[97],[98] and Mcllhenny [99],[100] investigated the redundant complex radix-( $2 j$ ) arithmetic for high-speed signal processing with emphasis on complex number addition and multiplication. The addition of two numbers, $X=\left(x_{K-1} \mathrm{~L} x_{i} \mathrm{~L} x_{-L}\right)$ and $Y=\left(\begin{array}{l}y_{K-1} \mathrm{~L} \\ y_{i} \\ \mathrm{~L} \\ y_{-L}\end{array}\right)$ in the redundant complex number system $(2 j, 3)$, where $x_{i}, y_{i} \in\{-3,-2,-1, \ldots 1,2,3\}$, is performed by the following three steps for each digit:

$$
\begin{align*}
& \text { Step 1: } z_{i}=x_{i}+y_{i} \\
& \text { Step 2: }-4 c_{i}+w_{i}=z_{i}  \tag{2.53}\\
& \text { Step 3: } s_{i}=w_{i}+c_{i-2}
\end{align*}
$$

Here $z_{i}$ is the linear sum, $w_{i}$ is called the intermediate sum, and $c_{i}$ is the carry. This socalled radix-(2j) redundant number system is actually a radix-(4) redundant number system with the real and imaginary parts treated separately.

In summary, this analysis shows that the non-redundant radix-(2j) number system has several disadvantages and the redundant radix-( $2 j$ ) complex number system is a radix-(4) redundant binary system, with real and imaginary parts treated separately.

Therefore, radix-( $2 j$ ) offers no overall hardware implementation advantage over a conventional binary number system.

### 2.2.2.2 Radix-( $j \sqrt{2}$ )

A system similar to radix- $(2 j)$ that uses only the digits 0 and 1 is based on $j \sqrt{2}$. This scheme, however, requires an infinite non-repeating expansion for the simple number $0+1 j$. Slekys [67] defined arithmetic operations on a modified bi-imaginary number system based on radix- $(j \sqrt{2})$. If a modified bi-imaginary complex system is used to encode each complex number $a+j b$ as $a+j \sqrt{2} c$, then the number $\left(a_{2 N} \mathrm{~L} a_{1} a_{0} \cdot a_{-1} \mathrm{~L} \quad a_{-2 K}\right)_{\sqrt{2} j}$ is equal to:

$$
\left(a_{2 N} \mathrm{~L} a_{2} a_{0} \cdot a_{-2} \mathrm{~L} a_{-2 K}\right)_{-2}+\sqrt{2} j\left(a_{2 N-1} \mathrm{~L} a_{3} a_{1} \cdot a_{-1} \mathrm{~L} a_{-2 K+1}\right)_{-2}
$$

The conversion to and from radix- $(j \sqrt{2})$ notation reduces to the conversion to and from a negative- 2 representation of the real and imaginary parts separately. Slekys defined algorithms for complex number addition, subtraction, multiplication and division. However, since Slekys's radix- $(j \sqrt{2})$ system is not redundant, the computational arithmetic requires additional operations comparable to the conventional 2 's-complement representation of complex numbers. To see this, multiplication in this system is considered.

Define the multiplier and multiplicand as the complex numbers $Z_{1}$ and $Z_{2}$ respectively, where:

$$
\begin{aligned}
& Z_{1}=A+j B \\
& Z_{2}=C+j D
\end{aligned}
$$

Using a modified bi-imaginary representation let,

$$
\begin{aligned}
Z_{1}^{\prime} & =A+j \sqrt{2} B \\
Z_{2}^{\prime} & =C+j \sqrt{2} D
\end{aligned}
$$

Then the multiplication of $Z_{l}$ and $Z_{2}$ will be:

$$
\begin{equation*}
Z_{1} \bullet Z_{2}=Z_{1}^{\prime} \bullet Z_{2}^{\prime}+B \bullet D \tag{2.54}
\end{equation*}
$$

Equation (2.54) shows that in the bi-imaginary complex-radix system the multiplication of two complex numbers will be composed of one complex number multiplication requiring four multiplications and two additions, plus one real number multiplication and one complex number addition.

For complex radix- $(j \sqrt{2})$, we can also use a redundant binary system for the computational arithmetic. Similar to the case of radix-( $2 j$ ), this redundant complex radix( $j \sqrt{2}$ ) is just a radix-(2) redundant binary system with the real and imaginary parts treated separately.

### 2.2.2.3 Radix-(j-1)

A binary complex number system is also obtained by using the base $(j-1)$, as first suggested by Penney [66]. Further studies of this radix were conducted by Jamil [62] and Blest [55]. Jamil shows that the conversion from 2's-complement to radix- $(j-1)$ is actually the conversion from radix-(2) to radix-(-4) for the real and imaginary parts separately, with a ( $j-1$ )-based addition needed to complete the conversion procedure. The multiplication and division of complex numbers based on this radix are also presented in Jamil's [62] and Blest's work [55]. Hardware implementations were not specifically addressed. In fact, it appears that the hardware for this radix will possess considerable
latency and gate count due the necessary carry detection logic requirement for the addition operation.

In radix-( $j-1$ ), there exists a carry propagation problem in complex number addition that further exacerbates the partial product additions in the multiplication hardware. To deal with this carry propagation problem for high-speed parallel hardware, a zero-detector is required for each digit. The zero-detector adds additional latency and gate complexity.

The value of an $N$-bit binary number $A=\left(a_{N-1}, a_{N-2}, \cdots a_{1}, a_{0}\right)$ with radix- $(j-1)$ can be written in the form of a power series as follows:

$$
\begin{equation*}
A=a_{N-1}(-1+j)^{N-1}+a_{N-2}(-1+j)^{N-2}+\cdots+a_{1}(-1+j)+a_{0} \tag{2.55}
\end{equation*}
$$

where the coefficients $a_{N-1}, a_{N-2}, \cdots a_{1}, a_{0} \in\{0,1\}$. As an example, if $N$ is a 16 bit number, the powers of $-1+j$ associated with the coefficients will be (from bottom to top, right to left, in groups of four):
[Row 4] (-128-j 128), (0+j128), (64-j64), (-64+j0)
[Row 3] $(32+j 32),(0-j 32),(-16+j 16),(16+j 0)$
[Row 2] $(-8-j 8),(0+j 8),(4-j 4),(-4+j 0)$
[Row 1] $(2+j 2),(0-j 2),(-1+j 1),(1+j 0)$
To describe the hardware implementation difficulties for this system, we need only consider the addition operation. In radix $-(-1+j)$, we have:

$$
\begin{equation*}
(1)_{-1+j}+(1)_{-1+j}=(2)_{-1+j}=(1100)_{-1+j} \tag{2.56}
\end{equation*}
$$

In radix- $(-1+j)$ there are two carries from one bit position. The addition truth table for one bit position is shown in the Table 2-F.

From the truth table, it is seen that there is a carry propagation problem in the addition of numbers in radix- $(-1+j)$. Figure $2-10$ shows an example for $x_{i}=1, y_{i}=1, c_{i-2}=1$ and $c_{i-3}=1$.

In this example, $x_{i}=y_{i}=c_{i-2}=c_{i-3}=1$, from Table 2-F and Equation (2.56), the carry-out $c_{i+2}$ and $c_{i+3}$ to the 2 and 3 digit positions to the left are $c_{i+2}=c_{i+3}=2$. Then carry $c_{i+2}=2$ will further propagate to $c_{i+4}=c_{i+5}=1$ and carry $c_{i+3}=2$ will further propagate to $c_{i+5}=c_{i+6}=1$. Thus, carry $c_{i+5}=1+1=2$ will propagate to $c_{i+7}=c_{i+8}=1$.

Table 2-F. Truth Table for Radix-( $\mathbf{- 1 + j )}$ One-Bit Addition

| $x_{i}$ | $y_{i}$ | $c_{i-2}$ | $c_{i-3}$ | $z_{i}$ | $c_{i+2}$ | $c_{i+3}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 2 | 2 |
| 1 | 1 | 1 | 2 | 1 | 2 | 2 |
| $\ldots$ |  |  |  |  |  |  |
| 1 | 1 | 2 | 2 | 0 | 3 | 3 |
| $\ldots$. |  |  |  |  |  |  |
| 1 | 1 | 3 | 3 | 0 | 4 | 4 |
| $\ldots$ |  |  |  |  |  |  |



$$
\begin{array}{lllllll}
c_{i+8} & c_{i+7} & c_{i+6} & c_{i+5} & c_{i+4} & c_{i+3} & c_{i+2}
\end{array}
$$

Figure 2-10. An Example for Radix-(j-1) Carry Propagation

A possible hardware method to work around this problem is a zero-detector based on the equality:

$$
\begin{equation*}
(11)_{-1+j}+(111)_{-1+j}=(0)_{-1+j} \tag{2.57}
\end{equation*}
$$

However, the zero-detector is expensive, serial in nature, and produces high latency. From this analysis, the radix $-(-1+j)$ has no advantage in the complex number computational hardware for addition and multiplication.

In summary, different complex radices such as radix $-(2 j)$, radix $-(-1+j)$ and radix$(j \sqrt{2})$ are studied. It is shown that these complex radices have no advantage over traditional binary number systems in hardware implementation. Chang's research [56],[57] also supports this conclusion. In Chang's research, a RIA (Real Imaginary Alternate) complex number system is proposed. In essence, his system represents complex numbers in 2's-complement binary form with interleaved real and imaginary parts. Therefore, based upon traditional binary number representation and the previously discussed real-number inner-product processor, a high-performance complex multiplier and complex number inner-product processor is developed in the following sections.

### 2.2.3 Complex Number Multiplier and Inner-Product Computation

### 2.2.3.1 RB Complex Number Multiplier

Applying the inline multiplication method to the complex inner-product requires no modification to Figure $2-2$ to produce the redundant binary imaginary part, $j\left(A_{R} B_{I}+A_{I} B_{R}\right)$. However, the real part, $A_{R} B_{R}-A_{I} B_{I}$, requires a final redundant binary subtraction, rather than an addition. The subtraction is easily implemented by modifying the redundant binary adder (RBA) to add the complement of the redundant binary
number $\left(-A_{I} B_{I}\right)_{\mathrm{RB}}$, since $\left(-A_{l} B_{I}\right)_{\mathrm{RB}}=\overline{\left(A_{I} B_{I}\right)_{\mathrm{RB}}}$, where $\overline{-1}=1, \overline{1}=-1$, and $\overline{0}=0$. In the actual hardware implementation of Figure 2-2, the addend or its RB complement are multiplexed into the RBA, thereby converting it to an adder/subtracter. Defining the control signal Real_Img as Real_Img=1, for the $A_{0} B_{0}-A_{1} B_{1}$ computation and Real_Img=0 for the $A_{0} B_{0}+A_{1} B_{1}$ computation, and then the inline implementation for $A_{0} B_{0} \pm A_{1} B_{1}$ is shown in Figure 2-11. For Booth encoding of the inline method, no further modification is required.


Figure 2-11. Inline Implementation of $\boldsymbol{A}_{0} B_{0}-\boldsymbol{A}_{1} B_{I}$
The cross partial product method is slightly more difficult and requires mapping the difference of two 2 's-complement numbers into a redundant binary number to
compute the real part of the complex product, $A_{R} B_{R}-A_{l} B_{I}$. The result given here is similar to that provided in [49] and is derived much the same as given in Section 2.1.1 for $A_{0} B_{0}+A_{1} B_{1}$.

$$
\begin{align*}
A_{0} B_{0}-A_{1} B_{1}= & \left(-a_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} a_{i, 0} 2^{i}\right)\left(-b_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} b_{i, 0} 2^{i}\right)- \\
& \left(-a_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i, 1} 2^{i}\right)\left(-b_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2} b_{i, 1} 2^{i}\right) \\
= & 2^{N-1}\left(a_{N-1,0} b_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}\right) 2^{i}-a_{N-1,1} b_{N-1,1} 2^{N-1}+\sum_{i=0}^{N-2}\left(a_{i, 1} b_{N-1,1}\right) 2^{i}\right)+ \\
& \sum_{i=0}^{N-2} 2^{i}\left\{b_{i, 0}\left(-a_{N-1,0} 2^{N-1}+\sum_{j=0}^{N-2} a_{j, 0} 2^{j}\right)+b_{i, 1}\left(a_{N-1,1} 2^{N-1}-\sum_{j=0}^{N-2} a_{j, 1} 2^{j}\right)\right\}  \tag{2.58}\\
= & 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}-a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}+a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}+ \\
& \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}+a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}-a_{j, 1} b_{i, 1}\right) 2^{j}\right\}
\end{align*}
$$

Considering the first term of Equation (2.58),

$$
\begin{align*}
& 2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}-a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}+a_{i, 1} b_{N-1,1}\right) 2^{i}\right\} \\
& =2^{N-1}\left\{\left(a_{N-1,0} b_{N-1,0}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(-a_{i, 0} b_{N-1,0}\right) 2^{i}\right\}+\left\{\left(-a_{N-1,1} b_{N-1,1}\right) 2^{N-1}+\sum_{i=0}^{N-2}\left(a_{i, 1} b_{N-1,1}\right) 2^{i}\right\}  \tag{2.59}\\
& =2^{N-1}\left\{\mu_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \mu_{i, 0} 2^{i}\right\}
\end{align*}
$$

where $\mu_{i, 0}(0 \leq i \leq N-1)$ is a redundant binary number with

$$
\begin{equation*}
\mu_{N-1,0}=a_{N-1,0} b_{N-1,0}-a_{N-1,1} b_{N-1,1} \text { and } \mu_{i, 0}=-a_{i, 0} b_{N-1,0}+a_{i, 1} b_{N-1,1} \text { for } 0 \leq i \leq N-2 \tag{2.60}
\end{equation*}
$$

Encoding $\mu_{i, 0}$ as two binary bits for $0 \leq i \leq N-2$ :

$$
\begin{align*}
& \mu_{N-1,0}^{-}=a_{N-1}, 0  \tag{2.61}\\
& b_{N-1,0}, \mu_{N-1,0}^{+}=\overline{a_{N-1}, 1} b_{N-1,1} \\
& \mu_{i, 0}^{-}=\overline{a_{i, 0} b_{N-1,0}}, \text { and } \mu_{i, 0}^{+}=a_{i, 1} b_{N-1,1} .
\end{align*}
$$

Converting the last term of Equation (2.58) to redundant binary,

$$
\begin{align*}
& \sum_{i=0}^{N-2} 2^{i}\left\{\left(-a_{N-1,0} b_{i, 0}+a_{N-1,1} b_{i, 1}\right) 2^{N-1}+\sum_{j=0}^{N-2}\left(a_{j, 0} b_{i, 0}-a_{j, 1} b_{i, 1}\right) 2^{j}\right\} \\
& =\sum_{i=0}^{N-2} 2^{i}\left\{v_{i, N-1} 2^{N-1}+\sum_{j=0}^{N-2} v_{i, j} 2^{j}\right\} \tag{2.62}
\end{align*}
$$

where $v_{i, j}(0 \leq i \leq N-2,0 \leq j \leq N-1)$ is a redundant binary number and

$$
\begin{equation*}
v_{i, j}=a_{j, 0} b_{i, 0}-a_{j, 1} b_{i, 1} \text { for } 0 \leq j \leq N-2, \text { and } v_{i, N-1}=-a_{N-1,0} b_{i, 0}+a_{N-1,1} b_{i, 1} . \tag{2.63}
\end{equation*}
$$

Encoding $v_{i, j}(0 \leq i \leq N-2,0 \leq j \leq N-1)$ as a redundant binary number using two binary bits:

$$
\begin{align*}
& v_{i, N-1}^{-}=\overline{a_{N-1,0} b_{i, 0}}, \quad v_{i, N-1}^{+}=a_{N-1,2,} b_{i, 1}, \\
& v_{i, j}^{-}=a_{j, 0} b_{i, 0} \text {, and } \quad v_{i, j}^{+}=\overline{a_{j, 1} b_{i, 1}} . \tag{2.64}
\end{align*}
$$

The general redundant binary equation for the summation of $A_{0} B_{0}-A_{1} B_{1}$ is:

$$
\begin{equation*}
A_{0} B_{0}-A_{1} B_{1}=2^{N-1}\left\{\mu_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} \mu_{i, 0} 2^{i}\right\}+\sum_{i=0}^{N-2} 2^{i}\left\{v_{N-1,0} 2^{N-1}+\sum_{i=0}^{N-2} v_{i, 0} 2^{i}\right\} \tag{2.65}
\end{equation*}
$$

Now consider the inner-product, $A_{0} B_{0}-A_{1} B_{1}$, using modified Booth encoding:

$$
\begin{align*}
A_{0} B_{0}-A_{1} B_{1}= & \sum_{i=0}^{\frac{N}{2}-1} Q_{i, 0}\left(-a_{N-1,0} 2^{N-1}+\sum_{k=0}^{N-2} a_{k, 0} 2^{k}\right) 2^{2 i} \\
& -\sum_{i=0}^{\frac{N}{2}-1} Q_{i, 1}\left(-a_{N-1,1} 2^{N-1}+\sum_{k=0}^{N-2} a_{k, 1} 2^{k}\right) 2^{2 i}  \tag{2.66}\\
= & \sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}+g_{i, 0}\right) 2^{2 i}-\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 1}+g_{i, 1}\right) 2^{2 i} \\
= & \sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}-C_{i, 1}+g_{i, 0}-g_{i, 1}\right) 2^{2 i}
\end{align*}
$$

From Equation (1.10), the subtraction of two 2's-complement numbers can be considered as a redundant binary number. So Equation (2.66) becomes

$$
\begin{align*}
A_{0} B_{0}-A_{1} B_{1} & =\sum_{i=0}^{\frac{N}{2}-1}\left(C_{i, 0}-C_{i, 1}+g_{i, 0}-g_{i, 1}\right) 2^{2 i} \\
& =\sum_{i=0}^{\frac{N}{2}-1}\left(Y_{i, 01}+g_{i, 0}-g_{i, 1}\right) 2^{2 i}  \tag{2.67}\\
& =\sum_{i=0}^{\frac{N}{2}-1}\left(Y_{i, 01}+\gamma_{i, 01}\right) 2^{2 i}
\end{align*}
$$

where $Y_{i, 01}$ is a redundant binary number from the subtraction of $C_{i, 0}$ and $C_{i, 1}$ with $Y_{i, 01}=C_{i, 0}-C_{i, 1}$ and $\gamma_{j, 01}=g_{j, 0}-g_{j, 1}$. The correction factor, $\gamma_{i, 01}$, depends upon the value $g_{i, 0}$ and $g_{i, 1}$, as shown in Table 2-G.

Table 2-G. Booth Correction Factors for Redundant Binary Partial Product Generation of $\boldsymbol{A}_{0} \boldsymbol{B}_{0}-\boldsymbol{A}_{1} \boldsymbol{B}_{I}$

| $g_{i, 0}$ | $g_{i, 1}$ | $\gamma_{i, 01}=g_{i, 0}-g_{i, 1}$ | $\gamma_{i, 01}^{-} \gamma_{i, 01}^{+}$ |  |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | -1 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 |

From the Booth coding Table 2-A,

$$
\begin{align*}
\gamma_{i, 01}^{-}=g_{i, 0} & =b_{i+1,0} \overline{b_{i, 0}} \overline{b_{i-1,0}}+b_{i+1,0} \overline{b_{i, 0}} b_{i-1,0}+b_{i+1,0} b_{i, 0} \overline{b_{i-1,0}} \\
& =b_{i+1,0} \overline{b_{i, 0} b_{i-1,0}} \\
\gamma_{i, 01}^{+}=\overline{g_{i, 1}} & =\overline{b_{i+1,1} \overline{b_{i, 1}} \overline{b_{i-1,1}}}+b_{i+1,1} \overline{b_{i, 1}} b_{i-1,1}+b_{i+1,1} b_{i, 1} \overline{b_{i-1,1}}  \tag{2.68}\\
& =\overline{b_{i+1,1} \overline{b_{i, 1} b_{i-1,1}}}
\end{align*}
$$

Again, the " + " in the equations above is the Boolean OR operation.

### 2.2.3.2 RB Complex Number Inner-Product Processor

The inner-product of complex numbers $C_{0}, C_{1}, \cdots C_{M-1}$ and $D_{0}, D_{1}, \cdots D_{M-1}$,

$$
\begin{align*}
& C_{0}=A_{0}+j A_{1}, C_{1}=A_{2}+j A_{3}, \ldots C_{M-1}=A_{2 M-2}+j A_{2 M-1}  \tag{2.69}\\
& D_{0}=B_{0}+j B_{1}, D_{1}=B_{2}+j B_{3}, \ldots D_{M-1}=B_{2 M-2}+j B_{2 M-1}
\end{align*}
$$

is

$$
\begin{align*}
\langle\boldsymbol{C}, \boldsymbol{D}\rangle & =\sum_{j=0}^{M-1} C_{j} D_{j}=\sum_{j=0}^{M-1}\left(A_{2 j}+j A_{2 j+1}\right)\left(B_{2 j}+j B_{2 j+1}\right)  \tag{2.70}\\
& =\sum_{j=0}^{M-1}\left(A_{2 j} B_{2 j}-A_{2 j+1} B_{2 j+1}\right)+j \sum_{j=0}^{M-1}\left(A_{2 j} B_{2 j+1}+A_{2 j+1} B_{2 j}\right)
\end{align*}
$$

The real part of the complex number inner-product is shown as Figure 2-12.


Figure 2-12. The Real Part of the Complex Number Inner-Product

The imaginary part of the complex number inner-product is shown as Figure 2-13:


Figure 2-13. The Imaginary Part of Complex-Number Inner-Product

Defining a control signal Real_Img=1/0 for $A B-C D / A B+C D$ computation, then the overall structure of a unified inner-product processor for $A B \pm C D$ is shown in Figure 2-14:


Figure 2-14. Unified RB IP Processor for $A B \pm C D$

Using the above structures, the real and imaginary parts of the complex number inner-products are computed in redundant binary form. Finally, a RB to 2 's-complement converter is needed to convert the redundant binary inner-product real and imaginary parts to 2 's-complement form, if required.

### 2.3 Inner-Product Computation Comparison

In this section, the computation time for real number inner-product processing is compared between the Texas Instruments TMS320C6000 series and the RB innerproduct processor. The sample inner-product assembly code using TMS320C6000 [92] is shown in Figure 2-15:

```
MVK .S1 100, A1 ; set up loop counter
ZERO .L1 A7 ; zero out accumulator
LOOP:
LDH .D1 *A4++,A2 ; load ai from memory
LDH .D1 *A3++,A5 ; load bi from memory
NOP 4 ; delay slots for LDH
MPY .M1 A2,A5,A6 ; ai * bi
NOP ; delay slot for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
SUB .Sl A1,l,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 5 ; delay slots for branch
; Branch occurs here
```

Figure 2-15. An Example Code of Fixed-Point Inner-Product [92]

To analyze the execution clock cycles of this sample, a dependency graph is very useful. Dependency graphs can help analyze loops by showing the flow of instructions and data in an algorithm. These graphs also show how instructions depend on one another. The following terms are used in defining a dependency graph:

- A node is a point on a dependency graph with one or more data paths flowing in and/or out.
- The path shows the flow of data between nodes. The numbers beside each path represent the number of cycles required to complete the instruction.
- An instruction that writes to a variable is referred to as a parent instruction and defines a parent node.
- An instruction that reads a variable written by a parent instruction is referred to as its child and defines a child node.

Use the following steps to draw a dependency graph:

1) Define the nodes based on the variables accessed by the instructions.
2) Define the data paths that show the flow of data between nodes.
3) Add the instructions and latencies.
4) Add the functional units.

Figure 2-16 shows the dependency graph for the fixed-point inner-product assembly instructions, shown in Figure 2-15 and their corresponding register allocations:


Figure 2-16. Dependency Graph of Fixed-Point Inner-Product [92]

Figure 2-16 provides the following observations:

- The two LDH instructions, which write the values of $a_{i}$ and $b_{i}$, are parents of the MPY instruction. Five cycles for the parent (LDH) instruction are
needed. Therefore, if the LDH is scheduled on cycle $i$, then its child (MPY) cannot be scheduled until cycle $i+5$.
- The MPY instruction, which writes the product $p_{i}$, is the parent of the ADD instruction. The MPY instruction takes two cycles to complete.
- The ADD instruction adds $p_{i}$ (the result of the MPY) to the sum. The output of the ADD instruction feeds back to become an input on the next iteration and, thus, creates a loop carry path.

The dependency graph for this inner-product algorithm has two separate parts since the decrement of the loop counter and the branch do not read or write any variables from the other part. The loop counter graph shows the following:

- The SUB instruction writes to the loop counter, entr. The output of the SUB instruction feeds back and creates a loop carry path.
- The branch (B) instruction is a child of the loop counter.

Executing this inner-product code serially requires 16 cycles for each iteration plus two cycles to set up the loop counter and initialize the accumulator, thus 100 iterations require 1602 cycles. For the fixed-point TMS320C62X ('C62X) devices, which are operated typically at a 200 MHz clock ( 5 ns ) frequency, 100 iterations require:

$$
\begin{equation*}
1602 \times 5=8010 \mathrm{~ns} \tag{2.71}
\end{equation*}
$$

For the RB inner-product processor, Figure 2-17 shows the structure to implement the real number inner-product computation. For an accurate and fair comparison, we use only one RB multiplier and RB accumulator.


Figure 2-17. RB Inner-Product Implementation

A CMOS implementation of the RB multiplier [3] with $0.5 \mu \mathrm{~m}$ fabrication shows that a $54 \times 54$ bit multiplier achieves 8.8 ns delay, which includes 2.4 ns delay for the RB-to-2's-complement converter. The actual delay of the RB multiplier is only 7.2 ns . Compared to the implementation of TMS320C62X processor, if two-stage pipelines are used for the multiplication and the RB multiplier is employed in the TMS320C62X, then the clock cycle can be reduced to $6 \cdot 4 / 2=3.2 \mathrm{~ns}$. Assuming that all the other instruction operations (LDH, ADD, SUB, etc) take the same time in the RB IP processor, then for 100 iterations of inner-product computations, the total time required is

$$
\begin{equation*}
1602 \times 3.2=5126.4 n \mathrm{~s} \tag{2.72}
\end{equation*}
$$

Table 2-H shows the comparison result using TMS320C62X and the RB innerproduct processor for 100 iterations of inner-product computations:

Table 2-H. Comparison of IP Computation between TMS320C62X and RB InnerProduct Processor

|  | 100 Iterations |
| :--- | :--- |
| TMS320C62X | 8010 ns |
| RB Inner-Product <br> Processor | 5126.4 ns |

### 2.4 Implementation of Unified Signed/Unsigned Multiplier

In this section, a unified signed/unsigned multiplier is developed using the RB inner-product core without and with Booth encoding. An unsigned binary number can be considered as a 2 's-complement number with an extra sign bit ' 0 ' padded before the MSB (Most Significant Bit). For example, an unsigned binary number 10001111 can be considered as a signed binary number 010001111 with an extra sign bit ' 0 '.

### 2.4.1 Unifled Signed/Unsigned Muitiplier Without Booth Coding

From Equation (1.6), a RB number, $\Delta$, can be derived from the addition of a pair of 2's-complement numbers. Thus for an unsigned $N \times N$ multiplier of $A \times B, N$ unsigned partial products are generated. These $N$ unsigned partial products are converted to $N+1$ signed partial products with extra bit 0 padded before MSB and are mapped to N/2 RB partial products with correction factors as shown in Figure 2-18.


Figure 2-18. Unsigned Multiplier with Partial Product Generation
Consider the partial products of $P P_{N-2}$ and $P P_{N-1}$ for signed multiplier of $A$ and $B$ as:

$$
\begin{align*}
& P P_{N-2}=\left(-a_{N-1} b_{N-2} 2^{N}+a_{N-1} b_{N-2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{N-2} 2^{i}\right) 2^{N-2}  \tag{2.73}\\
& P P_{N-1}=-\left(-a_{N-1} b_{N-1} 2^{N}+a_{N-2} b_{N-1} 2^{N-1}+\sum_{i=1}^{N-2} a_{i-1} b_{N-1} 2^{i}\right) 2^{N-2}
\end{align*}
$$

Where $P P_{N-2}$ and $-P P_{N-1}$ are 2 's-complement binary numbers, the sum of $P P_{N-2}$ and $P P_{N-1}$ are mapped into a RB number according to Equations (1.10), (1.11), (1.12) and Figure 1-14. The mapping structure is shown in Figure 2-19.


Figure 2-19. Mapping of $\boldsymbol{P} \boldsymbol{P}_{N-2}$ and $\boldsymbol{P} \boldsymbol{P}_{N-1}$ for Signed Multiplier into a RB Digit

Consider the partial products of $P Q_{N-2}$ and $P Q_{N-1}$ for unsigned multiplier of $A$ and $B$ as:

$$
\begin{align*}
& P Q_{N-2}=\left(0\left\llcorner 2^{N+1}+a_{N-1} b_{N-2} 2^{N}+a_{N-1} b_{N-2} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} b_{N-2} 2^{i}\right) 2^{N-2}\right.  \tag{2.74}\\
& P Q_{N-1}=\left(02^{N+1}+a_{N-1} b_{N-1} 2^{N}+a_{N-2} b_{N-1} 2^{N-1}+\sum_{i=1}^{N-2} a_{i-1} b_{N-1} 2^{i}\right) 2^{N-2}
\end{align*}
$$

Where $P Q_{N-2}$ and $P Q_{N-1}$ are 2 's-complement binary numbers, the sum of $P Q_{N-2}$ and $P Q_{N-1}$ are mapped into a RB number according to Equations (1.6), (1.7), (1.8) and Figure 1-13 with an extra correction factor -1 . The mapping structure is shown in Figure 2-20.


Figure 2-20. Mapping of $P Q_{N-2}$ and $P Q_{N-1}$ for Signed Multiplier into a RB Digit Define a control signal, $S I G N$, where $S I G N=1$ for signed multiplication and $S I G N=0$ for unsigned multiplication, then the combined partial product $P P_{N-1}^{\prime}$ is realized in Figure 2-21.


Figure 2-21. Circuit Realization of the Last Partial Product $\boldsymbol{P} \boldsymbol{P}_{N-I}^{\prime}$ for Signed/Unsigned Multiplier

Thus, for the implementation of $N$-bit unsigned multiplier, the correction factors are $\{0,-1,0,-1,0,-1, \ldots \ldots-1\}$. That is:

$$
\begin{align*}
& \alpha_{N-2}=\alpha_{N-4}=\cdots=\alpha_{2}=\alpha_{0}=-1  \tag{2.75}\\
& \alpha_{N-1}=\alpha_{N-3}=\cdots=\alpha_{3}=\alpha_{1}=0
\end{align*}
$$

For unsigned multiplication, all the correction factors except $\alpha_{N-2}=-1$ can be added in the RB addition tree. Here the $\alpha_{N-2}$ factor is combined with the first partial product, $P P 0_{N-1} P P 0_{N-2} \cdots P P 0_{1} P P 0_{0}$, which is shown in Figure 2-22:


Figure 2-22. First Partial Product PQ0 for Unsigned Multiplier

Table 2-I to Table 2-L are the truth tables for the first partial product $P Q 0$ of the unsigned multiplier:

Table 2-I. Partial Product $P Q 0_{0}$ to $P Q 0_{N-3}$ for Unsigned Multiplier

| First Partial Products $P P_{0}$ <br> $P P 0_{0}$ to $P P 0_{N-3}$ | First Partial Products $P Q_{0}$ <br> $P Q 0_{0}$ to $P Q 0_{N-3}$ |
| :---: | :---: |
| 0 | 1 |

Table 2-J. Partial Product $P Q 0_{N-2}$ for Unsigned Multiplier

| First Partial Products $P P_{0}$ <br> $P P 0_{N-2}$ | First Partial Products $P Q_{0}$ <br> $P Q 0_{N-2}$ |
| :---: | :---: |
| 0 | 1 |

Table 2-K. Partial Product $P Q 0_{N-1}$ for Unsigned Multiplier

| $P P 0_{N-2}$ | $P P 0_{N-1}$ | First Partial Products $P Q_{0}$ <br> $P Q 0_{N-1}$ |
| :---: | :---: | :---: |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |

Table 2-L. Partial Product $P Q 0_{N}$ and $P Q 0_{N+1}$ for Unsigned Multiplier

| $P P 0_{N-2}$ | $P P 0_{N-1}$ | First Partial Products $P Q_{0}$ <br> $P Q 0_{N}$ and $P Q 0_{N+1}$ |
| :---: | :---: | :---: |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |

The logic equation of the partial product $P Q 0$ for unsigned multiplier is as follows:

$$
\begin{align*}
& P Q 0_{0}=P P 0_{0} \\
& P Q 0_{1}=P P 0_{1} \\
& \cdots \ldots  \tag{2.76}\\
& P Q 0_{N-3}=P P 0_{N-3} \\
& P Q 0_{N-2}=\overline{P P 0_{N-2}} \\
& P Q 0_{N-1}=\overline{P P 0_{N-1} \oplus P P 0_{N-2}} \\
& P Q 0_{N+1}=P Q 0_{N}=\overline{P P 0_{N-1}+P P 0_{N-2}}
\end{align*}
$$

Figure 2-23 shows the circuit realization of the combined partial product $P P_{0}^{\prime}$.


Figure 2-23. Circuit of the First Partial Product $\boldsymbol{P P}_{\boldsymbol{j}}^{\prime}$ for Signed/Unsigned Multiplier For the partial products of $P Q_{1}$ and $P Q_{N-2}$ for unsigned multiplier, an extra bit 0 is padded before the MSB. Figure 2-24 shows the circuit realization of the combined partial products from $P P_{1}^{\prime}$ to $P P_{N-2}^{\prime}$.


Figure 2-24. Circuit of the Partial Products from $P P_{1}^{\prime}$ to $P P_{N-2}^{\prime}$ for Signed/Unsigned Multiplier

Figure 2-25 shows a unified signed/unsigned multiplier with the control signal SIGN.


Figure 2-25. A Unified Sign/Unsigned Multiplier

Both the signed and unsigned multiplier have the same structure of the RB adder tree. The partial products are controlled to switch between signed and unsigned multiplication. Then the combined binary partial products are mapped to RB partial products and added using the RB adder tree to compute the signed/unsigned RB product.

### 2.4.2 Unified Signed/Unsigned Multiplier With Booth Coding

For the unsigned multiplier for $A$ and $B$,

$$
\begin{align*}
& A=\sum_{i=0}^{N-1} a_{i} 2^{i}=-0 \times 2^{N}+a_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} a_{i} 2^{i}=A^{\prime} \\
& B=\sum_{i=0}^{N-1} b_{i} 2^{i}=b_{N-1} 2^{N}-b_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} b_{i} 2^{i}=b_{N-1} 2^{N}+B^{\prime} \tag{2.77}
\end{align*}
$$

where $A^{\prime}$ and $B^{\prime}$ are 2 's-complement binary numbers.
Thus, the product of $A B$ is calculated as:

$$
\begin{align*}
A B & =A^{\prime}\left(b_{N-1} 2^{N}+B^{\prime}\right) \\
& =A^{\prime} b_{N-1} 2^{N}+A^{\prime} B^{\prime}  \tag{2.78}\\
& =\sum_{i=0}^{N-1} a_{i} b_{N-1} 2^{i} 2^{N}+A^{\prime} B^{\prime}
\end{align*}
$$

The product of $A^{\prime} B^{\prime}$ using Booth encoding was previously discussed in Section
2.1.3. The extra $\sum_{i=0}^{N-1} a_{i} b_{N-1} 2^{i} 2^{N}$ value can be combined with the correction factors $\gamma_{i}$. The new correction factors are

$$
\begin{align*}
\gamma_{i}^{+} & =0 \\
\gamma_{i}^{-} & =b_{i+1} \overline{b_{i}} \overline{b_{i-1}}+b_{i+1} \bar{b}_{i} b_{i-1}+b_{i+1} b_{i} \overline{b_{i-1}} \\
& =b_{i+1} \overline{b_{i} b_{i-1}}  \tag{2.79}\\
\gamma_{i+1}^{+} & =b_{i+3} \overline{b_{i+2} b_{i+1}} \\
\gamma_{i+1}^{-} & =1
\end{align*}
$$

where $i=0,2,4, \ldots \frac{N}{2}-2$ and

$$
\begin{equation*}
\gamma_{i}^{+}=a_{i} b_{N-1} \quad \gamma_{i}^{-}=1 \tag{2.80}
\end{equation*}
$$

where $i=N, N+1, \ldots, 2 N-1$.

### 2.5 The Implementation of a Unified Signed/Unsigned InnerProduct Processor for $A B \pm C D$

The overall structure of a unified signed/unsigned inner-product processor is described for the computations of $A B+C D$ and $A B-C D$. In Section 2.2.3.1 and Figure

2-11. The $A B-C D$ implementation is developed using the structure for $A B+C D$ with inverters added in the RBA tree and a unified structure for $A B \pm C D$ is developed. In Section 2.4, a unified sign/unsigned multiplier was developed using RB representations. Define two control signals, $\operatorname{SIGN}=1 / 0$ for signed/unsigned multiplication, and Real_Img $=1 / 0$ for the $A B-C D / A B+C D$ computation. Then the overall structure of a unified signed/unsigned inner-product processor for $A B \pm C D$ is shown in Figure 2-26, where the $P Q_{0}$ to $P Q_{N-1}$ of $A B$ and $C D$ refers to the unsigned partial products discussed in Section 2.4.


Figure 2-26. Unified Signed/Unsigned IP Processor for $A B \pm C D$

### 2.6 The Implementation of a Redundant Binary Multiplier

Currently, numerous floating-point unit designs incorporating a fast multiplier make iterative use of the multiplier for implementing fast algorithms for division, square root, and/or transcendental function computations by extended polynomial approximation [107]-[112]. If multipliers are to be used iteratively for RB computations, it is advantageous for the multiplier to accept redundant binary coded input directly, in addition to the initial 2 's-complement numbers. A multiplier capable of accepting both 2's-complement and RB inputs avoids the excessive RB to 2's-complement delay. To our knowledge, no prior multiplier design exists with this capability. Recently a new floating point arithmetic unit was proposed [113]. A redundant number system is used to achieve IEEE compliant results. All operations in the arithmetic units are carried in redundant form with conversion back to the standard IEEE format performed only when an operand is written to memory. In [113], it is argued that the proposed floating point unit could achieve better performance across all of the required functions. In all these cases, a fast multiplier that can accept either 2 's-complement or RB inputs is advantageous, i.e., the multiplicand and multiplier are both redundant binary numbers with the product produced in redundant binary form, as shown in Figure 2-27:


Figure 2-27. A RB Multiplier Diagram

### 2.6.1 Direct Implementation of Redundant Binary Multiplier

To implement the dual input multiplier, we first consider a RB multiplier. Figure 2-28 shows an example of RB multiplication.

Let $A_{R B}=\left[\begin{array}{llll}1 & -1 & 0 & 1\end{array}\right], B_{R B}=\left[\begin{array}{llll}-1 & 1 & 0 & -1\end{array}\right]$. Then, the product of $A_{R B}$ and $B_{R B}$ is computed as:


Figure 2-28. An Example of RB Multiplication

The RB partial product is generated according to Table 2-M, where $\alpha_{i}$ and $\beta_{i}$ are the RB signed digits of $A_{R B}$ and $B_{R B}$, respectively.

Table 2-M. RB Partial Product Generation

| $\alpha_{i}$ | $\beta_{i}$ | $\alpha_{i} \beta_{i}$ |
| :---: | :---: | :---: |
| -1 | -1 | 1 |
| -1 | 0 | 0 |
| -1 | 1 | -1 |
| 0 | -1 | 0 |
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | -1 | -1 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |

Encoding the RB digits $1=(11), 0=(01)=(10),-1=(00)$, Table $2-\mathrm{N}$ shows the encoded RB partial products.

Table 2-N. Encoded RB Partial Product Generation

| $\alpha_{i}$ | $\beta_{i}$ | $\rho_{i}=\alpha_{i} \beta_{i}$ |
| :--- | :--- | :--- |
| $-1(0,0)$ | $-1(0,0)$ | $1(1,1)$ |
| $-1(0,0)$ | $0(0,1)(1,0)$ | $0(0,1)(1,0)$ |
| $-1(0,0)$ | $1(1,1)$ | $-1(0,0)$ |
| $0(0,1)(1,0)$ | $-1(0,0)$ | $0(0,1)(1,0)$ |
| $0(0,1)(1,0)$ | $0(0,1)(1,0)$ | $0(0,1)(1,0)$ |
| $0(0,1)(1,0)$ | $1(1,1)$ | $0(0,1)(1,0)$ |
| $1(1,1)$ | $-1(0,0)$ | $-1(0,0)$ |
| $1(1,1)$ | $0(0,1)(1,0)$ | $0(0,1)(1,0)$ |
| $1(1,1)$ | $1(1,1)$ | $1 \quad(1,1)$ |

From Table 2-N, Equation (2.81) is derived to find the encoded RB partial product:

$$
\begin{align*}
& \rho_{i}^{-}=\overline{\alpha_{i}^{-}} \overline{\overline{\alpha_{i}^{+}}} \overline{\beta_{i}^{-}} \overline{\beta_{i}^{+}}+\alpha_{i}^{-} \alpha_{i}^{+} \beta_{i}^{-} \beta_{i}^{+} \\
& \rho_{i}^{+}=\overline{\overline{\alpha_{i}^{-}} \overline{\alpha_{i}^{+}} \beta_{i}^{-} \beta_{i}^{+}+\alpha_{i}^{-} \alpha_{i}^{+}} \overline{\beta_{i}^{-}} \overline{\beta_{i}^{+}} \tag{2.81}
\end{align*}
$$

Another way to implement a RB multiplier is to use the RB inline inner-product processor core. In Section 2.4, and Section 2.5, the implementation of $A_{0} B_{0}+A_{1} B_{1}$ and $A_{0} B_{0}-A_{1} B_{1}$ for both 2 's-complement and unsigned numbers was discussed. Here the reuse of these cores is investigated to implement the RB multiplier.

### 2.6.2 Redundant Binary Multiplier Implementation Using InnerProduct Processor

Let $Z_{R B}=\sum_{i=0}^{N-1} \zeta_{i} 2^{i}$, where $Z_{R B}$ is a RB number and $\zeta_{i}$ is encoded as two binary bits $\left(\zeta_{i}^{+} \zeta_{i}^{-}\right)$(refer to Table 1-B). In this research, the RB encoding is $\zeta_{i}=\zeta_{i}^{+}-\overline{\zeta_{i}^{-}}$. Therefore,

$$
\begin{align*}
Z_{R B} & =\sum_{i=0}^{N-1} \zeta_{i} 2^{i}=\sum_{i=0}^{N-1}\left(\zeta_{i}^{+}-\overline{\zeta_{i}^{-}}\right) 2^{i} \\
& =\sum_{i=0}^{N-1} \zeta_{i}^{+} 2^{i}-\sum_{i=0}^{N-1} \overline{\zeta_{i}^{-}} 2^{i} \tag{2.82}
\end{align*}
$$

Let $Z^{+}=\sum_{i=0}^{N-1} \zeta_{i}^{+} 2^{i}$ and $Z^{-}=\sum_{i=0}^{N-1} \zeta_{i}^{-} 2^{i}$ then

$$
\begin{equation*}
Z_{R B}=Z^{+}-\overline{Z^{-}} \tag{2.83}
\end{equation*}
$$

where $Z^{+}$and $Z^{-}$are unsigned binary numbers.
For two RB numbers, $A_{R B}$ and $B_{R B}$, we have:

$$
\begin{align*}
A_{R B} & =\sum_{i=0}^{N-1} \alpha_{i} 2^{i}, A^{+}=\sum_{i=0}^{N-1} \alpha_{i}^{+} 2^{i} \text { and } A^{-}=\sum_{i=0}^{N-1} \alpha_{i}^{-} 2^{i}  \tag{2.84}\\
B_{R B} & =\sum_{i=0}^{N-1} \beta_{i} 2^{i}, B^{+}=\sum_{i=0}^{N-1} \beta_{i}^{+} 2^{i} \text { and } B^{-}=\sum_{i=0}^{N-1} \beta_{i}^{-} 2^{i}
\end{align*}
$$

Here $\alpha_{i}$ and $\beta_{i}$ are encoded as $\left(\alpha_{i}^{+} \alpha_{i}^{-}\right)$and $\left(\beta_{i}^{+} \beta_{i}^{-}\right)$, with $\alpha_{i}=\alpha_{i}^{+}-\overline{\alpha_{i}^{-}}$and $\beta_{i}=\beta_{i}^{+}-\overline{\beta_{i}^{-}}$.

From Equation (2.83),

$$
\begin{align*}
& A_{R B}=A^{+}-\overline{A^{-}}  \tag{2.85}\\
& B_{R B}=B^{+}-\overline{B^{-}}
\end{align*}
$$

so $A_{R B} B_{R B}$ becomes

$$
\begin{align*}
A_{R B} B_{R B} & =\left(A^{+}-\overline{A^{-}}\right)\left(B^{+}-\overline{B^{-}}\right) \\
& =\left(A^{+} B^{+}-\overline{A^{-}} B^{+}\right)+\left(\overline{A^{-}} \overline{B^{-}}-A^{+} \overline{B^{-}}\right) \tag{2.86}
\end{align*}
$$

Where $A^{+}, A^{-}, B^{+}$and $B^{-}$are all unsigned binary numbers, so the product of RB
$A_{R B}$ and $B_{R B}$ can be realized, using two unsigned binary numbers computations with $A^{+} B^{+}-\overline{A^{-}} B^{+}$and $\overline{A^{-}} \overline{B^{-}}-A^{+} \overline{B^{-}}$. The diagram is shown in Figure 2-29:


Figure 2-29. Implementation of RB Multiplier $A_{R B} B_{R B}$

Using this method to split the RB encoding bits and utilizing the unsigned feature of the multiplier (see Figure 2-26), the basic IP computing core will generate RB products with RB multiplicand and multiplier inputs.

### 2.7 Redundant Binary Inner-Product Computation

With the development of a RB multiplier in Section 2.6, an inner-product processor which can accept RB numbers input is easily designed. For example, Figure 2-30 shows the implementation structure to find the inner-product of $A B+X A$, where $A, B, X$ and $\Delta$ are RB numbers.


RB multiplier

Figure 2-30. IP Implementation for RB Number $A B+X A$

## Chapter 3 Implementations of Division Method

### 3.1 Division Algorithm Review

The notation is used in the discussion here of division algorithms:

| $Z$ Dividend | $z_{2 N-1} z_{2 N-2} \ldots z_{1} z_{0}$ |
| :--- | :--- |
| $D$ Divisor | $d_{N-1} d_{N-2} \ldots d_{1} d_{0}$ |
| $Q$ Quotient | $q_{N-1} q_{N-2} \ldots q_{1} q_{0}$ |
| $S$ Remainder $[Z-(D \times Q)]$ | $s_{N-1} s_{N-2} \ldots s_{1} s_{0}$ |

Division algorithms can generally be divided into the following classes: digit recurrence (restoring or non-restoring), functional iteration, table look-up and variable latency. The basis for these classes is the difference in the hardware operations used in their implementations, such as multiplication, subtraction, and table look-up. Many practical division algorithms are not pure forms of a particular class but rather are combinations of multiple classes. For example, a high performance algorithm may use table look-up to gain an initial approximation of the reciprocal, then use a function iteration algorithm to converge quadratically to the quotient. Table look-up may be impractical for general applications. The division method of table look-up requires a large RAM size for longer divisor size. The size of RAM increases exponentially with the word length of the divisor. The variable latency method results in a complex design for the control circuit and requires an asynchronous design method. The latency of variable latency division method depends on the value of the divisor. For different values of divisor, the latency is different. The two most popular division methods are digital recurrence and functional iteration.

Digit recurrence is the oldest class of high-speed division algorithms and, as a result, a significant quantity of literature exists proposing digit recurrence algorithms, implementations, and techniques. The most common implementation of digit recurrence division in modern processors was named $S R T$ division by Freiman [81], taking its name from the initials of Sweeney, Robertson, and Tocher, who developed the algorithm independently at approximately the same time. Atkins [78] did fundamental research on division by digit recurrence, which was the first major analysis of SRT algorithms. Tan [89] derived and presented the theory of high-radix SRT division and an analytic method of implementing SRT look-up tables. Ercegovac and Lang [79] presented a comprehensive treatment of division by digit recurrence. Kuninobu [83], Aoki [77], and Srinivas [88] investigated the digit-recurrence division method with the redundant binary representation of the remainders. Basically the equation of the digit-recurrence division method in radix-(r) is:

$$
\begin{equation*}
s_{j}=r s_{j-1}-q_{N-j}\left(r^{N} D\right) \tag{3.1}
\end{equation*}
$$

Digit recurrence algorithms deal with how to represent the remainder and quotient, how to choose the quotient, and choice of radix. Convergence of digitrecurrence is linear and has order $N$. A high performance quadratically convergent method, function iteration, was proposed which included the Goldschmidt [82] and Newton-Raphson [85] methods. Both methods first find the reciprocal and then use multiplication to compute the quotient. The functional iteration method is discussed below.

To compute the radio $Q=Z / D$, one can repeatedly multiply $Z$ and $D$ by a sequence of $M$ multipliers $X_{0}, X_{1} \ldots, X_{M-1}$ :

$$
\begin{equation*}
Q=\frac{Z}{D}=\frac{Z X_{0}, X_{1} \ldots, X_{M-1}}{D X_{0}, X_{1} \ldots, X_{M-1}} \tag{3.2}
\end{equation*}
$$

If this is done in such a way that de denominator $D X_{0}, X_{1} \ldots, X_{M-1}$ converges to 1 , the numerator $Q X_{0}, X_{1} \ldots, X_{M-1}$ will converge to $Q$. This process does not yield a remainder, but the remainder $S$ (if needed) can be computed, via an additional multiplication and a subtraction, using $S=Z-Q D$.

To perform division based on the preceding idea, we face two questions:

1. How should we select the multipliers $X_{i}$ so that the denominator does in fact converge to 1 ?
2. Given a selection rule for the multipliers $X_{i}$ how many iterations are needed?

In the following discussion, we answer these questions in turn, but first, we formulate this process as a convergence computation.

Assume a bit-normalized fractional divisor, $D$, and dividend, $Z$, in [1/2 1). If this condition is not satisfied initially, it can be made to hold by appropriately shifting $Z$ and/or $D$. The corresponding convergence computation is formulated as follows [82]:

$$
\begin{array}{ll}
D_{i+1}=D_{i} X_{i} & \text { Set } D_{0}=D ; \text { make } D_{M} \text { converge to } 1 \\
Z_{i+1}=Z_{i} X_{i} & \text { Set } Z_{0}=Z ; \text { obtain } Z / D=Q \approx Z_{M} \tag{3.3}
\end{array}
$$

We now answer the first question posed above by selecting,

$$
\begin{equation*}
X_{i}=2-D_{i} \tag{3.4}
\end{equation*}
$$

This choice transforms the recurrence equations into:

$$
\begin{array}{ll}
D_{i+1}=D_{i}\left(2-D_{i}\right) & \text { Set } D_{0}=D ; \text { iterate until } D_{M} \approx 1 \\
Z_{i+1}=Z_{i}\left(2-D_{i}\right) & \text { Set } Z_{0}=Z ; \text { obtain } Z / D=Q \approx Z_{M} \tag{3.5}
\end{array}
$$

Thus, computing the functions $f$ and $g$ consists of determining the 2 's-complement of $D_{i}$ and two multiplications by the result $2-D_{i}$.

Now to address the second question: How quickly does $D_{i}$ converge to 1 ? In other words, how many multiplications are required to perform division? Noting that

$$
\begin{equation*}
D_{i+1}=D_{i}\left(2-D_{i}\right)=1-\left(1-D_{i}\right)^{2} \tag{3.6}
\end{equation*}
$$

It is concluded that [82]:

$$
\begin{equation*}
1-D_{i+1}=\left(1-D_{i}\right)^{2} \tag{3.7}
\end{equation*}
$$

Thus, if $D_{i}$ is already close to 1 (i.e. $1-D_{i} \leq \varepsilon$ ), $D_{i+1}$ will be even closer to I (i.e. $1-D_{i+1} \leq \varepsilon^{2}$ ). This property is known as quadratic convergence and leads to a logarithmic number, $M$, of iterations to complete the process.

Another way to compute $Q=Z / D$ is to first find $1 / D$ and then multiply the result by $Z$. If several divisions by the same divisor $D$ need to be performed, this method [85] is particularly efficient. One method for computing $1 / D$ is based on the Newton-Raphson iteration to determine a root of $f(x)=0$. We start with some initial estimate $X_{0}$ for the root and then iteratively refine the estimate using the recurrence:

$$
\begin{equation*}
X_{i+1}=X_{i}-\frac{f\left(X_{i}\right)}{f^{\prime}\left(X_{i}\right)} \tag{3.8}
\end{equation*}
$$

where $f^{\prime}\left(X_{i}\right)$ is the derivative of $f(x)$. To apply the Newton-Raphson method to reciprocation, we use $f(x)=1 / x-d$ which has a root at $x=1 / d$. Then $f^{\prime}(x)=-1 / x^{2}$, leading to the recurrence.

$$
\begin{equation*}
X_{i+1}=X_{i}\left(2-X_{i} D\right) \tag{3.9}
\end{equation*}
$$

Computationally, two multiplications and a 2 's-complement step are required per iteration.

Let $\delta_{i}=1 / D-X_{i}$ be the error at the $i$ th iteration. Then:

$$
\begin{equation*}
\delta_{i+1}=1 / D-X_{i+1}=1 / D-X_{i}\left(2-X_{i} D\right)=D\left(1 / D-X_{i}\right)^{2}=D\left(\delta_{i}\right)^{2} \tag{3.10}
\end{equation*}
$$

Since $D<1$, we have $\delta_{i+1}<\left(\delta_{i}\right)^{2}$, thus this functional iteration based upon Newton-Raphson converges quadratically.

### 3.2 Further Studies of the Goldschmidt and NewtonRaphson Methods

In this section the algorithm of the Goldschmidt and the Newton-Raphson method are compared and studied. We show that these two methods are theoretically equivalent, but are often treated separately in the literature. Further studies of the Goldschmidt method are presented. Next, the RB inner-product processor core is investigated for performing the division computations for both real and complex numbers. We show how to control and/or reconfigure the RB inner-product processor to provide highperformance division.

### 3.2.1 Comparison of the Goldschmidt and Newton-Raphson Methods

For the initial divisor, $D$, and dividend, $Z$, the Goldschmidt iteration equations are:

$$
\begin{align*}
& D_{i+1}=D_{i}\left(2-D_{i}\right), \text { Set } D_{0}=D ; \text { iterate until } D_{M} \approx 1  \tag{3.11}\\
& Z_{i+1}=Z_{i}\left(2-D_{i}\right), \text { Set } Z_{0}=Z ; \text { obtain } Z / D=Q \approx Z_{M}
\end{align*}
$$

where $D_{i}$ is the iterated divisor, $Z_{i}$ is the iterated dividend, and $Q$ is the quotient.
For the Newton-Raphson method, the iteration equations are:

$$
\begin{equation*}
X_{i+1}=X_{i}\left(2-X_{i} D\right) \tag{3.12}
\end{equation*}
$$

where $X_{i}$ is the approximate reciprocal of $D$.
Multiplying $Z$ on Equation (3.12) on both sides, we have:

$$
\begin{equation*}
Z X_{i+1}=Z X_{i}\left(2-X_{i} D\right) \tag{3.13}
\end{equation*}
$$

Comparing Equations (3.13) with (3.11), we notice that in Equation (3.13), $Z X_{i}$ is the approximate quotient which is gradually close to $Z / D$ after each iteration. Therefore,

$$
\begin{equation*}
Z_{i}=Z X_{i} \tag{3.14}
\end{equation*}
$$

Then Equation (3.13) becomes:

$$
\begin{equation*}
Z_{i+1}=Z X_{i+1}=Z X_{i}\left(2-X_{i} D\right)=Z_{i}\left(2-X_{i} D\right) \tag{3.15}
\end{equation*}
$$

If we define $D_{i}=X_{i} D$, then Equation (3.15) is:

$$
\begin{equation*}
Z_{i+1}=Z_{i}\left(2-X_{i} D\right)=Z_{i}\left(2-D_{i}\right) \tag{3.16}
\end{equation*}
$$

Note that,

$$
\begin{equation*}
D_{i+1}=X_{i+1} D=X_{i} D\left(2-X_{i} D\right)=D_{i}\left(2-D_{i}\right) \tag{3.17}
\end{equation*}
$$

and

$$
\begin{equation*}
D_{0}=X_{0} D, \quad Z_{0}=Z X_{0} \tag{3.18}
\end{equation*}
$$

Letting $X_{0}=1$, then

$$
\begin{equation*}
D_{0}=X_{0} D=D \quad Z_{0}=Z X_{0}=Z \tag{3.19}
\end{equation*}
$$

Under the condition of $X_{0}=1$, Equations (3.12) and (3.13) are equivalent to Equation (3.11). However, from the standpoint of implementation, Goldschmidt and Newton-Raphson methods are different. For the Goldschmidt method, as shown in Figure 3-1, two parallel multiplications plus two complement operations are required. For the Newton-Raphson method, as shown in Figure 3-2, two sequential multiplications and one complement operation are required. Therefore, the critical time delay in the Newton-

Raphson method is two multiplications plus one addition, while for the Goldschmidt method, only one multiplication step (parallel multiplications) and one addition are necessary. As far as the implementation area is concerned, the Goldschmidt method needs only one extra complement operation to implement. From this, we conclude that there are performance advantages for using the Goldschmidt method rather than NewtonRaphson method.


Figure 3-1. Goldschmidt Divisor Implementation [82]


Figure 3-2. Newton-Raphson Divider Implementation [85]

### 3.2.2 Further Discussion of the Goldschmidt Method

In [85], it is claimed that the number of clock cycles for the Goldschmidt division method is $\log _{2} N$, where $N$ is the word length for dividend and divisor. Further study of our circuit implementations show that the actual number of clock cycles to achieve precise accuracy in the LSB of the quotient is $\log _{2} N+1$. In the implementation of the Goldschidmt method, two additional guard bits are required to get the quotient precision of $N$ bits.

After $M=\log _{2} N$ iterations,

$$
\begin{equation*}
D_{M}=D X_{0} X_{1} \cdots X_{M-2}=1-2^{-N} \tag{3.20}
\end{equation*}
$$

so,

$$
\begin{equation*}
X_{0} X_{1} \cdots X_{M-2}=\frac{1-2^{-N}}{D} \tag{3.21}
\end{equation*}
$$

and,

$$
\begin{equation*}
Z_{M}=Z X_{0} X_{1} \cdots X_{M-1}=\frac{Z}{D}\left(1-2^{-N}\right)=Q\left(1-2^{-N}\right) \tag{3.22}
\end{equation*}
$$

The error between the actual quotient and untruncated quotient $Z_{M}$ is

$$
\begin{equation*}
\varepsilon_{1}=\left|Q-Z_{M}\right|=\left|Q-Q\left(1-2^{-N}\right)\right|=Q 2^{-N} \tag{3.23}
\end{equation*}
$$

For $Z<D$, we have $Q<1$, so $\varepsilon_{1}<2^{-N}$.
The approximate quotient is taken by truncating $Z_{M}$ to $N$ bits as $\left(Z_{M}\right)_{T}$, so

$$
\begin{equation*}
\varepsilon_{2}=\left|Z_{M}-\left(Z_{M}\right)_{T}\right|<2^{-N} \tag{3.24}
\end{equation*}
$$

The error between the actual quotient and the computed one is:

$$
\begin{align*}
\varepsilon & =\left|Q-\left(Z_{M}\right)_{T}\right|=\left|Q-Z_{M}+Z_{M}-\left(Z_{M}\right)_{T}\right| \\
& \leq\left|Q-Z_{M}\right|+\left|Z_{M}-\left(Z_{M}\right)_{T}\right|  \tag{3.25}\\
& =2^{-N}+2^{-N}=2^{-N+1}
\end{align*}
$$

i.e. $\varepsilon \leq 2^{-N+1}$

After $M=\log _{2} N$ iterations, the precision of the quotient found from the Goldschimdt method is $N-1$ bits. In order to reach the precision of $N$ bits, $\log _{2} N+1$ iterations are needed.

If we want to achieve the computation error of the division using the RB IP processor to satisfy $\varepsilon \leq 2^{-N}$, from Equations (3.23)(3.24) and (3.25), the following conditions must be satisfied as:

$$
\begin{equation*}
\varepsilon_{1} \leq 2^{-N-1} \text { and } \varepsilon_{2} \leq 2^{-N-1} \tag{3.26}
\end{equation*}
$$

That is, the dividend $z_{i}$ and divisor $d_{i}$ during the iteration must be truncated to $N+1$ bits instead of $N$ bits, so one extra guard bit is required for the iteration.

In the same way, if the computation error of the division is required to meet $\varepsilon<2^{-N}$, then two guard bits are required to compute the quotient using the RB IP processor.

### 3.2.3 Implementation of the Goldschmidt Division

Here we will explore how to implement the Goldschmidt division method using the RB inner-product structure. From Equation (3.5) in order to implement the high-speed divisor, all the intermediate dividend, divisor are in RB forms. Therefore, a RBcomplement operation, $2-A_{R B}$, similar to 2 's-complement operation, and a $R B$ multiplier must be developed. In Section 2.6, the RB multiplier is studied using the RB IP structure, so only the RB-complement operation requires development.

Let a real number $A$ with precision $N$ be represented in RB form, that is:

$$
\begin{equation*}
A=\sum_{i=-L}^{N-L-1} \alpha_{i} 2^{i}=\sum_{i=0}^{N-L-1} \alpha_{i} 2^{i}+\sum_{i=-L}^{-1} \alpha_{i} 2^{i} \tag{3.27}
\end{equation*}
$$

where $\alpha_{i}=\{-1,0,1\}$, then

$$
\begin{equation*}
2-A=2-\left(\sum_{i=0}^{N-L-1} \alpha_{i} 2^{i}+\sum_{i=-L}^{-1} \alpha_{i} 2^{i}\right)=(0010)_{2}+\sum_{i=0}^{N-L-1}\left(-\alpha_{i}\right) 2^{i}+\sum_{i=-L}^{-1}\left(-\alpha_{i}\right) 2^{i} \tag{3.28}
\end{equation*}
$$

Using the RB coding system, $1=(11), 0=(1,0)=(0,1)$ and $-1=(0,0)$, notice that if $\alpha_{i}$ is encoded as $\left(\alpha_{i}^{+} \alpha_{i}^{-}\right)$, then $-\alpha_{i}$ will be encoded as $\left(\overline{\alpha_{i}^{+}} \overline{\alpha_{i}^{-}}\right)$. The implementation of 2- $A_{R B}$ is shown in Figure 3-3:


Figure 3-3. Implementation of 2- $\boldsymbol{A}_{\boldsymbol{R}}$

### 3.3 Real Number Division Implementation

First, the dividend $Z$ and the divider $D$ are normalized to satisfy $Z$ and $D \in\left[\begin{array}{ll}0.5 & 1)\end{array}\right.$.
For the normalization circuit, see references [116]-[118].
Then for the first iteration equation,

$$
\begin{array}{ll}
D_{1}=D_{0}\left(2-D_{0}\right) & \text { Set } D_{0}=D  \tag{3.29}\\
Z_{1}=Z_{0}\left(2-D_{0}\right) & \text { Set } Z_{0}=Z
\end{array}
$$

where $Z_{0}$ and $D_{0}$ are both 2's-complement numbers. From Equation (1.9), $Z_{0}$ and $D_{0}$ can be mapped into RB digit numbers $\left(Z_{0}\right)_{R B}$ and $\left(D_{0}\right)_{R B}$ directly. Thus a RB multiplier which can accept the inputs $\left(D_{0}\right)_{R B},\left(Z_{0}\right)_{R B}$, and $\left(2-D_{0}\right)_{R B}$ can realize the first iteration, as is shown in Figure 3-4.


Figure 3-4. First Iteration Implementation of the Goldschmidt Division

Notice that the output $Z_{1}$ and $D_{1}$ of the first iteration are redundant binary numbers. Thus the successive iterations can be implemented as shown in Figure 3-5.


Figure 3-5. Implementation of Successive Iteration Computation for $\boldsymbol{Z}$ and $\boldsymbol{D}$

After $\log _{2} N+1$ iterations are carried out, where $N$ is the number of bit precision, a RB-to-2's-complement converter is required to convert the quotient back to 2 'scomplement, if required. Four unified structures of $A B \pm C D$ are required to realize the
real number division. Figure 3-6 shows the overall structure of the divider using RB IP processor.


Figure 3-6. Overall Structure of Divider Using RB IP Processor

### 3.4 Comparison of the Implementations of Division

The implementation time required for division is compared between the Pentium Processor and the division implementation using a RB processor. Division implemented on the Pentium processor uses the SRT method. The 8 -bit unsigned division implemented on the Pentium requires 17 clock cycles [85],[114],[115]. If the VLSI fabrication in [3] is implemented to realize the RB inner-product processor, then one iteration for the RB multiplication requires $8.8-2.4=6.4 \mathrm{~ns}$ [3]. For 8 -bit division, 4 iterations are required. The total time required for 8 -bit division using the RB IP processor is:

$$
\begin{equation*}
\left(\log _{2} 8+1\right) \times 6.4=25.6 n s \tag{3.30}
\end{equation*}
$$

If this division implementation result is compared to the Pentium processor, the equivalent clock cycle will be $25.6 / 17=1.505 n s$, and is equivalent to 660 Mhz clock frequency of Pentium processor.

### 3.5 Complex Number Division Implementation

To find the quotient of the complex number $\frac{A+j B}{C+j D}$,

$$
\begin{equation*}
Q=\frac{A+j B}{C+j D}=\frac{(A+j B)(C-j D)}{(C+j D)(C-j D)}=\frac{A C+B D}{C^{2}+D^{2}}+j \frac{B C-A D}{C^{2}+D^{2}} \tag{3.31}
\end{equation*}
$$

For the implementation of a complex number divisor, $A C+B D, B C-A D$ and $C^{2}+D^{2}$ need to be computed. These computations can be realized by the unified signed/unsigned $A B \pm C D$ IP structure. For $C^{2}+D^{2}$, let $A=B$ and $C=D$. Notice that the outputs are in RB form for $A C+B D, B C-A D$ and $C^{2}+D^{2}$. For the Goldschmidt division method, both the dividend and the divisor need to be normalized. To normalize a RB digit, a RB-to-2's-complement converter is required to convert it back to 2 's-complement. A normalization circuit is required to normalize $A C+B D, B C-A D$ and $C^{2}+D^{2}$ into [0.5 1). The diagram is shown in Figure 3-7.


Figure 3-7. Complex-Number Division Implementation Initial Process

Following the derivation of the normalized 2's-complement values of $A C+B D$, $B C-A D$ and $C^{2}+D^{2}$ in the first iteration, the real-number division implementation procedure is utilized to develop the quotient.. Six blocks of unified IP structure $A B \pm C D$ are needed to compute the complex number division since the implementation of divisors for both $\frac{A C+B D}{C^{2}+D^{2}}$ and $\frac{B C-A D}{C^{2}+D^{2}}$ can share the same computing structure for $C^{2}+D^{2}$. For normalization circuits, see [116]-[118].

## Chapter 4 Computational Extensions

The inner-product structures described in Chapter 2 can be extended to provide a rich set of real, complex, RB and mixed real, complex and RB number computations. The inline partial product method [39] allows more extensions than the cross partial product scheme and will be used for illustrating added capabilities. Together with the basic inner-product operation, the computational capabilities afforded can be implemented using control signals or accomplished with circuit reconfiguration if configurable hardware is used. All of these extended computational capabilities are targeted for implementation in a Complex Arithmetic Signal Processor (CASP).

Referring to Figure 4-1, up to eight accumulator segments are required to support the following operations for $M=8$.


Figure 4-1. An Example of a Redundant Number Adder Tree
The structure in Figure 4-1 can support the following real number computations:

1. 8-element real number inner-product computation using a single $R B$ accumulator segment.
2. Dual 4-element real number inner-product using two RB accumulator segments.
3. Quad 2-element real inner-product using four RB accumulator segments.
4. Eight parallel multipliers with or without eight accumulator segments.

The structure in Figure 4-1 can support the following complex number computations:

1. Single 2-element complex number inner-products using one $R B$ accumulator segment.
2. Dual single complex number inner-products using four RB accumulator segments.
3. Two parallel complex number multipliers with or without two real and imaginary accumulator segments.

The structure in Figure 4-1 can support the following redundant binary number computations:

1. Single element redundant binary number inner-product computation using one accumulator segment.
2. Dual 2-element RB inner-product using two RB accumulator segments.
3. Four parallel RB multipliers using 4 RB accumulator segments.

Mixed real and complex number operations and mixed real/complex, 2'scomplement/RB operations are also possible using the same 8-element IP structure. All of the extended computations are performed by bypassing some or all of the RB adder tree shown in Figure 4-1. The basic inner-product structure has the highest latency since the entire RB adder tree is utilized. When implementing one or more of the extended
operations using control signals, design choices should be carefully considered since additional multiplexers are necessary for a multiple operation capability.

For a general purpose complex number DSP core, a key element of the design is the segmented accumulator and the ability to provide both overflow and saturation arithmetic. The design of the segmented accumulator and its associated final RBA for implementing the extended operations in a CASP device is beyond the scope of this dissertation and is the subject of continuing research.

### 4.1 Real-Number Computational Extensions

### 4.1.1 8-Element Real Number Inner-Product Computation

This structure is developed in Chapter 2 and provides the basic computational foundation for extended calculations. Refer to Figure 2-2 and Figure 4-1.

### 4.1.2 Dual 4-Element Real Number Inner-Product

Figure 4-2 shows the structure to perform dual 4-element real number inner-
product using two RB accumulator segments. This calculation requires two accumulators, one for each inner-product.


Figure 4-2. Dual 4-Element Real Number Inner-Product

### 4.1.3 Quad 2-Element Real Inner-Product Using Four Redundant Binary Accumulators

For this computation, the RB adder tree requires reconfiguration as shown in
Figure 4-3. Here four RB accumulators are required:


Figure 4-3. Quad 2-Element Real Number Inner-Product

### 4.1.4 Eight Parallel Multipliers Using 8 Redundant Binary Accumulators

For this computation, the RB adder tree needs to be controlled as shown in Figure
4-4. Here eight RB accumulators are required, if the structure is used for computing eight inner-products; otherwise, the accumulators are bypassed.


Figure 4-4. Eight Parallel Multipliers Using 8 RB Accumulators

### 4.2 Complex-Number Computational Extensions

Defining four complex numbers as: $C_{0}=A_{0}+j B_{0}, C_{1}=A_{1}+j B_{1}, C_{2}=A_{2}+j B_{2}$,
$C_{3}=A_{3}+j B_{3}$, the computational extensions for complex number computations are
depicted as follows.

### 4.2.1 Single 2-Element Complex Number Inner-Product Computation Using One Real/Imaginary Redundant Binary Accumulator

Figure 4-5 shows the structure of a single 2-element complex number innerproduct using a RB accumulator segment for the real and imaginary parts separately.


Real Part


Figure 4-5. Single 2-Element Complex Number IP Using One Real/Imaginary RB Accumulator

### 4.2.2 Dual Single-element Complex Number Inner-Product Computation Using Four Redundant Binary Accumulators

Figure 4-6 shows the structure of dual 2-element complex number inner-products using four RB accumulators.


Figure 4-6. Dual 2-Element Complex Number Inner-Products Using Four RB Accumulators

### 4.2.3 Two Parallel Complex Number Multipliers

Figure 4-7 shows the structure of two parallel complex number multipliers using four RB accumulators.


Figure 4-7. Two Parallel Complex Number Multipliers

### 4.3 Redundant Binary Number Computational Extensions

### 4.3.1 Single Element Redundant Binary Number inner-Product Computation

Figure 4-8 shows the structure of a 4-element redundant binary inner-product computation, where $\Phi_{0}$ to $\Phi_{3}$ and $\Gamma_{0}$ to $\Gamma_{3}$ are redundant binary numbers. The structure of the RB multiplier for $\Phi_{0} \Gamma_{0}$ to $\Phi_{3} \Gamma_{3}$ is discussed in Section 2.6.


Figure 4-8. 4-Element Redundant Binary Inner-Product

### 4.3.2 Dual 2-Element RB Inner-Product plus Two Redundant Binary Accumulators

Figure 4-9 shows the structure of dual 2-element RB inner-product computation using two RB accumulators.


Figure 4-9. Dual 2-Element RB Inner-Product

### 4.3.3 Four Parallel Redundant Binary Multipliers Using Four Redundant Binary Accumulators

Figure 4-10 shows the structure of four parallel RB multipliers using 4 RB accumulators:


Figure 4-10. Four Parallel RB Multipliers Using 4 RB Accumulators

### 4.4 Pipeline Extensions

In this section, the possible pipeline design alternatives of the RB inner-product processor are investigated. The $0.5 \mu \mathrm{~m}$ CMOS time delay model from [3] for an 8 -bit RB multiplier is used for the discussion and is shown as Table 4-A:

Table 4-A. Time Delay Model of RB Multiplier [3]

|  | Time Delay |
| :--- | :--- |
| 2's-complement to RB Mapping | 2-Nand gate $=200 \mathrm{ps}=0.2 \mathrm{~ns}$ |
| RBA (RB adder) | 0.9 ns |
| RB to 2's-complement converter | 1.6 ns |

Consider the 8-word 8-bit RB IP processor in Figure 4-1 and re-draw it as shown in Figure 4-11:


Figure 4-11. 8-Word 8-Bit RB IP Processor

The implementation of $A_{0} B_{0}+A_{1} B_{1}$ consists of 2's-complement to RB mapping and 4 RB adders. If a two-stage pipeline structure is used, the resulting pipeline RB IP processor is depicted in Figure 4-12.


Figure 4-12. Two-Stage Pipelined RB IP Processor

The structure for stage 1 consists of the 2's-complement to RB mapping and 4 RB adders. The time delay is for the first stage is

$$
\begin{equation*}
0.2+4 \times 0.9=3.8 \mathrm{~ns} \tag{4.1}
\end{equation*}
$$

The structure for stage 2 consists of 3 RB adders and one RB to 2 's-complement converter. The latency for the second stage is

$$
\begin{equation*}
3 \times 0.9+1.6=4.3 n s \tag{4.2}
\end{equation*}
$$

The difference of the time delay between these two stages is

$$
\begin{equation*}
4.3-3.8=0.5 n s \tag{4.3}
\end{equation*}
$$

and can be considered as a balanced pipelined structure.
If a three-stage pipeline structure is employed, then the resulting structure is shown as Figure 4-13.


Figure 4-13. Three-Stage Pipelined RB IP Processor

The time delay for each stage is:
Stage 1

$$
\begin{equation*}
0.2+3 \times 0.9=2.9 n s \tag{4.4}
\end{equation*}
$$

Stage 2

$$
\begin{equation*}
3 \times 0.9=2.7 n s \tag{4.5}
\end{equation*}
$$

Stage 3

$$
\begin{equation*}
0.9+1.6=2.5 n s \tag{4.6}
\end{equation*}
$$

The maximum difference in time delay among the three stages is 0.4 ns . This pipeline structure can be considered to be balanced.

## Chapter 5 Redundant Binary to 2's-Complement Number Conversion

Since the inner-product processor produces results in a redundant binary form, a RB-NB (redundant binary to normal binary) converter may be required to provide a 2 'scomplement representation. Ruiz [5] proposed a carry-look-ahead RB-to-2'scomplement converter which is similar to the structure of a carry-look-ahead adder. Rajashekhara [71] proposed a similar converter that is based upon a borrow-look-ahead structure. In his paper [7], Yen gave a novel definition of carry in the proposed RB to 2'scomplement converter. An on-the-fly converter was discussed in [72] which converts serial RB inputs to a 2's-complement number. Choo [74] claimed a breakthrough of a new converter which has no latency proportional to the word length. However, according to the proof in [70], it is equivalent of 2 's-complement addition in the conversion of redundant binary to 2 's-complement. Choo's converter never works correctly. Ling [106] proposed a high-speed adder which is currently the fastest known binary adder. In this chapter, based on Yen's method [7] and Ling's adder scheme [106], we propose an improved RB-NB converter.

### 5.1 An Improved Redundant Binary to 2's-Complement Converter

Define a new variable carry $c_{i}$ [7] as follows:

1) $c_{i}=1$ means that, for the current RB digit position $i$, there is at least one -1 to the right of the current bit position and no +1 s between the -1 and the current position.
2) $c_{i}=0$ otherwise.

Table 5-A shows the conversion rules at stage $i$, where $x_{i}$ is the redundant binary bit, $c_{i}$ is the carry-in from next lower order position, $s_{i}$ is the 2 's-complement binary bit output, and $c_{i+1}$ is the carry-out to the next higher bit position. Example 1 shows a conversion from RB to 2's-complement based on the foregoing rules.

Table 5-A. Conversion Rules in Stage I [7]

| Input |  | Output |  |
| :---: | :---: | :---: | :---: |
| Redundancy bit $x_{i}$ | Carry in $c_{i}$ | Binary bit $s_{i}$ | Carry out $c_{i+1}$ |
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 |
| -1 | 0 | 1 | 1 |
| -1 | -1 | 0 | 1 |

Example 1: Letting the RB number $X_{R B}=\left[\begin{array}{llllllllllll}-1 & 1 & 0 & -1 & 0 & -1 & 1 & 0 & -1 & 0 & 0 & 0\end{array}\right]$, then

$$
\left.\begin{array}{l}
X_{R B}=\left[\begin{array}{lllrlrllllll}
-1 & 1 & 0 & -1 & 0 & -1 & 1 & 0 & -1 & 0 & 0 & 0
\end{array}\right]=-1320 \\
C=\left[\begin{array}{llllrrrrrrrr} 
\\
\text { and } & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0
\end{array}\right] \\
\text { and } \\
S=\left[\begin{array}{llllllllll}
1 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1
\end{array} 0\right.
\end{array}\right)
$$

For example, at bit position $0, x_{0}=0$ and $c_{0}=0$; then according to Table 5-A, $s_{0}=0$ and $c_{1}=0$. For bit position $3, x_{3}=-1, c_{3}=0$; then according to Table 5-A, $s_{3}=1$ and $c_{4}=1$.

Using the encoding provided in Table 1-B to encode Table $5-\mathrm{A}$, the RB-NB conversion is shown in Table 5-B.

Table 5-B. Conversion Truth Table for RB-NB

| Input |  | Output |  |
| :---: | :---: | :---: | :---: |
| Redundancy bit $x_{i}=\left(x_{i}^{-} x_{i}^{+}\right)$ | Carry in $c_{i}$ | Binary bit $s_{i}$ | Carry out $c_{i+1}$ |
| 0 (01) (10) | 0 | 0 | 0 |
| $0(01)(10)$ | 1 | 1 | 1 |
| 1 (11) | 0 | 1 | 0 |
| 1 (11) | 1 | 0 | 0 |
| -1 (00) | 0 | 1 | 1 |
| $-1(00)$ | 1 | 0 | 1 |

According to this conversion truth table, we derive the following equation:

$$
\begin{align*}
& s_{i}=\overline{c_{i} \oplus\left(x_{i}^{-} \oplus x_{i}^{+}\right)} \\
& c_{i+1}=\overline{x_{i}^{-}+x_{i}^{+}}+c_{i} \overline{x_{i}^{-} x_{i}^{+}} \tag{5.1}
\end{align*}
$$

For the $c_{i+1}$ equation above, define the signals, carry propagate, $p_{i}=\overline{x_{i}^{-} x_{i}^{+}}$, and carry generate, $g_{i}=\overline{x_{i}^{-}+x_{i}^{+}}$. Then:

$$
\begin{equation*}
c_{i+1}=g_{i}+c_{i} p_{i} \tag{5.2}
\end{equation*}
$$

Unrolling the carry equations, we get:

$$
\begin{align*}
& c_{4}=g_{3}+g_{2} p_{3}+g_{1} p_{2} p_{3}+g_{0} p_{1} p_{2} p_{3}+c_{0} p_{0} p_{1} p_{2} p_{3} \\
& c_{3}=g_{2}+g_{1} p_{2}+g_{0} p_{1} p_{2}+c_{0} p_{0} p_{1} p_{2}  \tag{5.3}\\
& c_{2}=g_{1}+g_{0} p_{1}+c_{0} p_{0} p_{1} \\
& c_{1}=g_{0}+c_{0} p_{0}
\end{align*}
$$

Based on the Ling adder [106], a more efficient RB-NB converter can be designed.

Define the signal, carry transfer, $t_{i}=p_{i}+g_{i}$, (carry is not annihilated) and $h_{i}=c_{i}+c_{i-1}$, then

$$
\begin{align*}
c_{i} & =g_{i-1}+c_{i-1} p_{i-1}=g_{i-1}+c_{i-1} t_{i-1} \\
& =g_{i-1}+g_{i-2} t_{i-1}+g_{i-3} t_{i-2} t_{i-1}+g_{i-4} t_{i-3} t_{i-2} t_{i-1}+c_{i-4} t_{i-4} t_{i-3} t_{i-2} t_{i-1} \tag{5.4}
\end{align*}
$$

Ling's modification consists of propagating $h_{i}=c_{i}+c_{i-1}$ instead of $c_{i}$. To understand the following derivations, we note that $g_{i-1}$ implies $c_{i}\left(c_{i}\right.$ if $\left.g_{i-1}=1\right)$, which in turn implies $h_{i}$.

$$
\begin{aligned}
p_{i} g_{i} & \left.\left.=\overline{x_{i}^{-} x_{i}^{+}} \overline{x_{i}^{-}+x_{i}^{+}}=\overline{\left(x_{i}^{-}\right.}+\overline{x_{i}^{+}}\right) \overline{\left(x_{i}^{-}\right.} \overline{x_{i}^{+}}\right) \\
& =\overline{x_{i}^{-}} \overline{x_{i}^{+}}=\overline{x_{i}^{-}+x_{i}^{+}}=g_{i} \\
c_{i} & =g_{i-1}+c_{i-1} p_{i-1} \\
& =g_{i-1} h_{i}+g_{i-1}+c_{i-1} p_{i-1} \quad\left(g_{i-1} h_{i}+g_{i-1}=g_{i-1}\right) \\
& =g_{i-1} h_{i}+c_{i-1} p_{i-1}+p_{i-1} g_{i-1} \quad\left(p_{i-1} g_{i-1}=g_{i-1}\right)+c_{i-1} p_{i-1} p_{i-1}(\text { repeated term }) \\
& =g_{i-1} h_{i}+c_{i-1} p_{i-1}+p_{i-1}\left(g_{i-1}+c_{i-1} p_{i-1}\right) \\
& =g_{i-1} h_{i}+c_{i-1} p_{i-1}+c_{i} p_{i-1} \\
& =g_{i-1} h_{i}+p_{i-1}\left(c_{i-1}+c_{i}\right) \\
& =g_{i-1} h_{i}+p_{i-1} h_{i}=h_{i} t_{i-1} \\
h_{i} & =c_{i}+c_{i-1}=\left(g_{i-1}+c_{i-1} p_{i-1}\right)+c_{i-1} \\
& =g_{i-1}+c_{i-1} \\
& =g_{i-1}+h_{i-1} t_{i-2}
\end{aligned}
$$

Unrolling the recurrence for $h_{i}$, we get:

$$
\begin{align*}
h_{i}= & g_{i-1}+t_{i-2} h_{i-1}=g_{i-1}+t_{i-2}\left(g_{i-2}+h_{i-2} t_{i-3}\right) \\
& =g_{i-1}+g_{i-2}+h_{i-2} t_{i-2} t_{i-3}\left\{\text { Since } t_{i-2} g_{i-2}=g_{i-2}\right\} \\
& =g_{i-1}+g_{i-2}+g_{i-3} t_{i-3} t_{i-2}+h_{i-3} t_{i-4} t_{i-3} t_{i-2} \\
= & g_{i-1}+g_{i-2}+g_{i-3} t_{i-2}+g_{i-4} t_{i-3} t_{i-2}+h_{i-4} t_{i-3} t_{i-4} t_{i-3} t_{i-2} \\
& h_{1}=g_{0}+t_{-1} h_{0} \\
& h_{2}=g_{1}+g_{0}+h_{0} t_{0} t_{-1}  \tag{5.5}\\
& h_{3}=g_{2}+g_{1}+g_{0} t_{1}+h_{0} t_{-1} t_{0} t_{1} \\
& h_{4}=g_{3}+g_{2}+g_{1} t_{2}+g_{0} t_{1} t_{2}+h_{0} t_{-1} t_{0} t_{1} t_{2}
\end{align*}
$$

Now, the expression for the converter output is:

$$
\begin{align*}
s_{i} & =\overline{c_{i} \oplus\left(x_{i}^{-} \oplus x_{i}^{+}\right)}  \tag{5.6}\\
& =\overline{\left(h_{i} t_{i-1}\right) \oplus\left(x_{i}^{-} \oplus x_{i}^{+}\right)}
\end{align*}
$$

where,

$$
\begin{aligned}
t_{i} & =p_{i}+g_{i}=\overline{x_{i}^{-} x_{i}^{+}}+\overline{x_{i}^{-}+x_{i}^{+}}=\left(\overline{x_{i}^{-}}+\overline{x_{i}^{+}}\right)+\left(\overline{x_{i}^{-}} \overline{x_{i}^{+}}\right) \\
& =\overline{x_{i}^{-}}+\overline{x_{i}^{+}}=\overline{x_{i}^{-} x_{i}^{+}}=p_{i}
\end{aligned}
$$

Here, $h_{0}$ and $h_{4}$ are the 4-bit converter's $h_{\text {in }}$ and $h_{\text {out }}$, respectively. A carry network based on the preceding equations can be used in conjunction with 2-input NANDs, producing the $t_{i}$ signals, 2 -input NORs, producing $g_{i}$, and 3-input XNORs, producing the sum bits, to build a 4-bit binary RB-NB converter. Note that since $h_{4}$ does not affect the computation of the sum bits, it can be derived based on the simpler equation:

$$
\begin{equation*}
h_{4}=g_{3}+h_{3} t_{2}, \tag{5.7}
\end{equation*}
$$

with a slight speed penalty. The resulting carry network is depicted in Figure 5-1.


Figure 5-1. Four-Bit Carry-Lookahead RB-NB Converter (Similar to [85])

Compared to Equation (5.3), Equation (5.5) for $h_{4}$ contains only 12 terms, while in Equation (5.3), $c_{4}$ has 15 terms. The cost is that the sum is obtained by a slightly more complex expression in (5.6), as compared to Equation (5.1).

Given the design represented in Equations (5.5), the group "block generate" and "block propagate" signals can be derived as follows:

$$
\begin{align*}
& g\left[{ }_{i, i+3}\right]=g_{i+3}+g_{i+2}+g_{i+1} t_{i+2}+g_{i} t_{i+1} t_{i+2}  \tag{5.8}\\
& t\left[\left[_{i, i+3}\right]=t_{i-1} t_{i} t_{i+1} t_{i+2}\right.
\end{align*}
$$

Figure 5-2 shows a schematic diagram of a 4-bit carry-lookahead block carry generator based on Ling's design.


Figure 5-2. Diagram of a 4-Bit Carry-Lookahead RB-NB Carry Generator (Similar to [85])

Given the 4-bit carry-lookahead generator from Figure 5-1 and Figure 5-2, the construction of a multilevel-lookahead circuit is straightforward. For example, to construct a two-level 16-bit carry-lookahead RB-NB converter, we need four 4-bit RBNB converters and a 4-bit carry-lookahead generator, connected together as shown in

Figure 5-3.


Figure 5-3. Two-Level 16-bit RB-NB Converter (Similar to [85])

### 5.2 Comparison Result

In this section, the novel converter is compared to a traditional carry-look-ahead based converter. A 4-bit converter is used for comparison. Assume only 2-inputs OR or AND gates can be used to build such a converter. For a traditional carry-look-ahead converter, the longest latency is defined by Equation (5.3). That is,

$$
c_{4}=g_{3}+g_{2} p_{3}+g_{1} p_{2} p_{3}+g_{0} p_{1} p_{2} p_{3}+c_{0} p_{0} p_{1} p_{2} p_{3}
$$

If only two-input gates are allowed, then it requires 14 gates to the realized $c_{4}$.

The critical path delay is six gates level.
For the converter investigated here, the critical path delay is defined by Equation (5.5). That is,

$$
h_{4}=g_{3}+g_{2}+g_{1} t_{2}+g_{0} t_{1} t_{2}+h_{0} t_{-1} t_{0} t_{1} t_{2}
$$

If only two-input gates are allowed, then this requires 10 gates to the realized $h_{4}$ and the critical path delay is only five gates level.

## Chapter 6 Summary and Conclusions

Inner-product computations play a central role in digital signal processing, especially in the areas of digital filters, signal correlation, convolution, FFT, etc. Complex number arithmetic computation is a key arithmetic feature required in modern digital communication, radar systems and optical systems. Many algorithms based on convolutions, correlations, and complex number filters require complex number multiplication and high-speed inner-product computation. The overall motivation for this work is the design of a high-performance complex arithmetic processor (CASP) capable of offering novel extended inner-product operations.

The CASP design relies on the high-speed multiplication afforded by redundant binary techniques, while avoiding the relatively slow conversion back to 2 's-complement numbers until a final 2 's-complement result is necessary. Inherently, the CASP device provides intermediate register storage for redundant binary, as well as 2 's-complement numbers. A new high-performance inner-product processor using redundant binary number representation is presented in this dissertation.

When the Booth coding technique is used, our proposed RB inner-product processor can significantly reduce the number of partial product to $25 \%$. Also, it can be dynamically reconfigured/controlled to perform real, complex and redundant binary number computations such as parallel multiplications and inner-product computations. The extended computational capabilities of the RB IP processor are developed for real, complex, and redundant binary number or mixed computations. In Chapter 2, the structure of $A_{0} B_{0}+A_{1} B_{1}$ for IP computation is studied. Two possible implementations,
the inline and the cross partial product methods, are compared, with our inline method provides several advantages in speed and flexibility.

Complex number representations and arithmetic are also studied. Different complex radices such as radix- $(2 j)$, radix- $(j-1)$ and radix- $(\sqrt{2} j)$ are investigated and compared. It is found that the complex radices have no advantage in hardware implementations. The traditional redundant binary number representation is used to implement complex-number multiplication and inner-product processing. The new RB inline inner-product processor can be reconfigured/controlled to perform complexnumber computations. The structures for $A_{0} B_{0}+A_{1} B_{1}$ and $A_{0} B_{0}-A_{1} B_{1}$ is developed and compared. The implementation of $A_{0} B_{0}+A_{1} B_{1}$ can be easily controlled to perform the computation of $A_{0} B_{0}-A_{1} B_{1}$. The complex number inner-product processor is investigated based upon this unified structure for $A_{0} B_{0} \pm A_{1} B_{1}$. The implementation using the RB IP processor is compared with the TMS320C6XXX processor. This comparison shows there is some speed improvement for the RB IP core. Next, a unified signed/unsigned multiplier without and with Booth encoding is presented. Based upon the unified multiplier, the RB IP processor is further extended to realize a redundant binary multiplier that can accept both 2's-complement or RB inputs. The ability to accept RB inputs is essential for iterative calculations such as real and complex number division.

In Chapter 3, different division methods are reviewed. Two function iteration division methods, Newton-Raphson and Goldschimdt, are compared in detail. The theoretical equivalence of these two methods is shown. Further studies show that the Goldschmidt method is preferred over the Newton-Raphson method for efficient hardware implementation. Extension to the RB IP core are provided for performing

Goldschmidt division. The division implementation structure for both real and complex numbers is discussed using the same IP processor.

In Chapter 4, together with the basic inner-product operations, the computational capabilities afforded can be implemented using control signals or circuit reconfiguration if configurable hardware is used. These extended operations provide a rich set of computational capabilities targeted for implementation in a complex arithmetic signal processor (CASP). Various extensions such as IP computations, parallel multiplication of real, complex and redundant binary numbers are studied. Possible pipeline implementations of the RB IP core are investigated. A two-stage and three-stage pipeline structures are presented and the time delay model of these stages is studied. An improved RB to 2's-complement number converter is investigated in Chapter 5. This converter shows improvement in speed with a small increase in area.

Several areas of research are suggested. Further development of the IP core is required for the extended calculation capabilities, primarily dealing with the segmented accumulator and the requirements for flag setting based on arithmetic results for both saturation and overflow arithmetic. In addition, the IP processor can be developed to provide computational capabilities for square root, CORDIC, and other iterative functions.

Since the IP processor developed here serves as a core DSP computing capability, the overall architecture of the Complex Arithmetic Signal Processor (CASP) device requires extensive research to provide a dual numeric representation, i.e., 2'scomplement and redundant binary. The CASP device should have a rich instruction set architecture that leverages the IP core for performing calculations for signed/unsigned,
real/complex binary numbers, as well as intermediate calculations on redundant binary numbers. The CASP device will significantly enhance future applications requiring highperformance inner-product calculations.

## APPENDIX VHDL HARDWARE IMPLEMENTATION

For the information regarding VHDL hardware implementations, please contact the Office of Technology Development, University of Oklahoma.

660 Parrington Oval
Evans Hall, Room 201
Norman, Oklahoma 73019
Tel: (405) 325-3800
Fax: (405) 325-7162

## REFERENCES

[1] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, and N. Takagi, "A high-speed multiplier using a redundant binary adder tree," IEEE Journal of Solid-State Circuits, vol. 22, no. 1, pp. 28-33, Feb. 1987.
[2] X. Huang, W. Liu, and B. W. Y. Wei, "A high-performance CMOS redundant binary multiplication-and-accumulation (MAC)," IEEE Transactions on Circuits and Systems I. vol. 41, no. 1, pp. 33-39, Jan. 1994.
[3] H. Makino, Y. Nakase, H. Suzuki, H. Morinaka, H. Shinohara, and K. Mashiko, "An 8.8 -ns $54 \times 54$-bit multiplier with high speed redundant binary architecture, " IEEE Journal of Solid-state Circuits, vol. 31, no. 6, pp. 773-783, June 1996.
[4] N. Takagi, H. Yasuura, and S. Yajima, "High-speed VLSI multiplication algorithm with a redundant binary addition tree," IEEE Transactions on Computers, vol. C-34, no. 9, pp. 789-796, Sept., 1985.
[5] G. A. Ruiz, "4bit CLA-based conversion from redundant to binary representation for CMOS simple and multi-output implementations," Electronics Letters, vol. 35, no. 4, pp. 281-283, Feb. 1999.
[6] A. Herrfeld and Hentschke S., "Conversion of redundant binary into two's complement representations," Electronics Letters, vol. 31, no. 14, pp. 1132-1133, July 1995.
[7] S. Yen, C. Laih, C. Chen, and J. Lee, "An efficient redundant-binary number to binary number converter, " IEEE Journal of Solid-State Circuits, vol. 27, no. 1, pp. 109-112, Jan. 1992.
[8] M. O. Ahmad and D. V. Poornalah, "Design of an efficient VLSI inner-product processor for real-time DSP applications," IEEE Transactions on Circuits and Systems, vol. 36, no. 2, pp. 324-329, Feb. 1989.
[9] E. Abdel-Raheem, A. Tawfik, M. Fahmi, and F. El-Guibaly, "New inner-product processor for FIR filter implementation," in Proceedings of IEEE Pacific RIM Conference Communication, Computer and Signal Processing, Victoria, May 17-19, 1995, pp. 395-398.
[10] D. J. Soudris, V. Paliouras, T. Stouraitis, and C. E. Goutis, "A VLSI design methodology for RNS full adder-based inner product architectures," IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Processing, vol. 44, no. 4, pp. 315-318, April 1997.
[11] M. N. Fahmi, F. El-Guibaly, S. Sunder, and D. J. Shpak, "Design of novel serialparallel inner-product processors," in 1994 IEEE International Symposium on Circuits and Systems, London, UK, May 30-June 2, 1994, pp. 55-58.
[12] S. Haynal and B. Parhami, "Arithmetic structures for inner-product and other computations based on a latency-free bit-serial multiplier design," in $33^{\text {rd }}$ Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 3-6, 1996, pp. 197-201.
[13] W. P. Burleson and L. L. Scharf, "VLSI design of inner-product computers using distributed arithmetic," in 1989 IEEE International Symposium on Circuits and Systems, Portland, OR, May 8-11, 1989, pp. 158-161.
[14] A. S. Vega, P. S. R. Diniz, and A. C. Mesquita, "A modular distributed-arithmetic implementation of the inner product and its application to digital filters," Journal of VLSI Signal Processing, vol. 10, pp. 93-106, 1995.
[15] N. Kazakova, R. Sung, N. Durdle, M. Margala, and J. Lamoureux, "Fast and lowpower inner product processor," in The 2001 IEEE International Symposium on Circuits and Systems, Sydney, Australia, May 6-9, 2001, vol. 4, pp. 646-649.
[16] R. Lin and S. Olariu, "A new buses scheme for fast inner-product computation," in $28^{\text {th }}$ Asilomar Conference on Signals, Systems and Computers, Monterey, CA, Oct. 30-Nov. 2, 1994, pp. 1402-1406.
[17] J. A. Starzyk and C. Chen, "A VLSI inner-product processor for real-time DSP applications," in Proceedings of the 26th Southeastern Symposium on System Theory, Athens, OH, March 20-22, 1994, pp. 219-223.
[18] C. C. Wang, C. J. Huang, and Y. P. Chen, "Design of an inner-product processor for hardware realization of multi-valued exponential bidirectional associative memory," IEEE Transactions on Circuits and Systems II: Analog Digital Signal Processing, vol. 47, no. 11, pp. 1271-1278, Nov. 2000.
[19] C. C. Wang, P. M. Lee and C. J. Huang, "Three alternative architectures of digital ratioed compressor design with application to inner-product processing," IEE Proceedings on Computers and Digital Techniques, vol. 147, no. 2, pp. 65-74, March 2000.
[20] D. C. M. Bilsby, R. L. Walke, and R. W. M. Smith, "Comparison of a programmable DSP and a FPGA for real-time multiscale convolution," in Proceeding of the 1998 IEE Colloquium on High Performance Architectures for Real-Time Image Processing, London, UK, Feb. 12, 1998, pp. 4/1-4/6.
[21] L. Breveglieri and L. Dadda, "A VLSI inner product macrocell," IEEE Transactions on VLSI Systems, vol. 6, no. 2, pp. 292-298, June 1998.
[22] E. Dujardin and O. Gay-bellile, "Software implementation of ADSL application with a convolution coprocessor," in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Seattle, WA, May 12-15, 1998, pp. 3053-3056.
[23] O. Gay-Bellile and E. Dujardin, "Architecture of a programmable FIR filter coprocessor," in 1998 IEEE International Symposium on Circuits and Systems, Pacific Grove, CA, May 31-June 3, 1998, vol. 5, pp. 433-436.
[24] R. Lin, "Reconfigurable parallel inner product processor architectures," IEEE Transactions on VLSI Systems, vol. 9, no. 2, pp. 261-272, April, 2001.
[25] R. Lin, A.S., Botha, K.E. Kerr, and G.A., Brown, "An inner product processor design using novel parallel counter circuits", in The First IEEE Asia Pacific Conference on ASICs, Seoul, Korea, Aug. 23-25, 1999, pp. 99-102.
[26] W. K. Luk and J. E. Vuillemin, "Recursive implementation of optimal time VLSI integer multiplier," in Proceedings of VLSI 1983, Amsterdam, Netherlands, 1983, pp. 155-168.
[27] R. Managuli, G. York, and Y. Kim, "An efficient convolution algorithm for VLIW mediaprocessors," in Proceeding of SPIE-the International Society for Optical Engineering, Jan. 1999, pp. 65-74.
[28] S. H. Baik, K.N. Han, and E. Yoon, "A 230MHz 8 tap programmable FIR filter using redundant binary number system," in IEEE International Symposium on Circuits and Systems, Orlando, FL, May 30-June 2, 1999, pp. 415-417.
[29] M. A. Sacristan, V. Rodellar, A. Diaz, V. Garcia, and P. Gomez, "A reusable inner product unit for DSP applications," in $25^{\text {th }}$ Proceedings of EUROMICRO Conference, 1999, pp. 209-213.
[30] G. Wang and M. Tull, "The implementation of an efficient and high-speed innerproduct processor," in $35^{\text {th }}$ Asilomar Conference on Signals, Systems, and Computers, Nov. 4-7, 2001, vol. 2, pp. 1362-1366.
[31] T. Aoki, Ohi, Y., and Higuchi, T., "Redundant complex number arithmetic for highspeed signal processing," in IEEE Workshop on VLSI Signal Processing, Sakai, Japan, Oct. 16-18, 1995, pp. 523-532.
[32] J. Buhler, M.A. Shokrollahi, and V. Stemann, "Fast and precise Fourier transforms," IEEE Transactions on Information Theory, vol. 46, no. 1, pp. 213-228, Jan 2000.
[33] D. Fu and A. N. Willson Jr., "A high-speed processor for rectangular-to-polar conversion with applications in digital communications," in 1999 Global

Telecommunications Conference, GLOBECOM '99, Janeireo, Brazil, Dec. 5-9, 1999, vol. 4, pp. 2172-2176.
[34] Y.P. Lee, L.G., M.J. Chen, and C.W. Ku, "A new design and implementation of $8 \times 8$ 2-D DCT/IDCT," in Workshop on VLSI Signal Processing, San Francisco, CA, Oct. 30-Nov. 1, 1996, pp. 408-417.
[35] K. Sasayama, M. Okuno, and K. Habara, "Coherent optical transversal filter using silica-based single-mode waveguides," Electronics Letters, vol. 25, no. 22, pp. 15081509, Oct. 1989.
[36] K. Sasayama, M. Okuno, and K. Habara, "Coherent optical transversal filter using silica-based waveguides for high-speed signal processing," Journal of Lightwave Technology, vol. 9, no. 10, pp. 1225-1230, Oct 1991.
[37] M.A. Soderstrand, D.H. Chu, W. Chan, M. Lazkani, H.H. Loomis Jr., "Multi-rate bandpass filter bank implemented in QRNS complex arithmetic using parallel multiple DSP chips or ASICs," in $27^{\text {th }}$ Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 1-3, 1993, pp. 801-806.
[38] S. Toledo, "On the communication complexity of the discrete Fourier transform," IEEE Signal Processing Letters, vol. 3, no. 6, pp. 171-172, June 1996.
[39] M. Tull, G. Wang, and M. Ozaydin, "High-speed complex number multiplier and inner-product processor," in $45^{\text {th }}$ Midwest Symposium on Circuits and Systems, August 4-7, 2002, vol. 3, pp. 640-643.
[40] A. Berkeman, V. Owall, and M. Torkelson, "A low logic depth complex multiplier using distributed arithmetic," IEEE Journal of Solid-State Circuits, vol. 35, no. 4, pp. 656-659, April 2000.
[41] S. He and M. Torkelson, "A complex array multiplier using distributed arithmetic," in Proceedings of the IEEE 1996 Custom Integrated Circuits Conference, San Diego, CA, May 5-8, 1996, pp. 71-74.
[42] M. Karlsson, M. Vesterbacka, and L. Wanhammar, "Design and implementation of a complex multiplier using distributed arithmetic, " in 1997 IEEE Workshop on Signal Processing, Leicester, UK, Nov. 3-5, 1997, pp. 222-231.
[43] V. G. Oklobdzija, "An integrated multiplier for complex numbers," Journal of VLSI Signal Processing, vol. 7, pp. 213-222, 1994.
[44] V. G. Oklobdzija, D. Villeger, and T. Soulas, "Considerations for design of a complex multiplier," in $26^{\text {th }}$ Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Oct. 26-28, 1992, pp. 366-370.
[45] A. P. Pascual, J. Valls, and M. M. Peiro, "Efficient complex number multipliers mapped on FPGA," in 6th IEEE International Conference on Electronics, Circuits and Systems, Pafos, Cyprus, Sept. 5-8, 1999, pp. 1123-1126.
[46] A. P. Pascual, T. Sansaloni, and J. Valls, "FPGA based on-line complex-number multipliers," in $8^{\text {th }}$ IEEE International Conference on Electronics, Circuits and Systems, Malta, Sept. 2-5, 2001, vol. 3, pp. 1481-1481.
[47] K. Z. Pekmestzi, "Complex number multipliers," IEE Proceedings, vol. 136, no. 1, pp. 70-75, Jan. 1989.
[48] T. J. Sansalon, J. Valls, and K. K. Parhi, "FPGA-based digit-serial complex number multiplier accumulator," 2000 IEEE International Symposium on Circuits and Systems, Geneva, Switzerland, May 28-31, 2000, pp. 585-588.
[49] K. W. Shin and H. W. Jeon, "High-speed complex-number multiplications based on redundant binary representation of partial products," International Journal of Electronics, vol. 87, no. 6, pp. 683-702, June 2000.
[50] K. W. Shin, B. S. Song, and K. Bacrania, "A 200-Mhz complex number multiplier using redundant binary arithmetic," IEEE Journal of Solid-State Circuits, vol. 33, no. 6, pp. 904-909, June 1998.
[51] A. Skavantzos and T. Stouraitis, "Decomposition of complex multipliers using polynomial encoding," IEEE Transactions on Computers, vol. 41, no. 10, pp. 13311333, Oct. 1992.
[52] T. Soulas, D. Villeger, and V. G. Oklobdzija, "An ASIC macro cell multiplier for complex numbers," in $4^{\text {th }}$ European Conference on Design Automation with the European Event in ASIC Design, 1993, pp. 589-593.
[53] T. Stouraitis and A. Skavantzos, "Multiplication of complex numbers encoded as polyniamials," Journal of VLSI Signal Processing, vol. 3, no. 4, pp. 319-328, April 1991.
[54] T. Stouraitis and A. Skavantzos, "Parallel decomposition of complex multipliers," in $22^{\text {nd }}$ Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 1988, pp. 379-383.
[55] D. C. Blest and Jamil, T., "Efficient division in the binary representation of complex numbers," in Southeastcon 2001 Proceedings IEEE, Clemson, SC, March 30-April 1, 2001, pp. 188-195.
[56] Y. N. Chang and K. K. Parhi, "High-performance digit-serial complex-number multiplier-accumulator", in Proceedings of the 1998 IEEE International Conference on Computer Design, Austin, TX, Oct. 5-7, 1998, pp. 211-213.
[57] Y. N. Chang and K. K. Parhi, "High-performance digit-serial complex multiplier," IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing II, vol. 47, no. 6, pp. 570-572, June 2000.
[58] T. T. Dao, "Knuth's complex arithmetic with quaternary hardware," in $12^{\text {th }}$ International Symposium on Multiple-Value Logic, May 1982, pp. 94-98.
[59] J. Duprat, Y. Herreros, and S. Kla, "New redundant representations of complex number," IEEE Transactions on Computers, vol. 42, no. 7, pp. 817-823, July 1993.
[60] C. Fougny, "Parallel and on-line addition in negative base and some complex number systems," Euro-Par'96 Parallel Processing, pp. 175-182, 1996.
[61] W. J. Gilbert, "Complex numbers with three radix expansions," Canadian Journal of Mathematics, vol. 24, no. 6, pp.1335-1348, June 1982.
[62] T. Jamil, N. Holmes, and D. Blest, "Towards implementation of a binary number system for complex numbers," in Proceedings of the IEEE Southeastcon 2000, Nasville, TN, April 7-9, 2000, pp. 268-274.
[63] D. E. Knuth, "An imaginary number system," Communications of the ACM, vol. 3, no. 4, pp. 245-247, April 1960.
[64] D. E. Knuth, The Art of Computer Programming, Vol. 2: Addison-Wesley Publishing Company, 1969.
[65] I. Koren and Maliniak, Y., "On classes of positive, negative and imaginary radix number systems", IEEE Transactions on Computers vol. C-30, no. 5, pp. 312-317, May 1981.
[66] W. Penney, "A binary system for complex numbers," Journal of the Associated for Computing Machinery, vol. 12, no. 2, pp. 247-248, April, 1965.
[67] A. Slekys, "Design of complex number digital arithmetic units based on a modified bi-imaginary number system," Ph.D. dissertation, University of California at Los Angels, Los Angeles, CA, 1976.
[68] V. N. Stepanenko, "Computer arithmetic of complex numbers," Cybernetics and Systems Analysis, vol. 32, no. 4, pp. 585-591, July 1996.
[69] A. Avizienis, "Signed-digit number representations for fast parallel arithmetic," IRE Transactions on Electronic Computers, vol. EC-10, no. 3, pp. 389-400, Sept. 1961.
[70] G. M. Blair, "The Equivalence of two's-complement addition and the conversion of redundant binary of twos-complement numbers," IEEE Transactions on Circuits and

Systems I: Fundamental Theory and Applications, vol. 45, no. 6, pp. 669-671, June 1998.
[71] T. N. Rajashekhara and A.S. Nale, "Conversion from signed-digit to radix complement representation," International Journal of Electronics, vol. 69, no. 6, pp. 717-721, Dec. 1990.
[72] M. D. Ercegovac and T. Lang, "On-the-fly conversion of redundant into conventional representations", IEEE Transactions on Computers, vol. C-36, no. 7, pp. 895-897, July 1987.
[73] M.P. Tull, G. Wang, and M. Ozaydin, "Method and apparatus for converting redundant binary numbers to two's-complement binary numbers", Disclosure 03NOR005, University of Oklahoma, July 2002.
[74] I. Choo and R. G. Deshmukh, "A novel conversion scheme from a redundant Binary to two's complement binary number for parallel architectures," in SoutheastCon 2001 Proceedings of IEEE, 2001, Clemson, SC, March 30-April 1, 2001, pp. 196 201.
[75] S. F. Oberman and M. J. Flynn, "Design issues in division and other floating-point operations," IEEE Transactions on Computers, vol. 46, no. 2, pp. 154-161, Feb. 1997.
[76] G. Wang, M. Ozaydin, and M. Tull, "High-performance divider using redundant binary representation," in $45^{\text {th }}$ Midwest Symposium on Circuits and Systems, August 4-7, 2002, vol. 3, pp. 640-643.
[77] T. Aoki, H. Tokoyo, and T. Higuchi, "High-radix parallel dividers for VLSI signal processing," in $4^{\text {th }}$ Workshop on VLSI Signal Processing, Pacific Grove, CA, Oct. 30-Nov. 1, 1996, pp. 83-92.
[78] D. E. Atkins, "Higher-radix division using estimates of the divisor and partial remainder," IEEE Transactions on Computers, vol. C-17, no. 10, pp. 925-934, Oct. 1968.
[79] M. D. Ercegovac and T. Lang, Division and Square Root: Digit-Recurrence Algorithms and Implementations: Kluwer Academic, 1994.
[80] M. Flynn, "On division by functional iteration," IEEE Transaction on Computers, vol. 19, no. 8, Aug. 1970.
[81] C. V. Freiman, "Statistical analysis of certain binary division algorithms," IRE Proceedings, vol. 49, pp. 91-103, 1961.
[82] R. E. Goldschmidt, "Application of division by convergence," M.S. thesis, MIT, Cambridge, MA, June 1964.
[83] S. Kuninobu, T. Nishiyama, H. Edamatsu, T. Taniguchi, and N. Takagi, "Design of high speed MOS multiplier and divider using redundant binary representation," in Proceedings of $8^{\text {th }}$ Symposium on Computer Arithmetic, 1987, pp. 80-86.
[84] S. F. Oberman and M. J. Flynn, "Division algorithms and implementations," IEEE Transactions on Computers, vol. 46, no. 8, pp. 833-854, Aug. 1997.
[85] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs: Oxford University Press, 2000.
[86] J. E. Robertson, "A new class of digital division methods," IRE Transactions on Electronic Computers, vol. 7, pp. 218-222, Sep. 1958.
[87] E. M. Schwarz and M. J. Flynn, "Hardware starting approximation method and its application to the square root operation," IEEE Transactions on Computers, vol. 45, no. 12, pp. 1356-1369, Dec. 1996.
[88] H. R. Srinivas, "High speed computer arithmetic architectures," Ph.D. dissertation, University of Minnesota, Twin Cities, MN, 1994.
[89] K. G. Tan, "The theory and implementation of high-radix division," in Proceeding of $4^{\text {th }}$ IEEE Symposium of Computer Arithmetic, 1978, pp. 154-163.
[90] K. D. Tocher, "Techniques of multiplication and division for automatic binary computers," Quarterly Journal of Mechanic Applied Mathematics, vol. 11, part 3, pp. 386-383, 1958.
[91] Intel Architecture Tutorial, Intel Company, http://www.intel.com.
[92] "TMS320C6000 programmer's guide," http://dspvillage.ti.com/docs/catalog/resources/techdocs.jhtml?navSection=user_guid es\&familyId=132.
[93] S. D. Pezaris, "A 40-ns 17 -bit by 17 -bit array multipliers," IEEE Transactions on Computers, vol. 20, pp. 442-447, April 1971.
[94] C. S. Wallace, "A suggestion for a fast multiplier," IEEE Transaction on Electronic Computer, vol. EC-13, pp. 14-17, Feb. 1964.
[95] IEEE Standard VHDL Language Reference Manual IEEE Std 1076-2002, http://ieeexplore.ieee.org/Xplore/DynWel.jsp
[96] A.D. Booth, "A Signed binary multiplication technique," Quarterly Journal Mechanics and Applied Mathematics, vol. 4, pt. 2, pp. 236-240, June 1951.
[97] T. Aoki, Amada, H., and T. Higuchi, "Real/Complex reconfigurable arithmetic using redundant complex number systems," in $13^{\text {th }}$ IEEE Symposium on Computer Arithmetic-Arith'97, Pacific Grove, CA, July 6-9, 1997, pp. 200-207.
[98] T. Aoki, Hoshi, K., and Higuchi, T., "Redundant complex arithmetic and its application to complex multiplier design," in Proceeding of $29^{\text {th }}$ IEEE International Symposium on Multiple Valued Logic, Freiburg, Germany, May 20-22, 1999, pp. 200-207.
[99] R. Mcllhenny and M. D. Ercegovac, "On-line algorithms for complex number arithmetic," in $32^{\text {nd }}$ Asilomar Conference on Signals, Systems, Computers, Pacific Grove, CA, Nov. 1-4, 1998, pp. 172-176.
[100] R. Mcllhenny and M.D. Ercegovac, "On the design of on-line givens rotation," in $35^{\text {th }}$ Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 4-7, 2001, pp. 160-164.
[101] R. E. Blahut, Fast Algorithms For Digital Signal Processing: Addison-Wesley, 1987.
[102] S. Zhoar, "Negative radix conversion," IEEE Transactions on Computers, vol. C19, no. 3, pp. 222-226, March 1970.
[103] D.P. Agrawal, "Negabinary carry-look-ahead adder and fast multiplier," Electronic Letters, vol. 10, no. 15, pp. 312-313, July 1974.
[104] N.G.P. Satish, "Negative radix arithmetic and its applications," M.S. thesis, University of Nevada, Las Vegas, NV, 1973.
[105] G. Wang, M. P. Tull, and M. Ozaydin, "Binary conversion algorithms for the implementation of complex-radix numbers," presented at Proceedings of $2^{\text {nd }}$ IEEE Electro/Information Technology, Oakland, MI, June 6-8, 2001.
[106] H. Ling, "High-speed binary adder," IBM Journal of Research and Development, vol. 25, no. 3, pp. 156-166, March 1981.
[107] W. S. Briggs and D. W. Matula, "A $17 \times 69$ bit multiply and add unit with redundant binary feedback and single cycle latency," in Proceeding of IEEE $11^{\text {th }}$ Symposium on Computer Arithmetic, Windsor, Canada, June 29-July 2, 1993, pp. 163-170.
[108] P. Chai, et al., "A 120 MFLOPS COS floating-point processor," in Proceeding Of IEEE 1991 Custom Integrated Circuit Conference, San Diego, CA, May 12-15, 1991, pp. 15.1/1-15.1/4.
[109] H. M. Darley, et al., "Floating-point/integer processor with divide and square root functions," U.S. Patent 4878190, Oct. 1989.
[110] S. M. Quek, L. Hu, J. P. Prabhu, and F. A. Ware, "Apparatus for determining booth recoder input control signal," U.S. Patent 5280439, Jan. 1994.
[111] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, and N. Takagi, "A high-speed multiplier using a redundant binary adder tree," IEEE Journal of Solid-State Circuits, vol. 22, no. 1, pp. 28-34, Feb. 1987.
[112] N. Takagi and S. Yajima, "On a fast iterative multiplication method by recoding intermediate product," in Proceedings Of $36^{\text {th }}$ National Convention of Information Science, Kyoto University, Aug. 1987.
[113] H.A.H. Fahmy, A.A. Liddicoat, and M.J. Flynn, "Improving the effectiveness of floating point arithmetic," in $35^{\text {th }}$ Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Nov. 4-7, 2001, pp. 875-879.
[114] IA-32 Intel Architecture Software Developer's Manual volume 2: Instruction Set Reference, http://developer.intel.com/design/pentium4/manuals/.
[115] Intel Pentium Family User's Manual volume 3: Architecture and Programming Manual, Intel Corporation, http://developer.intel.com/design/pentium4/manuals/.
[116] E. Antelo, M. Boo, J. D. Bruguera, and E. L. Zapata, "A novel design of a two operand normalization circuit", IEEE Transactions on VLSI Systems, vol. 6, no. 1, March 1998, pp. 173-176.
[117] V. G. Oklobdzija, "An algorithmic and novel design of a leading zero detector circuit: comparison with logic synthesis", IEEE Transactions on VLSI Systems, vol. 2, no. 1, pp. 124-148, March 1994.
[118] V. G. Oklobdzija, "Algorithmic design of a hierarchical and modular leading zero detector circuit," Electronic Letters, vol. 29, no. 3, pp. 283-284, Feb. 1993.

