A high-performance inner-product processor for real and complex numbers. by Wang, Guoping.
UNIVERSITY OF OKLAHOMA 
GRADUATE COLLEGE
A HIGH-PERFORMANCE INNER-PRODUCT PROCESSOR 
FOR REAL AND COMPLEX NUMBERS
A Dissertation 
SUBMITTED TO THE GRADUATE FACULTY 









The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 3260769 
Copyright 2007 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Copyright by Guoping Wang 2003 
All Rights Reserved.
A HIGH-PERFORMANCE INNER-PRODUCT PROCESSOR 
FOR REAL AND COMPLEX NUMBERS
A Dissertation APPROVED FOR THE 
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING
BY





To my family 
who have lovingly supported my years of study.
IV
ACKNOWLEDGMENTS
In  the years leading to this dissertation, I have had the great pleasure o f 
working for and with an advisor who provided an environm ent conducive to 
learning. 1 would like to  take this opportunity to thank my advisor, Dr. M onte P. 
TuU for his consistent guidance.
1 would also like to thank Dr. Gerald Crain, Dr. Linda DeBrunner, Dr. Joe 
Havlicek and Dr. M urad Ozaydin for serving on my supervisory committee.
1 appreciate many others w ho have been helpful at the University o f 
Oklahoma. They make my years at the university a unique and m em orable 
experience.
1 am especiaUy indebted to my family for their support. Guanglan Zhang, 
my wife, has been invaluable for her patience and love during my research. My 
daughter, Ying, and my son, Christopher always give me great joy and peace.
“This is w hat the Lord says ...
‘Call to me and 1 will answer you and 
tell you great and unsearchable things 
you do not know.’ ”
Jeremiah 33:2-3
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION.............................................................................. 1
1.1 Inner-Product Implementation by the General Purpose Processors.................. 3
1.2 Inner-Product Implementation by Digital Signal Processing Processors 6
1.3 Other Inner-Product Processor Implementation Methods....................................7
1.4 Multiplier Implementation Review...........................................................................10
1.5 The Redundant Binary Number System.................................................................. 13
1.6 The Conversion of 2 ’s-Complement to Redundant Binary..................................15
CHAPTER 2 INNER-PRODUCT PROCESSOR OF REAL, COMPLEX AND
REDUNDANT BINARY NUMBERS................................................ 21
2.1 Real Number Inner-Product Computation............................................................ 21
2.1.1 Inline Partial Product Redundant Binary Inner-Product.................................. 21
2.1.2 Cross Partial Product Redundant Binary Inner-Product.................................. 26
2.1.3 Booth Encoding Methods...................................................................................30
2.1.4 Implementation Comparison of Inline and Cross Inner-Product Methods... 34
2.2 Complex Number Inner-Product Computation.....................................................36
2.2.1 Review of Complex Number Arithmetic.......................................................... 36
2.2.2 Comparison of Different Complex Radices......................................................39
2.2.3 Complex Number Multiplier and Inner-Product Computation........................51
2.3 Inner-Product Computation Comparison...............................................................58
2.4 Implementation of Unified Signed/Unsigned Multiplier......................................62
2.4.1 Unified Signed/Unsigned Multiplier Without Booth Coding..........................62
2.4.2 Unified Signed/Unsigned Multiplier With Booth Coding...............................68
2.5 The Implementation of a Unified Signed/Unsigned Inner-Product Processor
for A B  ±  C D .............................................................................................................. 69
2.6 The Implementation of a Redundant Binary Multiplier......................................71
2.6.1 Direct Implementation of Redundant Binary Multiplier.................................72
2.6.2 Redundant Binary Multiplier Implementation
Using Inner-Product Processor.........................................................................73
2.7 Redundant Binary Inner-Product Computation.................................................. 75
VI
CHAPTER 3 IMPLEMENTATIONS OF DIVISION METHOD..........................76
3.1 Division Algorithm Review ......................................................................................... 76
3.2 Further Studies of the Goldschmidt and Newtou-Raphsou Methods............... 80
3.2.1 Comparison of the Goldschmidt and Newton-Raphson Methods...................80
3.2.2 Further Discussion of the Goldschmidt Method...............................................83
3.2.3 Implementation of the Goldschmidt Division................................................... 84
3.3 Real Number Division Implementation..................................................................85
3.4 Comparison of the Implementations of Division....................................................87
3.5 Complex Number Division Implementation............................................................88
CHAPTER 4 COMPUTATIONAL EXTENSIONS.............................................. 90
4.1 Real-Number Computational Extensions................................................................92
4.1.1 8-Element Real Number Inner-Product Computation..................................... 92
4.1.2 Dual 4-Element Real Number Inner-Product...................................................92
4.1.3 Quad 2-Element Real Inner-Product
Using Four Redundant Binary Accumulators..................................................93
4.1.4 Eight Parallel Multipliers Using 8 Redundant Binary Accumulators............. 93
4.2 Complex-Number Computational Extensions.......................................................94
4.2.1 Single 2-Element Complex Number Inner-Product Computation Using One 
Real/Imaginary Redundant Binary Accumulator............................................ 94
4.2.2 Dual Single-element Complex Number Inner-Product Computation Using
Four Redundant Binary Accumulators............................................................ 94
4.2.3 Two Parallel Complex Number Multipliers.....................................................95
4.3 Redundant Binary Number Computational Extensions......................................95
4.3.1 Single Element Redundant Binary Number Inner-Product Computation 95
4.3.2 Dual 2-Element RB Inner-Product plus
Two Redundant Binary Accumulators............................................................. 96
4.3.3 Four Parallel Redundant Binary Multipliers Using
Four Redundant Binary Accumulators............................................................ 97
4.4 Pipeline Extensions.......................................................................................................97
CHAPTER 5 REDUNDANT BINARY TO 2’S-COMPLEMENT NUMBER
CONVERSION................................................................................. 102
5.1 An Improved Redundant Binary to 2’s-Complement Converter.................... 102
5.2 Comparison Result......................................................................................................108
Vll
CHAPTER 6 SUMMARY AND CONCLUSIONS............................................. 109




Figure 1-1. Flowchart of Inner-Product Computation by the General Purpose
Processors.............................................................................................................................4
Figure 1-2. Inner-Product Implementation in Pentium MMX Processor..................... 5
Figure 1-3. Sample Code for an Inner-Product by Pentium M M X ................................ 5
Figure 1-4. Sample Code of the Fixed-Point Inner-Product by TMS 320C60............. 6
Figure 1-5. Kazakova’s Inner-Product Processor Architecture.......................................8
Figure 1-6. Lin’s Reconfigurable Inner-Product Processor............................................. 9
Figure 1-7. A RB MAC (Multiply and Accumulate) by Huang........................................9
Figure 1-8. Baik’s Redundant Binary Filter Implementation........................................ 10
Figure 1-9. Design of a 5x 5  Array Multiplier................................................................. 11
Figure 1-10. 4:2 Counter Using 3:2 Counter.....................................................................11
Figure 1-11. 3:2 Counter Based 4x 4  M ultiplier...............................................................12
Figure 1-12. 4:2 Counter Based 4x 4  M ultiplier...............................................................12
Figure 1-13. Mapping from the Sum of Two 2 ’s-Complement Numbers to a RB
Num ber............................................................................................................................... 17
Figure 1-14. Mapping from the Subtraction of Two 2’s-Complement Numbers to a
RB Number.........................................................................................................................18
Figure 2-1. Inline Partial Product RB Implementation of + AjB^...................... 24
Figure 2-2. Inline Partial Product Structure of A^B  ̂+ AjB^..........................................25
Figure 2-3. Overall Structure of the Redundant Binary Inner-Product..................... 26
Figure 2-4. Cross Partial Product Structure of A^B  ̂+ A jB j ..........................................29
Figure 2-5. Basic Diagram of Complex Number Multiplication................................... 38
Figure 2-6. Blahut’s Complex Number M ultiplier........................................................... 38
Figure 2-7. An Example of Addition in Radix-(2/)............................................................ 43
Figure 2-8. An Example of Subtraction in Radix-(2/)......................................................44
Figure 2-9. An Example of Multiplication in Radix-(2/)................................................. 45
Figure 2-10. An Example for Radix-(/-l) Carry Propagation........................................50
Figure 2-11. Inline Implementation of A„B„ -A jB j ........................................................... 52
Figure 2-12. The Real Part of the Complex Number Inner-Product............................56
Figure 2-13. The Imaginary Part of Complex-Number Inner-Product....................... 56
Figure 2-14. Unified RB IP Processor for AB  ± C D .......................................................... 57
Figure 2-15. An Example Code of Fixed-Point Inner-Product.......................................58
Figure 2-16. Dependency Graph of Fixed-Point Inner-Product.................................... 59
Figure 2-17. RB Inner-Product Implementation.............................................................. 61
Figure 2-18. Unsigned Multiplier with Partial Product Generation.............................62
Figure 2-19. Mapping of and PPyy-/ for Signed Multiplier into a RB D ig it.... 63
Figure 2-20. Mapping of and , for Signed Multiplier into a RB Digit... 64
Figure 2-21. Circuit Realization of the Last Partial Product PPy .̂, for
Signed/Unsigned M ultiplier...........................................................................................64
Figure 2-22. First Partial Product PQO for Unsigned Multiplier.................................. 65
IX
Figure 2-23. Circuit of the First Partial Product PPg
for Signed/Unsigned M ultiplier....................................................................................67
Figure 2-24. Circuit of the Partial Products from to PP]̂ _2 for Signed/Unsigned
Multiplier........................................................................................................................... 67
Figure 2-25. A Unified Sign/Unsigned Multiplier............................................................. 68
Figure 2-26. Unified Signed/Unsigned IP Processor for AB ± CD ...............................70
Figure 2-27. A RB Multiplier Diagram................................................................................71
Figure 2-28. An Example of RB M ultiplication................................................................ 72
Figure 2-29. Implementation of RB Multiplier  75
Figure 2-30. IP Implementation for RB Number AB + X A ............................................75
Figure 3-1. Goldschmidt Divisor Implementation............................................................ 82
Figure 3-2. Newton-Raphson Divider Implementation................................................... 82
Figure 3-3. Implementation of 2 - A r b ...................................................................................85
Figure 3-4. First Iteration Implementation of the Goldschmidt D ivision...................86
Figure 3-5. Implementation of Successive Iteration Computation for Z  and D  86
Figure 3-6. Overall Structure of Divider Using RB IP Processor.................................87
Figure 3-7. Complex-Number Division Implementation Initial Process.....................88
Figure 4-1. An Example of a Redundant Number Adder T ree..................................... 90
Figure 4-2. Dual 4-Element Real Number Inner-Product..............................................92
Figure 4-3. Quad 2-Element Real Number Inner-Product.............................................93
Figure 4-4. Eight Parallel Multipliers Using 8 RB Accumulators.................................93
Figure 4-5. Single 2-Element Complex Number IP Using One Real/Imaginary RB
Accumulator......................................................................................................................94
Figure 4-6. Dual 2-Element Complex Number Inner-Products Using Four RB
Accumulators.....................................................................................................................95
Figure 4-7. Two Parallel Complex Number M ultipliers................................................. 95
Figure 4-8. 4-Element Redundant Binary Inner-Product...............................................96
Figure 4-9. Dual 2-Element RB Inner-Product................................................................. 96
Figure 4-10. Four Parallel RB Multipliers Using 4 RB Accumulators........................ 97
Figure 4-11. 8-Word 8-Bit RB IP Processor.......................................................................98
Figure 4-12. Two-Stage Pipelined RB IP Processor.........................................................99
Figure 4-13. Three-Stage Pipelined RB IP Processor.....................................................100
Figure 5-1. Four-Bit Carry-Lookahead RB NB Converter..........................................106
Figure 5-2. Diagram of a 4 Bit Carry-Lookahead RB NB Carry Generator 107
Figure 5-3. Two-Level 16-bit RB NB Converter............................................................. 108
LIST OF TABLES
Table 1-A. Computation Rules for the First Step in Carry-Propagation-Free
Addition for RB Numbers.............................................................................................. 15
Table 1-B. Coding Table for Binary Signed Digits........................................................... 16
Table 1-C. Logic Functions for RBFA and RBH A........................................................... 19
Table 2-A. Modified Booth Encoding Table......................................................................30
Table 2-B. Booth Correction Factors for the Inline Multiplication Method...............32
Table 2-C. Booth Correction Factors for the Cross Partial Product Method............ 33
Table 2-D. 16-Bit FPGA Implementations of A„Bi, + AjBj
Without Booth Encoding................................................................................................ 35
Table 2-E. 16-Bit FPGA Implementations of + A,B, With Booth Encoding... 35
Table 2-F. Truth Table for Radix-(-1 + J) One-Bit Addition.........................................50
Table 2-G. Booth Correction Factors for Redundant Binary Partial Product
Generation of AgBg - A , B j .............................................................................................. 55
Table 2-H. Comparison of IP Computation between TMS320C62X and RB Inner-
Produet Processor............................................................................................................ 61
Table 2-1. Partial Product PQO„ to PQOĵ _̂  for Unsigned Multiplier..............................65
Table 2-J. Partial Product for Unsigned Multiplier...........................................65
Table 2-K. Partial Product for Unsigned Multiplier...........................................66
Table 2-L. Partial Product PQO^ and PQOj^ ĵ for Unsigned Multiplier...................... 66
Table 2-M. RB Partial Product Generation....................................................................... 72
Table 2-N. Encoded RB Partial Product Generation.......................................................73
Table 4-A. Time Delay Model of RB Multiplier................................................................ 97
Table 5-A. Conversion Rules in Stage / .............................................................................103
Table 5-B. Conversion Truth Table for RB N B...............................................................104
XI
ABSTRACT OF THE DISSERTATION
A High-Performance Inner-Product Processor 
for Real and Complex Numbers
by
Guoping Wang
Doctor of Philosophy in Electrical and Computer Engineering 
University of Oklahoma, Norman, OK, 2003 
Dr. Monte P. Tull, Chair
A novel, high-performance fixed-point inner-product processor based on a 
redundant binary number system is investigated in this dissertation. This scheme 
deereases the number of partial products to 50%, while achieving better speed and 
area performanee, as well as providing pipeline extension opportunities. When 
modified Booth coding is used, partial products are redueed by almost 75%, thereby 
significantly reducing the multiplier addition depth. The design is applieable for 
digital signal and image processing applications that require real and/or complex 
numbers inner-product arithmetic, such as digital filters, correlation and convolution. 
This design is well suited for VLSI implementation and can also be embedded as an 
inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic 
control of the computing structure permits different computations, such as a variety of 
inner-product real and complex number computations, parallel multiplication for real 
and complex numbers, and real and complex number division. The same strueture ean 
also be controlled to accept redundant binary number inputs for multiplication and 
inner-product computations. An improved 2’s-complement to redundant binary 
converter is also presented.
Xll
Chapter 1 Introduction
Consider the definition of the inner-product. For two 
vectors A = and B = ,5^_j), the inner-produet of
A  and B  is defined as:
M-l
< A ,B >  = A {B *)  = J ] 4 B ;  (1.1)
>=0
In general, A. andB. may be real or complex numbers. A B *  denotes matrix
multiplication with the row vectors A and B considered as IxM  matrices, and (B*)^ 
denotes the conjugate transpose of B. In the traditional method, all of the multiplications 
are processed independently of one another, thereby requiring M  multiplications and M-1 
additions. To obtain high-performance circuit implementations of the inner-product, 
several salient features of Equation (1.1) can be utilized; namely, carry-free addition, 
high-speed multiplication, and parallel or pipelined multiplication and addition.
The application of redundant binary (RB) numbers was previously investigated 
for carry-free addition and fast multiplication. These techniques have proven to be easily 
laid out in VLSI and result in high-speed circuit implementations [l]-[4]. In this 
dissertation, high-performance and easily pipelined implementations of an inner-product 
processor are presented. The designs utilize RB numbers for achieving the carry-free 
addition of partial products. Redundant binary schemes are less viable in applications that 
require persistent conversion back to 2’s-complement [5]-[7], since this process is 
relatively slow due to an unavoidable carry propagation requirement. The overall 
motivation for this work is the design of a high-performance Complex Arithmetic Signal 
Processor (GASP) capable of offering novel extended inner-product operations. The
CASP design relies on the high-speed multiplication afforded by redundant binary 
techniques, while avoiding the relatively slow conversion back to 2’s-complement 
numbers until a final 2’s-complement result is necessary. Inherently, the CASP device 
provides intermediate register storage for redundant binary, as well as 2’s-complement 
numbers. The methods for implementing the core inner-product structure and general 
extensions are presented in this dissertation.
Inner-product computations play a central role in digital signal processing, most 
often in digital filters, signal correlation, convolution, FFT, etc. Current implementations 
of inner-product computations include the following methods: 1) general purpose 
processors, 2) digital signal processor devices, such as Texas Instruments TMS320C60, 
3) VLSI devices, such as FPGAs or ASICs. Various researchers have investigated the 
implementation of inner-product processors. Implementations include array multipliers 
[8],[9], VLSI Residue Number System architecture [10], serial implementations 
[II],[12], distributed arithmetic [13],[14], carry-save addition [15]-[19], specific DSP 
processor and FPGA [20]-[27], redundant binary implementations [28]-[30].
Complex number arithmetic computation is a key arithmetic feature required in 
modem digital communication and optical systems [31]-[38]. Many algorithms based on 
convolution, correlation, and complex number filters require complex number 
multiplication and high-speed inner-product computation. These applications require 
efficient representation and manipulation of complex numbers together with real 
numbers. Considerable research exists for hardware implementations of complex number 
systems [39]-[54] and representations of complex numbers in different radices [55]-[68].
The redundant binary (RB) representation is one of the signed-digit number 
representations originally introduced by Avizienis [69] for achieving the carry- 
propagation-ffee addition. RB numbers differ from the conventional 2’s-complement 
representation in that the individual digits comprising a number may have negative values 
as well as positive values. High-speed VLSI multiplication algorithms, which are based 
on redundant binary numbers, are proposed in [1],[3],[4]. Since integer numbers in most 
digital systems are represented in 2’s-eomplement form, a converter is needed to convert 
a redundant binary number to a 2 ’s-eomplement number in the last step. Different 
implementations of this converter have been proposed in [5], [7], [12], [70]-[74].
Although division is an infrequent operation, it has been shown [75] that ignoring 
its hardware implementation can result in significant system performance degradation for 
many applications. Extensive literature describes the theory of division [75]-[90]. 
Division algorithms can be generally divided into the following classes: digit recurrence, 
functional iteration, table look-up and variable latency [84]. Choosing an optimal design 
of a divider depends heavily upon its requirements for area and speed.
In the following sections, these hardware implementations and research issues 
will be reviewed and investigated.
1.1 Inner-Product Implementation by the General Purpose 
Processors
General purpose processors, such as Intel Pentium and 80x86, Motorola 68000, 
AMD K6 and K7, etc., ean perform different algorithms using combinations of various 
machine instructions. The systems built with these programmable processors are 
adaptable to different applications and easily upgradeable to changing requirements.
Even with such potential advantages, traditional programmable processors have not been 
widely used for high-speed inner-product computation because of their limited 
performance. For example, in order to find the inner-product of two vectors A  and B, the 
flowchart in Figure 1-1 is usually employed.
IBegin
Sum<=0
Register 1 <= A 
Register 2 <= B 
Register 3 <= Sum





Figure 1-1. Flowchart of Inner-Product Computation by the General Purpose
Processors
In a general purpose (GP) processor, all these computations are sequential and 
each load, multiplication or summation requires one or more clock cycles. Traditional 
multiplication and accumulation methods are generally used. Some GP processors 
provide additional hardware features for inner-product calculations. Among these 
processors, the Pentium MMX processor contains a super scalar architecture, which 
includes; 1) enhanced pipelines 2) two pipelined integer units capable of two instructions
per clock, as well as other features. With the new architecture, the Pentium MMX can 
compute inner-products more efficiently than other general purpose processors. The 










MMX “ reg is te r
- b v e c to r
MMX™ reg iste r
|b3 |b2 |b1 |bU |b3 b2 b1 bO
a3*b3+a2*b2 a1 *b1 +aO*bO
a3*b3+a2*b2 «1*b1+aO*bO
+
E prev. loops E prev. loops
Figure 1-2. Inner-Product Implementation in Pentium MMX Processor [91]
Sample code [91] for an inner-product implementation using Pentium assembly
language is shown Figure 1-3:
l o o p  :
I  iiiovq  HHO ,  [ H _ve c t  o r  ]
Z  m ovq  H H l, [ b _ v e c t o r ]
3 praaddwd MMO, MMi
4 p a d d d  H H 7, HHO
5 a d d  [ a _ v e c t o r ] ,  8
6 a d d  [ b _ v e c t o r ] ,  8
7 su b  [ c o u n t ] ,  4
8 j n z  l o o p
9 m ovq  HHO, HH7
1 0  p s r l q  HH7, 32
I I  p a d d d  H H 7, HHO
1 2  m ovd  m em _ v d p , HH7
Figure 1-3. Sample Code for an Inner-Product by Pentium MMX [91]
1.2 Inner-Product Implementation by Digital Signal 
Processing Processors
DSP processors are specifically designed for DSP applications. One typical DSP 
processor is the Texas Instruments TMS320C60. It is a highly integrated, multiprocessor, 
single chip device specifically designed for DSP applications. The TMS320C60 
integrates the following components onto a single device [92]:
1. a single 32-bit RISC master processor (MP) with an integral IEEE-754 
floating point unit
2. four 32-bit integer DSP parallel processors (PP)
3. a sophisticated direct memory access (DMA) transfer controller (TC)
4. a video controller (VC)
5. 50K bytes of on-chip SRAM memory
The five processors on the TMS320C60, i.e. the MP and four PPs, can be 
configured for a variety of multiple-instruction, multiple-data, multiple-instruction, 
single-data, or single-instruction, multiple-data modes. The PPs, similar to most DSPs, 
perform all operations, except division, in a single cycle. For example, it can perform the 
parallel operations, A*B =>C and A+I => A in one clock cycle, while in a general 
purpose processor, at least two cycles are required. Sample code of the fixed-point inner- 
product computation is shown Figure 1-4 [92].
ZERO .L1 A7
LDH .01 *A4++,A2 ;load a-, from memory
LDH .01 *A3++,A5 ;load bi from memory
MPY .Ml A2,A5,A6 : 3i * bi
AOO .LI A6,A7,A7 ; sum += (a, * bj
SUB .81 A1,1 ,A1 ; decrement loop counter
B .82 LOOP ; branch to loop[A1]
Figure 1-4. Sample Code of the Fixed-Point Inner-Product by TMS 320C60 [92]
While DSP processors allow flexibility, for some applications that require high 
speed inner-produet computation, FPGAs or ASICs can provide higher performance 
options.
1.3 Other Inner-Product Processor Implementation Methods
Besides the inner-product implementations on general purpose processors and 
DSP processors, other arithmetic and implementations of inner-product processors have 
been investigated. Ahmad and Poomalah [8] proposed an inner-produet implementation 
using array multipliers. Although the array multipliers provide convenient layout for 
VLSI, this method may not be a good option in high-performance requirements for inner- 
product computation because of its high latency. Fahmi, et al., [11] and Haynal and 
Parhami [12] investigated serial implementations of an inner-product processor. The 
designs result in a small area but has a high latency. Inner-produet implementations based 
on distributed arithmetic are proposed by Burleson et al, [13] and Vega, et al, [14]. 
Various inner-product implementations using carry-save adders are investigated by many 
researchers [15]-[19]. Application specific inner-product processors are studied in [21]- 
[25] and redundant binary implementations are proposed in [28],[29]. In this research, 
which is focused upon the high-performance implementation of an inner-product 
processor, only implementations of high-performance inner-product processors will be 
reviewed and compared.
With a carry-save adder structure, Kazakova [15] investigated a fast and low- 
power three-dimensional inner-product processor. This processor consists of Booth 
encoders, a Wallace reduction tree, and a final two-operand adder. Its structure is shown 
in Figure 1-5.













T w o Operand Adder
Wallace Tree
Dot Product
Figure 1-5, Kazakova’s Inner-Product Processor Architecture [15]
A novel approach for high-performance inner-produet proeessor, which is 
dynamically reconfigurable, was proposed by Lin [24],[25], This processor mainly 
consists of an 8 X 8  or 4 X 4 array of small multipliers plus two or three arrays of adders. 
It requires very simple reconfigurable components. The entire summation network can be 
reconfigured by using a few control bits for the desired computations, where the 
reconfiguration can be done dynamically. The design is regular, modular, and, it can 
easily be pipelined. The diagram is shown in Figure 1-6.
Since the implementation of an array multiplier has a high latency, compared to 
the design of carry-save addition and redundant binary representation, this proposed 
inner-product processor has a high latency.
4 X 4 4 X 4 4 X 4 4 X 4
M ultip lier M u ltip lie r M u ltip lie r M ultip lie r
A rray A rray A rray A rray
7
C Reconfïgurable S w itch
Final Adder










(4 RB Partial Product 
Generations)
1 r y r
Pineline Register
Pineline Register
RB to 2’S-complement 
Converter
Output
Figure 1-7. A RB MAC (Multiply and Accumulate) by Huang [2]
Based upon redundant binary numbers and the Booth eneoding, Huang [2]
proposed a high-performanee, two-stage pipelined MAC (Multiply and Aecumulate) unit,
which is shown in Figure 1-7. Later, Sacristan [29] further developed this structure as a 
reusable inner-product unit for multipliers with different sizes of word length.
Baik et al. [28] proposed a redundant binary implementation of an FIR filter. The 




Figure 1-8. Baik’s Redundant Binary Filter Implementation [28]
1.4 Multiplier Implementation Review
Multiplication is the key operation in the implementation of inner-product 
computation. Three popular implementations for multipliers are an array multiplier 
[85],[93], a multiplier using a Wallace tree [94] and a multiplier using redundant binary 
number representation [1],[4]. An array multiplier has good repeatability of unit cells and 
is very regular in its structure. It uses only short wires that eonneet one full adder to 
horizontally, vertically, or diagonally adjacent full adders. Thus, it results in a very 
simple and efficient layout in VLSI implementation. However, the A-bit multiplication 
time is linearly proportional to N. This method requires a long computation time for 










Figure 1-9. Design of a 5 x5  Array Multiplier [85],[93]
The Wallace-tree method is commonly used to realize high-speed multiplication. 
The basic cell in Wallace-tree multiplication is 3-to-2 or 4-to-2 CSA (Carry Save Adder), 
also called 3:2 or 4:2 counter. A 3:2 counter can be realized by a full adder, which 
reduces three numbers to two numbers while a 4:2 counter can be realized by two 3:2 
counters, as shown in Figure 1-10 [85]. Figure 1-11 and Figure 1-12 are 4x4  multipliers, 
using 3:2 counters and 4:2 counters.
3:2 CSA




3:2 CSA 3:2 CSA
3:2 CSA
C arry  LookAhead Adder
P roduct
Figure 1-11. 3:2 Counter Based 4 x 4  Multiplier [85]
Binary Partial Products
Product
4 2  Counter
4 2  Counter
4 2  Counter
Carry LookAhead Adder
Figure 1-12. 4:2 Counter Based 4 x 4  Multiplier [85]
The traditional Wallace-tree method uses a 3:2 counter. This scheme results in a 
complicated interconnection between three-input/two-output counters. This makes the 
VLSI layout difficult and inefficient. The extended layout process increases the design 
complexity. As the multipliers increase in bit length, the interconnection becomes 
exponentially complicated. To solve this problem with conventional Wallace-trees, the 
following two methods have been proposed. One method is to use 4:2 counters [94] 
instead of 3:2 counters [94]. The use of 4:2 counters simplifies the interconnection
12
drastically because the partial produets are added using a binary tree. Another method is 
to use redundant binary representation for the partial products [1], [3], [4], The use of the 
RB simplifies the interconneetion because the RB partial products can be summed using 
an RB adder tree. The A-bit multiplication time of RB multipliers and Wallace-tree 
multipliers is proportional to logiA The physical layout of a RB multiplier has good 
repeatability. The RB multiplier does not require any optional sign bits for adding partial 
products. Makino’s research [3] indicates that a 54x54-bit multiplier using redundant 
binary number representation is faster than the eonventional 4:2 counter-based multiplier 
and has lower power dissipation. The power dissipation of 540 mW is estimated for the 
54x54 RB multiplier operating at 100 MHz. These figures are more than 12% faster 
speed and 38% lower power than the conventional CSA multipliers.
Using redundant binary representation in our research results in an easily 
eontrolled/reconfigurable high-performance eomputing structure capable of handling 
various computations for both real and complex numbers.
1.5 The Redundant Binary Number System
Redundant binary (RB) representation is one of the signed-digit (SD) number 
systems originally introduced by Avizienis [69], which provides carry-propagation-ffee 
(CPF) addition. In a signed-digit system, the individual digits have negative as well as 
positive values. Given a radix-(r) signed-digit number, each digit of the signed-digit 
number can take one of the following 2a + 1 values:
{—a ,. ..,—1,0,1,...,a} (1.2)
where the magnitude of a positive integer a  must be within the following interval:
13
— < a < r - l  (1.3)
_2j
The radix-(2) signed-digit system (Redundant Binary (RB) representation) uses 
the digit set {-1, 0, 1} to represent numbers. The SD number system is also called 
redundant because a given integer number may have more than one representation. For 
example, the radix-(2) integer, (7)io, can be represented in several ways, e.g., [0 1 1 1]r b , 
[10 0 -1]r b , or [1 -1 1 1] RB. Based on the SD redundancy property, addition rules can be 
devised so that carry propagation is limited to only one digit position, thereby eliminating 
the possibility of a carry from the LSD (Least-Significant-Digit) to the MSD (Most- 
Significant-Digit). In a RB adder circuit implementation, the addition time is fixed and 
does not depend on the word length. Also, no explicit mechanism to handle the overall 
sign of a signed-digit number is required since it is determined by the most significant 
non-zero digit. Since the multiplication of two numbers is generally performed by the 
addition of partial produets, the carry-propagation-free (CPF) feature of the RB 
arithmetic can be used to design high-speed multipliers [1], [3], [4] and multiply-and- 
accumulate (MAC) units [2].
The algoritbmie rules for the RB addition are defined by Takagi, et al., [4]. 
Basieally, two steps are required. In the first step, the intermediate carry-out, e {-1, 0, 
I}, and the intermediate sum digit, e (-1, 0, 1}, is generated at each position and 
satisfy the equation:
(1.4)
where a. and /?; are the RB augend and addend digits, respectively. Note that for 
increased speed, the circuit implementation may utilize the next lower order digits.
14
and/?, I,to determine the earry-out from that digit position. Table 1-A describes 
these rules of step 1 in detail. In the second step, the final sum digit is obtained at 
each position by adding the intermediate sum digit, < J ., and the intermediate carry, , 
from the next-lower-order position, without generating a carry. That is,
C = 0 - , ( 1 . 5 )
Table 1-A. Computation Rules for the First Step in Carry-Propagation-Free
Addition for RB Numbers [1]
Augend Addend Digits at the next- Intermediate Intermediate
Type digit digit lower-order carry sum digit
a, A position
i-l 5 A-1 )
(%,) ((T,)
<1> 1 1 1 0
<2> 1 0 Both are negative 0 10 1 All other cases I -1




<5> 0 -1 Both are negative -1 1-1 0 All other cases 0 -1
<6> -1 -1 -1 0
In general, throughout this dissertation, RB numbers are expressed using Greek 
symbols.
1.6 The Conversion of 2 ’s-Compiement to Redundant 
Binary
A  limited precision RB number, A , can be derived from the addition of a pair of 
N-hit 2’s-complement numbers A and B [2].
15
(A + B)^^ -  A - { - 8 )2̂  
= A — (i? + 1)




i=0 J  \  
" N - 2
(L(%
/=0 j
N - 2  \
V ,2 " - '+ 2 : '5 /2 '
V





where (5̂ _, = -a^_, + = a,. -  h, for 0 < / < A/" -  2 , 2c is the 2’s-eomplement
operations, B is the rs-complement operations, b. is the bit-complement, and -1 can be
considered as a -1 carry-in to a subsequent RB addition. For inner-product calculations, 
the -1 correction is applied in the RB partial product adder tree.
The binary-signed digits can be encoded into binary in several ways. In this work, 
the binary signed digits{-1, 0 ,0, 1} are coded as {00, 01, 10, 11}, respectively, as given 
in Table 1-B. Another encoding method is to encode redundant binary number in signed- 
magnitude [28], that is, to encode (-1, 0,0, 1} as (11,00,10,01). It is less efficient to map 
2C to RB for signed-magnitude encoding. See Section 2.1.4 for further discussions.
Table 1-B. Coding Table for Binary Signed Digits






Examining Equation (1.6), beginning with the S, term, the signed digits are 
encoded using the relationship, S. = a- - b -, where S. is a binary signed digit, S. e (-1,0,
1}. The mapping equations for S. and are [2],[49],[50]
S. -  a.
f b r O < ( < # - 2
(1.7)
Similarly, in the Most Significant Digit (MSD) term of Equation (1.6), is encoded 
with the mapping equations
(18 )
The structure of mapping the sum of two 2’s-complement binary numbers to a RB 
number is shown in Figure 1-13:
^N-l^N-2 ^N-2 Qq
s~ s;
Figure 1-13. Mapping from the Sum of Two 2’s-CompIement Numbers to a RB
Number
Note that a single 2’s-complement number. A, is converted directly into a RB 
number A in Equation (1.9):
=a,., (^,r=l ( 0 < ; < # - 2 )
(19 )
17
For example, a 2’s-eomplement number (OOOOOlOl)̂ ,̂  is converted directly into a 
RB number (01 01 01 01 01 11 01 1%^.
The subtraction of two jV-bit 2’s-complement numbers can also be represented by 
a redundant binary number;
E = (A-B)2 , = 2̂  ̂ ' -I- ^  a,. 2' -f 2^ ' - ' ^ b .2'
i=o y V i=o y
= (-«A'-l * + X
i=0
’ + '^£ ,2 '
i=0
( 1.10)
where and £■ = a. -b^ for 0 < / < TV -  2 .
The mapping equations for the encoded g., (0 < / < TV-1) in Equation (1.10) are:
s t  =b: for 0 < / < TV -  2
and
—b*7V-1 ~  ^ * - 1
( 1.11)
(1.12)
The structure of mapping the subtraction of two 2’s-eomplement binary numbers 
to a RB number is shown in Figure 1-14.
^A^-l ̂ N-l ^N-2 ̂ N-2
Figure 1-14. Mapping from the Subtraction of Two 2’s-Complement Numbers to a
RB Number
18
Based on this coding for RB numbers, the logic functions of a RBFA (RB full 
adder) and RBHA (RB half adder) are obtained [49] and shown in Table 1-C for the sum, 
z,, with inputs x and y. Boolean variables, g  and h, are used as intermediate variables to
simplify the equations for the carry, c, and sum, z. Note, in Table 1-C, the notation used 
is the same as [49] but corrects the RBHA equations found in [49].
Table 1-C. Logic Functions for RBFA and RBHA
RBFA RBHA
g. = (x: @ x+ ) @ (yr @ y / ) z: = (x: © x; ) © y
A, = x :x ; 4-y:y;
z,: =g,. ©c:_,
4̂- +
cT = x: + x/
Z/ - ^  (a'  ® A + (a~ ® x]^)x:x/
c: =(x,: +X.)(y7 +y+)
In this work, a novel, high-performance, fixed-point, inner-product processor 
based upon a redundant binary number system will be investigated. Similar to Balk’s [28] 
methods, this scheme decreases the number of partial products by 50%, while achieving 
better speed and area performance and providing pipeline extension opportunities. When 
modified Booth encoding is used, partial products are reduced by almost 75%, thereby 
significantly reducing the multiplier addition depth. This design is well suited for VLSI 
implementation, and it can also be embedded as an inner-product core inside a general 
purpose DSP FPGA-based processor. This inner-product processor can be easily 
reconfigured for different computations, such as real number inner-product computations, 
parallel real number multipliers, complex number multipliers, complex number inner-
19
product processors, redundant binary multipliers, redundant binary inner-product 
processors, etc. Chapter 2 proposes a fixed-point number inner-product processor. 
Computational struetures for both real and complex number inner-products for both 2’s- 
complement and unsigned integers is presented. A new division method using the IP 
structure is investigated in Chapter 3. Two convergence division methods — 
Goldschimdt and Newton-Raphson are compared. Chapter 4 discusses extended 
computations, such as parallel multiplications, inner-product processors using the inner- 
product processor for real, complex and redundant binary numbers. In Chapter 5, an 
improved redundant binary number to 2’s-complement number converter is discussed. 
Chapter 6 provides a summary of contributions and future research directions for this 
work. The implementation of the redundant binary IP processor for real and complex 
number and the Goldschmidt division unit using the IP processor have been implemented 
using VHDL on Xilinx FPGA. The original contributions of this research are:
• IP processor reduces the number of partial products.
• A unified signed/unsigned 2’s-complement/RB multiplier is developed 
using this IP structure.
• With the same IP structure, a novel Goldschmidt high-performance 
division circuit is developed.
• This IP structure can be used to build a multi-purpose dynamical processor 
for real, complex and redundant binary number computations.
• An improved 2’s-complement to RB converter is proposed.
2 0
Chapter 2 Inner-Product Processor of Real, Complex 
and Redundant Binary Numbers
2.1 Real Number inner-Product Computation
Consider an inner-product for anM  dimensional (M=even), Â -bit real vectors {N 
even), A and B, where ^  = (4>, 4 ’^2• • • -^m-\) B = ) with
4  ~ ^ N - 2 , i  ^ N - î , i  ^0,/) 4  ~  ^ N - 2 , i  ""^1,; A)., ) (^-l)
where 4  and B. are real numbers.
The real inner-product is defined as:
< A , B >  = (Aq ■■■ ) •  (^0 4  • • • ^M-2 ^M-\) = X 4 4  (2.2)
/=0
Two basic approaches exist for performing the necessary irmer-product 
multiplications using redundant binary arithmetic. The first method uses inline 
conversion or mapping of 2’s-complement partial products into a redundant binary 
number for each multiplication of A-B.. The second method combines or maps equivalent 
2’s-complement partial products into a redundant binary number across the 4 4  P^irs 
[49]. Both approaches are considered in the following sections.
2.1.1 Inline Partial Product Redundant Binary Inner-Product
Considering the simple case of M = 2 , 4 ) 4  + 4 4  , we first compute the 
redundant binary products for4 ) 4  and 4 4 ’ and then add the RB products together to 
produce the irmer-product. Redundant binary partial products are generated by mapping 
even/odd pairs of 2’s-complement partial product sums. For #-bit numbers, the product 
OÎAB is expanded in the following equations:
2 1
N - 2  N - 2
+ ^a ,2 ')(-6 ^ _ ,2 ^ - ' + %]6,.2')
i=0 i=0
= (-a„_,*„2*-‘ + X a ,6 ,2 ')2 "  è,2"-‘ + |;« A 2 ') 2 ' +
/=0 i=0
(-^N-1^2^^ * + +(~^N-\h'^^ * + 2' )2  ̂+
/•=0 i=0
;=o 1=0
Denote the 2’s-complement partial products as;
+ ^ « A 2 ') 2 \  =(-«^_,2^2^-' -k ^ a A 2 ')2 '
i-O /-O
= (-a ^ _ A 2 ''- '-^ ^ ( ,A 2 ')2 ^  =(-a^_A 2"'-' + ̂ a A 2 ') 2 '
I-O i=0
PP«^2 = (-a„_,i„_;2*-' + ^ „ ,6 , . : 2 ' ) 2 ' - \  = K _ , V , 2 " - '- ^ i 'A _ ,2 ') 2 ' '- '
1=0 1=0
Consider the first N - 2  even/odd partial products, PP2J and PP2J+1, where
J=0,l,2,...,^^^^—^ .  To align the 2’s-complement partial products, the sign of the even 
2
partial products is extended, and a low order zero is appended to the odd partial products:
PP22 =(-!!»-,*,, 2”-' + |;a A y 2 ') 2 " '
1=0
H -^N -A j'^^  + ^N -A j^^  ' + X^'^2y2')2^-'
(2.4)
1=0
= (~^N-Aj+A^ +^N-2hj+A^ ’ + Ÿ j^ i-A j+ A  +0)2^-'
1=1
Combining even/odd 2’s-complement partial product pairs according to Equation 
(1.6), we have:
2 2
i V - 1  __________________ 2j
PP̂ J + P 2̂̂ +, -  + <^N-Aj+\ )2^ + X  ̂ ^A j ~ <̂ ,-Aj+\ )2' + <̂0̂2J -1  - 1}2
/=1
=  +  Yĵ iaA' +  <^0,2;  - 1) 2^-'
where.
^N,2j ^N-Aj ^N-A2+1
o r . 2 . =  a .b ^ . - a , . _ , 6 2 , + ] ,  ! < ( < / / - !  ( 2 . 6 )
Eneoding the redundant binary eoefficients, or., using two binary bits, all but the final RB 
partial product is encoded as:
= 0,1, # - 4..., 2 ’
— ^N-Aj-> ^N,2j~ ^N-Aj+\->
<22 = ^i,2j = ^i-A  2+1 ’
^0.2; -  ^ A  j ’ ^̂ 0,22 = 0
(2 7)
1 < ; < # - !
Now, consider the last two 2’s-complement partial products, PP,̂ _2 and :
PP,-2+PP«-, = (-«»-,V 22"+a»-,*»-i2* '"
1=0
J "  (2 .8)
= {{-^N-A^2 + %-l l̂V-l )2^ + X  (^A-2 ~ )2' + «0̂ A'-2 -  0}2^ ^
/-]
= f̂ N + X ^ '2 ' +^0
/=1
where,
&  = ^ 2  + (%iv-i V i ,  A = V 2 -  (̂ M̂ Ai-1, (1 ^ ^ -1 ) , ^ 0  = -  0 (2.9)
Encoding the redundant binary eoeffieients, yg., using two binary bits, the final RB 
partial product is:
23
N ^ N-\^N-2-’ Pn
, p :  = a,_,è^_i, 1 < z < TV -1
A  “  ^(pN-2 ’ A  ~ ^
(2.10)
Figure 2-1 shows the RB implementation diagram of A^B^+A^B^ for 8-bit
numbers and Figure 2-2 shows the hardware implementation of A^B^+A^B  ̂ with the
inline redundant binary partial produet generation (RBPPG) using 2’s-complement 
even/odd partial products. If the final redundant binary adder (RBA) is bypassed, the 
circuit in Figure 2-2 can also perform the separate multiplications, A^B^ and A^B^.
M ultip licand M uttip licand
M ultiplier
RBPP23 R B PP23
P P 6
P P 7
Figure 2-1. Inline Partial Product RB Implementation of A^Bg + AjBj
Defining the redundant binary number Æ = A^jB^j -f , the general form
of the inline multiplication inner-product is given by;
M
M - \
< A, B>  -  ^  4 4  -  X  + Aj+\^2J+P -  X  4 (2 11)
j= 0 y=o
24
Binary Number Partial Product 
Generator for A,)Bq
Binary Number Partial Product 
Generator for A ,
M il i i i i r r “X i
RBPPG
PPo PP| PPo PPl PP\.2 PPn-1 PPo PPl PPo PP| PPn-2 PPn-1



























RB Sum of AoBq+AjBI
Figure 2-2. Inline Partial Product Structure of A^Bg + A,Bj
To realize the inner-product, all of the redundant binary numbers, J . ,  are added using a
redundant binary adder tree, and the final sum of the redundant binary numbers can be 
converted into a 2’s-eomplement number using a RB-NB converter [5],[7],[12]. Figure
2-3 depicts the overall architecture of the RB inner-produet circuit showing the RB adder 
tree, a RB accumulator, a RB-to-2 ’ s-complement converter, and additional data paths that 




From  RB 









RB A dder T ree
RB To 2 's-C om pIem ent 
C o n v e rte r
To RB R e g is te r  
o r  M em ory
To 2 's-C om pIem en t 
R e g is te r  o r  M em ory
Figure 2-3. Overall Structure of the Redundant Binary Inner-Product
2.1.2 Cross Partial Product Redundant Binary inner-Product
An alternative method for mapping 2’s-eomplement partial products to redundant 
binary partial products is to combine like partial products across the AjBj and
pairs. The method derived here is similar to that provided by Shin and Jeon [50] for 
complex number multiplication. Consider the simple case of M=2, \ B ^  + Â B̂  , and 
expand it as;
26
=  ' + X ‘̂ ' .o2 ')(-^a^-i,o2^  ' +  X ^ '.o2') +
/= 0  1=0
N - 2  N - 2
1=0 1=0
= 2 {ô -i,ô 7v-i,o 2 + ^  (~ î,o^N-\,o )2 + ̂ n-\,\^n-\,\ 2 + ^  )2 } +
1=0 /=0
^2'{6.(,(-a^_,o2^ ' + %]Oy.o2^) + 4.1 2"̂  ' + ^^y .i2 ')}
<•=0 y=o y=o
N - 2
~ 2 {(̂ #-1,0^#-!,0 ‘̂ A'-l.l̂ A'-l.l )2 + ^  i~ îfi^N-\,G ~ ̂ i,\^N-\,\ )2 } +
i=0
y i  2 {(~̂ w-i,ô /,o ~ )2 + ^  (̂ y,o4,o ■*■ '̂ y.i4,i )2 }
,=0 7=0
Consider the first term of Equation (2.12),
2  { ( ‘̂ A'-l.O^A'-l.O ■*■ ^ N - \ , \ ^ N - \ , \  ) 2  +  ^  ( ~ ^ i , 0 ^ N - \ , 0  ~  ^ i , \ ^ N - l , l  ) 2  }
i=0
= -2^  ‘{(-«AT-i,ô iv-i,o)2'̂  ' +Ÿ^i^i,oK-\,o)^‘
i=0




— 2 {('̂ //-l.Ô M-l.O '̂ W-1,1̂ A'-1,i)2
N - 2
■ ^  ( o.iobf̂ _̂ Q + <3; |6^_] ] )2 +1} 
(=0
= 2"* 'K _ ,„2”- '+ | ; r „ 2 '+ 0
1=0
where „ is a redundant binary number and ,
^#-1,0 ^N-\,ô N-\,o ^N-\,\^N-\,\ 5 AT. Q ‘̂ ;,ô A'-i,o for 0 < z < 2 (2.14)
Encoding the redundant binary numbers, zr̂ _, „ and xr, ̂ , the Boolean equations are
27
( Z . i  J j
yv-1,1
Considering the second term of Equation (2.12), and using Equation (1.6),
2 {(-«^_i ô /,0 “  1̂1,1 )2 + ^  }
1=0 y=o
“ ^ 2  { — (%jv-].o4.o + ^  (^y.o4,o)^  ̂ )2^}
;=0 y=0 7=0
^ ^ 2 '  {(-«A-_i,ô i,o + ̂ yv-1,1̂ /,1 )2^ ' + X  (̂ v.o4.o “  ̂ 7.1̂ 0' )2'' -1}
/■=0 7=0
= g 2 '{ A , _ 2 - + g ^ , 2 ; - l }
f̂ o ŷ o
where À., {Q<i < N  -2 ,  0 < j  < N  - \ )  is a redundant binary number with
(2 16)
^.7 =  «7.0 ,̂.0 -  (^7.i4.i for 0 <  y  < #  -  2, and , fo r; =  Æ -1  .(2.17)
Encoding À.j  as two binary bits, À7. and À ^ j , { 0 < i  < N  - 2 ,  0 < j  < N - \ ) ,
4 . 7  ^  ^ 7 . o 4 . 0  ’  4 , 7  ‘^ 7 , i 4 , 1  ■
The overall inner-product is expressed as:
+4^1
N - 2
—  2  { ( ^ j V - 1 , 0 ^ V - 1 , 0  ^ N - l , l ^ N - l , \  ) 2  +  i ~ ^ i , 0 ^ N - l , 0  ~  ‘̂ i . l ^ V - 1 ,1  ) 2  }  +
1=0




= 2»-' 2»-' + 1 ;  , 2'} + X  2' {7, 2“-‘ + 1 ;  7 ,2 '}
/-O  /-O  y=o
N - 2






Since 2^ ‘ ^  2' = 1, Equation (2.19) becomes:
+ + 2 v ,„ 2 '}  + 2 ;2 '^ ,,_ ,2 ''- ' + 2 ^ i , ,2 0  + l (2.20)
(=0
N - 2 N - 2
1=0 y=o
The adjusting term, +1, can be applied as a carry-in to the LSD of the redundant binary 
full adder. Figure 2-4 shows the schematic structure of the A^Bq + hardware 
implementation.
2's-C om plem ent 
Partial Products
RBPPG
2's-C om plem ent 





RB M appingRB M apping
Binary Num ber Partial Product 
Generator io r  A , 8 ,
Binary N um ber Partial Product 
Generator for A„B(,
RB Sum  o f  A(,Bg+A ,B ,
Figure 2-4. Cross Partial Product Structure of A„B  ̂+ AjBj [50]
Defining the redundant binary number CZ>. = A^jB^j + , the general form






{A,B) -  ^  A.B̂  -  'Yj (Aj^ l j  + ^y+lAy+l) -  X  ^ (2 .21)
J=0 J=0 j= 0
Again, all of the redundant binary numbers, (5, are added using a redundant binary adder
tree, and the final sum of the redundant numbers can be converted into a 2’s-complement 
number using a RB-NB converter [5],[7],[12]. The same redundant binary adder tree 
used for the inline inner-product, shown in Figure 2-4, can be applied to the cross partial 
product method.
2.1.3 Booth Encoding Methods
To further reduce the number of partial produets, the modified Booth encoding 
technique is used [96]. The modified Booth algorithm recodes an TV-bit 2’s-complement 
number, B, by the following equation:
N - 2  2
(-0 /=0 (2.22)
1=0
where = 0 and Q. e (-2, -1,0, -fl, +2} is determined according to the bit pattern of the
3-bit string of B as given in Table 2-A.
Table 2-A. Modified Booth Encoding Table [96]
4 4-, a
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 2
1 0 0 -2
1 0 1 -1
1 1 0 -1
1 1 1 0
30
For the inline partial product method, we apply Booth coding to find the product








If 6, = 1, C, +22^*2* and g, = 0
A=0
If g, - 2 ,  q  = -o^_ ,2 '' + ^ 0 ,2 * + ' and g, = 0
k=0
If g, = - I , C, = 2""-' + ^  a, 2" and g, = 1 (2.24)
k=0
N - 2
If Q, = - 2 ,  C, = -a,_, 2“ + y  a, 2 " ' and g, = 1
k=0
Notice that C,. is a 2’s-complement number. Mapping the product to redundant binary.
f '
AB = 2 ( C ,+ « ,) 2 ”
/-o
= E  !(C ,+«,)2"'+<C,.,+«„i)2"“ }







where E-is a redundant binary number and Æ”; -1  = C; + 4C,.^,. The correction factors, 
/■ and , depend on the values of g. and , as shown in Table 2-B. Here, the two 
redundant binary Booth correction factors, and , are used since g. and g,.̂ , can not 
be combined. The -1 is encoded into the /,  correction factor.
Table 2-B. Booth Correction Factors for the Inline Multiplication Method
gi gi+\
0 0 -1 (0 0) 0 (0  1)
0 1 -1 (0 0) 1(11)
1 0 0( 0  1) 0 ( 0  1)
1 1 0(0  1) 1(11)
Referring to Booth coding Table 2-A, the Boolean equations for the correction 
factors are,
r : - o  
7Ï = Kxbi 4-1 + 4+14 4-1 + 4+i4 4-1
4+1 44-1 C2.26)
= 4+3 4+24+1 
= 1
Nwhere / = 0 ,2 ,4 ,...-^ -2  . After the redundant binary products of and are
computed, a redundant binary adder is used to compute A^B^ + A^B^.
The general form of the inline Booth encoded inner-produet is.
M - \
< A , B >  -  2 ]  A j B j  -  ^  2 ]  + 7j,x +  7 j,m )2 '̂ (2.27)
y=o /=0,2,4,.. T-2
where the number of partial products is decreased to slightly above 25%, with 
consideration being given to the correction factors.
32
Applying Booth encoding for the cross partial product method, we must 
consider . Using Equation (2.22),
f - ' N - 2
A,B„ + A,B, = y  + Z
i = 0 k = 0
+ Z  a , , 2 " ' ' + Z " w 2 ' ) 2 "
/ = 0 k = 0
y-' y-'
= Z C : , . + g , . . ) 2 "  + Z ( C , , + g , , ) 2 "
i = 0 i = 0
fc '
=  X  ((",.0 +  ^/,1 +  S i,I  +  g / ,o )2 ^ '
C2.28)
i =  0
From Equation (1.6), the sum of two 2’s-complcmcnt numbers can be considered 
as a redundant binary number minus 1. Equation (2.28) can be converted to:
4 )^ 0  +  4 ^ 1  -  X !  (A .o +  +  Si,\ +  g,,o)2^'
1=0
Y"'
=  X  +  ^1,1 +  Si,o - 1)2^' 
/=0
Y"'
~ ^  (-̂ 1,01 +U',oi)2
(2.29)
1=0
where U. g, is a redundant binary number tfom the addition of C. g and C,., ,  and the 
redundant binary number y. g, = g, o +^i,i • The correction factor, , depends on the
values of g. g and g -,, as shown in Table 2-C.
Table 2-C. Booth Correction Factors for the Cross Partial Product Method
Si,0 &.1 y.,oi -  Si,0 +Si,i - 1 Yi.oi Yi.oi
0 0 -1 0 0
0 1 0 0 1
1 0 0 1 0
1 1 1 1 1
33
From the Booth coding Table 2-A, the bit encoding Boolean equations are;
,0
_  _  (2,30)
y,!oi = &i =4n,i4,i 4 - 1 . 1 4-i.i+4+i.i4.i 4-n
“ 4 + 1 , 1  4 , i 4 - i . i
Using Booth coding, the cross partial product inner-product method is given by
M M N
M-i y - ’ y - 'y - '
< A ,B  > ^  AjBj ^  + ^2y+i-̂ 2y+i ) ^  (̂ i,(2j,2j+i) +/,,(2;.2y+i))  ̂ (2-31)
J=0 j = 0  J=0 i=0
Again, the number of partial products is decreased to slightly above 25%, with 
consideration being given to the correction factors.
2.1.4 Implementation Comparison of Iniine and Cross Inner- 
Product Methods
An 8-tap digital filter implementation in [28] uses a signed-amplitude system to 
encode a redundant binary number. The signed-amplitude method requires two gate 
delays for the conversion from 2’s-complement to redundant binary. Examining 
Equations (1.7) and (1.8), only inverters are necessary for the inline partial product 
redundant binary mapping.
In [49], the cross partial product implementations of Â B^^+A^B  ̂and A^B^-A^B^
are discussed for complex number multiplication. An equivalent derivation was provided 
in Section 2.1.2. The inline method, presented in Section 2.1.1, combines partial products 
within the partial products of 4)^0 A^B^, respectively. Figure 2-2 and Figure 2-4 
depict these methods. For a qualitative comparison of these two designs, note that as 
feature sizes shrink in deep submicron VLSI technology, interconnection wires contribute
34
a large portion of the total delay [49],[50]. The inline implementation provides more 
direct routing for vertical (horizontal) wires, while the cross partial product method 
[49],[50] will need crossing horizontal (vertical) wiring paths for partial product 
mapping, with the routing distance proportional to the word width. Therefore, the inline 
partial product method will result in improved performance, compared to the cross partial 
product method. In addition, the inline method offers more extended operational 
capahility than the cross partial product scheme (see Chapter 4). The inline method 
requires more horizontal gates, primarily due to the overhead of the partial product 
alignment. Table 2-D and Table 2-E show comparisons for Xilinx FPGA 
implementations for , with and without Booth encoding. The Xilinx Virtex2
2V6000FF1517 device was targeted for the implementations using VHDL and Xilinx 
Foundation software. For this word length, the higher performance and area savings of 
the Booth encoded designs are evident (See Appendix for VHDL code availability).
Table 2-D. 16-Bit FPGA Implementations of + AjB^ Without Booth Encoding
Cross PP Method Inline PP Method
Number of Slices 1094 1094
Number of LUTs 1992 2002
Equivalent Gate Count 11952 12012
Maximum Delay 30.014ns 28.084ns
Table 2-E. 16-Bit FPGA Implementations of A^B  ̂+ AjBj  With Booth Encoding
Cross PP Method Inline PP Method
Number of Slices 846 905
Number of LUTs 1603 1717
Equivalent Gate Count 9618 10302
Maximum Delay 29.460ns 27.970ns
35
2.2 Complex Number Inner-Product Computation
2.2.1 Review of Complex Number Arithmetic
Complex number arithmetic computation is a key arithmetic feature in modem 
digital communication, radar systems and optical systems. Many algorithms based on 
convolutions, correlations, and complex filters require complex number multiplication, 
complex number division, and high-speed inner-products. These applications require 
efficient representation and manipulation of complex numbers together with real 
numbers. Among these computations, high-performance complex number multipliers and 
complex number inner-products are desirable in modem digital communication, optical 
systems, and radar systems. Recent research in hardware implementation of complex 
number arithmetic circuits is focused on utilization of radix-(2), as well as altemative 
radices, for the representation of complex numbers.
In this chapter, different complex radices are investigated and compared. It is 
found that the complex radices have no advantage in hardware implementations. 
Traditional radix-(2) redundant binary numbers are used to implement complex-number 
multiplication and inner-product processing. The investigated inline inner-product 
processor can be reconfigured/controlled to perform complex-number computations. The 
computational structures of and A^B^ -  Â B̂  are developed for performing
complex number inner-products. The implementation of \Bç^ + Â B̂  can be easily
controlled to perform the computation of A^B^ -  AjBj. A complex number inner-product
processor is realized, based upon a unified structure for A^Bg ± AjB^.
To represent a complex number other than radix-(2), several representations have 
been proposed. Knuth [63],[64] described a “quater-imaginary” number system with
36
radix-(2j). Dao [58] further analyzed this quater-imaginary system for eomplex-radix 
arithmetic. Penney [66] proposed a complex number representation with the base of j- \ .  
Slekys [67] defined arithmetic operations on radix- (7V2) . Recently, further 
investigations examined the arithmetic algorithms and hardware implementation of these 
representations. Aoki [31],[97],[98], and Mcllhenny [99],[100] investigated complex 
number arithmetic in a redundant radix-(2y) number system. Jamil [62] and Blest [55] 
further analyzed the complex number computations in the radix-( 7 -1  ) number system 
and included proposed arithmetic methods for addition, subtraction, multiplication and 
division. Frougny [60] and Koren [65] provided a theoretic investigation of complex 
number arithmetic for complex numbers in the bases yVè and -b  + j . Stepanenko [68]
also investigated the complex number arithmetic in radix-(7'V2 ).
Multiplication is an essential operation for high-speed hardware implementation 
of complex number computations. It can be used to compare the complexity of complex 
number arithmetic with different complex radices. The analysis of complex number 
multiplication in these various radices will provide one metric for comparison.
To compute the product of two complex numbers, the conventional method is to 
use four binary multiplications, one addition, and one subtraction, as shown in Figure 
2-5. Define two complex numbers as:
where j= 4 ^ ,  and , R,, and R. are the real and imaginary parts of the complex 
numbers, A and B . Multiplication of A and B is given by;
A x B  = (Â . + jA .)X(B̂ . + jB.) = Â .B̂ . -  A.B. + j{A^.B. + Â B̂ .) (2.33)
37
Figure 2-5. Basic Diagram of Complex Number Multiplication
In this direct implementation method, four multiplications plus two additions are 
required. To reduce the arithmetic complexity of the complex number multiplication, an 
algebraic transform, given in Equation (2.34), is proposed by Blahut [101]. This method 
saves one real number multiplication at the expense of three more additions:
C2 34)
B:
Figure 2-6. Blahut’s Complex Number Multiplier [101]
38
As shown in Figure 2-6, this method requires pre-addition of B̂ . + B. and pre­
subtractions, Â . -  and 5, -  B. , before the binary multiplications, resulting in an 
increase of critical path delay. Although addition is generally less expensive in area than 
multiplication, the overall savings in hardware does not offset the non-trivial critical path 
delay. Therefore, the complex multiplication scheme given in Equation (2.33) will be 
utilized in this research.
2.2.2 Comparison of Different Complex Radices
Representing complex numbers with a complex radix implies that the complex 
numbers can be manipulated without separating the real and the imaginary part. It is 
supposed that in these complex radices, the complex-numher computational arithmetic 
will be simplified. For example, complex number multiplication may only need one 
complex radix multiplication and hence provide a major performance improvement. Can 
a complex radix system really achieve such improvement? In the analysis, the various 
complex radices are compared, although it is interesting that no significant improvement 
is achieved compared to the traditional 2’s-complement binary representation of complex 
numbers with the real and imaginary parts treated separately. Further, many of the 
alternative complex radix representations are unbalanced or fractal, thereby providing, in 
limited precision hardware, a significant representation issue for the range of the real and 
imaginary number components as well as the positive and negative fixed-point values. A 
review of these alternative complex radices follows.
39
2.2.2.1 Radix-(2j)
As early as 1960, Knuth [63],[64] proposed radix-(2y) which leads to an 
interesting system called “quater-imaginary” (by analogy with “quaternary” or base-4). In 
this system every complex number is represented with the digits 0, 1,2, and 3 without a 
sign. For example;
( 1 1 2 1 0 . 3 % - 1 6  + ( -8y)  + 2 x ( - 4 )  + ( 2 ; ) - k 3 x ( - l y )  + ( - l ) ^ 7 ^ - 7 l ;  (2.35)
Here the number (u^^L )zy is equal to
(<22ajL ^-2K^-4 ^  .U ] L )_̂  (2.36)
Conversion to and from quater-imaginary radix reduces to the conversion to and 
from negative quaternary representation of the real and imaginary parts. In his book [64], 
Knuth proposed that the interesting property of this system is that it allows the 
multiplication and division of complex numbers to be done in a fairly unified manner 
without treating real and imaginary parts separately. For example, we can multiply two 
numbers in this system, much as we may do with any radix, by merely using a different 
carry rule. Whenever a sum digit exceeds 4, we subtract 4 from the sum digit and carry 
-1 two columns to the left; when a sum digit is negative, we add four to it and carry +1 
two columns to the left.
Representing complex numbers in radix-(2y) is the same as representing the real 
and imaginary parts in radix-(-4). Although there is no sign to deal with in radix-(-4), the 
number system is imbalanced. The imbalance of the negative-base number system in 
Zohar’s work [102] isn’t correct. Zohar’s results [102] are shown in Equations (2.37) and 
(2.38):
40
For radix g<0, the maximum positive number is given by
2[(fl+l)/2] 1
and the minimum negative number is given by
#  = (2.38)
M+1
Here are the eorreet results. Consider a system that uses D digits to represent
numbers in the base '̂<0. When D is even, the largest representable integer is the positive
number P, whose representation is:
^  ,0,1^1-1,0,1^1-1). (2.39)
Its value is given by,
= = ^  (2.40)
Similarly, the smallest integer {N) is:
Â  = (|^|-1,L ,0 ,|g |- l ,0 ,|^ |- l ,0 ) . (2.41)
Its value is given by,
IÛ -1  I |2
(2.42)
The number of integers eontained in the elosed interval defined this way is 
P - N  + X. That is:
p  + Ar-l  = p  = lT - Z l  + M M - d l + i , | g | ' ’ (2.43)
1 + k l  1 + H  ' '
The result is very similar when D is odd. The largest representable integer is the 
positive number P  whose representation is
41
,0 ,|^ |-1 ,0 ,|^ |-1). (2.44)
Its value is given by:
,  = = ^  (2.45)
and the smallest integer (TV) is
TV =  (0 ,|gr |_ l,L  ,0 , |^ |- l ,0 , |g r |- l ,0 ) .  (2 .46)
Its value is given by:
1-M  i+ k l
(2.47)
The number of integers contained in the closed interval defined this way is 
F-TV + 1. That is:
f  + = f  = ‘ M<H  * T l^ |^ |°  (2.48)
1+M 1+kl
This, however, is the number of different configurations of the D digits. We 
conclude, therefore, that D  digits span all the integers from TV to P\ regardless of D being 
even or odd. A troubling result is that the closed interval {N,P) is quite asymmetrical. A 
simple example will illustrate these statements.
Assume
q -  -10 and Z) = 3
Then the largest number is (909)_,o = 909 , and the smallest is (90)_,g = -90.
Assume
q = -10 and Z) = 4
Then the largest number is (0909) jq = 909 , and the smallest is (9090)^,„ = -9090.
42
2.2.2.1.1 Complex Number Addition in Radix-(2y)
Dao [58] proposed a hardware implementation for the radix-(2y) addition. The 
adding of two numbers X and Y in the quater-imaginary system, as in any positional 
representation, consists of adding digits of the same weight. The modulo-(-4) result 
produces a sum digit and a carry digit. In this radix, the carry is -1 and has a weight equal 






5 + 10j 1 1 3  3 1 1
+ 8 + 2 j  0 1 0 2 1 0
1 2  3 1 2  1
-1
13 + 12j 1 1 3  1 2  1
Figure 2-7. An Example of Addition in Radix-(2/) [58]
Actually, the radix-(2y) adder is a radix-(-4) adder in the separate even and odd 
digit positions. The negative radix addition for real numbers is further investigated in 
[103],[104]. We conclude that the radix-(2y) addition reduces to the radix-(-4) addition in 
the even and odd positions separately.
2.2.2.1.2 Complex Number Subtraction in Radlx-(2/)
In negative bases like (-4), no explicit sign digit is required in the representation. 
The negation of a number is obtained by taking the 4’s-complement of each non-zero 
digit together with a positive carry digit of 1 two positions ahead:
43
= ( - 4  +  x J ( 2 y )  w ith  = 4 -% ^
= ( 2 ; r ' + ^ ( 2 ; y
The subtraction of a number X  is reduced to adding its 4’s-complement with 
proper carry propagation:
Similar to addition, the implementation of radix-(2y) subtraction is radix-(-4) 
subtraction in the even and odd positions separately.
5 + 10j 1 1 3  3 1 1
- ( 8 + 2j ) 0 3 0 2 3 0
1 1 carry from 4'S-complement
+ _i _i carry from radix-(-4) addition
- 3 + 8j 1 0  3 1 0  1
Figure 2-8. An Example of Subtraction in Radix-(2/) [58]
2.2.2.1.3 Complex Number Multiplication in Rad\x-(2j):
Serial multiplication, i.e., one digit of the multiplier at a time, proceeds as in the 
binary case. Given,
X  =  X o (2 y f - fL  + x ^ _ ,(2 y y - '
the product is:
z = A T = Y L y y A ( . v n < . v f  (2-si)
/t=0 /=0
The digit product can generate a carry (0,-1,-2), which must be added to the 




The terms inside the braeket represent the partial produet from the lower digit of 
the multiplier.
Notiee that shifting a number X  one position to the left is equivalent to rotating 
the vector v by 90 degrees and doubling its length, while shifting to the right one position 
results in a -90 degree rotation and halving the length.
5 + 10j 1 1 3  3 1 1
X 8 + 2 j  0 1 0 2 1 0
20 + 90j 1 1 3  3 1 1 0  Partial Product
2 2 2 2 2 2 Partial Product
-1 -1_____________________
1 1 2  1 1 3  1 0
1 1 3  3 1 1  Partial Product
-1 -1
3 2 1 3  1 0
Figure 2-9. An Example of Multiplication in Radix-(2/) [58]
The implementation of radix-(2y) multiplication results in a radix-(2y) partial 
product generator with even and odd positions separately followed by a radix-(-4) 
addition tree to generate the final product (see example in Figure 2-9).
An analysis of Knuth’s “quater-imaginary” radix shows that there are several 
disadvantages of this imaginary number system:
• Since the numbers obtained from sensors and digital systems are normally 
2’s-complement binary numbers, a conversion from 2’s-complement to radix- 
(2j) must be conducted. Our research [105] shows that the implementation of 
this conversion procedure will require a delay on the order of a earry- 
lookahead adder and will add an additional computational delay in the critical 
path.
45
• A key computation for complex numbers in radix-(2y) will invariably be 
multiplication. Without further developments, radix-(2y) is slow, compared to 
2's-complement binary multiplication that uses Wallace trees, redundant 
binary number addition. Booth encoding, or array multipliers.
• Representing complex numbers in radix-(2y) is the same as representing the 
real and imaginary parts in radix-(-4). As previously shown in Equations 
(2.40), (2.42), (2.45) and (2.47), the radix-(-4) is an imbalanced system while 
a traditional positive radix system is a balanced number system.
Recently, Aoki [31],[97],[98] and Mcllhenny [99],[100] investigated the 
redundant complex radix-(2y) arithmetie for high-speed signal processing with emphasis 
on complex number addition and multiplication. The addition of two numbers, 
A  = (Xĵ _jL X;L x_^) and 7  = (y^_, L >’,.L ) in the redundant complex number system
(2/3), where e {-3,-2,-l,...l,2 ,3}, is performed by the following three steps for 
each digit:
Step 1 : Z; = X- + y.
Step 2: - 4c. 4- w. = z. (2.53)
Step 3: s .= w .+ c ._2
Here z. is the linear sum, w. is called the intermediate sum, and c. is the carry. This so-
called radix-(2y) redundant number system is actually a radix-(4) redundant number 
system with the real and imaginary parts treated separately.
In summary, this analysis shows that the non-redundant radix-(2y) number system 
has several disadvantages and the redundant radix-(2y) complex number system is a 
radix-(4) redundant binary system, with real and imaginary parts treated separately.
46
Therefore, radix-(2y) offers no overall hardware implementation advantage over a 
conventional binary number system.
2.2.2.2 R a d ix -( j4 2 )
A system similar to radix-(2y) that uses only the digits 0 and 1 is based on j 4 l . 
This scheme, however, requires an infinite non-repeating expansion for the simple 
number 0+\j. Slekys [67] defined arithmetic operations on a modified bi-imaginary
number system based on radix-(/V2). If a modified bi-imaginary complex system is used 
to encode each complex number a-\-jb as a + jy flc  , then the number 
o,ûto.a_,L üf_2̂ )^ .is  equal to:
) - 2  ^  ^ - 2 A T + l) - 2
The conversion to and from radix- (y V2) notation reduces to the conversion to and fi-om 
a negative-2 representation of the real and imaginary parts separately. Slekys defined 
algorithms for complex number addition, subtraction, multiplication and division. 
However, since Slekys's radix- (7V2) system is not redundant, the computational 
arithmetic requires additional operations comparable to the conventional 2’s-complement 
representation of complex numbers. To see this, multiplication in this system is 
considered.
Define the multiplier and multiplicand as the complex numbers Z; and Z2 
respectively, where:
Z, = Z + jB  
Z2 = C  + jD
Using a modified bi-imaginary representation let,
47
Z , '- ^  + ;V 25
Zj = C + j-J ^ D
Then the multiplication of Zj  and Z2 will be:
Zj «Z2 =Zj « Z j + 5 » Z )  (2.54)
Equation (2.54) shows that in the bi-imaginary complex-radix system the 
multiplication of two complex numbers will be composed of one complex number 
multiplication requiring four multiplications and two additions, plus one real number 
multiplication and one complex number addition.
For complex radix-(jV2), we can also use a redundant binary system for the 
computational arithmetic. Similar to the case of radix-(2y), this redundant complex radix- 
( j y f l )  is just a radix-(2) redundant binary system with the real and imaginary parts 
treated separately.
2.2.2J Radix-(j-1)
A binary complex number system is also obtained by using the base (/-I), as first 
suggested by Penney [66]. Further studies of this radix were conducted by Jamil [62] and 
Blest [55]. Jamil shows that the conversion from 2’s-complement to radix-(/’-l) is 
actually the conversion from radix-(2) to radix-(-4) for the real and imaginary parts 
separately, with a (/-I)-based addition needed to complete the conversion procedure. The 
multiplication and division of complex numbers based on this radix are also presented in 
Jamil's [62] and Blest’s work [55]. Hardware implementations were not specifically 
addressed. In fact, it appears that the hardware for this radix will possess considerable
48
latency and gate count due the necessary carry detection logic requirement for the 
addition operation.
In radix-(/-l), there exists a carry propagation problem in complex number 
addition that further exacerbates the partial product additions in the multiplication 
hardware. To deal with this carry propagation problem for high-speed parallel hardware, 
a zero-detector is required for eaeh digit. The zero-detector adds additional latency and 
gate complexity.
The value of an #-bit binary number A = with radix-(/-l) can
be written in the form of a power series as follows:
^  ' + û̂ -2 (-1 + -̂-I-----  ̂ (~1 + y) + ̂ 0 (2.55)
where the coefficients  ̂ ‘ ^ {0,1} • As an example, if #  is a 16 bit number,
the powers of -1 + j  associated with the coefficients will be (from bottom to top, right to 
left, in groups of four):
[Row 4] (-128-jl28), (0+jl28), (64-j64), (-64+jO)
[Row 3] (32 + ;32), (0-y32), (-16 + y 16), (16 + yO)
[Row 2] (-8-y 8), (0 + y8), (4 -y 4), (-4 + yO)
[Row I] (2 + y2), (0-y2), (-1 + yl), (1 + yO)
To describe the hardware implementation difficulties for this system, we need 
only consider the addition operation. In radix- (-1 + y), we have:
(l)_,+y +(!)_,+;=(2)_,+,=(1100)_,+, (2.56)
In radix-(-1 + y) there are two earries from one bit position. The addition truth table for 
one bit position is shown in the Table 2-F.
49
From the truth table, it is seen that there is a carry propagation problem in the 
addition of numbers in radix- ( - l - t  j )  . Figure 2-10 shows an example for
= 1, y, = 1, ĉ  2 — 1 ^rid c. 3 = 1.
In this example, x, = y, = c,_2 = = 1, from Table 2-F and Equation (2.56), the
carry-out and c.̂ 3 to the 2 and 3 digit positions to the left are c,̂ 2 = <̂,+3 = 2 ■ Then 
carry = 2 will further propagate to = 1 and carry c .̂ 3 = 2 will further
propagate to = c,3_g = 1. Thus, carry =1 + 1 = 2 will propagate to = 1.
Table 2-F. Truth Table for Radix-(-1 + j )  One-Bit Addition
X, y, 4-2 4-3 4 4+2 4+3
0 0 0 0 0 0 0
0 0 1 1 0 1 1
0 1 1 1 1 1 1
1 1 1 1 0 2 2
1 1 1 2 1 2 2
1 1 2 2 0 3 3
1 1 3 3 0 4 4
Sum0
^ '4 8  4 + 7  4 h 6  4 + 5  4 + 4  4 + 3  4 + 2
Figure 2-10. An Example for Radix-(/-I) Carry Propagation
50
A possible hardware method to work around this problem is a zero-detector based 
on the equality:
(!!)_ ,,,- k ( l l l )_ , , , - (0 ^ , ,  (2.57)
However, the zero-deteetor is expensive, serial in nature, and produces high latency. 
From this analysis, the radix- (-1 + j )  has no advantage in the complex number 
computational hardware for addition and multiplication.
In summary, different complex radices such as radix-(2y), radix-(-Hy) and radix-
( 7V2 ) are studied. It is shown that these complex radices have no advantage over 
traditional binary number systems in hardware implementation. Chang’s research 
[56],[57] also supports this conclusion. In Chang’s research, a RIA (Real Imaginary 
Alternate) complex number system is proposed. In essence, his system represents 
complex numbers in 2’s-eomplement binary form with interleaved real and imaginary 
parts. Therefore, based upon traditional binary number representation and the previously 
discussed real-number inner-produet processor, a high-performance complex multiplier 
and complex number inner-produet processor is developed in the following sections.
2.2.3 Complex Number Multiplier and Inner-Product 
Computation
2.2.3.1 RB C om plex  N u m ber M ultiplier
Applying the inline multiplication method to the complex inner-produet requires 
no modification to Figure 2-2 to produce the redundant binary imaginary part, 
+A,Bj^). However, the real part, Â B̂,̂  -A^B j, requires a final redundant binary
subtraction, rather than an addition. The subtraction is easily implemented by modifying 
the redundant binary adder (RBA) to add the complement of the redundant binary
51
number , since )gg = {A,B,)^g, where -1 = 1, 1 = -1, and 0 ^ 0. In the
actual hardware implementation of Figure 2-2, the addend or its RB eomplement are 
multiplexed into the RBA, thereby converting it to an adder/subtracter. Defining the 
control signal Real lmg as Real_Img=l, for the A^^B^ -̂A^B  ̂ computation and
Real_lmg=0 for the A^B  ̂+ Â B̂  computation, and then the inline implementation for 
\ B q ± Â B̂  is shown in Figure 2-11. For Booth encoding of the inline method, no further 
modification is required.
Binary N um ber Partial Product 
G enerator for AnB„
Binary N um ber Partial Product 
Generator for A  ,


















TR B  Sum  o f  AgBg  -  A j B j
Figure 2-11. Inline Implementation oïA„B„ -A jB j 
The cross partial product method is slightly more difficult and requires mapping 
the difference of two 2’s-complement numbers into a redundant binary number to
52
compute the real part of the complex product, -A iB ,.  The result given here is 
similar to that provided in [49] and is derived much the same as given in Section 2.1.1 for
AqBq + A^Bj.
V-2 V-2




= 2^ '(<^A'-i.o^v-i.o2  ̂ ' + 2]("^'.o^v-i.o)2' “  2^ ' + %](^o^v-u)2') +
;=o !=o
Î ;  2 ' , 2 “-' + E " , . .2 ') + 4 , ,W - u 2 " - '- Z V ,2 ') }  (2.58)
/■=0 j=Q j= 0
N - 2
~  2  { ( a ^ _ ,  q Z 2 ^ _ j  Q —  , ) 2  + ^ ( “ ‘2 , - , o ^ a '- i , o  " * " ^ i , i ^ v - i , i ) 2  }  +
(=0
^  2 {(~<2at_i_ô ;,0 ‘̂ V-1,1̂1,1 )2 + ^  (^;,o4,0 ~ )2  ̂}
i=0 j= 0
Considering the first term of Equation (2.58),
N - 2
2  l(^;v-i,o^v-i,o '^v-i,i^v-i,t)2  +  ^  (  ‘̂ i,o^A -̂i,o ■*"‘̂ (,i^v-i,i)2 }
(=0
1=0




where /i,.  ̂ (0 < i < TV -1) is a redundant binary number with
B n - \ . o ~ ^ n - \ , ( P n - \ , o ~  ^ N - \ 2 p N - \ , \  Bifi ~ ~ ^ i , ( p N - \ f i  ^ i ^ N - x i  forO<i<A^ —2. (2.60)
Encoding as two binary hits for 0< i<  N - 2 :
B n - \ , o ^ n - \ ’0 ^ n - \ , 0 ’ B n - 1,0 ^ N - i n ^ N - [ , i ’   ̂ (2.61)
Bi,o ~ î,o^N-i,o’ Bifi ~
53
Converting the last term of Equation (2.58) to redundant binary,
^  2  o^,-.o +  ^  ( ^ y , o 4 . 0  ~  }
(2-«2)
i=0 y=0
where v. . (0 <i < N -2 ,  0 < y < # - 1 )  is a redundant binary number and
(2.63)
Encoding v. j (0 < i< N -2 ,  0 < y < TV -1) as a redundant binary number using 
two binary bits:
^ i , N - \  ^V-l,o4.0 ’ ^
  (2.64)
The general redundant binary equation for the summation of is:
A B „ - A A  = 2 « - y ^ „ _ ,„ 2 » - '+ |; / i ,„ 2 ‘) + |;2 'K _ , ,„ 2 '* - ‘ + |;v ,„ 2 ') .(2 .6 5 )  
Now consider the inner-produet, A^B^ -  A^B ,̂ using modified Booth encoding:
"-12 V-2 2i-  A,B, = X  + Z  « » 2 ' ) 2
/=0 k = 0
y -'
“  Z i 6z,i(-<3^_i ,2^  ' + ^  0^ ,2 ' ' )2
1 = 0 k = 0
y - '  y - 1
- Z ( C , . + & , . ) 2 " - Z ( V , , + g u ) 2 '
1 =  0 /  =  0
a '




From Equation (1.10), the subtraction of two 2’s-complement numbers can be considered 
as a redundant binary number. So Equation (2.66) becomes






= Z ( n o ,+ r , , . , ) 2 ”
;=0
where V. g, is a redundant binary number from the subtraction of 
C.Q and C., with K-m =C.g-C,., and • The correction factor, y.g, ,
depends upon the value g. g and g. ^, as shown in Table 2-G.
Table 2-G. Booth Correction Factors for Redundant Binary Partial Product
Generation of AgBg - A jBj
Sifi &,1 y,.oi =& .o“ &,i T '/.o i / / , o i
0 0 0 0 1
0 1 -1 0 0
1 0 1 1 1
1 1 0 1 0
From the Booth coding Table 2-A,
y i,m S i,o  4 + 1 , 0 4 .0  4 - 1 ,0  ■ * " 4 + i ,o 4 ,o  4 - 1 ,0  4 + i , o 4 , o  4 - i , o
“ 4 + 1 ,0  4 , o 4 - i , o
/;,oi= &;.i =4+i.i4.i 4_].i +4+i,i4,i 4-1,1 +4+i.i4,i 4-
(2 .68)
1,1
“ 4+ 1 ,1  4 , i 4 - i , i
Again, the “+” in the equations above is the Boolean OR operation.
2.2.3.2 RB Complex Number Inner-Product Processor
The inner-produet of complex numbers Cg, C,, • • • and Z)g, Z),, • • • ,,
55
Q  — A  + 7 ^ ’ Q “  ^2 “  ^2M-2 ~^J^2M-I
■̂0 ~ ^0 ■*■ J^I’ A  ~ A  ~ ^2M-2 +7Aa/-1
(2.69)
is
A / - 1 M-1
( c , o )  = X  CjA  = z  ( 4 y  + 2 4 , . ,  ) ( 4 ,  + 2 4 , . ,  )
y-0 7=0
M - l  M - \
~  % ]  ( ^ 2  2 A  2 "  A  2+1A  2+1 )  +  7  ^  ( ^ 2 ;  A  2+1 ■*‘ ^ 2 2 + i A y )
2=0 2=0






RB A dder Tree
Figure 2-12. The Real Part of the Complex Number Inner-Product







Figure 2-13. The Imaginary Part of Complex-Number Inner-Product
56
Defining a control signal Real_lmg=l/0 for A B -C D /A B  + CD computation, 
then the overall structure of a unified inner-produet processor for AB ± CD is shown in 
Figure 2-14:
1 2's-Complement A  and B i
RBPPG
u  u y n  1 r 1r
















C and D JL
Binary Number Partial Product Binary Number Partial Product
Generator for A B Generator for C D















RB Sum o f  AB + CD
Figure 2-14. Unified RB IP Processor for ± CD
Using the above structures, the real and imaginary parts of the complex number 
inner-produets are computed in redundant binary form. Finally, a RB to 2’s-complement 
converter is needed to convert the redundant binary inner-produet real and imaginary 
parts to 2’s-complement form, if required.
57
2.3 Inner-Product Computation Comparison
In this section, the computation time for real number inner-produet processing is 
compared between the Texas Instruments TMS320C6000 series and the RB inner- 
produet processor. The sample irmer-produet assembly code using TMS320C6000 [92] is 
shown in Figure 2-15:
MVK .SI 100, A1 ; set up loop counter
ZERO .LI A7 ; zero out accumulator
LOOP:
LDH .D1 *A4++,A2 
LDH .D1 *A3++,A5 
NOP 4
MPY .Ml A2,A5,A6 
NOP
ADD .LI A6,A7,A7 
SUB .SI Al,1,A1 
[Al] B .S2 LOOP 
NOP 5
; Branch occurs here
load ai from memory 
load bi from memory 
delay slots for LDH 
ai * bi
delay slot for MPY 
sum += (ai * bi) 
decrement loop counter 
branch to loop 
delay slots for branch
Figure 2-15. An Example Code of Fixed-Point Inner-Product [92]
To analyze the execution clock cycles of this sample, a dependency graph is very 
useful. Dependency graphs can help analyze loops by showing the flow of instructions 
and data in an algorithm. These graphs also show how instructions depend on one 
another. The following terms are used in defining a dependency graph:
• A node is a point on a dependency graph with one or more data paths flowing in 
and/or out.
• The path shows the flow of data between nodes. The numbers beside each path 
represent the number of cycles required to eomplete the instruetion.
• An instruction that writes to a variable is referred to as a parent instruction and 
defines a parent node.
58
• An instruction that reads a variable written by a parent instruction is referred to as 
its child and defines a child node.
Use the following steps to draw a dependency graph:
1) Define the nodes based on the variables accessed by the instructions.
2) Define the data paths that show the flow of data between nodes.
3) Add the instructions and latencies.
4) Add the functional units.
Figure 2-16 shows the dependency graph for the fixed-point inner-produet 










(A7) LI LOOP.S I
Figure 2-16. Dependency Graph of Fixed-Point Inner-Product [92]
Figure 2-16 provides the following observations:
• The two LDH instructions, which write the values of a. and b-, are parents 
of the MPY instruction. Five cycles for the parent (LDH) instruction are
59
needed. Therefore, if the LDH is scheduled on cycle i, then its child 
(MPY) cannot be scheduled until cycle i + 5.
• The MPY instruction, which writes the product p., is the parent of the 
ADD instruction. The MPY instruction takes two cycles to complete.
• The ADD instruction adds p. (the result of the MPY) to the sum. The
output of the ADD instruction feeds back to become an input on the next 
iteration and, thus, creates a loop carry path.
The dependency graph for this inner-produet algorithm has two separate parts 
since the decrement of the loop counter and the branch do not read or write any variables 
from the other part. The loop counter graph shows the following:
• The SUB instruction writes to the loop counter, cntr. The output of the 
SUB instruction feeds back and creates a loop carry path.
• The branch (B) instruction is a child of the loop counter.
Executing this inner-produet code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator, thus 100 
iterations require 1602 cycles. For the fixed-point TMS320C62X (‘C62X) devices, which 
are operated typically at a 200 MHz clock (5 ns) frequency, 100 iterations require:
1602x5  = 8010 /»  (2.71)
For the RB inner-produet processor, Figure 2-17 shows the structure to implement 
the real number iimer-product computation. For an accurate and fair comparison, we use 




R B-to-2's-com plem ent 
Converter
Figure 2-17. RB Inner-Product Implementation
A CMOS implementation of the RB multiplier [3] with 0.5 jum fabrication shows 
that a 54x54 bit multiplier achieves 8.8 ns delay, which includes 2.4 ns delay for the 
RB-to-2’s-eomplement converter. The actual delay of the RB multiplier is only 7.2 ns. 
Compared to the implementation of TMS320C62X processor, if two-stage pipelines are 
used for the multiplication and the RB multiplier is employed in the TMS320C62X, then 
the clock cycle can be reduced to 6.4/2=3.2ns. Assuming that all the other instruction 
operations (LDH, ADD, SUB, etc) take the same time in the RB IP processor, then for 
100 iterations of inner-produet computations, the total time required is
1602x3.2-5126.4M.Y (2.72)
Table 2-H shows the comparison result using TMS320C62X and the RB inner- 
produet processor for 100 iterations of inner-produet computations:




RB Inner-Product 5126.4 ns
Processor
61
2.4 Implementation of Unified Signed/Unsigned Multiplier
In this section, a unified signed/unsigned multiplier is developed using the RB 
inner-produet core without and with Booth encoding. An unsigned binary number can be 
considered as a 2’s-eomplement number with an extra sign bit 'O’ padded before the 
MSB (Most Significant Bit). For example, an unsigned binary number 10001111 can be 
considered as a signed binary number 010001111 with an extra sign bit ‘O’.
2.4.1 Unified Signed/Unsigned Muitipiier Without Booth Coding
From Equation (1.6), a RB number, A , can be derived from the addition of a pair 
of 2’s-complement numbers. Thus for an unsigned N x N  multiplier of A x B  , N  
unsigned partial products are generated. These N  unsigned partial products are converted 
to N+\ signed partial products with extra bit 0 padded before MSB and are mapped to 
N/2 RB partial products with correction factors as shown in Figure 2-18.
sign extension
0 0 
0 •  
0 0 . * 
0 .  .  .
Redundant binary 
Partial Products
0 -1 0 -1 correction factors a,
Figure 2-18. Unsigned Multiplier with Partial Product Generation









Where PPf _̂2 and are 2’s-complement binary numbers, the sum of
PPf _̂2 and f a r e  mapped into a RB number according to Equations (1.10), (1.11), 
(1.12) and Figure 1-14. The mapping structure is shown in Figure 2-19.
^N-Pn~2^N-Pn-\ ^N-iK-2 <̂ N-2̂ N-\ ^o4v-2 ôAv-1
Figure 2-19. Mapping of and for Signed Multiplier into a RB Digit










Where PQfç_2 andPg^ , are 2’s-complement binary numbers, the sum of 
PQf _̂2 and PQf̂ _x are mapped into a RB number according to Equations (1.6), (1.7), (1.8) 
and Figure 1-13 with an extra correction factor -1. The mapping structure is shown in 
Figure 2-20.
63
0  0  ^7V-1̂ 7V-2 ̂ N-2^N-\ ^(P n -2 ^(P n -\
Figure 2-20. Mapping of PQ _̂2 and PQj^.j for Signed Multiplier into a RB Digit
Define a eontrol signal, SIGN, where SIGN=l for signed multiplieation and




2 to 1 MUX
pH,N -\,N
2  to 1 MUX
PP to PP
Figure 2-21. Circuit Realization of the Last Partial Product  ̂ for 
Signed/Unsigned Multiplier
Thus, for the implementation of #-bit unsigned multiplier, the correction factors 
are {0, -1, 0, -1, 0, -1 ,..... -1}. That is:
(̂ N-2 = = • • • = «2 = «0 = -1
~ ̂ N-3 = • ■ ■ = <̂3 = =0
(2 75)
64
For unsigned multiplication, all the correction factors except = -1 can be 
added in the RB addition tree. Here the factor is combined with the first partial 
product, 2 ' "PP^\ PP^o, which is shown in Figure 2-22;
sign extension • • • • • • • •
0 0  • • • • • • • •  partial product PPO
 ^-1__________________________
•  • • • • • • • • •  partial product PQ O
for unsigned multiplier
Figure 2-22. First Partial Product PQO for Unsigned Multiplier
Table 2-1 to Table 2-L are the truth tables for the first partial product PQO of the 
unsigned multiplier:
Table 2-1. Partial Product fgOg to PQO^_j for Unsigned Multiplier
First Partial Products PPo First Partial Products PQo
PPOq to PPO^ ^ P G O o to P % _ ,
0 1 1 0 1 1
Table 2-J. Partial Product f   ̂ for Unsigned Multiplier
First Partial Products PPq First Partial Products PQo
0 1 1 1 0
65
Table 2-K. Partial Product f  y for Unsigned Multiplier










and PQOj^ ĵ for Unsigned Multiplier






The logic equation of the partial product PQO for unsigned multiplier is as
follows:
~  P P ^ N  3 (2.76)




2 to  1 M UX2 to 1 M UX 2 to  1 M U X
PP.0,«-2 / ’/ ’oo toP/’ô _3
Figure 2-23. Circuit of the First Partial Product PP  ̂ for Signed/Unsigned Multiplier
For the partial products of PQ  ̂ and PQ^-2 for unsigned multiplier, an extra bit 0
is padded before the MSB. Figure 2-24 shows the circuit realization of the combined 
partial products from PP[ to PPh_2 ■
\ fo
SIG N
2 to  I M UX
PP,, pp:,toPp;,_,
Figure 2-24. Circuit of the Partial Products fromPP, to PF%  ̂ for Signed/Unsigned
Multiplier
Figure 2-25 shows a unified signed/unsigned multiplier with the eontrol signal
67
i Signed/Unsigned A and B




2:1 MUX 2:1 MUX







M o p in g M apping M apping





Figure 2-25. A Unified Sign/Unsigned Multiplier
Both the signed and unsigned multiplier have the same structure of the RB adder 
tree. The partial products are controlled to switch between signed and unsigned 
multiplication. Then the combined binary partial products are mapped to RB partial 
products and added using the RB adder tree to compute the signed/unsigned RB product.
2.4.2 Unified Signed/Unsigned Muitiplier With Booth Coding
For the unsigned multiplier for A and B,
68
^ = %]a,.2' =-0x2*+ajv_,2"-' + ^a,.2' =,4'
(2^^)
5  = 1;(.,2' =* ,.,2»  - i„ . ,2 " - ‘ + Y,b ,2‘ = i,_ ,2 * -+ 5 '
(=0
where A' and 5 ' are 2’s-complement binary numbers. 





The product of A 'B '  using Booth encoding was previously discussed in Section
N - l
2.1.3. The extra ^a,6^_i2'2^ value can be combined with the correction factors y .. The
i=0
new correction factors are
r ! - o  
r î  = 4+14 4-1 +4+14 4-i + 4+i4 4-i
= 4+1 44-1 (2.79)
y 1+1 4+3 4+24+1
r,;. = 1
Nwhere / = 0,2,4,...— - 2  and
7i —^Pn-\ 7i =1 (2.80)
where N ,N  + \ , . . . ,2 N - \ .
2.5 The Implementation of a Unified Signed/Unsigned Inner- 
Product Processor for A B  ± CD
The overall structure of a unified signed/unsigned inner-product processor is 
described for the computations of ^15 + CD and A B -C D  . In Section 2.2.3.1 and Figure
69
2-11. The -  CD implementation is developed using the structure for AB + CD with 
inverters added in the RBA tree and a unified structure for AB±CD  is developed. In 
Section 2.4, a unified sign/unsigned multiplier was developed using RB representations. 
Define two control signals, SIGN=l/0 for signed/unsigned multiplieation, and 
Real_Img=l/0 for the A B -C D  I AB + CD computation. Then the overall structure of a 
unified signed/unsigned inner-product processor for AB±CD  is shown in Figure 2-26, 
where the PQ^ to of AB and CD refers to the unsigned partial products discussed in
Section 2.4.
Signed/Unsigned 
A  and B
Binary Number Partial Product 






Binary Number Partial Product Generator for 
CD
W  W    PQn-1 PPnu
 1 % _ Z : , ______  S r '
2.1 2.1 MUX I 2:1 MUX | 2:1 M u F | | 2:1 M U x ] | %  m 1 ^
RBPp\  /  1 /  y  I  /
I RB RB RB





























J RB Sum o f  AB + CD
Figure 2-26. Unified Signed/Unsigned IP Processor for AB ± CD
70
2.6 The Implementation of a Redundant Binary Multiplier
Currently, numerous floating-point unit designs incorporating a fast multiplier 
make iterative use of the multiplier for implementing fast algorithms for division, square 
root, and/or transcendental function computations by extended polynomial approximation 
[107]-[112], If multipliers are to be used iteratively for RB computations, it is 
advantageous for the multiplier to accept redundant binary coded input directly, in 
addition to the initial 2’s-eomplement numbers. A multiplier capable of accepting both 
2’s-complement and RB inputs avoids the excessive RB to 2’s-eomplement delay. To our 
knowledge, no prior multiplier design exists with this capability. Recently a new floating 
point arithmetic unit was proposed [113]. A redundant number system is used to achieve 
IEEE compliant results. All operations in the arithmetic units are carried in redundant 
form with conversion back to the standard IEEE format performed only when an operand 
is written to memory. In [113], it is argued that the proposed floating point unit could 
achieve better performance across all of the required functions. In all these eases, a fast 
multiplier that can accept either 2’s-complement or RB inputs is advantageous, i.e., the 
multiplicand and multiplier are both redundant binary numbers with the product produced 
in redundant binary form, as shown in Figure 2-27:
X 2 c o r  Xi^fj Y 2 c o r  Yj^
Figure 2-27. A RB Multiplier Diagram
71
2.6.1 Direct Implementation of Redundant Binary Multiplier
To implement the dual input multiplier, we first consider a RB multiplier. Figure 
2-28 shows an example of RB multiplication.


















-1  -1 -1
Figure 2-28. An Example of RB Multiplication
The RB partial product is generated according to Table 2-M, where a. and p. are 
the RB signed digits of and , respectively.











Encoding the RB digits 1=(11), 0=(01)=(10), -1=(00), Table 2-N shows the 
encoded RB partial products.
72
Table 2-N. Encoded RB Partial Product Generation
a,. A Pi ~ ^ ip i
-1 (0,0) -1 (0,0) 1 (1,1)
-1 (0,0) 0 (0,1) (1,0) 0 (0,1)(1,0)
-I (0,0) 1 (1,1) -1 (0,0)
0 (0,1) (1,0) -1 (0,0) 0 (0,1)(1,0)
0 (0,1) (1,0) 0 (0,1)(1,0) 0 (0,1)(1,0)
0 (0,1) (1,0) 1 (1,1) 0 (0,1)(1,0)
1 (1,1) -1 (0,0) -1 (0,0)
1 (1,1) 0 (0,1) (1,0) 0 (0,1) (1,0)
1 (1,1) 1 (1,1) 1 (1,1)
From Table 2-N, Equation (2.81) is derived to find the encoded RB partial
product:
p r  or; or; #+
(281)
=orr yg: +crr y?/
Another way to implement a RB multiplier is to use the RB inline inner-product 
processor core. In Section 2.4, and Section 2.5, the implementation of Â B̂Q+AyB̂  and
AqBq-A^B^ for both 2’s-complement and unsigned numbers was discussed. Here the 
reuse of these cores is investigated to implement the RB multiplier.
2.6.2 Redundant Binary Multiplier Implementation Using Inner- 
Product Processor
N-l
Let = , where Z^g is a RB number and ^.is encoded as two binary
i=0
bits (refer to Table 1-B). In this research, the RB encoding is .
Therefore,
73
Â -1 N - \
/-O (-0
N - \  N - l ____
= £ c 2 ' - £ c 2'
(2.82)
1=0 i=0
LetZ+ = X C 2 ' and Z  = then
1=0 1=0
Z ^ g = Z + - Z -  (2.83)
where Z^ and Z  are unsigned binary numbers.
For two RB numbers, and , we bave:
N - l  N - l  N - l
■̂RB ~ %]^i^ ’ ^  — %]( î 2 and v4 — 2
2Î„ = E  A  2', B* = y  A* 2' md 2?" = Z  /̂ r 2'
(2.84)
1=0 1=0 1=0




\ b^ rb = ){B^ - B  )
={A^B^ -A~B^)+{A^ B - - A ^ B ~ )
Where A^ ,A~ ,B^  and i?“are all unsigned binary numbers, so the product of RB
v4gg and B ^  can be realized, using two unsigned binary numbers computations with 
A^B* -A ~ B ^  and A~ B  -  A^ B~̂  . The diagram is shown in Figure 2-29:
74
u nsigned
R B A  (R e d u n d a n t B inary A d d er)
P ro d u c t o f
Figure 2-29. Implementation of RB Multiplier
Using this method to split the RB eneoding bits and utilizing the unsigned feature 
of the multiplier (see Figure 2-26), the basie IP computing core will generate RB 
products with RB multiplicand and multiplier inputs.
2.7 Redundant Binary Inner-Product Computation
With the development of a RB multiplier in Section 2.6, an inner-product 
processor which can accept RB numbers input is easily designed. For example. Figure
2-30 shows the implementation structure to find the inner-product of A B + X A , where 
A, B, X  and A are RB numbers.
RBA
RB
A c c u m u la to r
RB m ultiplier
Figure 2-30. IP Implementation for RB Number AB  + XA
75
Chapter 3 Implementations of Division Method
3.1 Division Aigorithm Review
The notation is used in the discussion here of division algorithms:
^  Dividend 2̂ v-i "̂2v-2 ' ' '̂ 1
D Divisor d,^_2.. .<i,
6  Quotient
S Remainder [Z -  (£) x Q)] ...s Ŝq
Division algorithms can generally be divided into the following classes: digit 
recurrence (restoring or non-restoring), functional iteration, table look-up and variable 
latency. The basis for these classes is the difference in the hardware operations used in 
their implementations, such as multiplication, subtraction, and table look-up. Many 
practical division algorithms are not pure forms of a particular class but rather are 
combinations of multiple classes. For example, a high performance algorithm may use 
table look-up to gain an initial approximation of the reciprocal, then use a function 
iteration algorithm to converge quadratically to the quotient. Table look-up may be 
impractical for general applications. The division method of table look-up requires a 
large RAM size for longer divisor size. The size of RAM increases exponentially with the 
word length of the divisor. The variable latency method results in a complex design for 
the control circuit and requires an asynchronous design method. The latency of variable 
latency division method depends on the value of the divisor. For different values of 
divisor, the latency is different. The two most popular division methods are digital 
recurrence and functional iteration.
76
Digit recurrence is the oldest class of high-speed division algorithms and, as a 
result, a significant quantity of literature exists proposing digit recurrence algorithms, 
implementations, and techniques. The most common implementation of digit recurrence 
division in modem processors was named SRT division by Freiman [81], taking its name 
fi'om the initials of Sweeney, Robertson, and Tocher, who developed the algorithm 
independently at approximately the same time. Atkins [78] did fundamental research on 
division by digit recurrence, which was the first major analysis of SRT algorithms. Tan 
[89] derived and presented the theory of high-radix SRT division and an analytic method 
of implementing SRT look-up tables. Ercegovac and Lang [79] presented a 
comprehensive treatment of division by digit recurrence. Kuninobu [83], Aoki [77], and 
Srinivas [88] investigated the digit-recurrence division method with the redundant binary 
representation of the remainders. Basically the equation of the digit-recurrence division 
method in radix-(r) is:
(3.1)
Digit recurrence algorithms deal with how to represent the remainder and 
quotient, how to choose the quotient, and choice of radix. Convergence of digit- 
recurrence is linear and has order N. A  high performance quadratically convergent 
method, function iteration, was proposed which included the Goldschmidt [82] and 
Newton-Raphson [85] methods. Both methods first find the reciprocal and then use 
multiplication to compute the quotient. The functional iteration method is discussed 
below.
To compute the radio Q = Z !D  , one can repeatedly multiply Z and D by a 
sequence of M  multipliers Xq , X ,..., :
77
n  -  P  2)
If this is done in such a way that the denominator Z)Xo,Xi...,X^_j converges to 1, the 
numerator will converge to Q. This process does not yield a remainder,
but the remainder S (if needed) can be computed, via an additional multiplieation and a 
subtraction, using S = Z -  QD .
To perform division based on the preceding idea, we face two questions:
1. How should we select the multipliers so that the denominator does in fact 
converge to 1 ?
2. Given a selection rule for the multipliers X. how many iterations are needed?
In the following discussion, we answer these questions in turn, but first, we 
formulate this process as a convergence computation.
Assume a bit-normalized fractional divisor, D, and dividend, Z, in [1/2 1). If this 
condition is not satisfied initially, it can be made to hold by appropriately shifting Z  
and/or Z). The corresponding convergence computation is formulated as follows [82]:
= D.X. Set Dq = Z); make converge to 1
Z. ,̂ = Z.X. Set Zq = Z; obtainZ /D  = g  %Z^ (3 3)
We now answer the first question posed above by selecting,
W ,= 2 -D , (3.4)
This choice transforms the reeurrenee equations into:
A + i= A  (2 ~ A ) SetZ>Q=Z); iterate until Z)^ « 1
Z;+, = Z .(2- D.) SetZq = Z; obtainZ ID  = Q ^ Z ^
(3 5)
78
Thus, computing the functions/and g  consists of determining the 2’s-complement 
of D. and two multiplications by the result 2 - D . .
Now to address the second question: How quickly does D. converge to 1? In 
other words, how many multiplications are required to perform division? Noting that
= D,(2 -  D,) = ! - ( ! -  D,)' (3.6)
It is concluded that [82]:
1 -D ,+ ,= (1 -D ,) ' (3.7)
Thus, if D. is already close to 1 (i.e. \ - D .  < s),  will be even closer to I (i.e. 
1 -  ). This property is known as quadratic convergence and leads to a
logarithmic number, M, of iterations to complete the process.
Another way to compute Q - Z  ! D  is to first find HD and then multiply the result 
by Z  If several divisions by the same divisor D need to be performed, this method [85] is 
particularly efficient. One method for computing HD is based on the Newton-Raphson 
iteration to determine a root off(x)=0. We start with some initial estimate X q for the root 
and then iteratively refine the estimate using the recurrence:
where f  (X.) is the derivative of f(x). To apply the Newton-Raphson method to
reciprocation, we use f(x)=l/x-d which has a root at x=l/d. Then /  (x) = -1/x^, leading
to the recurrence.
y ^ ,,,= y f,(2 -^ ,D ) (3.9)
79
Computationally, two multiplications and a 2’s-complement step are required per 
iteration.
Let S - = \ I D -  X. be the error at the /th iteration. Then:
= (3.10)
Since D<\, we have ô^^^<{S.Ÿ , thus this functional iteration based upon 
Newton-Raphson converges quadratically.
3.2 Further Studies of the Goldschmidt and Newton- 
Raphson Methods
In this section the algorithm of the Goldschmidt and the Newton-Raphson method 
are compared and studied. We show that these two methods are theoretically equivalent, 
but are often treated separately in the literature. Further studies of the Goldschmidt 
method are presented. Next, the RB inner-product processor core is investigated for 
performing the division computations for both real and complex numbers. We show how 
to control and/or reconfigure the RB inner-product processor to provide high- 
performance division.
3.2.1 Comparison of the Goldschmidt and Newton-Raphson 
Methods
For the initial divisor, D, and dividend, Z, the Goldschmidt iteration equations are:
A.i = A (2 -  A ), Set Z)„ = D; iterate until 
Z,,, = Z  (2-D ,), Set Z , = Z; obtain Z /D  = g  «
where D. is the iterated divisor, Z. is the iterated dividend, and Q is the quotient.
For the Newton-Raphson method, the iteration equations are:
80
% , = Z , ( 2 - ^ , D )  (3 .12)
where X.  is the approximate reciprocal of D.
Multiplying Z  on Equation (3.12) on both sides, we have:
=  ZY, (2  -  ylT,D) (3 .13)
Comparing Equations (3 .13) with (3.11), we notice that in Equation (3.13), ZX^ is 
the approximate quotient which is gradually close to Z/D after each iteration. Therefore,
Z ,- Z Y , .  (3 .14)
Then Equation (3.13) becomes:
Z ,,, = Z Y ,( 2 - J i r ,D )  =  Z , ( 2 - . i r ,D )  (3 .15)
If we define D. = X^D , then Equation (3.15) is:
Z ,,, = Z , ( 2 - ^ , D )  =  Z , ( 2 - D , )  (3 .16)
Note that.
and
= y r , , ,D  =  J i r ,D ( 2 - ^ ,D )  =  D , ( 2 - D , )  (3 .17)
(3.18)
Letting X q=1, then
= X^D = D Zq = ZXg = Z (3.19)
Under the condition of X q=1, Equations (3.12) and (3.13) are equivalent to 
Equation (3.11). However, from the standpoint of implementation, Goldschmidt and 
Newton-Raphson methods are different. For the Goldschmidt method, as shown in Figure
3-1, two parallel multiplications plus two complement operations are required. For the 
Newton-Raphson method, as shown in Figure 3-2, two sequential multiplications and one 
complement operation are required. Therefore, the critical time delay in the Newton-
81
Raphson method is two multiplications plus one addition, while for the Goldschmidt 
method, only one multiplication step (parallel multiplications) and one addition are 
necessary. As far as the implementation area is concerned, the Goldschmidt method 
needs only one extra complement operation to implement. From this, we conclude that 






Figure 3-1. Goldschmidt Divisor Implementation [82]
Multiplier
Multiplier
Figure 3-2. Newton-Raphson Divider Implementation [85]
82
3.2.2 Further Discussion of the Goidschmidt Method
In [85], it is claimed that the number of elock cycles for the Goldschmidt division 
method is log27V, where N  is the word length for dividend and divisor. Further study of 
our circuit implementations show that the actual number of eloek cyeles to aehieve 
precise aceuraey in the LSB of the quotient is log27V+l. In the implementation of the 
Goldschidmt method, two additional guard bits are required to get the quotient preeision 
of bits.
After M  = logz N  iterations,
= 1 - 2 ' '  (3.20)
so,
V ,  0.21)
and,
= ^ ( 1 - 2  "') = g ( l - 2 -^ )  (3.22)
The error between the aetual quotient and untruneated quotient is
g, = |6  -  Z^ I = |G -  6(1 -  2-'' )| = 62-^ (3.23)
For Z < D , we have g < l, so £•, < .
The approximate quotient is taken by truncating Z^ to Whits as (Z^ )^ , so
g ,= |Z ^ - (Z ^ ) , |< 2 - ' '  (3.24)
The error between the actual quotient and the computed one is;
83
^ \ Q - ^ m \ + \Zm (3 25) 
=2'^+2"^ = 2"^""
i.e. g < 2-"̂ +̂
After M  = logj N  iterations, the precision of the quotient found from the 
Goldschimdt method is N-l bits. In order to reach the precision of N  hits, log; Â  + 1 
iterations are needed.
If we want to achieve the computation error of the division using the RB IP 
processor to satisfy ér < 2~^ , from Equations (3.23)(3.24) and (3.25), the following 
conditions must be satisfied as:
g, < 2"^ ' and g; < 2"^ ' (3.26)
That is, the dividend z, and divisor d. during the iteration must be truncated to
# + I  bits instead of TV bits, so one extra guard bit is required for the iteration.
In the same way, if the computation error of the division is required to meet 
s  < 2“^ , then two guard bits are required to compute the quotient using the RB IP 
processor.
3.2.3 Implementation of the Goldschmidt Division
Here we will explore how to implement the Goldschmidt division method using 
the RB inner-product structure. From Equation (3.5) in order to implement the high-speed 
divisor, all the intermediate dividend, divisor are in RB forms. Therefore, a RB- 
complement operation, 2-v4^ , similar to 2’s-complement operation, and a RB 
multiplier must be developed. In Section 2.6, the RB multiplier is studied using the RB IP 
structure, so only the RB-complement operation requires development.
84
i = - L i=0 i=~L
Let a real number A with precision N  be represented in RB form, that is:
N - L - \  N - L - \  -1
^ = g  ^  «,2' + or,2' (3.27)
where a. = {-1,0,1} , then
2 - ^  = 2-("^^ 'or,2 ' + %;or,2') = (0010),+ ^  (-or,)2' + %](-or,)2' (3.28)
/=0 i=—L i=0 i——L
Using the RB coding system, 1=(1 1), 0=(1,0)=(0,1) and -1=(0,0), notice that if 
a ,is  encoded as (a^ or:), then -a ,w ill be encoded as (or/ or/). The implementation of 




Figure 3-3. Implementation of 2 - A r b
3.3 Real Number Division Implementation
First, the dividend Z and the divider D are normalized to satisfy Z and D e[0.5 1). 
For the normalization circuit, see references [116]-[118].
Then for the first iteration equation,
D ,= D ,( 2 -D J
Z ,= Z o (2 -D J
Set Dq = D 
Set Zg = Z
(3.29)
85
where Zg and Dg are both 2’s-complement numbers. From Equation (1.9), Zg and Dg can 
be mapped into RB digit numbers (Zg)^g and (Dg)gg directly. Thus a RB multiplier which 
can accept the inputs (Dg)^^, (Zg)^^, and (2-Dg)gg can realize the first iteration, as is 
shown in Figure 3-4.
( ^ o ) rB 2 - ( Z ) g ) g g  ( ^ o ) r B 2 - ( Z ) g )O /R S
4 ( 2 “ A )
1r
m )\̂ RB
4 ( 2 “ 4 )
1r
(■̂ i)rr
Figure 3-4. First Iteration Implementation of the Goldschmidt Division
Notice that the output Z, andD, of the first iteration are redundant binary 
numbers. Thus the successive iterations can be implemented as shown in Figure 3-5.
2-Di
2-D,
RB  M u lt ip lie rR B  M u lt ip lie r
RB R e g is te rs RB  R eg is te rs
Figure 3-5. Implementation of Successive Iteration Computation for Z  and D
After log2 A-t-1 iterations are carried out, where TV is the number of bit precision, 
a RB-to-2’s-eomplement converter is required to convert the quotient back to 2’s- 
complement, if required. Four unified structures of AB ± CD are required to realize the
86




2 to 1 MUX2 to 1 MUX
RB Multiplier
Normalization
Registers to store 
number o f  shifts 
and sign
Figure 3-6. Overall Structure of Divider Using RB IP Processor
3.4 Comparison of the implementations of Division
The implementation time required for division is compared between the Pentium 
Processor and the division implementation using a RB processor. Division implemented 
on the Pentium processor uses the SRT method. The 8-bit unsigned division implemented 
on the Pentium requires 17 clock cycles [85],[114],[115]. If the VLSI fabrication in [3] is 
implemented to realize the RB inner-product processor, then one iteration for the RB 
multiplication requires 8.8-2.4=6.4ns [3]. For 8-bit division, 4 iterations are required. The 
total time required for 8-bit division using the RB IP processor is:
(log; 8 -1-1) X 6.4 = 25.6»^' (3.30)
If this division implementation result is compared to the Pentium processor, the 
equivalent clock cycle will b e25.6/17 = 1.505^6', and is equivalent to 660Mhz clock 
frequency of Pentium processor.
87
3.5 Complex Number Division implementation
To find the quotient of the complex number
.4 + jB  _ {A + jB ) { C - jP )  _ AC + BD . B C -A D
^  / - I  . * 7 -^  /  ^  . '  r ^ \  '  r \ \  ^ 2  r \2  ^  , r ^ 2 (3^1)C + y'D (C + yD X C -yD ) C^+D" " C ' + D"
For the implementation of a complex number divisor, AC + BD, BC -  AD and +D^ 
need to be computed. These computations can be realized by the unified signed/unsigned 
AB ± CD IP structure. For +D% let A = B and C = D . Notice that the outputs are in 
RB form for AC + BD, B C -A D  and C ^+ D ^. For the Goldschmidt division method, 
both the dividend and the divisor need to be normalized. To normalize a RB digit, a RB- 
to-2’s-complement converter is required to convert it back to 2’s-complement. A 
normalization circuit is required to normalize J C  + RZ), BC -  AD and C^ +D^ into [0.5 
1). The diagram is shown in Figure 3-7.
Normalized AC+BD, 





R B  to 
2's-Com plem ent 
Converter
RB to 
2 's-C om plem ent 
Converter
RB  to 
2 's-C om plem ent 
Converter
registers to store 
sign and shift
Figure 3-7. Complex-Number Division Implementation Initial Process
88
Following the derivation of the normalized 2’s-complement values of AC+BD, 
BC-AD and +D^ in the first iteration, the real-number division implementation 
procedure is utilized to develop the quotient.. Six blocks of unified IP structure 
AB ± CD are needed to compute the complex number division since the implementation
of divisors for both and can share the same computing structure for
C^ +D^. For normalization circuits, see [116]-[118].
89
Chapter 4 Computational Extensions
The inner-product structures described in Chapter 2 can be extended to provide a 
rich set of real, complex, RB and mixed real, complex and RB number computations. The 
inline partial product method [39] allows more extensions than the cross partial product 
scheme and will be used for illustrating added capabilities. Together with the basic 
inner-product operation, the computational capabilities afforded can be implemented 
using control signals or accomplished with circuit reconfiguration if configurable 
hardware is used. All of these extended computational capabilities are targeted for 
implementation in a Complex Aritbmetie Signal Processor (CAST).
Referring to Figure 4-1, up to eight accumulator segments are required to support 







Figure 4-1. An Example of a Redundant Number Adder Tree
The structure in Figure 4-1 can support the following real number computations:
1. 8-element real number inner-produet computation using a single RB 
accumulator segment.
90
2. Dual 4-element real number inner-product using two RB accumulator 
segments.
3. Quad 2-element real inner-product using four RB accumulator segments.
4. Eight parallel multipliers with or without eight accumulator segments.
The structure in Figure 4-1 can support the following complex number
computations:
1. Single 2-element complex number inner-produets using one RB 
accumulator segment.
2. Dual single complex number inner-produets using four RB accumulator 
segments.
3. Two parallel complex number multipliers with or without two real and 
imaginary accumulator segments.
The structure in Figure 4-1 can support the following redundant binary number 
computations:
1. Single element redundant binary number inner-produet computation using 
one accumulator segment.
2. Dual 2-element RB inner-produet using two RB accumulator segments.
3. Four parallel RB multipliers using 4 RB accumulator segments.
Mixed real and complex number operations and mixed real/complex, 2’s- 
eomplement/RB operations are also possible using the same 8-element IP structure. All 
of the extended computations are performed by bypassing some or all of the RB adder 
tree shown in Figure 4-1. The basic inner-produet structure has the highest latency since 
the entire RB adder tree is utilized. When implementing one or more of the extended
91
operations using control signals, design choices should be carefully considered since 
additional multiplexers are necessary for a multiple operation capability.
For a general purpose complex number DSP core, a key element of the design is 
the segmented accumulator and the ability to provide both overflow and saturation 
aritbmetie. The design of the segmented accumulator and its associated final RBA for 
implementing the extended operations in a CASP device is beyond the scope of this 
dissertation and is the subject of continuing research.
4.1 Real-Number Computational Extensions
4.1.1 8-Element Real Number Inner-Product Computation
This structure is developed in Chapter 2 and provides the basic computational 
foundation for extended calculations. Refer to Figure 2-2 and Figure 4-1.
4.1.2 Dual 4-Element Real Number Inner-Product
Figure 4-2 shows the structure to perform dual 4-element real number inner- 
product using two RB accumulator segments. This calculation requires two accumulators, 






Figure 4-2. Dual 4-Element Real Number Inner-Product
92
4.1.3 Quad 2-Element Real Inner-Product Using Four Redundant
Binary Accumulators
For this computation, the RB adder tree requires reconfiguration as shown in 









Figure 4-3. Quad 2-Element Real Number Inner-Product
4.1.4 Eight Parallel Multipliers Using 8 Redundant Binary 
Accumulators
For this computation, the RB adder tree needs to be controlled as shown in Figure
4-4. Here eight RB accumulators are required, if the structure is used for computing eight 
inner-produets; otherwise, the accumulators are bypassed.
accumulator
bypats
A ccum ulator A ccum ulatorA ccum ulator A ccum ulatorA ccum ulatorA ccum ulator
Figure 4-4. Eight Parallel Multipliers Using 8 RB Accumulators
93
4.2 Complex-Number Computational Extensions
Defining four complex numbers as: Q  =  ^  + jB^, C^=A^+ jB ,̂ € 2 = A2 + ’
C3 = + jB^, the computational extensions for complex number computations are 
depicted as follows.
4.2.1 Single 2-Element Complex Number Inner-Product
Computation Using One Reai/lmaginary Redundant Binary 
Accumulator
Figure 4-5 shows the structure of a single 2-element complex number inner- 
produet using a RB accumulator segment for the real and imaginary parts separately.
RBARBA





R e a l P a r t Im a g in a ry  P a r t
Figure 4-5. Single 2-Element Complex Number IP Using One Real/Imaginary RB
Accumulator
4.2.2 Dual Single-element Complex Number Inner-Product
Computation Using Four Redundant Binary Accumulators
Figure 4-6 shows the structure of dual 2-element complex number inner-produets 











R eal P a r ts Im a g in a ry  P a r ts
Figure 4-6. Dual 2-Element Complex Number Inner-Products Using Four RB
Accumulators
4.2.3 Two Parallel Complex Number Multipliers














Figure 4-7. Two Parallel Complex Number Multipliers
4.3 Redundant Binary Number Computational Extensions
4.3.1 Single Element Redundant Binary Number inner-Product 
Computation
95
Figure 4-8 shows the structure of a 4-element redundant binary inner-product 
computation, where 0g to 0^ and to F  ̂are redundant binary numbers. The structure of
the RB multiplier for (Pgfg to is discussed in Section 2.6.
R B  A d d e r  T r e e
R B A
RB
A c c u m u la to r
R BAR B A
Figure 4-8. 4-Element Redundant Binary Inner-Product
4.3.2 Dual 2-Element RB Inner-Product plus Two Redundant 
Binary Accumulators
Figure 4-9 shows the structure of dual 2-element RB inner-product computation 






Figure 4-9. Dual 2-Element RB Inner-Product
96
4.3.3 Four Parallel Redundant Binary Multipliers Using Four
Redundant Binary Accumulators





A c c u m u la to r
RB
A c c u m u la to r
RB
A c cu m u la to r
RB
A c cu m u la to r
Figure 4-10. Four Parallel RB Multipliers Using 4 RB Accumulators
4.4 Pipeline Extensions
In this section, the possible pipeline design alternatives of the RB inner-product 
processor are investigated. The 0.5 //m CMOS time delay model from [3] for an 8-bit 
RB multiplier is used for the discussion and is shown as Table 4-A:
Table 4-A. Time Delay Model of RB Multiplier [3]
Time Delay
2’s-complement to RB Mapping 2-Nand gate^200ps=0.2ns
RBA (RB adder) 0.9 ns
RB to 2’s-eomplement converter 1.6ns
Consider the 8-word 8-bit RB IP processor in Figure 4-1 and re-draw it as shown 
in Figure 4-11:
97
B inan  N um ber Partial P roduc t 
G enerator f o r d  A




Figure 4-11. 8-Word 8-Bit RB IP Processor
98
The implementation of consists of 2’s-complement to RB mapping
and 4 RB adders. If a two-stage pipeline structure is used, the resulting pipeline RB IP 












Figure 4-12. Two-Stage Pipelined RB IP Processor
The structure for stage 1 consists of the 2’s-complement to RB mapping and 4 RB 
adders. The time delay is for the first stage is
0 . 2 4 - 4 x 0 . 9  =  3 .8 /w  ( 4 .1 )
The structure for stage 2 consists of 3 RB adders and one RB to 2’s-complement 
converter. The latency for the second stage is
3 x 0 .9 4 - 1 .6  =  4 .3 /w  (4 .2 )
The difference of the time delay between these two stages is
99
4.3-3.8 -  0.5ns 
and can be considered as a balanced pipelined structure.
If a three-stage pipeline structure is employed, then the resulting structure is 
shown as Figure 4-13.
(4 3)
Stage 1
Figure 4-13. Three-Stage Pipelined RB IF Processor
The time delay for each stage is:




















3x0.9 = 2.7%; (4.5)
Stage 3
0.94-1.6 = 2.5%; (4.6)
The maximum difference in time delay among the three stages is 0.4 ns. This 
pipeline structure can be considered to be balanced.
101
Chapter 5 Redundant Binary to 2’s-Complement 
Number Conversion
Since the inner-product processor produces results in a redundant binary form, a 
RB-NB (redundant binary to normal binary) converter may be required to provide a 2’s- 
complement representation. Ruiz [5] proposed a carry-look-abead RB-to-2’s- 
complement converter which is similar to the structure of a carry-look-abead adder. 
Rajasbekbara [71] proposed a similar converter that is based upon a borrow-look-abead 
structure. In bis paper [7], Yen gave a novel definition of carry in the proposed RB to 2’s- 
complement converter. An on-tbe-fly converter was discussed in [72] which converts 
serial RB inputs to a 2’s-complement number. Cboo [74] claimed a breakthrough of a 
new converter which has no latency proportional to the word length. However, according 
to the proof in [70], it is equivalent of 2’s-complement addition in the conversion of 
redundant binary to 2’s-complement. Cboo’s converter never works correctly. Ling [106] 
proposed a high-speed adder which is currently the fastest known binary adder. In this 
chapter, based on Yen’s method [7] and Ling’s adder scheme [106], we propose an 
improved RB-NB converter.
5.1 An Improved Redundant Binary to 2’s-Complement 
Converter
Define a new variable carry c,. [7] as follows:
1) c,. =1 means that, for the current RB digit position i, there is at least one -1 to 
the right of the current hit position and no + ls  between the -1 and the current position.
2) c- =0 otherwise.
102
Table 5-A shows the conversion rules at stage /, where %, is the redundant binary 
bit, c,. is the carry-in from next lower order position, 5, is the 2’s-complement binary bit 
output, and c,.̂ , is the earry-out to the next higher bit position. Example 1 shows a 
conversion from RB to 2’s-eomplement based on the foregoing rules.
Table 5-A. Conversion Rules in Stage /  [7]
Input Output
Redundancy bit x, Carry in c- Binary bit s- Carry out
0 0 0 0
0 1 1 1
1 0 1 0
1 1 0 0
-1 0 1 1
-1 -1 0 1
Example 1: Letting the RB number =[-1 1 0 - 1  0 - 1  1 0 - 1  0 0  0], then
= [-1 1 0 -1 0 -1 1 0 -1 0 0 0] = -1320
C = [ 0 1 1  1 1  0 0 1  0 0 0  0]
and
5 = [ 1  1 0  1 0  1 1 0 1  1 0 0  0] = -1320
For example, at bit position 0, = 0 and Cq = 0 ; then according to Table 5-A,
5Q = 0 and c, = 0. For bit position 3, Xg = -1, Cg = 0 ; then according to Table 5-A,
Sj = 1 and C4 = 1.
Using the encoding provided in Table 1-B to encode Table 5-A, the RB-NB 
conversion is shown in Table 5-B.
103
Table 5-B. Conversion Truth Table for RB-NB
Input Output
Redundancy bit Carry in c,. Binary hit s. Carry out c,.̂ ,
X,. = (x: x^)
0(0 1)(1 0) 0 0 0
0(0 1)(1 0) 1 1 1
1(1 1) 0 1 0
1(1 1) 1 0 0
-1 (0 0) 0 1 1
-1 (0 0) 1 0 1
According to this conversion truth table, we derive the following equation:
5,. =C-®{x. © x/)
c,+] = 4  +c,x,:x;
(5.1)
For the c.̂  ̂ equation above, define the signals, carry propagate, p. = x. x^ , and
carry generate, g. = x. + x^ . Then:
Unrolling the carry equations, we get:
(5.2)




Based on the Ling adder [106], a more efficient RB-NB converter can be 
designed.
Define the signal, carry transfer, t. = p. 4- g. , (carry is not annihilated) and 
h. = c. , then
104
‘'z ~  S i A  ~ § i - \
~  8i-\ î-A^i-Ah-ih-lh-\ (5-4)
Ling’s modification consists of propagating A. = c,. + c._, instead of c. . To 
understand the following derivations, we note that g,._, implies c, (c. if g,._, =1), which in 
turn implies A,.
■ xT X. =x7 +xl =g.
= gi-Â  + gi-i + = g,_i )




= g,._,A,.+p,._,/i,. = V m
\+CiPi-x
4  = 4  + C,._, = (g._i + c , + C ._ , 
= g , _ i + c , _ ,
=&-] + Vih-2 
Unrolling the recurrenee for h., we get;
=gH+^,-2(g/-2+A,_2(_3)
=g,_,+g,_2+A,_2(_2L3 {Sincet,_2g,_2 =g,_2 )
~ g i - \  g i ~ 2  g i - i h - l h - l  ^ /-3 ^ / - 4 h - 3 ^ i - 2
~ g i - l  ~ ^ g i - 2  g i - 3 ^ i - 2  g i - 4 ^ i ~ 3 ^ - 2  + ̂ /_4̂ /-)h-4̂ Z-3̂ Z-2
K =go+^-A  
/h = & + & 0 + W -l
^  = ^ 2 + ^ 1  + ^ 0 ^ 1 + V - l V l
4̂ = &3 + ̂ 2 + g^2 +goh^2 +
(5.5)
Now, the expression for the converter output is:
105




h = Pi + gi = 4-^/ + 4  + = (4  + xl) + (x: x l )
= x ;+ x l = x;xl = Pi
Here, and are the 4-bit converter’s and , respectively. A carry network based 
on the preceding equations can be used in conjunction with 2-input NANDs, producing 
the ti signals, 2-input NORs, producing g . , and 3-input XNORs, producing the sum bits,
to build a 4-bit binary RB-NB converter. Note that since does not affect the 
computation of the sum bits, it can be derived based on the simpler equation;
4̂ = ^ 3 + ^ 4 ,  (5.7)








Figure 5-1. Four-Bit Carry-Lookahead RB-NB Converter (Similar to [85])
106
Compared to Equation (5.3), Equation (5.5) for contains only 12 terms, while 
in Equation (5.3), c^has 15 terms. The cost is that the sum is obtained by a slightly more 
complex expression in (5.6), as compared to Equation (5.1).
Given the design represented in Equations (5.5), the group “block generate” and 
“block propagate” signals can be derived as follows:
M u + a ]- g /+ 3  + Si+2 +&,+l(+2 +  g/,+1 ,̂+2
~ h-\hh+\h+2
Figure 5-2 shows a schematic diagram of a 4-bit carry-lookahead block carry 
generator based on Ling’s design.
hj+4 î+3 î+2 î+1
k  g,-3Ph3 k  g,i2P, 2 k  gi^lPi I À g iP i
(5.8)




Figure 5-2. Diagram of a 4-Bit Carry-Lookahead RB-NB Carry Generator
(Similar to [85])
Given the 4-bit carry-lookahead generator from Figure 5-1 and Figure 5-2, the 
construction of a multilevel-lookahead circuit is straightforward. For example, to 
construct a two-level 16-bit carry-lookahead RB-NB converter, we need four 4-bit RB- 


































4-Bit Lookahead RB-NB Converter Block Carry Generator
Figure 5-3. Two-Level 16-bit RB-NB Converter (Similar to [85])
5.2 Comparison Result
In this section, the novel converter is compared to a traditional carry-look-ahead 
based converter. A 4-bit converter is used for comparison. Assume only 2-inputs OR or 
AND gates can be used to build such a converter. For a traditional carry-look-ahead 
converter, the longest latency is defined by Equation (5.3). That is,
C4 =  ^3 + g o A 172.P3 +C0I70A.P2.P3
If only two-input gates are allowed, then it requires 14 gates to the realized c^.
The critical path delay is six gates level.
For the converter investigated here, the critical path delay is defined by Equation 
(5.5). That is,
A4 " &3 + ̂ 2  + g / 2  + goh4 + AoL/oĥ 2 
If only two-input gates are allowed, then this requires 10 gates to the realized 
and the critical path delay is only five gates level.
108
Chapter 6 Summary and Conclusions
Inner-product computations play a central role in digital signal processing, 
especially in the areas of digital filters, signal correlation, convolution, FFT, etc. 
Complex number arithmetic computation is a key arithmetic feature required in modem 
digital communication, radar systems and optical systems. Many algorithms based on 
convolutions, correlations, and complex number filters require complex number 
multiplication and high-speed inner-product computation. The overall motivation for this 
work is the design of a high-performance complex arithmetic processor (CASP) capable 
of offering novel extended inner-produet operations.
The CASP design relies on the high-speed multiplication afforded by redundant 
binary techniques, while avoiding the relatively slow conversion back to 2’s-complement 
numbers until a final 2’s-complement result is necessary. Inherently, the CASP device 
provides intermediate register storage for redundant binary, as well as 2’s-complement 
numbers. A new high-performance inner-product processor using redundant binary 
number representation is presented in this dissertation.
When the Booth coding technique is used, our proposed RB inner-product 
processor can significantly reduce the number of partial product to 25%. Also, it can be 
dynamically reconfigured/controlled to perform real, complex and redundant binary 
number computations such as parallel multiplications and inner-product computations. 
The extended computational capabilities of the RB IP processor are developed for real, 
complex, and redundant binary number or mixed computations. In Chapter 2, the 
structure of for IP computation is studied. Two possible implementations ,
109
the inline and the cross partial product methods, are compared, with our inline method 
provides several advantages in speed and flexibility.
Complex number representations and arithmetic are also studied. Different
complex radices such as radix-(2y), radix-(/-l) and radix- (V2y) are investigated and 
compared. It is found that the complex radices have no advantage in hardware 
implementations. The traditional redundant binary number representation is used to 
implement complex-number multiplication and inner-product processing. The new RB 
inline inner-product processor can be reconfigured/controlled to perform complex- 
number computations. The structures for Aç̂Bq -f yfjR, and is developed and
compared. The implementation of A^B^+A^B^ can be easily controlled to perform the 
computation of A^B  ̂-  A-̂ B̂ . The complex number inner-product processor is investigated 
based upon this unified structure for Â B̂̂  ±A^B^. The implementation using the RB IP
processor is compared with the TMS320C6XXX processor. This comparison shows there 
is some speed improvement for the RB IP core. Next, a unified signed/unsigned 
multiplier without and with Booth encoding is presented. Based upon the unified 
multiplier, the RB IP processor is further extended to realize a redundant binary 
multiplier that can accept both 2’s-complement or RB inputs. The ability to accept RB 
inputs is essential for iterative calculations such as real and complex number division.
In Chapter 3, different division methods are reviewed. Two function iteration 
division methods, Newton-Raphson and Goldschimdt, are compared in detail. The 
theoretical equivalence of these two methods is shown. Further studies show that the 
Goldschmidt method is preferred over the Newton-Raphson method for efficient 
hardware implementation. Extension to the RB IP core are provided for performing
n o
Goldschmidt division. The division implementation structure for both real and complex 
numbers is discussed using the same IP processor.
In Chapter 4, together with the basic inner-product operations, the computational 
capabilities afforded ean be implemented using control signals or circuit reconfiguration 
if configurable hardware is used. These extended operations provide a rich set of 
computational capabilities targeted for implementation in a eomplex arithmetic signal 
processor (CASP). Various extensions such as IP computations, parallel multiplication of 
real, complex and redundant binary numbers are studied. Possible pipeline 
implementations of the RB IP core are investigated. A two-stage and three-stage pipeline 
structures are presented and the time delay model of these stages is studied. An improved 
RB to 2’s-complement number converter is investigated in Chapter 5. This converter 
shows improvement in speed with a small increase in area.
Several areas of research are suggested. Further development of the IP core is 
required for the extended caleulation capabilities, primarily dealing with the segmented 
accumulator and the requirements for flag setting based on arithmetic results for both 
saturation and overflow arithmetic. In addition, the IP processor can be developed to 
provide computational capabilities for square root, CORDIC, and other iterative 
functions.
Since the IP processor developed here serves as a core DSP computing 
capability, the overall architecture of the Complex Arithmetic Signal Processor (CASP) 
device requires extensive research to provide a dual numeric representation, i.e., 2’s- 
complement and redundant binary. The CASP device should have a rich instruction set 
architecture that leverages the IP core for performing calculations for signed/unsigned.
111
real/complex binary numbers, as well as intermediate calculations on redundant binary 





For the information regarding VHDL hardware implementations, please
contact the Office of Technology Development, University of Oklahoma.
660 Farrington Oval 
Evans Hall, Room 201 
Norman, Oklahoma 73019 




[1] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, and N. Takagi, “A high-speed 
multiplier using a redundant binary adder tree,” IEEE Journal o f Solid-State Circuits, 
vol. 22, no. 1, pp. 28-33, Feb. 1987.
[2] X. Huang, W. Liu, and B. W. Y. Wei, “A high-performance CMOS redundant binary 
multiplication-and-accumulation (MAC),” IEEE Transactions on Circuits and 
Systems I. vol. 41, no. 1, pp. 33-39, Jan. 1994.
[3] H. Makino, Y. Nakase, H. Suzuki, H. Morinaka, H. Shinohara, and K. Mashiko, “An 
8.8-ns 54x54-bit multiplier with high speed redundant binary architecture, " IEEE 
Journal o f Solid-state Circuits, vol. 31, no. 6, pp. 773-783, June 1996.
[4] N. Takagi, H. Yasuura, and S. Yajima, “High-speed VLSI multiplication algorithm 
with a redundant binary addition tree,” IEEE Transactions on Computers, vol. C-34, 
no. 9, pp. 789-796, Sept., 1985.
[5] G. A. Ruiz, “4bit CLA-based conversion from redundant to binary representation for 
CMOS simple and multi-output implementations,” Electronics Letters, vol. 35, no. 4, 
pp. 281-283, Feb. 1999.
[6] A. Herrfeld and Hentschke S., “Conversion of redundant binary into two's 
complement representations,” Electronics Letters, vol. 31, no. 14, pp. 1132-1133, 
July 1995.
[7] S. Yen, C. Laih, C. Chen, and J. Lee, “An efficient redundant-binary number to 
binary number converter, ” IEEE Journal o f Solid-State Circuits, vol. 27, no. 1, pp. 
109-112, Jan. 1992.
[8] M. O. Ahmad and D. V. Poomalah, “Design of an efficient VLSI inner-product 
processor for real-time DSP applications,” IEEE Transactions on Circuits and 
Systems, vol. 36, no. 2, pp. 324-329, Feb. 1989.
[9] E. Abdel-Raheem, A. Tawfik, M. Fahmi, and F. El-Guibaly, “New inner-product 
processor for FIR filter implementation,” in Proceedings o f IEEE Pacific RIM  
Conference Communication, Computer and Signal Processing, Victoria, May 17-19, 
1 9 9 5 , p p . 3 9 5 - 3 9 8 .
[10] D. J. Soudris, V. Paliouras, T. Stouraitis, and C. E. Goutis, “A VLSI design 
methodology for RNS full adder-based inner product architectures,” IEEE 
Transactions on Circuits and Systems -  II: Analog and Digital Signal Processing, 
vol. 44, no. 4, pp. 315-318, April 1997.
114
[11] M. N. Fahmi, F. El-Guibaly, S. Sunder, and D. J. Shpak, “Design of novel serial- 
parallel inner-produet processors,” in 1994 IEEE International Symposium on 
Circuits and Systems, London, UK, May 30-June 2, 1994, pp. 55-58.
[12] S. Haynal and B. Parhami, “Arithmetic structures for inner-product and other 
computations based on a latency-free bit-serial multiplier design,” in 33'̂  ̂Asilomar 
Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 3-6, 1996, 
pp. 197-201.
[13] W. P. Burleson and L. L. Scharf, “VLSI design of irmer-product computers using 
distributed arithmetic,” in 1989 IEEE International Symposium on Circuits and 
Systems, Portland, OR, May 8-11, 1989, pp. 158-161.
[14] A. S. Vega, P. S. R. Diniz, and A. C. Mesquita, “A modular distributed-arithmetic 
implementation of the inner product and its application to digital filters,” Journal o f  
VLSI Signal Processing, vol. 10, pp. 93-106, 1995.
[15] N. Kazakova, R. Sung, N. Durdle, M. Margala, and J. Lamoureux, “Fast and low- 
power inner product processor,” in The 2001 IEEE International Symposium on 
Circuits and Systems, Sydney, Australia, May 6-9, 2001, vol. 4, pp. 646-649.
[16] R. Lin and S. Olariu, “A new buses scheme for fast inner-product computation,” in 
2S‘̂  Asilomar Conference on Signals, Systems and Computers, Monterey, CA, Oct. 
30-Nov. 2, 1994, pp. 1402-1406.
[17] J. A. Starzyk and C. Chen, “A VLSI inner-product processor for real-time DSP 
applications,” in Proceedings o f the 26th Southeastern Symposium on System 
Theory, Athens, OH, March 20-22, 1994, pp. 219-223.
[18] C. C. Wang, C. J. Huang, and Y. P. Chen, “Design of an inner-product processor for 
hardware realization of multi-valued exponential bidirectional associative memory,” 
IEEE Transactions on Circuits and Systems II: Analog Digital Signal Processing, 
vol. 47, no. 11, pp. 1271-1278, Nov. 2000.
[19] C. C. Wang, P. M. Lee and C. J. Huang, “Three alternative architectures of digital 
ratioed compressor design with application to inner-product processing,” lEE 
Proceedings on Computers and Digital Techniques, vol. 147, no. 2, pp. 65-74, 
March 2000.
[20] D. C. M. Bilsby, R. L. Walke, and R. W. M. Smith, “Comparison of a programmable 
DSP and a FPGA for real-time multiscale convolution,” in Proceeding o f the 1998 
lEE Colloquium on High Performance Architectures for Real-Time Image 
Processing, London, UK, Feb. 12, 1998, pp. 4/1-4/6.
[21] L. Breveglieri and L. Dadda, “A VLSI inner product macrocell,” IEEE Transactions 
on VLSI Systems, vol. 6, no. 2, pp. 292-298, June 1998.
115
[22] E. Dujardin and O. Gay-bellile, “Software implementation of ADSL application with 
a convolution coprocessor,” in Proceedings o f the 1998 IEEE International 
Conference on Acoustics, Speech and Signal Processing, ICASSP, Seattle, WA, May 
12-15, 1998, pp. 3053-3056.
[23] O. Gay-Bellile and E. Dujardin, “Architecture of a programmable FIR filter co­
processor,” in 1998 IEEE International Symposium on Circuits and Systems, Pacific 
Grove, CA, May 31-June 3, 1998, vol. 5, pp. 433-436.
[24] R. Ein, “Reconfigurable parallel inner product processor architectures,” IEEE 
Transactions on VLSI Systems, vol. 9, no. 2, pp. 261-272, April, 2001.
[25] R. Lin, A.S., Botha, K.E. Kerr, and G.A., Brown, “An inner product processor 
design using novel parallel counter circuits”, in The First IEEE Asia Pacific 
Conference on ASICs, Seoul, Korea, Aug. 23-25, 1999, pp. 99-102.
[26] W. K. Euk and J. E. Vuillemin, “Recursive implementation of optimal time VLSI 
integer multiplier,” in Proceedings o f VLSI 1983, Amsterdam, Netherlands, 1983, 
pp. 155-168.
[27] R. Managuli, G. York, and Y. Kim, “An efficient convolution algorithm for VLIW 
mediaprocessors,” in Proceeding o f SPIE—the International Society for Optical 
Engineering, Jan. 1999, pp. 65-74.
[28] S. H. Baik, K.N. Han, and E. Yoon, “A 230MHz 8 tap programmable FIR filter 
using redundant binary number system,” in IEEE International Symposium on 
Circuits and Systems, Orlando, FL, May 30-June 2, 1999, pp. 415-417.
[29] M. A. Saeristan, V. Rodellar, A. Diaz, V. Garcia, and P. Gomez, “A reusable inner 
product unit for DSP applications,” in 25‘̂  Proceedings o f EUROMICRO 
Conference, 1999, pp. 209-213.
[30] G. Wang and M. Tull, “The implementation of an effieient and high-speed inner- 
product processor,” in 35* Asilomar Conference on Signals, Systems, and 
Computers, Nov. 4-7, 2001, vol. 2, pp. 1362-1366.
[31] T. Aoki, Ohi, Y., and Higuchi, T., “Redundant complex number arithmetic for high­
speed signal processing,” in IEEE Workshop on VLSI Signal Processing, Sakai, 
Japan, Oct. 16-18, 1995, pp. 523-532.
[32] J. Buhler, M.A. Shokrollahi, and V. Stemann, “Fast and precise Fourier transforms,” 
IEEE Transactions on Information Theory, vol. 46, no. I, pp. 213-228, Jan 2000.
[33] D. Fu and A. N. Willson Jr., “A high-speed processor for rectangular-to-polar 
conversion with applications in digital communications,” in 1999 Global
116
Telecommunications Conference, GLOBECOM '99, Janeireo, Brazil, Dec. 5-9, 1999, 
vol. 4, pp. 2172-2176.
[34] Y.P. Lee, L.G., M.J. Chen, and C.W. Ku, “A new design and implementation of 8x8 
2-D DCT/IDCT,” in Workshop on VLSI Signal Processing, San Francisco, CA, Oct. 
30-Nov. 1, 1996, pp. 408-417.
[35] K. Sasayama, M. Okuno, and K. Habara, “Coherent optieal transversal filter using 
silica-based single-mode waveguides,” Electronics Letters, vol. 25, no. 22, pp. 1508- 
1509, Oct. 1989.
[36] K. Sasayama, M. Okuno, and K. Habara, “Coherent optieal transversal filter using 
silica-based waveguides for high-speed signal processing,” Journal o f Lightwave 
Technology, vol. 9, no. 10, pp. 1225-1230, Oct 1991.
[37] M.A. Soderstrand, D.H. Chu, W. Chan, M. Lazkani, H.H. Loomis Jr., “Multi-rate
bandpass filter bank implemented in QRNS complex arithmetic using parallel 
multiple DSP chips or ASICs,” in 2 /^  Asilomar Conference on Signals, Systems and
Computers, Pacific Grove, CA, Nov. 1-3, 1993, pp. 801-806.
[38] S. Toledo, “On the communication complexity of the discrete Fourier transform,” 
IEEE Signal Processing Letters, vol. 3, no. 6, pp. 171-172, June 1996.
[39] M. Tull, G. Wang, and M. Ozaydin, “High-speed complex number multiplier and 
inner-product processor,” in 45‘'’ Midwest Symposium on Circuits and Systems, 
August 4-7, 2002, vol. 3, pp. 640-643.
[40] A. Berkeman, V. Owall, and M. Torkelson, “A low logic depth complex multiplier 
using distributed arithmetic,” IEEE Journal o f Solid-State Circuits, vol. 35, no. 4, pp. 
656-659, April 2000.
[41] S. He and M. Torkelson, “A eomplex array multiplier using distributed arithmetic,” 
in Proceedings o f the IEEE 1996 Custom Integrated Circuits Conference, San Diego, 
CA, May 5-8, 1996, pp. 71-74.
[42] M. Karlsson, M. Vesterbaeka, and L. Wanhammar, “Design and implementation of a 
complex multiplier using distributed arithmetic, ” in 1997 IEEE Workshop on Signal 
Processing, Leicester, UK, Nov. 3-5, 1997, pp. 222-231.
[43] V. G. Oklobdzija, “An integrated multiplier for complex numbers,” Journal o f VLSI 
Signal Processing, vol. 7, pp. 213-222, 1994.
[44] V. G. Oklobdzija, D. Villeger, and T. Soulas, “Considerations for design of a 
complex multiplier,” in 26̂  ̂ Asilomar Conference on Signals, Systems and 
Computers, Pacific Grove, CA, Oct. 26-28, 1992, pp. 366-370.
117
[45] A. P. Pascual, J. Vails, and M. M. Peiro, “Efficient complex number multipliers 
mapped on FPGA,” in 6th IEEE International Conference on Electronics, Circuits 
and Systems, Pafos, Cyprus, Sept. 5-8, 1999, pp. 1123-1126.
[46] A. P. Pascual, T. Sansaloni, and J. Vails, “FPGA based on-line complex-number 
multipliers,” in 8̂ '' IEEE International Conference on Electronics, Circuits and 
Systems, Malta, Sept. 2-5, 2001, vol. 3, pp. 1481-1481.
[47] K. Z. Pekmestzi, “Complex number multipliers,” lEE Proceedings, vol. 136, no. 1, 
pp. 70-75, Jan. 1989.
[48] T. J. Sansalon, J. Vails, and K. K. Parhi, “FPGA-based digit-serial complex number 
multiplier accumulator,” 2000 IEEE International Symposium on Circuits and 
Systems, Geneva, Switzerland, May 28-31, 2000, pp. 585-588.
[49] K. W. Shin and H. W. Jeon, “High-speed complex-number multiplications based on 
redundant binary representation of partial products,” International Journal o f  
Electronics, vol. 87, no. 6, pp. 683-702, June 2000.
[50] K. W. Shin, B. S. Song, and K. Bacrania, “A 200-Mhz complex number multiplier 
using redundant binary arithmetic,” IEEE Journal o f Solid-State Circuits, vol. 33, no. 
6, pp. 904-909, June 1998.
[51] A. Skavantzos and T. Stouraitis, “Decomposition of complex multipliers using 
polynomial encoding,” IEEE Transactions on Computers, vol. 41, no. 10, pp. 1331- 
1333, Oct. 1992.
[52] T. Soulas, D. Villeger, and V. G. Oklobdzija, “An ASIC macro cell multiplier for 
complex numbers,” in European Conference on Design Automation with the 
European Event in ASIC Design, 1993, pp. 589-593.
[53] T. Stouraitis and A. Skavantzos, “Multiplication of complex numbers encoded as 
polyniamials,” Journal o f VLSI Signal Processing, vol. 3, no. 4, pp. 319-328, April 
1991.
[54] T. Stouraitis and A. Skavantzos, “Parallel deeomposition of eomplex multipliers,” in 
22^ Asilomar Conference on Signals, Systems and Computers, Paeific Grove, CA, 
1988, pp. 379-383.
[55] D. C. Blest and Jamil, T., “Efficient division in the binary representation of complex 
numbers,” in Southeastcon 2001 Proceedings IEEE, Clemson, SC, Mareh 30-April 
1,2001, pp. 188-195.
[56] Y. N. Chang and K. K. Parhi, “High-performance digit-serial complex-number 
multiplier-accumulator”, in Proceedings o f the 1998 IEEE International Conference 
on Computer Design, Austin, TX, Oct. 5-7, 1998, pp. 211-213.
118
[57] Y. N. Chang and K. K. Parhi, “High-performance digit-serial complex multiplier,” 
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal 
Processing II, vol. 47, no. 6, pp. 570-572, June 2000.
[58] T. T. Dao, “Knuth’s complex arithmetic with quaternary hardware,” in I2 ‘̂  
International Symposium on Multiple-Value Logic, May 1982, pp. 94-98.
[59] J. Duprat, Y. Herreros, and S. Kla, “New redundant representations of eomplex 
number,” ÆEE Transactions on Computers, vol. 42, no. 7, pp. 817-823, July 1993.
[60] C. Fougny, “Parallel and on-line addition in negative base and some complex 
number systems,” Euro-Par’96 Parallel Processing, pp. 175-182, 1996.
[61] W. J. Gilbert, “Complex numbers with three radix expansions,” Canadian Journal o f 
Mathematics, vol. 24, no. 6, pp.1335-1348, June 1982.
[62] T. Jamil, N. Holmes, and D. Blest, “Towards implementation of a binary number 
system for complex numbers,” in Proceedings o f the IEEE Southeastcon 2000, 
Nasville, TN, April 7-9, 2000, pp. 268-274.
[63] D. E. Knuth, “An imaginary number system,” Communications o f the ACM, vol. 3, 
no. 4, pp. 245-247, April 1960.
[64] D. E. Knuth, The Art o f Computer Programming, Vol. 2: Addison-Wesley Publishing 
Company, 1969.
[65] I. Koren and Maliniak, Y., "On classes of positive, negative and imaginary radix 
number systems", IEEE Transactions on Computers vol. C-30, no. 5, pp. 312-317, 
May 1981.
[66] W. Penney, “A binary system for complex numbers,” Journal o f the Associated for 
Computing Machinery, vol. 12, no. 2, pp. 247-248, April, 1965.
[67] A. Slekys, “Design of complex number digital arithmetic units based on a modified 
bi-imaginary number system,” Ph.D. dissertation. University of California at Los 
Angels, Los Angeles, CA, 1976.
[68] V. N. Stepanenko, “Computer arithmetic of complex numbers,” Cybernetics and 
Systems Analysis, vol. 32, no. 4, pp. 585-591, July 1996.
[69] A. Avizienis, “Signed-digit number representations for fast parallel arithmetic,” IRE 
Transactions on Electronic Computers, vol. EC-10, no. 3, pp. 389-400, Sept. 1961.
[70] G. M. Blair, “The Equivalence of two’s-complement addition and the conversion of 
redundant binary of twos-complement numbers,” IEEE Transactions on Circuits and
119
Systems I: Fundamental Theory and Applications, vol. 45, no. 6, pp. 669-671, June 
1 9 9 8 .
[71] T. N. Rajashekhara and A.S. Nale, “Conversion from signed-digit to radix 
complement representation,” International Journal o f Electronics, vol. 69, no. 6, pp. 
717-721, Dec. 1990.
[72] M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant into 
conventional representations”, IEEE Transactions on Computers, vol. C-36, no. 7, 
pp. 895-897, July 1987.
[73] M.P. Tull, G. Wang, and M. Ozaydin, “Method and apparatus for converting 
redundant binary numbers to two’s-complement binary numbers”. Disclosure 
03NOR005, University of Oklahoma, July 2002.
[74] I. Choo and R. G. Deshmukh, “A novel conversion scheme from a redundant Binary 
to two’s complement binary number for parallel architeetures,” in SoutheastCon 
2001 Proceedings o f IEEE, 2001, Clemson, SC, March 30-April 1, 2001, pp. 196 -  
201 .
[75] S. F. Oberman and M. J. Flynn, “Design issues in division and other floating-point 
operations,” IEEE Transactions on Computers, vol. 46, no. 2, pp. 154-161, Feb. 
1997.
[76] G. Wang, M. Ozaydin, and M. Tull, “High-performance divider using redundant 
binary representation,” in 45‘̂  Midwest Symposium on Circuits and Systems, August 
4-7, 2002, vol. 3, pp. 640-643.
[77] T. Aoki, H. Tokoyo, and T. Higuchi, “High-radix parallel dividers for VLSI signal 
processing,” in 4* Workshop on VLSI Signal Processing, Pacifie Grove, CA, Oct. 
30-Nov. 1, 1996, pp. 83-92.
[78] D. L. Atkins, “Higher-radix division using estimates of the divisor and partial 
remainder,” IEEE Transactions on Computers, vol. C-17, no. 10, pp. 925-934, Oct. 
1 9 6 8 .
[79] M. D. Lrcegovac and T. Lang, Division and Square Root: Digit-Recurrence 
Algorithms and Implementations'. Kluwer Academic, 1994.
[80] M. Flynn, “On division by functional iteration,” IEEE Transaction on Computers, 
vol. 19, no. 8, Aug. 1970.
[81] C. V. Freiman, “Statistical analysis of certain binary division algorithms,” IRE 
Proceedings, vol. 49, pp. 91-103, 1961.
120
[82] R. E. Goldschmidt, “Application of division by convergence,” M.S. thesis, MIT, 
Cambridge, MA, June 1964.
[83] S. Kuninobu, T. Nisbiyama, H. Edamatsu, T. Taniguebi, and N. Takagi, “Design of 
high speed MOS multiplier and divider using redundant binary representation,” in 
Proceedings o f8 ‘̂  Symposium on Computer Arithmetic, 1987, pp. 80-86.
[84] S. F. Oberman and M. J. Flynn, “Division algorithms and implementations,” IEEE 
Transactions on Computers, vol. 46, no. 8, pp. 833-854, Aug. 1997.
[85] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs'. Oxford 
University Press, 2000.
[86] J. E. Robertson, “A new class of digital division methods,” IRE Transactions on 
Electronic Computers, vol. 7, pp. 218-222, Sep. 1958.
[87] E. M. Sebwarz and M. J. Flynn, “Hardware starting approximation method and its 
application to the square root operation,” IEEE Transactions on Computers, vol. 45, 
no. 12, pp. 1356-1369, Dee. 1996.
[88] H. R. Srinivas, “High speed computer arithmetic architectures,” Ph.D. dissertation. 
University of Minnesota, Twin Cities, MN, 1994.
[89] K. G. Tan, “The theory and implementation of high-radix division,” in Proceeding o f
IEEE Symposium o f Computer Arithmetic, 1978, pp. 154-163.
[90] K. D. Toeher, “Teehniques of multiplication and division for automatic binary 
computers,” Quarterly Journal o f Mechanic Applied Mathematics, vol. 11, part 3, 
p p . 3 8 6 - 3 8 3 ,  1 9 5 8 .
[91] Intel Arehiteeture Tutorial, Intel Company, http://www.intel.com.
[92] “TMS320C6000 programmer’s guide,”
http://dspvillage.ti.com/docs/eatalog/resources/techdoes.jhtml?navSection=user_guid 
es&familyld= 132.
[93] S. D. Pezaris, “A 40-ns 17-bit by 17-bit array multipliers,” IEEE Transactions on 
Computers, vol. 20, pp. 442-447, April 1971.
[94] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transaction on Electronic 
Computer, vol. EC-13, pp. 14-17, Feb. 1964.
[95] IEEE Standard VHDL Language Reference Manual 
IEEE Std 1076-2002, http://ieeexplore.ieee.org/Xplore/DynWel.jsp
121
[96] A.D. Booth, “A Signed binary multiplication technique,” Quarterly Journal 
Mechanics and Applied Mathematics, vol. 4, pt. 2, pp. 236-240, June 1951.
[97] T. Aoki, Amada, H., and T. Higuchi, “Real/Complex reconfigurahle arithmetic using 
redundant complex number systems,” in 13‘̂  IEEE Symposium on Computer 
Arithmetic—Arith ’97, Pacific Grove, CA, July 6-9, 1997, pp. 200-207.
[98] T. Aoki, Hoshi, K., and Higuchi, T., “Redundant complex arithmetic and its 
application to complex multiplier design,” in Proceeding o f 29‘̂  IEEE International 
Symposium on Multiple Valued Logic, Freiburg, Germany, May 20-22, 1999, pp. 
200-207.
[99] R. Mcllhenny and M. D. Ercegovac, “On-line algorithms for complex number 
arithmetic,” in 52"^ Asilomar Conference on Signals, Systems, Computers, Pacific 
Grove, CA, Nov. 1-4, 1998, pp. 172-176.
[100] R. Mcllhenny and M.D. Ercegovac, “On the design of on-line givens rotation,” in
Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, 
Nov. 4-7, 2001, pp. 160-164.
[101] R. E. Blahut, Fast Algorithms For Digital Signal Processing'. Addison-Wesley, 
1987.
[102] S. Zhoar, “Negative radix conversion,” IEEE Transactions on Computers, vol. C- 
19, no. 3, pp. 222-226, March 1970.
[103] D P. Agrawal, “Negabinary carry-look-ahead adder and fast multiplier,” 
Electronic Letters, vol. 10, no. 15, pp. 312-313, July 1974.
[104] N.G.P. Satish, “Negative radix arithmetic and its applications,” M.S. thesis. 
University of Nevada, Las Vegas, NV, 1973.
[105] G. Wang, M. P. Tull, and M. Ozaydin, “Binary conversion algorithms for the 
implementation of complex-radix numbers,” presented at Proceedings of 2"‘* IEEE 
Electro/Information Technology, Oakland, MI, June 6-8, 2001.
[106] H. Ling, “High-speed binary adder,” IBM Journal o f Research and Development, 
vol. 25, no. 3, pp. 156-166, March 1981.
[107] W. S. Briggs and D. W. Matula, “A 17x69 bit multiply and add unit with 
redundant binary feedback and single cycle latency,” in Proceeding o f IEEE 
Symposium on Computer Arithmetic, Windsor, Canada, June 29-July 2, 1993, pp. 
163-170.
122
[108] P. Chai, et al., “A 120 MFLOPS COS floating-point processor,” in Proceeding O f 
IEEE 1991 Custom Integrated Circuit Conference, San Diego, CA, May 12-15,
1991, pp. 15.1/1-15.1/4.
[109] H. M. Darley, et al, “Floating-point/integer processor with divide and square root 
functions,” U.S. Patent 4878190, Oct. 1989.
[110] S. M. Quek, L. Hu, J. P. Prabhu, and F. A. Ware, “Apparatus for determining 
booth recoder input control signal,” U.S. Patent 5280439, Jan. 1994.
[111] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa, andN. Takagi, “A high-speed 
multiplier using a redundant binary adder tree,” IEEE Journal o f  Solid-State Circuits, 
vol. 22, no. 1, pp. 28-34, Feb. 1987.
[112] N. Takagi and S. Yajima, “On a fast iterative multiplication method by recoding 
intermediate product,” in Proceedings 
Science, Kyoto University, Aug. 1987.
O f 36'^ National Convention o f Information
[113] H.A.H. Fahmy, A.A. Liddicoat, and M.J. Flynn, “Improving the effectiveness of 
floating point arithmetic,” in 35'* Asilomar Conference on Signals, Systems, and 
Computers, Pacific Grove, CA, Nov. 4-7, 2001, pp. 875-879.
[114] IA-32 Intel Architecture Software Developer’s Manual volume 2: Instruction Set
Referenee, http://developer.intel.com/design/pentium4/manuals/.
[115] Intel Pentium Family User’s Manual volume 3: Architecture and Programming 
Manual, Intel Corporation, http://developer.intel.com/design/pentium4/manuals/.
[116] E. Antelo, M. Boo, J. D. Bruguera, and E. L. Zapata, “A novel design of a two 
operand normalization circuit”, IEEE Transactions on VLSI Systems, vol. 6, no. 1, 
March 1998, pp. 173-176.
[117] V. G. Oklobdzija, “An algorithmic and novel design of a leading zero detector 
circuit: comparison with logic synthesis”, IEEE Transactions on VLSI Systems, vol. 
2, no. 1, pp. 124-148, March 1994.
[118] V. G. Oklobdzija, “Algorithmic design of a hierarchical and modular leading zero 
detector cixcmi,” Electronic Letters, vol. 29, no. 3, pp. 283-284, Feb. 1993.
123
