VLSI architectures for geometrical mapping problems in high-definition image processing by Kim, K. & Lee, J.
3rd NASA Symposium on VLSI Design 1991
N94-18345
3.1.1
VLSI Architectures for Geometrical Mapping
Problems in High-Definition Image Processing
K. Kim
Superconducting Super Collider Lab.
2550 Beckleymeade Avenue
Dallas, TX 75237
J. Lee
Department of Electrical Engineering
University of Houston
Houston, TX 77204-4793
A bstract- This paper exploits a VLSI architecture for geometrical mapping ad-
dress computation. The geometric transformation is reviewed under the field
of plane projective geometry, which evokes a set of basic transformations to
be implemented for the general image processing. The homogeneous and 2-
Dimensional cartesian coordinates are employed to represent the transforma-
tions, each of which is implemented via an augmented CORDIC as a process-
ing element. A specific scheme for a processor, utilizing fully-pipelinlng at the
macro-level, parallel constant-factor-redundant arithmetic and fully-plpelinlng
at the micro-level, is assessed to produce a single chip VLSI for the HDTV
applications under the current state-of-art MOS technology.
1 Introduction
Geometrical transformations are widely discussed in the field of digital image processing
such as high-definition televis_on(HDTV), image recognition, interactive computer graphics
and vision processing [1,2,3]. The primary interest of these transformations is to project
an image in a different domain, to extract additional signal conveying the information of
the image. Moreover, it affords value-added images over the conventional displaying via
the high resolution, definition, and flexible framing. Consequently, a geometrical mapping
processor is about to appear to support a real-time processing. In recent years, several
geometrical mapping processing modules have been developed and applied successfully for
an appropriate application, They are implemented either by popular graphics package or
application software accompanying an acceleration box [5], or a VLSI Processor [6]. We
are interested in a VLSI implementation of a processor to realize a real-time speed for TV
image processing, with a sufficient set of transformations to make a value-added display.
It has been known that two barriers have existed toward the development of such a pro-
cessor. The first is the lack of a sufficiently high-speed arithmetic computation technique
to generate the mathematical functions required for geometrical mapping. The second is
the need for an extensive library of geometrical mapping functions. To overcome these,
two key techniques have been developed in [4,6]: The first is a very high speed radix-2
https://ntrs.nasa.gov/search.jsp?R=19940013872 2020-06-16T18:10:18+00:00Z
3.1.2
signed-digit adder and the second is a pipelined _cro-programmable arithmetic function
generator. In this paper, we study the same problem with the goal of optimizing the overall
functionality and performance. We achieve this goal by improving the basic cell.
In the following section, we will review the requirement of the geometrical mapping
processor by introducing its definition and appllcation,. !u Section 3, we will study various
CORDIC schemes to implement a basic eel], Which can be used to compose the necessary
function set for the geometric transformations.
2 Geometrical Mapper
Transformation of a sub-image requires a mapping of the sub-image from one point to
the transformed, pixel by pixel. To rearrange the image, it is necessary to calculate the
destination address of each pixel, which is called a geometrical mappe_r_ i.__:
In the field of plane projeetlve geometry, transformation from ap0int to another point
is represented as a multiplication in homogeneous coordinates [10]. Let a 2-dimensional (2-
D) point p_ = (z, y) is represented as (az, ay, a) in rlght'handed homogeneous coor_na%s,
with a non-zero constant a. The vector p= is referenced to an origin (0, 0). The most useful
transformations are translation, scaling and rotation, exampies of which are respectively
defined as:
Trans(z, d) - translating p. to (z +d.y)
Rot(z, 0) : rotating the vector p. by an angle of $ about x-axis
Scale(z, c) : scaling the vector p. by c along x-axis.
(z,y). Tran,(z,d) = (z + d,y)
"Se,ae(z,c) = (cz,u)
_1)
Or, the composite of 3 different transformations in 2-D is represented by
T
c.co80 sinO
-c. sinO cosO
cd 0
0
0 ,
1.
(2)
which is called an atone transformation. The afFine transformation is performed via a set
of multiplication and trigonometric function.
Easily observed, the- a__ne _ansfdrmat_on is a necessary transformation to map a sub-
image into another area of the image domain, with sliding, re-sizing and proper rotation.
Its immediate applications inchde sub-image generatiou for the multiple picture-in-picture
(PIP) TV, image template generation for the recognition and vision/graphics processing
Further sophisticate transformation useful for the general image processing is the spher-
ical, which basically transforms between the p!ane and sphere surfaces. A spherical trans-
formation from p, to q_ = (u, v) can be represented by using a set of elementary functions,
z
i
Z
3rd NASA Symposium on VLSI Design 1991 3.1.3
such as square root, division, and squaring operations.
TX
_/r 2 - z_ - y2
ry
v = _/r_ - x2 - y_'
(3)
where r denotes the curvature degree of sphere surface. A conventional way to implement
the transformations starts from a software package, i.e., interactive graphics package. To
implement a dedicate hardware, possibly a set of modular structures in VLSI, it is necessary
to figure out a basic cell of those functions, and there has been two different approach: the
first based on a set of elementary function generators and the second on a programmable
module. For the first approach, fast function generators are necessary and the performance
is limited by the slowest function generator. Apparently, the trigonometric functions are
the bottleneck while being implemented via the first idea. To optimize the trigonometric
function generation, while considering the regularity of its structure, CORDIC has been
suggested the recursiveness of the CORDIC iteration has been misleading a concept that
the second approach is not usually better than the first one.
Recently, as VLSI technologies evolve, the effectiveness of the integration is not simply
a complexity of the multiplication but also implies a communication complexity more
than the multiplication complexity include regularity of the structure, simplicity of the
design and localization of the interfacing. In these senses, CORDIC has been widely
reviewed again, and shown to be appropriate for a couple of algorithmic processors. In
brief, CORDIC is a set of recursive algorithms, which can be easily programmed to generate
a set of elementary functions via a different mode and a proper zero-enforcing. It is also
capable of vector-oriented processing.
3 CORDIC Techniques
In this section, we will review CORDIC functions to i) perform a vector transformation and
ii) generate elementary functions. CORDIC comprises of three linear recursive equations,
namely X-, Y- and Z- recurrences. Table 1 summarizes the computing mode, input
and output specifications of CORDIC functions of our interest. As shown in the Table,
these functions are classified into two cases, one which enforces Z[N] to be zero (known
as rotating ) and the other which enforces Y[N] to be zero(known as vectoring ). We will
discuss these cases in the following sections.
3.1 Rotating case
The vector rotation for pz = (x[0],Y[0]) by the angle 0 can be realized by an iteration
algorithm called CORDIC [12] instead of computing trigonometric functions and applying
matrix multiplication. CORDIC realizes a vector rotation by a partial sum of micro-angle
rotations with a pre-fixed sequence of angles. When the rotation macro-angle is represented
3.1.4
Mode Inpiit Eni'orcing OUtput
circ_ z[0]= 0,(X[0l,r[0])_ Z[N]= 0 rtotationSy o
Cir¢.l_ z[0]=o,(x[ol, Y[o]) Y[NI=0 X[N] = v/X[0]'+ r[012
Z[N] _an"(r[Ol/Jt[O])
Line_ Z[0]= 0,(Xt0l,y[0l)
hype boU¢V (x[0],Y[0])
Y[N] = 0 Z[N] = Y[OI/X[O]
r[Nl = 0 x[Nl = v/X[0]2- riO],
_a_ie ii Avaiiabie CORDIU Processing
_s _ su_ of d-eeomi_0sed micro-angies, |.e 0
_k=O_k,
"[ 7']',';= II 1k=o tanO_ P_ (2_)
where les, = coS_}i,k is a micro-scale c0mposing a final SeMe factor, expiained tater. Such
a specific form of the pre-fixed micro-angle sequence as tan -1 2 -i, is attractive for VLSI
impiementaiioii since it is composed only of additions, shlftings, and a arctangent i0ok_i[5
Non-redundant : The micro-iterations of the conventional (hereafter, it will be called
non-redundant ) CORDI_ use t]_xe {0]]o_g 3 linear recursive equations [i2]: : :
Y[i 4- i] Y[il - o-_2-_x[il
Z[i + i]= Z[i] - _i fan-' 2-' (_)
to one for the circularCORDiC, while m = 0 for the _near and -1
With an initialvalue of Z[0] = 8, CORDIC rotates initialvalues of
where m will be set
for the hyperbolic.
X[0] and Y[0], to the last value X[n] and Y[n] while making Z[i] close to zero in each i
iteration, so that Z[n] is forced to be zerO. With n number of iterations, n-bit accuracy of
X[N] and Y[N] can be achieved. For a known angle, the direction of the rotation, _r_ can
be pre-eomputed or calculated one by one on-the-fly using the following selectiOn function.
{_ i_Z[_l_>0or, = -i if Z[i] < 0 (6)
The CORDrC rotation does not preserve the input norm. To -get a rotated Vector having
the Same length as the input (X[0],Y[0]), X[n](Y[n]) i£eeds to be compensated by a scaling
factor K
g = [l[X[n]'Y[n]]_[]=:=:= : " = H _1 + a_2 '_, (7)
ll[X[0],Y[0]]'ll ,=0
where ll" II stands for the norm oftl_e Vecto-r. Note that K is constan-t _or the non-redundant
scheme since _rl is in {-1, 1}.
3rd NASA Symposium on VLSI Design 1991 3.1.5
Redundant ; Non-redundant CORDIC is slow inherently with delay of O(n 2) due to
its recursiveness and serial dependency, since a micro-rotation with delay O(n) should be
finished before processing the next micro-rotation. Delay performance of a macro-rotation
(n micro-rotations) can be improved from O(n 2) to O(n) by using redundant arithmetic
(carry-free addition such as carry save or signed-digit addition) to determine the direction
of the rotation &i, based on an estimate instead of an exact value [14]. The redundant
arithmetic gives a delay of O(1) instead of O(n), and the estimation of direction is necessary
not to erode the advantage of O(1). This requires the modification of the recurrences and
selection function. This redundant CORDIC scheme produces the output about 4 times
faster than the non-redundant [14]. However, it introduces additional cost since the scale
factor K is variable depending on a macro-angle by allowing b_ to be in (-1, 0, 1).
Constant-Factor-Redundant : To reduce implementation cost of redundant CORDIC,
it would be good to have a constant scale factor by forcing bl in {-1, 1}. However, since b,
is determined from an estimate, there arises a convergence assurance question. A scheme
appending correcting iteration stages at proper positions was proposed for it [15]. Along
to this idea, the number of extra correcting iterations is further reduced by dividing the
micro-iterations (for i = 0 to i = n - 1) into two groups: one group where the direction of
the rotation is in (-1, 1} for i = 0 to i = n/2 and the other in (-1, 0, 1) for i = (n + 1)/2
to i = n - 1 correcting iterations by 50 % since correcting iteration is not needed for the
second half of the micro-iterations and we still obtain a constant scale factor K since the
value of K in n-blt precision does not depend on the b value for (n + 1)/2 < i < (n- 1). Z-
recurrence also can be modified so that bl is determined quickly by looking at a few most
significant bits. This new scheme is called Constant-Factor-Redundant-CORDIC(CFR-
CORDIC). The modified recurrences and selection functions for the scheme are described
below.
X[i + 1] = X[i] + b,2-iY[i]
Y[i + 11 = Y[i] - bi2-iX[i]
Vii + 1] = 2(V[i] - &,2' tan-' 2-') (8)
where U[i] is for the implementation simplicity, which is equal to 2'Z[i], and the selection
function is given as follows:
1 ifO[i] >
or O[i]=°0n i < hi2
= 0 5"[i]= 0 n i > n/2 (9)
--1 if O[i]< 0
When _ fractional bits are used in the estimate value, i.e., U[i] is computed using t
fractional bits of redundant representation of U[i], the following correcting iteration need
to be included, where the interval between indexes of correcting iterations should be less
than or equal to (t - 1) up to the last iteration index equal to n/2. When the correction
stage is necessary at the jth step of micro-iteration,
uc[j + 1] = Vii + 1]- 2a_2Jtan-_2 -j (10)
3.1.6
with the direction of the rotation b_ determined from_the sameselectionfunction of eq.(9),
except being decided based on gr[j + 11instead of O[i].
3.2 Vectoring case
While the rotating case affords vector-wise rotation to implement a geometrical mapper,
the vectoring case does elementary functions as in Table 1. Apparent difference between
the vectoring and rotating mode is the zero enforcing parameter, which necessRates a
different selection funCtion. For i]_e "conventional C(TI_DiC-', the recurrence equations-are
given: : .......
x[_+ 1i= xM +_,_'_i] :- _
Y[, + II = Y[il- a_2-'X[i]
Z[i + 1] = Z[i] + _tan,' 2"_ (11)
with the i'ollowing'seleefion=functi6n_ ......
" = ( -1 if Y[_]< o
The selection [unctiOn _or CgR-CO_i_, in vect0rlng has been developed Shown be|_i
Let W[i] = 2_Y[i] in the same token as for the rotating case, then
x[_+ 11= x[q +_,2-'r[q
w[_+ 11= 2(w[q- _,x[q)
z[i + 11= z[i]+ _,tan-' _.-' (13)
(14)
1 if _[i] > 0
or W[i] = 0 n i < n/2
*_= 0 1 _[i] = 0 n i > n/2
- i_ ¢¢[i]< 0
Here the correcting stage at the jth step is defined as follows:
WC[j + 1] = W[j + 1]- 2b_X[j + 1] (15)
So far, we discussed about recursive structures of several CORDIC schemes to imple-
ment the basic PE. The PE, augmented by a translator, necessitates scaling operation at
each stage, because shuffling of the output at each stage makes continuous accumulation
of the scaling factor complex to be processed at the final stage. The scaling operation
has been solved either by an explicit way or an implicit. The explicit way is dividing the
rotated vector by a constant, which is known for the non-redundant, to be calculated while
running the micro-steps of CORDIC [12,14]. The division can be processed by another
CORDIC (in a linear mode) or a divider. The implicit approach reconfigures the sequence
of mlcro-iterations of the CORDIC, eventually to have a different norm from that without
3rd NASA Symposium on VLSI Design 1991 3.1.7
scaling micro-iterations. Scaling micro-iterations target in general at making the adjusted
scaling factor in a form of 2_ or 1, which can be easily set to the unit size. Each micro-
iteration can be composed of i) reduction axis-scaling [16], ii) repetition of vector-scaring,
iii) expansion axis-scaling or combinations thereof. Relevant issues regarding search for
the solution are to be further studied, better than the greedy method or the decomposed
search [18]. In summary, the explicit scaling almost doubles the system complexity, while
the implicit increases 25 % for non-redundant CORDIC and about 30 % for redundant
CORDIC.
3.3 VLSI Scheme
To maximize the throughput of the geometric processor, the fully spanned architecture is
selected. Affine transformer is a trivial case, which can be implemented by using a single
CORDIC of which micro-iteration is expanded to include an addition. To implement a
spherical transformer, 4 CORDICs are configured: i) circular square root of _,
ii) hyperbolic square root of \/r 2 _ (_)2, and two iii) linear divisions of u and
v. To get first estimates of the VLSI size, a typical TV image processing application is
considered: 0(10 s) pixel/image addressing and O(lO-1)sec screen flashing. For the case,
the number of input bits bi _ v/pizel number, for which 12 bits are sufficient. To allow
possible interpolations between pixels, bt is set to be 16. Each CORDIC module requires
(bl + log2bl) steps of micro-iterations, and 30% additional iterations for an implicit scaling.
For the spherical transformer, using fully spanned 4-CORDIC, the number of TRs are
estimated about 30K (4"6K*1.3).
References
[1] R. Nicholl and T. NichoU,'Performing Geometric Transformations by program Trans-
formation," ACM Trans. on Graphics, Vol. 9, No 1, pp.28-40, 1990.
[2] N. Ansari and E. Delp,"Recognizing Planar Objects in 3-D Space," Proc. of SPIE, Vol.
1197, pp.127-138, 1989.
[3] R. Cossu, M. Ercoli and L. Moltedo,"Extension of CGI functions for Generation and
Manipulation of Raster Image," Computers _ Graphics, Vol.13, No 1, pp.39-48, 1989.
[4] T. Nakanishl and H. Yoshimura,"A High-speed Address Generator for Affine Transfor-
mation," Nat. Cony. IECE, 1985.
[5] ACM/SIGGRAPH Graphics Standards Planning Committee, Report of the CORE
Definition Subgroup, 1977.
[6] H. Yoshimura, T. Nakanishi and H. Yamauchi, "A 50-MHz CMOS Geometrical Map-
ping Processor", IEEE Transactions on circuits and systems, Vol 36, No. 10, pp.1360-
1364, October 1989
3.1.8
[7] K. Arbter and et.al., "Application of Afflne-invariant Fourier Descriptors to Recogni-
tion of 3-D Objects," IBEE Trans. Pattern Analysis and Machine Intelligence, Vol. 12,
No 7, pp.640-647, July 1990.
[8] T. Waka_ara, "On-line Handwritten Character Recognition Using Local A_ne Trali_
formation," Systems and Computers in Japan, Voi.20, No 7, pp.10-19, July 1989.
[9] K. Aono, M. Toyokura and T. Araki, "30nsec (600 Mops) Image Processor with a
Reconfigurable Pipeline Architecture," Proc. IBEE 1989 Custom Integrated Circuits,
pp.24.4.1-4, 1989.
_ . - _ . . - --= _: ___ : - _
[10] E. Maxwell, General Homogeneous Coordinates in Space of Three Dimensions, The
University Press, Cambridge, England, 1961.
and E. Swartzlander,' Image Processing Address Generator Chip,"
Proceedings o] _E,E, int. Conf. Acous-tics, Speech, and Signal ProcesSing S, pp.993-996,
i985.
[i2] J. Waltiaer, "A _d-A]godtilm _or E]_entary _Unetions 'i, _i_iPS ,_pring )o-int
Computer Conference, pp.379-385 , 1971.
[13] i_.Kung,"Le_s_esign XigoHthms for VLS]- Systems ," Caltech Con]. VLSi, pp.65-90.
i979.
[14] M. Ercegovae and T. Lang, "Redundant and on-Line CORDIC: Applica.tlon to Matrix
Triangularization and SVD", IEEE Trans. on Computers, Vol. C-39, NO 6, pp.79.5-740,
June 1990.
[15] N. Takagl, T. Asada _a _. Yaj]rna,_'Redundant _ORDIC Meth_ wltla a _onstknt
Scale Factor for Sine and Cosine Computation", Submitted to!EEE Trans. on Co-m,
puters, 198D. _............ :..............
[16] G. Havi!and and A. Tuszynski,"A CORDIC Arithmetic Processor Chip," IEEE Trans.
on Computers, Vol C-29, No 2, pp.68-79, Feb. 1980.
[17] J. Delosme, "VLSI Implementation of Rotations in Pseudo-Euclldean Spaces" , Pro-
ceedings of IEEE Int. Conf. Acoustics, Speech, and Signal Processing _, pi),927-_30,
1983.
.............................................
[18] J. Lee and T. Lang, "Matrix Triangularization by Fixed-p0int Redundant CORDIC
with a Constant Scale Factor" , Proc, SPI_ Conference on Advanced Signal Processing
Algorithms, Architectures, and Implementations, July 1990.
[19] S. Note et. al.,"Automated Synthesis of a High Speed CORDIC Algorithm with the
CATHEDRAL-III Compilation System", Int. Conf. Circuit and System, pp.581-584,
1988.
=
_=
