Super scalar high speed 2(mew) N-well MOSIS CMOS digital halftoning processor by Gupta, Anupam
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
6-1-1996 
Super scalar high speed 2(mew) N-well MOSIS CMOS digital 
halftoning processor 
Anupam Gupta 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Gupta, Anupam, "Super scalar high speed 2(mew) N-well MOSIS CMOS digital halftoning processor" 
(1996). Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 






Partial Fulfillment of the





Chairman - George A. Brown, - Professor - Computer Engineering Department
Graduate Advisor - Peter G. Anderson, - Professor and Chairman - Computer Science Department
Graduate Advisor - Roger L. Easton, - Professor - Center For Imaging Science





THESIS REPRODUCTION RESTRICTION NOTICE
ROCHESTER INSTITUTE OF TECHNOLOGY
COLLEGE OF ENGINEERING
Title: Super Scalar High Speed 2~. N-well MOSIS CMOS Digital Halftoning Processor.
I, Anupam S. Gupta, hereby deny permission to the Wallace Memorial Library to reproduce my






Digital halftoning is the algorithmic process for converting electronic images into bitonal images
that preserves the perception of a continuous-tone image. Various digital halftoning algorithms
were considered in the development of this processor on the basis of quality of image and
amenability for VLSI implementation. An error-diffusion algorithm with the options of noise
encoding, printer model adjustment, and edge enhancement was chosen for implementation.
Since the algorithm allows for multiple independent parallel processors to operate on the same
image, the system is capable of super scalar processing. The processor is intended for an 8-bit
input (256 gray levels). The processor was designed using a 2\x. N-well MOSIS CMOS process.
The expected processor speed for that process is about 21 million pixels/sec. The processing
speed was enhanced by using Double Pass Transistor Logic implementation on all logic
components used in the processor.
Ill
ACKNOWLEDGMENTS
I wish to acknowledge the invaluable assistance and cooperation I have received from many
faculty members, students and IEEE members. I wish to thank my faculty committee members
Professor George Brown, Professor Peter Anderson and Professor Roger Easton for their
invaluable assistance, advise and guidance throughout the project. I wish to thank Professor Ken
Hsu for providing the requisite handouts on VLSI design rules and tools and the necessary
contact with Professor W. Liu of N. C. State University. Professor Liu was invaluable in assisting
me with the crucial adder design. I especially like to express my gratitude to the computer
engineering department student technician Mr. Frank Casilio whose assistance has been
invaluable throughout the project. He has been instrumental in providing the intended




List of Figures. vi
List of Images. vii
List of Tables. viii
List of Blocks. ix
List of Schematics. x
List of Simulations xii
List of Layout xiii
1.0 Digital Halftoning Introduction. 1
1.1 Reflectance and Luminance. 1
1.2 Discussion On Types of Noise Produced By Halftoning. 1
1.3 Printer Model Theory. 2
2.0 Methods of Digital Halftoning. 4
2.1 Noise Encoding. 4
2.2 Ordered Dither. 10
2.3 Error Diffusion. 19
2.4 Selected Algorithm. 21
3.0 Processor Architecture. 27
3.1 Pseudorandom Number Generator. 27
3.2 Error Computation Block. 35
3.3 Binarization Block. 36
3.4 Overall Top Level Post Layout Backanotated Simulation. 36





Figure 1.3-1 Overlap of two neighboring black pixels. 2
Figure 4.01 Block diagram of carry generator parallel adder.
41
Figure 4.02 Block diagram of the carry generator. 42
Figure 4.03 Comparison of CPL, CMOS, DPL equivalent resistance. 43
Figure 4.04 Comparison of circuit design, operation, and Voltage swing of 44
XOR gate implemented in CPL, CMOS, and DPL technology.
VI
List of Images
Image 2.1-1 Vertical Gray Scale Ramp processed with Noise Encoding. 7
Image 2.1-2 Portrait of Lena processed with Noise Encoding. 8
Image 2.1-3 Landscape of a house processed with Noise Encoding. 9
Image 2.2-1 Vertical Gray Scale Ramp processed with Clustered Dot Algorithm. 12
Image 2.2-2 Portrait of Lena processed with Clustered Dot Algorithm. 13
Image 2.2-3 Landscape of a house processed with Clustered Dot Algorithm. 14
Image 2.2-4 Vertical Gray Scale Ramp processed with Dispersed Dot Algorithm. 16
Image 2.2-5 Portrait of Lena processed with Dispersed Dot Algorithm. 17
Image 2.2-6 Landscape of a house processed with Dispersed Dot Algorithm. 18
Image 2.4-1 Vertical Gray Scale Ramp processed with Error Diffusion Algorithm. 24
Image 2.4-2 Portrait of Lena processed with Error Diffusion Algorithm. 25
Image 2.4-3 Landscape of a house processed with Error Diffusion Algorithm. 26
VII
List of Tables
Table 2.2-1 Example of a clustered dot dither screen. 11
Table 2.2-1 Example of a dispersed dot dither screen. 1 5
Vlll
List of Blocks
Block 3.0-1 Black box representation of the Pseudorandom Number generator. 28
Block 3.0-2 Black box representation of the Error Computation Block. 30






















Schematic of the Pseudorandom Number generator. 29
Schematic of the Error Computation Block. 31
Schematic of the Binarization block. 33
Top Level Logic model for the processor. 34
Schematic of f_gl_buffer. 46
Schematic of f_pl_gl_buffer. 48
Schematic of f_gl_pl_gr_buffer. 50
Schematic of f_gl_pl_gr_pr_buffer. 52
Top level schematic of 12 bit adder. 54
Enlarged view of bits 0 through 3 of the 55
12 bit adder.
Enlarged view of bits 4 through 7 of the 56
12 bit adder.
Enlarged view of bits 8 through 1 1 of the 56
12 bit adder.








Schematic of a Double Pass Transistor Logic AND gate. 61
Schematic of a Double Pass Transistor Logic XOR gate. 62
Schematic of the worst case load on the XOR and AND gate 64
in the adder's input stage.
Schematic of the worst case load on f_g l_pl_g r_buffe r . 66
Schematic 4.0-1 3 Schematic of the worst case load of the XOR gate at the 68








Backanotated post layout simulation of the logic model. 38
Backanotated post layout simulation of the logic model. 39
Worst case (g,p) Generator transistor level Spice Model 65
Simulations.
Worst case Carry Generator transistor level Spice Model 67
Simulations.




Lay 4.0-1 Layout of f_gl_buffer. 47
Lay 4.0-2 Layout of f_pl_glbuffer. 49
Lay 4.0-3 Layout of f_gl_pl_buffer. 51
Lay 4.0-4 Layout of f_gl_pl_gr_pr_buffer. 53
Mil
1.0 Digital Halftoning Introduction
Digital halftoning is the algorithmic process of converting gray-scale digitized images into bitonal
images that preserves the perception of the continuous tone in the original image. Continuous
gray scale images usually are obtained from scanned photographs or by computer generation.
Many algorithms are available to achieve such conversion. Some of the common algorithms,
such as error diffusion and ordered-dither masking, are used extensively in many printing
applications.
The criteria for the selection of a suitable algorithm were high image quality, low computational
complexity, and the possibility of an efficient VLSI implementation
1.1 Reflectance and Luminance
An image function denoted by f(x,y) refers to the two-dimensional distribution of light-intensity (f)
at the spatial coordinate (x,y). This function consists of two components. The first is the
illumination function i(x,y), which is defined as the distribution of light incident at the spatial
coordinate (x,y). The second component is the reflectance r(x,y), which corresponds to the
amount of light reflected at the spatial coordinate (x,y). The relationship between f(x,y) and i(x,y)
and r(x,y) is expressed in eqn
- 1.1-1 [1].
f(x,y) = i(x,y) r(x,y) eqn -1.1-1
The gray level (g) of any monochrome image at any point (x,y) is the intensity of the image after
quantization to a limited number of gray levels.
1.2 Discussion On Types OfNoise Produced By Halftoning.
The process of halftoning creates noise or non image components in the resulting bitonal image.
Noise components are generally classified on the basis of frequency spectrum. The high spatial
frequency components of an image is commonly referred to as "blue noise". Blue noise is
the
high spatial frequency noise component of the bitonal image and tends to be invisible to the
eye.
Pink noise is the low spatial frequency noise component of the bitonal image. This type of
noise
may be visible to the eye.
1.3 Printer Model Theory
Fig. 1.3-1 - Overlap of two neighboring black pixels. Note that black pixels
are bigger than the white pixels.
Most of the theory discussed in this section has been obtained from Pappas and Neuhoff [2],
This section is based on the model of "write
black"
electrophotographic laser printers with 300
dpi resolution. These printers print black spots of approximately circular shape. Let T be the
spatial distance between the centers of any two adjacent pixels. The reciprocal of T is the printer
resolution, which is commonly expressed in dots per inch. The radius of an ideal
circular dot can
be obtained by using the Pythagorean theorem to compute the radius of the pixel dot: its value is
(T / V2). The area of this dot is {nil) (T2), which is approximately 57% larger than the area of a T
x T square (Fig. 1.4-1). This implies that a black pixel tends to occupy about 14.3 % of the area
of each neighboring pixel. Suppose a white pixel is surrounded by (d) black pixels, then (14.3) (d)
% of the white pixel area will be blackened. This is a major source of gray scale distortion of
halftoned images. These characteristics are applicable only to an ideal circular dot. The dots
actually printed usually are not perfectly round, not perfectly black, not the ideal size, and may
be slightly misplaced. These distortions may be due to the spread and/or movement of toner
particles, distortions in the laser beam, uneven heat finish, and reflections of light within the
paper. This implies that the distortion of the black pixel in a real printer will be greater than that
calculated for the ideal printer [2],
A pixel can have only two states, it can be referred as a bit (binary digit). Let u(x,y) be the gray
level produced by the printer at point (x,y) on a matrix of M x N pixels. This gray level produced
by the pixel at (x,y) will depend on the states of the surrounding pixels [2]:
u(x,y)
= f(x,y;Bx,y) T/2 <, x <, MT + T/2 eqn -1.3-1
T/2 <; y :S NT + T/2 eqn -1.3-2
where f could be a deterministic or stochastic function, and Bxy denotes the set of bits in a
immediate neighborhood of the point (x,y).
Assume a uniform toner blur over a distance T; due to close spacing of the dots and the limited
spatial resolution of the eye, the gray level at site (ij) can be modeled as having a constant




- T/2 < x < iT + T/2, eqn - 1 .3-3
jT-T/2<y fSjT + T/2 eqn -1.3-4
for 1 ; i f M and 1 < j ^ N. Assuming a uniform average, the gray level can be
computed as
shown in eqn 1.3-5 [2],
iT + T/2 jT + T/2
ij
=
(1/T2) I j f(x,y;Bx,y) dx dy eqn - 1.3-5
iT - T/2 jT - T/2
for 1 < i < M and 1 <j < N
Therefore the actual gray level perceived by the eye at any given microscopic section is the ratio
of the area which is black to the total area in consideration. If the oversize black pixel is
uncompensated in the halftoning algorithm, then the final image will be too dark.
2.0 Methods OfDigital Halftoning.
Three methods are commonly used for digital halftoning [3]. The first method to be discussed is
called "noise encoding". This method is based on probabilistic thresholding to maintain gray
appearance. The second method is ordered dither, and compares the gray-scale image to a two-
dimensional dither mask containing threshold values. The third method is known as error
diffusion, which relies on the computation of the darkness error incurred by the bitonalization of
the current pixel and the incorporation of this error in the processing of subsequent pixels.
2.1 Noise Encoding
The goal of any halftoning algorithm is to reduce the number of quantization levels in a digital
image while still maintaining the illusion of the original image. In a noise-encoding algorithm, the
threshold is adjusted to introduce noise into the output image [3]. The type of noise used for
encoding will determine the gray appearance and the sharpness of the edges in the image. A
typical algorithm for input image f(i,j) with a gray scale range [0,H] is:
b(ij) = 0, if f(i j) + r[-a,a] < t eqn - 2.1 -1
b(i,j) = 1, iff(i,j) + r[-a,a]>t
for 1 < i s M and 1 <j < N
where b(i,j) is the halftoned output at the point (ij), r[a,-a] is a deterministic or a pseudorandom
function within the range [a,-a], f(ij) represents the gray scale in the original image at the point
(ij), and t is a fixed threshold. The value of t and a is specified by the algorithm.
The distribution of black and white pixels depends on the function r and the range [a,-a]. Due to
the spatial integration of the eye ( eqn - 1.3-5 ), the perceived gray level at any fractional area
will be proportional to the density of black pixels in the area. The relative number of black pixels
at any given area in the output image depends on average gray level of the corresponding area
in the original image. This can be demonstrated by a simple case in which t is chosen to be
midway in the gray scale range [0,H] of the original image and r is chosen to be a uniform
random distribution. The probability that b(i j) represents a white pixel is given by:
P[b(i,j) = 1] = f(i,j)/H eqn -2.1-1
This implies that the perceived gray level is proportional to the gray level of the continuous-tone
image. Therefore the gray-level representation is accurate and the edge-preservation capabilities
are maintained [3]. One of the main problems with this type of method is that the images appear
to be grainy. This occurs because there is pink noise in the frequency spectrum of the image
generated by the algorithm.
Professor Peter Anderson has developed a noise encoding algorithm which generates quality
images [4]. The quality of the image was determined by psychovisual inspection. This algorithm
is based on a sequence of numbers (X, Y, Z) which generate a set of pseudorandom numbers
which are used in determining the threshold values.
b(ij) = 0, if f(i,j)
*
Z < {(i*X + j*Y) MOD Z + (H/2 - Z/2)}
*
H eqn - 2.1 -2
b(i,j) = 1 , if f(i,j)
*
Z {(i*X + j*Y) MOD Z + (H/2 - Z/2)}
*
H eqn - 2.1-3
1 <i <M and 1 <j ^N
where X, Y, Z are the consecutive terms of the sequence:
Cn = Cv, + Cn.3 eqn -2.1-4
Co = Ci = C2 = 1
where Cn is the
nth
term of the sequence.
For example: let i = 1, j = 1, X = 28, Y = 41, Z = 60, H = 255, and f(1,1) = 59, and substituting
these values in eqn 2.1-2, the equation can be expressed as:
b(1,1) = 0 if [59
*
60 < {(1*28 + 1*41) MOD 60 + (128 - 30)}
*
255] eqn - 2.1-5
This algorithm was used with the sequence X = 28, Y = 41, Z = 60 to process a number of
images. The resulting images for a gray scale ramp (Img 2.1-1), portrait of Lena (Img 2.1-2), and
landscape of a house (Img 2.1-3) are shown on the following pages. Since the algorithm does not
compensate for the printer model, the images have been printed in 72 dpi resolution. To get the
effect of a higher resolution of 300 dpi, one has to view the images at a distance of about 4.2
feet. The transition between various gray levels in the gray scale ramp is relatively smooth, and
the images of Lena and the house appear to be fairly accurate.
Img 2.1-1 - Vertical Gray Scale Ramp processed by using Dr. Peter Anderson's algorithm
with X = 28, Y = 41, Z = 60 and printed at a resolution of 72 dpi. Image should be viewed at
a distance of about 4.2 feet to get the effect of 300 dpi. Note the smooth transition
between gray levels.
Img 2.1-2 - Portrait of Lena processed by using Dr. Peter Anderson's algorithm with
X =
28, Y = 41, Z = 60 and printed at a resolution of 72 dpi. Image should be viewed at a
distance of about 4.2 feet to get the effect of 300 dpi. Note the accuracy of different gray
levels and the edges are well defined.
8
Img 2.1-3- Landscape of a house processed by using Dr. Peter Anderson's algorithm with
X = 28, Y = 41, Z = 60 and printed at a resolution of 72 dpi. Image should be viewed at a
distance of about 4.2 feet to get the effect of 300 dpi. Note the accuracy of different gray
levels and the good definition of edges.
2.2 Ordered Dither
This method adds a deterministic function to the original image before thresholding. This method
is illustrated in equation 2.2-1:
b(x,y) = 0, if f(x,y) < T(k,l) eqn
- 2.2-1
b(x,y) = 1, iff(x,y)>T(k,l)
where T is a threshold array having dimensions m by n, and k = mod(i.m) and I
= mod (j,n).





latter algorithm, the black pixels are clustered in large groups which make up the halftone dots in
a fashion similar to traditional printer halftones. These algorithms have the advantage of
reducing the black-pixel distortion error discussed in section 1.3. One of main disadvantages of
these algorithms is that they produce a strong fundamental frequency which blurs the image
detail. If used with a low resolution display, the image tends to appear to be periodically granular
[6].
Dispersed-dot dither algorithms produce isolated printed dots. They have the advantage of
generating a halftone texture which contains higher frequencies and therefore is less grainy. The
second advantage of the dispersed dot algorithms is that they reproduce fine image detail better
than the clustered dot. Their main disadvantage is that they are highly susceptible to printer
distortions. Their second disadvantage is that they create false edges in the reproduced image.
These edges are caused by the directional changes in the halftone patterns as function of the
gray level [6].
Classical screens have been used extensively in printing applications since the nineteenth
century. They produce clustered dots with the size of the dot at any point in the resulting bitonal
image being proportional to the darkness at that point in the original image. The threshold matrix
for a 6 x 6 Classical screen is shown on Tab 2.2-1. Note that this screen can generate up to 37
gray levels, and the maximum possible input gray level is 36.
10
35 30 18 22 31 36
29 15 10 17 21 32
14 9 5 6 16 20
13 4 1 2 11 19
28 8 3 7 24 25
34 27 12 23 26 33
Tab 2.2-1 - Example of a clustered dot dither algorithm - 6x6 Classical screen [5].
Some images generated by the above matrix are shown in Img 2.2-1 through 2.2-3. These
images have been printed at a resolution of 300 dpi without any printer model adjustments. The
dots in the image appear to be periodic with a strong fundamental frequency. The gray level in
the gray scale ramp image (Img 2.2-1) changes by equal increments at equal distances because
the number of gray levels shown in the ramp is only 36. Furthermore, this mask algorithm could
not reproduce the gray level next to the completely black band. The portrait of Lena and the
image of the house appear to be fairly similar to the respective original images.
Ordered dither is an example of a dispersed dot algorithm. This mask tends to produce isolated
pixel dots. This makes the algorithm more susceptible to printer model distortions. A 8x8 ordered
dither mask is shown in Tab 2.2-2. This mask is capable of producing 65 gray levels. The
maximum gray level in the mask is assumed to be 255.
11
Img 2.2-1- Vertical gray scale ramp processed with a 6x6 classical clustered dot dither
screen. The image has been printed at 300 dpi. This algorithm does adjust for the printer
model and it is for this reason that the resulting bitonal image could be printed at a
higher resolution of 300 dpi. Note that each gray level occupies a uniform width on the
ramp.
12
Img 2.2-2- Portrait of Lena processed with a 6x6 classical clustered dot dither screen. The
image has been printed at 300 dpi. This algorithm does adjust for the printer model and it
is for this reason that the resulting bitonal image could be printed at a higher
resolution
of 300 dpi. The gray levels have been represented accurately and the picture appears
to
have a strong fundamental frequency.
13
Img 2.2-3- Landscape of a house processed with a 6x6 classical clustered dot dither
screen. The image has been printed at 300 dpi. This algorithm does adjust for the printer
model and it is for this reason that the resulting bitonal image could be printed
at a
higher resolution of 300 dpi. The gray levels have been represented accurately
and the
picture appears to have a strong fundamental frequency.
X
HI =";iiip
PWaSS-::;.','!! l: ,rj! /jft
aErSfKSa
14
4 132 36 164 12 140 44 172
196 68 228 100 204 76 236 108
52 180 20 148 60 188 28 156
244 116 212 84 252 124 220 92
12 140 44 172 4 132 36 164
204 76 236 108 196 68 228 100
60 188 28 156 52 180 20 148
252 124 220 92 244 116 212 84
Tab 2.2-2 - Example of a dispersed dot dither mask - 8x8 Ordered dither mask.
The matrix shown on Tab 2.2-2 was used to generate images of the gray-scale ramp, the portrait
of Lena, and the landscape of a house. This algorithm does compensate for the printer model,
therefore a resolution of 72 dpi was used. Due to this low resolution, the image should be viewed
at a distance of about 4.2 feet to get the effect of 300 dpi resolution. The gray-scale ramp image
(Img 2.2-4) has clearly marked demarcations for different gray levels. The width of any gray
level varies with the gray level. The other images (Img 2.2-5 and Img 2.2-6) of Lena and house
have a less grainy appearance and the fine details have been well represented.
15
Img 2,2-4- Vertical gray scale ramp processed with a 8x8 ordered (dispersed dot dither)
screen. The image has been printed at 72 dpi. This algorithm does not adjust for the
printer model and it is for this reason that the resulting bitonal image had to be
printed at
a lower resolution of 72 dpi. To get the effect of a higher resolution of 300 dpi, one should











Img 2.2-5- Portrait of Lena processed with a 8x8 ordered (dispersed dot dither) screen.
The image has been printed at 72 dpi. This algorithm does not adjust for the printer
model and it is for this reason that the resulting bitonal image had to be printed at a lower
resolution of 72 dpi. To get the effect of a higher resolution of 300 dpi, one should view
the image at a distance of 4.2 feet. The image appears less grainy and the fine details
have been well represented.
17
Img 2.2-6- Landscape of a house processed with a 8x8 ordered (dispersed dot dither)
screen. The image has been printed at 72 dpi. This algorithm does not adjust for
the
printer model and it is for this reason that the resulting bitonal image was
printed at a
lower resolution of 72 dpi. To get the effect of a higher resolution of 300 dpi, one should
view the image at a distance of 4.2 feet. The image appears less grainy and the fine
details have been well represented.
18
2.3 Error Diffusion
Error diffusion is a "one-pass sequential thresholding
method"
with negative feedback [6],
Starting at a corner of the image, each pixel is compared to a threshold and the result of the
comparison determines the quantized output pixel. The error introduced by the quantization is
applied to the neighboring pixels yet to be quantized. This is an adaptive halftoning method
which does not have a fundamental frequency, has reduced texture noise and is capable of
reproducing fine image detail.
There are two reasons for the superior quality of the images produced by error diffusion. The first
is that the negative feedback of the error to the neighboring pixels enhances the edges of the
output image [6]. A typical example of this phenomenon is an algorithm which adds the value of
error to the neighboring pixel. Let the value of the current pixel be 200 where 255 represents the
maximum gray level and the binary output at that pixel is white or 255. Therefore the error is -55
for that pixel. This error will be added to the subsequent pixel before processing. This error
addition acts as a negative feedback mechanism which causes edge enhancement. The second
reason is that the method may optimize the output pixel patterns to represent any gray level.
This phenomenon is due to its "frequency
modulation"
characteristics [3]. The main difference
between ordered dither and error diffusion is that the latter varies the spacing between equally
sized halftone dots [6], which is believed to generate the high-frequency blue noise in the output
image.
The standard error diffusion method can be expressed in the following equations:
v('J)
=
x(i,j) + Z h(m,n) e(i-mj-n) eqn - 2.3-1
m,n
v(ij) is the modified image pixel obtained after adding the current pixel value and the weighted
sum of the errors generated by the previous pixels. The weight function h(m,n) is determined by
the specific algorithm used for error diffusion.
19
b(i,j) = 0, if v(ij) < t( ij ) eqn -2.3-2
b(i,j) = 1, ifv(ij)>t(ij) eqn -2.3-3







e(ij) is the error generated by the current pixel which is computed by subtracting the bitonal
output from the modified image v(ij).
This method is computationally intensive and tends to produce isolated pixels which are
susceptible to printer distortions [2]. Since it produces "superior image
quality"
compared to the
previously discussed algorithms, it is used extensively in many display units and printers that
have limited gray scale [6]. A drawback of the algorithm is that it may produce artifacts in the
output image. The texture of the artifacts depends on the algorithm. For example in the Floyd
Steinberg filter the artifacts have a snake like appearance. This problem may be mitigated by
processing the image along a different path (such as a serpentine raster, various types of space
filling curves [7]), by using random thresholds, random error distribution, ordered dither
threshold, or random weights [8] [3],
Error-diffusion algorithms are known to enhance edges [6], By modifying eqn 2.3-2 and eqn 2.3-3
to include an additional edge enhancement factor 'k', it is possible to further enhance the edges
of the output bitonal image without increasing the pink noise content in the output image [9]. The
modified equations for this algorithm are given below:
b(i,j) = 0, if ( k
*
x(ij) + v(ij)) < t( ij ) eqn - 2.3-5
b(i j) = 1 , if ( k
*




is known as the edge enhancement factor. By multiplying this value with the input
image pixel x(ij) and adding it to the modified image v(i j) before comparing it to the threshold
t(ij), it is possible to additionally enhance the edges of the output bitonal image b(ij).
2.4 Selected Algorithm
The superiority of the error diffused halftoning algorithm has been proven and was used to
implement this processor. The algorithm which was chosen was the modified error diffusion





Compute the modified image v(i) by subtracting the previous error e(i-1) from the input gray
scale pixel x(i).









b(i)-v(i) eqn - 2.4-3
If printer model adjustment mode is chosen and the bitonal output has a black to white edge then
the error value is the same as the modified image, otherwise the error value is obtained by
subtracting the modified image pixel v(i) from the output bitonal image b(i).
b(i) = 0, if(k*x(i) + v(i))<t(i) eqn - 2.4-4
b(i) = 1 , if ( k
*
x(i) + v(i)) > t( i ) eqn - 2.4-5
ke {0,1,2}
The output bitonal image b(i) is obtained by multiplying the edge enhancement factor k to the
input gray-scale pixel x(i) and adding it to the modified image v(i) before comparing it to the
threshold function t(i). The addition of k multiplied by the input gray-scale pixel x(i) to the
modified image v(i) enhances the edge(s) of the output bitonal image.
21
t(i) = 127 + adjustment eqn
- 2.4-6
If the adjustment value is chosen to be a pseudo random value generated by Dr.
Anderson's algorithm then the adjustment = X(i) otherwise the adjustment can be
an arbitrary user chosen input.
Dr. Peter Anderson's algorithm can be summarized by eqn 2.4-7 through and including 2.4-10.
Let A, B, X be twelve bit registers which are initialized to zero. If the previous value of the B-
register is less than zero then X-register is assigned the value of the A-register otherwise X-
register is assigned the value of the B-register. The A-register is assigned the value X-register
incremented by decimal 41 and B-register is assigned the value of the X-register decremented
by decimal 19.
If B(i-1 ) > 0 then X(i) = B(i-1 ) eqn - 2.4-7
ElseX(i) = A(i-1) eqn - 2.4-8
A(i) = X(i) + 41 eqn - 2.4-9
B(i) = X<i) - 19 eqn -2.4-10
A random number generator is used for encoding pseudo-random noise into the output bitonal
image. This additionally enhances the texture of the artifacts which are produced with simple
error diffusion.
The structure of the artifacts can be altered by changing the order in which the pixels are
processed, by random threshold, or by combining the algorithm with an ordered dither threshold
mask. This algorithm is amenable to an efficient VLSI implementation and meets all of the
previous guidelines outlined for algorithm selection. The function t(i) is based on Dr. Peter
Anderson's pseudo-random number generation algorithm, the edge-enhancement factor k is
based on the edge enhancement algorithm of Eschbach and Knox [8], and the printer model
adjustment is based on a heuristic. This algorithm was tested on portraits, landscapes, and gray
scale ramps. Some of the typical results are shown in Img 2.4-1 to 2.4-3. All the images have
been printed in 300 dpi with k = 1 for all images. Img 2.4-1 is a gray scale ramp which shows
22
transition from black to white. Note that the transition between various gray levels is smoother
than any other algorithm shown previously. The images of Lena and the house (Img. 2.4-2 and
2.4-3 respectively) have smoother and a sharper finish. The parameters used in processing each
image are given.
The algorithm also allows for super scalar processing because the processing of the current pixel
uses only the error value generated by the previous pixel. Therefore it is possible to have
multiple independent processors processing the same image at different points. This facet of the
algorithm is a very powerful feature which allows for a tremendous speed up in the processing
speed.
23
Img 2.4-1- Vertical gray scale ramp processed with error diffusion, noise encoding
based
on Professor Peter Anderson's algorithm, edge enhancement with k
= 1, and printer
model adjustment. The image has been printed at 300 dpi. The transition between
different gray levels appears to be smooth.
24
Img 2.4-2- Portrait of Lena processed with error diffusion, noise encoding based on
Professor Peter Anderson's algorithm, edge enhancement with k
= 1, and printer model
adjustment. The image has been printed at 300 dpi. The image has a smooth finish and
the edge are well defined.
25
Img 2.4-3- Landscape of a house processed with error diffusion, noise encoding based on
Professor Peter Anderson's algorithm, edge enhancement with k = 1, and printer model
adjustment. The image has been printed at 300 dpi. The image has a smooth finish and
the edge are well defined.
26
3.0 Processor Architecture
Every component block of the processor is designed to operate in parallel with any other
component block. This parallelization has led to the need for prediction circuitry to eliminate data
dependence between blocks. As per the algorithm discussed in the previous chapter, the
processor can be divided into three main logical units. The first block is the pseudorandom
number generator unit based on Dr. Anderson's algorithm (eqn 2.4-7 to 2.4-10). This block
generates a pseudorandom number every clock cycle (at the rising edge of clock phase
phi1).The schematic of this unit is shown is Sch 3.0-1. The second block computes the error, and
is based on equations 2.4-2 and 2.4-3. The schematic of this unit is shown in Sch 3.0-2. The
binarization block is based on eqn 2.4-1, 2.4-4, and 2.4-5. The schematic is shown in Sch 3.0-3.
Since the binarization block depends on the error generated by the previous pixel, there is a data
dependency which eliminates the possibility of a cost effective pipelined implementation of this
processor. The complete top-level schematic is shown on Sch 3.0-4. As previously discussed in
section 2.4, the algorithm does allow for super scalar processing. Therefore it is possible to have
multiple independent processors processing the same image in parallel. This can be achieved
very efficiently with this architecture.
3.1 Pseudorandom Number Generator
The Pseudorandom number generator is based on equations 2.4-7 through and including 2.4-10.
The functions X(i), A(i), B(i) have been translated into twelve-bit registers. Two adder circuits are
used to compute the next two possible values for the X register and the sign bit of the B register
determines the input to be channeled through a multiplexer to the X register. If the value of the B
register is negative, then the value in the A register is fed to the X register, otherwise the value


















































































I o \ ^
00
CD





































































































































Each adder has a constant value feeding one of its inputs and the second input is the current
output of the X register. Since this processing is done in parallel with the other blocks
in the
processor, this allows for a high speed parallelization in the processor.
3.2 Error Computation Block
The Error Computation block is based on eqn. 2.4-2 and 2.4-3 of the algorithm. This unit is
responsible for computing the error generated by the current pixel, and is operational only if the
EDE (error diffusion enable) input is high. This option allows for processing other types of
algorithms. This unit also makes the adjustment for the printer model, if needed. Since the value
of the current bitonal output is not available, circuitry is used to determine the error which will be
generated by the current bitonal output. This allows for parallel processing of the binarization unit
of the processor, since dependence of the data on the error is eliminated. Also the error unit is
processing data in parallel with the rest of the processor, faster data processing is possible in the
processor. A detailed schematic is shown on Sch 3.0-2.
The top adder is connected to the PMA unit which is responsible for predicting the error if the
bitonal output is 1 . Note that the PMA unit outputs a 0 value if the PMA option is selected and
the previous output was a 0. Otherwise the output of the PMA unit is -255. The output of the
PMA is added to the modified image value computed by the lower adder as per eqn 2.4-1 . Since
the value of the output is assumed for both 1 and 0 therefore this unit does not have to wait for
the binarization unit to actually produce the bitonal output to compute the error. This leads to a
speed up in the processing.
35
3.3 Binarization Block
This is responsible for the binarization of the input pixel and the edge enhancement of the
image. The operation is based on eqn 2.4-1, 2.4-4, and 2.4-5 of the algorithm. The inputs of the
top adder as shown on Sch 3.0-3 is connected to two muxes. The select input of the top mux is
named Dr. AA (which stands for Dr. Anderson's algorithm). If this input is enabled, the output of
the random number generator is fed to the adder, otherwise a user chosen value can be fed to
the adder by the Adj(11:0) input. The Adj(11:0) bus input allows the user to specify his own
dithering mask or noise to process the input pixel. The lower mux connected to the lower input of
the top adder is responsible for the edge enhancement of the input. If the select input MulO, is
enabled then a value of the current pixel is output by the mux, otherwise the output of the mux is
zero. The operation of the top adder is based on eqn 2.4-6 and partially based on eqn 2.4-4 and
2.4-5. The lower adder is responsible for the edge enhancement and the addition of the error
generated by the previous bitonal output (corresponding to the eqn 2.4-4 and eqn 2.4-5). If Mull
input is enabled, the output of 2xmult is twice the input pixel value, otherwise the output is just
the input pixel value. The doubling is achieved by shifting the input pixel value to the left by one
position. Therefore if k from eqn 2.4-4 and 2.4-5 is chosen to be 0, both MulO and Mull should
be disabled. If k is chosen to be 1, then either MulO or Mull (not both) should be enabled. If k is
chosen to be 2, then both MulO and Mull should be enabled. The values of both the adders are
summed and fed to a comparator, which outputs a white pixel if the value of its input is greater
than 127. This comparator also takes care of any overflows as well. The output of the
comparator is fed to the D flip flop and the dff output, which is fed to the pad ( labeled "Vout") is
the bitonal output of the processor.
3.4 Overall Top Level Post Layout Backanotated Simulations.
After the processor was laid out for a 2n N-well MOSIS CMOS process, the worst-case
parameters of the layout were extracted, to determine the worst case delay combinations of the
36
interconnected components. The delay of the metal interconnections were computed and found
to be insignificant. They were not included in the backanotated computations. All the functional
components were backanotated to the logical model and the model was simulated for both delay
and functional verification. The pad delays were taken from the MOSIS pad specification report
and were also included in the simulations. Some of the typical simulation results are shown in
Sim 3.4-1 and 3.4-2. In 3.4-1 the PMA option is disabled and EDE, MulO and Mull are both
enabled to obtain error-diffusion feedback with high edge enhancement. The input values start
out at 0 and are incremented by 15 in every clock cycle. The input
"Dr.AA"
is disabled and the
Adj(1 1 :0) is forced to a value of 0 for the life of this simulation. A clock speed of about 21 MHz is
maintained throughout the simulation. In Sim 3.4-2 the input pixel value was started at a value of
255 and was decremented by 15 in each clock cycle. Dr.AA, PMA, EDE inputs were enabled and
MulO and MuM inputs were disabled. This corresponds to a error diffusion algorithm with noise
encoding and printer model adjustment but without any additional edge enhancement ( k = 0 ).
The clock speed for this simulation was the same as the previous simulation. In both cases the
values were verified against expected results and the simulations were found to be accurate. The
expected results were obtained by applying the test vectors in these simulations to the C++



























































ft | I t> t
































4.0 Adder Implementation Details
The Adder is one of the key components of this project. Since this processor has two adders
which are connected in series, Their operation consumes about 71% of the clock period. It is
therefore imperative to have a very high-speed adder processing the input pixel values. The
architecture of the adder was borrowed from a wave pipelined adder design [10].
Since this algorithm is not amenable to efficient pipelining, a wave-pipelined system was not
implemented on this processor. The architecture of the wave-pipelined adder was borrowed from
Liu [10] and the individual cells in the adder were implemented in Double-Pass Transistor Logic,
which reduced the delays. The block diagram is shown on Fig. 4.0-1. A basic carry generate
Adder Circuit was used because the delay associated with this architecture is in the order of
Log2n. Where n is the maximum number of input bits. The adder has been divided into three
stages. In order, they are the (g,p) generation stage, the carry generator stage, and the sum
generator stage. The carry stage has been shown on Fig. 4.0-2. Each circle and square in that
figure corresponds to an individual cell. The block structure located above the individual cells
shows the architecture of the carry generator. The carry generator, sum generator, and (g,p)
generator have all been implemented in DPL technology.
DPL technology has the advantage of better performance when compared with Complementary
Pass Transistor Logic (CPL) or conventional CMOS logic because of its lower equivalent
resistance (shown on Fig. 4.0-3) [11]. Therefore the drive capability of the gate is enhanced
tremendously [11]. Despite the improved performance, there are many disadvantages associated
with this logic. Since the complementary output is also used at the next stage, additional drivers
are needed either at the input of the next stage or at the output of the current stage. This leads
to additional area and power consumption. Since this logic requires that the input as well as the












, , i i \
c16
c2 p2 cl Pi
Carry Generator
i . < \
p16 g16
11 ii ii i\





a; s, bj s






































Fig. 4.0-3 Comparison of the equivalent resistances for Complementary Pass Transistor
Logic (CPL), CMOS Transmission Gate, and Double Pass Transistor Logic (DPL). Note
that DPL technology has the lowest equivalent resistance.

















Fig. 4.0-4 Comparison of a circuit design, operation,
and Voltage swing of a
typical gate
(XOR gate) implemented in Complementary
Pass Transistor Logic (CPL),
CMOS












































to either the input or the output stage of each gate. This creates additional propagation delays in
the system.
Each individual cell of the carry generator as shown in Fig. 4.0-2 was directly translated into its
corresponding DPL equivalent; i.e., the white square corresponds to the f_gl_buffer (Sch 4.0-1),
the white circle corresponds to f_pl_gl_buffer (Sch 4.0-2), the black square corresponds to
f_gl_P'_gr_buffer (Sch 4.0-3), the black circle corresponds to f_gl_pl_gr_pr_buffer (Sch 4.0-4).
The architecture shown above the individual cells has been shown on Sch 4.0-5. The individual
logic components in each of the cells have been translated to their DPL equivalent. Those
schematic and layout translations are shown in Sch 4.0-6 through Sch 4.0-10 and Lay 4.0-1 and
Lay 4.0-4 respectively.
The (g,p) generator of the adder can be summarized by the following equations:
g, = a, r-i b, eqn - 4.0-1
p, = a, u> b, eqn - 4.0-2
The carry generator equation for a n bit carry generator can be summarized by the following
equations:
C| = Gj, for i = 0, 2, , n-1 eqn - 4.0-3
( Gi,P, ) = ( g0,p0 ) for i = 0 eqn -4.0-4
( G,,P, ) = ( gi,p, ) ( GM, Pi.i ) for i > 0 eqn - 4.0-6
where is the concatenation operator with the following definition:
( gi>Pi) ( Or.Pr) = ( QiJ Pi n gr, Pi n Pr ) eqn -4.0-6
The equations for the sum generator is defined as follows:
Si = pi Cj eqn
- 4.0-7
s0 = p0 eqn 4.0-8




































































































































i ^ . ': "r . . ^~
^i^iJ*.MJL.m_M\ '' 1 ' --- : '=?-
: _-j 'nsjr-^ H ,-U^-
N




t-Hi : : ; it. liXE;
:
^

















































































































































































O Q Q O
?n^
Q Qu- O O Q Q
i-_ O O OOQ Q
O O Q Q a
air
*- c o a oa a

















































































































L=2u U=1u AS=20p AD=20p PS=18u P0=i&c
Sch ^.




































= 2u U=8u S=10p AD=40p PS=2Eu PD=2E
U2u U=1u S=2Gp AD=2Gp PS=18u PD=18u
IV
6|
L = 2u u,8u AS=<Gp ADslQp PS=26u PD=26u
or
oo
L-2u !',< S=2Gp AC=2Gp PS;'8u PCs IS
Sch XO-9
- Schnat ic of a Double Pass Transistor Logic AND gate
61
62
The results of the Accusim simulation of the transistor level model of the adder can summarized
into three sets. The first set of results were obtained by simulating the (g,p) generator stage of
the adder. The schematic of the setup is shown on Sch 4.0-11. The resulting waveforms
obtained from the simulation of this schematic are available in Sim 4.0-1. The (g,p) generator
stage is based on eqn 4.0-1 and 4.0-2. Only one bit ( g,p ) generator has been simulated since
the results will be similar for the rest of the bits. The rise time and the fall time of the output were
measured to be about 0.99 ns. and 0.80 ns. respectively. The propagation delay of the output
with respect to the input was measured to be about 2.2 ns.
The carry generator is based on eqn 4.0-3 through and including 4.0-6. The schematic used for
the simulation of the worst case loading condition f_gl_pl_gr_buffer located in the carry generator
is shown on Sch 4.0-12. The reason that this buffer was selected was that it has the largest
possible capacitive load connected to its output. The resulting waveforms obtained from the
simulation of this schematic are available in Sim 4.0-2. The rise time and the fall time of the
output of the f_gl_pl_gr_buffer were measured to be 1.42 ns. and 1.18 ns. respectively. The
propagation delay of the output with respect to the input was found to be 3.2 ns.
The sum generator is based on eqn 4.0-7 and 4.0-8. One bit of this generator was simulated
under worst case load conditions as shown on Sch 4.0-13. The output waveforms are shown in
Sim 4.0-3. The rise time and fall time for the output was found to be 1.39 ns. and 1.02 ns.
respectively. The propagation delay of the output with respect to the input was found to be 2.4
ns. The total propagation delay of the whole adder was computed to be 17.4 ns. This value was
used in the digital simulations as the overall adder delay. Please note that the rise, fall and
propagation delay time values are much lower than CPL or conventional CMOS. This speed up
was obtained by using DPL technology.
63
I k 1 b, i
J3jjnq-jd-je-|~]B~J
- j 2 o . a p.
64
Sim 4.0-1 Worst case (g,p) Generator transistor level Spice Model Simulations.
(A) a QNV/ (A)0NV/ (a) a aox/ (A)



















Sim 4.0-2Worst case Carry generator transistor level Spice Model Simulations.
(a) e"o/
8 8 8 8 S S 8 S i 5 8 S I S 8 8 8 8



























































[a) a o/ (A)aa
69
5.0 Conclusion
There are many algorithms available for digital halftoning. The selection criteria of a suitable
algorithm were high image quality, low computational complexity and the possibility of an
efficient VLSI implementation. It was on this basis that an error diffusion algorithm was chosen
for implementation of the processor. The architecture of the processor was designed to have
three blocks working in parallel to each other. The first block is the random number generator
block which generates the noise in the algorithm. The second block is the error computation
block which calculates the error generated by the current pixel and the third block is the
binarization block which determines the bitonal image by thresholding the modified image. The
logic level model of the architecture was simulated in Quicksim and the results of the simulation
were verified against the expected results. The expected results were obtained by actually
substituting the test vectors used for simulation into the C++ program used to generate images.
The processor was designed in Double Pass Transistor logic which did speed up the operations
at the cost of increased power and area consumption. The processor was laid out for a 2\x. N-well
MOSIS CMOS process and the expected speed of the processor is about 21 Million pixels/sec.
Pansonic had created similar processor in 1.5(j. CMOS process with a processing speed of 60
ns/pixel [12]. This processor takes only 49.1 ns/pixel on a 2|i process! Similarly the Panasonic
adder takes about 20 ns/operation whereas the adder designed for this processor takes only 17.4




1. R. C. Gonzalez and P. Wintz [1987]. Digital Image Processing, Second Edition,
Addison-
Wesley Publishing Company, Inc. USA.
2.T. N. Pappas and D. L. Neuhoff, "Printer Models and Error
Diffusion,"
IEEE Trans. Image
Processing, vol. 4, no. 1, pp. 66-79, January 1995.
3. P. G. Roetling, R. P. Loce, "Digital
Halftoning,"
Digital Image Processing Methods, ed. E. R.
Dougherty, Marcel Dekker, Inc., New York, New York, 1994.
4. P. G. Anderson, "Linear Pixel Shuffling
Applications,"
Recent Progress in Digital Halftoning,
ed. R. Eschabach, IS&T, pp. 74-76, 1995.
5. R. Ulichney, Digital Halftoning, The MIT Press, Cambridge MA, 1987.
6. K. T. Knox, "Introduction to Digital
Halftones,"
Proc. IS&T 47th Annual Conference, pp. 456-
459, 1994.
7. L. Velho and J. M. Gomes, "Digital Halftoning with Space Filling
Curves,"
Proceedings
SIGGRAPH in Computer Graphics, Volume 25, No. 4, pp. 81-90, July 1991.
8. T. Kurosawa, Y. Maruyama, "A high Speed Halftoning Processor for Raster Scanned
Images,"
IEEE Journal of Solid State Circuits. Vol. 27. No. 2, pp. 222-224, February 1992.
9. R. Eschbach and K. T. Knox, "Error-diffusion algorithm with edge
enhancement,"
J. Opt. Soc.
Am. A/Vol. 8, No. 12, pp. 1844-1850, December 1991.
71
10. W. Liu, et al., "A 250 MHz. Wave Pipelined adder in 2um
CMOS,"
IEEE J. of Solid-State
Circuits, vol. 29, no. 9, pp. 1117 - 1127, September 1994.
11. M. Suzuki et al., "A 1.5 ns. 32-b CMOS ALU in Double Pass Transistor
Logic,"
IEEE Journal
of Solid State Circuits, vol. 28, no. 11, pp. 1145-1150, Nov. 1993.
12. T. Kurosawa et al., "A High-Speed Halftoning Processor for Raster Scanned
Images,"
IEEE
Journal of Solid State Circuits, vol. 27, no. 2, pp. 222 - 224, Feb. 1992.
72
APPENDIX A - Selected algorithm expressed in C++.
This code was also used to generate
the expected results and output images








Create a bitonal image by error diffusion, edge






















































//pixel width of the image
//pixel height of the image
//PGM file formal
//name cf input/output: files
















// Open the input and output files
cout<< "Enter PGM filename:";
cin> >inf lie ;
cout<< "Enter halftone output filename:
cin> >outf ile ;
ifstream input_file (ir.f ile , ios::in);






// Check if the input file is of PGM format
input_file> >pgm_flag;









// Get the dimensions of the original image
input_f ile> >width;
input_f ile> >height ;
input_f ile> >max_gray ;
ofstream. output_f ile (outf ile , ics::out);
if ( ! output_file)
{





output_f ile< < pgm_flag << data_type << endl ;
output_f ile< < width
<<"
"<< height << endl;
output_f ile< < max_gray << endl ;
// Read each line of the file and store it in the buffer for processing
for'
i=G; i < height ; i*+)
t
for ( j=0; ; < width ; j++)
{
input_file> >a [j ] ;
for
{




- 1 - 2*j ) * j ;
current = a [column] ;
mlmage = current
* prev ;
prOut = out ;
// Apply the edge enhancement value and the noise to the modified image
// to compute the bitonal pixel value




b [column] = max_gray ;




b [column] = 0 ;
out = C ;
}




// Apply printer model adjustment to the error value
if (prOut == 0 && out == 1) prev += max_gray ;
// Compute the next random number for noise encoding
iA = x + 41 ;
iB = x - 19 ;
if (iB >= 0) X = 13 ;
else x = iA ;
}
// Write the bitonal output buffer to the output file
for ( j=0; j < width ; j++)
{
output_file << b[j] <<
"
";
if ( j % 18 == 17 ) output_file << endl ;
}
}
input_f ile. close (]
output_f ile . close
return (0) ;

