VHDL implementation of an image processing chip by Kelly, E. Michael
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
2-18-1996 
VHDL implementation of an image processing chip 
E. Michael Kelly 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Kelly, E. Michael, "VHDL implementation of an image processing chip" (1996). Thesis. Rochester Institute 
of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
VHDL Implementation





Partial Fulfillment of the





Graduate Advisor - George A. Brown, Professor
Kenneth W. Hsu, Associate Professor
Ronald G. Matteson, Professor
Department of Computer Engineering
College of Engineering
Rochester Institute of Technology
Rochester, New York
February 18, 1996
VHDL Implementation of an Image Processing Chip
Thesis Release Permission Form
Rochester Institute of Technology
College of Engineering
Title: VHDL Implementation of an Image Processing Chip
I, E. Michael Kelly, hereby grant permission to the Wallace Memorial




VHDL Implementation of an Image Processing Chip
Abstract
Digital copiers and printers require that processing steps be performed
on image data after it is captured and before it is finally printed. Grayscale
data is typically captured using an image scanner or generated using image
composition software. Single bit data is usually printed using a laser, LED,
ink-jet, or thermal writer. This thesis describes the design of an ASIC that
implements several image processing algorithms. Image histogram modifi
cation and convolution filtering are used for operating on gray data. Ordered
dither, error diffusion and thresholding convert gray data to binary. A pipe
line architecture is used to maximize both the modularity and the throughput
of the design. A CPU interface is used to allow flexible programming of the
image processing parameters. The design is implemented and simulation
stimulus is generated entirely in VHDL.
A design of this type allows many of the image processing operations
commonly required in digital copiers and printers to be performed in real
time rather than as a pre-processing software step. The programmability of
the image processing parameters makes the design suitable for a wide variety
of applications. The use of VHDL for the design allows flexibility in select
ing the target implementation technology for synthesis.
- n
VHDL Implementation of an Image Processing Chip
This document was produced using FrameMaker4 on a unix workstation.
The VHDL code was developed and simulated using QuickVHDL version
8.4_4.3e on aMentor Graphics design workstation. The C programs used in
support of imaging verification were developed using Turbo C++ version 3.0
on a PC clone. Logic synthesis was performed using Autologic II version
8.4 3.1.
The following names used here and in the remainder of the document are reg











Copyright 1996 byMike Kelly
All rights reserved
in




List of Figures v
List ofTables vii
Glossary viii
Chapter 1 - Introduction andMethodology 1
Chapter 2 - Device Architecture 7
Chapter 3 - HistogramModification 13
Chapter 4 - Two Dimensional Convolution 22
Chapter 5 - Halftone Screening 34
Chapter 6 - Error Diffusion 41
Chapter 7 - Threshold Application 50
Chapter 8 - VHDL Issues 56
Chapter 9 - Simulation and Test Bench Generation 61
Chapter 10 - Synthesis 65
Chapter 1 1 - Conclusions 70
Bibliography 73
Appendix A - Device User's Specification 74
Appendix B - Device VHDL Code 108
Appendix C - Test Bench VHDL Code 158
Appendix D - Register Initialization Files 189
Appendix E - *C Code 193
- iv -
List of Figures
Figure 2.1 - EMKTPC Top Level BlockDiagram 8
Figure 3. 1 - Original image 16
Figure 3.2 - Image processed using HistogramModification 17
Figure 3.3 - HistogramModification BlockDiagram 19
Figure 4.1 - Filter kernel examples 23
Figure 4.2 - Convolution Filter BlockDiagram 25
Figure 4.3 - Convolution Boundary Condition 27
Figure 4.4 - Convolution FIFO interconnect 28
Figure 4.5 - PIXEL_CTL Block 29
Figure 4.6 - Image processed with Laplacian Filter 33
Figure 5.1- Image processed Using Ordered Dither 35
Figure 5.2 - Example Dither Matrix 36
Figure 5.3 - Halftone Screening Top Level Block Diagram 39
Figure 6. 1 - Image processed using Error Diffusion 43
Figure 6.2 - Error Diffusion Top Level Block Diagram 44
Figure 6.3 - Error Diffusion Process Block Diagram 49
Figure 7.1 - Threshold Block Diagram 53
Figure 7.2 - Image Processed using Thresholding Only 54
Figure A.l - EMKTPC Top Level Block Diagram 78
Figure A.2 - Histogram Modification Block Diagram 82
Figure A.3 - Convolution Filter Block Diagram 88
Figure A.4 - Halftone Block Diagram 92
Figure A.5 - Error Diffusion Block Diagram 96
Figure A.6 - Error Diffusion Processing BlockDiagram 97
- v-
Figure A.7 - Threshold BlockDiagram 100
Figure A.8 - CPU Write Cycle Timing 102
Figure A.9 - CPU Read Cycle Timing 102
Figure A.10 - Input Image DataMacro Timing 105
FigureA.l 1 Input Image DataMicro Timing 105
Figure A. 12 - Output Image Data Interface Signal Timing 107
- VI
List of Tables
Table 1 - Glossary ofTerms viii
Table 2.1 - Pipeline Delays 11
Table 6.1 - Error Diffusion Pixel Reference 42
Table 10.1 - Global Cell Usage Statistics, Halftone Block 68
Table A.l - EMKIPC CPU Address Map 77
Table A.2 - HistogramModification Control Register Bits 79
Table A.3 - Histogram Modification Read back Select Bits 79
Table A.4 - Convolution Filter Kernel Elements 83
Table A.5 - Convolution Filtering Address Map 84
Table A.6 - Dither Matrix Register Addresses (Hex) 90
Table A.7 - Error Diffusion Pixel Reference 94
Table A.8 - Error Diffusion Multiplier Addresses 94
Table A.9 - CPU Interface Signals 101
Table A. 10 - Write Cycle Timing Requirements 102
Table A.ll - Read Cycle Timing Parameters 103
Table A.12 - Image Data Input Interface Signals 104
Table A. 13 - Input Image Data Interface Timing Requirements 105
Table A. 14 - Output Image Interface Signal Descriptions 106
Table A. 15 - Output Image Data Interface Timing Characteristics 107
vn
VHDL Implementation of an Image Processing Chip
Glossary
Table 1 - Glossary of Terms
Term Definition
ASIC Application Specific Integrated Circuit
CPU Central Porcessing Unit
CSZ Chip Select (active low)
DFF D-type Flip Flop storage element
DPI Dots (or pixels) Per Inch
EMKIPC E. Michael Kelly - Image Processing Chip
FIFO First In First Out storage device
LED Light Emitting Diode
LUT Look-Up Table
Pixel Picture Element - The dots (gray or binary), usu
ally arranged in a rectangular array, which form a
digital image.
UUT Unit Under Test
VHDL VHSIC Hardware Description Language
VHSIC Very High Speed Integrated Circuit
Vlll
VHDL Implementation of an Image Processing Chip
Chapter 1 - Introduction andMethodology
This thesis project has two primary goals. The first is to produce a
design for a chip which can perform image processing functions at high speed
which are useful for a number of applications. The second is to explore the
issues around the use ofVHDL (VHSIC Hardware Description Language) in
the implementation of a digital design of this type. Some of the issues are
code style, bus representation, and mathematical accuracy. A secondary goal
is to gain an understanding of the issues around the synthesis of a VHDL
description to a gate level schematic by actually synthesizing a portion of the
design. Two of the synthesis issues are limitations on coding style and the
impact of coding style on chip size. Throughout this thesis, 8-bit mono
chrome image data are assumed where not explicitly stated.
To meet the first goal, a number of decisions and assumptions must be
made about the type of system toward which the design is targeted. The high
end of the digital copier and printermarkets is targeted so that the design can
be used in the widest possible range of products. Inmeeting the requirements
of the high end of the market, the needs of the low end are automatically







600 pixels per inch printing resolution.
Eight-bit per pixel input data.
Eight-bit per pixel or binary output data (configurable).
Monochrome or color applications.
These assumptions result in the following design requirements.
VHDL Implementation of an Image Processing Chip
(8.5" * 600 pixels/inch)
* (1








When overhead is added for time between pages and time between fines
within a page, a processing speed of approximately 70 Mpixels/second is
needed. A four pixel wide image data path is used to minimize the frequency
of the image data clock. The image data interface is 32-bits wide (four pixel
data paths at eight-bits/pixel) and must run with a minimum data clock fre
quency of 17.5 MHz. The cost of this architecture decision is chip size. The
chip area increases as a result of the internal parallelism required for many of
the processing blocks. This increase, however, is not by a factor of four since
the size of several of the large blocks of the design is independent of, or min
imally affected by, the width of the data path.
The use of four pixel channels does not imply that the chip can be used
for color applications by using a different pixel channel for each color separa
tion. Internal to the device, many of the operations performed on the data
depend on neighboring pixel data. This means that, for most operations,
independent pixel channel processing cannot occur. It is intended that multi
ple instances of this device be used in parallel for simultaneous processing of
multiple color channels.
The specific image processing functions implemented in the design are
functions which are both useful and practical. Operations which require a
page buffer are not implemented. The limitation of implementing the design
in one or several custom ASICs does not allow for the inclusion of a page
-2-
VHDL Implementation of an Image Processing Chip
buffer so image transformations such as rotation and scaling are not done.
The operations which are implemented are divided into two classes. First are
operations which are applied to grayscale data and which generate grayscale
data as a result. These include image histogram modification and two dimen
sional convolution. The second class of operations are those which convert
grayscale image data to binary for output to devices which print only binary
data. These include the application of a digital halftone screen, error diffu
sion, thresholding, or any combination of the three.
A CPU interface to the device is required for configuration and test. An
eight-bit bidirectional data bus, a nine-bit input address bus, as well as read,
write, and chip select strobes are used to implement this simple interface.
The existence of this interface implies that this device must be used in sys
tems which have an intelligent host processor and associated software which
will be used to configure internal registers and look up tables.
The selection of a pipeline architecture for the device serves as an
advantage from several viewpoints [3]. First, image processing applications
are well suited to pipeline processing since the algorithms to be applied are
done in a sequential manner and since image data is typically available in a
sequential manner. Data dependencies are eliminated through careful selec
tion of the processing algorithm order. Second, pipeline processing allows
the device to be easily partitioned into functional modules which can be
developed independently. A standard interface protocol at the input and out
put of the device and at the input and output of each block allows individual
processing modules to be inserted or removed as needed during development.
Third, pipeline processing allows internal parallel processing in that each
VHDL Implementation of an Image Processing Chip
block is working on a different part of the image at the same time. This helps
to maximize the data throughput of the device.
The second thesis goal is met by using a design methodology which
takes advantage of the modular nature of the pipeline architecture and of the
various code styles available when designing with VHDL. Each of the pro
cessing blocks of the device is described in VHDL using a mixture of struc
tural, data flow, and behavioral styles. The blocks are then interconnected
using the structural modeling style to create a hierarchical design. This
allows the device to be described and simulated entirely using VHDL and
still be partitioned into smaller design tasks. Schematic block diagrams were
created using the Mentor Graphics schematic capture package, called Design
Architect. These schematics are not used as part of the design database but
were created as accurate schematics to fulfill two needs. The first is for ease
of design visualization during VHDL code creation. The second is to provide
a means to interconnect the gate level description of each block which is pro
duced during synthesis. Duplicate schematics were created using the Inter
graph Aceplus schematic editor. These are the schematic diagrams used for
documentation purposes throughout this document.
As each VHDL block design was developed, it was compiled using the
QuickVHDL compiler on the Mentor Graphics workstation (invoked by typ
ing qvcom filename.vhd) and then simulated using the QuickVHDL simula
tor (invoked by typing qvsim). The simulation stimulus was generated using
VHDL to create a test bench for each block. The block level test benches are
made up of an instantiation of the block under test and the signal generation
statements needed to initialize and exercise the block. Block functional veri-
-4-
VHDL Implementation of an Image Processing Chip
fication was done by viewing the simulation waveforms to make sure that the
response of the block was as expected. At the block level, real image data
was not used. Instead, patterns such as incrementing data were used.
More elaborate testing was performed at the top level. File I/O was used
to load the device configuration registers and look up tables and to read input
image data and write resulting output image data. All file I/O was performed
on text files. Several programs were written in 'C to support design verifica
tion through file manipulation. One program was written to convert raster
image files (.img format binary files) to a text file format which was read by
the VHDL test bench during simulation. Another was written to take the text
file generated by the VHDL test bench during simulation and convert it to a
raster image file. An existing program (written by Professor Ronald G. Mat-
teson, Dept. of Computer Engineering, Rochester Institute of Technology)
was used to view the original and processed images on a Personal Computer
to verify the operation of the device. Multiple simulations were run with
each processing block individually enabled and with combinations of blocks
enabled to fully verify the functionality of each block and of the entire data
path.
The final thesis goal of exploring some of the issues around synthesis of
VHDL code was met by performing synthesis on the halftone block of the
design. The Autologic II tools provided by Mentor Graphics Corporation
were used to convert the behavioral description of the block into a gate level
description. Discussion of the results is found in Chapter 8 - VHDL Issues.
VHDL Implementation of an Image Processing Chip
Throughout this document, signal names which end with the letter
"Z"
are active low.
VHDL Implementation of an Image Processing Chip
Chapter 2 - Device Architecture
The architecture of the EMKIPC device evolved during development
into a pure pipeline design. Image data enters the device at the input inter
face, is independently processed and then propagated by each block, and then
exits at the output interface. This can be seen in Figure 2.1 on page 8. Ini
tially, the architecture was more of a mix of a pipeline and a muxed parallel
design. There was a separate CPU interface block which fed control signals
to each of the processing blocks. This proved to be cumbersome in that even
minor changes to a single block usually propagated to the top level and to
other blocks. It also made the device more difficult to partition into multiple
devices, if this were desired, due to each processing block's dependence on
the cpu interface block. For these reasons, a common set of cpu interface sig
nals is brought into each of the blocks in the final architecture (except for the
threshold block) and each block implements it's own cpu decode.
The initial design also differs from the final design in that the initial
design could convert grayscale data to binary using either ordered dither
(halftoning) or error diffusion, but not both. Image data passed out of the
convolution block and into both the halftone and error diffusion blocks in
parallel. It was intended that one or the other block would be used at a time
while the unused block was switched completely out of the data path. This is
clearly not a pipeline methodology.
7-
VHDL Implementation of an Image Processing Chip
Figure 2.1 - EMKIPC Top Level Block Diagram
-8-
VHDL Implementation of an Image Processing Chip
The possibility of using the halftone block to provide a varying threshold
to the error diffusion block [13] led to the current implementation in which
image data is passed from the convolution block to the halftone block and
then both image data and threshold values are passed from the halftone block
to the error diffusion block. This, combined with the ability to individually
enable or disable each of the blocks via the cpu interface, gives the user com
plete flexibility in deciding how to process images.
The input and output image data interfaces are 32-bits wide and synchro
nous. The protocol of the image data interface is based on the natural way of
processing bitmap images. Input and output devices (e.g. scanners and print
ers respectively) operate on a page at a time and, within a page generally
operate on a single scan line at a time. The image data interface, therefore,
has synchronization signals which are used to indicate the start and end of a
page (VSYNC or vertical sync) and the start and end of a line within a page
(HSYNC or horizontal sync). The terms horizontal and vertical are inherited
from video processing and do not indicate page orientation. They can be
interpreted as page and line sync respectively. The data and sync signals are
sampled on the rising edge of a master clock signal making this a synchro
nous interface.
A synchronous data interface was chosen to minimize the handshaking
required and maximize data throughput. The image data source indicates that
it is ready to transmit an image by raising VSYNC. It then indicates that it is
ready to send a line by simultaneously raising HSYNC and presenting the
first data word. Once a line of data is started, itmust be completed with valid
image data available at each clock rising edge. When HSYNC is sampled
9-
VHDL Implementation of an Image Processing Chip
low, it indicates the end of an image line. It is expected that every line within
an image will be of the same length. The area operators will not operate cor
rectly if this is not the case. VSYNC goes low after transmission of the last
line of data to indicate the end of the image. At least two more HSYNC
cycles are required after VSYNC goes low to complete the processing of
internally stored data when convolution and error diffusion are enabled. Data
are passed from block to block within the device and are fed to the output
interface using the same protocol. This scheme assumes that the data recipi
ent is always ready and can accept data at the full system clock rate.
The selection of a pipeline architecture over other possible architectures
implies certain advantages and limitations. On the advantages side, image
data can be processed through the device in real time with no page delays or
external page buffers. The internal parallelism of the pipeline means that the
output data rate is the same as the input rate and there is only a small latency
penalty of one fine and sixteen pixels when all the blocks are enabled. Refer
to Table 2.1 on page 11 for the delay introduced by each of the pipeline
blocks. The use of internal FIFO devices allows local area operations such as
two dimensional convolution and error diffusion to be performed without a
significant impact on processing speed or delay. A second advantage of the
pipeline is that it is extremely modular. This makes it easy to partition the
design into multiple devices, if desired, or to come up with new implementa
tions with various functions added or deleted.
10
VHDL Implementation of an Image Processing Chip













Halftone 0 clocks 1 clock
Error Diffusion 0 clocks 4 clocks
Threshold 1 clock 1 clock
One limitation of the architecture is that data can only be processed
sequentially. This means that it cannot perform operations which require
more arbitrary access to data in an image bitmap, such as image rotation or
scaling. These operations would require a page buffer at either the input or
output of the device. This added functionality was sacrificed in favor of a
simpler interface and no page buffer requirement.
Another drawback of the pipeline architecture is that all blocks must
operate at the same clock rate, which is limited by the slowest block. This
means that a computationally intensive block, such as convolution or error
diffusion, can negatively impact overall device performance if it is not struc
tured properly. This problem can be minimized by structuring complex
blocks as smaller local pipelines with the block function broken into smaller
operations which can be performed quickly. An example of this is found in
the convolution block, where a DFF is inserted in the image data path after
-11
VHDL Implementation of an Image Processing Chip
each arithmetic operation (multiply, add, or divide). This methodology
increases the number of clock cycles required to process data through a
block, but also increases the rate at which the clock can run. A potentially
large gain in throughput is achieved at the price of a small increase in latency.
A final drawback to the pipeline approach is that some additional latency
may be incurred as a result of passing data from one pipeline stage to the
next. This is mainly due to the need for synchronization at the output of each
stage.
12-
VHDL Implementation of an Image Processing Chip
Chapter 3 - HistogramModification
The histogram of an image is a representation of the frequency of occur
rence of each of the possible grayscale levels. Histogram modification is the
process of changing the distribution of gray levels in an image for the pur
pose of improving the appearance of the image or to emphasize or de-empha
size one or more ranges of gray levels. In the case of a digital copier this can
serve several functions. One is to allow a user to change the copy contrast.
Another is to compensate for undesirable characteristics of the input scanner
or the output printing engine. In all applications of histogram modification
discussed here, the number of utilized gray levels in the resulting image is
less than or equal to the number in the original. The distribution of the origi
nal gray levels is all that is changed.
One example of an application of histogram modification is histogram
equalization. The goal of histogram equalization is to obtain an output image
with a uniform histogram. That is, there is roughly the same number of pix
els at each of the gray levels (0 to 255 in an 8-bit system). The result of histo
gram equalization is the full utilization of the available dynamic range of the
system. The equation for histogram equalization is found from the expected
value shown in (Equation 3.1) on page 13.
Histogram Equalization [1][2] (Equation 3.1)
13-
VHDL Implementation of an Image Processing Chip
In (Equation 3.1), sk is the new pixel value to which original pixels with
value k are mapped, N is the total number of pixels in the image, j is the cur
rent gray level, and nj is the number of original pixels at gray level j. Histo
gram equalization requires knowledge of the original image's histogram and
calculation of the value map prior to processing the image.
Another application is density modification. In density modification, the
overall lightness of the image is changed. This can be accomplished by add
ing a fixed value (positive or negative) to each pixel. Limits must be applied
to keep the resulting values within the range of valid values (0 to 255 in an 8-
bit system). This function can be used to implement the lighten and darken
copy modes seen on most copiers. The equation for density modification is
seen in (Equation 3.2) below where x is the original pixel value, b is the posi
tive or negative offset, and y is the resulting value.
y
= x + b (Equation 3.2)
A third application of histogrammodification is contrast adjustment. To
increase image contrast, pixels which fall below a selected threshold are
moved closer to zero and pixels above the threshold are moved closer to 255.




x x < thresh (Equation 3.3)
y
= 255 - (n * (255 - x)) x >= thresh (Equation 3.4)
In (Equation 3.3) and (Equation 3.4), x is the original pixel value, thresh
is the selected threshold value, m is the scale factor by which pixels below
thresh are moved closer to zero, n is the scale factor by which pixels above
14-
VHDL Implementation of an Image Processing Chip
thresh are moved closer to 255, and y is the resulting pixel value. Once
again, limits must be applied to keep y in the range 0 to 255.
There are a couple of options with regard to the means of executing his
togram modification in a VLSI device. One that comes to mind is to imple
ment a linear transformation by performing addition and multiplication
operations on the input data as in (Equation 3.5) below.
y
= (m * x) + b (Equation 3.5)
This method would yield fairly simple hardware but would have limited
usefulness as demonstrated by the examples cited above. Even if it were
designed so thatmultiple unique linear equations could be applied to multiple
ranges of values, it is still limited to piecewise linear operations. The hard
ware would become large and complex quickly if the number of allowable
equations is increased to add flexibility. For these reasons, the EMKIPC
implements histogram modification using a 256 element Look Up Table
(LUT). The 8-bit input pixel value is used as the address into the table and
the 8-bit data value at that location is the output. The LUT is programmable
using the host cpu interface. This method allows any histogram modification
algorithm (linear or non-linear) to be implemented in software on the host
cpu and then loaded into the LUT prior to processing images. Four LUTs are
used to simultaneously process all four pixels of the 32-bit wide data path.
All four LUTs are written at once during cpu access to save table load time.
Each of the four tables can be selected individually for diagnostic read back
using two register select bits in the control register.
15
VHDL Implementation of an Image Processing Chip
A sample of an image which was processed under simulation by the
EMKIPC device with histogrammodification enabled is shown in Figure 3.2.
The histogram modification LUTs were loaded in such a way that the original
pixel values are inverted creating a negative of the original image. The equa
tion used is [y = 255 - x] where x is the original pixel value and y is the output
pixel value. This function was used because of the ease of visual verification
of the results. The image was processed through the entire pipeline with only
the histogram modification block enabled. Refer to Chapter 8 - VHDL
Issues for a discussion of the methods used for processing an image under
simulation. The original and output images are both continuous tone. The
output shown here appears as a halftone because it was printed on a binary
laser printer.
Figure 3.1 - Original image
;;S::j'i.:o;>>:-KV::;.::K.?:::.W:^
16
VHDL Implementation of an Image Processing Chip
Figure 3.2 - Image processed using HistogramModification
The block diagram for the histogram modification hardware, which is
defined using VHDL, is shown in Figure 3.3 on page 19. Note that there are
data and sync signal DFFs on the input to this block which synchronize the
device input to the input clock. These contribute one of the two minimum
clock delays to get data through the device. Input registers are required so
that processing internal to the device can happen over a full clock cycle inde
pendent of the setup or hold time of the data and control signals coming into
the chip. The muxes on the output of the block are used to select either the
input data or the output of the LUTs depending on whether or not histogram
modification is enabled.
The eight-bit two-to-one muxes (mux_2tol_8) which have the CPU
address bus and synchronized image data as inputs are used to allow CPU
access to the LUTs during device configuration. Note that the muxes are
17
VHDL Implementation of an Image Processing Chip
controlled by the device chip select. This means that ifCSZ goes low during
imaging when histogram modification is enabled, the output image data will
be affected. This device should not be accessed by the CPU while it is pro
cessing image data. The four reg_arry blocks are the actual LUTs. The
eight-bit address input to these blocks is the lower eight CPU address bits
during setup and is the image data during imaging. The cpu_d(7:0) bus is
only used during setup.
The second set ofD Flip Flops is used to capture the LUT output when
the block is enabled. Corresponding DFFs for the sync signals give them the
same delay as the image data.
The hra_cpu_dcd (histogram RAM array cpu decode) block is where the
control register is implemented and where the lut_sel(l:0) bits are decoded to
produce a read enable for each of the four LUTs.
-18
VHDL Implementation of an Image Processing Chip





















o: a a a _i
^t^^t^t^T
g g 2 S g
Cam
19-
VHDL Implementation of an Image Processing Chip
The VHDL code for the histogram modification block can be found in
Appendix B. A mixture of the structural, behavioral, and dataflow styles is
used in the implementation of this block. The dff, dff32, and reg_arry (regis
ter array), blocks are described behaviorally. The mux_2tol, mux_2tol_8,
mux_2tol_32 and hra_cpu_dcd blocks are described using the data flow style
and all the blocks are interconnected using structural VHDL. This method
allows the code to closely resemble the block diagram which simplifies
development and debug. It also allows the LUT block to be developed and
potentially synthesized as a stand alone unit since it is declared as an entity.
This simplifies the optimization of the design.
The reg_arry (register array) block was defined in the code as an array of
255 elements of type std_logic_vector(7 downto 0). This type (ram_data)
was declared in the package types_emk which can be found at the beginning
of Appendix B. The index to an array in VHDL is an integer so the address
input, which is an 8-bit std_logic_vector, must be converted to an integer.
The solution to this problem is also found in the package types_emk in the
form of the function vec_int (vector to integer). This function takes an argu
ment of type std_logic vector (of any number of bits up to the limit of type
integer) and returns an integer value equivalent to the interpretation of the
vector as an unsigned binary number. This function is used frequently
throughout the design of the EMKIPC.
The bidirectional nature of the cpu data bus requires special care in its
implementation within the histogram modification block. There are five sub-
blocks which can potentially drive the bus. They are the four LUT blocks
(reg_arry in the block diagram) and the hra_cpu_dcd block which contains
-20-
VHDL implementation of an Image Processing Chip
the control register for the block. The only time any of these blocks can drive
the bus is when they are selected (through decode of the address bus and chip
select) and the cpu_rdz line is held low. At all other times, these blocks must
not drive the data bus (i.e. they must leave it in a high impedance state). This
can be seen in the definition of each of these two sub-block types.
The histogram modification block is relatively simple in design and
straightforward to implement in VHDL, but it is likely to use a fairly large
amount of chip area due to the size of the four LUTs. It is not expected to be
a limiting block in terms of chip speed since it is not computationally inten
sive. A single access to each of the four LUT arrays is all that needs to be
done within a clock cycle. The regular structure and absence of computa
tional complexity of the LUTs will allow them to be accessed quickly.
-21
VHDL Implementation of an Image Processing Chip
Chapter 4 - Two Dimensional Convolution
Two dimensional convolution has a number of uses in image processing
including low pass filtering, high pass filtering and edge detection. Low pass
filtering is used to reduce the undesirable components of an image which
have a high spatial frequency, such as noise. High pass filtering, in contrast,
is used to accentuate the desirable high spatial frequency components of an
image such as edges. Edge detection is used to detect sharp transitions in
image density or lightness.
High and low pass filtering have direct application to image processing
in the digital copying and printing market. Low pass filtering can be used to
reduce noise in images which can result from the scanning and data transmis
sion processes such as salt and pepper noise. It can also be used to give an
image a
"softer"
overall appearance when this is desired. High pass filtering
can be useful in systems where the edges in the input image are deemed too
soft to be acceptable such as in text. This can be the case in systems where
the MTF (modulation transfer function) of the optical system or relative
motion between the document and the imaging system attenuate the high fre
quency components of the image [1][2].
The EMKIPC implements high and low pass filtering by performing a
two dimensional convolution on the original image (or on the image data out
put from histogram modification) with a three by three filter kernel. The ker
nel elements are programmable via the cpu interface with values from -127 to
127. Negative values are allowed so that high pass or Laplacian filter kernels
22
VHDL Implementation of an Image Processing Chip
can be implemented. The equation for the convolution process [1][2] is
shown in (Equation 4.1) on page 23 where:
Q is the image width in pixels
R is the image height in pixels
g(x,y) is the Q x R filtered image
f(x,y) is the Q x R original image
h(i,j) is the (2M + 1) x (2M + 1) filter kernel andM = 1 for a 3 x 3 kernel
P is the sum of the values of the (2M +
l)2
kernel elements




g(x,y)=p2 X h(ij)f(x + i,y+j) andO<=y<=R-l
i = -Mj = -M
This operation results in a weighted average of the pixel values in the 3 x
3 neighborhood of pixels covered by the kernel with the weights assigned by
the kernel elements. Examples of low pass, high pass, and edge detection fil
ter kernels are shown in Figure 4.1 on page 23.
Figure 4.1 - Filter kernel examples
1 2 1 -1 -2 -1 -1 -2 -1
2 4 2 -2 13 -2 -2 12 -2
1 2 1 -1 -2 -1 -1 -2 -1
Low Pass High Pass Laplacian
-23
VHDL Implementation of an Image Processing Chip
The structure of the convolution hardware as implemented in the VHDL
code for the EMKIPC is shown in Figure 4.2 on page 25. Refer to this dia
gram for the following discussion. Each of the major blocks in the diagram
represents an entity in the convolution filter VHDL code found in File:
cf_blk.vhd on page 124 of - Device VHDL Code.
Throughout the discussion of convolution, the terms "previous", "cur
rent", and
"next"
are used in reference to the three fines of image data to
which the convolution matrix is being applied. The word
"Next"
refers to a
fine further down in the image which has not yet been convolved.
"Current"
refers to the line being processed.
"Previous"
refers to a line further up in the
image which has already been convolved.
24
VHDL Implementation of an Image Processing Chip
Figure 4.2 - Convolution Filter Block Diagram










































VHDL Implementation of an Image Processing Chip
The FTFO_CTL block consists primarily of a state machine which is
used to control the operation of the convolution block. It detects boundary
conditions and sends FIRST_LN and LAST_LN signals to the FIFOs block
as well as FIRST_WD and LAST_WD signals to the PIXEL_CTL block.
The FTFO_CTL block also sends the write enable (wen), read enable (ren),
and reset (FIFO_RSTZ) signals to the FIFOs. Delayed versions of VSYNC
and HSYNC which line up with the convolution output data are also pro
duced.
The area nature of the convolution operation requires a certain amount
of data storage. In the case of the three by three convolution performed by
the EMKIPC, two lines of stored data are required along with the input data
to perform the operation. A 32-bit wide by 2048 word deep FIFO entity was
created to meet this need. The vhdl code for the FIFO (and for a 36 bit wide
version used by error diffusion) can be found in File: fifo.vhd on page 1 14 of
Appendix B - Device VHDL Code. The FIFO depth (2048 words or 8192
pixels) was chosen as the minimum power of 2 size to handle an 1 1 inch 600
DPI image. Two of these FIFOs are used in the convolution block to allow
operation on three fines of data at once. The basic interconnection of the
FIFOs is shown in Figure 4.4 on page 28. This figure represents the contents
of the FIFOS block shown in Figure 4.2 - Convolution Filter Block Diagram
on page 25.
Boundary conditions exist at the edges of the image (Refer to Figure 4.3
on page 27). When HSYNC goes active for the first time after VSYNC goes
active (i.e. the first fine of the image) the two FIFOs are empty so valid out
put data cannot be produced. During this first line, MUX1 is switched so that
-26
VHDL Implementation of an Image Processing Chip
IN_IMG_D is loaded into both of the FIFOs. No data is read out of the
FIFOs during the first fine. During subsequent fines, data is simultaneously
read from and written to both FIFOs which, combined with IN_IMG_D (next
fine data), produces the three lines of data out from this block at the same
time. MUX1 in this configuration has the effect of replicating the first line of
data for use on the boundary at the beginning of the image.










cna cnb cue Image line 2
Image line 3
The matrix coefficients are defined as follows: cpa, cpb, cpc = coeffi
cient - previous fine - pixel a, b, c; cca, ccb, ccc = coefficient - current line -
pixel a, b, c; cna, cnb, enc = coefficient - next fine - pixel a, b, c; a is the
left most pixel in the matrix row and c is the right most.
27
VHDL Implementation of an Image Processing Chip













MUX2 in Figure 4.4 on page 28 is used to handle the boundary condi
tion which exists on the last fine of the image. The fact that no output fine is
produced during the processing of the first input fine means that an extra line
of datamust be produced after the last input line in order for the output image
to be of the same size as the input image. MUX2 is used to switch the output
ofFIFOl to go to both the NEXT and CURRENT outputs of the block on the
last line of the image. This has the effect of replicating the last fine of data
for processing by kernel elements which fie off the bottom of the page during
processing of the last fine.
A similar approach is used at the pixel level to handle the first and last
pixels of a fine at the right and left edges of the image respectively. This
operation is performed in the block labeled PIXEL_CTL in Figure 4.2 on
page 25. Refer to Figure 4.5 - PIXEL_CTL Block on page 29 for the block
diagram of this sub-block. This figure shows only the NEXT fine data path.
The same function is also performed in this block for the CURRENT and
-28
VHDL Implementation of an Image Processing Chip
PREVIOUS data paths. The mux paths selected when the mux control sig
nals are at logic level 0 are the paths for normal processing. The paths
selected by a logic level 1 are the paths which are switched in at the boundary
conditions at the beginning and end of a fine.
In Figure 4.5, NLA through NLI represent the various delayed versions
of data. NL stands for next fine (it would be CL or PL if this block showed
the CURRENT or PREVIOUS path respectively) and A represents the oldest
data (furthest left in the image) while I is the newest. Data for six pixels
(NLA to NLF) are required by the FILT blocks on each rising edge of the
clock. This is due to the four-pixel wide data paths and the two overlap pix
els required by the 3 x 3 convolution matrix. The number of unique pixels
required per clock is N + 2 where N is the width of the data path in pixels. A
one pixel wide data path would require three pixels of data (for each of the
three lines) on every clock to perform the convolution operation.



























VHDL Implementation of an Image Processing Chip
The four FTLT blocks are where the multiplications and additions take
place to calculate the value for each of the four pixel data paths. All four
FTLT blocks get the same kernel information from the cpu interface block
(CF_CPUIF) which includes the nine kernel values and the kernel total for
normalization. Each of the FTLT blocks receives a unique set of nine pixel
values, three from each of the three fines, which it processes according to the
following equation.




cnc) + (Equation 4.2)











nl_a, b, c are next fine pixels a, b, and c.
cl_a, b, c are current fine pixels a, b, and c.
pl_a, b, c are previous fine pixels a, b, and c.
cna, b, c are the matrix coefficients for the next fine pixels a,
b, and c.
cca, b, c are the matrix coefficients for the current fine pixels
a, b, and c.
cna, b, c are thematrix coefficients for the previous fine pixels
a, b, and c.
This operation is broken down into four separate operations, which
occur in a pipeline, in order to maximize the speed at which the device can
operate. The operations are:
Nine multiplications - multiply each pixel data by its
coefficient.
Six additions - add the multiplication results for each fine.
Two additions - add the line sums.
One division - normalize the result using the kern_tot input.
30
VHDL Implementation of an Image Processing Chip
This separation can be seen in the code fragment below, where the result

























pla_prod <= cf_mult(cpa,pl_a) j\FTER DELAY1;
plb_prod <= cf_mult(cpb,pl_b) AFTER DELAY1;
plc_prod <= cf_mult(cpc,pl_c)AFTER DELAY1;
cla_prod <= cf_mult(cca,cl_a) AFTER DELAY1;
clb_prod <= cf_mult(ccb,cl_b)AFTER DELAY1;
clc_prod <= cf_mult(ccc,cl_c) AFTER DELAY1 ;
nla_prod <= cf_mult(cna,nl_a)AFTER DELAY1;
nlb_prod <= cf_mult(cnb,nl_b) AFTER DELAY1;
nlc_prod <= cf_mult(cnc,nl_c) AFTER DELAY1;
END IF;
IFhsyncl=TTHEN
pl_sum <= pla_prod + plb_prod + plc_prod AFTER DELAY1;
cl_sum <= cla_prod + clb_prod + clc_prod AFTER DELAY1;
nl_sum <= nla_prod + nlb_prod + nlc_prodAFTER DELAY1 ;
END IF;
IFhsync2 = TTHEN
pcn_sum <= nl_sum + cl_sum + pl_sum AFTER DELAY1 ;
ENDEF;
IFhsync3 = 'l'THEN









VHDL Implementation of an Image Processing Chip
The bit precision which is carried through the mathematical operations is
worthy ofmention. The inputs to the multiply and divide functions are sign
extended to the precision of the result prior to performing the operation in
order to avoid overflow. Each of the nine multiplications produces a 16-bit
result from multiplying two 8-bit input values. The sum of these nine 16-bit
numbers produces a 20 bit result. The final normalizing division by the
12-
bit kernel total returns an 8-bit value. Maintaining this precision is very
costly in several ways. Simulation times are increased due to the large num
ber of bits being processed. Synthesis and layout of the circuit will be very
difficult and the resulting chip area will be quite large. Some of the lower
order bits can be dropped at each stage of the process to alleviate these prob
lems. This will result in some loss of accuracy, depending on how many bits
are dropped. The resulting image quality from various precisions can be
tested using simulation to optimize the balance between image quality and
chip area.
The output of the four FTLT blocks are used to form a 32-bit result for
the convolution block. The muxes at the output of the convolution block are
used to select either processed data or the original input data dependent on
whether or not the block is enabled. Additionally, muxes at the input of the
block prevent the input sync signals from going into the block when it is not
enabled. This will reduce unnecessary switching and thus power usage when
the convolution block is disabled.
An image was processed through the EMKIPC device under simulation
with only the convolution filter block enabled. The filter kernel values were
chosen to generate an output image which would show obvious results. In
-32-
VHDL Implementation of an Image Processing Chip
this case, the Laplacian kernel shown in Figure 4. 1 on page 23 was used to
demonstrate edge detection. The resulting image is shown in Figure 4.6
on
page 33.
Figure 4.6 - Image processed with Laplacian Filter
-33
VHDL Implementation of an Image Processing Chip
Chapter 5 - Halftone Screening
The purpose of the Halftoning block is to convert 8-bit image data to
binary in a way that allows pictorial images to retain a grayscale appearance.
This function is performed in the EMKIPC using two methods. The first
method, ordered dither, is discussed in this chapter. The second method,
error diffusion, is discussed in Chapter 6 on page 41. Ordered dither is
accomplished by overlaying a matrix of threshold values on the image.
Applying a spatially variant threshold in this way produces a pattern of dots
which the human visual system integrates into an image with a gray appear
ance at normal viewing distances.
The primary trade-off involved in implementing ordered dither is cell
size versus the number of gray levels which can be represented. The higher
the number of cells per inch, the less likely it is that the cell structure will be
visible when viewing the image. A cell is a number of pixels grouped
together to form a larger dot of varying size and, therefore, gray appearance.
The size of the dot is determined by the number of
"on"
pixels in the cell.
The number of pixels in a cell determines the number of gray levels the cell
can represent. Unfortunately, a higher number of cells per inch means a
lower number of pixels per cell and, as a result, a smaller number of gray lev
els which can be represented by a cell. Larger cell size reduces the chance
that visible contouring will be present in the binary image by increasing the
number of gray levels which can be represented. The compromise between
these conflicting goals is generally decided as a function of output system
resolution. A higher resolution system such as a 600 DPI printer can use a
-34
VHDL Implementation of an Image Processing Chip
large cell size and still be fairly immune to the cells becoming objectionable.
Cells become objectionable when the human eye does not integrate them into
an image with a continuous gray appearance at a normal viewing distance.
An image processed using the halftone feature of the EMKIPC is shown
in Figure 5.1 on page 35. It should be noted that the printing device used to
print this document attempts to apply it's own halftoning algorithm to this
image so the results shown are not exactly as the bitmap appears. A dither
matrix was used which grows two dots at a 45 degree screen angle in the
eight by eight matrix. This matrix is shown in Figure 5.2 on page 36.
Figure 5.1 - Image processed Using Ordered Dither
The EMKIPC uses an 8 pixel by 8 pixel dither matrix which can be pro
grammed by the user via the device CPU interface. Each matrix element is
loaded with an 8-bit threshold with valid values from 0 to 255. The actual
halftone cell size depends on how the dither matrix is loaded. The device
35
VHDL Implementation of an Image Processing Chip
programmer is free to design any cell pattern that will fit within the 8 by 8
matrix constraint. Cell structures at both 90 degree and 45 degree orientation
have been tested under simulation. At 90 degrees, 2 x 2, 4 x 4, and 8x8 cells
are simple to implement. At 45 degrees, 8 element and 32 element cells have
been tested. The possibilities for cell design are even more varied if mixed
cell sizes are used within the 8 x 8 dither matrix.
Figure 5.2 - Example DitherMatrix
12 52 196 240 228 156 76 20
60 108 184 136 192 220 116 68
204 176 96 40 48 104 212 148
232 128 32 0 8 56 200 252
236 168 88 24 16 64 144 248
132 180 120 80 72 112 208 172
36 100 188 160 152 216 124 92
4 44 140 244 224 164 84 28
An additional use for the dither matrix is to supply the error diffusion
blockwith a spatially varying threshold [13]. Some of the artifacts associated
with error diffusion can be reduced in this way. The values programmed into
the dither matrix for this purpose would be different than those used to per-
-36-
VHDL Implementation of an Image Processing Chip
form straight ordered dither. The matrix would be used to apply a small sig
nal variation around the desired threshold.
The dither matrix is mapped into the device address space for program
ming via the CPU interface. Refer toAppendix A
- Device User's Specifica
tion on page 74 for a description of the CPU interface to this block.
The block diagram of the halftone block is shown in Figure 5.3 on page
39. The HT_CPU_IF (Halftone CPU Interface) block decodes the device
chip select (CSZ) and the nine address fines to generate the eight select sig
nals (R_SELZ(7:0)) for the eight rows of the dither matrix. This block also
implements a control register which contains a single bit (HT_ENABLE)
which is used to enable/disable the halftone block. When the halftone block
is disabled, the input image data (IMG_DIN) and sync signals are passed
straight through to the output. Also, a fixed 8-bit threshold value from
another register in the HT_CPU_IF block is propagated instead of the dither
matrix output. This allows straight thresholding without programming the
entire dither matrix with a single value.
The halftone block has the same input image data interface as the other
blocks in the device but on the output there is and additional 32-bit bus. This
bus is used to transmit the threshold value for each pixel along with the image
data for that pixel to the next processing block. The image data is not modi
fied in the halftone block but it is delayed as required using DFFs to remain
in sync with the threshold values. The HT_MATRIX block contains the
dither matrix. The eight values contained in a row of the matrix are sent out
four at a time over the 32-bit bus in an alternating manner during active
37-
VHDL Implementation of an Image Processing Chip
image time (VSYNC and HSYNC high). The same eight values are repeated
throughout an entire fine of the image. When HSYNC goes low (inactive)
the three-bit fine counter in the HT_LINE_CNT block is incremented so that
the next fine will use the next row of the dither matrix. The output of the fine
counter is used as the select input to an eight-to-one by 32 mux. After eight
fines have been processed the fine counter value rolls over to zero and the
dithermatrix is indexed once again starting with the first row.
The output of the MUX_8T01_32 block is then synchronized, along
with the input image data and sync signals, to present the next block in the
pipeline signals with adequate setup time. Access to data in the 8 x 8 dither
matrix will be very fast so the halftone block will not be a factor in limiting
device speed.
38
VHDL Implementation of an Image Processing Chip
Figure 5.3 - Halftone Screening Top Level Block Diagram
0 Q 0 H
k i-
o t04






































^ <o o o z> o o o
; oo
ca
tODUDLU U- O X






































































VHDL Implementation of an Image Processing Chip
The halftone block was synthesized using the AutoLogic II tools
from
Mentor Graphics. This block was chosen for synthesis because it is of mod
erate complexity and should result in a synthesized design of reasonable size.
For a more complete discussion of the synthesis results, refer to Chapter 9 on
page 61.
-40-
VHDL Implementation of an Image Processing Chip
Chapter 6 Error Diffusion
Error diffusion is another method of halftoning. It is used, like ordered
dither, to generate the appearance of gray levels in a binary document. Error
diffusion has several advantages over ordered dither. First, it avoids aliasing
of the applied cell frequency with any spatial frequencies which exist in the
original document. This is particularly useful when the original scanned doc
ument is created using ordered dither. In this case, the scan resolution can
create an interference frequency with the cell frequency of the original which
results in a visible artifact known as
moire'
noise patterns. Error diffusion
avoids this problem by generating a dispersed pattern of dots which has no
inherent spatial frequency imposed on it.
Another advantage of error diffusion over ordered dither is that error dif
fusion does a better job of retaining the overall gray level of the original
image. With ordered dither, error is introduced as the threshold is applied to
each pixel. These errors generally average out over the image but in some
types of images, a total error can result. Error diffusion avoids this by pass
ing on the threshold error to neighboring pixels which have not yet had a
threshold applied to them. This allows the resulting binary image to retain
the same overall gray content as the original multi-bit per pixel image.
One disadvantage of error diffusion is the computational complexity
required to implement it. The error from thresholding a given pixel is typi
cally distributed over several as yet unprocessed pixels. This means that once
the error is calculated for a given pixel, a percentage of that error must be
computed for each pixel which will receive a portion of it, and then the por-
41
VHDL Implementation of an Image Processing Chip
tion must be added to the value of the target pixel. This requires several mul
tiplications, divisions, and additions for each pixel in the image. On top of
this, since the error is typically propagated to pixels one or more lines further
into the image, line storage is required.
A second disadvantage to error diffusion is that it has its own set of visi
ble artifacts. A pattern which looks like squiggly fines can be apparent, par
ticularly in areas of constant low or high density near the gray level limits
(white or black). This results from error slowly building up as it is propa
gated until it finally "spills
over"
resulting in a pixel of the opposite sense.
This problem can be reduced by spreading the error over a larger number of
pixels or by using a varying threshold as discussed in the previous chapter.
Error diffusion is implemented in the EMKIPC using a programmable
mask with the same shape as the Floyd-Steinberg mask [3]. This is an
"L"
shaped mask in which the error is distributed to one pixel to the right on the
same line as the current pixel and to three pixels in the next line. This is
equivalent to saying that the current pixel, X, receives portions of the total
error from each of the four pixels, i through 1, as shown in Table 6. 1 on page
42.
Table 6.1 - Error Diffusion Pixel Reference
j k 1
i X
The equation for the new value ofX is as follows:
42-
VHDL Implementation of an Image Processing Chip
Xnew = Xold +mi /16>
*





The values for ej through ei are the total error for pixels i through 1. The
multiplier values (m^ through m{) are programmable via the device CPU
interface.
An example of an image processed using error diffusion, as imple
mented in the EMKIPC is shown in Figure 6.1 on page 43. This image was
produced by running the grayscale original through a simulation of the
EMKIPC with error diffusion and thresholding turned on.
Figure 6.1 - Image processed using Error Diffusion
The top level block diagram for the error diffusion block is shown in
Figure 6.2 on page 44.
43-
VHDL Implementation of an Image Processing Chip
Figure 6.2 - Error Diffusion Top Level Block Diagram
Q 0 0 0 p 0




























































3 3 o <z> s
















































































1 1 A 1 D





































0 E3 CC LJ cc > J= 2
QJijj ittitih|
44
VHDL Implementation of an Image Processing Chip
The error diffusion block required more work to achieve functional sta
tus than any other block in the EMKTPC design. The initial attempt at imple
menting it had several errors in it which required that the design virtually be
started over. In the first design, the error for a pixel was calculated, the frac
tion meant for the next pixel on the same fine was diffused, and then all the
pixel values for that fine were stored with part of their error added. As the
next fine was processed, the previous fine data was retrieved and the error
contribution for pixels in the next fine was computed and added to them.
This approach has the flaw that, when the error is calculated for the next fine,
some error has already been added to the original pixel values. This approach
resulted in images which did not retain the gray content of the original and
which had an objectionable appearance. In addition, this approach required
10-bit storage for each pixel since pixel value plus error was being stored
with possible values from -254 to +510. Further, this approach required a
second FIFO to store the threshold values along with the pixel values. The
final approach achieves correct results with significantly less hardware.
The blocks in the block diagram closely resemble the way the VHDL
code is structured. The ED_CTL, ED_CPU, FIFO_36, and ED_PROCESS
blocks are all defined as separate entities and then interconnected using struc
tural VHDL. The output muxes are defined using a dataflow style of code
and the D Flip-Flops are defined behaviorally within a process. Refer to the
File: ed_blk.vhd on page 145 in Appendix B - Device VHDL Code, and to
Figure 6.2 on page 44 for the following discussion.
The ED_CPU block decodes CPU accesses to the error diffusion block
and implements the five required registers. There is one control register
-45-
VHDL Implementation of an Image Processing Chip
which contains two active bits. These are used to enable/disable the error dif
fusion and threshold blocks respectively. The enable bit for the threshold
block is implemented here since the threshold block has no other reason to
have a CPU interface. Chip area is saved by using the logic already present
in the error diffusion block to implement this bit and pass it on to the thresh
old block. The other registers implemented here are the four multiplier val
ues which determine how much of the pixel error is propagated from which
of the four neighboring pixels. The four multiplier values (m^ to m{) are
implemented as four-bit read/write registers. The multiplication result is a
12-bit value. An automatic divide by 16 is done on the multiplication result
by dropping the lower four bits. This has the effect ofmaking the multiplier
m/16 where the maximum value ofm is 15. It is up to the device programmer
to ensure that the total of the four m values equals 16 so that no error is intro
duced to or lost from the image. The ED_CPU block is implemented entirely
using the dataflow coding style.
The ED_CTL block is used primarily to control writing to and reading
from the error FIFO, and to generate delayed versions of the sync signals
which are aligned with the output of the ED_PROCESS block. These func
tions are performed by a state machine implemented with behavioral VHDL.
The entire block is made up of two processes. The first defines the state tran
sitions of the state machine and the second implements the needed sequential
logic.
The FIFO_36 block contains a 36-bit wide by 2048 word deep FIFO for
storing the quantization error which results
when a threshold is applied to the
gray image data. This FIFO stores a nine bit error value, in 2's complement
-46-
VHDL Implementation of an Image Processing Chip
format, for each pixel in the current line. When the next fine is processed, the
error values for the previous fine are read back out and added, in the propor
tions specified by the multiplier values to the appropriate pixels before they
are thresholded. Nine bits are required for each pixel since the quantization
error can be positive or negative in the range -255 to 255. The VHDL code
for the FIFO can be found in the File: fifo.vhd on page 1 14 in Appendix B -
Device VHDL Code. The FIFO uses a synchronous interface so that when
wen (write enable) is active, data is written into the FIFO on every rising edge
of elk. A word of data is read out on every rising edge of elk when ren (read
enable) is active. The 36 bit FIFO is implemented in the same way as the 32
bit FIFO used for convolution.
The ED_PROCESS block is where the bulk of the computational work
is done for error diffusion. The block diagram for the ED_PROCESS block
is shown in Figure 6.3 on page 49. This diagram shows the details for pro
cessing all four pixels in a 32-bit data word simultaneously. The characteris
tic of this implementation of error diffusion which makes it the slowest block
in the EMKIPC device is the fact that the ERR output of the
ED_ERR_CALC blocks on the right side of the diagram is fed back to the
input of the next pixel in the word. This means that within a clock cycle the
sequence ADD, ERR_CALC, MULTIPLY is done four times in sequence.
This is due to the nature of the Floyd-Steinberg mask which passes a portion
of the error to the next pixel to the right. This, along with the fact that the
EMKIPC pipeline processes four pixels on every clock cycle, results in a
large amount of logic between D flip-flops in this block. This problem could
be reduced by a factor of four if a mask is used which only diffuses error to
47-
VHDL Implementation of an Image Processing Chip
pixels on subsequent lines. The Floyd-Steinberg mask is used for the pur
poses of this thesis, but if this design were to be implemented in hardware, it
is strongly recommended that it be changed so that error is only propagated to
pixels in subsequent fines.
It should be noted that the output of the error diffusion block is still gray
data. The threshold is not actually applied to the data until it reaches the
threshold block which is discussed in Chapter 7 on page 50. This is done to
accommodate the case where error diffusion is disabled but ordered dither or
straight thresholding are enabled. The threshold values are only needed in
the error diffusion block so that quantization error can be calculated.
The ED_PROCESS VHDL architecture code has two main parts to it.
The first part is a set of function definitions and calls to implement the
requiredmultiplication, addition, error calculation, and output limit checking.
The function definitions are located just before the BEGIN statement for the
ED_PROCESS and the function calls are made as the first group of opera
tions after the BEGIN statement. The second part of the code is a process
which is used to implement the DFFs for the intermediate and final results.
This partitioning of the code for this complex block makes it much easier to
read, understand, and debug and was arrived at after many iterations of failed
attempts as described previously.
48
VHDL Implementation of an Image Processing Chip
Figure 6.3 - ErrorDiffusion Process Block Diagram










^r 3 s "
r--H ^ cl2 ~
f^
^ 0 -* 0 ^d op
iri-
vb'P !z u
g 5 g Eg 3
5 i ^ S j Si S j 5 . 1 . S j ! .1 i
5, a j 5,
b'
l i b \ a 5,
b'
I a s
>5 a ^ 5. 3 $ s 5 a J s a S s J 3 g & s a 5. s v sSssSs
























tr CN C! X ~
s S 2
s s | a s
U. LjJ Si; CC CD
cr cc S x
r r i- o
1- h- H UJ
/ ' ^ r
3 a a 3 a
i i i 3
^f^^^_ EL EL EL B.EL EL EL EL 6.EL EL 65, EL euBL EL 6L EL




} * * ' ff
UJ
S
__^ _ _^ HHHB
,_
m ,_^ ^ MB,, .1 BBW Ml __.
Hoooa a a q aaa a a










o o oooo oooo oooo
CO CD JD CD ci) CO CO CO go CO <D CD CO OD CD CC
k^^Ij ^ _j ^ 1j ^ 5 _j
<<<< CQcCcDirt UUUU Q O ? Q
tr a: a cc tr a: o: a: cc cc cc tr E cc ac cc
UJ UJ LJ L. U- LU UJ LJ UJUJUJIU UJ UJ UJ UJ
S, Si 5 5 5, 6 , , , , 5, 1
* i { * { i, i, i, { i, i, i, { i i, t,
3 3 3 3 3333 3333 3333
3RS pR? s?? sss
o^ CD .-. in ~ <c .. csir^jn rt^JCIcD ^^u)
cos. ^o mo in o i/jmlNC ^cnCx f-C^cn





w S _ je _. k
?s e e e e d
fill diili fi ft d l
-49
VHDL Implementation of an Image Processing Chip
Chapter 7 - Threshold Application
The application of a threshold to the image data is the final step used in
any of the methods employed by the EMKIPC for converting grayscale data
into binary. The process of thresholding simply involves comparing the
image data, pixel by pixel, to a threshold value and sending a
"0"
to the out
put if the pixel value is less than the threshold or a
"1"
to the output if the
pixel value is greater than the threshold. A possible enhancement to the cur
rent implementation is to make the threshold output user selectable since a
"0"
output may be desired for pixel values greater than the threshold rather
than a "1". Many 8-bit gray data systems represent black pixels with a value
of 0 and white with a value of 255. In binary printing systems, a
"1"
is often
used to indicate a printed or black pixel. This can easily be implementedwith
a 2-input exclusive or gate at each output bit and a CPU accessible control
bit.
The threshold value enters the threshold block as a 32-bit value from the
halftone block, via the error diffusion block. This value can be either fixed or
varying. It will be a fixed 32-bit value, with all four bytes in the word having
the same value, when halftoning is not enabled. This value is determined by
a programmable value in the halftone block. It will be a varying value if half
toning is enabled. In this case, each pixel has a threshold value applied to it
according to the dithermatrix in the halftone block.





according to (Equation 7.1) on page 51.
50-
VHDL Implementation of an Image Processing Chip
IMG_DOUT = FF Hex when IMG_DIN > THRESHJN (Equation 7. 1)
00 Hex otherwise
A strictly greater than comparison is used since it requires less logic than
a greater than or equal to compare.
The result of the comparison is one of the inputs to a mux which allows
the user to select either the original image data or the result of the comparison
operation. The mux is controlled by the TH_ENABLE signal which is pro
grammable via the device cpu interface. This control bit is physically located
in the error diffusion block and passed to the threshold block in order to avoid
implementing a cpu interface in the threshold block. The ability to pass the
gray data out of the device rather than the result of the threshold operation is
important for applications where additional gray level processing is to be
done.
The block diagram for the threshold block is shown in Figure 7.1 on
page 53. The last operation performed in this block is a final capturing of the
output data and sync signals, using DFFs to provide controlled, synchronous
output signals to the output pins of the device.
Figure 7.2 on page 54 shows the result when our original image is pro
cessed by applying a fixed threshold with a value of 80 Hex (128 decimal).
This type of conversion is not normally done to pictorial images due to the
obvious loss of image data content and degradation of the appearance of the
image. Straight thresholding can, however, be a reasonable way of convert
ing a text or line art document from gray to binary since these types of docu-
51
VHDL Implementation of an Image Processing Chip
ments are already of high contrast. It can even be used to
reduce or eliminate
low level background in the image which is usually undesirable. In general,
though, thresholding will be used as the final step in converting gray
pictori
als to binary images which can be printed on today's binary printing devices.
-52-
VHDL Implementation of an Image Processing Chip



































r>j ** CO o






























































































































l > :jz LU CC
QMY^Jll
53
VHDL Implementation of an Image Processing Chip
Figure 7.2 - Image Processed using Thresholding Only
The VHDL code for the threshold block is very straight forward. First
the comparison is done, as shown in the code fragment below, using the data
flow style of code.
cmp_dd <=
"11111111"






Notice that, in addition to the threshold operation, zero data is guaran
teed when the VSYNC and HSYNC signals are not active. The second oper
ation is the mux implementation, again using dataflow VHDL.
out_d(7 downto 0) <=
"00000000"





cmp_dd AFTER DELAY1 WHEN tfa_enable = TELSE
ed_img_d(7 downto 0) AFTER DELAY1 ;
54
VHDL Implementation of an Image Processing Chip
The last operation, the capturing of the data and sync signals,
is done
using behavioral VHDL implemented in a PROCESS statement as shown in














ELSIF clkFVENT AND elk = T THEN -DFF
out_img_d <= out_dAFTER DELAY1;
out_vsync <= ed_vsync AFTER DELAY1;




VHDL Implementation of an Image Processing Chip
Chapter 8 VHDL Issues
VHDL is more than just a hardware description language. It has enough
power and flexibility built into it to allow it to be used as a simulation stimu
lus generation language and also as a simulation component modeling lan
guage. The versatility of the language can sometimes make it seem daunting.
The fact that there are multiple methods available for accomplishing the same
task places the burden of deciding which is the best approach for a given task
on the user. One of the earliest decisions to be made is which predefined
packages to include and which data types to use. The desire for the EMKIPC
to be implementable as hardware led to the decision to represent it, to as great
a degree as possible, using the std_logic type for signals and variables. This
is because the std_logic type closely matches the representation of logic hard
ware used by most simulators today. As a result of this decision, most of the
entities described in the design of the EMKIPC use the following predefined
packages from the IEEE library.
std_logic_1164.ALL
std_logic_1164_extensions.ALL
The std_logic_1164_extensions package is used in entities which per
form arithmetic operations on std_logic signals or variables. When this pack
age is included and these functions are used, the entity must be compiled
using the -explicit switch due to the overloading of the arithmetic operations.
Additionally, the test benches which use file I/O also include the follow
ing package from the STD library
textio.all
VHDL'
87 was used throughout the design and simulation.
56
VHDL Implementation of an Image Processing Chip
One issue faced in deciding how to write the VHDL for a device of this
types centers around how to represent the internal data paths. The approach
taken in the EMKIPC is to keep them as close to actual hardware as possible
by representing them throughout as std_logic_vectors. This adds some com
plexity to the description of the device in several areas. The main areas of
difficulty in using std_logic_vectors are the convolution filter and error diffu
sion blocks. The arithmetic functions required in these blocks are more eas
ily implemented using integer functions. An alternative approach to the
design would be to convert all busses to constrained integer types as soon as
they enter the device and operate on them as integers. This would greatly
simplify many of the internal operations and would shift the burden of repre
senting the data paths in hardware to the synthesis tools. The trade-off here is
simulation speed versus synthesis speed.
Another question faced is whether or not to include delays in the device
description using the AFTER keyword. Small delays are included in the
description of the operation of the EMKIPC. This is done to make verifica
tion in simulation easier. Cause and effect can sometimes be difficult to
determine when all transitions appear to happen at the same time in the simu
lator. The addition of a small delay after capturing a result in a DFF, for
example, can make it much easier to see that a particular clock edge caused a
particular event. The AFTER clauses are ignored by most synthesis tools,
including the Autologic jT tools used to synthesize the halftone block. This
has no negative effect on the operation of the synthesized circuit since the
AFTER clauses were added for simulation clarity only.
57-
VHDL Implementation of an Image Processing Chip
The methods for accomplishing various hardware functions using
VHDL evolved as the project progressed and more was learned about what
worked well and what didn't. The first stage of the design task was to parti
tion the design into blocks of manageable size. This task fit naturally with
the use of structural VHDL where the various design blocks are defined as
entities and are then interconnected as individual components. The use of
this style carried over into the development of the first lower level block, his
togram modification. Here, low level building blocks such as one, eight,
nine, and 32-bit D flip-flops were defined behaviorally and interconnected
using structural VHDL. These low level blocks are not associated with any
one block of the design and are, therefore, kept in a general file. This File:
Comps.vhd on page 111 can be found in - Device VHDL Code
As the design progressed and grew, more basic building blocks were
needed and the Comps file began to get quite large and cumbersome to main
tain. Also, the resulting structural VHDL code is not very readable and is
somewhat difficult to change when necessary. As a result, subsequent blocks
use more dataflow and behavioral code which is more easily read, modified
and maintained. The error diffusion block was one of the last to be com
pleted and it uses very few low level building blocks. The ED_PROCESS
block is a good example of this. The arithmetic operations required by this
block were easily implemented in functions which could then be called many
times. All of the DFF functions for this block are done in one PROCESS
statement which makes the overall structure of the block easy to see from the
code.
58-
VHDL Implementation of an Image Processing Chip
Functions, data types, and constants which were used in many of the
design blocks were included in a package called types_emk which can be
found in File: Packages.vhd on page 108. Two of the commonly used func
tions are for conversion from std_logic_vector to integer or from integer to
std_logic_vector. These were necessary when access to an element of an
array was needed and was indexed by the value of a std_logic_vector. The
fact that array indices must be integers means that a conversion is necessary
in order to implement functions such as look up tables where the address to
the table is formed by the inputs to the device which are generally of type
std_logic. These functions were also useful in the test bench code where the
image data values were read from or written to files. The file I/O functions
included in the std.textio library are geared toward working with integers so
the std_logic_vector format used in the device had to be converted to/from
integer format for file access.
Another function in this package is called valid_vec. This function takes
a std_logic_vector as an argument and returns a boolean true if every bit of
the vector is either a
'1'
or a '0'. This function is used in multiple places
throughout the device for the purpose of verifying that certain operations are
only performed on valid data. This allowed these functions to be simplified
since it could be guaranteed that they would only have to work on binary val
ues. This function can also be used to flag invalid data prior to capturing it to
a file.
Procedures are included for performing read and write cycles over the
cpu interface. These procedures are used extensively in the test bench code
for configuring the EMKIPC device prior to simulation. Their inclusion in
-59
VHDL Implementation of an Image Processing Chip
the package allows multiple stimulus programs to be used without having to
re-implement these common operations in every file. Constants are
defined
in the package for the depth of the FIFOs and for commonly used bus values
such as 32-bit high impedance and 32-bit unknown. The FIFO depth con
stant allows easy configuration of the maximum size of the image lines to
be
processed. The bus value constants are used to make the code simpler and
more readable.
-60-
VHDL Implementation of an Image Processing Chip
Chapter 9 - Simulation and Test Bench Generation
The most time consuming part of a device design of this nature is verify
ing that the VHDL code for the device operates as intended. This is done by
executing the device code in a simulator. The simulation of the device is only
meaningful if the inputs to the device are manipulated in a way that causes
the device to operate in its intendedmanner. This is easily done using VHDL
to create a test bench for the entity under test.
Compilation and simulation of the VHDL code for the EMKIPC and its
associated test benches was done using the QuickVHDL tools from Mentor
Graphics Corporation. The simulator actually executes the code of the entity
that the user loads and allows viewing of waveform or list output of the
entity's signals. In this way, the designer can verify that the expected signal
transitions actually occur.
The general approach to simulation and test bench generation is to
instantiate the entity to be tested as a component, connect signals to its ports,
stimulate the input signals using one or more processes and concurrent signal
assignments, and monitor the outputs. The outputs, internal signals, and
internal variables can be monitored in one of three ways. First, they can be
viewed as waveforms or lists in the simulator. This allows detailed probing
of signals and is most useful for tracking down the cause of problems which
have been observed at a higher level. Second, ASSERT statements can be
embedded in the test bench to check that expected signal levels occur at the
appropriate time. This method is useful for verifying signal timing relation
ships as well as detecting incorrect results such as unknown data when image
61
VHDL Implementation of an Image Processing Chip
output is expected. If incorrect results occur, the ASSERT statements can be
used to output messages to the simulator or to stop the simulation. Third, the
key output results can be captured to a file and verified using other means.
This is particularly useful in the processing of image data where the results
can either be viewed using some form of display or print media, or they can
be compared to results generated by an independent means such as a software
algorithm.
The EMKIPC was tested in stages as each block of the pipeline was
completed. Once all of the blocks had been individually verified, the entire
device was verified. The methods used for test bench generation evolved as
the design progressed, similar to the way the methods for designing device
functions evolved. Test bench generation for the individual blocks was made
simpler by the common image data interface used at the device input, block
interconnect, and device output. Stimulus written for one block could be cop
ied and modified for use on a subsequent block with relatively minor modifi
cations.
The initial verification of the logic blocks was somewhat simplistic, and
not really a complete verification. A test bench was written which loaded the
block's cpu registers and then exercised the image path by controlling the
sync signals and feeding the block a repeating data pattern to mimic image
data. This required close scrutiny of the resulting waveforms to verify that
the results were as expected. The advantage to this methodology at this level
is fast execution under simulation. This is a real benefit in the early stages of
design when the potential exists for many iterations of the simulate/fix cycle.
Later test benches utilize file I/O to process complete images. This approach
-62-
VHDL Implementation of an Image Processing Chip
is more appropriate to mature designs which have had all the low level func
tional problems fixed and which require a higher level verification.
One of the major improvements made as test bench development pro
gressed was the adoption of file I/O for loading processing parameters and
image data into the device. This offered two major advantages. First, it
meant that the code no longer had to be modified and recompiled in order to
change the operation of the device. Files containing the values to be loaded
into the device configuration registers and tables could be easily modified
and the simulation rerun without any other changes. Second, it provided an
additional means of device operation verification. With the use of file I/O,
the device can be fed actual image data and the processed result can be cap
tured, viewed, and compared to the input or to expected results. The images
included throughout this thesis were produced in this manner.
A good example of this methodology is seen in the File: test_img.vhd on
page 166 in Appendix C - Test Bench VHDL Code. The entity declaration of
a test bench is typically trivial and serves primarily to give it an entity name.
The Architecture declaration includes signal definitions for the UUT (unit
under test) inputs and outputs, the component declaration for the UUT, and in
this case the file declarations needed for file I/O. The Architecture body con
sists of a component instantiation for the UUT and then a set of concurrent
statements. These statements can take the form of signal assignments or
PROCESS statements. In this example, signal assignments are used for gen
erating a reset pulse at the beginning of the simulation and a clock which
cycles indefinitely. Process PI is used to load all of the device registers and
tables which are accessible via the CPU interface of the device. WAIT state-
-63
VHDL Implementation of an Image Processing Chip
ments are used to control the timing of events as the process is executed.
When PI reaches the end, it waits forever. Process P2 is used to control the
input image data interface signals. It waits until PI has completed loading
the device parameters and then reads image data in from file and drives it,
along with the VSYNC and HSYNC signals, into the EMKIPC. The duration
of the VSYNC and HSYNC signals is determined from the file size as read
from a header at the beginning of the image file. Process P3 is used to moni
tor the outputVSYNC and HSYNC signals and capture the output image data
on the rising edge of CLK when VSYNC and HSYNC are high. P3 then
writes this captured data to and output file. This example test bench is the
one that was used for final verification of the device and output image gener
ation.
The only drawback to the high level approach taken for final verification
of the complete device is long simulation times. The processing of a 512
pixel by 512 pixel image, such as the one included in this report, with one or
more of the complex blocks enabled (such as convolution or error diffusion)
requires about three hours on the workstations used for this project. In this
case the workstation used was a Hewlett Packard HP 715/100 RISC based
machine with 96 MBytes of RAM running the HPUX version 10.01 operat
ing system.
64
VHDL Implementation of an Image Processing Chip
Chapter 10 - Synthesis
The goal of synthesis is to translate a behavioral description of a system
into an interconnection description of basic logic elements (a netlist) for the
purpose of building hardware. VHDL as a language is not designed strictly
for synthesis, but synthesis tools do exist for at least a subset of the constructs
available in VHDL. Synthesis is the capability which makes VHDL a useful
language for hardware description. Without the emergence of synthesis
tools, VHDL would probably not have gained the wide acceptance that it has
today as a standard. It would still be a powerful tool for simulation stimulus
generation, component simulation model generation, and hierarchical system
description, but it would not be a particularly useful tool for design entry.
Synthesis empowers VHDL to be a common tool for all stages of logic
design entry and verification.
The synthesis tool used for this project is calledAutoLogic 11 fromMen
tor Graphics Corporation. The steps required to take a behavioral description
of a logic device through synthesis to hardware are listed below.
1. Design and simulate synthesizable VHDL code to verify that
logical operation meets design goals
2. Synthesize the VHDL code to a netlist interconnection of
generic logic function modules.
3. Optimize the netlist by targeting a specific implementation
technology and imposing performance and area constraints.
65
VHDL Implementation of an Image Processing Chip
All three of these steps were performed on the Halftone block of the





Wide (32-bit), high speed D flip-flops.
Data path muxes.
High speed state machine control.
The moderate size of the Halftone block in comparison to the Convolu
tion or Error Diffusion blocks was another reason for it's selection. The goal
was to gain experience with the issues around synthesis in general and a mod
erately sized block allowed this without getting bogged down in long compile
times.
Synthesis, as stated earlier, imposes some limitations on the constructs
which can be utilized out of the full scope of VHDL constructs. The main
classes of limitations are as follows [9]:
delay expressions (such as the AFTER clause) are ignored.
restrictions are placed on the writing of PROCESS
statements.
A limited number of types are allowed.
Descriptions are oriented toward synchronous styles.
The main reason for these hmitations is that synthesis tools need to be
able to translate VHDL code to actual hardware elements. Certain types
which are allowed by VHDL (such as the type FILE) have no hardware coun
terpart and are not allowed. The types that generally are allowed are enumer
ated types, integer types and subtypes, and one-dimensional arrays. These
66
VHDL Implementation of an Image Processing Chip
allowed types still permit a very wide range of implementation options for
most desired hardware functions.
The restrictions placed on the writing of PROCESS statements by the
AutoLogic II tools are best illustrated by the following example, which is a D
flip flop with enable.
PROCESS (resetz.clk)
IF (resetz = '0') THEN
q<='0';








The code fragment above would not compile for synthesis. It had to be
rewritten as follows:
PROCESS (resetz.clk)
IF (resetz = '0') THEN
q<='0';










The reason for this restriction is to avoid ambiguity around synchronous
elements. The compiler requires that a conditional statement which implies a
clocked element contain only the clock signal at that level. This is a fairly
minor restriction and the only one which was encountered during the synthe
sis process for the halftone block.
Other issues that came up during synthesis centered more around the
tools than around VHDL or the actual design. For example, after optimizing
the design, prior to leaving the AutoLogic II tools, the results of the process
need to be saved to file. In this case, schematic drawings of the synthesis
67-
VHDL Implementation of an Image Processing Chip
results were requested with an upper limit of 200 components per page set.
The large number of files generated for each schematic page caused the user
file limit on the system to be reached. The user file limit was increased from
2000 to 5000 and the process repeated. Once again, the file limit was
reached. In the end the design netlist was saved without schematic pages.
This limitation is not a major problem since the utility of synthesized sche
matics is minimal. It actually points out the power of VHDL as a circuit
description tool in terms of the compactness of the VHDL description in
comparison to a schematic representation of the same circuit.
The area results for synthesis of the halftone block are shown in Table
10. 1 on page 68. The cost numbers in the table can be used as an indicator of
relative chip area for the listed cell types. Actual area is dependent on the tar
get technology.
Table 10.1 - Global Cell Usage Statistics, Halftone Block
Cell Name Instance Count Cost/Cell Subtotal
cmosn:anotb 62 6.30 390.60
cmosmaorbn 15 6.30 94.50
cmosn:dec3 2 38.70 77.40
cmosnrdffr 607 24.50 14871.50
cmosn:dsel 314 18.50 5809.00
cmosnrinvl 188 2.20 413.60
cmosn:inv2 69 2.70 186.30
cmosn:inv4 767 3.20 2454.40
cmosn:invl6 697 5.20 3624.40
cmosn:nd2xl 472 4.30 2029.60
cmosn:nd3 4 12.40 49.60
cmosn:nd4 18 12.50 225.00
cmosn:ndi2xl 160 6.40 1024.00
cmosn:nr2 16 8.30 132.80
cmosn:nr3 107 9.40 1005.80
cmosn:tribl 19 8.30 157.70
68
VHDL Implementation of an Image Processing Chip
Table 10.1 - Global Cell Usage Statistics, Halftone Block
Cell Name Instance Count Cost/Cell Subtotal
cmosn:trib2 501 16.30 8166.30
cmosn:xnor 2 10.30 20.60
cmosn:xor 1 10.30 10.30
Total: 4021 40743.40
The total for the halftone block translates to approximately 8150
two-
input NAND gates. The Autologic II synthesis settings used are as follows:
Destination Technology: CMOSN/1.2 - Worst
Technology Environment: 25 Degrees C, 5 Volts
Input Capacitance Limit: 10 pF
Optimization Type: area
Area Optimization: low
With these results from the halftone block, rough estimates can be made











It should be noted that the size of the Convolution and Error Diffusion
blocks will be highly dependent on the depth of the FIFOs used and on the
efficiency of the synthesis tools in optimizing these regular structures. A bet
ter option for implementing the FIFO's would be to use an existing FIFO
macrocell and avoid synthesis of these already optimized structures. Another
important factor for these blocks is the bit accuracy carried after the arith
metic operations, as discussed in Chapter 4.
69-
VHDL Implementation of an Image Processing Chip
Chapter 11 Conclusions
The design of complex electronic systems has grown to the point where
the use of traditional schematic capture methods for their description have
become cumbersome and inefficient. The desire for a higher level of abstrac
tion in system design has led to the development and acceptance ofVHDL as
a standard hardware description language. This thesis has attempted to high
light the benefits of using VHDL for the description of an image processing
system with substantial capabilities. In addition, the advantages of using
VHDL for the creation of test benches for design verification are demon
strated.
The high level of abstraction provided by VHDL, combined with the
availability of synthesis tools, allows compact description of complex sys
tems and flexibility in their implementation. Synthesis allows the selection
of the target hardware technology to be delayed until design entry and verifi
cation under simulation are nearly complete. This is also a benefit for the
long term support of a design since the source code can be retargeted to new
technologies for implementation as old technologies become expensive or
obsolete. The use ofVHDL for test bench generation reduces the number of
tools required in the design process and tightly couples the design entry and
test tasks.
The most effective use ofVHDL can be made by writing code at as high
a level as possible and leaving the implementation details to the synthesis
tools. An example ofwhere this could be applied in the EMKIPC is the con
volution block. In this block, several additions, multiplications, and a divi-
-70-
VHDL Implementation of an Image Processing Chip
sion are performed on std_logic_vector types. This requires low level
handling of signed versus unsigned values and the implementation of some
specific arithmetic functions to handle these cases. The design could be done
and maintained more efficiently if the busses were handled as constrained
subtypes of the type integer and if the predefined integer arithmetic functions
were used. In fact, a good argument can be made for converting the image
data bus to a constrained integer subtype as soon as it enters the device, and
mamtaining this type through the entire device. The same argument can be
made for the CPU address and data busses. This approach would allow a
more general implementation of the individual blocks and would make the
transition to a different signal type at the interfaces, if desired, much simpler.
The exception to the above stated guideline for writing code at a high
level is when more control is desired over how the synthesis tools will imple
ment certain functions. An example of this is once again taken from the con
volution block. The computational complexity which results from the many
multiplication and addition operations could lead to the desire to simplify the
hardware by constraining multiplier and divisor values to powers of 2. This
allows multiplication and division to be performed using shift operations,
greatly reducing hardware requirements. This kind of constraint requires
lower level control over the arithmetic operations but can still be done utiliz
ing integer operations.
The relationship between image processing and electronics hardware
development will remain a close one and development tools such as VHDL
continue to strengthen and improve it. The ability to bridge the gap between
software image processing and hardware implementation of the algorithms is
-71
VHDL Implementation of an Image Processing Chip
enhanced by the high level language capabilities ofVHDL. Verification of a
hardware design through simulation which enables comparison with a soft
ware algorithm is critical to the development of image processing hardware
of increasing capability and complexity.
The pipeline architecture and the use of common CPU and image data
interfaces to all blocks makes the design easy to partition into smaller pieces,
if desired. One logical place to split the design is between the convolution
and halftone blocks. This would split the device into two devices of roughly
equivalent size. It would also result in the first device being a processor of
gray data and the second being a converter from gray to binary. This would
be convenient if, for example, multiple convolution operations were desired
before converting to binary. In this case, multiple gray processors would be
connected in a pipeline followed by the gray to binary converter.
The EMKIPC device meets the needs of a variety of high speed image
processing applications. It is a highly integrated and highly configurable
solution which will process image data at the speeds required by today's high
end digital copiers and printers. The image processing algorithms integrated
into the device are useful for many types of imaging systems. The use of a
CPU interface for configuring the the device allows it to be tailored to a spe
cific application without having to modify the device hardware.
72
VHDL Implementation of an Image Processing Chip
Bibliography
[1] Ronald G. Matteson, Professor's notesforEECC-683, Survey of
Electronic Document Processing, 1994.
[2] Ronald G. Matteson, Introduction to Document Image Process
ing, Artech House, 1995.
[3] Edward R. Dougherty, Digital Image Processing Methods, M.
Dekker, 1994.
[4] Zhigang Fan and Reiner Eschbach, Limit Cycle Bahavior of
Error Diffusion, 1994 1st IEEE International Conference on
Image Processing, Vol 2, pp 1041 to 1045.
[5] Anil K. Jain, Fundamentals of Digital Image Processing, Pren
tice Hall, 1989.
[61 Raymond J. Offen, VLSI Image Processing, Mcgraw-Hill, 1985.
[7] Douglas L. Perry, VHDL - Second Edition, McGraw-Hill, Inc.,
1994.
[8] K. Hsu, L.J. D'Luna, H. Yeh,W.A. Cook, G.W. Brown, A Pipe
lined ASICfor ColorMatrixing and Convolution, Proceedings:
The Third Annual IEEE ASIC Seminar and Exhibit, Rochester,
New York, 1990, pp 7-6.1 to 7-6.6.
[91 Roland Airiau, Circuit Synthesis withVHDL, KluwerAcademic
Publishers, 1994.
[10] James R. Armstrong, Chip-Level Modeling with VHDL, Pren
tice Hall, 1989.
[11] David R. Coelho, The VHDL Handbook, Kluwer Academic
Publishers, 1989.
[12] RandolfE. Harr, Applications ofVHDL to Circuit Design, Klu
wer Academic, 1991.
[13] Donald E. Knuth, Digital Halftones by Dot Diffusion, ACM
Transactions on Graphics, Vol. 6, No. 4, Oct, 1987, pp. 245-273.
73
Design Specification - Image Processing Chip E. Michael Kelly
Appendix A Device User's Specification
A.l Introduction
A.l.l Document Scope
The purpose of this document is to provide the reader with a complete
specification of the Image Processing chip developed by Mike Kelly as part
of aMaster's degree thesis in the ComputerEngineering department ofRoch
ester Institute ofTechnology. For the remainder of this document, the device
will be referred to as the EMKIPC (E. Michael Kelly - Image Processing
Chip). This document will operate on two levels. First, it will provide an
overall understanding of the architecture and operation of the device through
text description and block diagrams. Second, it will provide low level pro
gramming, timing, and pin connection information which a software or hard
ware engineer attempting to use the chip would require.
The block diagrams in this document were created using the Intergraph
Aceplus schematic editor. They are accurate schematics but were not used in
implementing the design. The entire chip was designed and simulated using
VHDL for both the logic specification of the device and test bench genera
tion.
A.1.2 General Description
The EMKIPC is an image processing chip designed using VHDL and
implemented using synthesis. Fundamental Image processing techniques
such as filtering and image histogram modification are implemented. The
device operates on eight-bit monochrome images. Multiple devices can be
used to independently process multiple channels for color applications. The
74
Design Specification - Image Processing Chip E. Michael Kelly
chip architecture utilizes a pipeline approach for the various operations in
order to maximize throughput. The EMKIPC is not targeted to any single
imaging application but is flexible enough to be used in many types of sys
tems. These can include electronic copiers and printers as well as real-time
video applications.
The device performs the following functions:
HistogramModification (lighten, darken, or change contrast)
using a programmable Look Up Table (LUT) for linear or
non-linear pixel value translations.
Convolution Filtering with a programmable 3x3 filtermatrix.
Convert from eight-bit to binary using thresholding, error
diffusion, or halftoning, as desired.
The chip will have the following interfaces:
An 8-bit CPU data interface for loading processing
parameters.
A 32-bit synchronous input image data interface.
A 32-bit synchronous output image data interface.
All of the device functions are capable of being enabled/disabled in any
combination. Note that when thresholding is disabled the output is 8-bit con
tinuous tone image data. When it is enabled, output pixel values are either 00
hex or FF hex, which constitutes binary data.
-75-
Design Specification - Image Processing Chip E. Michael Kelly
A.2 Device Architecture
The device operates on eight-bit data, 32-bits (four pixels) at a time, in a
pipeline fashion as shown in Figure A.l on page 78. The device operates
synchronously from a single clock. The signal IN_VSYNC (Vertical Sync)
is used to frame valid input image data on a page basis. The signal
IN_HSYNC (Horizontal Sync) is used to frame valid input image data on a
line basis as shown in Figure A. 10 on page 105. 1N_VSYNC, IN_HSYNC
and IN_IMG_D(31:0) are all sampled on the rising edge of CLK.
OUTJVSYNC, OUTHSYNC, and OUT_IMG_D(31:0) are synchronized to
the rising edge ofCLK using DFFs. This interface scheme for the image data
is used internally between blocks as well as at the device input and output
image data interfaces. This allows flexibility in partitioning the design for
ease of implementation. The EMKIPC device could just as easily be imple
mented as multiple devices as a single device if, for example, access were
needed to an internal data path.
The CPU interface allows easy access to processing parameters and
device control. Each of the major blocks in the design, except for the thresh
olding block, has one ormore registers which can be configured. If no regis
ters in the device are programmed, the device defaults after reset to simply
pass the input image data and control signals through after a two clock delay.
The CPU interface consists of an eight-bit bi-directional data bus, a nine-bit
address bus, a chip select, a write signal, and a read signal. The nine address
bits allow access to 512 addresses (0 to 511). The table below describes the
location of each of the registers in the device address map.
76-
Design Specification - Image Processing Chip E. Michael Kelly
TableA.l - EMKIPC CPU AddressMap
Address (Hex) Function
000 HistogramModification Control register
001 - 003 Unused
004 Convolution Filter Control register
005 - 007 Unused
008 Halftone Control register
009 Threshold value register
00A - 00B Unused
OOC Error Diffusion Control register
00D - OOF Unused
010-018 Convolution Filter Kernel elements
019 -01D Unused
01E-01F Convolution Filter Kernel Total
020 - 023 Error Diffusion Coefficients
024 - 03F Unused
040 - 07F Halftone Cell Values
080 - OFF Unused
100 -IFF HistogramModification LUT
More detail on the function of the registers can be found in the following
sections which describe the individual blocks.
The order of the blocks in the pipeline is important. The first two blocks
(histogram modification and convolution filtering) are grayscale operations
which need to happen prior to converting the data to binary.
The last three blocks in the pipe (halftone, error diffusion and threshold)
are used for converting 8-bit data to binary data. The halftone block needs to
be the first of the three since it generates threshold values which the two sub
sequent blocks require. The threshold block needs to be last since it is the last
step in converting gray data to
binary. This ordering of the blocks allows
halftoning and error diffusion to be used simultaneously.
-77-
Design Specification - Image Processing Chip E.
Michael Kelly
Figure A.l - EMKIPC Top Level Block Diagram
rt
cc LJ LJ o
cc z Z




















































































































































































Design Specification - Image Processing Chip E. Michael Kelly
A.3 HistogramModification
The purpose of the histogram modification block is to allow the 8-bit
image data coming in to be modified for the purpose of changing the overall
brightness and/or contrast of the image. This is typically done by applying
some equation (linear or non-linear) to the 8-bit pixel value. In the EMKIPC,
this function is accomplished through the use of a 256 entry Look Up Table
(LUT). This methodology allows complete flexibility with respect to the
function to be applied to the data. Four separate LUTs are used in parallel for
the four pixel data paths through the device in order to maximize device
throughput. The four tables are written simultaneously when the LUT
address range of the device (100 - IFF) is accessed. For read back, each table
can be individually selected using the lut_sel(l:0) bits in the hm_ctl register.
This block can be enabled or disabled using bit-0 of the hm_ctl register.
When it is enabled, one clock delay is added to the device data path delay.
The following two tables summarize the control register bit assignments and
functions.
Table A.2 - HistogramModification Control Register Bits
BitNumber(s) Description
0 hm_enable - HistogramModification enable
(2,1) lut_sel(l:0) - LUT read back select.
Table A.3 - HistogramModification Read back Select Bits
lut_sel(l:0) value Function
11 Read back LUT A (image data path bits 31 - 24).
10 Read back LUT B (image data path bits 23 - 16).
-79-
Design Specification - Image Processing Chip E.
Michael Kelly
Table A.3 - HistogramModification Read back Select Bits
lut_sel(l:0) value Function
01 Read back LUT C (image data path bits 15 - 8).
00 Read back LUT D (image data path bits 7 - 0).
The ability to read back the individual tables was added primarily for
diagnostic purposes. The ability to write all four tables at once was included
to minimize the time and CPU intervention needed to set up the chip. The
VHDL source could easily be modified to allow the tables to be written indi
vidually using the same select bits described in the table above. The utility in
this was seen to be minimal.
The Histogram Modification block diagram is shown in Figure A.2 on
page 82. Note that there are data and sync signal DFFs on the input to this
block which serve to synchronize the device input. These contribute one
clock delay of the two minimum to get data through the device. Input DFFs
are required so that processing internal to the device can happen over a full
clock cycle independent of the setup or hold time of the data and control sig
nals coming into the chip. The muxes on the output of the block are used to
pass the image data and sync signals through the blockwhen it is not enabled.
The eight-bit two-to-one muxes (mux_2tol_8) which have the CPU
address bus and synchronized image data as inputs are used to allow CPU
access to the LUTs during device configuration. Note that the muxes are
controlled by the device chip select. This means that ifCSZ goes low during
imaging when histogram modification is enabled, the output image data will
be affected. The EMKIPC device should not be accessed by the CPU while
an image is being processed. The four reg_arry blocks are the actual LUTs.
The eight-bit address input to these blocks is the lower eight CPU address
-80
Design Specification - Image Processing Chip E.
Michael Kelly
bits during setup and is the image data during imaging. The d(7:0) bus is
only used during setup.
The second set of D Flip Flops is used to capture the LUT output when
the block is enabled. Corresponding DFFs for the sync signals give them the
same delay as the image data.
The hra_cpu_dcd block is where the control register is implemented and
where the lut_sel(l:0) bits are decoded to produce a read enable for each of
the four LUTs.
81
Design Specification - Image Processing Chip E.
Michael Kelly






















Design Specification - Image Processing Chip E. Michael Kelly
A.4 Convolution Filtering
The purpose of the convolution filtering block is to allow the 8-bit image
data coining in to be filtered to improve the overall appearance of the image.
This is done by applying an area operator to the current pixel with weighted
contributions from the surrounding pixels. In the EMKIPC chip the weight
ing is done with the surrounding eight pixels and the weighting coefficients
are programmable with values from -127 to 127. This operation requires that
three lines of image data be available concurrently since the lines above and
below the current pixel are required to perform the calculation. The Filtering
block contains two FIFOs which allow the storage of two lines of data so that
image data from three lines can be worked on simultaneously.
The filter element is a 3 x 3 kernel as shown in the table below. In the
table elements, the second letter indicates the line (p = previous, c = current,
and n = next), and the third letter indicates the pixel (a = previous, b = cur
rent, c = next) (e.g. cp_a = coefficient, previous line, pixel a). When calcu
lating the new value for the current pixel, the original values for all the
surrounding pixels are used (i.e. the new values calculated for the previous
line are not used).




The filter elements are programmable via the device CPU port. Valid
values are from -127 to 127. The values are eight-bit (7 down to 0) in sign/
83
Design Specification - Image Processing Chip E. Michael Kelly
magnitude notation with bit-7 as the sign bit. The sum of the kernel elements
is generated automatically in the device hardware and is used to normalize
the convolution result. This value can be read back (but not written) from
addresses 01E hex and OIF hex. It is a 12-bit value (1 1 down to 0) in two's
complement format with bit- 1 1 as the sign bit. When programming the filter
elements, the sum of the values must not be zero to avoid divide by zero
errors. On reset, all of the kernel elements are given a value of zero except
for the center element (cc_b) which gets a value of one.
CPU address 004 hex contains the convolution filter control register.
Bit-0 of this eight-bit register is the convolution filter enable bit. When it is
cleared, the block is disabled and image data passes straight through without
modification and with no clock or line delays. When it is set, the block is
enabled and image data is filtered. When enabled, the block introduces a
delay in the image data and sync signals of 1 line and eight clocks.
Table A.5 - Convolution Filtering AddressMap
Address (Hex) Description
004 Convolution Filter Control Register
010 cp_a - Previous line, coefficient a
011 cp_b - Previous line, coefficient b
012 cp_c - Previous line, coefficient c
013 cc_a - Current line, coefficient a
014 cc_b - Current line, coefficient b
015 cc_c - Current line, coefficient c
016 cn_a - Next line, coefficient a
017 cn_b - Next line, coefficient b
018 cn_c - Next line, coefficient c
019 -01D Unused
01E Filter Kernel Total (bits 7 to 0)
01F Filter Kernel Total (bits 11 to 8)
84
Design Specification - Image Processing Chip E. Michael Kelly
The Convolution Filter Block diagram is shown in Figure A.3 on page
88.
The CPU registers described above are implemented in the cf_cpuif
block shown in the block diagram. The fifo_ctl block takes the input sync
signals and the input image data and generates the required FIFO read and
write signals (ren and wen) as well as the FIFO reset signal. It also generates
the first line, last line, first word, and last word signals needed to handle the
image boundary conditions.
Boundary conditions exist for the first and last lines of the image and for
the first and last pixel of each line. When the first line of the image is input to
the block, no output data is produced and the VSYNC output of the block
remains low. During this line the image data is written into the two convolu
tion filter FIFOs (FIFOl and FIF02 which are implemented in the fifos
block). The FIFOs are each 2048 words deep by 32-bits. On subsequent
lines, image data is output. Input data is used as next line data and as FIFOl
input. FIFOl output is used as current line data and as FTF02 input. FTF02
output is used as previous line data. The filter uses the first line of data as
both current and previous data during processing of the first line of the image.
When the inputVSYNC to the block goes inactive, there is still one more line
to be output so that the output image is of the same size as the input image.
This means that an additional input HSYNC is required after input VSYNC
goes low. During this last line of data out, FIF02 output is used as previous
line data, FIFOl output is used as both current and next line data.
85-
Design Specification - Image Processing Chip E. Michael Kelly
The first and last pixels of a line are handled in a similar fashion. During
the first pixel, previous pixel data is the same as current pixel data. During
the last pixel, next pixel data is the same a current pixel data. The muxes to
handle these conditions are in the pixel_ctl block. The pixel_ctl block is also
used to generate six pixels from each line at the same time. This is needed so
that the four pixels of each data word can be processed concurrently.
The filt block is where the actual filter computation is performed. There
are four of these blocks to allow the processing of four pixels in parallel.
Each filt block receives the nine filter coefficients (cna, cnb, cnc, cca, ccb,
ccc, cpa, cpb, and cpc in (Equation A.l)), the kernel total, and the nine pixel
values (nl_a, nl_b, nl_c, cl_a, cl_b, cl_c, pl_a, pl_b, and pl_c in (Equation
A. 1)) for the three by three area that it is processing. Each block produces an
8-bit output value for the center pixel of the three by three window that it is
processing. The filt output equation is:















The convolution filter block has an output called cf_fifo_err which is
used to indicate that a FIFO underflow or overflow condition has occurred.
Once this signal goes high, it remains high until the device is given a reset.
FIFO overflow can occur if IN_HSYNC is active for more than 8192 pixels
(2048 words). This limitation is due to the depth of the built in FIFOs in the
Filter block. FIFO underflow can occur if IN_HSYNC is active longer dur-
86
Design Specification - Image Processing Chip E.
Michael Kelly
ing the last two lines of the image, when data is only being read from the
FIFOs, than it was for the previous lines.
-87-
Design Specification - Image Processing Chip E. Michael Kelly
Figure A.3 - Convolution Filter Block Diagram

































Design Specification - Image Processing Chip E. Michael Kelly
A.5 Image Halftoning
The purpose of the Halftoning block is to convert 8-bit image data to
binary in a way that allows pictorial images to retain a grayscale appearance.
This is done by overlaying a matrix of threshold values on the image. The
primary trade-off in doing this is halftone cell size versus the number of gray
levels which can be represented. The higher the number of cells per inch, the
less likely it is that the cell structure will be visible when viewing the image.
Unfortunately, a higher number of cells per inch means a lower number of
pixels per cell and, as a result, a smaller number of gray levels which can be
represented by a cell. Larger cell size reduces the chance that visible con
touring will be present in the binary image by increasing the number of gray
levels which can be represented. The compromise between these conflicting
goals is generally decided as a function of output system resolution. A higher
resolution system such as a 600 dpi printer can use a large cell size and still
be fairly immune to the cells becoming objectionable.
The EMKIPC uses an 8 pixel by 8 pixel dithermatrix which can be pro
grammed by the user via the device CPU interface. Each matrix element is
loaded with an 8-bit threshold with valid values from 0 to 255. The actual
halftone cell size depends on how the dither matrix is loaded. The device
programmer is free to design any cell pattern that will fit within the 8 by 8
matrix constraint. Cell structures at both 90 degree and 45 degree orientation
have been tested under simulation. At 90 degrees, 2 x 2, 4 x 4, and 8x8 cells
are simple to implement. At 45 degrees, 8 element and 32 element cells have
been tested. The possibilities for cell design are even more varied if mixed
cell sizes are used within the 8 x 8 dither matrix.
89-
Design Specification - Image Processing Chip E. Michael Kelly
The dither matrix is mapped into the device address space for program
ming via the CPU interface. The following table shows the addressing of
these registers. In the table the elements are shown positionally as they will
overlay the image.
Table A.(> - DitherMatrix Register Addresses (Hex)
040 041 042 043 044 045 046 047
048 049 04a 04b 04c 04d 04e 04f
050 051 052 053 054 055 056 057
058 059 05a 05b 05c 05d 05e 05f
060 061 062 063 064 065 066 067
068 069 06a 06b 06c 06d 06e 06f
070 071 072 073 074 075 076 077
078 079 07a 07b 07c 07d 07e 07f
The Halftone block also contains two other registers. The first is the
halftone control register. It is located at address 008 hex. Bit-0 of this regis
ter is used as the enable for the halftone block. When it is set halftoning is
enabled. Bits 1 to 7 are unused. The second register is the threshold value
register. This value is used to implement straight thresholding. It is passed
out of the halftoning block over the same bus as the halftone values when
halftoning is disabled. The thresh register is located at address 009 hex and
valid values are from 0 to 255.
The diagram of the halftone block is shown in Figure A.4 on page 92. It
has the same input image data interface as the other blocks in the device but
on the output there is and additional 32-bit bus. This bus is used to transmit
90
Design Specification - Image Processing Chip E. Michael Kelly
the threshold value for each pixel along with the image data for that pixel to
the next processing block. The image data is not modified in the halftone
block but it is delayed as required, using DFFs, to remain in sync with the
threshold values. The eight ht_reg_row blocks in the block diagram each
contain a row of the dither matrix. The eight values contained in a row are
sent out four at a time over the 32-bit bus in an alternating manner during
active image time (VSYNC and HSYNC high). The same eight values are
repeated throughout an entire line of the image. When HSYNC goes low
(inactive) a three-bit line counter is incremented so that the next line will use
the next row of the dithermatrix. The output of the line counter is used as the
select input to an eight to one by 32 mux. After eight lines have been pro
cessed the line counter value rolls over to zero and the dither matrix is
indexed once again starting with the first row.
-91-
Design Specification - Image Processing Chip E.
Michael Kelly
















Design Specification - Image Processing Chip E. Michael Kelly
A.6 Error Diffusion
The purpose of the Error Diffusion block is to convert 8-bit image data
to binary in a way that allows pictorial images to retain a grayscale appear
ance. The goal is the same as halftoning but the method is entirely different.
In Error diffusion, the overall gray level of the image is maintained by propa
gating thresholding error to subsequent pixels as the image is processed. In
the EMKIPC implementation of error diffusion, the current pixel value has
error added to (or subtracted from) it from the three pixels above it and from
the pixel immediately to it's left. The new pixel value is then compared to
the threshold value which applies to it. This value comes from the halftone
block and can be either a single threshold value applied to all pixels (if half
toning is disabled) or a value from the dither matrix (when both halftoning
and error diffusion are enabled). The total error is calculated for the current
pixel and is sent out to a FIFO for use in processing the next line. The total
error for a pixel can be in the range -255 to +255. The error is negative when:
din<0
thresh < din < 255
Where din is defined as the value of the current pixel after previous pixel
error is added to it.
The pixels which contribute error to the current pixel are shown in the
table below. The table is a positional representation with the cell labeled
"X"
being the current pixel. Each of the pixels (i) through (1) contribute error to
pixel X in the proportions (m^ through (m^, where (m-) through (m^ are 4-
bit multipliers with values from 0 to 15.
-93-
Design Specification - Image Processing Chip E. Michael Kelly
Table A.7 - Error Diffusion Pixel Reference
j k 1
i X
The equation for the new value ofX is as follows:
Xnew = Xoid + ((mi /16)
*





The values for (e^ through (e^ are the total error for pixels (i) through
(1). The multiplier values (mj through n^) are programmable via the device
CPU interface. The sum of the multiplier values must never exceed 16 since
this could cause an error value which would exceed the 9-bit capacity of the
error storage for each pixel. This is a reasonable limitation since it also pre
vents the device from modifying the overall density of the image using error
diffusion. Ideally, the total sum of the multipliers should be exactly 16 to
maintain the overall gray level content of the image. The error diffusionmul
tipliers are programmed at the CPU addresses shown in the table below.






There is also an Error Diffusion Control register in the Error Diffusion
block. It is located at address 00C hex. Bit-0 of this register is the enable bit
for Error diffusion. When this bit is set, error diffusion is enabled and data
94-
Design Specification - Image Processing Chip E. Michael Kelly
passes through the block after a 5 clock delay. When this bit is cleared, data
passes through the block with no clock delays. Bit-1 of this register is the
threshold enable bit. This bit is located in the error diffusion block because
the threshold block does not have a CPU interface. This bit is passed on to
the threshold block. Both register bits are cleared with the assertion of the
device reset.
There are no line delays through error diffusion, even though there is a
2048 word deep by 36-bit FIFO in the system. This is due to the fact that the
FIFO is used to store error information, not image information. When the
first line of an image is processed, the resulting error is written out to the
FIFO. Data out for the first line is produced 5 clocks after the first data enters
the block. The error input data (err_in(35:0)) for the first line is all zero since
the FIFO is not read during the first line. Refer to the error diffusion block
diagrams on page 96 and page 97. On subsequent lines of the image, error
information from the previous line enters the ed_process block along with the
current line of image data for processing. On the last line, error information
is written to the FIFO but it is never used. This is due to the fact that the
FIFO is reset when VSYNC goes low.
The error diffusion block has an output called ed_fifo_err which is simi
lar to the cf_fifo_err signal described in section. This signal goes high to
indicate error diffusion FIFO underflow or overflow. It will remain high
until the device is reset. FIFO overflow will occur if IN_HSYNC is active
for more than 8192 pixels (2048 clocks).
-95-
Design Specification - Image Processing Chip
E. Michael Kelly








Design Specification - Image Processing Chip E. Michael Kelly
Figure A.6 - Error Diffusion Processing BlockDiagram









/ / / /
"* 3 m -
N ^
,*
In ^ Sn < c! 8
irit- vo,_ t: ob^







S h 3, 1 j 3 E i 3, J. 1 H 3, J . J,




e I ^ 5, B t ^ S














X X W h f
u. uj gf n d t
rr tr ^ x *





1 1 3 3 [
ii i j s
-
1 1 i i t
3'
a i l I
3'





c , a s a







^ o a a o aao aao aa
5 3 3 * fi * fi S 5 fi * B S
- fi 6 5 B
- fi S 5 S S 5 S 5 5 S
i2 \
- c
Sooo 3SS0 3 0 0 0 3 0 0 0 .
(O 03 CD CD a5 CD CD CD CD CD CD ij 0D CO CD
L
--il_J - SC _ S -1 sc _| S -1 SC _l
j
<<<< (QiQtQin uuuu 9 Q Q a [
Drcrcrcr11, ^rrrrfrcrcrirttfncEcr
UJijjUlj Id UJ uj UJ UJ UJ UJ UJ ILIUJ||JUJ
X % % % \ X X X X X X X X X X X
i 1 i { i, i i i i, i, { l 1 { i, i
3
3*
2 3 3 3 3 _3,_33
3'
* 1,1 l^t^t' U&LJ l^^l *^v ^ r tV ^ t ^ t ^t*p MJ "frr Ls kSu lt^J LJJ l-jt^f
^^r^CD_t^CDC_CO|C SJ j-* t*-
SSS P^^E S-gC. j-goCN ;
S CD ^ lii > ^J^ rNin-OL;.H>or~CD^r~;cDLn





T 1! 5 ...
3e8eeee cl cl
a!i d A d d A ai^
97
Design Specification - Image Processing Chip E. Michael Kelly
A.7 Threshold Block
The Threshold block is the only block in the EMKIPC which does not
require a CPU interface. The only configurable parameter needed for this
block is the threshold enable (th_enable) bit. This bit is generated in the
Error Diffusion block which requires a CPU interface for other parameters
and which passes this bit to the threshold block. This reduces the I/O and
logic requirements of the Threshold block significantly and only increases the
Error Diffusion block logic and I/O a very small amount.
The Threshold block has 32-bit image data and 32-bit threshold data as
inputs. Each of the four, 8-bit pixels is compared with its corresponding 8-bit
threshold value. The compare is a greater than compare. If the image data is
greater than the threshold value, then the output is FF hex. If the image data
is less than or equal to the threshold value, then the output is 00 hex. There
are four comparators of eight-bits each for the four pixel data paths. Refer to
the Threshold block diagram on page page 100. Four 8-bit muxes (one for
each of the four comparator outputs) follow the comparators. If the threshold
block is disabled (the th_enable input is low) then the 8-bit input image data
is synchronized using DFFs and sent out of the device. If the block is enabled
(th_enable high) then the comparator outputs are synchronized and output.
The Threshold blockmust be enabled whenever binary output is desired.
This includes straight thresholded output, halftone output, or error diffused
output. When the halftone block is enabled, the threshold value input will
vary as a result of the dither matrix. When the halftone block is disabled, the
threshold value will be static throughout the image and will be equal to the
value of the thresh register in the halftone block. It is valid to have both the
-98
Design Specification - Image Processing Chip
E. Michael Kelly
halftone and error diffusion blocks enabled at the same time. In this case, the
threshold values come from the halftone dither matrix.
99
Design Specification - Image Processing Chip E.
Michael Kelly
























































Design Specification - Image Processing Chip E. Michael Kelly
A.8 Device Interfaces
There are three interfaces to the EMKIPC device. They are the CPU
interface, the input image data interface, and the output image data interface.
These interfaces are described in more detail in the following sections. Note
that the signal timing characteristics presented in this section are estimates.
Actual timing would be obtained after synthesis, layout, and back-annotation.
The timing numbers presented as requirements must be met as specified to
allow interfacing to eternal devices.
A.8.1 CPU Interface
The CPU interface is a basic Intel type eight-bit bi-directional interface.
The signals are described in the table below.
Table A.9 - CPU Interface Signals
Signal Name Description
CSZ INPUT - Chip Select, active low. This signal must be low during
read or write access to the EMKIPC device. The other CPU inter
face signals have no effect on the device when CSZ is high.
CPU_WRZ INPUT - Write Strobe, active low. Data is written into the device
registers via CPU_D(7:0) on the low to high transition of
CPU_WRZ when CSZ is low.
CPU.RDZ INPUT - Read Strobe, active low. Data is driven out of the device
onto CPU_D(7:0) when CPU_RDZ is low and CSZ is low.
CPU_A(8:0) INPUT - Address bus. The 9-bit address bus is decoded internally
along with CSZ to select which device register is to be accessed
during read or write cycles.
CPU_D(7:0) BI-DIRECTIONAL - Data bus. The eight data bits function as
inputs during valid write cycles (CSZ and CPU_WRZ low) and as
outputs during valid read cycles (CSZ and CPU_RDZ low). Oth
erwise they are in a high impedance state.
The timing relationships among the CPU interface signals are shown
below.
-101-









j lASU M- ?> lAH





X Valid Data X






tw WRZ active (low) time 20
tcsu Setup time, CSZ low to rising edge ofWRZ. 20
4ASU Setup time, Address valid to rising edge ofWRZ. 20
tDSU Setup time, Data valid to rising edge ofWRZ. 15
tCH Hold time, rising edge ofWRZ to CSZ high. 5
lAH Hold time, rising edge ofWRZ to invalid Address. 5
lDH Hold time, rising edge ofWRZ to invalid Data. 5














CPU A t ; Valid Address X
;tDA
<4 ?
CPU D I Valid Data x
102
Design Specification - Image Processing Chip E. Michael Kelly






tR RDZ active (low) time 20
tcsu Setup time, CSZ low to falling edge ofRDZ. 5
Usu Setup time, Address valid to falling edge ofRDZ. 5
*DA Access time, Data valid after falling edge ofRDZ. 2 10
tCH Hold time, rising edge ofRDZ to CSZ high. 0
lAH Hold time, rising edge ofRDZ to invalid Address. 0
lDH Hold time, Data valid after rising edge ofRDZ. 2
In Table A. 11 above, tDA is a characteristic and the rest of the parame
ters are requirements.
A.8.2 Image Data Input Interface
The Input Image Data interface is a 32-bit synchronous raster image data
interface with vertical and horizontal sync signals. Each input pixel is made
up of eight-bit data, so four pixels are brought into the chip and processed on
each rising edge of CLK. The following conventions are used for locating
image data on a page.
A line ofdatawhich arrives earlier in time is considered to be
above a line which arrives later.-
A word of data which arrives earlier in time is considered to
be to the left of a word which arrives later.-
Within aword, a more significantbyte (pixel) is considered to
be to the left of a less significant byte (e.g.
IN_DVIG_D(31:24) is to the left of IN_IMG_D(23:16)).-
Within a byte, themost significant bit corresponds to themost
significant pixel data bit (e.g. Bits 31,23,15, and 7 of a word
each correspond to bit-7 (theMSB) of an 8-bit pixel value).
103
Design Specification - Image Processing Chip E.
Michael Kelly
The input interface signals are described in Table A. 12 on page 104.
Table A.12 - Image Data Input Interface Signals
Signal Name Description
CLK INPUT - Clock. 1N_1MGJD(31:0), IN.VSYNC, and
IN_HSYNC are sampled on the rising edge ofCLK.
OUT_IMG_D(31:0), OUT_VSYNC and OUT_HSYNC are
clocked out as a result of the rising edge ofCLK. CLK can run
at up to TBDMHz.
DSf_IMG_D(31:0) INPUT - 32-bit Input Image Data
IN_VSYNC INPUT - Vertical Sync, active high. This signal is used to indi
cate valid image data on a page basis. Itmust go active one
CLK cycle prior to the assertion of the first IN_HSYNC of the
image and must remain active throughout the image.
IN_VSYNCmust remain active at least one CLK cycle after the
last IN_HSYNC of the image has gone inactive. Itmust go
inactive at least one CLK cycle before the assertion of the first
IN_HSYNC outside of valid image data.
IN_HSYNC INPUT Horizontal Sync, active high. This signal is used to
indicate valid image data on a line basis. IN_HSYNC can be
free running or can go active onlywhen image data is presented.
When the Convolution Filter or Error Diffusion functions (oper
ations that use FIFOs) are enabled, IN_HSYNC must be active
for the same number ofCLK cycles for every line in an image
in order for these functions to operate correctly. When the Con
volution Filter function is enabled, either IN_HSYNC must be
free running or there must be at least one additional cycle of
IN_HSYNC after IN_VSYNC goes inactive to get the last line
of the image out.
RESETZ INPUT - Chip Reset, active low. The RESETZ input is used to
completely reset the device. All internal flip-flops, including
state machines, CPU registers, counters, and error condition
DFFs are asynchronously set or reset to their default condition
when RESETZ is active.
The timing relationships among the Input Image Data interface signals
are shown in Figure A. 10 on page 105, Figure A. 11 on page 105, and Table
A. 13 on page 105.
-104-
Design Specification - Image Processing Chip E.Michael Kelly
Figure A.10 - Input Image Data Macro Timing
1N_VSYNC
ESLHSYNC i_r
IN IMG D XXXX Linel Line 2 Line 3 Line 4 Line 5 XXXX










Wordl Word 2 Word 3 Word 4 Word 5 Word 6 Word 7






tCLK CLK cycle time 50
tvsu Setup time, IN_VSYNC to rising edge ofCLK 5
lVH Hold time, IN_VSYNC after rising edge ofCLK 2
lHSU Setup time, IN_HSYNC to rising edge ofCLK 5
tffil Hold time, IN_HSYNC after rising edge ofCLK 2
lDSU Setup time, IN_IMG_D to rising edge ofCLK 5
lDH Hold time, IN_IMG_D after rising edge ofCLK 2
-105
Design Specification - Image Processing Chip E. Michael Kelly
A.8.3 Image Data Output Interface
The Output Image Data interface is a 32-bit synchronous raster image
data interface with vertical and horizontal sync signals identical to the input
image data interface. The output data will be eight-bits per pixel but, if any
of the gray to binary functions are enabled, the values will be either 00 hex or
FF hex. There are two error signals in the output interface in addition to the
image data and sync signals. These are used to signal that internal FIFO
errors have occurred. All of the output image data interface signals are syn
chronous to the rising edge of the input CLK.
The output interface signals are described in Table A. 14 on page 106.
Table A.14 - Output Image Interface Signal Descriptions
Signal Name Description
OUT_1MG_D(31:0) OUTPUT - 32-bit Input Image Data
OUT_VSYNC OUTPUT - Vertical Sync, active high. This signal is used to
indicate valid image data on a page basis. It will follow
IN_VSYNC by a number of clock cycles which depends on
which functions are enabled. When the Convolution Filter is
enabled, OUTJVSYNC will be delayed by one line plus the
applicable clock delays.
OUT_HSYNC OUTPUT Horizontal Sync, active high. This signal is used to
indicate valid image data on a line basis. It will follow
IN_HSYNC by a number of clock cycles which depends on
which functions are enabled. OUT_HSYNC will propagate
through the chip independent of the level of IN_VSYNC.
CF_HFO_ERR OUTPUT - Convolution Filter FIFO error, active high. This
signal goes high to indicate that an overflow or an underflow has
occurred on one of the convolution filter FIFOs. It will remain
high until the device is reset.
ED_FIFO_ERR OUTPUT Error Diffusion FIFO error, active high. This signal
goes high to indicate that an overflow or an underflow has
occurred on the error diffusion FIFO. It will remain high until
the device is reset.
-106
Design Specification - Image Processing Chip E. Michael Kelly
The timing relationships among the output image data interface signals
are described in Figure A. 12 on page 107 and Table A. 15 on page 107.









Wordl Word 2 Word 3 Word 4 Word 5 Word 6 Word 7






kXK CLK cycle time 50
tv Rising edge ofCLK to OUT_VSYNC transition 0 5
tH Rising edge of CLK to OUT_HSYNC transition 0 5
tD Rising edge of CLK to OUT_IMG_D transition 0 5
-107-
VHDL Implementation of an Image Processing Chip












TYPE ram_data IS ARRAY(0 to 255) of std_logic_vector(7 downto 0);
TYPE fifo_data IS ARRAY(0 to (FIFO_DEPTH - 1)) of std_logic_vector(31 downto 0);
TYPE fifo36_data IS ARRAY(0 to (FEFO_DEPTH - 1)) of std_logic_vector(35 downto 0);
TYPE fifo40_data IS ARRAY(0 to (FIFO_DEPTH - 1)) of std_logic_vector(39 downto 0);
SUBTYPE fiforange IS INTEGER RANGE 0 to (FIFO_DEPTH - 1);
CONSTANT X_32:std_logic_vector(31 downto 0):= (OTHERS => 'X');
CONSTANT HIZ_32:std_logic_vector(31 downto 0):= (OTHERS => 'Z');
CONSTANT ZERO_32:std_logic_vector(31 downto 0):= (OTHERS => '0');
CONSTANT X_36:std_logic_vector(35 downto 0):= (OTHERS => 'X');
CONSTANT HIZ_36:std_logic_vector(35 downto 0):= (OTHERS => 'Z');
CONSTANT ZERO_36:std_logic_vector(35 downto 0):= (OTHERS => '0');
CONSTANT DELAY1:TIME:= 2 ns;
CONSTANT DELAY2:TIME:= 5 ns;
CONSTANT CPU_CYCLE:TIME:= 20 ns;
FUNCTION vec_int(a: std_logic_vector) RETURN INTEGER;
FUNCTION int_vec(n: INTEGER; v: std_logic_vector) RETURN std_logic_vector;
FUNCTION vahd_vec(v: std_logic_vector) RETURN BOOLEAN;
PROCEDURE cpu_write(a,d: IN std_logic_vector;
SIGNAL cpu_a,cpu_d: INOUT std_logic_vector;
SIGNAL csz,cpu_wrz: INOUT stdjogic);
PROCEDURE cpu_read(a: IN std_logic_vector;
SIGNAL cpu_a: INOUT std_logic_vector;
SIGNAL csz,cpu_rdz: INOUT stdjogic);
END types_emk;
PACKAGE BODY types_emk IS
- Function: Vector to Integer Conversion.
FUNCTION vec_int(a: std_logic_vector) RETURN INTEGER IS
VARIABLE work: INTEGER := 0;
108
VHDL Implementation of an Image Processing Chip
VARIABLE unknown: BOOLEAN := FALSE;
BEGIN












ASSERT (unknown = FALSE)





Function: Integer to VectorConversion.




hi := v'HIGH + vLOW;
work := n;
FOR i IN v'RANGELOOP
IF (workMOD 2) = 0 THEN








Function: Returns a boolean TRUE if each element of the
vector is a 1 or 0.
FUNCTION valid_vec(v: std_logic_vector) RETURN BOOLEAN IS
VARIABLE b: boolean := TRUE;
VARIABLE j: INTEGER;
BEGIN
FOR j in v'RANGE LOOP
IF ((v(j) /= T) AND (v(j) /= '0')) THEN
b := FALSE;
ASSERT(FALSE)








VHDL Implementation of an Image Processing Chip
Procedure: Generates a cpu write cycle.
PROCEDURE cpu_write(a,d: IN std_logic_vector;
SIGNAL cpu_a,cpu_d: INOUT std_logic_vector;




AFTER 2 ns.T AFTER (CPU_CYCLE - 2 ns);
cpu_wrz <=
'0'




cpu_d <= d AFTER 2 ns,
"ZZZZZZZZ"
AFTER (CPU_CYCLE - 2 ns);
WATT forCPU_CYCLE;
END cpu_write;
Procedure: Generates a cpu read cycle.
PROCEDURE cpu_read(a: EST stdJogic_vector;
SIGNAL cpu_a: INOUT stdJogic_vector;






AFTER (CPU_CYCLE - 2 ns);
cpu_rdz <=
'0'








VHDL Implementation of an Image Processing Chip
File: Comps.vhd
Basic logical building blocks.
- Mike Kelly
- RTT Computer engineering Master's Thesis project.
- Created: 8/5/95






a,b: IN stdJogic_vector(7 downto 0);
c: OUTstdJogic_vector(7 downto 0));
ENDmux_2tol_8;
ARCHITECTURE bhv ofmux_2tol_8 IS
BEGIN
c <= b AFTER DELAY1 WHEN a_bz =
'0'










ARCHITECTURE bhv ofmux_8tol_32 IS
BEGIN
mux_out <=a AFTER DELAY1 WHEN sel =
"000"
ELSE
b AFTER DELAY1 WHEN sel =
"001"
ELSE
c AFTER DELAY1 WHEN sel =
"010"
ELSE
dAFTER DELAY1 WHEN sel =
"011"
ELSE
e AFTER DELAY1 WHEN sel =
"100"
ELSE
fAFTER DELAY1 WHEN sel =
"101"
ELSE



















VHDL Implementation of an Image Processing Chip









ELSIF (clkEVENT AND elk = '1') THEN































d: IN std_logic_vector(7 downto 0);
q: OUTstd_logic_vector(7 downto 0));
ENDdff8;





















d: IN std_logic_vector(31 downto 0);
q: OUTstd_logic_vector(31 downto 0));
ENDdff32;




IF (clrz = '0') THEN
q
<= ZERO_32 AFTER DELAY1;
ELSIF (clkEVENT AND elk = '1') THEN
q









d: INOUTstdJogic_vector(7 downto 0);
q: OUTstd_logic_vector(7 downto 0));
END rcg_rb_8;
ARCHITECTURE bhv of reg_rb_8 IS








ELSIF (clkElVENT AND elk = '1') THEN
reg_conts <= d AFTER DELAY1 ;
END IF;
END PROCESS;















RIT Computer engineering Master's Thesis project.
Created: 8/12/95
Last Revision: 10/17/95
On 10/17/95, 1 added a 36 bit fifo implementation to workwith a new
approach on error diffusion which stores total error (9 bits/pixel)
instead of pixel values with error added (10 bits/pixel).
- On 1/9/96 1 deleted the fifo_40 block since this device is no longer







d: IN std_logic_vector(31 downto 0);
over,under:OUTstd_logic;
q: OUTstd_logic_vector(31 downto 0));
END fifo;
ARCHITECTURE bhv of fifo IS
TYPE concat_2 IS ARRAY (0 to 1) OF stdjogic;
SIGNAL over_int,under_int:std_logic;
SIGNAL q_int:std_logic_vector(31 downto 0);
SIGNAL w_cnt,r_cnt:fifo_range;
FUNCTION addr_incr( a:fifo_range) RETURN fifo_range IS
VARIABLE work: fifo_range;
BEGIN










VARIABLE r_rdy:boolean := false;
BEGIN










ELSIF (clkEVENT AND elk = '1') THEN










































































d: IN std_logic_vector(35 downto 0);
over,under:OUTstd_logic;
q: OUTstd_logic_vectO!ri;35 downto 0));
END fifo_36;
ARCHITECTURE bhv of fifo_36 IS
TYPE concat_2 IS ARRAY (0 to 1) OF stdjogic;
SIGNAL over_int,under_int:std_logic;
SIGNAL q_int:std_logic_vector(35 downto 0);
SIGNAL w_cnt,r_cnt:fifo_range;
FUNCTION addr_incr( a:fifo_range) RETURN fifo_range IS
VARIABLE work: fifo_range;
BEGIN










VARIABLE r_rdy:boolean := false;
BEGIN








ELSIF (clkEVENT AND elk = '1') THEN




































































VHDL Implementation of an Image Processing Chip
File: emkipcvhd
- Top level for the EMKIPC chip. This file is used to tie all the functional blocks
together. Structural VHDL is used.
- Mike Kelly
- RIT Computer engineering Master's Thesis project.
- Created: 8/20/95








cpu_a: IN std_logic_vector(8 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
in_img_d: IN std_logic_vector(31 downto 0);
out_vsync,out_hsync:OUTstd_logic;
cf_fifo_err,ed_fifo_err:OUTstd_logic;
out_img_d: OUTstd_logic_vector(31 downto 0));
END emkipc;




SIGNAL ht_val,ed_thresh:stdJogic_vector(31 downto 0);






cpu_a: IN std_logic_vector(8 downto 0);
in_img_d:IN std_logic_vector(31 downto 0);








cpu_a: IN std_logic_vector(8 downto 0);
hm_img_d: IN std_logic_vector(31 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
cf_vsync,cf_hsync,cf_fifo_eiT:OUTstd_logic;
cfjmg_d: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
118-





cpu_a: IN std_logic_vector(8 downto 0);
cf_img_d: IN std_logic_vector(31 downto 0);







cpu_a: IN std_logic_vector(8 downto 0);
ht_img_d,ht_val:INstd_logic_vector(31 downto 0);







































RIT Computer engineering Master's Thesis project.
- Created: 8/6/95







cpu_a: IN std_logic_vector(8 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
hm_enable,ra_wrz:OUTstd_logic;
ra_rdzO,ra_rdz1 ,ra_rdz2,ra_rdz3 :OUTstdJogic) ;
END hra_cpu_dcd;
ARCHITECTURE bhv of hra_cpu_dcd IS






































= '0') ELSE '1';









































































IN std_logic_vector(7 downto 0);
INOUT std_logic_vector(7 downto 0);
OUT std_logic_vector(7 downto 0));
ARCHITECTURE bhv of reg_arry IS
BEGIN































cpu_a: IN std_logic_vector(8 downto 0);
in_img_d: IN std_logic_vector(31 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
hm_vsync,hm_hsync :OUTstdJogic ;
hm_img_d: OUTstd_logic_vector(31 downto 0));
END hist_mod;
ARCHITECTUREmixed of histjmod IS
SIGNAL hm_enable:std_logic;
SIGNAL ra_wrz^a_rdz0,ra_rdzl,ra_rdz2,ra_rdz3:stdJogic;
SIGNAL vsync 1 ,vsync2,hsyncl ,hsync2:std_logic;
SIGNAL hra_a_a,hra_a_b,hra_a_c,hra_a_d:std_logic_vector(7 downto 0);
SIGNAL l_in_d,hra_d,img_dout:std_logic_vector(31 downto 0);
COMPONENT dff32
PORT(clk,clrz:INstd_logic;
d: IN std_logic_vector(31 downto 0);
q: OUTstd_logic_vector(31 downto 0));
-121





cpu_a: IN std_logic_vector(8 downto 0);






a: IN std_logic_vector(7 downto 0);
d: INOUTstd_logic_vector(7 downto 0);

























































PORTMAP (ra_wrz,ra_rdz3,hra_a_a,cpu_d,hra_d(31 downto 24));
RA_B: reg_arry
PORTMAP (ra_wrz,ra_rdz2,hra_a_b,cpu_d,hra_d(23 downto 16));
RA_C: reg_arry
PORTMAP (ra_wrz,ra_rdzl,hra_a_c,cpu_d,hra_d(15 downto 8));
RA_D: reg_arry
122
VHDL Implementation of an Image Processing Chip
PORT MAP (ra_wrz,ra_rdz0,hra_a_d,cpu_d,hra_d(7 downto 0));
DFF_OUT: dff32
PORT MAP (clk,resetz,hra_d,img_dout);
hm_vsync <= vsync2 AFTER DELAY1 WHEN hm_enable =
'1'
ELSE vsyncl AFTER DELAY1;
hm_hsync <= hsync2 AFTER DELAY1 WHEN hm_enable =
'1'
ELSE hsyncl AFTER DELAY1;














VHDL Implementation of an Image Processing Chip
File: cf blk.vhd
- Convolution Filter block
- Mike Kelly
- RIT Computer engineering Master's Thesis project.
- Created:8/10/95
- Last Revision: 10/24/95
- Modified the design to allow negative filter coefficients.
- Valid coefficient values are from -127 to 127. Note that
- negative coefficientsmust be entered by the cpu in sign-












cd_img_d: OUTstd_logic_vector(31 downto 0));
END cf_fifo_cd;
ARCHITECTURE bhv of cf_fifo_cd IS







~ Generate the delayed hsync signals and the fifo wen and ren signals.
Latch image data to line up with wen and ren.
Also detect and signal fifo error conditions.




































ELSIF clkEVENT AND elk =
'1'
THEN
hsyncl <= hm_hsync AFTER DELAY1 ;
hsync2 <= hsyncl AFTERDELAY1;
hsync3 <= hsync2 AFTER DELAY1 ;
hsync4 <= hsync3 AFTERDELAY1;
hsync5 <= hsync4 AFTERDELAY1;
hsync6 <= hsync5 AFTERDELAY1;
hsync7 <= hsync6 AFTER DELAY1 ;
cf_hsync <= hsync7 AFTER DELAY1;
filt_hsync <= vsync3 AND hsync3 AFTER DELAY1 ;
wenl <= (hm_hsync AND hm_vsync AND NOT(over)) AFTER DELAY1 ;
wen2 <= (hsyncl AND hm_vsync AND NOT(over)) AFTER DELAY1 ;
ren <= (hmjisync AND vsyncl AND NOT(under)) AFTER DELAY1 ;
IF (hm_vsync =
'1'
AND hm_hsync = '1') THEN
cd_img_d <= hm_img_d AFTER DELAY1 ;
ELSE




OR under = '1') THEN
fifo_err <=
'1'
AFTER DELAY1 ;-- This condition persists until resetz goes low.
END IF;
vsync2 <= vsyncl AFTERDELAY1;
vsync3 <= vsync2 AFTERDELAY1;
vsync4 <= vsync3 AFTER DELAY1;
vsync5 <= vsync4 AFTER DELAY1;
vsync6 <= vsync5 AFTERDELAY1;
vsync7 <= vsync6 AFTERDELAY1;
cf_vsync <= vsync7 AFTER DELAY1;
fifo_rstz <= hm_vsync OR vsyncl AFTER DELAY1;
first_ln <= hm_vsync AND NOT(vsyncl) AFTERDELAY1;
last_ln <= NOT(hm_vsync) AND vsyncl AFTER DELAY1 ;
f_wd2 <= f_wdl AFTER DELAY1;
first_wd <= f_wd2 AFTER DELAY1 ;
















































































































pres_state <= wait_v AFTER DELAY1 ;
ELSE










cd_img_d: IN std_logic_vector(31 downto 0);
over.under: OUTstdJogic;
prev,curr,n_ext: OUTstd_logic_vector(31 downto 0));
END cfjifos;
ARCHITECTURE mixed ofcfjifos IS
SIGNAL curr_int,fifo2_in,latch_cd:std_logic_vector(31 downto 0);
SIGNAL overl,over2,underl,under2:std_logic;
127
VHDL Implementation of an Image Processing Chip
COMPONENT dff32
PORT(clk,clrz:INstd_logic;
d: IN std_logic_vector(31 downto 0);




d: IN std_logic_vector(31 downto 0);
over,under:OUTstd_logic;
q: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
BEGIN










n_ext <= currjntWHEN last_ln =
'1'
ELSE latch_cd;
over <= overl OR over2;













ARCHITECTURE nedist of cf_pix_ctl IS
SIGNAL plin_a,clin_a,nlin_a:std_logic_vector(7 downto 0);
SIGNAL pla,plb,plc,pld,ple,plf,plg,plh,ph,pl_e:std_logic_vector(7 downto 0);
SIGNAL cla,clb,clc,cld,cle,clf,clg,clh,cli,cl_e:std_logic_vector(7 downto 0);
SIGNAL nla,nlb,nlc,nld,nle,nlf,nlg,nlh,nh,nl_e:std_logic_vector(7 downto 0);
COMPONENTmux_2tol_8
PORT(a_bz:INstd_logic;
a,b: IN std_logic_vector(7 downto 0);




d: IN std_logic_vector(7 downto 0);





VHDL Implementation of an Image Processing Chip




PORT MAP (clk,resetz,prev(23 downto 16),plg);
DFF_P3: dff8
PORTMAP (clk,resetz,prev(15 downto 8),plh);
DFF_P4: dff8


















PORT MAP (clk,resetz,curr(23 downto 16),clg);
DFF_C3: dff8
PORT MAP (cUe,resetz,curr(15 downto 8),clh);
DFF_C4: dff8


















PORT MAP (clk,resetz,n_ext(23 downto 16),nlg);
DFF_N3: dff8
PORT MAP (clk(resetz,n_ext(15 downto 8),nlh);
DFF_N4: dff8
PORTMAP (clk,resetz,n_ext(7 downto 0),nli);
129













prev_a <= pla AFTER DELAY1 ;
prev_b <= plb AFTER DELAY1 ;
prev_c <= pic AFTER DELAY1;
prev_d <= pld AFTER DELAY1 ;
prev_e <= pie AFTER DELAY1 ;
prevj<= plfAFTER DELAY1 ;
curr_a <= cla AFTER DELAY 1 ;
curr_b <= clb AFTER DELAY1;
curr_c <= clc AFTER DELAY1;
curr_d <= eld AFTER DELAY1 ;
curr_e <= cle AFTER DELAY1 ;
curr_f<= elfAFTER DELAY1 ;
next_a <= nlaAFTER DELAY1 ;
next_b <= nib AFTER DELAY1 ;
next_c <= nlc AFTER DELAY1;
next_d <= nld AFTER DELAY1 ;
next_e <= nle AFTER DELAY1 ;









cpu_a: IN stdJogic_vector(8 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
cf_enable: OUTstdJogic;
kern_tot: OUTstd_logic_vector(l 1 downto 0);
cpa,cpb,cpc,cca,ccb,ccc,cna,cnb,cnc:OUTstd_logic_vector(7 downto 0));
END cf_cpuif;
ARCHITECTURE bhv of cf_cpuif IS
SIGNAL cf_cd_reg,d_mux:std_logic_vector(7 downto 0);
SIGNAL k_tot: std_logic_vector(l 1 downto 0);
SIGNAL cp_a,cp_b,cp_c,cc_a,cc_b,cc_c,cn_a,cn_b,cn_c:std_logic_vector(7 downto 0);
130
VHDL Implementation of an Image Processing Chip
FUNCTION kern_add(a,b,c,d,e,f,g,h,i:std_logic_vector(7 downto 0))
RETURN std_logic_vector IS
VARIABLE aw,bw,cw,dw,ew,fw,gw,hw,iw,work:std_logic_vector(ll downto 0);
BEGIN
aw(6 downto 0) := a(6 downto 0);
bw(6 downto 0) := b(6 downto 0);
cw(6 downto 0) := c(6 downto 0);
dw(6 downto 0) := d(6 downto 0);
ew(6 downto 0) := e(6 downto 0);
fw(6 downto 0) := f(6 downto 0);
gw(6 downto 0) := g(6 downto 0);
hw(6 downto 0) := h(6 downto 0);
iw(6 downto 0) := i(6 downto 0);
aw(ll downto 7) := "00000";
bw(ll downto 7) := "00000";
cw(ll downto 7) := "00000";
dw(ll downto 7) := "00000";
ew(ll downto 7) := "00000";
fw(ll downto 7) := "00000";
gw(ll downto 7) := "00000";
hw(l 1 downto 7) := "00000";
iw(ll downto 7) := "00000";













cw := NOT(cw) + "000000000001";
END IF;
IF d(7) = T THEN
dw := NOT(dw) + "000000000001";
END IF;
IFe(7)-TTHEN
ew := NOT(ew) + "000000000001";
END IF;
IFf(7) = TTHEN





gw := NOT(gw) + "000000000001";
END IF;
IFh(7) = TTHEN
hw := NOT(hw) + "000000000001";
END IF;
IFi(7) = TTHEN
iw := NOT(iw) + "000000000001";
END IF;




VHDL Implementation of an Image Processing Chip
BEGIN
The cpu registers for the conv_filt are a control register and 9 filter
element registers. Bit 0 of the control register is used as an enable
for this block. The other bits of the control register are unused.
The filter element registers are 8 bits. Bit 7 is the sign bit. Sign/
~ magnitude representation is used. Valid values are -127 to 127.




The filter kernel total (12 bits) can be read as 2 bytes. Bits 7 to 0 are
read at address Ole hex and bits 1 1 to 8 are read at address Olf hex.






























































AND cpu_a = "000010011") ELSE
cc_a;










































cpu_dWHEN (cpu_wrzEVENT AND cpu_wrz = T AND
csz =
'0'
AND cpu_a = "0000101 11") ELSE
132












AND cpu_a = "00001 1000") ELSE
cn_c;
k_tot is the sum of all 9 filter elements. This value is used to normalize
the convolution result. It can be read via the cpu port at address Olf hex.
Note that at reset, it has a value of 1 since cc_b resets to 1. Read only.
k_tot <= kem_add(cp_a,cp_b,cp_c,cc_a,cc_b,cc_c,cn_a,cn_b,cn_c) AFTER 10 ns;
WITH cpu_a SELECT
d_mux <= cp_a WHEN "000010000",- 010 hex
cp_b WHEN "000010001",- 01 1 hex
cp_c WHEN "000010010",- 012 hex
cc_a WHEN "00001001 1 ",-- 013 hex
cc_b WHEN "000010100",- 014 hex
cc_c WHEN "000010101",- 015 hex
cn_a WHEN "000010110",- 016 hex
cn_b WHEN "000010111",- 017 hex
cn_c WHEN "00001 1000",-- 018 hex
k_tot(7 downto 0)WHEN "00001 1 1 10",- Ole hex
("0000"
















cpu_d <= d_muxWHEN (csz =
'0'
AND (cpu_a(8 downto 4) =
"00001"
OR cpu_a = "000000100")
AND cpu_rdz =
'0'










cpa,cpb,cpc,cca,ccb,ccc,cna,cnb,cnc:IN std_logic_vector(7 downto 0);
pl_a,pl_b,pl_c,cl_a,cl_b,cl_c,nl_a,nl_b,nl_c:INstd_logic_vector(7 downto 0);
kernjot: IN std_logic_vector(l 1 downto 0);
d_out: OUTstd_logic_vector(7 downto 0));
133
VHDL Implementation of an Image Processing Chip
END filt;
ARCHITECTURE bhv of filt IS
SIGNAL hsyncl,hsync2,hsync3:std_logic;
SIGNAL pla_prod,plb_prod,plc_prod:std_logic_vector(19 downto 0);
SIGNAL cla_prod,clb_prod,clc_prod:std_logic_vector(19 downto 0);
SIGNAL nla_prod,nlb_prod,nlc_prod:std_logic_vector(19 downto 0);
SIGNAL pl_sum,cl_sum,nl_sum:std_logic_vector(19 downto 0);
SIGNAL pcn_sum:std_logic_vector(19 downto 0);
- Function: Convolution Filter Multiplication
INPUTS:a8 bit value is sign/mag format with possible values
from -127 to 127.
b 8 bit positive value.
- RETURNS :cThe 20 bit product of a and b in 2's complement format.
FUNCTION cf_mult(a,b:std_logic_vector(7 downto 0))
RETURN std_logic_vector IS
VARIABLE aw:std_logic_vector(7 downto 0);
VARIABLE work:std_logic_vector(15 downto 0);
VARIABLE c:std_logic_vector(19 downto 0);
BEGIN





- if a (the coefficient) is negative, convert result to 2's complement.
IFa(7) = TTHEN
work := NOT(work) + "0000000000000001";
ENDEF;
c(15 downto 0) := work;
sign extend the result





Function: Convolution Filter Division
INPUTS:a20 bit sum of products result is 2's comp format
b 12 bit kernel total in 2's comp format
- RETURNS :cThe 20 bit product of a and b in 2's complement format
FUNCTION cf_div(a:std_logic_vector(19 downto 0); b:std_logic_vector(ll downto 0))
RETURN std_logic_vector IS
VARIABLE aw,work:std_logic_vector(19 downto 0);
VARIABLE bw:std_logic_vector(ll downto 0);
VARIABLE c:std_logic_vector(7 downto 0);
BEGIN
-134-
VHDL Implementation of an Image Processing Chip




ASSERT (NOW = 0 ns)





AND b(ll) = '0') OR (a(19)
-
'0'
AND b(ll) = '1') THEN
c := "00000000";- negative results
ELSIF (a(19) =
'1'
AND b(ll) = '1') THEN
aw :=NOT(a) + "00000000000000000001";
bw := NOT(b) + "000000000001";
work := aw / bw;




work := a / b;















ELSIF clkEVENT AND elk =
'1'
THEN
hsyncl <= hsync AFTER DELAY1;
hsync2 <= hsyncl AFTERDELAY1;
































VHDL Implementation of an Image Processing Chip
ELSIF clkEVENT AND elk = T THEN
IF hsync = T THEN
pla_prod <= cf_mult(cpa,pl_a) AFTER DELAY1 ;
plb_prod <= cf_mult(cpb,pl_b) AFTER DELAY1;
plc_prod <= cf_mult(cpc,pl_c) AFTER DELAY1;
cla_prod <= cf_mult(cca,cl_a) AFTER DELAY1 ;
clb_prod <= cf_mult(ccb,cl_b) AFTER DELAY1;
clc_prod <= cf_mult(ccc,cl_c) AFTER DELAY1;
nla_prod <= cf_mult(cna,nl_a) AFTER DELAY1 ;
nlb_prod <= cf_mult(cnb,nl_b) AFTER DELAY1;





pLsum <= pla_prod + plb_prod + plc_prodAFTER DELAY1;
cl_sum <= cla_prod + clb_prod + clc_prod AFTER DELAY1 ;





pcn_sum <= nl_sum + cl_sum + pl_sumAFTER DELAY1 ;
END IF;
IF hsync3 = T THEN
















cpu_a: IN std_logic_vector(8 downto 0);
hm_img_d: IN std_logic_vector(31 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
cf_vsync,cf_hsync,cf_fifo_err:OUTstd_logic;
cf_img_d: OUTstd_logic_vector(31 downto 0));
END conv_filt;




SIGNAL cd_img_d,n_ext,curr,prev,img_dout:std_logic_vector(31 downto 0);
SIGNAL prev_a,prev_b,prev_c,prev_d,prev_e,prev_f:std_logic_vector(7 downto 0);
-136
VHDL Implementation of an Image Processing Chip
SIGNAL curr_a,curr_b,curr_c,curr_d,curr_e,curr_f:std_logic_vector(7 downto 0);
SIGNAL next_a,next_b,next_c,next_d,next_e,next_f:std_logic_vector(7 downto 0);
SIGNAL cf_enable: stdjogic;
SIGNAL kern_tot: std_logic_vector(l 1 downto 0);















cd_img_d: IN std_logic_vector(31 downto 0);
over.under: OUTstdJogic;





prev,curr,n_ext: IN std_logic_vector(31 downto 0);
prev_a,prev_b,prev_c,prev_d,prev_e,prev_f:OUTstd_logic_vector(7 downto 0);
curr_a,cuiT_b,curr_c,curr_d,curr_e,curr_f:OUTstd_logic_vector(7 downto 0);





cpu_a: IN std_logic_vector(8 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
cf_enable: OUTstdJogic;






cpa,cpb,cpc,cca,ccb,ccc,cna,cnb,cnc:IN stdJogic_vector(7 downto 0);
pl_a,pl_b,pl_c,cl_a,cl_b,cl_c,nl_a,nl_b,nl_c:INstd_logic_vector(7 downto 0);
kern_tot: IN std_logic_vector(l 1 downto 0);




VHDL Implementation of an Image Processing Chip
vsyncjn <= hm_vsyncWHEN cf_enable =
'1'
ELSE '0';


































cf_vsync <= vsync_out AFTER DELAY1 WHEN cf_enable =
'1'
ELSE hm_vsync AFTER DELAY1;
cf_hsync <= hsync_outAFTER DELAY1 WHEN cf_enable =
'1'
ELSE hmjisync AFTER DELAY1;









- RIT Computer engineering Master's Thesis project.
- Created: 8/17/95







lc: OUTstd_logic_vector(2 downto 0));
END ht_line_cnt;
ARCHITECTURE bhv ofht_line_cnt IS
TYPE eol_sm IS (wait_h,wait_l,pulse);
SIGNAL pres_state,next_state:eol_sm;
SIGNAL eol_pulse:std_logic;































pres_state <=wait_h AFTER DELAY1 ;





pres_state <= wait_h AFTER DELAY1;
139-
VHDL Implementation of an Image Processing Chip
ELSIF clkEVENT AND elk =
'1'
THEN













































































a: IN std_logic_vector(2 downto 0);
rbz,wck:OUTstd_logic_vector(7 downto 0));
END ht_rr_cd;



































































































































































a: IN std_logic_vector(2 downto 0);
d: INOUTstd_logic_vector(7 downto 0);
val: OUTstd_logic_vector(31 downto 0));
END ht_reg_row;
ARCHITECTURE nedist of ht_reg_row IS
SIGNAL firstz: stdjogic;
SIGNAL rbz,wck:std_logic_vector(7 downto 0);













d: INOUTstd_logic_vector(7 downto 0);




a,b: IN std_logic_vector(7 downto 0);


























PORTMAP (firstz,r4d,r0d,val(31 downto 24));
MUX2: mux_2tol_8
PORTMAP (firstz,r5d,rld,val(23 downto 16));
MUX3: mux_2tol_8
PORT MAP (firstz,r6d,r2d,val(15 downto 8));
MUX4: mux_2tol_8









cpu_a: IN std_logic_vector(8 downto 0);
cf_img_d: EST std_logic_vector(31 downto 0);




ARCHITECTURE mixed of halftone IS
SIGNAL vsync_out,hsync_out,ht_enable:std_logic;
SIGNAL r_selz,ht_cd_reg,thresh:std_logic_vector(7 downto 0);
SIGNAL lc: std_logic_vector(2 downto 0);
SIGNAL val_a,val_b,val_c,val_d,val_e,val_f,val_g,val_h:std_logic_vector(31 downto 0);
SIGNAL ht_out,val,img_dout:std_logic_vector(31 downto 0);
COMPONENT ht_line_cnt
PORT(clk,clrz,vsync,hsync:INstd_logic;





a: IN std_logic_vector(2 downto 0);
d: INOUTstd_logic_vector(7 downto 0);
val: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
142
VHDL Implementation of an Image Processing Chip
COMPONENTmux_8tol_32





































OR csz = T OR cpu_a(8 downto 6) /= "001") ELSE
"11111110"





























































































ELSIF clkEVENT AND elk = T THEN
vsync_out <= cf_vsync AFTER DELAY1 ;





val <= ZERO_32 AFTER DELAY1 ;
img_dout <= ZERO_32 AFTER DELAY1 ;
ELSIF clkEVENT AND elk = T THEN
IF cfvsync = T AND cfjisync = T THEN
img_dout <= cf_img_d AFTER DELAY1;
val <= ht_outAFTER DELAY1 ;
ELSE
img_dout <= ZERO_32 AFTER DELAY1 ;





ht_vsync <= vsync_outAFTER DELAY1 WHEN ht_enable =
'1'
ELSE cf_vsync AFTER DELAY1;
ht_hsync <= hsync_out AFTER DELAY1 WHEN ht_enable = T ELSE cfjisync AFTER DELAY1;
ht_img_d <= img_dout AFTER DELAY1 WHEN ht.enable = TELSE cf_img_dAFTERDELAY1;
ht_val <= val AFTER DELAY1 WHEN ht_enable = T ELSE
(thresh & thresh & thresh & thresh) AFTERDELAY1 ;
END mixed;
-144
VHDL Implementation of an Image Processing Chip
File: ed blk.vhd
- Error Diffusion Block.
- Mike Kelly
- RIT Computer engineering Master's Thesis project.
- Created:8/17/95
- Last Revision: 1 1/14/95
- This revision changes the implementation to do all calculations in
- one clock cycle. This fixes a problem with the first implementation








cpu_a: IN std_logic_vector(8 downto 0);




ARCHITECTURE bhv of ed_cpu IS









cpu_dWHEN cpu_wrz'EVENT AND cpu_wrz = T AND
csz =
'0'

















































cpu_dWHEN cpu_wrz'EVENTAND cpu_wrz = T AND
csz =
'0'





























VHDL Implementation of an Image Processing Chip






mi <=mi_reg(3 downto 0);
mj <=mj_reg(3 downto 0);
mk <= mk_reg(3 downto 0);











ARCHITECTURE bhv of ed_cd IS







































































ELSIF clkEVENT AND elk = T THEN
pres_state <= next_state;
vsl <= ht_vsync;


























errjn: IN std_logic_vector(35 downto 0);
val_out,ed_dout:OUTstd_logic_vector(31 downto 0);
err_out: OUTstd_logic_vector(35 downto 0));
END ed_process;
ARCHITECTUREmixed of ed_process IS
SIGNAL en_3d: stdjogic;
SIGNAL era_i,eraJ,era_k,era_l:std_logic_vector(8 downto 0);
SIGNAL erbJ,erbJ,erb_k,erb_l:std_logic_vector(8 downto 0);
SIGNAL erc_i,ercJ,erc_k,erc_l:std_logic_vector(8 downto 0);
SIGNAL erd_i,erdJ,erd_k,erd_l:std_logic_vector(8 downto 0);
SIGNAL eral_i,eralJ,era2J,eral_k,eral_l:std_logic_vector(8 downto 0);
SIGNAL erblJ,erbl_k,erbl_l:std_logic_vector(8 downto 0);
SIGNAL erclJ,ercl_k,ercl_l:std_logic_vector(8downto 0);
SIGNAL erdlJ,erdl_k:std_logic_vector(8 downto 0);
SIGNAL sum_a,sum_b,sum_c,sum_d:std_logic_vector(9 downto 0);
SIGNAL val_3d,id_3d,result:std_logic_vector(31 downto 0);
SIGNAL err: std_logic_vector(35 downto 0);
Function: Error diffusion signedmultiplication
ejn is a 9 bit value from the fifo or the previous
pixel's error calculator. It is in 2's complement
notation.
mult is a 4 bit positive value.
prod is returned and is a 9 bit value. It is in 2's
complement notation.
This function multiplies the 2 inputs together and divides
the result by 16 by dropping the lower 4 result bits.
FUNCTION ed_mult(e_in:std_logic_vector(8 downto 0);
mult:std_logic_vector(3 downto 0))
RETURN std_logic_vector IS
VARIABLE e_in_v,prod:std_logic_vector(8 downto 0);
148
VHDL Implementation of an Image Processing Chip
VARIABLE prod_v:std_logic_vector(12 downto 0);
BEGIN
IF (NOT(valid_vec(e_in)) OR NOT(valid_vec(mult))) THEN
ASSERT(NOW = 0 ns)




















Function: Error diffusion signed addition
din is an 8 bit value and is always positive
erri,errj,errk,errl are 9 bit values and are already
in 2's complement notation from the ed_mult function.
sum is output in 2's complement notation.
FUNCTION ed_add(din:std_logic_vector(7 downto 0);
erri,errj,errk,errl:stdJogic_vector(8 downto 0))
RETURN std_logic_vector IS
VARIABLE dinval,sum:std_logic_vector(9 downto 0);






ASSERT(NOW = 0 ns)
REPORT "ed_add warning: X in input
vector."
SEVERITYWARNING;
sum(7 downto 0) := din;
sum(9downto8):="00";
ELSE
convert to 10 bit values
dinval(7 downto 0) := din(7 downto 0);
dinval(9 downto 8) := "00";
eriv(8 downto 0) := erri;
erjv(8 downto 0) := errj;
erkv(8 downto 0) := errk;




:= errj(8);~ sign extension
erkv(9)
:= errk(8);~ sign extension
149
VHDL Implementation of an Image Processing Chip
erlv(9) :=
errl(8);~ sign extension




- Function: Error diffusion error calculator
din is a 10 bit value in 2's complement notation.
thresh is an 8 bit positive value.
enable is a single bit control signal.
err is returned and is a ') bit value. It is in 2's
complement notation.




VARIABLE din_v:std_logic_vector(9 downto 0);
VARIABLE err:std_logic_vector(8 downto 0);
BEGIN
IF NOT(enable = T) THEN
err := "000000000";
ELSIF (NOT(valid_vec(din)) OR NOT(valid_vec(thresh))) THEN
ASSERT(NOW = 0 ns)





IF din(9) = T THEN- din is negative.
IF din(8) =
'0' THEN- din < (-256)
err := "100000001";- err = (-255)
ELSIF din(8 downto 0) =
"100000000" THEN- din = (-256)
err := "100000001";- err = (-255)
ELSE - (-255) < din < 0
err := din(8 downto 0);
END IF;
ELSIF din(8) =T THEN- 255 < din <= 510
err(7 downto 0) := din(7 downto 0) + "00000001";
err(8) := '0';
ELSIF din(7 downto 0) > thresh THEN- thresh < din <= 255
err := "01 1 1 1 1 1 11
"
- din(8 downto 0);~ 255 - din is the magnitude
err := NOT(err) + "000000001";
ELSE - 0 <= din <= thresh






VHDL Implementation of an Image Processing Chip
- Function: Error diffusion output limit
- sumjn is a 10 bit value in 2's complement notation.
enable is a single bit control signal.
- result is returned and is a positive 8 bit value.








OR NOT(enable = T) THEN
result := "00000000";
ELSIF sum_in(8) = T THEN
result := "11111111";
ELSE





Calculate the error fractions.
eraj <= ed_mult(err(8 downto 0),mi);
eraj <= ed_mult(err_in(8 downto 0),mj);
era_k <= ed_mult(err_in(35 downto 27),mk);
eraj <= ed_mult(err_in(26 downto 18),ml);
erbj <= ed_mult(err(35 downto 27),mi);
erbj <= ed_mult(err_in(35 downto 27),mj);
erb_k <= ed_mult(err_in(26 downto 18),mk);
erbj <= ed_mult(err_in(17 downto 9),ml);
ercj <= ed_mult(err(26 downto 18),mi);
ereJ <= ed_mult(err_in(26 downto 18),mj);
erc_k <= ed_mult(err_in(17 downto 9),mk);
ercj <= ed_mult(erT_in(8 downto 0),ml);
erdj <= ed_mult(err(17 downto 9),mi);
erdj <= ed_mult(err_in(17 downto 9),mj);
erd_k <= ed_mult(err_in(8 downto 0),mk);
erdj <= ed_mult(err_in(35 downto 27),ml);
Calculate the sum of the data and errors.
sum_a <= ed_add(id_3d(31 downto 24),eral_i,era2J,eral_k,eral_l);
sum_b <= ed_add(id_3d(23 downto 16),erb_i,erblJ,erbl_k,erblJ);
sum_c <= ed_add(id_3d(15 downto 8),erc_i,erclJ,ercl_k,erclJ);
sum_d <= ed_add(id_3d(7 downto 0),erd_i,erdlJ,erdl_k,erd_l);
Calculate the new error for the next line.
err(35 downto 27) <= ed_err_calc(sum_a,val_3d(31 downto 24),en_3d);
err(26 downto 18) <= ed_err_calc(sum_b,val_3d(23 downto 16),en_3d);
err(17 downto 9) <= ed_err_calc(sum_c,val_3d(15 downto 8),en_3d);
err(8 downto 0) <= ed_err_calc(sum_d,val_3d(7 downto 0),en_3d);
-151
VHDL Implementation of an Image Processing Chip
Calculate the error diffusion output.
result(31 downto 24) <= ed_limit(sum_a,en_3d);
result(23 downto 16) <= ed_limit(sum_b,en_3d);
result(15 downto 8) <= ed_limit(sum_c,en_3d);
result(7 downto 0) <= ed_limit(sum_d,en_3d);
Latch input, intermediate results, and output.
P0:PROCESS(clk,resetz)
BEGIN


































































cpu_a: IN std_logic_vector(8 downto 0);
ht_img_d,ht_val:INstd_logic_vector(31 downto 0);










SIGNAL ed_cd_reg:std_logic_vector(7 downto 0);
SIGNAL mi,mj,mk,ml:std_logic_vector(3 downto 0);
SIGNAL id_ld,id_2d,val_ld,val_2d:std_logic_vector(31 downto 0);
SIGNAL ed_dout,val_out:std_logic_vector(31 downto 0);
SIGNAL err_in,err_out:std_logic_vector(35 downto 0);
COMPONENT ed_cpu
PORT(cpu_wrz,cpu_rdz,csz,resetz:INstd_logic;
cpu_a: IN std_logic_vector(8 downto 0);














d: IN std_logic_vector(35 downto 0);
over,under:OUTstd_logic;






errjn: IN std_logic_vector(35 downto 0);
val_out,ed_dout:OUTstd_logic_vector(31 downto 0);






































ed_vsync <= vsync_out AFTER DELAY1 WHEN ed_enable =
'1'
ELSE ht_vsync AFTER DELAY1;
ed_hsync <= hsync_out AFTER DELAY1 WHEN ed_enable = T ELSE ht_hsync AFTER DELAY1;
-154
VHDL Implementation of an Image Processing Chip
ed_img_d <= ed_doutAFTER DELAY1 WHEN ed_enable = T ELSE ht_img_dAFTER DELAY1;
ed_thresh <= val_out AFTER DELAY1 WHEN ed_enable = T ELSE ht_val AFTERDELAY1 ;
END mixed;
155-
VHDL Implementation of an Image Processing Chip
File: th blk.vhd
Threshold block of the emkipc chip.
- Mike Kelly













ARCHITECTURE bhv of thresh IS
SIGNAL cmp_da,cmp_db,cmp_dc,cmp_dd:std_logic_vector(7 downto 0);
SIGNAL out_d: std_logic_vector(31 downto 0);
BEGIN
- When the thresholder is enabled, a
"1"
output is a 255 and a
"0"
output
is an eight bit zero.
cmp_da <=
"11111111"
WHEN ed_img_d(3 1 downto 24) > ed_thresh(3 1 downto 24) AND
ed_vsync = T AND ed_hsync = T ELSE "00000000";
cmp_db <=
"11111111"
WHEN ed_img_d(23 downto 16) > ed_thresh(23 downto 16) AND
ed_vsync = T AND ed_hsync = T ELSE "00000000";
cmp_dc <=
"11111111"
WHEN ed_img_d(15 downto 8) > ed_thresh(15 downto 8) AND
ed_vsync = T AND ed_hsync = T ELSE "00000000";
cmp_dd <=
"11111111"
WHEN ed_img_d(7 downto 0) > ed_thresh(7 downto 0) AND
ed_vsync = T AND ed_hsync = T ELSE "00000000";
The output mux determines whether gray data or the comparator ouput is sent to the
device output pins when vsync and hsync are active.
out_d(31 downto 24) <=
"00000000"





cmp_daAFTER DELAY1 WHEN th_enable = T ELSE
ed_img_d(31 downto 24) AFTERDELAY1;
out_d(23 downto 16) <=
"00000000"





cmp_db AFTER DELAY1 WHEN th_enable = T ELSE
ed_img_d(23 downto 16) AFTER DELAY1;
out_d(15 downto 8) <=
"00000000"





cmp_dc AFTER DELAY1 WHEN th_enable =
'1'
ELSE
ed_img_d(15 downto 8) AFTER DELAY1;
out_d(7 downto 0) <=
"00000000"





cmp_dd AFTER DELAY1 WHEN th_enable = T ELSE
ed_img_d(7 downto 0) AFTER DELAY1 ;
-156-














ELSIF clkEVENT AND elk = T THEN
out_img_d <= out_d AFTER DELAY1;
out_vsync <= ed_vsync AFTER DELAY1;





VHDL Implementation of an Image Processing Chip
Appendix C - Test Bench VHDL Code
File: test dev.vhd
- This file contains the test bench for the emkipc chip.
- Mike Kelly










out_img_d: OUTstd_logic_vector(31 downto 0));
END emkipc_test;
ARCHITECTURE bhv of emkipc_test IS
SIGNAL in_vsync,in_hsync,clk,resetz:std_logic := '0';
SIGNAL cpu_rdz,cpu_wrz,csz:std_logic := T;
SIGNAL in_img_d:std_logic_vector(31 downto 0) := X_32;
SIGNAL cpu_a: std_logic_vector(8 downto 0);
SIGNAL cpu_d: std_logic_vector(7 downto 0);
SIGNAL int_hsync:std_logic := '0';
SIGNAL int_out_vsync:std_logic;-mirrors out_vsync
SIGNAL init_regs:boolean := FALSE;
CONSTANT HALFJXK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;







cpu_a: IN std_logic_vector(8 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
in_img_d: IN std_logic_vector(31 downto 0);
out_vsync,out_hsync:OUTstd_logic;
cf_fifo_err,ed_fifo_err:OUTstd_logic;
out_img_d: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
BEGIN
resetz <= 'O'.T AFTER 50 ns;
elk <= NOT elkAFTERHALFJXK;
158
VHDL Implementation of an Image Processing Chip
in_hsync <= int_hsync'DELAYED(4 * PERIOD);
out_vsync <= int_out_vsync;
PLPROCESS Load the processing parameters using the cpu port.
BEGIN
WATT for (5 * PERIOD);--Wait for reset to complete.
Write the convolution filter coefficient registers.
WATT for CPU_CYCLE;
FOR i IN 16 to 31 LOOP






Read back the registers.




Write to the error diffusion multiplier registers.
FOR i IN 32 to 35 LOOP - 20 hex to 23 hex
IFi = 32THEN
cpu_write(int_vec(i,cpu_a),"00000001",cpu_a,cpu_d,csz,cpu_wrz);
ELSIF i = 33 THEN
cpu_write(int_vec(i,cpu_a),"0000001 1 ",cpu_a,cpu_d,csz,cpu_wrz);
ELSIF i = 34 THEN
cpu_write(int_vec(i,cpu_a),"00000101",cpu_a,cpu_d,csz,cpu_wrz);
ELSE
cpu_write(int_vec(i,cpu_a),"000001 1 1 ",cpu_a,cpu_d,csz,cpu_wrz);
END IF;
END LOOP;
Read back the multiplier registers.









FOR i IN 64 to 127 LOOP
cpu_write(int_vec(i,cpu_a),int_vec((4
* (i - 64)),cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
END LOOP;
Read back the dithermatrix.
FOR i IN 64 to 127 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
-159
VHDL Implementation of an Image Processing Chip
END LOOP;
WAIT for CPU_CYCLE;
- Write to all 256 addresses of the histmod LUTs.
- This data inverts the video (0 => 255, 1 => 154, ... 255 => 0).
FOR i IN 256 to 511 LOOP
cpu_write(int_vec(i,cpu_a),int_vec((51 1 - i),cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
END LOOP;
- Read back all 256 addresses ofLUT 0.





Let the first image complete with no processing.
WATT until int_out_vsync = T;
WATT until int_out_vsync = '0';
WATT for (2 * PERIOD);
Write to the histogrammodification control register to set hm_enable
cpu_write("000000000","00000001",cpu_a,cpu_d,csz,cpu_wrz);
cpu_read("000000000",cpu_a,csz,cpu_rdz);
Let the second image complete with histogrammodification.
WAIT until int_out_vsync = T;
WATT until int_out_vsync = '0';
WATT for (2 * PERIOD);
Write to the histogrammodification control register to clear hm_enable
cpu_write("000000000","00000000",cpu_a,cpu_d,csz,cpu_wrz);
Write to the convolution filter control register to set cf_enable
cpu_write("000000100","00000001",cpu_a,cpu_d,csz,cpu_wrz);
cpu_read("000000100",cpu_a,csz,cpu_rdz);
Let the third image complete with convolution filtering.
WATT until int_out_vsync = T;
WATT until int_out_vsync = '0';
WATT for (2 * PERIOD);
Write to the convolution filter control register to clear cf_enable
cpu_write("000000100","00000000",cpu_a,cpu_d,csz,cpu_wrz);
Write to the halftone control register to set ht_enable
cpu_write("000001000","00000001",cpu_a,cpu_d,csz,cpu_wrz);
cpu_read("000001000",cpu_a,csz,cpu_rdz);
- Let the fourth image complete with halftoning (thresh disabled).
WATT until int_out_vsync = T;
WATT until int_out_vsync = '0';
WAIT for (2 * PERIOD);
-160
VHDL Implementation of an Image Processing Chip
- Write to the error diffusion control register to set th_enable
cpu_write("000001100","00000010",cpu_a,cpu_d,csz,cpu_wrz);
cpu_read("000001100",cpu_a,csz,cpu_rdz);
- Let the fifth image complete with halftoning (thresh enabled).
WATT until int_out_vsync = T;
WAIT until int_out_vsync = '0';
WATT for (2 * PERIOD);
- Write to the halftone control register to clear ht_enable
cpu_write("000001000","00000000",cpu_a,cpu_d,csz,cpu_wrz);
- Write to the error diffusion control register to set ed_enable and clear th_enable
cpu_write("000001 100","00000001 ",cpu_a,cpu_d,csz,cpu_wrz);
cpu_read("000001100",cpu_a,csz,cpu_rdz);
- Let the sixth image complete with error diffusion (thresh disabled).
WATT until int_out_vsync = T;
WAIT until int_out_vsync = '0';
WAIT for (2 * PERIOD);
Write to the error diffusion control register to set th_enable




P2:PROCESS~ Continuously cycle in_hsync
BEGIN




WATT for HSYNC_CYCLE - PERIOD;
END PROCESS P2;
P3:PROCESS~ Generate vsync and image data.
BEGIN
WATT until init_regs;
- Run a small test image (2 lines by 16 words) with nothing enabled
in_vsync <= T;
FOR i IN 0 to 1 LOOP
WAIT for (4* PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "001 1001 1";
in_img_d(7 downto 0) <= "01000100";
END IF;
161-








- Run a small test image (2 lines by 16 words) with hist_mod enabled
in_vsync <= T;
FOR i IN 0 to 1 LOOP
WAIT for (4* PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";









Run a small test image (3 lines by 16 words) with conv_filt enabled
in_vsync <= T;
FOR i IN 0 to 2 LOOP
WATT for (4* PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";





WATT for (HSYNC_CYCLE - (20 * PERIOD));
END LOOP;
162
VHDL Implementation of an Image Processing Chip
in_vsync <= '0';
WATT for (2 * HSYNCCYCLE);
Run a small test image (2 lines by 16 words) with halftoning enabled
in_vsync <= T;
FOR i IN 0 to 1 LOOP
WAIT for (4 * PERIOD);
FORjDSf0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";









- Run a small test image (2 lines by 16 words) with halftoning and thresh enabled
in_vsync<=T;
FOR i IN 0 to 1 LOOP
WATT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";











- Run a small test image (3 lines by 16 words) with error diffusion enabled
163
VHDL Implementation of an Image Processing Chip
in_vsync<=T;
FOR i IN 0 to 2 LOOP
WAIT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "10011001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";





WAIT for (HSYNCCYCLE - (20 * PERIOD));
END LOOP;
in_vsync <= '0';
WAIT for (2 * HSYNC_CYCLE);
Run a small test image (3 lines by 16 words) with error diffusion and thresh enabled
in_vsync<= T;
FOR i IN 0 to 2 LOOP
WALT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
in_img_d(31 downto 24) <= "10001000";
in_img_d(23 downto 16) <= "1001 1001";
in_img_d(15 downto 8) <= "10101010";
in_img_d(7 downto 0) <= "10111011";
ELSE
in_img_d(31 downto 24) <= "00010001";
in_img_d(23 downto 16) <= "00100010";
in_img_d(15 downto 8) <= "00110011";





WATT for (HSYNCCYCLE - (20 * PERIOD));
END LOOP;
in_vsync <= '0';










VHDL Implementation of an Image Processing Chip
File: test_img.vhd
This file contains the test bench for the emkipc chip.
- Text I/O is used to initialize the device registers and
to read in a test image and write out the processed result.
- Mike Kelly









ARCHITECTURE bhv of emkipc_test IS
Chip Inputs
SIGNAL in_vsync,in_hsync,clk,resetz:std_logic := '0';
SIGNAL cpu_rdz,cpu_wrz,csz:stdjogic := T;
SIGNAL in_img_d:std_logic_vector(31 downto 0) := X_32;
SIGNAL cpu_a: std_logic_vector(8 downto 0);




SIGNAL out_img_d:std_logic_vector(31 downto 0);
Internal control signals
SIGNAL init_regs:boolean := FALSE;
SIGNAL pixels,lines,words:integer := 0;
CONSTANT HALFJXK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;
FILE cf_val_file:text IS IN "/home/stu2/emk6904/vhdl/emkipc/text/cf_vals.txt";
FILE hm_val_file:text IS IN "/home/sm2/emk6904/vhdjVemkipc/text/hm_vals.txt";
FILE ht_val_file:text IS IN "/home/stu2/emk6904/vhdl/emkipc/text/ht_vals.txt";
FILE ed_val_file:text IS IN "/home/stu2/emk6904/vhdl/emkipc/text/ed_vals.txt";
FILE cd_val_file:text IS IN "/home/stu2/emk6904/vhdl/emkipc/text/cd_reg.txt";
FILE in_img_file:text IS IN "/home/sm2/emk6904/vhdl/emkipc/text/girlin.txt";





cpu_a: IN std_logic_vector(8 downto 0);
-166-
VHDL Implementation of an Image Processing Chip
cpu_d: INOUTstd_logic_vector(7 downto 0);
in_img_d: IN std_logic_vector(31 downto 0);
out_vsync,out_hsync:OUTstdJogic ;
cf_fifo_err,ed_fifo_err:OUTstd_logic;







resetz <= 'O'.T AFTER 50 ns;
elk <= NOT elk AFTERHALFJXK;




WATT for (5 * PERIOD);--Wait for reset to complete.
Write the convolution filter coefficient registers.
WATT for CPU.CYCLE;





Read back the convolution filter coefficient registers.
WAIT for CPU.CYCLE;
FOR i IN 16 to 31 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
END LOOP;
Write to the error diffusion multiplier registers.






FOR i IN 64 to 127 LOOP






Write to all 256 addresses of the histmod LUTs.
- This data inverts the video (0 => 255, 1 => 154, ... 255 => 0).
167-
VHDL Implementation of an Image Processing Chip
FOR i IN 256 to 511 LOOP






Read the control register values into the line buffer. 5 values, one line.
READLINE(cd_val_file, plbuf);
Load the hm_cd_reg register, address = 000 hex.
READ(plbuf, ftemp);
cpu_write("000000000",int_vec(ftemp,cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
Load the cf_cd_reg register, address = 004 hex.
READ(plbuf, ftemp);
cpu_write("000000100",int_vec(ftemp,cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
- Load the ht_cd_reg register, address = 008 hex.
READ(plbuf, ftemp);
cpu_write("000001000",int_vec(ftemp,cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
Load the threshold register, address = 009 hex.
READ(plbuf, ftemp);
cpu_write("000001001",int_vec(ftemp,cpu_d),cpu_a,cpu_d,csz,cpu_wrz);





















- Process the image.
in_vsync <= T;
FOR i IN 1 to lpi LOOP
-168
VHDL Implementation of an Image Processing Chip
WAIT for (4 * PERIOD);
in_hsync<= T;






in_img_d(31 downto 24) <= int_vec(apix,in_img_d(31 downto 24));
in_img_d(23 downto 16) <= int_vec(bpix,in_img_d(23 downto 16));
in_img_d(15 downto 8) <= int_vec(cpix,in_img_d(15 downto 8));





WATT for (4 * PERIOD);
END LOOP;
in_vsync <= '0';
Continue to cycle hsync to finish out processing.
FOR i IN 0 TO 7 LOOP
WATT for (4* PERIOD);
injisync <= T;
WAIT for (wpl * PERIOD);
injisync <= '0';




P3:PROCESS- Capture the processed image to file.
VARIABLE ftemp,ppl,lpi:integer;
VARIABLE p3buf:line;











WATT UNTIL out.vsync = T;
WHILE out_vsync = T LOOP
WALT UNTIL clkEVENT AND elk = T;
IF out_vsync = T AND out_hsync = T THEN
ftemp := vec_int(out_img_d(31 downto 24));
WRTTE(p3buf, ftemp);
WRTTE(p3buf, tab);
ftemp := vec_int(out_img_d(23 downto 16));
-169
VHDL Implementation of an Image Processing Chip
WRITE(p3buf, ftemp);
WRITE(p3buf, tab);
ftemp := vec_int(out_img_d(15 downto 8));
WRITE(p3buf, ftemp);
WRTTE(p3buf, tab);













VHDL Implementation of an Image Processing Chip
File: hm test.vhd
This file contains the test bench for the HistogramModification block of the emkipc chip.
- Mike Kelly











ARCHITECTURE bhv of hist_mod_test IS
SIGNAL resetz,clk,in_vsync,in_hsync:std_logic := '0';
SIGNAL csz,cpu_wrz,cpu_rdz:stdjogic := T;
SIGNAL cpu_d: std_logic_vector(7 downto 0);
SIGNAL cpu_a: std_logic_vector(8 downto 0);
SIGNAL in_img_d:std_logic_vector(31 downto 0) := X_32;
CONSTANT HALF_CLK: TIME := 10 ns;






cpu_a: IN std_logic_vector(8 downto 0);
in_img_d:IN std_logic_vector(31 downto 0);





resetz <= '0',T AFTER 50 ns;




WATT for (5 * PERIOD);--Wait for reset to complete.
- Write to all 256 addresses of all 4 LUTs.
FOR i IN 256 to 511 LOOP
cpu_write(int_vec(i,cpu_a),int_vec((51 1 - i),cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
171
VHDL Implementation of an Image Processing Chip
END LOOP;
- Read back all 256 addresses ofLUT 0.
FOR i IN 256 to 511 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
END LOOP;
WATT for (3 * PERIOD);
- Run a small test image (2 lines by 16 words) with hist_mod disabled
in_vsync <= T;
WATT for (2 * PERIOD);




in_img_d(31 downto 24) <= int_vec((temp + 0),in_img_d(31 downto 24));
in_img_d(23 downto 16) <= int_vec((temp + l),in_img_d(23 downto 16));
in_img_d(15 downto 8) <= int_vec((temp + 2),in_img_d(15 downto 8));





WAIT for (4 * PERIOD);
END LOOP;
WAIT for (2 * PERIOD);
in_vsync <= '0';
WATT for (6 * PERIOD);
- Run a small test image (2 lines by 16 words) with hist_mod enabled




second, run the image.
WALT for (6 * PERIOD);
in_vsync<= T;
WAIT for (2 * PERIOD);




in_img_d(31 downto 24) <= int_vec((temp + 0),in_img_d(31 downto 24));
in_img_d(23 downto 16) <= int_vec((temp + l),in_img_d(23 downto 16));
in_img_d(15 downto 8) <= int_vec((temp + 2),in_img_d(15 downto 8));





WALT for (4 * PERIOD);
END LOOP;
WAIT for (2 * PERIOD);
in_vsync <= '0';
-172-
VHDL Implementation of an Image Processing Chip








VHDL Implementation of an Image Processing Chip
File: cf testvhd
- This file contains the test bench for the Convolution Filter block of the emkipc chip.
- Mike Kelly









cf_img_d: OUTstd_logic_vector(31 downto 0));
END conv_filt_test;
ARCHITECTURE bhv of conv_filt_test IS
SIGNAL resetz,clk,hm_vsync,hm_hsync:std_logic := '0';
SIGNAL int_hsync:std_logic := '0';
SIGNAL int_vsync:stdjogic := '0'; mirrors cf_vsync
SIGNAL csz,cpu_wrz,cpu_rdz:std_logic := T;
SIGNAL cpu_a: std_logic_vector(8 downto 0);
SIGNAL cpu_d: std_logic_vector(7 downto 0);
SIGNAL hm_img_d:std_logic_vector(31 downto 0) := X_32;
SIGNAL init_regs:boolean := FALSE;
CONSTANT HALF_CLK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;





cpu_a: IN std_logic_vector(8 downto 0);
hm_img_d: IN std_logic_vector(31 downto 0);
cpu_d: INOUTstd_logic_vector(7 downto 0);
cf_vsync,cf_hsync,cf_fifo_err:OUTstd_logic;
cf_img_d: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
BEGIN
resetz <= '0',T AFTER 50 ns;





PLPROCESS- Load the filter kernel registers and read them back.
BEGIN
-174-
VHDL Implementation of an Image Processing Chip
WAIT for (5 * PERIOD);--Wait for reset to complete.
Write to the filter coefficient registers.
FOR i IN 16 to 31 LOOP






Read back the registers.
FOR i IN 16 to 31 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
END LOOP;
WALT for (2 * PERIOD);
inkregs <= TRUE;
WATT UNTIL int_vsync = T;
WATT UNTIL int_vsync = '0';
WATT for (2 * PERIOD);







P2:PROCESS~ Continuously cycle hmjisync
BEGIN




WALT for (HSYNCCYCLE - PERIOD);
END PROCESS P2;
P3:PROCESS~ Generate vsync and image data.
BEGIN
WATT UNTIL init.regs;
- Run a small test image (4 lines by 16 words) with conv_filt disabled
hm_vsync<= T;
FOR i IN 0 to 3 LOOP
WATT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
hm_img_d(31 downto 24) <= "10001000";
hm_img_d(23 downto 16) <= "10011001";
hm_img_d(15 downto 8) <= "10101010";
hm_img_d(7 downto 0) <= "10111011";
ELSE
hm_img_d(31 downto 24) <= "00010001";
hm_img_d(23 downto 16) <= "00100010";
175
VHDL Implementation of an Image Processing Chip
hm_img_d(15 downto 8) <= "001 10011";









Run a small test image (4 lines by 16 words) with conv_filt enabled
hm_vsync <= T;
FOR i IN 0 to 3 LOOP
WAIT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
hm_img_d(31 downto 24) <= "10001000";
hm_img_d(23 downto 16) <= "10011001";
hm_img_d(15 downto 8) <= "10101010";
hm_img_d(7 downto 0) <= "10111011";
ELSE
hm_img_d(31 downto 24) <= "00010001";
hm_img_d(23 downto 16) <= "00100010";
hm_img_d(15 downto 8) <= "00110011";















VHDL Implementation of an Image Processing Chip
File: ht testvhd
This file contains the test bench for the Halftone block of the emkipc chip.
- Mike Kelly











ARCHITECTURE bhv of halftone_test IS
SIGNAL resetz,clk,cf_vsync,cf_hsync:std_logic := '0';
SIGNAL csz,cpu_wrz,cpu_rdz:std_logic := T;
SIGNAL cpu_a: std_logic_vector(8 downto 0);
SIGNAL cf_img_d:std_logic_vector(31 downto 0) := ZERO_32;
SIGNAL cpu_d: std_logic_vector(7 downto 0) := "ZZZZZZZZ";
SIGNAL int_hsync:stdjogic := '0';
SIGNAL int_vsync:stdjogic := '0'; mirrors ht_vsync
SIGNAL init_regs:boolean := FALSE;
CONSTANTHALFJXK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;







cpu_a: IN std_logic_vector(8 downto 0);
cf_img_d: EST std_logic_vector(31 downto 0);





resetz <= '0',T AFTER 50 ns;





PLPROCESS- Load the cpu registers and read them back.
BEGIN
177
VHDL Implementation of an Image Processing Chip
WAIT for (5 * PERIOD);--Wait for reset to complete.




FOR i IN 64 to 127 LOOP
cpu_write(int_vec(i,cpu_a),int_vec((4 * (i - 64)),cpu_d),cpu_a,cpu_d,csz,cpu_wrz);
END LOOP;
- Read back the dithermatrix.
FOR i IN 64 to 127 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
END LOOP;
WAIT for (2 * PERIOD);
init_regs <= TRUE;
wait for the first image to run and then set the ht_enable bit.
WATT UNTIL int_vsync = T;
WAIT UNTIL int_vsync = '0';





P2:PROCESS~ Continuously cycle cfjisync
BEGIN




WATT for (HSYNCCYCLE - PERIOD);
END PROCESS P2;
P3:PROCESS Generate vsync and image data.
BEGIN
WATT UNTIL init_regs;
- Run an image with halftoning disabled
cf_vsync <= T;
FOR i IN 0 TO 1 LOOP
WATT for (4 * PERIOD);
cf_img_d(31 downto 24) <= "01010101";
cf_img_d(23 downto 16) <= "01010101";
cf_img_d(15 downto 8) <= "01010101";
cf_img_d(7 downto 0) <= "01010101";
WAIT for (16* PERIOD);
cf_img_d <= X_32;





VHDL Implementation of an Image Processing Chip
Run an image with halftoning enabled
cf_vsync<=T;
FOR i IN 0 TO 1 LOOP
WAIT for (4 * PERIOD);
cf_img_d(31 downto 24) <= "00110011";
cf_img_d(23 downto 16) <= "00110011";
cf_img_d(15 downto 8) <= "001 1001 1";
cf_img_d(7 downto 0) <= "00110011";
WAIT for (16 * PERIOD);
cf_img_d <= X_32;











VHDL Implementation of an Image Processing Chip
File: ed testvhd
- This file contains the test bench for the Error Diffusion block of the emkipc chip.
- Mike Kelly












ARCHITECTURE bhv of err_diff_test IS
SIGNAL resetz,clk,ht_vsync,ht_hsync:std_logic := '0';
SIGNAL int_hsync:std_logic := '0';
SIGNAL int_ed_vsync:std_logic; mirrors ed_vsync
SIGNAL csz,cpu_wrz,cpu_rdz:std_logic := T;
SIGNAL cpu_a: std_logic_vector(8 downto 0);
SIGNAL cpu_d: std_logic_vector(7 downto 0);
SIGNAL ht_img_d,ht_val:std_logic_vector(31 downto 0) := X_32;
SIGNAL init_regs:boolean := FALSE;
CONSTANTHALFJXK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;




cpu_a: IN std_logic_vector(8 downto 0);
ht_img_d,ht_val:INstd_logic_vector(31 downto 0);














PLPROCESS- Load the error distribution coefficients and read them back.
180-
VHDL Implementation of an Image Processing Chip
BEGIN
WATT for (5 * PERIOD);--Wait for reset to complete.
- Write to the filter coefficient registers.





ELSIF i = 33 THEN
cpu_write(int_vec(i,cpu_a),"0000001 1 ",cpu_a,cpu_d,csz,cpu_wrz);
ELSIF i = 34 THEN
cpu_write(int_vec(i,cpu_a),"00000101",cpu_a,cpu_d,csz,cpu_wrz);
ELSE
cpu_write(int_vec(i,cpu_a),"000001 1 1 ",cpu_a,cpu_d,csz,cpu_wrz);
END IF;
END LOOP;
WALT for (3 * PERIOD);
Read back the coefficient registers.
FOR i IN 32 to 35 LOOP
cpu_read(int_vec(i,cpu_a),cpu_a,csz,cpu_rdz);
END LOOP;
WALT for (2 * PERIOD);
init_regs <= TRUE;
WALT UNTIL int_ed_vsync = '1';
WATT UNTIL int_ed_vsync = '0';
WATT for (2 * PERIOD);
- Write to the control register to set ed_enable and th_enable
cpu_write("000001 100","0000001 1 ",cpu_a,cpu_d,csz,cpu_wrz);




P2:PROCESS~ Continuously cycle hmjisync
BEGIN




WATT for (HSYNCCYCLE - PERIOD);
END PROCESS P2;
P3:PROCESS~ Generate vsync and image data.
BEGE^
WAIT UNTIL init_regs;~Wait for cpu cycles to finish
- Run a small test image (4 lines by 16 words) with err_diffdisabled
ht_vsync <= T;
FOR i IN 0 to 3 LOOP
WAIT for (4* PERIOD);
FORjIN0tol5LOOP
-181-
VHDL Implementation of an Image Processing Chip
IFj>7THEN
ht_img_d(31 downto 24) <= "10001000";
ht_img_d(23 downto 16) <= "1001 1001";
ht_img_d(15 downto 8) <= "10101010";
ht_img_d(7 downto 0) <= "10111011";
ELSE
ht_img_d(31 downto 24) <= "00010001";
ht_img_d(23 downto 16) <= "00100010";
ht_img_d(15 downto 8) <= "00110011";
ht_img_d(7 downto 0) <= "01000100";
END IF;









- Run a small test image (4 lines by 16 words) with err_diff enabled
ht_vsync <= T;
FOR i IN 0 to 3 LOOP
WAIT for (4 * PERIOD);
FORjIN0tol5LOOP
IFj>7THEN
ht_img_d(31 downto 24) <= "10001000";
ht_img_d(23 downto 16) <= "10011001";
ht_img_d(15 downto 8) <= "10101010";
ht_img_d(7 downto 0) <= "10111011";
ELSE
ht_img_d(31 downto 24) <= "00010001";
ht_img_d(23 downto 16) <= "00100010";
ht_img_d(15 downto 8) <= "00110011";
ht_img_d(7 downto 0) <= "01000100";
END IF;














VHDL Implementation of an Image Processing Chip
END bhv;
-183
VHDL Implementation of an Image Processing Chip
File: th testvhd
This file contains the test bench for the Thresh block of the emkipc chip.
- Mike Kelly











ARCHITECTURE bhv of thresh_test IS
SIGNAL ed_vsync,ed_hsync:std_logic := '0';
SIGNAL resetz,clk,th_enable:std_logic := '0';
SIGNAL ed_img_d,ed_thresh:std_logic_vector(31 downto 0) := ZERO_32;
CONSTANT HALFJXK: TIME := 10 ns;
CONSTANT PERIOD: TIME := 20 ns;










resetz <= '0',T AFTER 50 ns;
elk <= NOT elkAFTERHALFJXK;
P0:PROCESS~ Generate vsync and image data.
BEGIN
WAIT for (5 * PERIOD);--Wait for reset to complete.
- Run an image with threshold disabled
ed_vsync<=T;
WATT for (2 * PERIOD);
FORJIN0TO3LOOP
edjisync <=T;
FOR i IN 0 TO 15 LOOP




VHDL Implementation of an Image Processing Chip
ed_img_d(23 downto 16) <= int_vec((8
*
i) + 2,"00000000");
ed_img_d(15 downto 8) <= int_vec((8 * i) + 4,"00000000");





WAIT for (HSYNCCYCLE - (16 * PERIOD));
END LOOP;
WALT for (2 * PERIOD);
ed_vsync <= '0';
WAIT for HSYNCCYCLE;




ed_thresh <= "00001000000100000010000001000000";- 08102040 hex
WALT for (2 * PERIOD);
ed_vsync <= T;
WAIT for (2 * PERIOD);
FOR j IN 0 TO 3 LOOP
edjisync <= T;
FOR i IN 0 TO 15 LOOP
ed_img_d(31 downto 24) <= int_vec(8
* i,"00000000");
ed_img_d(23 downto 16) <= int_vec((8
*
i) + 2,"00000000");
ed_img_d(15 downto 8) <= int_vec((8
*
i) + 4,"00000000");







WATT for (HSYNCCYCLE - (16 * PERIOD));
END LOOP;









VHDL Implementation of an Image Processing Chip
File: fifo tstvhd
- This file contains the test bench for the fifo block of the emkipc chip.
- Mike Kelly









q: OUTstd_logic_vector(31 downto 0));
END fifo_test;
ARCHITECTURE bhv of fifbtest IS
SIGNAL resetz,clk,ren,wen:std_logic := '0';
SIGNAL d: std_logic_vector(31 downto 0) := X_32;
CONSTANT HALFJXK: TIME := 10 ns;




d: IN std_logic_vector(31 downto 0);
over,unden OUTstdJogic;
q: OUTstd_logic_vector(31 downto 0));
END COMPONENT;
BEGIN
elk <= NOT elkAFTER HALFJXK;
PLPROCESS- Generate vsync and image data.
BEGIN
resetz <= '0',T AFTER 50 ns;
WATT for (5 * PERIOD);--Wait for reset to complete.
~ Run a small test image (10 lines by 256 words per line)
- to test address pointer roll-over under normal operation.







FOR j IN 0 to 255
LOOP- 256 words per line
-186-
VHDL Implementation of an Image Processing Chip





WAIT for (5 * PERIOD);
END LOOP;
resetz <= '0',T AFTER 50 ns;
WATT for (5 * PERIOD);--Wait for reset to complete.
Run a small test image (6 lines by 256 words per line)
to test underflow.
FOR i IN 0 to 5 LOOP- Run for 6 lines






FOR j IN 0 to 255 LOOP- 256 words per line










WATT for (5 * PERIOD);--Wait for reset to complete.
- Run a small test image (10 lines by 256 words per line)
to test overflow




IF i> 10 THEN
ren<=T;
ENDEF;
FOR j IN 0 to 255
LOOP- 256 words per line











VHDL Implementation of an Image Processing Chip
END bhv;
188-
VHDL Implementation of an Image Processing Chip
Appendix D - Register Initialization Files
This section contains the text files used in simulation of the EMKIPC
design. These are ASCII text files which are read in by the VHDL code as
integers and are used to configure the device registers and LUTs. The code
fragment below is taken from the File: test_img.vhd on page 166 and shows
the FLLE variable declarations for the files referenced in the following sec
tions.
FILE cf_val_file:text IS IN "/^ome/stu2/emk6904/vhdl/emkipc/text/cf_vals.txt";
FILE hm_val_file:text IS IN "/home/stu2/emk6904/vha^/emkipc/text/hm_vals.txt";
FELEht_val_file:textIS IN "/home/stu2/emk6904/vhdjVemkipc/text/ht_vals.txt";
FILEed_val_file:textIS IN "/home/stu2/emk6904/vhdl/emkipc/text/ed_vals.txt";
FILE cd_val_file:text IS IN "/home/sUi2/emk6904/vhdl/emkipc/text/cd_reg.txt'';
FILE in_img_file:text IS IN "/home/sm2/enik6904/vha,Vemkipc/text/girlin.txt";
FILE out_img_file:text IS OUT "/home/sm2/emk6904/vhdl/einkipc/text/girlouttxt";
Samples of the in_img_file and out_img_file are not included due to
their large size.
D.l - Control register file
The following text is from a sample control register file (ctl_val_file)
which is read in by the test_img test bench. The text on the second line of the
file is for editing reference only and is not read by the test bench.
0 0 0 128 2
hmcd cfcd htcd thresh edcd
In this example file, the follwing conditions are set:
histogram modification is disabled (hmctl = 0)
convolution is disabled (cfctl = 0)
ordered dither is disabled (htctl = 0)
The fixed threshold is set to 128
189
VHDL Implementation of an Image Processing Chip




D.2 - Histogram modification LUT file
The following text is from a sample Histogram modification file
(hm_val_file) which is read in by the test_img test bench.
255 254 253 252 251 250 249 248
247 246 245 244 243 242 241 240
239 238 237 236 235 234 233 232
231 230 229 228 227 226 225 224
223 222 221 220 219 218 217 216
215 214 213 212 211 210 209 208
207 206 205 204 203 202 201 200
199 198 197 196 195 194 193 192
191 190 189 188 187 186 185 184
183 182 181 180 179 178 177 176
175 174 173 172 171 170 169 168
167 166 165 164 163 162 161 160
159 158 157 156 155 154 153 152
151 150 149 148 147 146 145 144
143 142 141 140 139 138 137 136
135 134 133 132 131 130 129 128
127 126 125 124 123 122 121 120
119 118 117 116 115 114 113 112
111 110 109 108 107 106 105 104
103 102 101 100 99 98 97 96
95 94 93 92 91 90 89 88
87 86 85 84 83 82 81 80
79 78 77 76 75 74 73 72
71 70 69 68 67 66 65 64
63 62 61 60 59 58 57 56
55 54 53 52 51 50 49 48
47 46 45 44 43 42 41 40
39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24
23 22 21 20 19 18 17 16
15 14 13 12 11 10 9 8
7 6 5 4 3 2 1 0
This file is read in and the values are loaded into the histogrammodifica
tion LUTs. The values shown here create a negative image of the original by
inverting the pixel values.
-190
VHDL Implementation of an Image Processing Chip
D.3 - Convolution kernel value file
The following text is from a sample Convolution file (cf_val_file) which
is read in by the test_img test bench.
129 - sign mag notation for -1








This is a laplacian filter: -1 -2 -1
-2 12 -2
-1 -2 -1
kernel total is zero.
The text in the file other than the first nine values is for documentation
purposes only and is not read by the test bench.
D.4 - Ordered Dithermatrix file
The following text is from a sample halftone matrix file (ht_val_file)
which is read in by the test_img test bench.
12 52 196 240 228 156 76 20
60 108 184 136 192 220 116 68
204 176 96 40 48 104 212 148
232 128 32 0 8 56 200 252
236 168 88 24 16 64 144 248
132 180 120 80 72 112 208 172
36 100 188 160 152 216 124 92
44 140 244 224 164 84 28
191
VHDL Implementation of an Image Processing Chip
The text in this file implements a 32 element halftone dot oriented at 45
degrees.
D.5 - Error diffusion multiplier file
The following text is from a sample error diffusion multiplier file
(ed_val_file) which is read in by the test_img test bench.





Notice that the total of these values is 16. Notice also that more error is
diffused to the pixels directly next to (i) and directly below (k) the pixel being
processed than to those on the diagonal (j and 1).
192-
VHDL Implementation of an Image Processing Chip
Appendix R - <C> Code
This section contains the two 'C programs which are used to convert
between the binary raster image data format (.img) and the ASCII text data
format which is read by the test bench
E.l - Converting .img to .txt (img2v.c)
* File name = img2v.c. *
* Converts and img file to an ASCII file which can *









/jp% 3je sj: ^c s)c :|e+$t :fc sje ("jIqHjjI Definitions *^^^*^^^^^^^/
#define BYTE unsigned char
#define IMGHDRSIZE 128 /* IMG header input buffer size. */
#define SLMAXWIDTH 1080 /* IMG header input buffer size. */
* *
* MAIN PROGRAM *
* *
***J|!***^^.*********^************************************************/







Array for first 128 bytes of image file. */
int DataOffset;
/* Start of image data im IMG file. */
unsigned int width, length, comsize;
/* IMG size parameters. */
unsigned int i,j,k;
int numlines.outwidth;
BYTE SLBI[SLMAXWIDTH], SLBO[SLMAXWIDTH], data;
193
VHDL Implementation of an Image Processing Chip
/************* Qpgjj imgFile and read 1st 128 bytes. *************/
if( argc != 3 )
{
















if( (OutFile = fopen( OutFileName,
"w+"
)) == NULL )
{












if((ImgHdr[0] != T) II (ImgHdr[l] != 'M'))
{












comsize = (ImgHdr[2] & OxOOff) I ((ImgHdr[3] & OxOOff) 8);
width = (ImgHdr[4] & OxOOff) I (amgHdr[5] & OxOOff) 8);
length = (ImgHdr[6] & OxOOff) I ((ImgHdr[7] & OxOOff) 8);
outwidth = (width/4)
* 4;/* Out file width is mod 4 */
fprintf(OutFile, "%d\t%d\n",outwidth,length);
/* Reset the file pointer to the beginning of Image. */
DataOffset = 64 + comsize;
if(fseek(InFile, (long)DataOffset, 0) != 0)
{
printf("Can'tmove IMG pointer there.");
exit(l);
/* A non-zero argument indicates abnormal end. */
}
numlines = 0;
for (j = 0; j < length; j++)






VHDL Implementation of an Image Processing Chip
if(width != fread(SLBI,sizeof(BYTE),width, InFile))
{





for (i = 0; i < (width/4); i++)
{
fprintf(OutFile, "%d\t", SLBI[(i * 4) + 0]);
fprintf(OutFile, "%d\t", SLBI[(i * 4) + 1]);
f^rintf(OutFile, "%d\t", SLBI[(i * 4) + 2]);
fprintf(OutFile, "%d", SLBI[(i * 4) + 3]);
fprintf(OutFile, "\n");
}
numlines = numlines + i;
}
printf("%d pixels per line\n",outwidth);
printf("%d lines per image\n",length);
printf("%d lines in the output file\n",numlines);






E.2 - Converting .txt to .img (v2img.c)
* File name = v2img.c. *
* Converts a text image file to an img file which.
*
* The text file has the following
format:*
* linel: pixels_per_linelines_per_image*
* line 2->(N+l):4 decimal pixel values*
* N is the number of4 pixel words in the *
* image. *
* Each pixel is represented by an ascii
*
*









/*********** Global Definitions ***********/
#define BYTE unsigned char
#define IMGHDRSIZE 64
/* IMG header input buffer size. */
195
VHDL Implementation of an Image Processing Chip
#define SLMAXWIDTH 1080 /* IMG header input buffer size. */
/*******************************************************************
* *
* MAIN PROGRAM *
* *
*******************************************************************/







Array for first 128 bytes of image file. */
intDataOffset;
/* Start of image data im IMG file. */
long temp_long;
/* IMG size parameters. */
unsigned intwidth, length, comsize; /* IMG size parameters. */
unsigned int i,j,k;
BYTE SLBI[SLMAXWIDTH], SLBO[SLMAXWIDTH], data;
/************* Qpen imgFile and read 1st 128 bytes. *************/
if( argc != 3 )
{







if( (InFile = fopen( InFileName,
"r+"
)) == NULL )
{





if (fscanf(InFile, "%d", &width) != 1)
{





if (fscanf(InFile, "%d", &length) != 1)
{










ImgHdr[4] = (BYTE)(width & OxOOff) ;
196
VHDL Implementation of an Image Processing Chip
ImgHdr[5] = (BYTE)((width & OxffOO) 8) ;
ImgHdr[6] = (BYTE)aength & OxOOff) ;







for (i=14; i < 64; i++)
ImgHdr[i] = 0;













printf("\n Can't seem to write header. \n");
return;
}
for (j = 0; j < length; j++)






























/************* dose files. **********/
fclose( InFile );
fclose( OutFile );
}
/*
endmainO- */
-197
