Rochester Institute of Technology

RIT Scholar Works
Theses
5-1-2008

Reconfigurable hardware for color space conversion
Sreenivas Patil

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation
Patil, Sreenivas, "Reconfigurable hardware for color space conversion" (2008). Thesis. Rochester Institute
of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
ritscholarworks@rit.edu.

Reconﬁgurable Hardware for Color Space Conversion
by
Sreenivas Patil
A Thesis Submitted in Partial Fulﬁllment of the Requirements for the Degree of
Master of Science
in
Electrical Engineering
Approved By:

Dr. Eric Peskin
Thesis Advisor

Dr. Eli Saber
Thesis Committee

Dr. Sohail A. Dianat
Thesis Committee

Dr. Vincent Amuso
Department Head
Department of Electrical Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
May 2008

Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering

Title: Reconﬁgurable Hardware for Color Space Conversion

I, Sreenivas Patil, hereby grant permission to the Wallace Memorial Library to
reproduce my thesis in whole or part.

Sreenivas Patil

Date

Dedication

To my parents and family, Dr. K. V. Reddy and family, and friends.

For making it possible for me to complete my studies in graduate school.
For the continued support and patience.
For helping me to be a better person.
For inspiring me to reach higher.
For the unconditional love.
For always being there for me.

I dedicate this thesis to you.

iii

Acknowledgments
I would like to thank Dr. Eric Peskin for giving me the opportunity to be a part of this
research project. He has signiﬁcantly improved my knowledge about digital design and its
application in the industry. He has always supported me in my research work, discussed
and critiqued my ideas, and also provided different perspectives to consider in my work. He
helped provide structure and clarity to my paper and thesis through numerous proofreading
sessions. Dr. Peskin was so committed to my success on this project that he gave up time
with his family on late nights and weekends so he could provide me with additional support.
I would also like to thank Dr. Eli Saber, Dr. Vincent Amuso, and Dr. Sohail Dianat for
their reviews and suggestions during our weekly discussions. They took time from their
schedules to provide their expert opinions and insights into my project work. Thank you to
the professors for taking the time to be a part of thesis committee.
A special thanks goes to Hewlett-Packard, especially Dr. Kenneth Lindblom, Mr. Brad
Larson, and Mr. Gene Roylance for supporting this research work and for the technical
guidance they have provided with tools, documentation, and information on design and
implementation. I would also like to thank Xilinx, Inc. for donating the software that I
have used in this thesis.
Thank you to the Rochester Institute of Technology Electrical Engineering Department
for providing me with the software, hardware, and technical support for my research work.
I would also like to thank my colleagues: Mr. Mustafa Jaber, Mr. Luis Garcia, Mr.
Harsha Narne, Mr. Prudhvi Gurram, Mr. Kartheek Chandu, Mr. Manoj Reddy, Mr. Guru
Balasubramanian, Mr. Bhargava Chinni and Mr. Juan Galindo for their valuable advice on
image processing concepts and MATLAB usage.

iv

Abstract
Color space conversion (CSC) is an important application in image and video processing systems. CSC has been implemented in software and various kinds of hardware.
Hardware implementations can achieve a higher performance compared to software-only
solutions. Application speciﬁc integrated circuits (ASICs) are efﬁcient and have good performance. However, they lack the programmability of devices such as ﬁeld programmable
gate arrays (FPGAs).
This thesis studies the performance vs. ﬂexibility tradeoffs in the migration of an existing CSC design from an ASIC to an FPGA. The existing ASIC is used within a commercial
color-printing pipeline. Performance is critical in this application. However, the ﬂexibility
of FPGAs is desirable for faster time to market and also the ability to reuse one physical
device across multiple functions. This thesis investigates whether the reprogrammability
of FPGAs can be used to reallocate idle resources and studies the suitability of FPGAs for
image processing applications. In the ASIC design, two major conversion units that are
never used at the same time are identiﬁed. The FPGA-based implementation instantiates
only one of these two units at a time, thus saving area. Reconﬁguring the FPGA switches
which of the two units is instantiated.
The goal is to conﬁgure the device and process an entire page within one second. The
FPGA implementation is approximately a factor of three slower than the ASIC design,
but fast enough to process one page per second. In the current setup, the conﬁguration
time is very high. It exceeds the total time allotted for both conﬁguration and processing.
However, other methods of conﬁguration seem promising to reduce the time. Evaluation of
the performance of the implementation and the reconﬁguration time is presented. Methods
to improve the performance and reduce the time and area for reconﬁguration are discussed.

v

Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Model-based Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Color Look-Up Tables with Interpolation . . . . . . . . . . . . . . . . . .

4
5
7

3

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Existing ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Proposed FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . 12

4

Results . . . . . . . . . . . . . . . . . . . . .
4.1 Implementation . . . . . . . . . . . . . .
4.2 Testing . . . . . . . . . . . . . . . . . . .
4.3 Faster Conﬁguration: Preliminary Results

5

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . .

27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

A Generation of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

References

vi

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
. .
. .
. .

15
15
16
21

B Hardware Co-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

C Hardware and Software Used . . . . . . . . . . . . . . . . . . . . . . . . .

38

D MATLAB Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

vii

List of Figures
2.1
2.2

Simplest form of interpolation using eight vertices. . . . . . . . . . . . . .
Interpolation using sub-cubes in the color space. . . . . . . . . . . . . . . .

3.1
3.2
3.3

Core of the CSC engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
FPGA version including 3D module. . . . . . . . . . . . . . . . . . . . . . 13
FPGA version including 4D module. . . . . . . . . . . . . . . . . . . . . . 13

4.1
4.2
4.3
4.4
4.5

Test methodology. . . . . . . . . . . . . . . . . . . . . . . . .
Hardware-in-the-loop testing. . . . . . . . . . . . . . . . . . .
Test image results. . . . . . . . . . . . . . . . . . . . . . . .
Floor plan showing PRR in Virtex-II Pro (XC2VP30-7FF896).
Floor plan showing PRR in Virtex-4 (XC4VSX35-10FF668). .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

8
9

17
18
21
23
24

B.1 System Generator project for simulation. . . . . . . . . . . . . . . . . . . . 36
B.2 System Generator project for hardware-in-the-loop testing. . . . . . . . . . 37

viii

List of Tables
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

Implementation results for Virtex-II Pro (XC2VP30-7FF896). . . . . . .
Implementation results for Virtex-4 (XC4VSX35-10FF668). . . . . . . .
Hardware-in-the-loop testing – XUP Development System. . . . . . . . .
Hardware-in-the-loop testing – Annapolis WILDCARD-4. . . . . . . . .
Tests in the different modes of operation. . . . . . . . . . . . . . . . . . .
Physical resource usage within PRR in Virtex-II Pro (XC2VP30-7FF896).
Physical resource usage within PRR in Virtex-4 (XC4VSX35-10FF668). .
Conﬁguration time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

15
16
19
19
20
22
22
26

A.1 General structure of the test vector ﬁle. . . . . . . . . . . . . . . . . . . . . 34
C.1 List of hardware used for testing. . . . . . . . . . . . . . . . . . . . . . . . 38
C.2 List of software used in implementation and testing. . . . . . . . . . . . . . 39

ix

Chapter 1
Introduction
Commercial printers typically use application-speciﬁc integrated circuits (ASICs) in
their color-processing pipelines. ASICs are efﬁcient and can achieve higher performance
compared to software implementations. However, they are inﬂexible, and incorporating
new features requires designing and fabricating a new ASIC. This incurs considerable cost
and lead time. Furthermore, a given ASIC may need to support multiple features that are
never used at the same time. In an ASIC, unused units sit idle. In contrast, reconﬁgurable
devices such as ﬁeld-programmable gate arrays (FPGAs) can redeploy silicon to the task
at hand. FPGAs have established an attractive point on the tradeoff spectrum between the
ﬂexibility and low cost of software and the performance of hardware. The large number of
logic elements available for use is well suited for processing large streams of image data and
processing many streams in parallel. They offer good performance and design ﬂexibility.
Modern FPGAs feature embedded multipliers, on-chip memory, high speed transceivers
and sometimes embedded processors as a part of the device. This combination of features
enables FPGAs to be used in high performance image and video processing solutions.
The work presented in this thesis is a part of an ongoing research project to develop a
dynamic reconﬁguration [1] system for hardware features. It investigates how well FPGAs
are suited to functions in typical color-processing pipelines within printing applications
and whether they can replace ASICs in the future. Of particular interest are the tradeoffs
involved in migrating existing functions from ASICs to FPGAs. In hardware implementations of color-processing pipelines, different resources are used to implement different
1

functions. It is a common occurrence that the functions are mutually exclusive in time (or
all of the hardware resources are not active at the same time). These idle areas of the chip
can be reprogrammed to perform another function, or the same function as another area,
effectively realizing parallel processing. It would be possible to swap in and out different
functions based on the resource availability.
As an initial case study, this thesis considers the color space conversion (CSC) unit
from a Hewlett Packard (HP) color processing pipeline. The CSC unit is responsible for
converting images represented in one color space to another color space. The HP CSC
engine is chosen as the driving example for three main reasons. First, it is a commercial
ASIC design with realistic size and speed requirements. Second, it performs CSC using
color look-up tables (CLUTs) followed by interpolation [2]. This advanced technique requires both signiﬁcant storage and arithmetic processing capabilities. Third, at the core of
the pipeline, it contains two main conversion units. A 3D module is used when the input
space has three channels. A separate 4D module is used when the input space has four
channels. In most of the ASIC’s target applications, these two units are never used at the
same time. This presents an opportunity to take advantage of the ability of FPGAs to reallocate resources to the task at hand. This thesis investigates whether the reprogrammability
of FPGAs can be used to redeploy any unused resources.
Most prior work on FPGA implementations of CSC [3, 4, 5] is based on linear matrixbased methods. These are simpler and only apply to conversions that are linear across
the entire color space. Han [6, 7] does present a color-gamut-mapping architecture that
uses CLUTs and bi-linear interpolation. This is implemented on both an FPGA and an
ASIC. However, the FPGA is used for prototyping rather than as the ﬁnal implementation.
Furthermore, all these works deal with input color spaces that have three channels. Hence,
they do not use a separate 4D module.
This thesis presents an FPGA-based implementation of the HP CSC design. The FPGA
implementation consists of two separate conﬁgurations. One instantiates the 3D module.

2

The other instantiates the 4D module. Switching between use of the 3D or the 4D module can be accomplished by reconﬁguring the FPGA, depending on the target application.
The implementation described in this thesis also sets the stage for implementing dynamic
reconﬁguration. This will allow for individual modules to be replaced without reprogramming the entire FPGA and unused resources to be redeployed for other purposes. This
also introduces the possibility of adding new functions into the existing pipeline, without
redesigning the pipeline. The FPGA implementation has been tested on two different hardware platforms. One is a Xilinx University Program (XUP) Virtex-II Pro Development
System featuring a Xilinx Virtex-II Pro series (XC2VP30-7FF896) FPGA [8]. The other is
an Annapolis Micro Systems WILDCARD-4 with a Xilinx Virtex-4 series (XC4VSX3510FF668) FPGA [9].
The following contributions are made in this thesis:
• Implemented an existing, commercial CSC ASIC design on an FPGA.
• Obtained design speed capable of processing one page per second.
• Tested design in various modes of operation and veriﬁed correct implementation of
the design on the FPGA.
• Evaluated reconﬁguration time and obtained results indicating that reconﬁguration
can be done once per job.
• Initial analysis of methods to improve reconﬁguration time to allow reconﬁguration
on a per-page basis.
The rest of this thesis is structured as follows. Chapter 2 reviews the background material and related work in hardware implementations of CSC. Chapter 3 describes the existing
CSC ASIC design and presents the FPGA-based implementation and the design changes
made to implement the two versions. Chapter 4 describes the test methodology used to
ensure correct implementation. It also presents the results obtained and evaluates the performance. Chapter 5 presents concluding remarks and future work.
3

Chapter 2
Background
Color is a visual sensation resulting from visible light falling on the retina. The human
retina has three types of color photoreceptor cone cells, each responsive to a different region in the color spectrum. So, any given color can be described using three components,
provided that an appropriate weight of each component is used [10]. A color image can be
represented as an array of picture elements or pixels, each containing a combination of the
components that describe a color.
A color space is a method of describing and representing colors in a standard way [11].
There are many color spaces in use, such as CIE XYZ, RGB, YUV, CMY, HSV, CIE LAB,
etc., most of which have three components. So, a color can also be deﬁned as a point in the
three-dimensional coordinate system of a color space. However, in printing applications, a
fourth component, black (K) is also used to produce more accurate images and save toner,
hence the color space CMYK. There are three popular groups of color spaces used to deﬁne
colors in electronic devices, mainly RGB (used in display devices), YCrCb, YIQ and YUV
(used in video systems) and CMYK (used in color printing).
Color space conversion (CSC) is the process of converting the representation of a given
color or image from one color space to another. Different devices such as cathode ray tube
(CRT) displays, digital cameras, scanners and printers use different color spaces. These
devices are used in conjunction, and there arises a need for a conversion between the color
spaces in use. A typical application is to convert from the color space of an image sensor
of a camera to the color space of a CRT display or a printer. The computations involved
4

in these transformations are usually nonlinear and complex in multiple dimensions [12,
2]. They are computationally intense to be implemented in software. Other applications
of CSC include joint photographic experts group (JPEG) image compression [4], moving
picture experts group (MPEG) decoding [5], face detection [13, 14, 15], display device
modeling, device independent color reproduction and colorimetry instrumentation [16].
Conversion between color spaces presents many interesting challenges. The transformation from one space to another space is not trivial because the relationship between the
spaces is generally nonlinear. It is important to preserve the color information between the
original image in a source device (like a image sensor, color scanner, CRT display, digital/video camera, computer software, etc.) and a translated copy in a target device (like a
CRT display, color printer, etc.). The computations involved in these transformations are
usually nonlinear and complex in multiple dimensions and are computationally intense.
There are a few different approaches in common use.

2.1

Model-based Transforms

Pixels in one color space can be converted to any other color space using mathematical
equations [11]. A straightforward approach to create a hardware design is to implement
these mathematical equations. Depending on the conversion, the mathematical models
range from simple and linear to complex and nonlinear. Conversions from the color space
of one physical device to another physical device are usually nonlinear [2, 11]. It is a design challenge to implement the complex equations involved in the conversions and obtain
accurate results with considerable speed.
A simple and widely used numerical method is matrix transformation. This is used
when the underlying color transformation can be approximated by a linear conversion. A
matrix consisting of conversion constants is derived from the mathematical equations used
for the conversion. Each pixel of the input image is then multiplied with the conversion
matrix to generate the pixel values in the output color space.

5

The matrix-based approach has been implemented in different kinds of hardware. Bensaali et al. [3] present an FPGA-based architecture for RGB to YCrCb color space conversion that offers a speedup of 100 compared to software. Agostini et al. [4] present parallel
and pipelined architectures for the conversion from RGB to YCbCr. The FPGA implementation has operating frequencies of 40 MHz and can be used in real-time applications such
as JPEG compression. Sima et al. [5] conduct a case study on Y CbCr to R G B CSC for
MPEG decoding and present an FPGA-based implementation that has a 40% speedup over
the original design. Bilal and Masud [17] discuss a new architecture for color space conversion from RGB to YCbCr, using the instruction set of OpenRISC 32-bit reduced instruction
set computer (RISC) processor. Andreadis [16] presents an ASIC design that is capable of
conversion from RGB to CIE L*u*v* color space in real-time, operating at speeds of 20
MHz. Andreadis et al. [18] present a similar ASIC design that performs conversion from
XYZ to CIE L*u*v* color space in real-time, with a maximum operation speed of 20
MHz. Nsour and Abdel-Aty-Zohdy [19] discuss an ASIC design that is capable of conversion from RGB to CIE L*a*b* color in real-time. CSC intellectual property (IP) cores
are available for use from various vendors like Alma Technologies [20], CAST Inc. [21],
Athena Group Inc. [22], Xilinx [23], Altera [24], and a semi-custom implementation from
Triad Semiconductor [25].
Using these designs, other conversions may be performed by changing the coefﬁcients
in the conversion matrix. However, all of the above mentioned implementations are limited
to conversions in which the color transformation can be approximated by a linear conversion. They are also limited to input and output spaces with three channels. The CSC IP
cores offer a small selection of conversions, but they are also limited to color spaces with
three channels.
Matrix transformations can be efﬁciently implemented in hardware. However, such
methods are only applicable to conversions that are linear throughout the entire space. Not
all important CSCs ﬁt this proﬁle. In particular, conversions from the color space of one
physical device to that of another physical device are usually nonlinear. This is because

6

of other physical dependencies such as the dye and toner properties, the color response
of different types of media for the same colorant [12] and monitor component properties.
Accurately modeling such transformations with mathematical equations poses difﬁculties.
Simple equations do not provide enough accuracy. However, complex equations are too
slow in software and too expensive in hardware.

2.2

Color Look-Up Tables with Interpolation

When the two color spaces are not trivially related, the arbitrary transformation function
can be implemented using color look-up tables (CLUTs). In the extreme, the CLUT has an
entry for every possible position in the input color space. Each entry in the CLUT stores the
coordinates of the corresponding color in the output color space. With this approach, any
arbitrary conversion can be implemented and no arithmetic hardware is required. While
this method is more accurate than a matrix transformation, it requires signiﬁcant memory
resources for storing the output values. In a typical case of a eight bits per input component
of a three-channel input, the look-up table would have 224 or 16,777,216 entries, each containing three bytes for a three-component output or four bytes for a four-component output.
The memory requirements of the pure CLUT approach quickly become impractical [2,12].
An efﬁcient way of reducing the memory requirements is to use CLUTs with interpolation [2]. In this approach, the CLUTs only have entries for a subset of the possible
positions in the input color space. In its simplest form, the entire input color space can
considered as a single cube. For example, consider a three channel input space like RGB,
as shown in Figure 2.1. The primary colors (red, green and blue) and their combinations
(cyan, magenta, yellow, black and white) are the vertices. There is a CLUT entry for each
vertex. The output value is then calculated by interpolation between the vertices, using the
distance from the vertices as weights.
In this extreme case where the entire input space is a single cube, this method reduces
to the approach of Section 2.1. In the other extreme, there is a vertex at every possible

7

 





 

 
  

 
 






 
Figure 2.1: Simplest form of interpolation using eight vertices.

8





  
 
   

Figure 2.2: Interpolation using sub-cubes in the color space.
position in the input space and this method reduces to the pure CLUT method. In the
intermediate case, the input space can be divided into sub-cubes as shown in Figure 2.2.
CLUT entries are allocated for the vertices of the sub-cubes. Given an input color that
lies within a particular sub-cube, the algorithm extracts the output colors stored at each of
the vertices of that sub-cube. Interpolation between these values is used to determine the
ﬁnal output color. The size of the cubes must be chosen to balance memory requirements,
arithmetic logic requirements, speed and accuracy of the conversion.
There are a number of interpolation methods used, for example: bilinear, trilinear,

9

PRISM, tetrahedral, etc. [2, 6]. The CLUTs-with-interpolation method has a balanced use
of both memory and computing resources. However, the computational complexity of the
methods and accuracy of the results varies, depending on which interpolation method is
used. By using CLUTs with interpolation, a practical design with a wide range of conversion choices and accurate CSC results can be obtained.
A survey of current literature shows that very few implementations of the CLUT with
interpolation method are available in the public domain. My hypothesis is that the implementations used in various devices are proprietary. The CSC design described in this thesis
is one example. There are a few FPGA implementations of this method. Han [6,7] presents
a color-gamut-mapping architecture that uses CLUTs and bilinear interpolation. The input and output color spaces each have three channels. The design is implemented on both
an FPGA and an ASIC. The FPGA is used for prototyping rather than as the ﬁnal implementation. In contrast, this thesis studies the suitability of FPGAs and the use of dynamic
reconﬁguration for related applications.

10

Chapter 3
Implementation
3.1

Existing ASIC Implementation

The starting point for the current research work is a design that was provided to us by
HP. This design is used in an ASIC and is a part of the color pipeline. It supports a wide
variety of modes and input methods. The design consists of many modules, each with a
speciﬁc purpose. The conﬁguration of the modules and the behavior of the CSC design
are controlled by a set of conﬁguration registers. The core part of the pipeline consists of
a pre-processing unit, two main conversion units and a post-processing unit as shown in
Figure 3.1. The 3D module handles conversions in which the input color space has three
channels (such as RGB or L*a*b*). The 4D module handles conversions in which the input
color space has four channels (such as CMYK). Both these modules convert the input space
to a four-channel output space.
This design uses CLUTs with interpolation to perform the color space conversion. The
CLUTs can be loaded with the different values corresponding to the conversion requirements. The CLUTs are accessed by the higher order bits of each input channel. These bits
 
 


 


 

Figure 3.1: Core of the CSC engine.

11

 
 

identify the sub-cube that contains the input color. The output colors for the vertices of
this sub-cube are extracted from the CLUT. The output value is obtained by interpolating
between these values. The lower order bits of each input channel determine the relative
position of the input color within the sub-cube. This is used by the interpolation algorithm
to calculate the output value.
In order to convert one pixel per clock cycle, it is necessary to access all values required
for interpolation in a single clock cycle. This is achieved by implementing the CLUTs using
multiple memories that can be accessed in parallel. The CLUT entries are distributed such
that for any color, each vertex of the selected sub-cube is guaranteed to be in a different
memory [12]. Thus, all the vertices of the sub-cube can be extracted in parallel. This
enables a throughput of one pixel per clock cycle.
The CLUTs also feature high and low resolution modes. The design is able to process
16-bit color data as well as 8-bit color data. The CLUTs are typically loaded through a
processor register interface, but the module also provides a means to load the CLUTs via
a direct memory access (DMA) controller. The ASIC implementation has a throughput of
one pixel per clock cycle with a maximum clock frequency of 167 MHz.

3.2

Proposed FPGA Implementation

The ASIC used for CSC has the advantage of achieving higher performance compared
to a software-only solution. However, it has two key disadvantages. Firstly, incorporating
new features into the algorithm requires the design and fabrication of a new ASIC. This
involves a considerable amount of time and cost. Secondly, even if the design remains
constant, the conversion requirements or types may change from job to job, or in some
cases, from page to page. This requires multiple modules, each suited for a particular CSC
requirement. In most products in which the ASIC is deployed, the two conversion units are
never used at the same time. At any given time, one of the two modules is idle. This can be
considered an inefﬁcient use of available device resources. In contrast, on programmable

12

 
 


 

 
 

Figure 3.2: FPGA version including 3D module.
 
 


 

 
 

Figure 3.3: FPGA version including 4D module.
hardware such as an FPGA, these resources can be either eliminated or redeployed to the
task at hand. The motivation for this thesis is to study how well are FPGAs suited for
the type of CSC currently used and to use the reprogrammability of FPGAs to re-use the
available hardware resources to the task at hand.
In a typical print application, only one of the two main conversion modules is used at
any given time. Since the other module is idle, the design is modiﬁed such that only one
of the modules is present at any given time. Thus, there are two versions of the design,
each with one of the modules as shown in Figure 3.2 and Figure 3.3. The ﬁrst step is to
implement the existing ASIC design in an FPGA. In the process, some changes are made
to the design that allow for implementation on an FPGA.
The existing ASIC design can make use of static random access memory (SRAM) from
various vendors. Each kind of SRAM is enclosed in memory wrapper modules to keep the
interface to the design constant. In the FPGA implementation, Xilinx Coregenerator is
used to create these memory structures, which are implemented in block random-access
memories (BRAMs) on the FPGA. Custom wrappers are created for each of the memory
modules used.
The current ASIC design has two versions. One includes just the 3D module. The other
includes both the 3D and 4D module. The chip designer has to choose between the two
versions before fabricating the ASIC. The FPGA implementation also has two versions,
one with just the 3D module and the other with just the 4D module. The two versions can

13

be swapped at runtime, by reconﬁguring the FPGA with different bit streams.
As a result of the above change, the length of the pipeline also changes. In order to
prevent loss of pixel information, the control module for the pipeline is also modiﬁed to
compensate for the change in the pipeline length.
Another change to the design is the addition of a clock enable port. In the hardware test
bench, Xilinx System Generator is used. This requires the use of a clock enable port for
the imported hardware description language (HDL) module [26].
The ASIC design has a built-in self test (BIST) interface connected to the SRAMs.
The BIST interface is very large and requires a large number of I/O ports. If BIST were
included, there would be a shortage of I/O ports to implement the FPGA design. Also, the
SRAM modules created do not include a BIST port. So, the BIST interface is not supported
and is removed from the FPGA implementation.
Once these changes are made to the design, it can be tested in simulation or synthesized
for hardware co-simulation and implementation. The FPGA can be programmed with
one of the two conﬁgurations. The decision about what type of conﬁguration and when
to conﬁgure the FPGA is currently done manually. This could be automated by using a
controller to manage the conﬁgurations and the programming of the FPGA.

14

Chapter 4
Results
4.1

Implementation

The FPGA versions of the design are synthesized for the Xilinx Virtex-II-Pro and
Virtex-4 FPGAs. The synthesis tool used is Xilinx ISE 8.2.03i. The resource usage of
each of the implementations is shown in Table 4.1 and Table 4.2. The values for resource
usage are obtained after post place and route implementation. The values for design speed
are obtained from the post-place-and-route static timing analysis.
The FPGA implementations occupy a large number of resources, especially the block
RAMs. CSC using CLUTs with interpolation is an inherently memory-intensive application. Furthermore, the CLUTs in this design are not of the standard sizes of memory
available as BRAMs on-chip. Hence, each instantiation of a memory block uses more than
the required memory size.

Table 4.1: Implementation results for Virtex-II Pro (XC2VP30-7FF896).
CSC Design with
Available
Feature
3D Mod.
4D Mod. Resources
Slice Flip Flops
3,817
4,268
27,392
4 input LUTs
13,409
15,153
27,392
Slices
7,737
9,292
13,696
Block RAMs
92
66
136
MULT18x18s
32
40
136
Max. clock rate 50.56 MHz 50.39 MHz
15

Table 4.2: Implementation results for Virtex-4 (XC4VSX35-10FF668).
CSC Design with
Available
3D Mod.
4D Mod. Resources
Feature
Slice Flip Flops
3,832
4,236
30,720
4 input LUTs
13,725
15,205
30,720
Slices
7,930
9,297
15,360
RAMB16s
92
66
192
DSP48s
16
24
192
Max. clock rate 50.33 MHz 50.95 MHz
The goal of the implementation is to be able to conﬁgure the FPGA and then process
an entire page within one second. One page is 8.5 by 11 inches at 600 dots per inch
(DPI), which is about 33 million pixels. The FPGA versions maintain the throughput of
one pixel per clock cycle (as in the ASIC version). Based on the post-place-and-route
static timing analysis, the maximum operating speed of the FPGA implementations is about
50MHz. This is approximately a factor of three slower than the ASIC. However, at this rate,
processing one page would take approximately 0.7 seconds.

4.2

Testing

Each type of conversion has a speciﬁc CLUT data ﬁle. These custom data ﬁles are
processed in a HP software executable to generate the conﬁguration register information,
CLUT values and a CSC data ﬁle. The CSC data ﬁle is used to process the source image using another HP software executable that simulates the ASIC design. The result is a
reference output image that is used for comparison with the simulation results. The conﬁguration register information and CLUT values are used along with the image information to
create an input text ﬁle containing test vectors. (The structure of the text ﬁle containing the
test vectors is available in Appendix A.) This text ﬁle is used as an input to the Xilinx ISE
simulator to perform the software simulation. Image information is extracted from the simulation output and compared against a reference output image obtained from the software
executable. Figure 4.1 details the test method.
16

Configuration
Data

CLUT
Data

CLUT Data
Converter

Config
Register Data

CLUT
Values

CSC
Data

Input Image

CSC
Software Model

Simulation
Input File

Software
Simulation

Hardware
Co-simulation

(Xilinx ISE)

(Simulink, System
Generator & FPGA)

Output
Image 1

Output
Image 2

Compare Images

Results Match

Figure 4.1: Test methodology.
17

Output
Image 3


 
      
 
   


"





 !





  
 




Figure 4.2: Hardware-in-the-loop testing.
Once the design is veriﬁed in software, it can be tested in hardware. Xilinx System
Generator is used for hardware-in-the-loop testing. This gives the ability to use MATLAB
via Simulink as a hardware test platform. This interface is very convenient because it allows
for testing the design on the FPGA and also allows for sending test vectors and receiving
output values from within MATLAB. The design to be tested is imported as a black box
component in Simulink. Gateway-in and gateway-out components are used to connect the
input and output ports of the design to the MATLAB workspace. This Simulink model
is synthesized for the target device and a hardware co-simulation block is created. (More
information about the Simulink model and screen shots are available in Appendix B.) Test
vectors, including both conﬁguration data and an input image are setup in MATLAB using
the input text ﬁle and applied to the inputs of the design on the FPGA. The outputs are
collected, the image information is extracted and compared with the results of the software
model and a match can be validated. This is illustrated in Figure 4.2. The test results are
expected to have a bit-for-bit match with the reference output image.
Even though Xilinx System Generator provides a convenient interface, the speed of
testing is limited by the slow link between the host and the FPGA. Vectors stream through
the host computer, then MATLAB and System Generator and ﬁnally the FPGA. The test
vectors are shifted serially from the host computer and then applied to the design. Table 4.3

18

Table 4.3: Hardware-in-the-loop testing – XUP Development System.
Num. of Conﬁg
Proc. Time per
Vector
Vectors
Time
Time
Vector
Rate
Num. of Pixels
−3
Image 1 9600
17234
7.7s 21.1s 1.6×10 s 636 Hz
Image 2 19200
26834
7.9s 41.8s 1.6×10−3 s 642 Hz
Image 3 38400
46034
8.1s 71.3s 1.5×10−3 s 646 Hz
Image 4 57600
65234
8.2s 100.2s 1.5×10−3 s 641 Hz
Table 4.4: Hardware-in-the-loop testing – Annapolis WILDCARD-4.
Num. of Conﬁg Proc. Time per
Vector
Num. of Pixels
Vectors
Time Time
Vector
Rate
−4
Image 1 9600
17234
2.5s
3.8s 2.2×10 s 4.5 KHz
Image 2 19200
26834
2.5s
5.8s 2.2×10−4 s 4.6 KHz
Image 3 38400
46034
2.4s
9.7s 2.1×10−4 s 4.7 KHz
Image 4 57600
65234
2.5s 14.3s 2.2×10−4 s 4.6 KHz
and Table 4.4 show the time for processing four different test images. The test vectors
contain the CSC conﬁguration, CLUT values and the image pixels. The time required to
conﬁgure the FPGA and to process all the test vectors is measured. In the tests performed,
the effective clock rate is approximately 4.6 KHz for the Virtex-4 FPGA and approximately
641 Hz for the Virtex-II Pro FPGA. Even in case of the Virtex-4, the clock rate is four orders
of magnitude slower than the maximum operating speed of the FPGA implementation.
Thus, although System Generator is a convenient interface to verify correctness on the
hardware, it does not allow for at-speed testing of this design.
The two FPGA versions of the CSC design have been tested in several different modes
of operation. Table 4.5 shows the 24 tests performed and the different combinations of
parameters used. The columns in Table 4.5 are deﬁned as follows:
• Pipeline - indicates whether the pipeline includes the 3D module or the 4D module.
• CLUT Load Method - shows the method used to load the CLUTs.
• CLUT Resolution - shows whether the CLUTs use high resolution or low resolution.
• Conversion - indicates the speciﬁc conversion performed.
19

Table 4.5: Tests in the different modes of operation.

!
#
%
&
'
$
(
"
)
!*
!!
!#
!%
!&
!'
!$
!(
!"
!)
#*
#!
##
#%
#&

 







  

%+&+,

!

%+&+,

#

%+&+,

%

%+&+,

!

%+&+,

#

.
.

%+&+,
%+&+ ,

%
%

.

&+&+,

!

&+&+,

#

&+&+,

%

&+&+,

!

&+&+,

#

&+&+,

%

-

  

.

/  %+/

-

+0

  

-

/  &+/

.
+0

  

-


  
"
!$
"
!$
"
!$
"
!$
"
!$
" 
"
!$
"
!$
"
!$
"
!$
"
!$
"
!$
"
!$

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

• Image Resolution - shows the resolution of the source image.
• Results Match - shows the result of the comparison between the images obtained
from hardware co-simulation and the software model.
In all the test cases, both the output from software simulation and also that from the actual FPGA exactly match (bit-for-bit) the desired result from the HP software model. Figure 4.3 shows the result of one of the tests. The conversion performed is RGB to CMYK.
Figure 4.3(a) is the source RGB image. The result of the HP CSC software executable
is shown in Figure 4.3(b) and the result of the hardware co-simulation is shown in Figure 4.3(c).

20

(a) Source Image.

(b) CSC Executable.

(c) Hardware Co-simulation.

Figure 4.3: Test image results.
In the current test setup, the FPGA conﬁguration process is a part of the test routine. In
the case of the XUP board, the FPGA can be programmed independently to measure the
time for conﬁguration. Xilinx Impact can conﬁgure the FPGA in 3.5 seconds. Using the
test setup, conﬁguration time is approximately 8 seconds (as shown in Table 4.3). In case of
the WILDCARD-4, the time to conﬁgure the FPGA is 2.5 seconds (as shown in Table 4.4).
The conﬁguration time alone is over the one-second budget in either case. Section 4.3
presents potential solutions to this problem.

4.3

Faster Conﬁguration: Preliminary Results

Section 4.2 shows that when using hardware co-simulation, the time to conﬁgure the
FPGA externally exceeds the budget of one second to both reconﬁgure the FPGA and
process one page. The conﬁguration time can be reduced in two ways. One is to reduce the
size of the conﬁguration bit-stream. The other is to increase the throughput at which the
bit-stream is loaded.
The FPGA versions of the CSC differ only by the main conversion module. All other
modules are common to both the designs. Hence, there is no need to reprogram the FPGA
with the entire CSC design. If the 3D module can be replaced by the 4D module and viceversa without disturbing the other modules, then switching between the two designs could
take place quickly. This can be achieved by implementing partial reconﬁguration [27].
This allows part of the FPGA to be reconﬁgured, while the conﬁguration in other areas of
21

Table 4.6: Physical resource usage within PRR in Virtex-II Pro (XC2VP30-7FF896).
PRR with
Available resources
PRM-3D PRM-4D
within PRR
Feature
LUT
4729
7074
11200
FF
974
1399
11200
SLICE
2885
4315
5600
MULT18X18
0
0
60
RAMB16
60
34
60
Table 4.7: Physical resource usage within PRR in Virtex-4 (XC4VSX35-10FF668).
PRR with
Available resources
Feature
PRM-3D PRM-4D
within PRR
LUT
4001
5725
9600
FF
924
1370
9600
SLICEL
1220
1746
2400
SLICEM
1220
1746
2400
DSP48
16
24
80
FIFO48
0
0
80
RAMB16
60
34
80
the FPGA remains unchanged. Furthermore, a partial bit-stream contains only the information required to reconﬁgure the changing region. A suitable area on the FPGA, called
the partially reconﬁgurable region (PRR) [27] that can serve as either the 3D or the 4D
module is reserved. The rest of the CSC design is built around the PRR. This design
is synthesized and implemented to yield full and partial bit streams, which can be used to
program the FPGA. Figure 4.4 and Figure 4.5 show the ﬂoor plan of the CSC design in Xilinx Plan Ahead 8.2.10. The PRR is indicated by the rectangular shaped region surrounded
by the magenta border. The area surrounding the PRR is the static region. It contains
the base design that remains unchanged during partial reconﬁguration. Partially Reconﬁgurable Modules (PRMs) [27] are connected to the base design using bus macros [28]. The
resource usage for the different PRMs in the PRR is shown in Table 4.6 and Table 4.7.
The size and shape of the PRR is determined by the resources required by the different
PRMs. In the Virtex-II Pro and Virtex-4 FPGAs, BRAMs are arranged in columns. Since

22

CSC_3D_4D

ReconfigModules

ReconfigModules

ReconfigModules

ReconfigModules

ReconfigModules

Figure 4.4: Floor plan showing PRR in Virtex-II Pro (XC2VP30-7FF896).
23

C17

D10

D17

C10

C20

D9

B20

C8

B18

A8

A18

A7

D20

D8

D19

D7

E17

F10

F17

E10

C21

A6

B21

A5

C19

E9

D18

F9

A24

B6

A23

C6

G 18

G 10

G 17

G9

B24

F8

B23

G8

F18

B7

E18

E21

D21

A20

A19

D22

C7

CSC_3D_4D

C5

D5

A9

B9

A3

C22

B3

A22

A4

A21

B4

D24

C4

C24

D4

G 19

E7

F19

D6

E23

E6

E22

E5

F20

F7

E20

G7

C26

C2

C25

C1

D23

H8

C23

H7

H20

D3

G 20

E4

G 22

G6

G 21

G5

F24

E3

F23

E2

D26

B15

D2

D1

D25

B14

H22

A12

H21

E1

A11

F1

E25

C15

F4

E24

C14

F3

G 24

B13

G 23

B12

G3

F26

A16

H6

E26

A15

H5

H24

A10

G4

G2

H23

B10

G1

G 26

B17

H4

G 25

A17

H26

C13

H25

C12

J21

F14

J20

F13

J23

F12

J5

J22

F11

J4

K22

F16

K21

F15

K6

J26

D14

J2

J25

D13

J1

H3

H2

H1

J7

J6

K7

L19

D15

K20

E14

L6

L21

C11

K5

L20

D11

K24

D16

K23

C16

K26

E13

K25

D12

M 19

L7

K4

K3

K2

L4

L3

M8

N19

L8

L24

L1

L23

K1

M 25

M2

M 24

M1

L26

M4

M 26

M3

M 21

M6

M 20

M5

M 23

N3

M 22

N2

N25

N5

N24

N4

N23

P3

N22

P2

N21

N7

N20

M7

P25

P5

P24

P4

P23

P8

P22

N8

R26

R4

R25

R3

P20

P7

P19

P6

R24

R2

R23

R1

R22

R6

R21

R5

T24

U1

T23

T1

R20

AA14

R8

R19

AB14

T26

AC12

T4

U26

AC11

T3

R7

U23

AA16

V23

AA15

U25

AB13

U3

U24

AA13

U2

T7

T6

U22

AC14

U21

AD14

U4

T21

AA12

V2

T20

AA11

V1

U20

AC16

T19

AC15

V4

T8

U7

V26

AC13

V25

AD13

V21

AF12

W2

V22

AE12

W1

W25

AC10

W26

AB10

W21

AB17

W22

AC17

W23

AF11

W24

AF10

W3

W20

AE14

W6

V20

AE13

W5

Y25

AE10

U6

U5

V6

V5

W7

V7

W4

Y2

Y26

AD10

AB24

AD17

AA4

AB25

AD16

AA3

AA24

AD12

Y24

AD11

AC25

Y1

Y4

Y3

Y6

AC26

Y5

AB26

AB1

AA26

AA1

AD25

AC4

AD26

AB4

Y22

AB3

Y23

AB2

AC22

AC5

AB22

AB5

AB23

AC2

AA23

AC1

AD22

AF3

AD23

AE3

AC23

AD2

AC24

AD1

AF19

AF4

AF20

AE4

Y19

AD3

W19

AC3

AF23

AF6

AE23

AF5

Y20

AA7

Y21

Y7

AA18

AA9

Y18

Y9

AF24

AD5

AE24

AD4

AE20

AE7

AD20

AC21

AB21

ReconfigModules

AD7

AC6

AB6

AD19

AF8

AC19

AF7

AA19

AA8

AA20

Y8

Y17

Y10

AA17

AA10

AB20

AC7

AC20

AB7

AC18

AC9

AB18

AB9

AF21

AE6

AF22

AD6

AF18

AF9

AE18

AE9

AE21

AD8

AD21

AC8

Figure 4.5: Floor plan showing PRR in Virtex-4 (XC4VSX35-10FF668).
24

the 3D and 4D modules make use of the BRAMs, the area covered by the PRR must include
these columns. In the current implementation, the size of the PRR is chosen to meet the
BRAM (RAMB16) requirements of the PRM. In the PR implementation using the VirtexII Pro FPGA, the BRAM usage is 100% (as shown in Table 4.6). In case of the Virtex-4
FPGA, the size chosen for the PRR is slightly larger than the required size (as shown in
Table 4.7). The area can be trimmed down to the exact size required.
The shape of the PRR is usually rectangular. In case of the PR implementation using
the Virtex-II Pro FPGA, the PRR has a dumbbell-like shape. This is because the device
contains two Power PC blocks (the two rectangular blocks in the middle of the device in
Figure 4.4). In order to meet the BRAM requirements of the PRM, the area chosen for
the PRR overlaps the Power PC blocks. This splits the PRR into ﬁve rectangular areas,
arranged in a dumbbell-like pattern. Partial reconﬁguration for PRRs of non-rectangular
shapes can be performed using Xilinx Plan Ahead [29] and the software tools in [30].
In the Virtex-II Pro FPGA, the smallest unit of conﬁguration is a frame, which spans
the entire height of the device [27]. In the Virtex-4 FPGA, the conﬁguration architecture
is still frame-based, but a frame spans 16 rows of conﬁgurable logic blocks (CLBs) rather
than the full device height [27]. However, using the software tools in [30] and the partial
reconﬁguration ﬂow described in [28], PRR of any rectangular size can be implemented.
The design ﬂow for using Plan Ahead to generate the full and partial bit streams is described
in [29, 31].
As indicated in Table 4.8, the size of the original bit stream in reduced by a factor
of three in both the Virtex-II Pro and Virtex-4 FPGAs. In research experiments with the
Virtex-II Pro device, it is observed that the conﬁguration time scales linearly with the conﬁguration bit stream size. In case of the Virtex-4, when using hardware co-simulation, the
time to conﬁgure the entire FPGA externally is about 2.5 seconds. If partial reconﬁguration
were used to swap in a different module, the conﬁguration time would be about 0.7 seconds.
In case of the Virtex-II Pro, the time to conﬁgure the entire FPGA externally using Impact
is about 3.5 seconds. If partial reconﬁguration were used to swap in a different module, the

25

Table 4.8: Conﬁguration time.
Time (Goal  1s)
Device
Type
Bit-stream Externally Internally
Size conﬁgured conﬁgured
Full
1673 KB
2.5 s†
0.03 s∗
Virtex-4
∗
Partial
445 KB
0.7 s
0.01 s∗
Full
1415 KB
3.5 s‡
0.03 s∗
Virtex-II Pro
Partial
466 KB
1.1 s‡
0.01 s∗
∗

Estimated values.
Using hardware co-simulation.
‡
Using Impact.
†

conﬁguration time would be about 1.1 seconds. Even though this is a good reduction in the
conﬁguration time, the total time for conﬁguring the FPGA and processing an entire page
is still over the budget of one second.
Current research in improving the time for conﬁguration involves using different methods of conﬁguration. Virtex FPGAs can be programmed externally using SelectMAP. This
interface could yield faster conﬁguration times. Another method of conﬁguration that holds
promising results is internal conﬁguration access port (ICAP) [32]. This can be used to
program the FPGAs internally. This interface allows for reading and writing conﬁguration
bit streams. It has an 8-bit data port and can be run at a clock speed of 50 MHz [33, 34].
Table 4.8 shows that at this rate, reconﬁguration time would be negligible compared to the
time to process one full page of pixels. This would bring the total for conﬁguration and
processing within the goal of one second. Current implementation using these methods is
still in its preliminary stages.

26

Chapter 5
Conclusions and Future Work
An existing, commercial CSC ASIC design has been successfully reimplemented in
an FPGA. The FPGA implementation consists of two versions. The ﬁrst version contains
all units except the 4D conversion engine. The second version contains all units except
the 3D conversion engine. In each case, the interface is the same as that of the original
ASIC implementation of the CSC, except that BIST is not supported. Test results show an
exact match between the output of the FPGA implementation and a software model of the
original ASIC. The FPGA-based implementation is slower than the ASIC, but it can still
process a full page of pixels in one second. This has been achieved with few FPGA-speciﬁc
optimizations.
In the ASIC implementation, the pipeline contains two main conversion modules, only
one of which is used. In the FPGA implementation, only one main conversion module is
present in each version. This decreases the logic resources required. Since there is no need
to include the bypass logic for the other module, there is also a reduction in the required
routing resources.
It is important to note that an attempt has been made to keep the FPGA implementation
as similar as possible to the original ASIC implementation. The CSC design is a good
target for optimizations such as partial evaluation [35, 36]. This gives the possibility of
specializing the design for a speciﬁc set of values or modes, which could potentially result
in decreased resource usage and improved performance.

27

In order to support the various modes of the design, several multiplexers are implemented in the data path. This allows for existing features to be bypassed and alternate
features to be included. If the design is implemented as-is on an FPGA, the multiplexers are also synthesized. However, if a conversion requires a speciﬁc conﬁguration of the
data path, some of the multiplexers need not be included. This would reduce the area and
simplify place and route (PAR). However, this type of optimization requires that many bit
streams be generated for the different conﬁgurations.
Another area where partial evaluation can be implemented is the CLUT conﬁgurations.
The initial values of the CLUTs are zero (reset value). The values for the required conversion are loaded prior to conversion. However, if the type of conversion is known beforehand, CLUTs can be initialized with the conversion values. For other conversions, partial
reconﬁguration can be used to change the CLUT values. In the current implementation, the
CLUTs are instantiated as random access memories (RAMs). If they are preloaded with
the conversion values, they can be speciﬁed as read only memories (ROMs) instead. Thus,
no interface for writing values into the CLUTs would be synthesized. This reduction in
logic results in a lesser usage of resources and area on the chip.
Future work will consider methods to optimize the speed of the FPGA-based implementation. Of particular interest is the time required to switch between these two conﬁgurations. Results show that JTAG conﬁguration is too slow for the target application. However, ongoing work investigates the use of partial conﬁguration and internal conﬁguration
through ICAP. These methods promise to enable applications that require reconﬁguring on
a page-by-page basis. Future work will add a controller to automate device programming
and to manage conﬁguration bit streams.

28

References
[1] P. Lysaght and J. Dunlop, “Dynamic reconﬁguration of FPGAs,” in More FPGAs:
Proceedings of the 1993 International workshop on ﬁeld-programmable logic and
applications, W. Moore and W. Luk, Eds., Oxford, England, 1993, pp. 82–94.
[2] J. M. Kasson, S. I. Nin, W. Plouffe, and J. L. Hafner, “Performing color space conversions with three-dimensional linear interpolation,” Journal of Electronic Imaging,
vol. 4, no. 3, pp. 226–250, 1995.
[3] F. Bensaali, A. Amira, and A. Bouridane, “Accelerating matrix product on reconﬁgurable hardware for image processing applications,” IEE Proceedings - Circuits,
Devices and Systems, vol. 152, no. 3, pp. 236–246, 2005.
[4] L. V. Agostini, I. S. Silva, and S. Bampi, “Parallel color space converters for JPEG
image compression,” Microelectronics Reliability, vol. 44, no. 4, pp. 697–703, April
2004.
[5] M. Sima, S. Vassiliadis, S. Cotofana, and J. T. J. van Eijndhoven, “Color space conversion for MPEG decoding on FPGA-augmented trimedia processor,” in Proceedings.
IEEE International Conference on Application-Speciﬁc Systems, Architectures, and
Processors, Jun. 2003, pp. 250–259.
[6] D. Han, “Real-time color gamut mapping method for digital tv display quality enhancement,” IEEE Transactions on Consumer Electronics, vol. 50, no. 2, pp. 691–
698, 2004.
[7] ——, “A cost effective color gamut mapping architecture for digital tv color reproduction enhancement,” IEEE Transactions on Consumer Electronics, vol. 51, no. 1,
pp. 168–174, 2005.
[8] Xilinx XUP Virtex II Pro Development System. [Online]. Available:
//www.xilinx.com/univ/xupv2p.html
29

http:

[9] WILDCARD-4 from Annapolis Micro Systems, Inc. [Online]. Available: http:
//www.annapmicro.com/wc4.html
[10] C. Poynton, “A guided tour of color space,” in Proceedings of the SMPTE Advanced
Television and Electronic Imaging Conference, February 1995, pp. 167–180.
[11] P. Green and L. MacDonald, Eds., Color Engineering, Achieving Device Independent
Color. John Wiley Sons Ltd, 2002.
[12] G. L. Vondran, Jr., “Apparatus for generating interpolator input data,” U.S. Patent
5 717 507, Feb. 10, 1998.
[13] A. Albiol, L. Torres, and E. J. Delp, “An unsupervised color image segmentation algorithm for face detection applications,” in Proceedings. 2001 International Conference
on Image Processing, vol. 2, 2001, pp. 681–684 vol.2.
[14] P. Kuchi, P. Gabbur, P. S. Bhat, and S. David, “Human face detection and tracking
using skin color modelling and connected component operators,” The IETE Journal
of Research, Special issue on Visual Media Processing, May 2002.
[15] B. Menser and M. Brunig, “Face detection and tracking for video coding applications,” in Conference Record of the Thirty-Fourth Asilomar Conference on Signals,
Systems and Computers, vol. 1, 2000, pp. 49–53.
[16] I. Andreadis, “A real-time color space converter for the measurement of appearance,”
Pattern Recognition, vol. 34, no. 6, pp. 1181–1187, June 2001.
[17] M. Bilal and S. Masud, “Efﬁcient color space conversion using custom instruction in
a risc processor,” in IEEE International Symposium on Circuits and Systems, 2007,
pp. 1109–1112.
[18] I. Andreadis, A. Gasteratos, and P. Tsalides, “A new asic for the measurement of
appearance,” in Instrumentation and Measurement Technology Conference, 1996.
IMTC-96. Conference Proceedings. ‘Quality Measurements: The Indispensable
Bridge between Theory and Reality’., IEEE, vol. 1, 1996, pp. 545–548 vol.1.
[19] M. Nsour and H. S. Abdel-Aty-Zohdy, “An improved asic design and implementation for color space conversion applications,” in IEEE 39th Midwest symposium on
Circuits and Systems, vol. 2, 1996, pp. 609–612.

30

[20] CSC-PT core. [Online]. Available: http://www.alma-tech.com/products index.php?
item=/02 Image%20Processing&sid=0
[21] CSC Color Space Conversion Core. [Online]. Available: http://www.cast-inc.com/
cores/csc/index.shtml
[22] Athena Color Space Converter. [Online]. Available: http://www.athena-group.com/
pdf/AthenaCSCv1.pdf
[23] CSC Color Space Converter. [Online]. Available: http://www.xilinx.com/products/
logicore/alliance/cast/cast csc.pdf
[24] Altera Color Space Converter MegaCore. [Online]. Available: http://www.altera.
com.cn/literature/ug/csc ug.pdf
[25] Designing a Video Color Space Converter on the VCA-6160 Platform. [Online].
Available: http://www.triadsemi.com/page/TSA001
[26] System Generator for DSP User Guide. [Online]. Available: http://www.xilinx.com/
support/sw manuals/sysgen user.pdf
[27] P. Sedcole, B. Blodget, T. Becker, J. Anderson, and P. Lysaght, “Modular dynamic
reconﬁguration in Virtex FPGAs,” IEE Proceedings - Computers and Digital Techniques, vol. 153, no. 3, pp. 157–164, 2006.
[28] Early Access Partial Reconﬁguration User Guide. [Online]. Available:
//www.xilinx.com/support/prealounge/protected/docs/ug208.pdf

http:

[29] Partial Reconﬁguration Design with PlanAhead. [Online]. Available:
//www.xilinx.com/support/prealounge/protected/docs/PR User Guide.pdf

http:

[30] Partial Reconﬁguration Early Access Software Tools. [Online]. Available: http:
//www.xilinx.com/support/prealounge/protected/index.htm
[31] Partial Reconﬁguration Software Users Guide. [Online]. Available: http://www.
xilinx.com/support/prealounge/protected/software/pa pr user guide 81.pdf
[32] B. Blodget, P. James-Roxby, E. Keller, S. Mcmillan, and P. Sundararajan, “A selfreconﬁguring platform,” Field-Programmable Logic and Applications, pp. 565–574,
2003.

31

[33] R. J. Fong, S. J. Harper, and P. M. Athanas, “A versatile framework for FPGA ﬁeld
updates: an application of partial self-reconﬁguration,” in Proceedings of the 14th
IEEE International Workshop on Rapid Systems Prototyping, 9-11 June 2003, pp.
117–123.
[34] T. Huffmire, B. Brotherton, G. Wang, T. Sherwood, R. Kastner, T. Levin, T. Nguyen,
and C. Irvine, “Moats and drawbridges: An isolation primitive for reconﬁgurable
hardware based systems,” in IEEE Symposium on Security and Privacy, 20-23 May
2007, pp. 281–295.
[35] S. Singh, J. Hogg, and D. McAuley, “Expressing dynamic reconﬁguration by partial evaluation,” in Proceedings. IEEE Symposium on FPGAs for Custom Computing
Machines, J. Arnold and K. L. Pocek, Eds., Napa, CA, Apr. 1996, pp. 188–194.
[36] A. DeHon, J. Adams, M. DeLorimier, N. Kapre, Y. Matsuda, H. Naeimi, M. Vanier,
and M. Wrighton, “Design patterns for reconﬁgurable computing,” in 12th Annual
IEEE Symposium on Field-Programmable Custom Computing Machines. Washington, DC, USA: IEEE Computer Society, 2004, pp. 13–23.

32

Appendix A
Generation of Test Vectors
Table A.1 shows the general structure of the test vector ﬁle. The size of the CLUT
values and image data section varies based on the conversion required, the resolution of the
CLUTs and the size of the input image.

33

Table A.1: General structure of the test vector ﬁle.

34
 

-. " /" 0/1

 23"

 

 


 
 
 

( )

'

  ! "



 
 

 

 

 


 
 
 






 
 

 !

 



 ! #

  

 

  

  

) #

  

&

*++ $,++

%

$% % %&

 
 

! " #$ #%$

Appendix B
Hardware Co-simulation
The following screen shots from Simulink show the hardware co-simulation model.
Figure B.1 shows the model that uses the top level HDL module as a block. This model
can be used for software simulation. Once the design is veriﬁed, a hardware co-simulation
block can be generated. The hardware co-simulation block is used to program the FPGA to
implement the CSC design. Figure B.2 shows the model with the hardware co-simulation
block. More details about performing hardware co-simulation are available in [26].

35

Figure B.1: System Generator project for simulation.

36

1
Constant

In

PostCscReq
From Workspace23

PreCscData3
From Workspace22

PreCscData2
From Workspace21

PreCscData1
From Workspace20

PreCscData0
From Workspace19

PreCscNop
From Workspace18

PreCscOt
From Workspace17

PreCscEop
From Workspace16

PreCscEol
From Workspace15

PreCscAck
From Workspace14

0
Constant13

In

0
Constant12

Gateway In23

In

Gateway In22

In

Gateway In21

In

Gateway In20

In

Gateway In19

In

Gateway In18

In

Gateway In17

In

Gateway In16

In

Gateway In15

In

Gateway In14

In

Gateway In13

In

Gateway In12

In
Gateway In11

Constant11

Gateway In10

In

Gateway In09

Constant10
0

Constant9
0

In
Gateway In08

0

Gateway In07

In

Gateway In06

In

Gateway In05

In

Gateway In04

In

Gateway In03

In

Gateway In02

In

Gateway In01

Constant8
0

0
Constant7

reg_wdata
From Workspace6

reg_addr
From Workspace5

reg_write
From Workspace4

addr_valid
From Workspace3

pipe_enbl
From Workspace2

In

In
Gateway In00

0
Constant1

Switch

0
Constant2

Step
Scope1

postseccscdata2_t

postseccscdata1_t

postseccscdata0_t

postcscdata3_t

postcscdata2_t

postcscdata1_t

postcscdata0_t

postcscot_t

postcsceop_t

postcsceol_t

postcscack_t

precscreq_t

hresp_t

hready_out_t

hrdata_t

reg_rdata_t

postseccscdata3_t
csc with ce and bist removed
(now with csc_3d_4d module)

postcscreq_t

precscdata3_t

precscdata2_t

precscdata1_t

precscdata0_t

precscnop_t

precscot_t

precsceop_t

precsceol_t

precscack_t

hready_in_t

hwdata_t

hsize_t

hwrite_t

htrans_t

haddr_t

hsel_t

reg_wdata_t

reg_addr_t

reg_write_t

addr_valid_t

pipeline_enable_t

cscsoftreset_t

nreset_t

reg_ready_t

Out

Gateway Out17

Out

Gateway Out16

Out

Gateway Out15

Out

Gateway Out14

Out

Gateway Out13

Out

Gateway Out12

Out

Gateway Out11

Out

Gateway Out10

Out

Gateway Out09

Out

Gateway Out08

Out

Gateway Out07

Out

Gateway Out06

Out

Gateway Out05

Out

Gateway Out04

Out

Gateway Out03

Out

Gateway Out02

Out

Gateway Out01

Out

Gateway Out00

Ch4
To Workspace3

Ch3
To Workspace2

Ch2
To Workspace1

Ch1
To Workspace

PostCscAck
To Workspace4

Scope

System
Generator

Step
0
Constant2
1
Constant

Scope1
Gateway In00
Switch

0
Constant1

Gateway Out00

Gateway In01
Gateway Out01

pipe_enbl
From Workspace2

Gateway In02

addr_valid
From Workspace3

Gateway In03

reg_write
From Workspace4

Gateway In04

reg_addr
From Workspace5

Gateway In05

reg_wdata
From Workspace6

Gateway In06

Gateway Out02

System
Generator

Gateway Out03

Gateway Out04

0

Gateway In07

Constant7
0

Gateway In08

Gateway Out05

Gateway Out06

Constant8
0
Constant9

Gateway In09

0
Constant10

Gateway In10

0
Constant11

Gateway In11

0
Constant12

Gateway In12

0

Gateway In13

Constant13
PreCscAck
From Workspace14

Gateway Out07

Gateway Out08
JTAG
Co-sim
Gateway Out09

Gateway Out10
Gateway In14

PreCscEol
From Workspace15

Gateway In15

PreCscEop
From Workspace16

Gateway In16

PreCscOt
From Workspace17

Gateway In17

PreCscNop
From Workspace18

Gateway In18

PreCscData0
From Workspace19

Gateway In19

PreCscData1
From Workspace20

Gateway In20

PreCscData2
From Workspace21

Gateway In21

PreCscData3
From Workspace22

Gateway In22

PostCscReq
From Workspace23

Gateway In23

Gateway Out11

Gateway Out12

Gateway Out13
Scope
Gateway Out14
PostCscAck
To Workspace4
Gateway Out15

Ch1
To Workspace

Gateway Out16

Ch2
To Workspace1

Gateway Out17

Ch3
To Workspace2
Ch4
To Workspace3

csc hwcosim model
(for 3x)

Figure B.2: System Generator project for hardware-in-the-loop testing.

37

Appendix C
Hardware and Software Used
Table C.1: List of hardware used for testing.
Development Board
FPGA
XUP Virtex-II Pro Development System [8]
XC2VP30-7FF896
Annapolis Micro Systems WILDCARD-4 [9] XC4VSX35-10FF668

38

Table C.2: List of software used in implementation and testing.
Software
Version
Purpose
Synthesis, Place and Route
8.2.03i
Software Simulation (stand alone
Xilinx ISE
(for normal designs)
and using System Generator)
Xilinx ISE
8.2.01i PR 07b Synthesis of netlist for PRM
and Synthesis of PR designs
(for PR designs)
Xilinx iMPACT
8.2.03i
Programming the FPGA
MATLAB
7.2.0.232
Generating image information and
(R2006a)
extracting output from results
Simulink
6.4 (R2006a) Create the hardware test bench
System Generator
8.2
Extends Simulink for use
in FPGA hardware design.
Plan Ahead
8.2.10
Reserve PRR and
Synthesis of PR designs
CSC Software Model
5/23/2007
Convert source image
15:32:10
into target color space
(HP Executable)
Generate conﬁguration register
CLUT Data Converter
0.810 Linux
information, CLUT values
and CSC data ﬁle.
(HP Executable)
Board support Packages
Enables communication between
for FPGA boards
−
MATLAB/Simulink and FPGA

39

Appendix D
MATLAB Source Code
This section presents the source code that is used in MATLAB/Simulink for hardware
co-simulation.
1. Conﬁg File - csc conﬁg.m
This ﬁle is originally generated by Simulink when the top level HDL module is imported. It is then edited to include all the source ﬁles required for the compiling the
design. It is used by the implementation tools to generate the block used for hardware
co-simulation.
2. Preload File - im preload.m
This ﬁle reads the test vector input text ﬁle. The columns in the input ﬁle are separated and applied to the inputs of the HDL module.
3. Post Execute File - post exec.m
This ﬁle extracts the output image from the simulation results and compares it with
the output of the software model of the CSC. This ﬁle also displays the result of the
comparison in a message box. The statistics (max, min and mean) of the two output
images, the difference image and the time required to process the image are displayed
in the MATLAB command window.

40

4/27/08 12:21 PM

Config Code - csc_config.m

function csc_config(this_block)
% Revision History:
%
%
18-Sep-2007 (04:14 hours):
%
Original code was machine generated by Xilinx's System Generator after parsing
%
C:\Work\PR\s_code_PR_test\csc.v
%
this_block.setTopLevelLanguage('Verilog')
this_block.setEntityName('csc')

%
if it doesn't, then comment out the following line:
this_block.tagAsCombinational
this_block.addSimulinkInport('nreset_t')
this_block.addSimulinkInport('cscsoftreset_t')
this_block.addSimulinkInport('pipeline_enable_t')
this_block.addSimulinkInport('addr_valid_t')
this_block.addSimulinkInport('reg_write_t')
this_block.addSimulinkInport('reg_addr_t')
this_block.addSimulinkInport('reg_wdata_t')
this_block.addSimulinkInport('hsel_t')
this_block.addSimulinkInport('haddr_t')
this_block.addSimulinkInport('htrans_t')
this_block.addSimulinkInport('hwrite_t')
this_block.addSimulinkInport('hsize_t')
this_block.addSimulinkInport('hwdata_t')
this_block.addSimulinkInport('hready_in_t')
this_block.addSimulinkInport('precscack_t')
this_block.addSimulinkInport('precsceol_t')

1 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.addSimulinkInport('precsceop_t')
this_block.addSimulinkInport('precscot_t')
this_block.addSimulinkInport('precscnop_t')
this_block.addSimulinkInport('precscdata0_t')
this_block.addSimulinkInport('precscdata1_t')
this_block.addSimulinkInport('precscdata2_t')
this_block.addSimulinkInport('precscdata3_t')
this_block.addSimulinkInport('postcscreq_t')
this_block.addSimulinkOutport('reg_ready_t')
this_block.addSimulinkOutport('reg_rdata_t')
this_block.addSimulinkOutport('hrdata_t')
this_block.addSimulinkOutport('hready_out_t')
this_block.addSimulinkOutport('hresp_t')
this_block.addSimulinkOutport('precscreq_t')
this_block.addSimulinkOutport('postcscack_t')
this_block.addSimulinkOutport('postcsceol_t')
this_block.addSimulinkOutport('postcsceop_t')
this_block.addSimulinkOutport('postcscot_t')
this_block.addSimulinkOutport('postcscdata0_t')
this_block.addSimulinkOutport('postcscdata1_t')
this_block.addSimulinkOutport('postcscdata2_t')
this_block.addSimulinkOutport('postcscdata3_t')
this_block.addSimulinkOutport('postseccscdata0_t')
this_block.addSimulinkOutport('postseccscdata1_t')
this_block.addSimulinkOutport('postseccscdata2_t')
this_block.addSimulinkOutport('postseccscdata3_t')
reg_ready_t_port = this_block.port('reg_ready_t')
reg_ready_t_port.setType('UFix_1_0')
reg_ready_t_port.useHDLVector(false)
reg_rdata_t_port = this_block.port('reg_rdata_t')
reg_rdata_t_port.setType('UFix_32_0')

2 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

hrdata_t_port = this_block.port('hrdata_t')
hrdata_t_port.setType('UFix_32_0')
hready_out_t_port = this_block.port('hready_out_t')
hready_out_t_port.setType('UFix_1_0')
hready_out_t_port.useHDLVector(false)
hresp_t_port = this_block.port('hresp_t')
hresp_t_port.setType('UFix_1_0')
hresp_t_port.useHDLVector(false)
precscreq_t_port = this_block.port('precscreq_t')
precscreq_t_port.setType('UFix_1_0')
precscreq_t_port.useHDLVector(false)
postcscack_t_port = this_block.port('postcscack_t')
postcscack_t_port.setType('UFix_1_0')
postcscack_t_port.useHDLVector(false)
postcsceol_t_port = this_block.port('postcsceol_t')
postcsceol_t_port.setType('UFix_1_0')
postcsceol_t_port.useHDLVector(false)
postcsceop_t_port = this_block.port('postcsceop_t')
postcsceop_t_port.setType('UFix_1_0')
postcsceop_t_port.useHDLVector(false)
postcscot_t_port = this_block.port('postcscot_t')
postcscot_t_port.setType('UFix_2_0')
postcscdata0_t_port = this_block.port('postcscdata0_t')
postcscdata0_t_port.setType('UFix_12_0')
postcscdata1_t_port = this_block.port('postcscdata1_t')
postcscdata1_t_port.setType('UFix_12_0')
postcscdata2_t_port = this_block.port('postcscdata2_t')
postcscdata2_t_port.setType('UFix_12_0')
postcscdata3_t_port = this_block.port('postcscdata3_t')
postcscdata3_t_port.setType('UFix_12_0')
postseccscdata0_t_port = this_block.port('postseccscdata0_t')
postseccscdata0_t_port.setType('UFix_12_0')
postseccscdata1_t_port = this_block.port('postseccscdata1_t')

3 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

postseccscdata1_t_port.setType('UFix_12_0')
postseccscdata2_t_port = this_block.port('postseccscdata2_t')
postseccscdata2_t_port.setType('UFix_12_0')
postseccscdata3_t_port = this_block.port('postseccscdata3_t')
postseccscdata3_t_port.setType('UFix_12_0')
% ----------------------------if (this_block.inputTypesKnown)
% do input type checking, dynamic output type and generic setup in this code block.
if (this_block.port('nreset_t').width ~= 1)
this_block.setError('Input data type for port "nreset_t" must have width=1.')
end
this_block.port('nreset_t')

4 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

if (this_block.port('reg_write_t').width ~= 1)
this_block.setError('Input data type for port "reg_write_t" must have width=1.')
end
this_block.port('reg_write_t').useHDLVector(false)
if (this_block.port('reg_addr_t').width ~= 18)
this_block.setError('Input data type for port "reg_addr_t" must have width=18.')
end
if (this_block.port('reg_wdata_t').width ~= 32)
this_block.setError('Input data type for port "reg_wdata_t" must have width=32.')
end
if (this_block.port('hsel_t').width ~= 1)
this_block.setError('Input data type for port "hsel_t" must have width=1.')
end
this_block.port('hsel_t').useHDLVector(false)
if (this_block.port('haddr_t').width ~= 18)
this_block.setError('Input data type for port "haddr_t" must have width=18.')
end
if (this_block.port('htrans_t').width ~= 2)
this_block.setError('Input data type for port "htrans_t" must have width=2.')
end
if (this_block.port('hwrite_t').width ~= 1)
this_block.setError('Input data type for port "hwrite_t" must have width=1.')
end

5 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.port('hwrite_t').useHDLVector(false)
if (this_block.port('hsize_t').width ~= 3)
this_block.setError('Input data type for port "hsize_t" must have width=3.')
end
if (this_block.port('hwdata_t').width ~= 32)
this_block.setError('Input data type for port "hwdata_t" must have width=32.')
end
if (this_block.port('hready_in_t').width ~= 1)
this_block.setError('Input data type for port "hready_in_t" must have width=1.')
end
this_block.port('hready_in_t').useHDLVector(false)
if (this_block.port('precscack_t').width ~= 1)
this_block.setError('Input data type for port "precscack_t" must have width=1.')
end
this_block.port('precscack_t').useHDLVector(false)
if (this_block.port('precsceol_t').width ~= 1)
this_block.setError('Input data type for port "precsceol_t" must have width=1.')
end
this_block.port('precsceol_t').useHDLVector(false)
if (this_block.port('precsceop_t').width ~= 1)
this_block.setError('Input data type for port "precsceop_t" must have width=1.')
end
this_block.port('precsceop_t').useHDLVector(false)

6 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

if (this_block.port('precscot_t').width ~= 2)
this_block.setError('Input data type for port "precscot_t" must have width=2.')
end
if (this_block.port('precscnop_t').width ~= 1)
this_block.setError('Input data type for port "precscnop_t" must have width=1.')
end
this_block.port('precscnop_t').useHDLVector(false)
if (this_block.port('precscdata0_t').width ~= 16)
this_block.setError('Input data type for port "precscdata0_t" must have width=16.')
end
if (this_block.port('precscdata1_t').width ~= 16)
this_block.setError('Input data type for port "precscdata1_t" must have width=16.')
end
if (this_block.port('precscdata2_t').width ~= 16)
this_block.setError('Input data type for port "precscdata2_t" must have width=16.')
end
if (this_block.port('precscdata3_t').width ~= 16)
this_block.setError('Input data type for port "precscdata3_t" must have width=16.')
end
if (this_block.port('postcscreq_t').width ~= 1)
this_block.setError('Input data type for port "postcscreq_t" must have width=1.')
end
this_block.port('postcscreq_t').useHDLVector(false)

7 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

end % if(inputTypesKnown)
% ----------------------------% ----------------------------if (this_block.inputRatesKnown)
setup_as_single_rate(this_block,'clk_t','ce_t')
end % if(inputRatesKnown)
% ----------------------------% Add addtional source files as needed.
% |------------% | Add files in the order in which they should be compiled.
% | If two files "a.vhd" and "b.vhd" contain the entities
% | entity_a and entity_b, and entity_a contains a
% | component of type entity_b, the correct sequence of
% | addFile() calls would be:

%

|-------------

this_block.addFile('../csc_defs.vh')
this_block.addFile('../rbist_defs.vh')
this_block.addFile('../sram_65x24.edn')
this_block.addFile('../sram_65x24.xco')
this_block.addFile('../sram_65x24.v')
this_block.addFile('../sram_dp_bw_65x48_wrapper_coregen2.v')
this_block.addFile('../csc_lut_wrappers_1d.v')

8 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.addFile('../csc_phase2_1d_channel.v')
this_block.addFile('../csc_phase2_1d.v')
this_block.addFile('../csc_phase1_1d.v')
this_block.addFile('../csc_lut1d.v')
this_block.addFile('../sram_729x15.edn')
this_block.addFile('../sram_729x15.xco')
this_block.addFile('../sram_729x15.v')
this_block.addFile('../sram_729x20.edn')
this_block.addFile('../sram_729x20.xco')
this_block.addFile('../sram_729x20.v')
this_block.addFile('../sram_256x20.edn')
this_block.addFile('../sram_256x20.xco')
this_block.addFile('../sram_256x20.v')
this_block.addFile('../sram_320x20.edn')
this_block.addFile('../sram_320x20.xco')
this_block.addFile('../sram_320x20.v')
this_block.addFile('../sram_400x20.edn')
this_block.addFile('../sram_400x20.xco')
this_block.addFile('../sram_400x20.v')
this_block.addFile('../sram_500x20.edn')
this_block.addFile('../sram_500x20.xco')
this_block.addFile('../sram_500x20.v')
this_block.addFile('../sram_512x15.edn')
this_block.addFile('../sram_512x15.xco')
this_block.addFile('../sram_512x15.v')

9 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.addFile('../sram_512x20.edn')
this_block.addFile('../sram_512x20.xco')
this_block.addFile('../sram_512x20.v')
this_block.addFile('../sram_576x15.edn')
this_block.addFile('../sram_576x15.xco')
this_block.addFile('../sram_576x15.v')
this_block.addFile('../sram_576x20.edn')
this_block.addFile('../sram_576x20.xco')
this_block.addFile('../sram_576x20.v')
this_block.addFile('../sram_625x20.edn')
this_block.addFile('../sram_625x20.xco')
this_block.addFile('../sram_625x20.v')
this_block.addFile('../sram_648x15.edn')
this_block.addFile('../sram_648x15.xco')
this_block.addFile('../sram_648x15.v')
this_block.addFile('../sram_648x20.edn')
this_block.addFile('../sram_648x20.xco')
this_block.addFile('../sram_648x20.v')
this_block.addFile('../sram_sp_bw_256x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_320x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_400x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_500x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_512x30_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_512x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_576x30_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_576x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_625x40_wrapper_coregen2.v')

10 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.addFile('../sram_sp_bw_648x30_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_648x40_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_729x30_wrapper_coregen2.v')
this_block.addFile('../sram_sp_bw_729x40_wrapper_coregen2.v')
this_block.addFile('../csc_lut_wrappers_3d.v')
this_block.addFile('../csc_phase1_3d.v')
this_block.addFile('../csc_phase2_3d.v')
this_block.addFile('../csc_phase3_3d_channel.v')
this_block.addFile('../csc_phase3_3d.v')
this_block.addFile('../csc_3d.v')
this_block.addFile('../csc_4d.v')
this_block.addFile('../csc_lut_wrappers_4d.v')
this_block.addFile('../csc_phase1_4d.v')
this_block.addFile('../csc_phase2_4d.v')
this_block.addFile('../csc_phase3_4d_channel.v')
this_block.addFile('../csc_phase3_4d.v')
this_block.addFile('../csc_lutchk_crc16_parallel_n.v')
this_block.addFile('../lutchk_ctrl.v')
this_block.addFile('../lutchk_regs.v')
this_block.addFile('../lutchk_timer.v')
this_block.addFile('../lutchk_top.v')
this_block.addFile('../ahblite_to_regbus.v')
this_block.addFile('../csc_reg.v')
this_block.addFile('../busmacro_l2r_standin.v')
this_block.addFile('../csc_3d_4d.v')
this_block.addFile('../csc_control.v')
this_block.addFile('../pipeline_handshake_enable.v')
this_block.addFile('../csc_isolation_stage.v')
this_block.addFile('../csc_auto_load.v')
this_block.addFile('../csc_pre_match.v')
this_block.addFile('../csc_k_plane_mag.v')
this_block.addFile('../csc_target.v')

11 of 12

4/27/08 12:21 PM

Config Code - csc_config.m

this_block.addFile('../csc_toner_limit.v')
this_block.addFile('../clk_w_ce.v')
this_block.addFile('../csc.v')
return
% -----------------------------------------------------------function setup_as_single_rate(block,clkname,cename)
inputRates = block.inputRates
uniqueInputRates = unique(inputRates)
if (length(uniqueInputRates)==1 & uniqueInputRates(1)==Inf)
block.setError('The inputs to this block cannot all be constant.')
return
end
if (uniqueInputRates(end) == Inf)
hasConstantInput = true
uniqueInputRates = uniqueInputRates(1:end-1)
end
if (length(uniqueInputRates) ~= 1)
block.setError('The inputs to this block must run at a single rate.')
return
end
theInputRate = uniqueInputRates(1)
for i = 1:block.numSimulinkOutports
block.outport(i).setRate(theInputRate)
end
block.addClkCEPair(clkname,cename,theInputRate)
return
% ------------------------------------------------------------

12 of 12

4/27/08 12:12 PM

Pre-Load Code - im_preload.m

1 of 2

%% This file is used to read the input text file and load the vectors
% into the matlab workspace, for use in simulink.
clear all
close all
%% Read the Configuration data file.
[InFileName,PathName] =
- Configuration Data','..\..\Sim_Inputs\')

,'Text Files (*.txt)' '*.*', 'All Files (*.*)'}, 'File Selector

if isequal(InFileName,0)
disp('File not selected. Restart execution.')
msgbox('File not selected. Restart simulation.','HW Simulation','help')
else
fid = fopen([PathName InFileName])
confdata = textscan(fid, '%u8 %u8 %u8 %5c %8c %u8 %u8 %u8 %u8 %u8 %4c %4c %4c %4c %u8')
fclose(fid)
%% Generate
pipe_enbl
addr_valid
reg_write
reg_addr
reg_wdata
PreCscAck
PreCscEol
PreCscEop
PreCscOt
PreCscNop
PreCscData0
PreCscData1
PreCscData2
PreCscData3

the input vectors from the configuration file.
= confdata{1,1}
= confdata{1,2}
= confdata{1,3}
= hex2dec(confdata{1,4})
= hex2dec(confdata{1,5})
= confdata{1,6}
= confdata{1,7}
= confdata{1,8}
= confdata{1,9}
= confdata{1,10}
= hex2dec(confdata{1,11})
= hex2dec(confdata{1,12})
= hex2dec(confdata{1,13})
= hex2dec(confdata{1,14})

4/27/08 12:12 PM
PostCscReq

Pre-Load Code - im_preload.m
= confdata{1,15}

% Find the size of the number of rows in the input vectors.
% I can use any variable. pipe_enbl is just a random one I chose.
[row_count,col_count] = size(pipe_enbl)
%% Add the 'time value' column to each of the input vectors.
% Also the data value need to be in double form
%% The 'from workspace' block in simulink requires the array to
% be in the following format -> var = [TimeValues DataValues]
pipe_enbl
= [ double(0:row_count-1)' double(pipe_enbl)]
addr_valid = [ double(0:row_count-1)' double(addr_valid)]
reg_write
= [ double(0:row_count-1)' double(reg_write)]
reg_addr
= [ double(0:row_count-1)' double(reg_addr)]
reg_wdata
= [ double(0:row_count-1)' double(reg_wdata)]
PreCscAck
= [ double(0:row_count-1)' double(PreCscAck)]
PreCscEol
= [ double(0:row_count-1)' double(PreCscEol)]
PreCscEop
= [ double(0:row_count-1)' double(PreCscEop)]
PreCscOt
= [ double(0:row_count-1)' double(PreCscOt)]
PreCscNop
= [ double(0:row_count-1)' double(PreCscNop)]
PreCscData0 = [ double(0:row_count-1)' double(PreCscData0)]
PreCscData1 = [ double(0:row_count-1)' double(PreCscData1)]
PreCscData2 = [ double(0:row_count-1)' double(PreCscData2)]
PreCscData3 = [ double(0:row_count-1)' double(PreCscData3)]
PostCscReq = [ double(0:row_count-1)' double(PostCscReq)]
end

2 of 2

4/27/08 12:42 PM

Post-Execute Code - post_exec.m

1 of 3

%% The program is used to compare the outputs of the Xilinx or Hardware
% co-simulation and the csc application without losing any bits (bit width = 16)
%% Read File : Result of the CSC Application.
[OutFileName,PathName,FilterIndex] =
of CSC Application','..\..\csc_v3\')

,'TIFF Files (*.tiff,*.tif)'}, 'Chose result

if isequal(OutFileName,0)
disp ('File not selected. Restart simulation.')
msgbox('File not selected. Restart simulation.','HW Simulation','help')
else
app_img = imread([PathName OutFileName])
[num_row,num_col,dim]=size(app_img)
num_pixels = num_row * num_col
% We can use the PostCscAck signal to find out where the image pixel starts
i=find(PostCscAck == 1)
startrow = i(1) +
% The extra 1 is because of the blank first pixel. This can be removed if the config file is modified
iCh1
iCh2
iCh3
iCh4
sim_C
sim_M
sim_Y
sim_K

=
=
=
=

Ch1(startrow:startrow
Ch2(startrow:startrow
Ch3(startrow:startrow
Ch4(startrow:startrow
=
=
=
=

+
+
+
+

num_pixels
num_pixels
num_pixels
num_pixels

-1,1)
-1,1)
-1,1)
-1,1)

uint16(reshape(iCh1,num_row,num_col))
uint16(reshape(iCh2,num_row,num_col))
uint16(reshape(iCh3,num_row,num_col))
uint16(reshape(iCh4,num_row,num_col))

sim_img (:,:,1) = sim_C
sim_img (:,:,2) = sim_M

4/27/08 12:42 PM

Post-Execute Code - post_exec.m

2 of 3

sim_img (:,:,3) = sim_Y
sim_img (:,:,4) = sim_K
%% Display the statistics of the individual images and the comparison
%clc
disp ('Result Image from Simulation')
disp ('? bit
C
M
Y
K')
maxval = sprintf('Max %6.3f %6.3f %6.3f %6.3f',max(max(sim_img(:,:,1))),max(max(sim_img(:,:,2))), max
(max(sim_img(:,:,3))),max(max(sim_img(:,:,4))))
disp(maxval)
meanval = sprintf('Mean %6.3f %6.3f %6.3f %6.3f',mean(mean(sim_img(:,:,1))),mean(mean(sim_img(:,:,2))),
mean(mean(sim_img(:,:,3))),mean(mean(sim_img(:,:,4))))
disp(meanval)
minval = sprintf('Min %6.3f %6.3f %6.3f %6.3f\n',min(min(sim_img(:,:,1))),min(min(sim_img(:,:,2))), min
(min(sim_img(:,:,3))),min(min(sim_img(:,:,4))))
disp(minval)
disp (OutFileName)
disp ('? bit
C
M
Y
K')
maxval = sprintf('Max %6.3f %6.3f %6.3f %6.3f',max(max(app_img(:,:,1))),max(max(app_img(:,:,2))), max
(max(app_img(:,:,3))),max(max(app_img(:,:,4))))
disp(maxval)
meanval = sprintf('Mean %6.3f %6.3f %6.3f %6.3f',mean(mean(app_img(:,:,1))),mean(mean(app_img(:,:,2))),
mean(mean(app_img(:,:,3))),mean(mean(app_img(:,:,4))))
disp(meanval)
minval = sprintf('Min %6.3f %6.3f %6.3f %6.3f\n',min(min(app_img(:,:,1))),min(min(app_img(:,:,2))), min
(min(app_img(:,:,3))),min(min(app_img(:,:,4))))
disp(minval)
diff = double(app_img) - double(sim_img)
disp ('Difference in CMYK Values (Application - Simulation)')
disp ('? bit
C
M
Y
K')
maxval = sprintf('Max %6.3f %6.3f %6.3f %6.3f',max(max(diff(:,:,1))),max(max(diff(:,:,2))), max(max

4/27/08 12:42 PM

Post-Execute Code - post_exec.m

3 of 3

(diff(:,:,3))),max(max(diff(:,:,4))))
disp(maxval)
meanval = sprintf('Mean %6.3f %6.3f %6.3f %6.3f',mean(mean(diff(:,:,1))),mean(mean(diff(:,:,2))), mean
(mean(diff(:,:,3))),mean(mean(diff(:,:,4))))
disp(meanval)
minval = sprintf('Min %6.3f %6.3f %6.3f %6.3f\n',min(min(diff(:,:,1))),min(min(diff(:,:,2))), min(min
(diff(:,:,3))),min(min(diff(:,:,4))))
disp(minval)

%% compare the two files using isequal
resmatch = isequal(app_img,sim_img)
if resmatch
match_image = sprintf('The images match')
else
match_image = sprintf('The images do not match')
end
% Display the results in the Matlab console and a message box
disp_info=sprintf('Image1 : Result Image from Simulation,\n (Input = %s) \nImage2 : %s \n %s',
InFileName,OutFileName,match_image)
disp(disp_info)
msgbox(disp_info,'HW Simulation Results','help')
end
% Display processing time after impreload in the InitFcn, at start of
% StartFcn and after end of simulation - StopFcn
times = sprintf('\nSimulation times \n Load = %2.4f seconds \n Configuration = %2.4f seconds \n Simulation =
%2.4f seconds',toc1,toc2-toc1,toc3-toc1)
disp (times)

