Rochester Institute of Technology

RIT Scholar Works
Theses

Thesis/Dissertation Collections

12-1-2011

Dynamic partial reconfiguration for pipelined
digital systems— A Case study using a color space
conversion engine
Ryan Toukatly

Follow this and additional works at: http://scholarworks.rit.edu/theses
Recommended Citation
Toukatly, Ryan, "Dynamic partial reconfiguration for pipelined digital systems— A Case study using a color space conversion engine"
(2011). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.

Dynamic Partial Reconfiguration for Pipelined Digital Systems
A Case Study Using A Color Space Conversion Engine
by
Ryan Michael Toukatly
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science
in
Electrical Engineering
Approved By:

Dr. Dorin Patru
Associate Professor, Department of Electrical and Microelectronic Engineering
Thesis Advisor

Dr. Eli Saber
Professor, Department of Electrical and Microelectronic Engineering

Dr. Marcin Lukowiak
Assistant Professor, Department of Computer Engineering

Dr. Sohail Dianat
Department Head, Department of Electrical and Microelectronic Engineering
Department of Electrical Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
December 2011

Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering

Title: Dynamic Partial Reconfiguration for Pipelined Digital Systems —
A Case Study Using A Color Space Conversion Engine

I, Ryan Michael Toukatly, hereby grant permission to the Wallace Memorial Library to
reproduce my thesis in whole or part.

Ryan Michael Toukatly

Date

Dedication

Dedicated to my family for supporting me in numerous ways,
and to Ellen for keeping me looking toward the future.

iii

Acknowledgments

I would like to thank my advisor Dr. Dorin Patru
for his guidance throughout the project;
Dr. Eric Peskin, Brad Larson, and Gene Roylance
for providing this research opportunity and valuable insight;
and Jordan Hibbits and Alex Mykyta
for assistance at the beginning and the end of my work.

iv

Abstract
In digital hardware design, reconfigurable devices such as Field Programmable Gate Arrays (FPGAs) allow for a unique feature called partial reconfiguration (PR). This refers
to the reprogramming of a subset of the reconfigurable logic during active operation. PR
allows multiple hardware blocks to be consolidated into a single partition, which can be
reprogrammed at run-time as desired. This may reduce the logic circuit (and silicon area)
requirements and greatly extend functionality. Furthermore, dynamic partial reconfiguration (DPR) refers to PR that does not halt the system during reprogramming. This allows
for configuration to overlap with normal processing, potentially achieving better system
performance than a static (halting) PR implementation.
This work has investigated the advantages and trade-offs of DPR as applied to an existing color space conversion (CSC) engine provided by Hewlett-Packard (HP). Two versions
were created: a single-pipeline engine, which can only overlap tasks in specific sequences;
and a dual-pipeline engine, which can overlap any consecutive tasks. These were implemented in a Virtex-6 FPGA. Data communication occurs over the PCI Express (PCIe)
interface. Test results show improvements in execution speed and resource utilization,
though some are minor due to intrinsic characteristics of the CSC engine pipeline. The
dual-pipeline version outperformed the single-pipeline in most test cases. Therefore, future work will focus on multiple-pipeline architectures.

v

Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

3

Theory . . . . . . . . . . . . . . . . . . . .
3.1 Benefits and Drawbacks of DPR . . . .
3.2 Color Space Conversion . . . . . . . .
3.3 Potential for Engine Improvement . . .

4

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Hardware Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Software Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5

Methodology . . . . . . . . . . . . . . . .
5.1 Upgrading the Platform . . . . . . . . .
5.2 Integrating DPR . . . . . . . . . . . . .
5.3 Communicating with a PC . . . . . . .
5.4 Expanding to Two Pipelines . . . . . .
5.5 Testing Procedures . . . . . . . . . . .

vi

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
7
. . 7
. . 9
. . 10

.
. .
. .
. .
. .
. .

21
21
23
27
32
35

6

Results . . . . . . . . . . . . . . . . . . . .
6.1 CSC Performance . . . . . . . . . . . .
6.2 Configuration Times . . . . . . . . . .
6.3 FPGA Resources . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
. .
. .
. .

41
41
42
44

7

Discussion . . . . . . . . . . . . . . . . . .
7.1 Observed Benefits and Drawbacks . . .
7.2 The PC-FPGA Platform . . . . . . . . .
7.3 Future Work . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
. .
. .
. .

47
47
53
57

8

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

A Hardware Setup Details . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

B Software Setup Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

C ICAP Control Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

References

vii

List of Figures
2.1
2.2

Example System without PR . . . . . . . . . . . . . . . . . . . . . . . . .
Example System with PR . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1
3.2
3.3
3.4
3.5
3.6
3.7

Conversion Methods, Real-time v. CLUT-interpolate
Overview of the Static CSC Engine . . . . . . . . .
Processing Example without PR . . . . . . . . . . .
The PR-Enabled CSC Engine . . . . . . . . . . . . .
Processing Example with DPR . . . . . . . . . . . .
The Dual-Pipe PR CSC Engine . . . . . . . . . . . .
Processing Example with Dual-Pipe DPR . . . . . .

4.1
4.2
4.3

PC-FPGA-PC Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . 18
Development Software Flow Diagram . . . . . . . . . . . . . . . . . . . . 20
Test Software Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1
5.2
5.3
5.4
5.5
5.6
5.7

The ICAP VIRTEX6 Block . . . . . . . . . . . .
The Modified, PR-Capable 3D/4D Stage . . . . .
The PCIe Reference Design . . . . . . . . . . . .
The Full Software-PCIe-CSC System . . . . . .
Dual-Pipeline Support Logic . . . . . . . . . . .
Verification Process: Simulation of Migrated CSC
Verification Process: Hardware BIST . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

23
26
30
30
34
36
37

7.1
7.2
7.3
7.4
7.5
7.6
7.7

Speedup of Equal-Length Tasks . . . . . . . . . . .
Speedup of Unequal-Length Tasks . . . . . . . . . .
Single-pipe Layout . . . . . . . . . . . . . . . . . .
Dual-pipe Layout . . . . . . . . . . . . . . . . . . .
Input Vector Timing Diagram with HDL Pseudocode
Theoretical Multi-Pipe Engine . . . . . . . . . . . .
Theoretical PR-Pool Engine . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

49
49
52
52
56
58
59

viii

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

5
5
10
11
11
13
13
15
15

List of Tables
3.1

Overview of CSC Input Signals . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1
5.2
5.3

Comparison of I/O Interface Options . . . . . . . . . . . . . . . . . . . . . 28
Single-Pipe CSC Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . 39
Dual-Pipe CSC Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1
6.2
6.3
6.4
6.5
6.6

Performance Results, Single-Pipe . . . . . . . . . .
Performance Results, Dual-Pipe . . . . . . . . . . .
Bitstream Sizes & Configuration Times, Single-Pipe
Bitstream Sizes & Configuration Times, Dual-Pipe .
Resource Utilization & Power Estimates, Single-Pipe
Resource Utilization & Power Estimates, Dual-Pipe .

ix

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

43
43
44
44
46
46

Chapter 1
Introduction
In specialized computing applications, field programmable gate array (FPGA) devices are
often employed when a microprocessor is not powerful enough and an application specific
integrated circuit (ASIC) solution is too expensive (in terms of production costs or development time.) Compared to an ASIC, a system on an FPGA can typically be designed more
quickly, and modifications can be implemented and tested “immediately” without any need
to produce new chips or boards. Compared to a microprocessor executing a software routine, an FPGA can perform equivalent tasks faster (even at lower clock frequencies) assuming the workload can be parallelized to some extent. This is because a proper FPGA design
will utilize available logic resources more efficiently; many independent processing units
can work simultaneously on different data or tasks. Similarly, FPGA power consumption
may be less since microprocessors spend more cycles scheduling instructions and waiting
for limited functional units to become available. FPGAs also have disadvantages that must
be considered: the unit cost is greater than a microprocessor with equivalent computing
power, a smaller selection of platforms and development tools are available, and system
design complexity may increase.
The processing strengths of FPGA chips are derived from their specialized structures.
They contain large pools of combinational and sequential logic elements which can be connected and enabled as desired. Most of these elements are basic blocks such as registers,
buffers, lookup tables (LUTs), and simple logic gates. Some FPGAs include dedicated
blocks which handle digital signal processing (DSP) functions, mixed signal conversions,
1

or input/output (I/O) protocols. An FPGA is programmed or configured by streaming
in a special sequence of bits, called a bitstream, which may toggle solid-state interconnects, define LUT values, etc. These bits are written to volatile static random access
memory (SRAM) elements, which must be reprogrammed manually (or booted from nonvolatile memory) each time the system is powered up. In normal usage, a bitstream will
reprogram the entire FPGA; this process is referred to as reconfiguration.
Certain FPGA platforms support an extended feature called partial reconfiguration (PR).
This allows small partitions of the chip to be reprogrammed independently, using smaller
bitstreams. Modules implementing different functionality can therefore be swapped in and
out of a larger system, rather than all residing on-chip together. The advantages of PR systems are conservation of logic resources and layout space, as well as shorter times in the
implementation, bitstream generation, and programming stages. However, the operation
of the chip may be temporarily halted during partition programming, which interferes with
overall system performance. One solution to this is dynamic partial reconfiguration (DPR),
a mode which allows programming of an individual partition while the rest of the system
proceeds normally, available in some higher-end FPGA device families. In modular systems which benefit from basic PR, it is proposed that utilizing DPR (and therefore overlapping processing and reconfiguration stages) can improve the overall system performance.
This thesis demonstrates a methodology which applies DPR techniques to an existing
digital system, specifically a color space conversion (CSC) engine that processes sequences
of full-color images. The modifications are described in detail, and various metrics are used
to measure improvements in performance and functionality. Observed drawbacks are also
reported. Finally, potential areas for DPR refinement and extension are proposed.

2

Chapter 2
Background
In the last two decades, FPGA technologies have evolved tremendously and attained a
majority status in the market of reconfigurable logic devices. Due to their increasing capabilities and decreasing costs, FPGAs have been replacing (or sharing computational roles
with) microprocessors, ASICs, and simpler programmable logic devices (PLDs) in many
applications [1]. Microprocessors are widely available and well-suited for general purpose
computing, but are often the least efficient option due to their highly-sequential operation
and non-configurable architecture. FPGAs are commonly preferred in real-time computing
for their massive parallelism, large quantity of diverse logic blocks, and customizable data
paths and I/O interfaces. Even compared to other PLDs such as programmable logic arrays (PLAs) and complex programmable logic devices (CPLDs), FPGAs have significant
advantages in resource diversity and routing flexibility.
Trade-offs between FPGAs and ASICs also exist; they have become a significant topic
of research, in both academia and industry. ASICs are usually superior in computing performance since they are fine-tuned and optimized for each desired application. Additionally,
clock speeds on FPGAs are typically lower due to longer propagation delays and physical
trace lengths. In most cases, an ASIC implementation will also require less power and
layout space, since FPGAs accommodate for reprogramming overhead and chip-level control logic. (However, these gaps in requirements are continuously shrinking, and FPGA
overhead may become negligible as the implemented designs grow larger [11].)
ASICs have those advantages, but suffer from their own set of drawbacks. The costs
3

and time involved in developing and testing a new ASIC are far greater than an FPGA
implementation of the same system; the time-to-market is a critical factor of success in
any competitive product field. Also, since ASICs cannot be reconfigured after production, the interval between design revisions is measured in months or years. In contrast,
FPGA designs can be upgraded at any time (even by users in the field) by a comparatively
simple firmware update. For these reasons, hybrid ASIC/FPGA platforms have been proposed [21]. These implement simple ASIC “containers” (including static I/O logic) around
internal logic that can be reconfigured and customized.
Unique to FPGA devices is a feature called partial reconfiguration (PR), sometimes
called “run-time reconfiguration”. PR allows small partitions to be reprogrammed, independently of other partitions, after the initial full-chip configuration. This creates the possibility for functional blocks to be loaded into a system as needed, in a plugin-like fashion [9].
The main benefits of PR are increased functionality combined with decreased layout size,
which are usually conflicting goals. The process can either be initiated by an external
controller or by the system itself, called self-partial reconfiguration [6]. The latter is utilized in adaptive systems, which automatically detect the need for a new function and load
it (as a partial reconfiguration bitstream) from some attached memory source [3]. Since
externally-stored bitstreams may be intercepted during read access, they can optionally be
encrypted to prevent third-party replication of the design [5].
Figure 2.1 shows a simple example system, which processes data through one of N
selectable functions. Figure 2.2 shows an equivalent system which utilizes PR. Note that
the routing complexity is slightly reduced, and the area reserved for processing is greatly
reduced. This comes at the cost of some added control logic (plus execution time dedicated
to reconfiguration). By merging functional modules, only one is available at any given
moment; depending on the application, this may or may not be usable.
PR is particularly beneficial in embedded applications (in which area and power consumption are very important metrics) or where many variations of similar functionality
need to be available (encryption blocks, signal processing filters, I/O protocols, etc.). In

4

Function
1
Input
Data

Function
2
…

Output
Data

…

Selection Logic

Function
N

Figure 2.1: Example System without PR

Input
Data

Variant
Function
PR
Control
Bitstream Source
Figure 2.2: Example System with PR

5

Output
Data

FPGAs that implement static partial reconfiguration, each reprogramming action will temporarily halt the entire chip (performance may be impaired). This halting does not occur
in higher-end FPGAs that support dynamic partial reconfiguration (DPR). DPR allows individual modules to be reprogrammed while others continue to operate, enabling a method
of uninterrupted hardware multi-tasking [14].
DPR is currently available on high-end FPGA families, including the Virtex family
produced by Xilinx, Inc. Virtex devices and Xilinx software tools have been used to create
DPR-based Dynamic Hardware Plugins (DHPs): partitions which support diverse run-time
functionality via a set of partial bitstreams, provided that all were implemented with the
same static interface at the partition boundary [13]. Huang and Lee [10] have demonstrated
fast automated reconfiguration of Discrete Cosine Transform (DCT) partitions for adaptive
video compression on a Virtex-4 device. Most systems make use of the Xilinx-specific
Internal Configuration Access Port (ICAP) feature which grants user access to the chip’s
configuration control [4]. This is useful for simplified loading of PR bitstreams, but should
be noted that it is a hardware-only construct; simulation of ICAP activity is not possible because the bitstreams are device- and implementation-specific, and because most simulation
software does not allow module architectures to change after initial compilation.
Many DPR subsystems utilize a microcontroller which handles fetching, decompression (if needed), and writing of partial bitstreams [4] [10] [12]. They also tend to introduce
new top-level interfaces for the fetching of these bitstreams. One of the design challenges
of integrating DPR with existing systems is to minimize the amount of additional hardware
requirements. Galindo and Peskin [7] demonstrated a method of integrating static (halting) partial reconfiguration into a Color Space Conversion (CSC) engine without adding
any microcontroller (neither physical or soft-core) or new I/O bus. This was achieved by
transporting PR bitstreams on an existing data bus, with additional hardware to recognize
and utilize them.

6

Chapter 3
Theory
Partial reconfiguration can provide valuable advantages when applied appropriately. It will
always involve one or more drawbacks, but dynamic partial reconfiguration mitigates at
least some of these. Section 3.1 discusses the major trade-offs of generalized DPR techniques. Section 3.2 introduces an image processing engine, and Section 3.3 predicts the
potential benefits of integrating partial reconfiguration into it. This engine will serve as the
experimental platform for DPR implementation and analysis.

3.1

Benefits and Drawbacks of DPR

Partial reconfiguration introduces many improvement opportunities in FPGA-implemented
systems. At its core, it allows a variety of functional blocks to be inserted into a system,
without the need to implement these blocks elsewhere on-chip, and without resetting the
state of the system. Some quantitative benefits can immediately be inferred. The amount
of required logic resources decreases, since N different processing functions (assuming a
common interface) can be performed by a single dynamic module. The physical layout
space and routing requirements for these will similarly decrease by a factor of approximately N . Both static and dynamic (switching) power consumption may decrease slightly
as modules are merged, although a properly designed system should already disable idle
blocks to conserve power.
As an example, consider a high-speed connection with dedicated hardware for fast data
7

encryption and decryption. For a static system to support a variety of encryption algorithms, each must be individually implemented in hardware, with additional logic to selectively route the passed data. A PR version would only require one partition and data path,
plus an added port for receiving reconfiguration data whenever the user wishes to switch
algorithms. Similarly, PR has been used to provide fast-switching of I/O protocols at the
interfaces of hot-swappable devices. Each PR partition must be sized to accommodate the
requirements of its largest function. For this reason, blocks of comparable sizes should
be grouped together to avoid wasting layout space on smaller functions. In any case, it is
important to note that functions can only merged via PR if they are mutually exclusive in
time (never utilized simultaneously).
Some time-saving benefits are introduced by PR as well, in both the development stage
and usage stage. In the former, the Place And Route (PAR) and bitstream generation processes inherent to FPGA-based design may be shortened, since an individual partition and
partial bitstream require considerably less processing than a fully routed device. In the latter, the reprogramming duration (which the end user experiences) is proportionally reduced
by the ratio of the partial bitstream size to the full-FPGA bitstream size. These values may
differ greatly. Full-FPGA reconfiguration could perform the same module-swapping functionality, but with numerous drawbacks: increased programming time, forced halting and
resetting of the system, and substantially longer PAR time (since the entire chip must be
re-implemented for every combination of functional blocks.)
Some drawbacks are associated with PR techniques. It is generally only available on
relatively recent, high-end FPGAs. It requires a slight increase in control logic complexity,
to detect PR bitstreams and route them to the appropriate FPGA port. Reprogramming
latency may become an issue as partition size grows. By definition, static partial reconfiguration will halt the full chip’s operation for at least some short amount of time, reducing
performance in active systems. Dynamic partial reconfiguration specifically aims to mitigate these performance hits, by keeping other partitions running while one reprograms.

8

3.2

Color Space Conversion

To investigate the benefits of DPR, an image processing engine was provided by the HewlettPackard Company (HP) in the form of Verilog hardware description language (HDL) modules. ASIC implementations of the engine have been used as components within larger
systems in multiple HP products, particularly color printers. Documentation, associated
software tools, and sets of input/output files were provided to create a complete test platform.
The engine’s core functionality is called color space conversion (CSC), which translates the pixels of an image from one color coordinate system to another. A common
example of this is a conversion from the red-green-blue space (RGB, the typical format
for storing and displaying images on personal computers) to the cyan-magenta-yellow-key
space (CMYK, typically used by four-color ink printers) [2]. Software CSC implementations usually calculate each output pixel as a function of one input pixel. Depending
on the source and destination spaces, a standard mathematical mapping may be defined,
or some slight variations may exist. These tend to be non-linear and require conditional
calculations, and many redundant conversions might be performed (on input pixels which
repeat identical colors). For these reasons, hardware CSC implementations often resort to
pre-computed color lookup tables (CLUTs) rather than real-time calculation. Since CLUT
access time is constant, all pixels can be processed at an equal speed without stalling for
slow or conditional calculations. Furthermore, CLUTs can be customized for specific applications: for example, to tweak gamma levels or compensate for the non-linear effects
inherent to printing on paper [8]. Note that some amount of interpolation is still performed
after CLUT access, because it would be unnecessary and unreasonable to store conversion
values for every possible input combination. Figure 3.1 shows a simplified block-level view
of these two CSC methods, real-time calculation versus CLUT-interpolation, as applied to
an RGB-CMYK conversion.
The provided CSC engine performs multi-stage, deeply-pipelined conversions on either
three-channel or four-channel pixels. These are streamed in at a peak rate of one full pixel
9

Conversion
Function

c
m
y
k

nearest cmyk values

c
m
y
k

Interpolate

r
g
b

CLUT

r
g
b

Figure 3.1: Conversion Methods, Real-time v. CLUT-interpolate
per clock cycle, and follow a linear path through a series of cascaded processing modules.
Each module can be independently activated or bypassed by setting appropriate control
register values. These components can perform scaling, dithering, three-channel (3D) conversion, four-channel (4D) conversion, etc. For most aspects of this research, the system
can be viewed as consisting of a pre-processing stage, a selectable 3D/4D conversion stage,
and a post-processing stage. In its ASIC implementations, both the 3D and 4D modules are
physically present as cascaded bypassable stages. This is illustrated in Figure 3.2. In our
experimental version, the modules will be treated as mutually exclusive; only one will be
implemented at any given moment, consuming roughly half the resources.

3.3

Potential for Engine Improvement

After powering up, operation of the normal (ASIC) CSC begins with an input stream of configuration data (register values, CLUT data, control flags) followed by a sequence of images
to be processed. New CLUT values can be loaded between images, to perform different
conversions. Pixels are read into the pipeline on four 16-bit channel buses. These buses,
along with some binary control flags, are collectively known as the Pixel Bus. Each pixel
is processed and interpolated as needed, and the result is written out to four 12-bit buses.
As a side note, the use of four 16-bit input channels implies 264 possible colors, making
exhaustive CLUTs infeasible and creating the need for interpolation. Configuration data
10

Control Modules

Reg Bus

Pixel Bus

PreProcessing
Modules

3D
Processing
Module

4D
Processing
Module

PostProcessing
Modules

Output Pixels

Figure 3.2: Overview of the Static CSC Engine
Current Action ( time → )

Input Bus
Reg Bus

Configuration
1

…

Configuration
2

…

Configuration
3

…

Pixel Bus

…

Image
1

…

Image
2

…

Image
3

One static pipeline – no tasks can be overlapped.

Figure 3.3: Processing Example without PR
is grouped as an 18-bit address, a 32-bit word, and some control flags collectively known
as the register bus or Reg Bus. These two buses (plus a pipeline enable flag) constitute
the 123-bit input interface of the CSC, as summarized in Table 3.1. Figure 3.3 illustrates
bus utilization during a simple example sequence (processing three separately-configured
images). Shaded boxes indicate intervals in which an input bus is unused.
Previous research has been performed on this CSC platform by Galindo and Peskin [7].
Topics of interest included trade-offs in ASIC versus FPGA implementations, explorations
of different memory architectures, and integration of static partial reconfiguration. The
PR experiments were focused on reducing resource requirements and power consumption, rather than improving performance. That research provided the foundation for this
performance-oriented DPR methodology, and directly drove some key design choices, detailed in Section 5.1. It was decided to merge the 3D and 4D modules into a single reprogrammable module to demonstrate the functionality and resource conservation of PR.
Also, it was decided to stream PR data via the existing Reg Bus, as shown in Figure 3.4, to
avoid adding a new interface to the system.

11

Register Bus
Pixel Bus

Input Signal Name

Width (bits)

Description

PipeEnable

1

A global enable signal for the pipeline

AddrValid

1

Indicates a valid bus transaction

RegAddr

18

Address of register to write
(real RAM address or virtual)

RegWrite

1

Indicates the specified data should be written

RegData

32

Multi-purpose data word
(control data, CLUT data, PR data)

PreCscEol

1

End-of-line indicator

PreCscEop

1

End-of-page indicator

PreCscOt

1

Object-type indicator (LUT selector)

PreCscNop

1

No-operation (do not modify) flag

PreCscAck

1

Acknowledge pre-conversion pixel

PostCscReq

1

Request post-conversion pixel

PreCscData0

16

Input pixel, Channel 0

PreCscData1

16

Input pixel, Channel 1

PreCscData2

16

Input pixel, Channel 2

PreCscData3

16

Input pixel, Channel 3

Table 3.1: Overview of CSC Input Signals

12

Control Modules

Reg Bus

Pixel Bus

PreProcessing
Modules

3D/4D
Processing
Module

PostProcessing
Modules

Output Pixels

PR Control
Figure 3.4: The PR-Enabled CSC Engine

Current Action ( time → )

Input Bus
Reg Bus

Configuration
1

Configuration
2 (PR)

…

Configuration
3

Configuration
4 (PR)

…

Pixel Bus

…

Image
1

Image
2

…

Image
3

Image
4

One dynamic pipeline – 3D/4D reconfiguration can overlap with 1D image processing.
Here, tasks numbered 1 and 3 are 1D-only; tasks 2 and 4 include 3D/4D processing.

Figure 3.5: Processing Example with DPR

13

In the current research, it is desired to retain the benefits of the PR-enabled engine, and
to compensate for the performance hit caused by bitstream loading. DPR can be used to
increase the system throughput (without modifying the clock frequency or I/O buses) under certain conditions. First, a sequence of multiple images must be processed; DPR does
not provide any benefits in processing a single image. Second, the images must require
different conversions; this means that configuration data must be inserted between images.
Otherwise, the configuration is constant and the system is strictly limited by the rate it can
feed pixels through the pipeline. Third, the images should be “large” or else the performance benefits will be negligible. In this context, “large” means the length of image data
is greater than or approximately equal to the length of the configuration data.
Finally, and importantly, the configuration and pixel processing actions must be able
to overlap to some extent. Since the provided engine has just one pipeline path, the potential for overlapping is limited. Any module in the chain can only be reconfigured if the
image currently being processed bypasses it. No module that is in use (including global
control logic) can be reconfigured while processing an image. Figure 3.5 shows an example: Images 1 and 3 do not utilize the DPR-capable module (the 3D/4D processor), so their
processing can be overlapped with the DPR setup for subsequent Images 2 and 4.
To better utilize dynamic partial reconfiguration, a second architecture is investigated
which encapsulates two identical, parallel pipelines (including duplicated control registers).
By properly routing their incoming data streams, one path can be fully reconfigured while
the other processes an image, switching roles when appropriate. This version obviously
suffers from increased resource consumption, but it is used to demonstrate the even greater
performance possible with DPR.
Figure 3.7 shows a processing sequence example using DPR and two pipes. Part (a)
shows the activity of each input bus over time; Part (b) shows which corresponding pipeline
performs each activity. Because the pipes are fully independent, complete configuration for
the next image can be performed while processing the current image. The illustrated case
is ideal, i.e. the images have equal lengths and configuration data, and hence the speedup

14

Control Modules

Pixel Bus

3D/4D
Processing
Module

PostProcessing
Modules
Output Routing Logic

Reg Bus

Input Routing Logic

PreProcessing
Modules
Pipeline 1

PR Control
Pipeline 2

PreProcessing
Modules

3D/4D
Processing
Module

Output Pixels

PostProcessing
Modules

Control Modules

Figure 3.6: The Dual-Pipe PR CSC Engine

Current Action ( time → )

Input Bus
Reg Bus

Configuration
1

Configuration
2 (PR)

Configuration
3 (PR)

Configuration
4 (PR)

Configuration
5 (PR)

Configuration
6 (PR)

Pixel Bus

…

Image
1

Image
2

Image
3

Image
4

Image
5

Two dynamic pipelines – any configuration task can overlap with any image processing.

(a) Bus Activity
Current Action ( time → )

CSC Pipeline
#1

Configuration
1

Image
1

Configuration
3 (PR)

Image
3

Configuration
5 (PR)

Image
5

#2

…

Configuration
2 (PR)

Image
2

Configuration
4 (PR)

Image
4

Configuration
6 (PR)

(b) Pipeline Activity

Figure 3.7: Processing Example with Dual-Pipe DPR

15

factor is exactly 2. In practice, one action will take longer than the other. This results in
some amount of bus inactivity, and the speedup factor will be between 1.0 and 2.0.
The pre-processing stage, post-processing stage, control logic, and I/O logic are designated as static components. Their values vary while the engine runs, but their underlying
logic structures never need to change. The selectable 3D/4D module is designated as dynamic, since it is reprogrammed “on the fly” to accommodate the hardware differences
between the 3D and 4D functionality (mainly the amount, sizes, and connections of required CLUTs). The methodology described hereon will demonstrate the implementation,
analysis, and results for both single-pipe and dual-pipe DPR engines.

16

Chapter 4
Experimental Setup
The major components used in this research, and an overview of the entire setup, are described in this chapter. The roles of these components are discussed in detail in Chapter 5.

4.1

Hardware Components

The chosen reconfigurable platform is a Virtex-6 FPGA from Xilinx, Inc. This serves
as an extension to previous PR research on a Virtex-II Pro, as described in Section 5.1.
Specifically, the ML605 evaluation board was used, which includes these key features:
• DPR-capable Virtex-6 FPGA
• JTAG (Joint Test Action Group) programming via USB (Universal Serial Bus)
• BPI (Byte Peripheral Interface) Flash memory for non-volatile configuration
• PCIe (Peripheral Component Interconnect Express) link for high-speed I/O
• DDR3 (Double Data Rate 3) memory for I/O FIFO buffering
For programming access, the ML605 was connected by a USB cable to a desktop
computer (PC) running the Microsoft Windows 7 operating system. For run-time communication with the CSC engine, the board was directly plugged into a Generation 1 8lane (Gen1x8) PCIe slot on a second PC running the Fedora 10 (Linux) operating system.
This setup is illustrated in Figure 4.1. Further details of the hardware components are
presented in Appendix A.

17

Development & Programming

Windows
PC

JTAG

Testing & Execution

ML605 with
Virtex-6 FPGA

PCIe

Linux
PC

Figure 4.1: PC-FPGA-PC Hardware Setup

4.2

Software Components

Modifications to the CSC engine and the DPR integration were performed on the Windows PC, using the Xilinx Integrated Software Environment suite. This package includes
three key applications: ISE Project Navigator (called “ISE” hereon), PlanAhead, Impact,
and the other software modules required by these.
To make any modifications to the engine, ISE is used first for design entry. Most
changes are made in the textual Verilog HDL format, in which the supplied system is
modeled. The CORE Generator (coregen) tool is called from within ISE to customize
and generate useful sub-modules (e.g. RAMs, PCIe back-end, clock generators). The ISE
Simulator tool (ISim) is also called from within ISE, when system simulation is necessary
and possible. ISE outputs a set of netlist files, generated from the HDL sources by the
Xilinx Synthesis Tool (xst).
Netlist files are imported into PlanAhead, which manages the creation, placement, and
sizing of partitions within the FPGA. These can be designated as reconfigurable partitions (RPs) and multiple variant netlists can be assigned to them (assuming a PR-enabled
software license is present). The design implementation process in PlanAhead invokes
the mapping, place-and-route, and bitstream generation tools (map, par, bitgen), and
the end product is a set of bitstream files: full-chip configuration bitstreams and partial
bitstreams for PR.
18

Finally, the Impact application is used to program the FPGA via the USB-JTAG interface. Typically, full-chip bitstreams are loaded by Impact and programmed to the FPGA’s
volatile SRAM memory for testing purposes. Finished designs can alternatively be programmed into a non-volatile Flash memory. On power up, the full-chip configuration is automatically loaded from this Flash memory. The partial configuration bitstreams are stored
on the PC or equivalent master device, and are used at run-time as necessary. Figure 4.2
illustrates the general software/file flow of these development tools.
Other software used on the Windows PC includes a CSC emulation application provided by HP (for generating “known good” output files from source images), MATLAB
(for comparing experimental and “known good” results), and Tiny C Compiler (TCC, for
compiling custom tools).
On the Linux PC, modifications to PCIe driver were required as detailed in Section 5.3.
These drivers, and associated software, were based directly off the Virtex-6 Connectivity
Kit Targeted Reference Design (TRD), which demonstrates PCIe communication on the
ML605 board. The GNU C Compiler (GCC) was included with the Fedora 10 installation,
and was used to compile the tool and driver code. A detailed list of all software and version
numbers is available in Appendix B.
Custom test software was written for the project, including two main programs called
vecBuilder and vecSender. These were executed on the Windows PC and Linux PC respectively, but were written in the C language and designed for simple OS portability.
Figure 4.3 illustrates the software/file flow of these testing tools. Details of their roles and
functionalities are described in Sections 5.2 and 5.3.

19

HDL Source Files

Xilinx ISE

PlanAhead

ISim, xst

map, par, bitgen

Netlist Files

Bitstream Files

Impact

Program FPGA

Figure 4.2: Development Software Flow Diagram

CSC Register Files
CLUT Data Files
PR Bitstream Files
Image Files

vecBuilder

vecSender

CSC Vector Files

FPGA Processing

Figure 4.3: Test Software Flow Diagram

20

Output Images

Chapter 5
Methodology
This chapter describes the main procedures followed to implement the DPR-capable system, as well as the necessary preparation and testing steps. It is divided into sections which
provide details of each major task. Results and conclusions follow in Chapters 6 and 8.

5.1

Upgrading the Platform

The CSC engine had previously been migrated from its original ASIC target to a PRcapable FPGA, to study the trade-offs of non-dynamic partial reconfiguration and other
system modifications [6]. Four important design choices were made in the previous research, which directly affect this methodology. First, the CSC’s 3D and 4D modules have
been merged into a single, selectable module. This reduces power requirements and layout space, at the costs of slightly increased control complexity and the inability to perform
consecutive 3D and 4D conversions on pixels (a rarely used configuration). Second, the
Virtex-II Pro FPGA was previously chosen as the hardware platform; a more recent Virtex
model is now used. Third, the system’s clock frequency has been reduced from 167 MHz
to 50 MHz, due to logic timing constraints. Finally, the choice was made to integrate PR internally, without modifying the existing I/O interface. This increases the practicality of the
research; theoretically a DPR-FPGA version of the CSC engine could replace the original
ASIC inside a product with no other changes to the interface, routing, or bill of materials.
The throughput goal of one full-page image per second [7] is retained. A full page was
21

defined as 8.5 inches by 11 inches, at 600 dots per inch (DPI), or 33,660,000 pixels [7].
If configuration data is not considered, then this target rate can already be achieved by the
fully-saturated pipeline running at 50 MHz. However, for this methodology, our goal of
less than one second also includes the upload of CLUT data (and any DPR data) associated
with each image.
The Virtex-II Pro was replaced with a Virtex-6 LXT, incorporated on the ML605 development board [20]. To accommodate the new device, the development software (ISE
Design Suite) and CSC project files were updated from version 8.2 to version 13.1. It is
important to note that the acquired ML605 board contained revision 1 “Engineering Sample” (CES) silicon, which has some firmware limitations, as described in Section 5.3.
Migrating the CSC core itself required minimal modifications, since the Virtex-6 includes the same logic and memory resources (as the Virtex-II Pro) in even greater quantities. The random access memory (RAM) modules designated for CLUTs needed to be
updated to Xilinx’s latest format. The majority of migration effort involved modifying the
PR control module as described in Section 5.2, and the removal of bus macro blocks at
the interfaces of certain CSC sub-modules. Bus macros were required components in PRcapable HDL designs in version 8 of the ISE Suite, but are obsolete in version 13. They
provided reliable, safe, buffered interconnects at partition boundaries, during dynamic reconfiguration [17]. Formerly, these needed to be explicitly instantiated and connected by
the designer. In PlanAhead version 13, a similar functionality is automatically inserted at
PR-capable boundaries, and explicit bus macros become redundant. Although keeping the
bus macros would not have impaired functionality, these were nonetheless removed and the
sub-modules are directly connected to avoid unnecessary resource utilization.
Once the CSC platform has been migrated and confirmed working (see test procedures
in Section 5.5), it is possible to begin the implementation of a dynamic partial reconfiguration subsystem.

22

User Interface

CLK
CSB
RDWRB
I[31:0]
O[31:0]
BUSY

Internal Interface

ICAP_VIRTEX6

Figure 5.1: The ICAP VIRTEX6 Block

5.2

Integrating DPR

The Virtex family supports methods for both external and self dynamic reconfiguration.
The main external interface is called SelectMAP. It provides a straightforward user interface to the FPGA’s reconfiguration control logic; it consists of one data bus plus control
signals. Essentially, PR bitstreams are written sequentially to the data bus, and the chip
handles the decoding, checking, and reprogramming automatically. As long as the bitstream contains valid headers, addresses, and data (should be true if produced by bitgen)
and the control signals are driven correctly (by the user), reprogramming should occur just
as if done manually with the Impact software tool.
A similar interface can be instantiated within the HDL source itself, and then driven
by user-defined logic. This is a special Xilinx primitive block called the Internal Configuration Access Port (ICAP), which allows for self reconfiguration. Its ports are shown in
Figure 5.1. Functionally, the interface is identical to SelectMAP. The ICAP block can be
instantiated like any other module, and connected to a custom bus to receive bitstream data.
During the previous research, it was desired to pass in the PR bitstream data without
adding a new I/O bus. Therefore, the existing Reg Bus was utilized, but only during a
special PR mode which halted the CSC pipeline. For our DPR goal of overlapping image
processing and configuration, the Reg Bus can conveniently still be used, since it normally
23

remains idle during any image processing stage. Also, the limited CSC clock frequency of
50 MHz matches the ICAP block’s standard operating frequency of 50 MHz. This allows
ICAP data to be piped in without additional synchronization logic.
In the old Virtex-II Pro implementation, PR was performed by siphoning bitstream data
from the Reg Bus’s data pins, named RegData in the system. PR data was distinguished
from regular register data by a new virtual register address, CSC PR REG. This address was
given the value of 0x08B8. Whenever the Reg Bus’s 18-bit address, RegAddr, matches
this value, appropriate PR actions are initiated. This method and the virtual address are
retained in the Virtex-6 implementation, with some modifications.
The Virtex-II Pro ICAP block only accepted eight bits (one byte) of bitstream data per
clock cycle. Because of this, the 32-bit wide RegData word was not fully utilized; only
eight bits ever held valid PR data. In contrast, the Virtex-6 ICAP block features a selectable
data bus width, up to 32 bits, which is specified as a compile-time Verilog parameter. For
our new implementation, the 32-bit (four byte) width was chosen to match the Reg Bus.
This utilizes the existing interface more efficiently by reducing the number of cycles needed
to load a bitstream by up to 75%.
Furthermore, since the ICAP data bus now matches the Reg Bus one-to-one, the surrounding control logic was simplified. The ICAP’s Chip Select (CSB) input is only asserted
(pulled low) when PR words appear on the Reg Bus. Previously, this was managed by a
small finite state machine (FSM), which required the total number of PR bytes to be explicitly specified, and then asserted Chip Select for that many cycles. This was replaced with
a simpler ICAP controller, which asserts Chip Select whenever the address on RegAddr
matches the CSC PR REG constant. Since the ICAP is only used for write-access, the
Read-Write input (RDWRB) is tied low (indicating Write), and the output bus (O[31:0])
is left unconnected, as recommended by Xilinx [15].
Two more notable hardware changes were needed to support DPR. First, the bits within
each byte of ICAP data needed to be reversed to accommodate its non-conventional bit
order. Most n-bit signals within the system interpret Bit 0 as the least significant and

24

Bit (n − 1) as the most significant, but the ICAP interprets Bit 0 as most significant and
Bit 7 as least significant. For example, the bitstream synchronization word 0xAA995566
should be written to the ICAP block as 0x5599AA66 [15]. This swapping could have
been done in either the hardware or the CSC vector preparation software. It was chosen to
perform this in hardware, by simply re-routing the signals properly. This makes it easier to
generate, read, and edit input vectors (on the PC) since they retain “standard” bit ordering.
The finalized ICAP control module is provided in Appendix C in its HDL form.
The final hardware addition needed was a path of cascaded registers to bypass the reconfigurable 3D/4D module. In the previous implementation, PR (which temporarily places
a partition in an indeterminate state) was never performed while an image was being processed. In this DPR-extended version, PR may be performed during an image processing
stage that does not utilize the 3D/4D module. Normally, these pixels would pass through
the 3D/4D register path unmodified, but partial reconfiguration would briefly invalidate the
logic and corrupt the data. For this reason, a bypass branch of equivalent “length” (register delays) was added around the module, and a PR-aware multiplexer selects either the
module’s output or the bypass branch’s output to write to the next stage.
A visual summary of the required hardware changes is shown in Figure 5.2. All changes
were made by either editing the appropriate Verilog modules, or by adding new selfcontained modules. For the 3D/4D module to be PR-capable, it was defined in the CSC’s
top level as a black-box module. A black-box module’s internals are undefined at synthesis time, allowing a separately-synthesized netlist to take its place in implementation. In
ISE, this was achieved by including the black-box’s port definition in the top-level Verilog
source, explicitly declaring all I/O ports but no internal logic.
The top-level CSC and its static sub-modules were synthesized to netlist files in ISE.
The dynamic variants of the 3D/4D module (that is, the 3D block and the 4D block) were
also synthesized in ISE to separate netlist files. All were imported into PlanAhead, which
allows partitioning of the Virtex-6’s resources and fine control of block locations. For our
purposes, one partition was created and designated as PR-capable, sized according to the

25

Bypass Path
pixels from
previous stage

pixels to
next stage

3D/4D
Module

PR active
flag

ICAP
PR data

ICAP control
Reg Bus address
Reg Bus data

PR Detection
and Control

Figure 5.2: The Modified, PR-Capable 3D/4D Stage
maximum resource requirements of the two variants. PlanAhead’s implementation stage
automatically handles the boundaries of PR partitions; it produces buffering/safety logic
and ensures the 3D/4D variants share identical I/O locations. The final output was a set of
bitstream files: full-chip bitstreams which include all static logic, and a partial bitstream
defining each variant of each partition. The static bitstreams are typically programmed to
the chip by through the Impact software; the partial bitstreams are saved for later use during
run-time. These software steps are illustrated in Figure 4.2.
Finally, in order to benefit from the overlapping capabilities of PR, a method was
needed to analyze a sequence of tasks, determine which can be overlapped properly, and
combine Reg Bus and Pixel Bus data into CSC-ready input vectors. This can either be
achieved by dynamic scheduling within the hardware engine, or static scheduling within
the test software. For this research (and the prior research) it was chosen to schedule these
CSC vectors in software, because a hardware method may add complexity and slow down
the engine whose performance is being tested, and because the sequence analysis process is
more suited to software than FPGA logic. Also, a software method allows many sequence
26

test cases to be generated and stored, so benchmark test cases can easily be re-run and
compared as engine modifications are made.
A custom command-line application was written in the C language, called vecBuilder.
It accepts a user-specified sequence of CSC register data files, CLUT files, partial bitstream
files, and image files, and generates a file of formatted 128-bit vectors ready to be streamed
to the hardware engine. The software recognizes and performs task overlapping when possible; for this single-pipe CSC, it specifically overlaps 3D/4D module reconfiguration with
images which do not utilize that module. The role of vecBuilder in the testing process
is shown in Figure 4.3 and detailed in Section 5.5. An extension of its functionality is
described in Section 5.4, to accommodate further hardware expansion.

5.3

Communicating with a PC

Once the hardware was modified to accommodate DPR and software was written to generate DPR-aware input vectors, communication with the FPGA could be established. Since
high-performance processing is the goal of this project, an appropriate high-speed connection was desired. Specifically, the desired bandwith BWd was calculated from the input
vector width win and clock frequency Fclk as
BWd = win × Fclk
BWd = 123 bits × 50 MHz
BWd = 6.15 Gbps
The previous PR-related research had utilized a second (“test rig”) board with an identical Virtex-II Pro chip to feed test input vectors to the CSC engine. This input data was
stored on a Compact Flash (CF) card, loaded and prepared by the on-board PowerPC (PPC)
microprocessor, and stored in DDR memory until a test was initiated; the formatted vectors
were then streamed to the CSC board on general purpose input/output (GPIO) pins. Thus,
this test setup required two boards and a microprocessor, and required the CF card to be
27

I/O Interface

Advantages

Disadvantages

Ethernet

high throughput,
reference designs available

many data abstraction layers,
not designed for high-speed
processing units

PCIe

high throughput,
standard on many processing units,
reference designs available

relatively complex,
greatest hardware requirements

UART (over USB)

simplest to implement,
least hardware requirements

very slow,
serial data transfer

DDR3 RAM

test cases stored on-board,
no software link required,
least latency

limited by RAM size,
need an interface to pre-load
test cases (from a PC)

Table 5.1: Comparison of I/O Interface Options
rewritten for every test case. Because of the additional hardware and the limited capacity
of the CF card, the CSC interfaces with a PC in this DPR research.
The ML605 board contains multiple I/O ports. Table 5.1 lists four I/O interfaces that
were considered, and some major advantages and disadvantages of each. After exploring Xilinx’s design examples, a PCI Express (PCIe) interface was chosen. The interface
is fast (with a reported throughput of up to 10 Gbps [16]) and is common among highperformance processing units. Xilinx also provides a user-friendly wrapper interface. Since
this establishes communication with a PC, sets of different CSC test cases can be stored on
the hard drive and streamed to the FPGA in “real-time”.
The PCIe interfaces at both the FPGA and PC ends were based off Xilinx’s ML605
Connectivity Kit. This is a hardware, HDL, and software package that demonstrates basic
loopback functionality over PCIe. The provided HDL and netlists handle the low-level
PCIe control and direct memory access (DMA) operations, and present a simplified First
In First Out (FIFO) data interface to the user. In its unmodified form, the HDL simply
loops incoming (read) data back to the outgoing (write) port of the FIFO module.
The kit includes software and drivers, with source code in the C language, for PCIe
communication with the board. A graphical application called xpmon (Xilinx Performance
Monitor) is used to start and stop data transfers and to view performance statistics. It

28

passes commands to the xdma (Xilinx DMA) driver, which generates packets of numeric
data in memory and writes them to the hardware through two linked drivers, xrawdata
and xaui. These correlate to two separate data paths in the hardware, Raw Data and
XAUI (10 Gigabit Attachment Unit Interface), which can be independently activated. The
drivers also allow for packet verification upon reception.
It is important to note that our ML605 board contained CES silicon, which limits us to
use of the ISE 12.4 version of the Connectivity Kit Targeted Reference Design (TRD) [18]
and version 1.3 of the PCIe CORE module [19]. Also, software and drivers were only
provided for the Linux operating system for CES silicon (Xilinx provides a Fedora 10
disk), although support for Windows has since been added for non-CES silicon. Figure 5.3
illustrates the structure of the version 12.4 TRD.
Only minor modifications were needed to run the TRD successfully on the ML605
board. Some identification and power control constants needed to be changed in the HDL
and software code. For example, the design expects a Gen. 2 4-lane PCIe connection by
default; the test PC contains a Gen. 1 8-lane PCIe slot. The xpmon application showed
steady run-time throughput around 5.9 Gbps, less than the ideal 10 Gbps and the calculated 6.15 Gbps. Furthermore, no method for transferring data between Linux’s user space
(filesystem and software) and kernel space (drivers) was immediately available. All TRD
data packets are generated, transmitted, received, checked, and discarded within the drivers.
Therefore “significant changes” were required to establish a custom data path to and from
the filesystem [16].
One of the first steps taken was to remove as much of the XAUI data path as possible,
in both hardware and software. It was decided to only utilize the Raw Data path, which
involves less overhead and is better suited for streaming data (as opposed to packetized
bursts of data). After the XAUI removal, the Raw Data path was tested to ensure its own
functionality was not affected. The Raw Data path was also given full clock cycle priority,
slightly increasing throughput, since time is no longer spent handling the XAUI path.
Next, data read/write functionality was added to the xdma driver. Since it is registered

29

Software Space

Driver Space

control

xpmon

xdma

Hardware (FPGA) Space

xraw
data

PCIe
CORE

xaui

stats

FIFO

loopback

DMA
data
transfers

Figure 5.3: The PCIe Reference Design

Software Space

Input
Vectors

Driver Space

input
vectors

vecSender

xdma
output
pixels

Hardware (FPGA) Space

xraw
data
DMA
data
transfers

PCIe
CORE

CSC-PCIe Link
FIFO

Output
Images

Figure 5.4: The Full Software-PCIe-CSC System

30

CSC Engine

in Linux as a character device, the functionality was added via the standard file operations (f ops) interface, which can be called from user software. The new write function
accepts a pointer to a block of data, which it copies to the next empty slot in a larger ring
buffer for PCIe transmission. The new read function copies the next block of received
data, if any, back to the user’s reception buffer. Since buffers of 4096 bytes (4 KB) are internally used by the DMA driver, data blocks of 4096 bytes were chosen in the software to
avoid misalignment problems. In the end CSC system, transmitted data blocks will contain
CSC input vectors, while received data blocks will contain processed pixel values.
The graphical xpmon application was adapted into a new, command-line tool called
vecSender. The role of this tool was to read files of CSC vectors (as generated by
vecBuilder), write data blocks to the drivers, receive new data blocks back from the
drivers, and save this data to one or more output image files. The file I/O programming
was straightforward, and the driver communication was largely adapted from xpmon, but
utilizing the new write and read driver functions. At this point, a header was introduced
into the vector file format denoting the size (pixel count) of each image to be processed.
vecBuilder generates this header, and vecSender uses the information to correctly
route received data to one or more output images. The roles of each tool can be reviewed
in Figure 4.3.
Finally, a new HDL module was needed for handling communication between the PCIe
CORE and DPR CSC engine. This module (called the CSC-PCIe link) reads data from
the Raw Data FIFO in 64-bit words, groups them into 128-bit vectors for the CSC engine, receives 48-bit output pixels, and aligns them into 64-bit words to be written back
to the FIFO. Fortunately, these grouping and alignment operations are performed in the
PCIe/FIFO clock domain, which runs five times faster (250 MHz) than the CSC domain
(50 MHz), so the engine sees no added latency. Some synchronization logic was required
since the FIFO and CSC operate on different clocks. Originally, a full two-way handshake
system was used, piped through synchronizer flip-flops between clock domains, indicating:
when a new CSC vector is ready, when this new vector has been read, when a new output

31

pixel is ready, and when this new output pixel has been read. In later versions, this logic
was simplified to remove handshake latency, as described in Section 7.2.
The full software-PCIe-CSC system is illustrated in Figure 5.4. This provides a fast
and flexible platform for testing the CSC engine, as detailed in Section 5.5. The main
drawback of this platform is the many stages of I/O handling, compared to the previous
FPGA-to-FPGA platform. Files are read from the hard drive in blocks, copied to driver
space memory, transferred via DMA, retrieved from a FIFO, and combined to 128 bits
before reaching the CSC engine. Because of these steps, a throughput of slightly less than
one valid CSC input vector per 50 MHz clock cycle is expected.
In any clock cycle that a new CSC vector is not available, the CSC-PCIe link module
will deassert the entire Reg Bus and Pixel Bus. The pipeline enable flag will remain asserted
to continue processing any pixels already in the pipeline. At the finish of a CSC test, there is
a need to retrieve any output pixels that are waiting in the hardware FIFO. For this purpose,
a new virtual register address called CSC FLUSH was created. When this value appears on
the address field of the Reg Bus, the CSC engine is not affected, but the CSC-PCIe link
will know to write null data into the FIFO until the final 4 KB buffer is transmitted back to
the PC.
Once bi-directional communication is established with the vecSender application
over PCIe, testing can commence as outlined in Section 5.5.

5.4

Expanding to Two Pipelines

To further demonstrate the advantages of a DPR-enabled system, a second version of the
CSC engine was created. Since the engine’s interface was already divided into two independent halves (Reg Bus and Pixel Bus) each carrying different types of independent data,
it could potentially support two processing pipelines simultaneously. In theory, consecutive
tasks could be overlapped and allocated to alternating pipelines, producing a peak system
speedup of 2.0.

32

This dual-pipe version of the CSC required moderate hardware and software modifications. The processing data path of the engine is duplicated into two identical instances,
each with its own control registers, CLUTs, and reconfigurable 3D/4D partition. (The PCIe
communication modules and PR control modules are not duplicated.) Unlike the normal
single-pipe engine, this version allows unrestricted reconfiguration of one pipeline via the
Reg Bus (including control registers, CLUT values, and PR bitstreams) while the other
pipeline utilizes the Pixel Bus to process an image. Some new control logic was necessary
at the front and back ends of the pipelines, to separate and route incoming data appropriately and to select any valid pixel data at their outputs. A diagram of the added features
is shown in Figure 5.5. The system remembers which pipe is currently processing pixels
and which is reconfiguring; it toggles these roles when triggered by a new virtual register
address, CSC DP SWAP. This state is simply held as a 1-bit register called config1 (and
its complement config2) which is inverted when CSC DP SWAP appears on RegAddr.
(The config1 register and address check logic are not pictured.)
Whenever either pipeline produces a valid output pixel (indicated by its acknowledge
flag), the output acknowledge of the whole CSC (called PostCscAck) is asserted. The associated 48-bit output pixel is multiplexed from the outputs of the two pipes, automatically
selecting the correct one. By design, the pipes will never produce valid pixels simultaneously, because only one can be in the image-processing state (or rather, non-reconfiguration
state) at any moment in time.
A dual-pipe option was added to vecBuilder to provide usable input vectors to the
dual-pipe hardware. The main modifications were to the logic which decides which tasks
to overlap and which to halt; for example, a new image can be written to the Pixel Bus for
one pipeline if the other pipeline is idle or reconfiguring, but not if the other pipeline is
processing its own image still. The software also inserts the new CSC DP SWAP control
code where appropriate, namely between any configuration task and its subsequent image
processing task. No software changes were required in vecSender, the PCIe communication program.

33

from Pixel Bus
pixels
inactive

Pipe 1

3D/4D
inactive
ack1
from Reg Bus

config data
output1

PostCscData

48-bit pixels

config1

output2

from Pixel Bus
pixels
inactive

Pipe 2

3D/4D

inactve

ack2
from Reg Bus

config data

PostCscAck
ack1

config2 = config1

Figure 5.5: Dual-Pipeline Support Logic

34

Naturally, the expanded dual-pipe engine will require approximately twice the previous FPGA resources and operating power. (The increased number of PR partitions also
increases the implementation time required during development.) Therefore, it is feasible in applications which require high throughput but resource utilization and power consumption are not concerns. The Virtex-6 can easily provide the required resources, and
power estimation is reported in Chapter 6. The reduced restrictions on PR overlapping may
produce better system performance, so the dual-pipe and single-pipe versions were both
implemented and tested for comparison.

5.5

Testing Procedures

Several tests were needed throughout the system modification process to verify proper functionality, and to measure the effects of modifications on performance. The tests performed
can be grouped into three categories: verification of the CSC migration to the new hardware
and software, verification and measurement of the DPR-enabled engine, and verification
and measurement of the dual-pipe DPR engine.
In the first phase of the project, the provided HDL was migrated to an up-to-date version
of Xilinx’s ISE suite, as described in Section 5.1. It was necessary to verify that no functionality was broken during the upgrade. This and all subsequent verifications were made
possible by the software CSC Application provided by HP. Like the custom vecBuilder
tool, it accepts a set of CSC register data, CLUT data, and image data files. Rather than
generating inputs for the CSC engine, it generates the ideal (“known good”) output of the
CSC in the form of Tagged Image File Format (TIFF) images. Each pixel in these images
is stored as four 16-bit channels, although only the lower 12 bits are used since the CSC
engine produces 12-bit four-channel output.
Before running any hardware implementation or programming the Virtex-6, many test
cases were executed via Xilinx’s ISim simulation tool. This allowed us to verify the output of the CSC engine (before considering hardware constraints) and monitor any desired

35

CSC Register Files

vecBuilder

CLUT Data Files
Image Files

CSC Application

CSC Vector Files

ISE / ISim
Output Images

Compare
Script

Output Images

identical / different

Figure 5.6: Verification Process: Simulation of Migrated CSC
signal activity. An HDL testbench from the previous research was adapted to read in input
vectors (which can be generated by vecBuilder), feed the simulated CSC engine, and
save the processed output to text files. A MATLAB script is used for comparing the simulated output files and ideal TIFF files. MATLAB was chosen for its native ability to read
the 16-bit TIFF file format, and its simple methods for comparing numeric matrices. The
verification process of the migrated CSC project is illustrated in Figure 5.6.
The tests performed covered a variety of processing configurations: some used the
3D module, some used the 4D module, some used neither; in each case multiple input
images and CLUT definitions were utilized. These cases produced successful comparisons
and made it possible to detect Type I errors (different images when identical images are
expected). In other cases, one or more input pixels or CLUT values were intentionally
changed to produce different images and failed comparisons, detecting any Type II errors
(identical images when different images are expected).
When the migrated CSC engine was implemented and programmed to the FPGA, a
36

Unit Under Test

input vectors

BIST Module

Input COE File

Output COE File

Input ROM

Output ROM
ideal output

CSC Engine

output pixels

Compare
Logic

identical / different

Figure 5.7: Verification Process: Hardware BIST
similar verification process was used. However, no external I/O interface was implemented
then, so a new Built-In Self Test (BIST) sub-module was created. This module features
two large read-only memories (ROMs); the first stores a stream of CSC input vectors and
the second stores ideal output values. Upon reset, the input vectors are read sequentially
from the Input ROM and passed to the CSC engine, while its output is checked against the
sequence of pixels read from the Output ROM. A pass or fail comparison is indicated by
the on-board light emitting diodes (LEDs).
The hardware BIST is illustrated in Figure 5.7. The CSC engine is designated as the
unit under test (UUT). The same test cases that were simulated were executed again in the
hardware, with one additional step. The outputs of vecBuilder and the CSC Application needed to be converted to Xilinx’s memory coefficients file format (.COE) in order
to synthesize to a hardware ROM. These two conversions were performed by the simple
custom tools testVector2coe (written in C) and tiff2coe (written in MATLAB)
respectively.
Once dynamic partial reconfiguration was added to the system, simulations became

37

less useful. In its current form, PR execution cannot be simulated since it is a hardwarespecific feature of the FPGA (the ICAP simulation model is non-functional). Also, ISim
links and compiles project modules in one initial stage; these cannot be reconfigured midsimulation. For these reasons, simulation can only be performed for static (non-PR) test
cases as previously done. The focus shifted to hardware self tests and PC-FPGA interactive
tests.
The BIST module was utilized again for the DPR-enabled CSC engine. In contrast
to earlier tests, the input vector files now include multiple images and PR bitstreams inserted where appropriate. A set of four test cases was created for DPR verification and
performance measurement, each containing two different processing configurations and a
PR bitstream in between (overlapped with the first stage’s image data). These four cases
are described in Table 5.2. Because the single-pipe CSC engine is being used, PR is only
possible when a 3D or 4D processing stage follows a stage which utilizes neither 3D nor 4D
processing. The first pair of test cases (1-X) operate on “small” images, while the second
pair (2-X) operate on full-page 8.5 inch by 11 inch, 600 DPI images. Hardware tests were
limited to sequences of two images for practical purposes: two full-page images translate to
more than one gigabyte (230 bytes) when uncompressed and formatted into CSC input vectors. The performance of longer sequences will be estimated by extrapolating the results,
as documented in Chapter 6.
Once the PCIe communication is established with the Virtex-6, test cases can be run
in real-time via vecSender, without the need to repeatedly reprogram the BIST ROMs.
Performance is measured using two related metrics: processing execution clock cycles
(directly correlated to the length of the input vector file), and total execution time (which
includes all actual latencies: file I/O, software, PCIe wrapper, etc.). The first metric only
accounts for “active” cycles in which a new, valid CSC vector is provided to the engine; the
second metric includes also the overhead cycles, because it measures actual test duration.
The results of each test case are tabulated and analyzed in Chapter 6.
The dual-pipe CSC was tested and measured just like the single-pipe CSC, but with

38

Test Case

Stage 1

Stage 2

1-1

no 3D/4D module
small image
18,335 pixels

3D module
small image
19,200 pixels

1-2

no 3D/4D module
small image
18,335 pixels

4D module
small image
18,335 pixels

2-1

no 3D/4D module
full-page image
33,660,000 pixels

3D module
full-page image
33,660,000 pixels

2-2

no 3D/4D module
full-page image
33,660,000 pixels

4D module
full-page image
33,660,000 pixels

Table 5.2: Single-Pipe CSC Test Cases
an extended set of test cases to cover the additional possibilities for overlapping. Loading
of 3D module bitstreams can now overlap with image processing that uses a 4D module,
and vice versa. This second set of test cases is shown in Table 5.3. Like the previous
cases, both small (3-X) and full-page (4-X) tests are performed for comparison. Real-time
performance results are presented in Chapter 6.

39

Test Case

Stage 1

Stage 2

3-1

no 3D/4D module
small image
18,335 pixels

3D module
small image
19,200 pixels

3-2

no 3D/4D module
small image
18,335 pixels

4D module
small image
18,335 pixels

3-3

3D module
small image
19,200 pixels

4D module
small image
18,335 pixels

3-4

4D module
small image
18,335 pixels

3D module
small image
19,200 pixels

4-1

no 3D/4D module
full-page image
33,660,000 pixels

3D module
full-page image
33,660,000 pixels

4-2

no 3D/4D module
full-page image
33,660,000 pixels

4D module
full-page image
33,660,000 pixels

4-3

3D module
full-page image
33,660,000 pixels

4D module
full-page image
33,660,000 pixels

4-4

4D module
full-page image
33,660,000 pixels

3D module
full-page image
33,660,000 pixels

Table 5.3: Dual-Pipe CSC Test Cases

40

Chapter 6
Results
This section presents all quantitative results of the implementations and the hardware tests,
for both single-pipe and dual-pipe CSC engines.

6.1

CSC Performance

Performance of the CSC engine in each test case was measured using several metrics. These
are defined as follows:
Active Cycles, Static: The number of clock cycles (and corresponding input vectors)
required by the previous (static PR) CSC engine, as produced by vecBuilder in nonoverlapping mode.
Active Cycles, DPR: The number of cycles (and vectors) required by the DPR-enabled
CSC engine, as produced by vecBuilder in overlapping mode.
Speedup: The relative processing speed attained by enabling DPR. Calculated as:
S=

active cycles, static
active cycles, DPR

41

Speedup, N = 10: Theoretical speedup if the sequence was extended to 10 images:
N = number of configuration-image pairs
CLk = configuration length for image k, k ∈ {1, 2}
ILk = pixel data length for image k
N/2 × [CL + IL + CL + IL ]
1
1
2
2
SN =
N
CL1 + /2 × [ max(IL1 , CL2 ) + max(IL2 , CL1 )]
Speedup, N = ∞: Theoretical speedup if the sequence became very long (N approaches infinity). Note that for the ideal case of CL1 = IL1 = CL2 = IL2 , speedup
approaches 2.0; the configuration overhead of the very first image becomes negligible.
Time per Image, Ideal: Total processing time (in seconds) divided by the number of
images (two, here), assuming a fully-saturated CSC running at 50 MHz with no gaps in the
input stream. The proposed goal is one second or less.
Time per Image, Measured: Total processing time divided by the number of images,
in the full software-PCIe-FPGA system, measured by hardware counters running at the
CSC’s native 50 MHz.
The performance of the single-pipe engine, as calculated from collected data and measured in hardware, is captured in Table 6.1. The performance of the dual-pipe engine is
captured in Table 6.2.

6.2

Configuration Times

The time required for FPGA reconfiguration depends on three parameters: the length of the
bitstream, the programming frequency, and the size of the word loaded in each cycle. Two
methods of reconfiguration are used in this project.
Impact: Manual programming through the Impact software is typically performed after
FPGA power-up or design re-implementation, using the full bitstream. (Alternatively, it
can be downloaded to the board’s Flash memory and automatically booted upon powerup.) Impact uses a serial JTAG-over-USB protocol, and writes 1 bit/cycle at 6 MHz.
42

Test
Case

Active
Cycles,
Static

Active
Cycles,
DPR

Speedup

Speedup,
N = 10

Speedup,
N=∞

Time/Image,
Ideal
(sec)

Time/Image,
Measured
(sec)

1-1

142,079

123,725

1.148

1.171

1.173

0.001

0.002

1-2

144,520

126,166

1.145

1.172

1.174

0.001

0.002

2-1

67,424,544

67,331,838

1.001

1.001

1.001

0.673

3.441

2-2

67,427,850

67,334,747

1.001

1.001

1.001

0.673

3.524

Table 6.1: Performance Results, Single-Pipe

Test
Case

Active
Cycles,
Static

Active
Cycles,
DPR

Speedup

Speedup,
N = 10

Speedup,
N=∞

Time/Image,
Ideal
(sec)

Time/Image,
Measured
(sec)

3-1

154,112

135,758

1.135

1.154

1.156

0.001

0.002

3-2

154,154

135,800

1.135

1.158

1.160

0.001

0.002

3-3

255,075

235,856

1.081

1.086

1.265

0.002

0.004

3-4

259,702

241,348

1.076

1.110

1.295

0.002

0.004

4-1

67,436,577

67,321,104

1.002

1.002

1.002

0.673

3.606

4-2

67,437,484

67,321,104

1.002

1.002

1.002

0.673

3.537

4-3

67,537,540

67,421,160

1.002

1.001

1.001

0.674

3.476

4-4

67,542,167

67,426,694

1.002

1.001

1.001

0.674

3.461

Table 6.2: Performance Results, Dual-Pipe

43

Single-Pipe
Bitstream

Size
(bits)

Size
(KB)

Config. Time,
Impact
(sec, calculated)

Config. Time,
Impact
(sec, measured)

Config. Time,
ICAP
(sec)

Full

50,106,376

6,117

8.351

15

0.031

Partial
(3D/4D Module)

2,978,624

364

0.496

1

0.002

Table 6.3: Bitstream Sizes & Configuration Times, Single-Pipe

Dual-Pipe
Bitstream

Size
(bits)

Size
(KB)

Config. Time,
Impact
(sec, calculated)

Config. Time,
Impact
(sec, measured)

Config. Time,
ICAP
(sec)

Full

50,055,720

6,110

8.343

15

0.031

Partial
(3D/4D Module)

3,350,912

409

0.558

1

0.002

Table 6.4: Bitstream Sizes & Configuration Times, Dual-Pipe
ICAP: Automated reconfiguration via the ICAP block, using partial bitstreams, is the
foundation of our DPR system. Here it operates at 50 MHz and reads 32 bit/cycle.
The full and partial bitstream sizes, Impact programming times, and ICAP programming times for the single-pipe implementation are listed in Table 6.3. The corresponding
data for the dual-pipe version is listed in Table 6.4.
Note that these bitstreams were generated by bitgen with the compression option
(-g compress) enabled. This feature detects segments containing identical bits and
combines them, utilizing the FPGA’s Multiple Frame Write Register (MFWR). Because
our design contains large uninitialized CLUT RAMs, significant compression of 30% to
50% is typically achieved. However, compression also adds variance to the size of partial
bitstreams, depending on the contents (3D, 4D) and pipeline structure (single, dual). In
Tables 6.3 and 6.4, the largest (worst case) of the partial bitstream variants is listed.

6.3

FPGA Resources

Finally, statistics concerning FPGA resource utilization were collected for both engines,
with and without the PCIe link (which adds considerable resource requirements). Estimates

44

of static and dynamic power consumption were derived from these statistics plus Virtex-6
power information provided by Xilinx. Power consumption of the CSC’s original ASIC
implementation (running at a higher frequency) is not available, but the estimates are useful
for comparing the single-pipe versus dual-pipe designs, and the contribution of PCIe. The
quantities collected or estimated are:
Registers: General-purpose registers utilized in the design.
LUTs: 16-bit lookup tables which may also be configured as shift-registers.
Slices: Groups of 12 logic cells providing various functionality.
IOBs: Input/output buffers at the ports of the design’s top-level module.
RAMB36: 36-kilobit block RAMs, automatically configured as either RAMs, ROMs,
or FIFOs by the implementation tool as appropriate.
BUFG: Global buffered traces, usually reserved for clock signals.
MMCM: Clock signal managers/generators, used to produce multiple clocks of different frequencies from one (or more) single-ended or differential source.
Static Power: Power consumed by the design regardless of any system activity.
Dynamic + I/O Power: Power consumed by switching logic and I/O buffering.
Total Power: Sum of these two power metrics.
Resource utilization and power estimates for the single-pipe and dual-pipe engines are
listed in Table 6.5 and Table 6.6 respectively. Data was collected from PlanAhead postimplementation reports, and imported into Xilinx’s Power Estimator spreadsheet.

45

Resource

Number
Available

Number
Utilized

Percent
Utilized

Number Utilized
(PCIe added)

Percent Utilized
(PCIe added)

Rel. Increase
(PCIe added)

Registers

301,440

3,888

1%

31,408

10%

708%

LUTs

150,720

9,277

6%

33,308

22%

259%

Slices

37,680

3,204

9%

13,521

36%

322%

IOBs

600

296

49%

135

23%

-54%

RAMB36

416

60

14%

135

32%

125%

BUFG

32

1

3%

10

31%

900%

MMCM

12

0

0%

2

17%

-

Power Type

Estimate (W)

Estimate (W)
(PCIe added)

Relative Increase
(PCIe added)

Static

1.99

2.03

2%

Dynamic + I/O

0.13

1.47

1052%

Total

2.12

3.50

66%

Table 6.5: Resource Utilization & Power Estimates, Single-Pipe
Resource

Number
Available

Number
Utilized

Percent
Utilized

Number Utilized
(PCIe added)

Percent Utilized
(PCIe added)

Rel. Increase
(PCIe added)

Registers

301,440

7,778

3%

35,297

12%

354%

LUTs

150,720

18,756

12%

42,463

28%

126%

Slices

37,680

6,527

17%

17,138

45%

163%

IOBs

600

296

49%

135

23%

-54%

RAMB36

416

120

29%

195

47%

63%

BUFG

32

1

3%

10

31%

900%

MMCM

12

0

0%

2

17%

-

Power Type

Estimate (W)

Estimate (W)
(PCIe added)

Static

1.99

2.03

2%

Dynamic + I/O

0.18

1.53

748%

Total

2.17

3.56

64%

Relative Increase
(PCIe added)

Table 6.6: Resource Utilization & Power Estimates, Dual-Pipe

46

Chapter 7
Discussion
The primary purpose of adding dynamic partial reconfiguration to the CSC engine is to
overlap tasks and increase processing throughput. Reductions in resource utilization, power
consumption, and design cost/complexity are also targeted, but are considered secondary
goals. Section 7.1 discusses the overall trade-offs of adding DPR, while Section 7.2 specifically considers the performance of our PCIe-enabled hardware platform. Ideas and suggestions for future system enhancements are proposed in Section 7.3.

7.1

Observed Benefits and Drawbacks

Depending on pipeline configuration, processing sequence, and image sizes, the use of DPR
in the CSC engine results in speedup ranging from moderate to negligible. The single-pipe
implementation provides the most relevant results, because it is most similar to the realworld engine implementations currently in use.
Table 6.1 indicates that small images (around 160×120 pixels, test cases 1-X) are processed in one DPR pipeline with speedup between 1.145 (sequence of two) and 1.174 (long
sequence). Full-page images (33 Mpixel, cases 2-X) showed negligible speedup of 1.001,
less than 1%, regardless of sequence length. Looking at only these test cases, the results
would not justify the integration of a DPR subsystem.
The speedup was limited by a number of factors. In any system, overlapping two tasks

47

should allow a maximum speedup of 2.0; however, the nature of the single-pipe CSC restricts what can overlap. Two pixel sequences (images) cannot overlap because they require
the Pixel Bus, and configuration sequences (control registers, CLUTs, PR bitstreams) cannot overlap because they require the Reg Bus. Recall that the only pair of tasks that can be
overlapped are: reconfiguration of the 3D/4D partition, and any pixel processing configured
to bypass it.
Even when overlapping is possible, speedup is affected by the relative lengths of the
two tasks. An ideal speedup of 2.0 is only possible for tasks of identical length; otherwise
speedup falls between 1.0 and 2.0. These cases are illustrated in Figures 7.1 and 7.2. In
our test cases, a typical PR bitstream spans about 95,000 clock cycles — more than the
small images listed, and much less than the full-page images. (In all test cases, each image
requires one clock cycle per pixel.)
The dual-pipe version of the engine is useful, despite the increase in resources and complexity, because it can take greater advantage of DPR. The addition of a second independent
pipeline allows any pixel sequence to overlap with any configuration sequence. Table 6.2
captures the speedup obtained in small-image tests (3-X) and full-page tests (4-X). For sequences of small 1D-only processing followed by small 3D/4D processing, speedup was
slightly less than in the corresponding single-pipe tests (about 14% versus 15% for short
sequences, 16% versus 17% for long sequences). This is likely due to slight variance in
generated bitstream lengths, which are the dominant contribution to test length in smallimage cases. The full-page test cases show the same outcome as their single-pipe counterparts: almost no speedup (1.002) regardless of processing configuration, overlap amount,
or sequence length. This highlights an important conclusion for both the single-pipe and
dual-pipe designs: as input images become very large, configuration times become negligible (whether overlapped or not), and the rate of incoming pixels becomes the performance
bottleneck. The most straightforward way to improve this would be to scale up system
parameters — clock frequency, input bus width, etc. — as long as the system providing the
input vectors can match the increased throughput.

48

execution
sequence

execution
sequence

Task 1A

Task 1A

Task 1B

Task 2A

Task 2B

task
overlapping
Task 1B

...
Task 2A
speedup = 2.0
Task 2B
...

Figure 7.1: Speedup of Equal-Length Tasks

execution
sequence

execution
sequence
Task 1B

Task 1A

Task 1A
task
overlapping

Task 1B

Task 2A

Task 2A

Task 2B

...

Task 2B

1.0 < speedup < 2.0
...

Figure 7.2: Speedup of Unequal-Length Tasks

49

The greatest speedup was achieved in dual-pipe test cases 3-3 and 3-4: a small 3Dprocessed image followed by 4D, and vice versa. Note that this sequence was not tested
on the single-pipe because both images require the one 3D/4D module, so no overlapping
would be possible. Case 3-4 showed speedup ranging from 1.076 (two-image sequence)
to 1.295 (long sequence). Because every image was paired with a PR bitstream, and the
dual-pipe bitstreams tended to be slightly longer, these tests were able to benefit the most
from overlapping configurations. Longer configurations have greater impact on total processing time, so greater relative speedup is achieved when properly overlapped.
The dual-pipe engine bears some functional resemblance to a two-core microprocessor,
and its effective speedup S can be approximated by Amdahl’s Law, with N = 2, as:
S=

1
(1 − P ) +

P
N

N = number of parallel processing elements
P = fraction of work which can be parallelized
1 − P = fraction of work which must execute sequentially
In the case of the CSC engine, fraction P is determined by the relative lengths of images
and configuration sequences. If all tasks can be overlapped and are of equal lengths (the
ideal case), then P = 1 and speedup S becomes 2. If one task is shorter than the other,
then not all of the work can be overlapped; one pipeline will temporarily be idle, and P
will be between 0 and 1. For example, if P happens to be 1/3, then speedup S would
become 6/5 = 1.2. This concept can be applied to theoretical extended sequences of any
length. (Note that in all calculations regarding extended sequences, it is assumed that a
new configuration is applied between every image. This may or may not occur in realworld operating conditions. If consecutive images used the same parameters and CLUTs,
then no reconfiguration would be necessary and the system would again be limited only by
the rate it can stream in images.)
The length of tested configuration tasks (including FPGA reconfiguration and CLUT
loading) varied from about 50,000 to 100,000 vectors depending on the sizes of bitstreams
50

and conversion tables. This indicates that an optimal processing speedup of nearly 2.0
would have been achieved for images between 50,000 and 100,000 pixels. For reference, this range approximately corresponds to the Quarter Video Graphics Array resolution (QVGA, 320 × 240 or 76,800 pixels) and Wide QVGA resolution (400 × 240 or
96,000 pixels). These sizes are much smaller than the full-page image definition, but are
reasonable for individual page elements which might require independent conversions.
Dynamic reconfiguration via the ICAP block was successful and reasonably fast. As
captured in Tables 6.3 and 6.4, partial bitstream sizes fell between 350 and 450 kilobytes (KB). These could be written to the Virtex-6 in as little as 2 milliseconds (ms), assuming a continuous 50 MHz stream of 32-bit words. For the full-page test cases, this was
an insignificant fraction of the 670+ milliseconds required per image (Tables 6.1 and 6.2).
In the small-image test cases, reconfiguration consumed a larger relative portion of total
processing time. Both situations are acceptable because they support throughput better
than the proposed goal of one image per second (1000 ms). The majority of resources used
in the PR partitions were block RAMs allocated for CLUT data. An alternative engine that
performs conversions by arithmetic logic would produce smaller bitstreams, but would lack
the advantages discussed in Section 3.2.
It should also be noted that the larger, full-chip configuration bitstreams were around
6 megabytes (MB) in size, but were typically stored in the ML605’s Flash memory and
booted at power-up. While the Flash programming itself is a slow action, booting from
Flash is fast and completes with 100 ms, the timing window for establishing proper PCIe
communication with the test PC. For size comparison, Figures 7.3 and 7.4 illustrate the
full FPGA layouts (single and dual-pipe) and the PR partitions within them. The partitions
appear as highlighted regions in the upper-right and lower-right quadrants.
Table 6.5 captures the FPGA resources required by the single-pipe engine. Excluding I/O buffers (IOBs, 49% utilized), the worst-case resource utilization was block RAMs
(RAMB36) at 14%. Adding the PCIe link (including bus management, DDR3 buffering,
FIFO interfacing) increased all resources by a factor of 2 or more, except for IOBs. (The

51

3D/4D
PR Partition

Figure 7.3: Single-pipe Layout

3D/4D
PR Partition
#1

3D/4D
PR Partition
#2

Figure 7.4: Dual-pipe Layout

52

non-PCIe version required more IOBs because all CSC inputs, outputs, and global control signals [200+ bits] were top-level ports; the PCIe version communicates all I/O data
on common PCIe data ports.) The worst-case utilization in the PCIe-enabled engine was
general-purpose Slices at 36%. A new resource, the MMCM clock manager, was introduced since the PCIe subsystem requires several clock frequencies. The non-PCIe engine
only requires a single 50 MHz clock input at the top level.
As expected, the dual-pipe engine approximately doubled all resource requirements,
except for IOBs and clock buffers (BUFG) since the top-level interface did not change.
Table 6.6 shows that block RAM remained the dominant resource at 29%. This indicates
that perhaps two more pipelines could safely fit on the FPGA (Xilinx considers critical
utilization to begin around 80%.) However, with the PCIe link included, RAM utilization
increased to 47% and Slices increased to 45%.
Power consumption was estimated based on resource utilization, clock frequencies,
and Virtex-6 technology parameters. Tables 6.5 and 6.6 indicate that static power was the
dominant contribution in all four cases (single/dual, PCIe or not). Changing from single
to dual-pipes seemed to increase total power consumption insignificantly, while adding
PCIe increased it by about 65%. The power estimator provided by Xilinx uses simplistic assumptions, and therefore actual power consumption may vary widely during CLUT
loading, partial reconfiguration, etc.
It is important to note that the resource utilization and power consumption introduced
by the PCIe subsystem may not apply to actual implementations of the CSC engine, since
it may be directly connected to another ASIC or hardware component. PCIe was only
introduced here to demonstrate a functional, flexible, real-time test platform.

7.2

The PC-FPGA Platform

The test platform consisting of a Linux PC, custom software, and a PCIe-connected ML605
board was successfully used to demonstrate the DPR-enabled CSC engine. However, at the

53

time of writing, the performance does not meet the desired goal of at least one full-page
image (33 Mpixel) per second. Tables 6.1 and 6.2 instead indicate measured processing
times around 3.5 seconds, for a throughput of about 10 million pixels per second.
The hardware engine is fully capable of processing a continuous 50 MHz stream of
input vectors, which it does whenever a large burst of data arrives over the PCIe bus. Unfortunately, delays between bursts significantly affect throughput, particularly in the larger
tests. The previous PR research on a Virtex-II platform had supplied continuous CSC input vectors from a second, microcontroller-operated Virtex-II board. Custom PC software
and a PCIe link were chosen here for flexibility and demonstration purposes, at the cost of
increased complexity and additional stages in the I/O chain.
Throughout testing, the entire data path (from disk drive to software, to drivers, PCIe,
CSC engine, and back) was analyzed to find and improve points of low performance. Currently, CSC input vectors are stored on the hard disk drive (HDD) as formatted binary files,
as generated by vecBuilder, one per test case. vecSender reads these in buffered
4 kilobyte blocks and writes them to the PCIe driver as quickly as possible, which then
writes them to the hardware via fast DMA transfers.
In the first version of the customized PCIe driver, a DMA transfer was initiated for every
individual 4 KB block provided by vecSender. This resulted in intermittent transmission
to the FPGA. In later versions, these blocks of data were queued in buffers within the
driver and a fast sequence of DMA transfers was initiated once a threshold was reached.
Transmission became less sparse, and throughput was improved by a factor of nearly 10.
Improvements were also made on the hardware side. The first version of the CSC-PCIe
link included a full, two-way handshake protocol between the PCIe FIFO buffer (operating
at 250 MHz) and the CSC input logic (operating at 50 MHz). Handshakes ensured robustness and made the HDL easier to understand, but would add up to 3 wasted CSC clock
cycles per input vector. The logic that receives, assembles, and transports input vectors
was later refactored into a highly optimized, five-state cycle which does not rely on handshakes and can provide the CSC engine a valid input vector at every rising 50 MHz clock

54

edge. Figure 7.5 shows a partial timing diagram of the new state machine, along with HDL
pseudocode to summarize the role of each state. (Note the two-state delay between asserting the FIFO’s read signal and actually reading the FIFO’s data.) This method relies on the
implementation-specific knowledge that the two clock signals are generated from dividers
of the same clock source — they will be edge-synchronized and occur in a 5:1 pattern. If
the two clock signals came from unknown sources with arbitrary phase, synchronization
would not be guaranteed and metastability errors could occur in data reads.
Some minimal handshaking remains in the output (pixel) handling logic, since the output throughput is less demanding and is generally not problematic. The CSC’s output
bandwidth of 2.4 Gbps (48-bit pixel per cycle, 50 MHz) is less than half the desired input
bandwidth of 6.15 Gbps.
Further hardware optimizations are possible. Since the PCIe link was adapted from
Xilinx’s reference design, it included a FIFO interface around a 256 KB buffer, stored in
off-chip DDR3 memory. Because the CSC reads from the FIFO more slowly than the PCIe
bus can write to it (∼10 Gbps peak, when large bursts of data arrive), the FIFO is at risk of
overflowing if incoming data is transmitted too quickly. (This limits the DMA-triggering
threshold parameter of the PCIe driver mentioned earlier; data loss was observed if the
threshold was set too high and too many packets were transferred at once.) Removing the
DDR3 reliance and FIFO interface for direct access to DMA data may increase hardware
throughput, but this would require more complex PCIe link logic and block RAM utilization for an on-chip buffer.
After performance analysis and many data path improvements, an effective throughput
around 1.2 Gbps was achieved for full-sized test cases. Most latency was observed in the
interface components external to the CSC; these may not apply when integrated into a
larger, real-world hardware system. One bottleneck that has not yet been mitigated is the
disk drive latency of reading large (gigabyte range) test files. Supplementary tests showed
that “dummy” vectors generated in-software (rather than large, well-defined test cases)
could be transmitted nearly 5 times faster. Theoretically this speedup could reduce image

55

A

B

C

D

E

A

clk250

…

FIFO bytes

16

8

0

…

FIFO read

…

vecRegLow

…

vecRegHigh

…

cscVec

…

clk50

…
time

State A : x = ( FIFO bytes ≥ 16)
FIFO read ⇐ x
State B : FIFO read ⇐ x
State C : FIFO read ⇐ 0
vecRegLow ⇐ FIFO data (64-bit)
State D : vecRegHigh ⇐ FIFO data (64-bit)
State E : if (x) cscVec ⇐ {vecRegHigh, vecRegLow}
else cscVec ⇐ empty vector
Figure 7.5: Input Vector Timing Diagram with HDL Pseudocode

56

processing time from 3.5 seconds to about 0.7 seconds, the original calculated duration.
Upgrading the test workstation (perhaps with a solid-state drive and/or more RAM), and the
FPGA design (growing the vector buffer size) might result in processing times of less than
one second. Nonetheless, the important aspects of using dynamic partial reconfiguration
were successfully demonstrated on the current platform.

7.3

Future Work

Several improvements and enhancements to both the overlapping-DPR subsystem and the
specific Virtex-6 implementation are proposed here for potential future research. These can
be grouped into three main categories.
First, some improvements can be made to the software and hardware without changing
the CSC’s interface. Many were mentioned in Section 7.2. Throughput may be increased
by removing the off-chip DDR3 buffering stage and adding custom logic to receive input
vectors directly. (An on-chip buffer would be useful to ensure that no packets of data are
lost.) Along with this, the 64-bit FIFO interface could be removed from the PCIe link.
Direct access to 128 bits at a time would provide full CSC vectors without the need for
a state machine to read and combine half-vectors. The main clock frequency, lowered to
50 MHz in the previous Virtex-II research, could be increased closer to its original 167 MHz
by inserting additional pipeline stages at critical combinational nodes, or possibly improved
by manual placing and routing of critical components.
If the hardware throughput is increased, the DMA driver could be configured to send
larger bursts of CSC vector packets. Further optimizations may be possible deep within
Xilinx’s driver code, since we have specific knowledge of the size and format of outgoing
and incoming data. File read latency could be reduced by upgrading parts of the testing
PC. Partial bitstreams could be made smaller by using a more advanced compression algorithm than bitgen’s method, however this would require more complex decompression

57

Input Bus 1

Output Bus 1

…
Input Bus NI

Reconfigurable Pipe 2

…
Reconfigurable Pipe NP

Output Switching

Input Bus 2

Input Switching

Reconfigurable Pipe 1
Output Bus 2

…
Output Bus NO

Figure 7.6: Theoretical Multi-Pipe Engine
in-hardware and longer decompression times. CLUT RAM values could be directly embedded into PR bitstreams, so that the subsequent step of loading values on the Reg Bus is
not needed, but this would require more sophisticated bitstream generation (and speedup
would remain negligible when processing very lage images).
Second, some engine enhancements are possible if the constraint of keeping the CSC’s
original I/O interface is abandoned. Scaling the Pixel Bus and Reg Bus widths by a factor
of 2 or more would relax the pixel-streaming bottleneck which limited most of our test
cases, or it could be used to feed additional pipelines simultaneously. (The Virtex-6 could
accommodate four or more total pipelines.) An example of a generalized multi-pipe system is pictured in Figure 7.6, with NP processing pipes, NI input buses, and NO output
buses. These variables would not necessarily need to be equal if proper I/O multiplexing is
used. Of course, like the dual-pipe version, the layout costs of additional pipes may not be
practical, and the I/O throughput would have to grow accordingly. A new interface would
also break compatibility with existing real-world applications.
In its current state, the CSC engine expects a full 128-bit input vector at every clock
edge. Most of these bits (especially control signals) do not change often and are redundant
in typical usage. For example, when processing an image, the (64) Pixel Bus bits may
change in every cycle, but the rest are usually constant. Therefore the input stream is
well-suited for compression or run-length encoding. By only transmitting data which is

58

Wide
Multi-Purpose
Input Bus

I/O Switching

Wide
Multi-Purpose
Output Bus

Reconfigurable Partitions and Datapaths

Figure 7.7: Theoretical PR-Pool Engine
changing, the available bandwidth is more efficiently utilized and higher throughput is
achieved. This method would require an additional hardware layer to receive this stream
and expand it properly for feeding the CSC. Software enhancements would also be required
to generate input streams in a new format.
Finally, a more ambitious research endeavor might abandon the strict Pre-3D/4D-Post
structure of the pipeline. A generalized DPR system might contain a pool of interconnected PR partitions, composing a reconfigurable, dynamically-defined data path. Figure 7.7 shows an example of this. The functionality of each partition could be programmed
independently (perhaps from a collection of bitstreams stored off-chip) and the data buses
may no longer represent strictly image data. Of course, increased flexibility usually comes
with an additional overhead cost, but the principles of DPR and intelligent task overlapping
demonstrated throughout this research would remain relevant and valuable.

59

Chapter 8
Conclusions
The presented methodology has demonstrated several advantages of introducing dynamic
partial reconfiguration (DPR) into an existing system design. The immediate advantages
are reduction in layout area (and possibly power consumption) due to the merging of multiple functional blocks into a single reconfigurable partition, and a drastic increase in functional capability and diversity (limited by the number of partial bitstreams the designer
chooses to implement). The availability of a user-controlled reconfiguration interface (here,
the Virtex-6 ICAP block) allows for DPR to occur in parallel alongside normal processing.
The experimental color space conversion (CSC) engine was able to benefit from overlapping of configuration and image data. Speedup was achieved without widening the existing
input buses, although it was typically lower than the theoretical maximum of 2.0. (The
engine was modified with the constraint of not changing the top-level interface; this would
theoretically allow it to directly replace ASICs in real-world products.)
Observed disadvantages of DPR are increased complexity in both design implementation and hardware control logic, and the initial requirement of a high-end FPGA chip. The
amount of processing speedup achieved is highly dependent on the nature of the data being
processed. Speedup of the CSC engine was limited when processing full-page (33 Mpixel)
images; overlapping becomes insignificant when one task is orders of magnitude longer
than the other. CSC speedup was much greater when the image lengths were comparable
to the configuration lengths.
A complete test platform consisting of a PC with custom software and a DPR-capable
60

FPGA (linked via PCI Express bus) was used to demonstrate the functionality of dynamic
partial reconfiguration and measure real-time performance. After various optimizations,
processing throughput did not meet the desired goal of one second per full-page image
(including any associated configuration data). Although the CSC hardware was capable of
processing at the target frequency of 50 MHz, notable latencies were introduced by the test
PC and the adapted PCIe subsystem. Suggestions for future improvements were proposed,
as well as some generalized DPR-based processing systems. A pool of reconfigurable,
general-purpose partitions could be used to perform various operations on any data, in
a flexible order. This system may introduce its own trade-offs — complexity, area, and
throughput requirements — which would need to be thoroughly analyzed.
Overall, dynamic partial reconfiguration was successfully integrated into an existing
system, with no modification to the system’s external interface, providing greater functional
diversity and moderate speedup in some cases. Many of the specific and general DPR
principles demonstrated in this research can be applied to future FPGA-based designs.

61

References
[1] P. Alfke. The future of field-programmable gate arrays. 1999.
[2] R.S. Berns, editor. Principles of Color Technology. John Wiley & Sons Inc., 2000.
[3] L. Braun, D. Ghringer, T. Perschke, V. Schatz, M. Hbner, and J. Becker. Adaptive realtime image processing exploiting two dimensional reconfigurable architecture. Journal of Real-Time Image Processing, 4:109–125, 2009. 10.1007/s11554-008-0095-8.
[4] C. Claus, B. Zhang, W. Stechele, L. Braun, M. Hubner, and J. Becker. A multiplatform controller allowing for maximum dynamic partial reconfiguration throughput. In Field Programmable Logic and Applications, 2008. FPL 2008. International
Conference on, pages 535 –538, sept. 2008.
[5] R.J. Fong, S.J. Harper, and P.M. Athanas. A versatile framework for FPGA field
updates: an application of partial self-reconfiguration. In Rapid Systems Prototyping,
2003. Proceedings. 14th IEEE International Workshop on, pages 117 – 123, June
2003.
[6] J. Galindo. A novel partial reconfiguration methodology for FPGAs of multichip
systems. Master’s thesis, Dept. Computer Engineering, Rochester Institute of Technology, October 2008.
[7] J. Galindo, E. Peskin, B. Larson, and G. Roylance. Leveraging firmware in multichip
systems to maximize FPGA resources: An application of self-partial reconfiguration.
In RECONFIG’08, pages 139–144, 2008.
[8] P. Green and L. MacDonald, editors. Colour Engineering. John Wiley & Sons Ltd.,
2002.
[9] E.L. Horta, J.W. Lockwood, D.E. Taylor, and D. Parlour. Dynamic hardware plugins in an FPGA with partial run-time reconfiguration. In Proceedings of the 39th
annual Design Automation Conference, DAC ’02, pages 343–348, New York, NY,
USA, 2002. ACM.
62

[10] J. Huang and J. Lee. A self-reconfigurable platform for scalable DCT computation
using compressed partial bitstreams and blockRAM prefetching. Circuits and Systems
for Video Technology, IEEE Transactions on, 19(11):1623 –1632, November 2009.
[11] I. Kuon and J. Rose. Measuring the gap between FPGAs and ASICs. Computer-Aided
Design of Integrated Circuits and Systems, IEEE Transactions on, 26(2):203–215,
February 2007.
[12] K. Parnell. Could microprocessor obsolescence be history? XCell Journal, 45, 2003.
[13] C. Patterson, P. Athanas, M. Shelburne, J. Bowen, J. Surı́s, T. Dunham, and J. Rice.
Slotless module-based reconfiguration of embedded FPGAs. ACM Trans. Embed.
Comput. Syst., 9:6:1–6:26, October 2009.
[14] F. Say and C. F. Bazlamaci. A reconfigurable computing platform for real time embedded applications. Microprocessors and Microsystems, 2011.
[15] Xilinx, Inc. Virtex-6 FPGA Configuration User Guide. Application Note [Online].
Available:
http://www.xilinx.com/support/documentation/user guides/ug360.pdf,
November 2010.
[16] Xilinx, Inc. Virtex-6 FPGA Connectivity Targeted Reference Design. User Guide,
October 2010.
[17] Xilinx, Inc. Module-Based Partial Reconfiguration. Application Note [Online].
Available: http://www.xilinx.com/itp/xilinx7/books/data/docs/dev/dev0038 8.html,
September 2011.
[18] Xilinx,
Inc.
Virtex-6 FPGA Connectivity Kit Reference Design and Documentation.
Application Note [Online]. Available:
http://www.xilinx.com/products/boards/v6conn/reference designs.htm,
September 2011.
[19] Xilinx, Inc.
Virtex-6 FPGA ML605 Board - PCI Express link will not
train on boards using ES silicon. Answer Record #34009 [Online]. Available:
http://www.xilinx.com/support/answers/34009.htm, September 2011.
[20] Xilinx, Inc. Virtex-6 FPGA ML605 Evaluation Kit. Application Note [Online].
Available: http://www.xilinx.com/products/boards-and-kits/EK-V6-ML605-G.htm,
September 2011.
63

[21] P.S. Zuchowski, C.B. Reynolds, R.J. Grupp, S.G. Davis, B. Cremen, and B. Troxel.
A hybrid ASIC and FPGA architecture. In Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, ICCAD ’02, pages 187–194, New
York, NY, USA, 2002. ACM.

64

Appendix A
Hardware Setup Details
The FPGA prototype board described in Section 4.1 is specified as follows:
◦ Target Board: Xilinx ML605 Evaluation Board
◦ Silicon Type: Revision 1 (CES)
◦ FPGA Family: Virtex-6 LXT
◦ Device: xc6vlx240t
◦ Package: ff1156
◦ Speed Grade: -1
◦ System Clock Frequency: 50 MHz (generated)
◦ Debug Interface: JTAG over USB
◦ On-Chip Memory: 3.5 MB (Distributed RAM), 14.6 MB (Block RAM)
◦ Off-Chip Memory: 512 MB DDR3 SDRAM
The PC used for development and implementation consisted of:
◦ Operating System: Microsoft Windows 7 (32-bit, SP1)
◦ Processor: Intel Core 2 Duo at 2.66 GHz
◦ RAM: 3.00 GB
The PC used for testing and execution consisted of:
◦ Operating System: Fedora 10 (Linux Kernel 2.6.27.5)
◦ Processor: Intel Core 2 Duo at 2.40 GHz
◦ RAM: 2.00 GB

65

Appendix B
Software Setup Details
Software versions used on the development PC include:
◦ Xilinx ISE Design Suite: 13.1 System Edition (with April 2011 patch)
. ISE Project Navigator
. PlanAhead (with PR license)
. Included Tools: xst, map, par, bitgen
◦ MathWorks MATLAB: R2010b
◦ Tiny C Compiler: 0.9.25
Software versions used on the testing PC include:
◦ ML605 Connectivity Kit Reference Design: 12.4, CES Silicon
◦ GNU C Compiler: 4.3.2

66

Appendix C
ICAP Control Module
The system component responsible for the detection and proper routing of partial reconfiguration data (PR bitstreams) is presented over the next three pages, in the Verilog HDL
format. Its functionality is described in detail in Section 5.2. The Virtex-6 ICAP block (the
interface which allows for FPGA self-reconfiguration) is instantiated within this module.
This is an updated and simplified version of the ICAP control module used in previous
research by J. Galindo on the Virtex-II Pro platform.

67

/*
* icap_eai.v
*
* Monitors the Reg Bus, detects PR bitstream data,
* and routes it to the V6 ICAP block appropriately.
*
* 10/13/2011 - Simplified / FSM removed, by RMT
*
7/26/2011 - Adapted to Virtex-6
, by RMT
*
6/01/2008 - Original ICAP controller, by JMG
*
*/

// Include CSC definitions, including PR Reg Bus address
`include "csc_defs.vh"
// Bit-swap parameter for ICAP data words
`define ICAPBITSWAP

module icap_eai
(
// INPUTS
Clk,
nReset,
regAddr,
regAddrValid,
regWrite,
regData,
// OUTPUTS
prWord,
prEnable,
prDone

//
//
//
//
//
//

50 MHz clock input
CSC reset, active-low
Reg Bus address (18-bit)
Reg Bus valid flag
Reg Bus write flga
Reg Bus data word (32-bit)

// PR word, for debugging (32-bit)
// PR enabled (in-progress) flag
// PR done flag

);
// Define I/O
input
input
input [17:0]
input
input
input [31:0]
output [31:0]
output
output reg

port widths
Clk;
nReset;
regAddr;
regAddrValid;
regWrite;
regData;
prWord;
prEnable;
prDone;

ICAP Control Module - Page 1/3

68

// Assert prEnable when CSC_PR_REG is detected,
// and the writing flags are valid and enabled.
assign prEnable = ((regAddr == CSC_PR_REG
)
&& (regAddrValid)
&& (regWrite
)
&& (nReset
));
// Inverted prEnable signal, for ICAP
wire
prCSB;
assign prCSB = !prEnable;
// Assert prDone when a PR sequence has ended (registered)
reg prEnableDelay = 1'b0;
always @(posedge Clk) begin
if (!nReset) begin
prEnableDelay <= 1'b0;
prDone
<= 1'b0;
end else begin
prEnableDelay <= prEnable;
prDone
<= (prEnableDelay && (!prEnable));
end
end
// Define an intermediate net "prData" between
// Reg Bus input and ICAP block. Bit-swapping
// within each byte can optionally be enabled
// by the compile-time ICAPBITSWAP define.
//
// prData <- [optional bit-swap] <- regData
//
wire [31:0] prData;
// Output PR word (post bit-swap) for debugging
assign prWord = prData;

// Connect to Virtex-6 ICAP block
ICAP_VIRTEX6 #(
.ICAP_WIDTH("X32"),
// Select 32-bit data width
.DEVICE_ID(32'h04250093)
// For xc6vlx240t model
)
ICAP_VIRTEX6_INST (
.BUSY (
),
// Busy signal, not used
.O
(
),
// Output data, not used
.CSB
(prCSB ),
// Chip Select (active-low)
.CLK
(Clk
),
// 50 MHz clock
.I
(prData),
// 32-bit PR data word
.RDWRB (1'b0 )
// Read/write control (wired to write)
);

ICAP Control Module - Page 2/3

69

// Connect PR data to ICAP with optional (intra-byte) bit-swapping
`ifdef ICAPBITSWAP
// Bit-swap highest
assign prData[31] =
assign prData[30] =
assign prData[29] =
assign prData[28] =
assign prData[27] =
assign prData[26] =
assign prData[25] =
assign prData[24] =

byte
regData[24];
regData[25];
regData[26];
regData[27];
regData[28];
regData[29];
regData[30];
regData[31];

//
assign
assign
assign
assign
assign
assign
assign
assign

prData[23]
prData[22]
prData[21]
prData[20]
prData[19]
prData[18]
prData[17]
prData[16]

=
=
=
=
=
=
=
=

regData[16];
regData[17];
regData[18];
regData[19];
regData[20];
regData[21];
regData[22];
regData[23];

//
assign
assign
assign
assign
assign
assign
assign
assign

prData[15]
prData[14]
prData[13]
prData[12]
prData[11]
prData[10]
prData[ 9]
prData[ 8]

=
=
=
=
=
=
=
=

regData[ 8];
regData[ 9];
regData[10];
regData[11];
regData[12];
regData[13];
regData[14];
regData[15];

// Bit-swap lowest byte
assign prData[ 7] = regData[
assign prData[ 6] = regData[
assign prData[ 5] = regData[
assign prData[ 4] = regData[
assign prData[ 3] = regData[
assign prData[ 2] = regData[
assign prData[ 1] = regData[
assign prData[ 0] = regData[

0];
1];
2];
3];
4];
5];
6];
7];

`else
// Direct connect, no bit-swapping
assign prData = regData;
`endif
endmodule // icap_eai

ICAP Control Module - Page 3/3

70

