Behavior synthesis for high speed 3D color interpolation using VHDL by Glanville, Thomas, Jr
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
8-1-1999 
Behavior synthesis for high speed 3D color interpolation using 
VHDL 
Thomas Glanville Jr 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Glanville, Thomas Jr, "Behavior synthesis for high speed 3D color interpolation using VHDL" (1999). 
Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
BEHAVIORAL SYNTHESIS FOR HIGH SPEED 3D
COLOR INTERPOLATION USING VHDL
By
Thomas W. Glanville, Jr.
A Thesis Submitted in Partial Fulfillment









Department of Computer Engineering
College of Engineering




Rochester Institute of Technology
Behavioral Synthesis for High Speed 3D Color
Interpolation using VHDL
I, Thomas W. Glanville, Jr., hereby grant permission to any individual or
organization to reproduce this thesis in whole or in part for non-commercial and
non-profit purposes only
Thomas W. Glanville, Jr.
Date
ABSTRACT
The purpose of this thesis is to study the methodology ofbehavioral synthesis and
evaluate its usefulness compared to Register Transfer Level (RTL) synthesis.
Custom IC design uses high-powered synthesis tools. Engineers have traditionally
used RTL level descriptions of their circuits as input to these synthesis tools. As
new Behavioral Synthesis tools are becoming more powerful, the option to
describe their circuitry in a higher and more abstract level is becoming a more
feasible option. Describing circuitry at a higher level has many advantages. It is
easier to make architecture changes and higher level descriptions generally have
significantly less lines of code and faster development times.
To study behavioral synthesis a tri-linear interpolation algorithm is used. An RTL
style and two different behavioral styles are used. Each are compared for area,
power consumption, synthesis time, code length and throughput. The design is
simulated before and after synthesis to verify the accuracy of the design using
VHDL. Behavioral Compiler from Synopsys will be used to synthesize the design
from VHDL to the gate level. It was found that behavioral synthesis can produce
results nearly as good as an RTL described
circuit. The results were generally
20% - 30% worse for this implementation using behavioral synthesis.
TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 Challenges usingHLS Tools 1
1.2 Design Objectives 4
1.3 Quality Measures 5
2 ALGORITHM STUDIED 6
2.1 LinearColor Conversion 6
2.2 NonlinearColor Conversion 7
2.3 EmpiricallyDetermined Color Space Conversion 8
2.4 Algorithm Used for this Study 9
3 BEHAVIORAL SYNTHESIS 16




3.5 Coding Styles and Restrictions 26
4 DESIGN PROCEDURE 28
4.1 BaselineAlgorithm Design using RTL 28
4.2 High Level SynthesisMethod 1 31
4.3 High Level Synthesis Method 2 33
5 TESTING AND SIMULATION 36
5.1 C++Model 36
5.2 VHDL Code Testing 37
5.3 Gate LevelVerification 38
6 SYNTHESIS 41
6.1 Script common to allMethodologies 41
6.2 RTL Script 41
6.3 HLS Method 1 43
6.4 HLSMethodNumber 2 (BehavioralRetiming) 45
7 RESULTS 47
8 CONCLUSIONS 51
8.1 Justification forHigh Level Synthesis 51
8.2 Concerns using High Level Synthesis 52
8.3 Recommendations 53
9 REFERENCES 54
10 APPENDIX A SYNTHESIS SCRIPTS 56
10.1 Technology .scr 56
10.2 RTL Compilation Script 56
10.3 HLS Method 1 Scripts 57
10.4 HLSMethod 2 (BehavioralRetiming) 62
1 1 APPENDIX B VHDL SOURCE CODE 64
11.1 Testbench Source Code 64
11.2 Algorithm Code 79
12 ACKNOWLEDGMENTS 89
LIST OF FIGURES
Figure 1: Shared MultiplierArchitecture with Latency ofThree 2
Figure 2: Two or ThreeMultiplierArchitecture with Latency ofTwo 2
Figure 3: Three Multipliers with no Latency and slower Clock Speed 3
Figure 4 : One Dimensional Interpolation 9
Figure 5: Single Interpolation Cube //
Figure 6: Equal Sampling Color / /
Figure 7: Plane Address Pseudo code 12
Figure 8: Memory Configuration 13
Figure 9: Memory Addressing Pseudo code 14
Figure 10: Design and Test Bench Architecture 15
Figure 11: Two Different Architecturesfor the Same Design 17
Figure 12: A Generic CDFG 19
Figure 13: ASAP Scheduling 22
Figure 14: ALAP Scheduling 23
Figure 15: Original HDL timing and results ofthree I/O modes 24
Figure 16: Manual Pipeline of TLIAlgorithm 30
Figure 17: HLS Coding ofTLIAlgorithm 32
Figure 18: Behavioral Retiming Code Fragment 34




Application Specific Integrated Chip
Behavioral Synthesis:
A technique used in describing a circuit functionality at a high level. Generally,
the architecture can not be inferred from the description.
Color Space Conversion: page 12
There are many ways to describe color. Red, Green, Blue is one way. Hue,




Cycle Fixed: page 32
A mode used for scheduling an algorithm using behavioral synthesis
FPGA: page 36
Field Programmable Gate Array
Free Floating: page 32







A method used to describe color using hue, saturation and intensity
RGB: page 12
A method used to describe color using the colors red, green and blue.
RTL: page 9
Register Transfer Level. A style of coding an HDL that is synthesizable.
Superstate Fixed: page 23
A mode used for scheduling an algorithm using behavioral synthesis
VHDL: page 8
Very High Speed Integrated Circuit Hardware Description Language.
VLSI: page 13
Very Large Scale Integrated Circuit
YCC:pagel2
A method used to describe color using intensity and two chroma values
1 INTRODUCTION
Today, increased competition in the consumer electronics field is driving
ever-
shorter product life cycles and lower costs. This requires design teams to reduce
development cycle times, increase complexity (to add new features) and reduce
costs because of decreasing profit margins. High Level Synthesis (HLS) tools
promise to drastically reduce the level of effort to design circuits. Because of the
power of these tools, a custom IC designer can decrease the development time for
a given circuit or design much more complex circuitry in the same amount of
time.
1 . 1 Challenges using HLS Tools
Most engineers are not practiced with these new HLS tools. They must choose
between relatively new and untried tools and traditional design tools with which
they are not familiar. They will have to learn new techniques, commands and
change their code writing style.
1.1.1 Changes in Design Coding Style
In traditional register transfer language (RTL) descriptions of circuitry, a design
engineer is very explicit about the architecture of the design. In high level
synthesis, the engineer writes one
simple description of the algorithm with no
particular architecture in mind. To accomplish this, the engineer must use a style
ofHDL coding that is different from
RTL style. In HLS, the engineer designs the
architecture of a design by using constraints and commands to the synthesis tool.
Some of these constraints can be clock speed, desired throughput, and number of
cycles in a pipeline. For example, consider the equation:
Y =A*B*C*D.
There are a number of different ways that this can be designed. If area is the
primary concern, the multiplier can be shared over three cycles as shown in
Figure 1 . However, if speed is the primary concern architecture two or three could
be chosen (See Figure 2 and Figure 3). Architecture 2 will have the same clock
speed as architecture 1
,
larger area, and a smaller latency. Architecture 3 will have
a longer clock period, larger area and may complete faster with zero latency since






















































Figure 3: Three Multipliers with no Latency and slower Clock Speed
In this example equation, it is not that difficult to change between various
architectures. However, in a much larger equation, such as tri-linear interpolation
used in this thesis, it can be very time consuming to change the code to a different
architecture. One must ensure no bugs were introduced because of the change, get
the timing correct in each cycle and many other issues involved with custom IC
design. When using a HLS tool there is only one textual description and it is
written at a high level. It is the constraint scripts that drive the new architecture
formations. Because of this, the source code is much smaller, fewer bugs can be
introduced and it is much quicker to explore different design architectures.
1.1.2 New Commands and Techniques
Since the architecture of a design is not explicitly stated in the design in HLS, as
in RTL Synthesis, new techniques and commands must be learned. It may not be
known at the beginning how many cycles a pipeline may have or what latency a
computational block may have. Therefore, special care must be taken to build
special handshaking signals into the design so that the data will flow correctly. If
one block provides data to a computational block and it is not known how many
cycles it will take to complete the computation, the feeder block must wait until
the computational block is ready. This is also true for test benches used to validate
the accuracy of the design. To
drive constraints such as maximum cycles used,
3
latency in a pipeline and designating operators to be re-used new commands must
be learned. This can be challenging to do since the engineer is learning a new
coding style at the same time.
1.1.3 Algorithm used in this Study
In this study of behavioral synthesis, an algorithm for tri-linear interpolation was
chosen. This particular algorithm was chosen because it was computationally
intense and relevant to many designs in the industry today. For example, when an
image is displayed on a CRT, a red, green and blue color description is used.
However, when that image is printed it is changed to a cyan, magenta, yellow and
black color description. That same image, when stored electronically on magnetic
or optical media, may be stored in Lumina, chroma 1 and chroma2 (YCC) format
[1] . This algorithm is particularly useful for non-linear conversions such as RGB
to YCC, LAB, or YIQ. These non-linear equations can not be expressed in a
simple form like R = Kt*L + K2*A + K3*B. In fact RGB values can not cover all
the visible colors in the visible spectrum. This is why other coordinate systems are
often preferred.
1.2 Design Objectives
The following requirements are to be fulfilled for the final circuit.
Verification
The design will be verified before synthesis and post-synthesis using a HDL
simulator.
Single Phase Clock Logic
Behavioral Compiler from Synopsys only permits Single Phase Logic.
Throughput of one.
To meet high-speed requirements it is necessary that the throughput
approaches one.
The algorithm will operate at 20 MHz
This speed was chosen based upon a survey ofprevious designs done in
industry.
The design will have a synchronous reset.
Some HLS synthesis commands only support a synchronous reset.
1 .3 QualityMeasures
Since this thesis will attempt to quantitatively measure the quality of the different
synthesis techniques, certain metrics must be chosen. In VLSI design, the three
major measures that are used to support design decisions are size, performance
and power consumption. [2] , pgs. 63, 64] Size is important because it directly
correlates with yield rates for chip dies. Performance can be determined by the
propagation delay through the design and the execution time. Power measures
represent the power dissipated in the design, which can be very important in
mobile devices. Each synthesis technique will use these three quality measures to
determine the quality of the technique used for the tri-linear interpolation
algorithm.
2 ALGORITHM STUDIED
Image space conversion can be done using many different algorithms. Some
algorithms may use less space and take more time while others can consume large
amounts of silicon but execute very quickly. Depending on the problem at hand,
different color-space conversion equations can be employed. Some color space
conversions are linear and can be easily converted using Equation 1 . Other color
space conversions are nonlinear and very complicated. The last color-space
conversion problem comes from data that is determined empirically. Calibrating
a printer so that the output colors match those displayed on a monitor is an
example of this.
2. 1 Linear Color Conversion
For color space conversion that is linear, a simple equation can be used. This is
the case for RGB to HSV (Hue, Saturation and Brightness). HSV is a cylindrical
coordinate system. The vertical coordinate defines brightness, the angular position
defines hue and the radial position defines saturation. (Shown below is RGB to
HSV):
Equation 1: Linear Color Space Equation [5] [3]
H = cll*R + cl2*G + C13*B;
S = c21*R + c22*G + c23*B;
V = c31*R + c32*G + c33*B;
2.2 Nonlinear Color Conversion
Nonlinear color space conversions can be difficult to do digitally. In the example
of RGB to YCC color conversion there are exponential calculations and many
multiplies and additions. This would be difficult to do in silicon. The color space
conversion for RGB to YCC are converted in three steps:
1 . A nonlinear transformation is applied to the image signals
2. The resulting values are converted to one luma and two chroma
components.
3. These three components are converted to 8-bit data for storage.
The equations for this conversion process are[l] , pg 1]
For R, G, B> 0.018:
R'= 1.099 x(R045)- 0.99
G'
= 1.099 x(G045)- 0.99
B'
= 1.099 x(B045)- 0.99
For R, G, B< -0.018:
R = -1.099 x(|R|045) + 0.99
G = -1.099 x(|G|045) + 0.99
B'
=
-1.099 x(|B|045) + 0.99
For -0.018 < R, G, B < 0.018:
R = 4.5 x R











+ 0.587G + 0.1
14B'




- 0.587G - 0.1
14B*
The last step in color processing is the quantization of the luma and chroma
information to digital code 7 values for storage. For the 8 bit encoding used in
current Photo CD products, the resulting luma, chromal, and chroma2 values are









2.3 Empirically Determined Color Space Conversion
In the empirically determined color space conversion, there are only limited
amounts of data points. The other data-points are interpolated to get the values.
This method can also be used for nonlinear color space conversions when it is
very difficult to calculate transformations digitally. The RGB to YCC algorithm
example above would be very difficult to convert digitally and would be suited






* (x - Xl) / Ax







Interpolated vs. Acuial values of a function f(X) -
X2
f(lS0) m 22^00 (calculated)
KI50) can be interpolated as f(128) +K192) - R128)] x (150 - 128)/64 or
1S0) - 16384 4136864 - 16384] x 22/64 = 23424 (interpolated)
X represents datapoints inmemory
True Values
interpolated values
64 128 192 256
Figure 4 : One Dimensional Interpolation [5] , pg 22]
This algorithm uses a lookup table to decrease the amount of calculation needed.
However, it is not a simple lookup table, therefore some calculation is needed
since every single possible color combination is not stored in memory.
2.4 Algorithm Used for this Study
For this study, the empirically determined color space conversion algorithm was
chosen. Only the algorithm block was synthesized. The other blocks of the design
such as memories, address calculators and registers are trivial and not well suited
for behavioral synthesis. Because of this, these blocks are treated as part of the
design test bench. The equation for this color space algorithm is given below.
[4] pp. 67-68]
Equation 3 : Tril-Linear Interpolation
P(X,Y,Z) = C0 + C,AX + C2AY + C3AZ + C4AXAY + C5AXAZ + QAYAZ +
C7AXAYAZ
Where:
AX = X -X<,
AY = Y -Y0
AZ = Z -Z0
Co = Pooo
Ci = (P100- PoooV(X 1 -Xo)
C2 =(Poio-Pooo)/(Y,-Y0)
C3 = (P001- Pooo) /(Z1-Z0)
C4 =(P,io-Poio-Pioo+Pooo)/
[(X^XoKY.-Yo)]
C5 = (P101 - P001 - P100 +P000)/
[(X,-X0)](Z,-Zo)]
C6 = (P011 -P001 -P010 + Pooo)/
[(Yj-YoX^-Zo)]
C7= (Pin- P011 -P101-P110 +











Figure 5: Single Interpolation Cube
Figure 6: Equal Sampling Color
The color space is represented as the cube shown in Figure 6. The X, Y and Z
coordinates can represent any color space. In this example Red, Green and Blue
are the color channels. The total color space is broken into equal sections or
planes. The values to be interpolated are referenced by the address of each corner
of the cube. These data points are the Pxyz values used to calculate the constants
Q-C7 and are the values from the look up table. The values of the P^ values used
in Equation 3 are shown in Figure 5. It is necessary to be able to find which cube
interpolation needs to occur in. That is, locating points Pooo
- Pi 11 shown in
Figure 5. This can be done by using the most significant bits of the incoming
color space values to address locations in memory as described below. Given the
11
input values, of say the X color channel, the first two bits of the input color will
determine which planes in the color space are needed for the calculation. If the
range of input values is 0 - 255 then, in pseudo code, this can be described as:
X = input






}else if(X < 128) { // Plane 1
low_plane = "001";
high_plane = "010";









Figure 7: Plane Address Pseudo code
The code fragment in Figure 7 can be found in the VHDL code for the index
block.
12
Next, it is necessary to find all the combinations of different plane blocks to
locate the information in memory. How the memory is organized is very
important so that the information can be easily retrieved. For this study, the
color-space is divided by the values 0 through 255 into 4 equal sections. They are
(0 - 64), (64 - 128), (128 - 192), (129 - 255). These values make up the five
planes in the color space shown in Figure 6. The look up table was organized in
memory like so:
Memory Incoming MSB Interpolated Values
Address X Y Z A B c
0 0 0 0 X X X
1 0 0 64 X X X
2 0 0 128 X X X
3 0 0 192 X X X
4 0 0 255 X X X
5 0 64 0 X X X
6 0 64 64 X X X
7 0 64 128 X X X
122 255 255 128 X X X
123 255 255 192 X X X
124 255 255 255 X X X
Figure 8: Memory Configuration
13
There are three different channels. Each color space will have high and low
planes needed for interpolation. Since there are five planes and three channels,
there are
53
or 125 memory locations. Given three channels, the memory address
locations for the look up table can be addressed by using the plane numbers










Figure 9: Memory Addressing Pseudo code








^^-lowjlane number 2.J) + ^ I highjlanenumber ~> ) ' Zowjlane number
_
V^-low_plane number 2.J) T ^ I highjlanenumber 3) ' ^-highjlanejumber
_




V-'MiighjIane number /.J) ' \ I |owjlanenumber J) ' ^highjlane number
V-^highjlane_number ^*-V \ * highjlane_number -J ) ' ^lowjianejumber
\-^-highjlane_number 2.J J
'
y 1 highjlane_number ^ ) ' ^highjlanejumber
Incoming XYZ Values High and Low plane
numbers for X
Address(O) Calculated
60, 60, 60 0,1 (0*25)+(0*5)+0) = 0
100,60, 180 1,2 (l*25)+(0*5)+2 = 27
These eight addresses will correspond to data stored in memory for the Pooo - Pin
values. Equation 3 uses these values to calculate the Ci-7 constants. In this
algorithm, the output values for each channel are stored in one address. If the
14
color values are 8 bits, then there will be a 24-bit word stored at each address. The
mapping of address to P values are given below:
Address(O) = P0io Address(3) = P00i Address(6) = Pioo
Address(l) = P0n Address(4) = P110 Address(7) = Pioi
Address(2) = P0oo Address(5) = Pm
Figure 10 shows the architecture of this design. Figure 7 demonstrates the Red,
Green, and Blue channels and there corresponding indices which feed into the
address block to generate the memory address depicted in Figure 9. These
memory addresses contain the values for Pooo
- Pin that subsequently transfer to
the ALU block. The ALU block executes Equation 3. In this study, everything in
the partition labeled RTL is part of the test bench.

















Behavioral Synthesis is very different from RTL Synthesis. In RTL Synthesis, the
process is generally a simple two step process. The first step is to take HDL input
and replace the HDL operators with synthetic operators, replace case and if-else
statements with combinational structures and map certain predefined functions to
optimized logic components that come with the target library. Next, the user can
drive the compilation and optimization of the compilation by constraints that they
specify to the design. The last step is to map those operators and combinational
logic to library components for the technology library that the user is targeting.
There generally is not much flexibility in how the design looks depending on the
design's description in RTL. However, in HLS synthesis the first step is to break
the high level description of the circuit into a Control/Data Flow Graph. The
CDFG is an abstract representation of the desired circuit behavior. [6] From this
CDFG, the synthesis tools will
"schedule"
the design given user constraints. Once
the tools have a scheduled design, the design can be compiled from the RTL
netlist generated by the resulting schedule.
3 . 1 Control/Data Flow Graphs
Control/Data Flow Graphs (CDFG) consist of five different types of nodes. These
are conditional nodes, data nodes, hierarchical nodes and loop control nodes. [6]
Each node can have a data flow block associated with it that describes operational
activity or assignment types.
Theses data flow blocks are represented as circles
for operations and arcs (or edges) for data flow. The four different nodes are
described below as: [6]
16
Data nodes are used to represent arithmetic and logical operations such as
addition and multiplication. These nodes will correspond to operators in the
HDL, such as the
'+'
operator.
Conditional nodes correspond to IF and CASE statements in the HDL. They
are used for split and join nodes for MUX and DeMUX operations.
Hierarchical nodes are often used to represent loops, functions and
procedures. Each hierarchical node will contain other nodes and edges.
Loop control nodes consist of four different types of control nodes. These
types are loop begin, loop end, loop exit and loop continue. These nodes
represent the boundaries of nodes, which are the escape and entry points of a
loop.













State N + 1
State N + 2
Figure 11: Two Different Architectures for the Same Design
17
3.2 Csteps
Csteps and scheduling are at the heart of behavioral synthesis. Csteps control the
flow of the design and generally correspond to clock cycles or states. Scheduling
takes design operations and automatically maps them to the different csteps. The
user in RTL synthesis does this. Automatic scheduling is particularly useful for
pipelining high-speed designs. By adjusting constraints and synthesis scripts, the
designer can quickly explore a wide range of different architectures without ever
re-coding the design.
A cstep is an integer representing an abstract machine state or clock cycle. When
two operations are mapped to the same control step, they will be mapped in
parallel by the scheduler. One has to think of the results generated by behavioral
synthesis in terms of a Control-Dataflow Graph (CDFG). Consider the code
fragment in Figure 11. This can be implemented in a design in many different
ways. The CDFG in Figure 1 1 shows two possible implementations.
In the implementation on the left in Figure 11, there are three resources: RA, RB
and RC. On the right there are two resources RA and RB. In the left example, the
addition and subtraction are implemented in parallel. This will cause a higher area
but a faster design than the right design which uses the same general purpose unit
(RA) for both addition and subtraction in two different cycles. The architecture
that is chosen is dependent on the user constraints. For this example, the user may
want to specify if operation chaining is
allowed. If it is not, then the second
18
architecture will have to be chosen. If the clock cycle is long enough, the left
architecture may be chosen if low latency and quicker speed are required.
For an RTL designer to make changes like this they must re-code the design,
re-
simulate to verify the designs accuracy after changes and re-synthesize. However,
a HLS (High-Level Synthesis) designer simply needs to modify constraints and
re-synthesize. There is no need to change the code because the architecture of the
design is not designed into the code.
3.3 Scheduling
After the CDFG is created, the next step is to schedule the design. Since the
CDFG only defines the dependencies of the operations, the scheduling of the
CDFG determines the exact start time of each task. [7] In general the
Figure 12: A Generic CDFG
19
process is to generate a CDFG, then schedule partitions of the design into sub
graphs (or time slice). Each sub-graph will be executed in a control step (cstep).
Each cstep will correspond to a state in a controlling finite state machine (FSM).
Figure 12 shows a general CDFG without any regard to timing or resource
constraints. [7] [2] [8] When the design is scheduled the start times must satisfy
the original dependencies of the CDFG. This constraint will limit the amount of
parallelism of operations. Since scheduling determines the concurrency in the
design, it therefore effects its performance. Likewise, the maximum number of
concurrent operations allowed in a schedule is dependent on the hardware
available. The scheduling algorithm must deal with both of these constraints. The
minimum-latency resource-constrained scheduling problem can be defined
formally as follows: [7]
Definition: Given a set of operations V with integer Delays D, a partial order on
the operations E and upper bounds {ak;
k= 1,2, ..., nres}, find an integer labeling
of the operations q>: V-> 7? such that r,
=
<pfv,), t, > tj+ dj V /', j s.t. (v,, v,) e E, \ {v/.
I\v1)
= k and t,< I < t, + d,}\ < ak for each operation type k
=
1, 2, ..., nre5 and
schedule step / 1 , 2, . . . , tn and tn is minimum.
Area, latency and cycle time requirements also will drive the schedule of the
design. In this section, two different scheduling algorithms will be discussed.
20
They are As Soon As Possible (ASAP) and As Late As Possible (ALAP)
scheduling.
3.3.1 ASAP Scheduling
When a design is scheduled with out any timing constraints, by default As Soon
As Possible (ASAP) scheduling is chosen. In ASAP scheduling, nodes are
scheduled in the earliest state to which they can possibly be assigned. The
algorithm for ASAP scheduling is as follows, Given: [7]
The start times computed by the ASAP algorithm are f , where f is a vector
{t*;i = 0, 1, ..., n)
The number of resources are nres
The total latency of the schedule is X.
ASAP(GS(V,E)){
Schedule v0by setting t
0= \;
repeat {
Select a vertex i/,whose predecessors are all scheduled;
Schedule v,by setting t ,
=max t j
+ d};
j: (vj, v,) eE
}
until (vn is scheduled);
return(f5);
21
Figure 13: ASAP Scheduling
Figure 13 shows the CDFG in Figure 12 scheduled using the ASAP
algorithm.
3.3.2 ALAP Scheduling
If a schedule exists that satisfies the latency constraints of the design, then it is
possible to explore the range of values of start times for the operations that meet
the constraints. The ASAP algorithm drives the minimum values for start times. A
complementary algorithm is the As Late As Possible (ALAP) scheduling
algorithm. When the scheduler finds it can meet, constraints with ASAP
scheduling it can also try ALAP and derivations in between to find the optimal
22
timing and area results. The start times are denoted as
tL
for the ALAP algorithm.
The algorithm for ALAP is as follows:
ALAP(GS(V,E), X){
Schedule v by setting tLn =X+\;
repeat {









until (v0 is scheduled);
return^);
Figure 14 shows the CDFG in Figure 6 scheduled using the ALAP algorithm.
Figure 14: ALAP Scheduling
23
3.4 Scheduling Modes
This study used the HLS tool from Synopsys called Behavioral Compiler (BC).
This tool provided three different modes for input and output for a design. These
modes are Cycle-fixed, Superstate-fixed and Free-floating. In Cycle-fixed mode,
I/O operations are not free to be rescheduled. The timing must match the timing in
the HDL code. In Superstate-fixed mode, I/O timing can be stretched, but the
order of I/O operations can not be modified. In Free-floating mode, I/O operations



























Figure 15: Original HDL timing and results of three I/O modes
:*:
24
3.4.1 Cycle-fixed Scheduling Mode [6]
Cycle-fixed mode preserves the original I/O timing of the source HDL exactly.
This is a great advantage because interfaces to the design will not have to be
modified to communicate with the synthesized block. This implies that no special
handshaking requirements are necessary in the simulation test bench or other
design modules if the HLS portion of the design is one of many other design
blocks. One disadvantage of cycle-fixed mode is that you can not use any
synthesis constraints or commands that will change the I/O timing of the design.
These are important commands that will automatically pipeline the design and
implement resource sharing.
3.4.2 Superstate-Fixed Scheduling Mode [6]
Superstate-fixed mode is the most widely used scheduling mode. With
Superstate-
fixed the order that I/O transactions occur are preserved, but the number of clock
cycles in each state can vary. When the number of clock cycles vary, Behavioral
Compiler is inserting additional clocks to satisfy constraints and accommodate
data and control dependencies. Because of this, Behavioral Compiler has greater
freedom to determine the optimum design. When using a test bench to test code
that is synthesized using Superstate-fixed, handshaking signals must be used so
that it is known when signals are valid and changing. It is not possible to time test
benches and other interface design code to a design that was synthesized using
Superstate-fixed mode by counting clock cycles.
25
3.4.3 Free-floating Scheduling Mode [6]
The Free-floating mode allows I/O operations to float with respect to one another.
This mode is not used often because it is difficult to constrain synthesis so that
I/O operations are defined in tight blocks. In Figure 15.d an example of free-
floating timing is shown and how it is different from Superstate-fixed and Cycle-
fixed. It is very important when building free-floating designs to constrain I/O
operations and their synchronizing signals so that they float together. [6] For
example, when a bus is synchronized by a data-ready strobe the bus transfer will
float in time to the step where it uses the least resources and wastes the least
latency. This is determined by data dependencies and constraints. The strobe does
not depend on any of those things. Therefore, its data is an unconstrained
constant. This will cause the data-strobe signal to float to the beginning of the
schedule and eliminate its value as a strobe.
3.5 Coding Styles and Restrictions
Behavioral Synthesis using a High Level Synthesis tool requires a different
coding style. When an RTL designer starts using HLS, they will need to learn new
techniques.
First, BC only schedules processes.
BC does not try to schedule component instantiations.
Multiple processes are scheduled independently of each other. There is no
synchronization between processes with BC. It is important for the designer to
use appropriate hand shaking between processes to maintain synchronization.
26
Both rising and falling edges may be used. However, their polarities may not
be mixed with in the same process.
Only one signal can be the argument of a wait statement. Process sensitivity
lists are not used.
Behavioral code is written as a series ofnested loops in a single process block.
The outermost loop is the reset loop. It does not have to reset any
variables or signals, but it must bound the main algorithm loop. It is
highly recommended to reset all signals and variables.
All rolled loops are treated as a level hierarchy by BC.
Constraints must begin and end in the same level of hierarchy. Loop-
begin and loop-end are at the same level.
All lower levels of hierarchy are scheduled first then included in the
next higher level, which is then scheduled.
An example of good behavioral coding style is shown below.




















As shown in Figure 10 the design is partitioned into two sections, an RTL
partition and a behavioral or HLS partition. Traditional RTL synthesis is used to
create the circuitry for the address, index and memories. However, the ALU block
is very complex with many operations. This makes it an ideal candidate for High
Level Synthesis. To study the technology of HLS the ALU block was written
three different ways. The first used a standard RTL coding style. The pipeline for
the algorithm had to be manually timed, resources were manually allocated and
speed versus area considerations was manually planned. The RTL code provided
a good baseline for what an average engineer would design. Timing, area and
power reports were generated for baseline data to compare against the HLS
methods of synthesis. The second and third methods employed HLS coding styles
and methodologies that will be explained in detail later. Reports for each HLS
method were generated for area, timing and power to compare to the baseline
RTL data. The design space that includes the memories and memory addressing
is not a good candidate for HLS. Therefore, this space in the design was not
synthesized and treated as part of the design test bench. In a real ASIC or FPGA
design these blocks would not be part of the test bench and would be synthesized
using a standard RTL compiler.
"A description of the memory and memory
addressing functions are discussed thoroughly
in section 2.4.
4. 1 Baseline Algorithm Design using RTL
In order to understand how well HLS tools can meet requirements a baseline
algorithm design was created using a current RTL synthesis
tool. The tool chosen
28
was Design Compiler from Synopsys. The block diagram in Figure 16 shows how
the calculations were partitioned for the RTL design. Refer to Equation 3 for a
description of the constants and variables. In this design the factors (Xi - X0)
=
(Yi - Y0) = (Zi - Zo)
=
a constant 64. This is because the color space is evenly
divided from 0 to 255. These values are not calculated but their results are hard
coded into the design code. Should the design be modified to include more bits of
color, then this factor will have to be re-coded.
When a design is manually pipelined, there must be a very detailed knowledge of
the target synthesis library so that the design can be efficiently pipelined. One
must know how long multiplies, additions and other operations take to properly
time a design without a long iterative trial and error process. Unfortunately, in this
study, the synthesis library was not accompanied with documentation and a trial
and error process was used to properly time the design. The synthesis library
chosen for this project was an obsolete library. This gave a major advantage for
the study of behavioral synthesis. State-of-the-art technologies are so fast that
many small to mid-sized algorithms often have little trouble meeting timing. If
this were the case in this study, then there would little to evaluate in terms of the
performance of HLS tools. Since an older technology library was chosen, (LSI
10K), it was a challenge to pipeline the design. Given a 20MHz clock it is
impossible to execute 3 8-bit multiplies in a single operation or 7 8-bit additions
in one clock cycle, as discovered through trial and error. Therefore, the terms in
Equation 3 needed to be broken into smaller terms and calculated separately over
29
several clock cycles. How the terms were decomposed into smaller calculation
segments is shown below.
P(X,Y,Z) = C0 + C,AX + 0C2AY + C3AZ + C4AXAY + C5AXAZ + C6AYAZ +
C7AXAYAZ
=
C0 +Terml + Term2 + Term3 + Term4 + Term5 + Term6 + Term7
= Left + Right
Terml =QAX Term5 = C5AXAZ Right = Term4 + Term5 + Term6 + Term7
Term2 = C2AY Term6 = C6AYAZ Left = Co + Terml + Term2 + Term3




C1,C2, C3, C4, C5, C6,C7
Valuesfor:






















t = Co + Terml + Term2
*ht




Figure 16: Manual Pipeline ofTLI Algorithm
30
This method can create very efficient results with good timing and low area
usage. Designing circuitry in RTL style versus HLS style is akin to programming
in assembly language instead of a high-level programming language such as C or
Java. In section 11.2.1 each block or cycle is actually implemented in a separate
process block in VHDL. However, as in assembly language programming, this
method is also much more difficult to maintain and make changes to. In this
example any changes made in one process block may trickle through all four
blocks and create many opportunities for enor. If changes are made, the design
will have to be re-validated to be sure the changes did not introduce any new bugs
into the design before the design can be re-synthesized. The source code for this
method is 342 lines long.
4.2 High Level Synthesis Method 1
In the first High Level Synthesis method, standard HLS techniques were used. In
standard HLS techniques code is written at a high level, HLS constraints are
provided as additional input to the synthesis tool and a the tool schedules the
design. Scheduling includes creating a pipeline, choosing resources to share and
creating a state machine to manage
data-flow through the design. There is no
manual allocation and scheduling as in section 11.2.2. Figure 17 shows the code
fragment of the algorithm without computational operations or signal
assignments. It is evident in Figure 17 that there is no architecture built into the
source code.
31
Figure 17: HLS Coding of TLI Algorithm




VARIABLE : -- Declare variables here
BEGIN
reset_loop: loop--synopsys line_label reset_loop
Reset Signals Here.
wait until elk =
'1'
and elk 'event ;







event and elk = '1';
if reset =
'0'
then exit reset_loop; end if;
else
algo_loop: loop--synopsys line_label calc_loop
-- Algorithm goes here.


















The source code for the Method 1 HLS code is 171 lines, much shorter than the
342 lines required in the RTL style of code. Since there is no architecture
designed into the source code, it is never necessary to retest the code should new
requirements for technology, clock speed or area be needed. Therefore, no new
32
errors will be introduced and it will not be necessary to re-validate the design
before synthesis can begin. New architectures can easily be explored using this
method.
As shown in Figure 17 this form of HLS coding style employees a sequence of
nested loops. The outermost loop is the reset loop. From any internal loop, it is
possible to exit to the reset state. This method forces the reset to be synchronous.
When the data ready signal goes high, the algorithm calculation loop can start.
This will continue until a reset signal is asserted. In this coding style, there is no
indication of the number of cycles required, technology used or clock speed.
Timing constraints can be applied to each loop independently. Since the algorithm
loop will be the most computationally intensive, constraints are set on this loop.
Some of these constraints set the maximum or minimum cycles the loop can have.
Others will pipeline a loop automatically so that a throughput of one can be
achieved in the design.
4.3 High Level Synthesis Method 2
The second method ofHLS is similar to Method 1 in that the code is written at a
high level. However, it employees a different coding style and synthesis
technique. Method 2 employees a strategy called Behavioral Retiming.
Behavioral Retiming (BRT) takes a gate-level netlist that has already been
compiled to a target technology. [9] , pg 1-3] and moves flip-flops through the
logic of the design without modifying the compiled logic. BRT does this by
moving flip-flops across combinational
gates or merging flip-flops. In this study,
33
the automatic pipelining feature was used. BRT can automatically pipeline a
design by taking a purely combinational design and adding registers at the output
boundary. The steps performed are: [9] , pg 1-8]
1 . Insert registers at output ports
2. Collapse the clock tree
3. Perform minimum-period and minimum area Retiming
4. Connect registers to the clock, reset and pipeline stall ports
5. Perform incremental compile and design rule fixes.
The code fragment for this method is shown in Figure 18 :
Figure 18: Behavioral Retiming Code Fragment
ARCHITECTURE alu_behave OF alu IS
-- Declare Signals here
BEGIN
Put Purely Combinational algorithm here. (This is not a lot different
than the equation shown in Equation 3)
END alu_behave;
Once the purely combinational design is
created it is compiled to the target
technology library. There are no clock, reset
or pipeline-stall signals that effect
the behavior of the design. The gate-level netlist from the combinational netlist is
used as input to the BRT script. It is the BRT scripts that adds the clock, reset and


















UsingMethod 2 only took 105 lines ofHLS code. Both the RTL method and HLS
Method 1 took an average of 6 to 8 hours to complete compilation. However,
HLS method number 2 only took about 2 hours to completely compile and have a
synthesized netlist that met design requirements.
35
5 TESTING AND SIMULATION
A great deal ofplanning must take place when creating custom integrated circuits.
The designer must be certain that the design is correct before sending it to the
chip supplier to be fabricated. This algorithm is treated no differently. The first
model of the algorithm that was created was a test model written in C++. Since
this algorithm uses a look up table as its source data for calculations (See section
2.3 Empirically Determined Color Space Conversion.) there are only a limited
amount of data points available to check for accuracy. The C++ model was
designed so that output from the circuitry could be compared against the output
from the C++ model. After the C++ model is created and results can be generated
for any input the design is created in VHDL using all three methods. Each method
is verified against the output of the C++ model. Once it was certain that the
VHDL model was sound, the design was synthesized and a gate-level VHDL
netlist is generated. The gate-level netlist was simulated and compared against the
output from the C++ model.
5.1 C++ Model
For the C++ model, a Color Class was designed for easy manipulation of data.
Equation 3 is a calculation in three dimensions. Because of this, special operator
overloading was designed in the Color Class so that three dimensional color
values, such as Red, Green and Blue, could be easily added, subtracted, multiplied
or divided with another color value. The type of color values can easily be
changed between integer values or floating point via the colorType value. All of
36
the same equations for memory addressing and the tri-linear interpolation
calculation are identical to the ones used in the VHDL model.























colorType L, a, b;
colorType R, G, B;
};
5.2 VHDL Code Testing
The VHDL model was tested using a VHDL test bench. The test bench stores all
output generated by the algorithm in an ASCII file. This file can then be used to
compare the results from the VHDL model against results the C++ model
generated. It is expected that there will be some error in the VHDL model that the
C++ model does not exhibit because of the nature of the digital circuitry. The
digital circuitry is only 8-bits and integer based, whereas the C++ model can do
37
calculations on 32-bit floating-point values. In addition, the C++ model can do
division of odd numbers such as 63 very accurately, while division by 63 in
digital circuitry is difficult to design efficiently. Because of this situation when
(X - X0)
= 63, in Equation 3, it is rounded to 64 in the VHDL model. When the
results from the VHDL model are compared against the C++, the amount of ercor
is analyzed in a spreadsheet program to determine if the amount of error is within
an acceptable range. For this study, acceptable range is less than 2% error.
This algorithm was designed three different ways: traditional RTL, using High
Level Synthesis and using Behavioral Retiming. Each method employed was
tested using the same test bench and evaluated against the
C++ model. Each
method gave the same results before synthesis with different timing. The RTL
method had a latency of four cycles whereas the two behavioral methods only had
one cycle of latency since the architecture was not designed into the source code.
Once the initial designs using the three different methods to synthesize the
algorithm were verified to be accurate synthesis of the designs could begin.
5.3 Gate Level Verification
After all three methods met design requirements during synthesis, a VHDL
gate-
level netlist is exported from the synthesis tool. It is possible, at this point, that
the test bench used in the pre-synthesis testing will not work for post-synthesis
testing. This is because the latency and throughput were both one cycle before
synthesis. However, after high level synthesis there is a very good chance that
38
both of these have changed. The latency and throughput are both dependent on the
constraints, or lack of constraints, that you provide to synthesis tool. The
requirements for this study were to create a design with a throughput of one.
Since the throughput did not change between pre and post synthesis, the test
bench was good for both examples. If the throughput were not one, then the test
bench would have to be modified to handle different latencies. This would be
done via handshaking signals between the test bench and the synthesized
algorithm block.
The latency in this study often varied as new constraints were added to drive
better results. During synthesis trials the latency varied between 3 and 10 cycles.
The variation in latency had to be accounted for when comparing results for
correctness. This is accomplished by shifting the expected results by the
appropriate number of clock periods. This was easily done visually without any
special programs using a spreadsheet program.
All three methods were simulated using the gate-level netlist generated by
Synopsys. All three were proven to give correct results using the LSI 10k library.
The synthesis results for the CBA Core library were not verified via gate-level
simulation because the VHDL gate-level library was not available. The synthesis
vendor only provided the
gate-level library for the LSI 10k library. It is assumed
that if post-synthesis simulation works for the LSI library, then there is a very
good chance that it will also work for the CBA Core library. This
assumption can
39
be made since there is not architecture information embedded in HLS code and
the CBA core library is a much higher performing library. The higher
performance of the CBA Core library can be seen in Figure 19. In Figure 19, the
slack time for the RTL synthesis of the CBA core library is 32ns. This is over half




This chapter discusses the synthesis strategy employed and how the scripts
worked to create the design that met requirements.
6. 1 Script common to all Methodologies
All scripts have to do some tasks that are common to each other that are
independent of synthesis strategy. Those tasks choose the target synthesis library,
wire load models within the target library, setup operating conditions and other
environmental setups such as search path and the names of the libraries. These
tasks were all grouped together into a file named technology.scr. By setting a
"technology"
variable in the calling script, technology.scr will setup everything
the synthesis tools needs to use that technology. By setting a single variable in the
calling script the LSI, CBA Core or the LSI lca300 library could be chosen.
The technology script is shown in 10.2 RTL Compilation Script.
6.2 RTL Script
The RTL synthesis scripts are the simplest of the three methods. This is because
so much of the architecture is predetermined by the source code. The technology
variable must be set and the technology script must be called before anything else
is done in the RTL script. After the technology library information is set, then the
clock period can be declared and synthesis can begin. It only takes a simple 9 line
program to synthesize this script
41
/. clock_period = 50
This line sets the clock speed variable. It is convient toput this in the head of
your script so that "magic numbers
"
are not scattered throughout the script.
2. analyze -format vhdl alu.vhd
This line checks the sourcefilefor syntactical errors.
3. elaborate alu
Elaboration will create a netlist that the synthesis tool understands so that it
can compile the design later.
4. create_clock elk -p clock_period
The create command sets theperiod of the clock edges. The compiler
needs this information so that it can meet timing requirements ofthe design.
5. setoperatingconditions -librarymy_library my_ops
6. set_wire_load my_wire_load -library my_library
Steps 5 and 6 will set operating conditions relative to thefabrication ofthe
chip and the environment it will work in. The variables my_ops,
my_wire_load andmy_library are set in the technology.scr script.
7. transform_csa
Transform_csa will convert multipliers into Carry Save Adders. Carry Save
Adders are smaller and faster than ripple adders and Carry Look Ahead
adders. This is a feature of Behavioral Compiler from Synopsys. This
command is used so that there is afair comparison between techniques.
8. compile -map_effort medium
This commandwill compile the VHDL design into a gate-level representation.
42
9. write -format db -o design_output -hierarchy
This command saves the designforfuture use.
6.3 HLS Method 1
HLS Method 1 is the most complicated script of the three. The design must be
converted into a format that the behavioral compiler can schedule, it must then be
scheduled and then compiled. The listing for all the scripts used to compile
Method 1 are listed in 10.3 HLS Method 1 Scripts. The steps required for
synthesizing this method are:
1 . bc_enable_multi_cycle =
"true"
This script will allowpipelined components be used. These components can
span multiple clock cycles and can improve clock speed. Ifpipelined
components are not allowed, then the clock speedmust be slow enough to
allow entire operations tofit into a cycle.
2. Analyze alu.vhd -format vhdl
This line checks the sourcefilefor syntactical errors.
3. elaborate -schedule design -lib WORK
Elaboration will create a netlist that the synthesis tool understands so that it
can compile the design later. The -schedule switch creates the control
dataflow graph of the design.
4. transform_csa
Transform_csa will convert multipliers into Carry SaveAdders. Carry Save
Adders are smaller andfaster than ripple adders and Carry LookAhead
adders. This is afeature ofBehavioral
Compilerfrom Synopsys
43
5. bc_estimate_timing_effort = high
Thisfunction will allow more time in the schedule. The default valuefor this
synthesis variable is medium. When this variable is set to the default it is not
possible to meeting timing requirements when the wire load is set.
6. create_clock elk -period clock_period
The create command sets theperiod ofthe clock edges. The compiler
needs this information so that it can meet timing requirements ofthe design.
7. loop_label = fmd(cell -hier "algoloop")
The algorithm loop in the HLS code is identified by its label "algoloop". The
cells, which correspond to the loop, need to be identified so that it can be
pipelined to achieve a throughput ofone.
8. pipeline_loop looplabel -init 1 -latency cycles
This command willpipeline the algorithm loop. This is the most powerful
command in this method. The architecturefor the design takes shape using
this command. The latency ofthe pipeline is set using the -latency switch
followed by an integer. The initialization delay ofthepipeline is determined
by the -init switchfollowed by an integer.
9. schedule -io iomode -effmed
This function will take the CDFG as input and allocate resources and schedule
the design. See Section 3.3 for a description of scheduling.
10. set_operating_conditions -librarymy_librarymy_ops
1 1 . set_wire_load my_wire_load -librarymy_library
Steps 10 and 11 will set operating conditions relative to thefabrication ofthe
44
chip and the environment it will work in. The variables my_ops,
myjwirejoad andmyjibrary are set in the technology.scr script
12. compile -map_effort high -boundaryoptimization
This command will compile the VHDL design into a gate-level representation.
13. write -format db -o designoutput -hierarchy
This command saves the designforfuture use.
6.4 HLS Method Number 2 (Behavioral Retiming)
HLS Method 2 is a very easy and fast way to create a pipelined design
automatically. The design is simply compiled without any clock signals and then
re-timed (pipelined) after the initial compilation is complete. The steps for this
method are listed below:
1. analyze -format vhdl alu.vhd
This line checks the sourcefilefor syntactical errors.
2. elaborate alu
Elaboration will create a netlist that the synthesis tool understands so that it
can compile the design later.
3. set_operating_conditions -library my_library my_ops
4. set_wire_load my_wire_load -librarymy_library
Steps 3 and 4 will set operating conditions relative to thefabrication ofthe
chip and the environment it will
work in. The variables my_ops,
my_wireJoad andmyjibrary are set in the technology.
scr script. For some
reason, BRTwould not workproperly ifthe wire load and operating
conditions were set. As stated earlier thefirst step when using thepipeline
45
design is the register the output. However, when the wire load and operating
conditions were set, thepipeline loop command would not register the output.
To get a working example steps three andfour are omitted. However, when a
new version is released this should be check to verify if this condition still
exists.
5. compile -map_effort high -boundary_optimization
This command will compile the VHDL design into a gate-level representation.
6. pipeline_design alu -period clock_period -stall_ports ready -stages cycles
-
sync_reset reset -clock_port_name elk -reset__polarity low -stalUpolarity low
-
check_design
This is the command that takes the asynchronous tri-linear interpolation
equation andpipelines it. All that is required is the clockperiod, the number
ofcycles desired in thepipeline, a list ofthepipeline stallports and the reset
port name. Ifthe synthesis tool can meet these constraints, the design is
complete and readyforpost synthesis verification. Ifit can not meet
constraints then the number ofstages in thepipeline or the clockperiod
should be relaxed so that it can meet timing requirements. This is an iterative
process that is very quick compared the RTL andHLS 1 method. The RTL and
HLSmethod 1 synthesis time was approximately 8 hours. The Behavioral
Retimingmethod only took about 2 hours to complete.
46
7 RESULTS
As discussed in earlier chapters, three methods were used to synthesize the tri-
linear interpolation algorithm. In addition to the different methods, two different
libraries were used. Both target libraries are libraries provided by Synopsys. As
discussed in section 1.3, three different quality measures are examined to
determine the quality of the synthesis results. These quality measures are area,
performance and power consumed. Figure 1 9 shows the results of the different
synthesis techniques and the qualitymeasures for each method.
For the area measure, the sum of the combinational and non-combinational area is
used. Net interconnect area and any area gained due to place and route are not
considered since the synthesis results were not carried to the place and route step.
The units for the area are used as reported by Synopsys and are used to compare
the relative performance of the different methods.
For the performance quality measure, three items are considered. These are
number of cycles needed to execute the algorithm, the clock period needed and
the throughput of the design.
The power quality measure was simply
determined by using the report_power
command from Synopsys. This command will report the dynamic power used by
the system. This turns out to be the net switching power used in the system.
47
Neither the LSI 10K nor the CBA Core libraries report any leakage power or cell
internal power.
LSI LIBRARY








RTL 41,191 0.03 407.7 1 50ns 4 317 3 hours















50ns 4 317 3 hours
HLS 131,338 16.37 508.9
mW




BRT 147,823 2.25 562.2
mW




Figure 19: Synthesis Results
It is interesting to note that the BRT method
of synthesis for the LSI library only
took three cycles. It is believed that this is because of a bug in the Synthesis tool.
This method would not work properly when
the wire load and operating
conditions were set. As discussed in Section 4.3, the first step the pipeline_design
48
command should do is insert registers at output ports. For some reason, when the
operating conditions and wire load are set, pipeline_design would not register the
output. Because the BRT method would not simulate properly if the operating
conditions were set, this method had an unfair advantage over the other two
methods. It was much easier for this method to meet timing requirements. This
method would not be able to synthesize in less than four cycles if the wire load
and operating conditions were set.
It is evident from the results that the behavioral synthesis results did not do as
well as a manually pipelined design using an RTL description. However, when
looking at certain criteria it can be shown that the tradeoffs one needs to make are
not that great. For example, if design requirements changed such that the clock
speed needed to be increased by 50%, the RTL design would be useless since
there is almost no slack available in the critical path measurement (0.03ns). In this
example, more pipeline stages would need to be used to meet timing
requirements. The algorithm would certainly have to be redesigned. A re-designed
circuit would then need the HDL source to be re-verified since new bugs may
have been introduced. If there are errors in the new design they would have to be
removed, and then synthesis can start again. In both HLS methods, the constraint
commands for pipeline_design or pipelinejoop would modify and the design
could be re-synthesized immediately. This can save a great deal of time in the
design phase ofproduct development.
49
Since the architecture is not
"hard-coded"
into the design, it is also very easy to
change target libraries. In the RTL example, the four pipeline cycles are used
because of the limitations of the LSI 10k library. However, if a new library is
targeted, such as the CBA Core library there will be an excessive amount of
registers used since only two stages are needed. This will either cause the designer
to re-design the circuitry or waste silicon, which may drive up costs and decrease
yield. It is evident that there is a great deal ofwaste in the RTL design in the CBA
Core library since the HLS method only used 2% more area than the RTL
method. However, when using the LSI library, the HLS method used 21% more
area.
Synthesis time also varied greatly among the three methods. The RTL and HLS
methods both took approximately the same amount of time. However, the BRT
method was substantially quicker. Most synthesis batch jobs were less than Vi the
time for the BRT method. Since the BRT method had fewer lines of code, shorter
synthesis time and shorter development time, this method has a lot ofpotential.
The power used in the CBA Core library for the two HLS methods seem
disproportionately high. It is unknown why the dynamic power used for the two
methods is around 200% more.
50
8 CONCLUSIONS
For designs on a tight schedule and evolving requirements, HLS is a very
attractive tool to use. To make changes to latency, speed, area or technology
requirements there is no need to change the architecture of the design and re-
simulate. As shown in this study's area and timing results, Behavioral Compiler
performed well.
8.1 Justification for High Level Synthesis
For the LSI library, area results were only 20-30% worse than a manually
pipelined design. The lack of performance in this area can be compensated with
other strengths. For example, should requirements change such that the design
now required a 25ns clock, the design would be useless if the RTL method were
used. There is only 0.03ns slack in the design and further optimization will not
improve this situation. The RTL design would need major rework and re-testing
before synthesis could begin. Depending on the nature of the algorithm or design
at hand, this could take a significant amount of time in the design cycle. However,
neither the HLS nor BRT methods have any architecture specified in the code. In
the case of the HLS and BRT methods the design simply needs a few constraints
modified in the synthesis script and the synthesis tool can attempt to meet the
more aggressive requirements.
Another scenario when Behavioral Synthesis is useful is when the target library
changes or if it is unsure what the target library will be. In this study, the LSI
library was the original target library. It was not possible to pipeline the design in
51
less than four cycles. However, when the design is re-targeted for the CBA Core
library only two cycles are needed. This means there are approximately (8 bits
* 8
data points * 2 cycles = ) 128 unneeded registers. Either the designer is willing to
live with this waste or as in the previous situation, the design will need major
rework and re-testing before synthesis can begin again. This is reason why the
HLS method is only 2% worse than the RTL method for the CBA Core library. In
the CBA Core library, the RTL method has many unneeded registers.
Another attractive element of HLS is the significantly reduced lines of code
required. When one considers the amount of time required for debug and
maintenance having only half the code to maintain is a major advantage. It is also
easier to understand since the algorithm used is clearly described in the
Behavioral methods. In the RTL method, the algorithm is
"buried"
in four
different process statements. It is very difficult, to understand what is going on in
the RTL method.
8.2 Concerns using High Level Synthesis
There are some considerations before deciding to use an HLS tool. The first is an
engineer practiced at writing RTL style code will have to learn a new style. This
can be confusing at first. Instead of using familiar
if-then-else or case-statements,
a nested loop strategy must be used as shown in Figure 17. This method can be
challenging to implement hand-shaking signals for some.
52
In addition, there are limitations to HLS that do not exist in standard RTL
methods. HLS can not schedule across multiple processes. Caution must be used
because of this to be sure that timing between processes is correct. This is
especially true if the I/O of the design is modified by using the Superstate Fixed
or Free Floating scheduling methods described in section 3.4.2 and section 3.4.3.
8.3 Recommendations
Based upon the results of this study, HLS can be recommended for designs when
area is not a major concern. RTL methods should be used if area is a critical
factor. The hand architected design still outperformed the HLS tools. However,
with the advances in silicon technology today, area is not a major problem for
manymedium to large sized designs.
If design requirements are not completely defined or if the target library is not
known, using HLS techniques can be a great advantage. An engineer will be able
to start a design and verify its functionality while requirements become more firm
and foundry selection proceeds. Once these things are known, the constraints will
be set in the synthesis script and it may not be necessary to change the design.
Another benefit of HLS is the ability to quickly experiment with different
architectures. This can be very useful with mobile devices when the design with
the lowest power consumption is desired. It could be take days or weeks to re-
code designs to experiment with different architectures to compare power
consumption, while this can be done in hours using HLS.
53
9 REFERENCES
[1] Eastman Kodak Company, Fully Utilizing Photo CD Images,
(http://wvvw.kodak.com/country/US/en/digitaFtechInfo/pcd-045. shtml)Article
Number 4 - PhotoYCC Color Encoding and Compression Schemes, April
1994
[2] Danien Gajski, Kikil Dutt, Allen Wu, Steve Lin, High - Level Synthesis
Introduction to Chip and System Design, Kluwer Acedemic Publishers
[3] Vijay Raghavan, Personal Correspondence, Synopsys Technical Instruction
[4] Henry R. Kang, Color Technology for Electronic Imaging Devices, SPIE
Press, 1997
[5] Mark Brown, PCI ASIC Reference Manual, Eastman Kodak Company,
November 1997
[6] David Knapp, Behavioral Synthesis, Prentice Hall 1 996
[7] Giovanni De Micheli, Synthesis and Optimization of Digital Circuits,
McGraw-Hill Inc, 1994
[8] P.G Paulin and J.P Knight, Force-Directed Scheduling for the Behavioral
Synthesis of ASICs, IEEE Transactions on Computer Aided Design of
Integrated Circuits and Systems, vol. 8, no. June 1989
[9] Behavioral CompilerUser Guide, Version 1998.08
[10] John Hennessy and David Patterson, Computer Architecture, A
Quantitative Approach, Morgan Kaufmann Publishers
[11] J. Basker , A VHDL Primer, Prentice Hall 1 995
54
[12] Raul Camposano, Path-Based Scheduling for Synthesis. IEEE
Transactions on Computer Aided Design, Vol. 10 No. 1, January 1991
55
10 APPENDIX A SYNTHESIS SCRIPTS
10.1 Technology .scr
The script below is called by all script methods. This script setups up search
paths, defines libraries and operating conditions to be used during scheduling and
or compiling.
* Description : This script will set global variables to be used by
*
all synthesis jobs. The thought behind this file is
* to be able to set the operating environment for all other
*
scripts from one master script. Then I can easily switch
* between synthesis libraries, operating conditions, and
* turn phone messaging on in one file.
* Date : March 29, 1999
* Author . Tom Glanville
* History:
* March 2 9 : Creation.
*/
if (technology == "LSI") {
















[else if (technology == "tcSa") {









link_library = tc6a.db + link_library
link_library = dw_foundat ion. sldb + link_library




jelse if (technology == "ACT"){












10.2 RTL Compilation Script
The script below is the script used to compile the RTL method.
/*






* Date : Note
*
* March 14, 1999 : Creation
* March 29, 1999 : Added support
for the global technology . scr file
*
May 8, 19 99 : Added support











include scripts/technology . scr
design_output = "DB/alu_rtl_" + technology +
"
report_file = "reports/alu_rtl_" + technology + ".txt";
vhdl_file = "results/alu_rtl_" + technology + ".vhd";
analyze -format vhdl SRC/typepack. vhd
analyze -format vhdl RTLSRC/alu . vhd
elaborate alu
create_clock elk -p clock_period
set_operating_conditions -library my_library my_ops
set_wire_load my_wire_load -library my_library
trans form_csa
echo "#### Starting Compilation
########"
> report_file
sh date >> report_file
echo "##################################" report_file
compile -map_effort medium
echo "#### Finished Compilation ###(*####" >> report_file
sh date >> report_file
echo "##################################" report_file




write -format vhdl -hier -output vhdl_file
if (messages == "ON"){
/* This script will call me on my Cell phone




10.3 HLS Method 1 Scripts
The next four scripts.are a listing of the scripts used for HLS method number 1
10.3.1Master HLS Method 1 Synthesis Script
* The following variables set up the scheduling
*
environment and constraints.
* Variables are not set up in any other
scripts.
* All settings should be made here.
*/
/*
* First synthesis run
* 4 stage pipeline
* LSI Technology













messages = "ON" /* This will turn phone messaging on and off */
pipeline = "true"
clock_period = 50 /* master clock for design */
io_mode = superstate_f ixed
cycles = 4
/*





* the following line sets up the library, operating
*
conditions and wire load
*/
include scripts/technology . scr
/*+******?**+******************+**********************?***+***/
/*
* The following line times the design.











* The next lines will schedule and compile the design





echo "####### Scheduling ########";
include scripts/alu_schedule . scr
echo "####### Compiling the design ###########";
include scripts/alu_BC_compile . scr
remove_design -designs
/*
* Second synthesis run
* 2 stage pipeline
* CBA CORE Technology
* 50 ns clock
*/




include scripts/technology . scr
echo "###### Timing Design ######";
include scripts/alu_timed. scr
echo "####### Scheduling ########";
include scripts/alu_schedule. scr
echo "####### Compiling the design ###########";








10.3.2HLS Method 1 Timed Script
Description : This script will schedule the alu and then compile it.
Date : February 22, 1999
Author : Tom Glanville
This shell scripts success.bin and fail. bin call me on my cell






















1: Creation. Schedules the design
2 : Added compilation section
added BC View variable
Added bc_time_design command before schedule.
Supposedly schedule will automatically call
bc_time_design if it has not been called yet.
But I thought it would be good to manually
call it out for clarity.
Added bc_enable_multi_cycle = false command.
Removed the compile step and added it to the
alu_BC_compile . scr script.
Added support for global file technology . scr . This file sets the
operating conditions, wire load and other
variables.
Added the maximum cycle limitation constraint.
combined alu_elab.scr and alu_timed. scr
added the pipline_loop command.
added report_file variable
Put the set_max_cycles command back.
Added date time stamp in report file
: Added final Clean up. Tweaked scripts to reflect final results.
remove_design -designs
/* This next line is needed for BC View */
bc_enable_analysis_info = true












analyze SRC/add_array . vhd -format vhdl -lib
WORK
analyze SRC/typepack. vhd -format vhdl
-lib WORK








csa + clock_period +
"_"

















echo design_output >> report_file
sh date >> report_file
echo
report_file
analyze SRC/alu.vhd -format vhdl
current_design = alu
elaborate -schedule -arch
alujbehave -lib WORK alu
/* Include the next line for smaller area





Write out the elaborated design, but not scheduled or compiled */
write -format db -o design_output -hierarchy
10.3.3HLS Method 1 Scheduling Script
/*
clear out anything in memory */
remove_design -designs
bc_estimate_timing_ef fort = high




+ be speed + " " + csa +
".db";
if (pipeline == true) {
design_output = "DB/alu" + "_" + technology +
"_sch_"
+ bc_speed + "_" + csa +
clock_period +
"_"
+ cycles + "pipe.db";




+ bc_speed + "_" +
csa + clock_period +
"_"
+ cycles + "pipe.txt";
}else(














+ bc_speed + "_" +




echo "##### alu_schedule.scr will read in ##########"
echo design_input






create_clock elk -period clock_period
loop_label = find (cell -hier "algo_loop")
if (pipeline == true ){
echo "####### Pipelining the Design!
#######"
>> report_file
pipeline_loop loop_label -init 1 -latency cycles
}
echo "###### Executing bc_time_design command
#####"
>> report_file





schedule -io io_mode -eff med
report_schedule -summ >> report_file
write -format db -o design_output -hierarchy
if (dc_shell_status == 0) {





echo "##### Scheduling was successful. Writing scheduled DB file.
######"
10.3.4HLS Method 1 Compilation Script
/*
* Description : This script will schedule the alu and then compile it.
* Date : February 22, 1999
* Author : Tom Glanville
60
This shell scripts success.bin and fail. bin call me on my cell





February 21: Creation. Schedules the design
*
February 22 : Added compilation section
*
March 13 : added BC View variable
*
March 14 : Added bc_time_design command before schedule.
Supposedly schedule will automatically call
bc_time_design if it has not been called yet.
But I thought it would be good to manually
*
call it out for clarity.
* March 24: Creation.
*
March 29: Added support for global file technology . scr . This file sets the
library,
operating conditions, wire load and other variables.
*/
remove_design -designs
if (pipeline == true) {
design_input= "DB/alu" + "_" + technology +
"_sch_"
+ bc_speed + "_" + csa +
clockjperiod +
"_"
+ cycles + "pipe.db";
design_output= "DB/alu" + "_" + technology +
"_comp_"
+ bc_speed + "_" + csa +
clock_period + "_" + cycles + "pipe.db";




+ bc_speed + "_" +
csa + clock_period +
"_"
+ cycles + "pipe.txt";




+ bc_speed + "_" + csa
+ clock_period +
"_"
+ cycles + "pipe. vhd";
}else{
design_input= "DB/alu" + "_" + technology +
"_sch_"
+ bc_speed + "_" + csa +
clock_period +
"_"
+ cycles + "_no_pipe
design_output= "DB/alu" + "_" + technology +
"_comp_"
+ bc_speed + "_" + csa +
clock_period +
"_"







+ bc_speed + "_" +










+ be speed + " " + csa
+ clock_period +
"_"
+ cycles + "_no_pipe







set_operating_conditions -library my_library my_ops
set_wire_load my_wire_load -library my_library
compile -map_effort high -boundary_optimization




vhdlout_use_packages = { IEEE . std_logic_1154 , IEEE. std_logic_arith,




write -format vhdl -hier -output vhdl_file
61
10.4 HLS Method 2 (Behavioral Retiming)
The next three scripts are a listing of the scripts used for the HLS method number
2 which employed behavioral retiming.
10.4.1HLS Method 2 Master Script
/*




Variables are not set up in any other scripts.






LSI, tc6a or lca300*/
include scripts/technology . scr
clock_period = 50
cycles = 4
/* Compile an asynchronous version of the design */
include scripts/brt_async_compile . scr
/*
Retime / Pipeline the design */
include scripts/alu_brt_compile . scr
remove_design -designs
quit





include scripts/technology . scr
include scripts/brt_async_compile . scr
include scripts/alu_brt_compile . scr




10.4.2HLS Method 2 Asynchronous Compile Script
/*
* This script will compile the ASYNC
* Version of the ALU Block
* The output of this will be used as
* input to the pipeline_design command









COMPILE THE ASYNC DESIGN WITH NO CLOCK **********/
analyze -format vhdl SRC/typepack.vhd
analyze -format vhdl SRC/alu_brt . vhd
elaborate alu
/*set_operating_conditions -library my_library my_ops
set_wire_load my_wire_load -library my_library */
/* Because of an apparent bug in Synopsys do not use the
previous two lines */
/* Include this line for smaller area and faster timing */
trans form_csa
echo "##### Compiling Combinational Design
######"
compile -map_effort med -boundary_optimization
write -format db -o design_output -hierarchy
10.4.3HLS Method 2 Retiming Script
/*
* This script will compile the alu block in RTL mode
*
using Design Analyzer
* Then it will automatically pipeline the design
- I hope!
*
* Date : Note
*



































/******* NOw DO BRT ***********/
read design_input
echo "##### Pipelining the Design
##########"
> report_file
sh date >> report_file
create_clock elk -p 50




echo "########## Pipelining Complete
###########"
>> report_file
sh date >> report_file







-format db -o design_output













11 APPENDIX B VHDL SOURCE CODE
The following section is a listing of the VHDL source code used to simulate,
verify and synthesize the design under study.
11.1 Testbench Source Code
This section is a listing of all the code that is part of the testbench. Refer to Figure
10 for a graphical view ofhow these files are structured and interconnected.
11. 1.1 Index block.vhd
library ieee;
use ieee. std_logic_1164 .all;
use ieee. std_logic_arith. all;
use ieee . std_logic_unsigned . all ;






IN unsigned (7 DOWNTO 0) ;
out_low : out unsigned; 2 DOWNTO 0) ;
out_high : out unsigned ( 2 DOWNTO 0) )
END index_block;














ELSIF (input < 64) THEN
out_low <= "000";
outjhigh <= "001";
ELSIF (input < 12 8 ) THEN
out_low <= "001";
out_high <= "010";















use ieee. std_logic_arith. all;





reset : IN std_logic;
red : IN unsigned (7 DOWNTO 0) ;
green : IN unsigned (7 DOWNTO 0) ;
blue : IN unsigned (7 DOWNTO 0) ;
r_low : OUT unsigned ( 2 DOWNTO 0) ;
g_low : OUT unsigned ( 2 DOWNTO 0) ;
b_low : OUT unsigned! 2 DOWNTO 0);
r_high : OUT unsigned! 2 DOWNTO 0) ;
g_high : OUT unsigned ( 2 DOWNTO 0) ;
b_high : OUT unsigned! 2 DOWNTO 0) )
ARCHITECTURE behave OF index is
COMPONENT index_block
PORT( elk : IN std_logic;
reset : IN std_logic;
input : IN unsigned (7 DOWNTO 0) ;
out_low : OUT unsigned ( 2 DOWNTO 0);






PORT MAP( input => red, clk=>clk, reset=>reset ,
out_low => r_low,
out_high => r_high) ;
: INDEX_BLOCK
PORT MAP (input => green, clk=> elk, reset=>reset ,
out_low => g_low,
out_high => g_high) ;
INDEX3 : INDEX_BLOCK
PORT MAP (input => blue, clk=> elk, reset=>reset,
out_low => b_low,




use ieee. std_logic_1164 . all;



























IN unsigned! 2 DOWNTO 0) ;
OUT addressType)
ARCHITECTURE lut OF address IS
BEGIN
PROCESS (elk)




redlow_greenhigh, redhigh_greenlow : INTEGER;
BEGIN
IF(clk = '1' and elk 'event) THEN


































CONV_INTEGER(blue_low_2) , 7) ;
addr(l) <
CONV_INTEGER (blue_high_2 ) , 7);
addr (2) <=
CONV_INTEGER (blue_low_2 ) , 7) ;
addr (3) <=




CONV_INTEGER (blue_high_2 ) , 7) ;
addr (6) <=
CONV_INTEGER(blue_low_2) , 7) ;
addr (7) <=






= CONV_INTEGER(red_high) * 25;
= CONV_INTEGER(green_low) * 5;
= CONV_INTEGER(green_high) * 5;
= CONV_INTEGER(blue_low) ;
= CONV_INTEGER(blue_high) ;
= red_low_temp + green_low_temp;
= red_high_temp + green_high_temp;
= red_low_temp + green_high_temp;








CONV_UNS IGNED ( redhigh_greenlow
CONV_UNSIGNED ( redhigh_greenhigh






use ieee. std_logic_arith. all;

















IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
OUT unsigned (7 DOWNTO 0)
OUT unsigned (7 DOWNTO 0) ;
OUT unsigned (7 DOWNTO 0) )




IF (elk '1' and elk 'event) THEN




















use ieee. std_logic_1164 .all;
use ieee. std_logic_arith. all;
use ieee . std_logic_unsigned. all ;
PACKAGE add_array IS




use ieee . std_logic_1164 . all ;
use ieee. std_logic_arith.all;






IN unsigned (6 DOWNTO 0) ;
: OUT unsigned (23 DOWNTO 0))











































































































































































































data_out <= value (CONV_INTEGER(address ) )
END PROCESS;






use ieee. std_logic_arith. all;
use ieee. std_logic_unsigned. all;
ENTITY rgb_mem IS
PORT( address : IN unsigned (6 DOWNTO 0) ;
data_out : OUT unsigned (23 DOWNTO 0) )
END rgb_mem;
ARCHITECTURE rgb_mem_behave OF rgb_mem IS
BEGIN
PROCESS (address)
TYPE value_rom IS ARRAY (0




























































































































































is INTEGER RANGE -64 to 64;
is INTEGER RANGE -128 to 127;
is INTEGER RANGE -256 to 256;
is INTEGER RANGE 0 to 255;
SUBTYPE signedl6bits is INTEGER RANGE






use ieee . std_logic_1164 .all;
use ieee . std_logic_arith. all ;
use ieee . std_logic_unsigned. all ;
USE WORK. add_array. all;



















IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
IN unsigned(7 DOWNTO 0) ;
: OUT unsigned (7 DOWNTO 0) ;
: OUT unsigned (7 DOWNTO 0);
: OUT unsigned (7 DOWNTO 0) ;
d2, d3,d4,d5,d6,d7 : OUT unsigned(23 DOWNTO 0)
: OUT unsigned(23 DOWNTO 0) ) ;
















IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
OUT unsigned ( 2 DOWNTO
OUT unsigned! 2 DOWNTO
OUT unsigned) 2 DOWNTO
OUT unsigned) 2 DOWNTO
OUT unsigned ( 2 DOWNTO











PORT( elk : IN std_logic;
reset : IN std_logic;
red low : IN unsigned!










IN unsigned! 2 DOWNTO 0)
IN unsigned! 2 DOWNTO 0)
IN unsigned! 2 DOWNTO 0)
IN unsigned! 2 DOWNTO 0)
OUT addressType) ;
COMPONENT lab_mem
PORT! address : IN unsigned (6 DOWNTO 0) ;
data_out : OUT unsigned (23 DOWNTO 0) ) ;
END component ;
COMPONENT rgb_mem
PORT( address : IN unsigned (6 DOWNTO 0) ;


















IN unsigned (7 DOWNTO 0) ;
: IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
OUT unsigned (7 DOWNTO 0) ;
: OUT unsigned (7 DOWNTO 0);
: OUT unsigned (7 DOWNTO 0));
SIGNAL indO, indl , ind2 , ind3, ind4 , ind5 : unsigned (2 DOWNTO 0)
SIGNAL addr : addressType;
SIGNAL redl, greenl, bluel : unsigned(7 DOWNTO 0);
SIGNAL readyl : std_logic;
begin
regl rgb_reg
PORT MAP ( redin=>red, greenin=>green, bluein=>blue, clk=>clk,
readyin=>readyin, ready=>readyl ,
reset=>reset , redout=>redl, greenout=>greenl, blueout=>bluel)
reg2 : rgb_reg
PORT MAP ( redin=>redl, greenin=>greenl, bluein=>bluel, clk=>clk,
readyin=>readyl, ready=>ready,
reset=>reset , redout=>redout , greenout=>greenout
blueout=>blueout ) ;
indexl : INDEX




r_low => indO, r_high=> indl,
g_low => ind2, g_high => ind3 ,
b low => ind4, b_high => ind5);
address 1 : address
PORT MAP( red_low => indO, red_high => indl, clk=>clk,
green_low => ind2, green_high => ind3 ,
blue_low => ind4, blue_high => ind5 ,
reset=>reset , addr=>addr) ;
memrgb : rgb_mem
PORT MAP(address => addr(2),
data_out => rgbO) ;
memo : lab_mem
PORT MAP( address => addr(2),
data_out => dO) ;
meml : lab_mem
PORT MAP( address => addr (3),
data out => dl) ;
73
mem2 : lab_mem
PORT MAP( address => addr(O)
data_out => d2)
mem3 : lab_mem
PORT MAP( address => addr(l)
data_out => d3) ;
mem4 : lab_mem
PORT MAP( address => addr (6),
data_out => d4) ;
mem5 : lab_mem
PORT MAP( address => addr(7),
data_out => dS) ;
mem6 : lab_mem
PORT MAP( address => addr(4),
data_out => d6) ;
mem? : lab_mem
PORT MAP( address => addr(5),






use ieee. std_logic_arith. all;
use ieee. std_logic_unsigned. all;
USE WORK. add_array. all;
USE work. typepack. all ;
entity chip is
PORT (red : IN unsigned (7 DOWNTO 0) ;
green : IN unsigned (7 DOWNTO 0)
blue : IN unsigned (7 DOWNTO 0)
reset : IN std_logic;




elk : IN std_logic;
output_ready : OUT std_logic) ;
end chip;
architecture behave of chip is
--
Mini_chip is the RTL partition of the algorithm
COMPONENT mini_chip
PORT( elk : IN Std_logic;
reset : IN std_logic;
readyin : IN std_logic;
ready : OUT std_logic;
red : IN unsigned (7 DOWNTO 0) ;
green : IN unsigned (7 DOWNTO 0) ;
blue : IN unsigned (7 DOWNTO 0) ;
redout : OUT unsigned (7 DOWNTO 0) ;
greenout : OUT unsigned (7 DOWNTO 0) ;
blueout : OUT unsigned (7 DOWNTO 0) ;
dO, dl, d2, d3,d4,d5,d6,d7 : OUT unsigned (23 DOWNTO 0)
rgbO : OUT unsigned (23 DOWNTO 0) ) ;
END component ;
--ALU is the behavioral partition of the algorithm
COMPONENT alu
PORT (elk : IN std_logic;
reset : IN std_logic;
ready : IN std_logic ;




IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)

































IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned(7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0).
: IN unsigned (7 DOWNTO 0)
: IN unsigned (7 DOWNTO 0);
: IN unsigned(7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
: OUT signed8bits;
: OUT signed8bits;
: OUT signed8bits) ;
SIGNAL indO, indl, ind2 , ind3
SIGNAL addr : addressType;
SIGNAL redl, greenl , bluel
SIGNAL readyl : std_logic;










, ind5 : unsigned (2 DOWNTO 0)



















d7_B: unsigned (7 DOWNTO 0) ;













d0=>d0, dl=>dl, d2=>d2, d3=>d3,
d4=>d4, d5=>d5, d6=>d6, d7=>d7, rgb0=>rgb0) ;
dO < = CONV unS IGNED (dO (23 DOWNTO 16) , 8)
< - CONV unS IGNED (dO (15 DOWNTO 8 ) , 8)
B < = CONV_unS IGNED (dO (7 DOWNTO 0), 8) ;
h < = CONV_unS IGNED (dl (23 DOWNTO 16) , 8)
~A < = CONV unS IGNED (dl (15 DOWNTO 8), 8) ;
< = CONV_unSIGNED(dl (7 DOWNTO 0), 8) ;
L < = conv unsigned (d2 (23 DOWNTO 16) , 8)
d2 A < = conv_unsigned(d2 (15 DOWNTO 8), 8) ;
d2 ~b < = conv_unsigned(d2 (7 DOWNTO 0), 8) ;
d3 L < = conv_uns igned (d3 (23 DOWNTO 16) , 8)
"a < = conv_uns igned (d3 (15 DOWNTO 8), 8) ;
~_B < = conv_uns igned (d3 (7 DOWNTO 0), 8) ;
75
d4_L <= conv_uns igned (d4 (23 DOWNTO 16), 8) ;
d4_A <= conv_unsigned(d4 (15 DOWNTO 8), 8);
d4_B <= conv_uns igned (d4 (7 DOWNTO 0), 8);
dS_L <= conv_uns igned (d5 (23 DOWNTO 16), 8);
d5_A <= conv_uns igned (d5 (15 DOWNTO 8), 8);
d5_B <= conv_uns igned (dS (7 DOWNTO 0), 8);
d6_L <= conv_uns igned (d6 (23 DOWNTO 16), 8) ;
d6_A <= conv_uns igned (d6 (15 DOWNTO 8), 8);
d6 B <= conv unsigned(d6 (7 DOWNTO 0), 8) ;
d7_L <= conv_uns igned (d7 (23 DOWNTO 16), 8) ;
d7_A <= conv_unsigned(d7 (15 DOWNTO 8), 8) ;
d7 B <= conv_uns igned (d7 (7 DOWNTO 0), 8);
resetl <= not reset;
Behavioral : alu
PORT MAP (elk == > elk,
reset => resetl,
ready=> readyl ,
output ready => output_ready
P0_L -> dO_L,
P0_A = > dO_A,
P0_B = > dO_B,
P1_L = > dl_L,
P1_A = > dl_A,
P1_B = > dl_B,
P2_L = > d2_L,




P3_L = > d3_L,
P3_A = > d3_A,
P3_B = > d3_B,
P4_L - > d4_L,
P4_A = > d4_A,
P4_B = > d4_B,
P5_L - > d5_L,
P5_A = > d5_A,
P5_B - > d5_B,
P6_L = > d6_L,
P6_A = > d6_A,
P6_B = > d6_B,
P7_L = > d7_L,
P7_A = > d7_A,
P7_B = > d7 B,
R = > rgb0(23 DOWNTO 16) ,
G = > rgbO (15 DOWNTO 8) ,
B = > rgbO (7 DOWNTO 0) ,
Rin = > redl,
Gin = > greenl,
Bin -> bluel,
OUT_L = > L,
OUT_A = > A,




use ieee . std_logic_1164 .all;
use ieee. std_logic_arith. all;
use ieee. std_logic_signed. all;
use STD.TEXTIO.all;




ARCHITECTURE state_machine OF testbench IS
FILE dump : TEXT open WRITE_MODE is "logs/output . txt" ;
COMPONENT chip
PORT ( elk : IN std_logic;
reset : IN std_logic;
red : IN unsigned (7 DOWNTO 0) ;
green : IN unsigned (7 DOWNTO 0) ;
blue : IN unsigned (7 DOWNTO 0) ;
readyin : IN std_logic;
output_ready : OUT std_logic;
L : OUT signed8bits;
A : OUT signed8bits;
B : OUT signed8bits) ;
end COMPONENT ;
SIGNAL red, green, blue : unsigned(7 DOWNTO 0);





SIGNAL L, A, B : signed8bits ;
BEGIN
UUT : chip
port map (clk=>clock, red => red, green=>green, blue=>blue,
ready in=>readyin ,
reset=>reset , L=>L, A=>A, B=>B, output_ready=>output_ready)
INPUT : PROCESS
--TYPE vect_type IS ARRAY (0 to 3) of INTEGER;
--CONSTANT red_vector : vect_type := (0, 0, 64, 0 );
CONSTANT green_vector : vect_type := (64, 192, 255, 0) ;
--CONSTANT blue_vector : vect_type := (128, 255, 64, 0);
CONSTANT red_vector : vect_type := (25, 20, 50, 0 );
--CONSTANT green_vector : vect_type := (75, 170, 254, 0) ;
--CONSTANT blue_vector : vect_type := (100, 200, 73, 0);
VARIABLE red_var, green_var, blue_var : unsigned (7 DOWNTO 0);
VARIABLE state, i, r, rval , g, gval, b, bval, index : INTEGER := 0;
BEGIN


























wait until clock =
'1'









FOR r IN 0 to 4 LOOP
FOR g IN 0 to 4 LOOP
FOR b IN 0 to 4 LOOP

















= conv_uns igned (gval, 8)
= conv_uns igned (bval, 8)
<= red_var after 20 ns;
<;= green_var after 2 0 ns;






















wait until clock =
'1'
and clock 'event;
IF (output_ready = '0') THEN exit f inish_loop; END IF,
end loop;
assert (true)




OUTPUT : PROCESS (clock)
VARIABLE L_var, A_var, B_var :
VARIABLE L_div, A_div, B_div :
VARIABLE buf : LINE;
VARIABLE delim : STRING (1 to 2)
BEGIN
integer;
signed (10 DOWNTO 0)
This should only read when data is valid!
But lets just get it working and not be too













:= conv_integer (L );
conv_integer (A) ;
= conv_integer (B) ;






















use ieee . std_logic_1164 .all;
use ieee . std_logic_arith. all;
use ieee . std_logic_unsigned . all ;
use work. typepack. all;
ENTITY alu IS
PORT( elk : IN std_logic;
reset : IN std_logic;
ready : IN std logic;
P0_L : IN unsigned (7 DOWNTO 0)
PO_A : IN unsigned (7 DOWNTO 0)
PO_B : IN unsigned (7 DOWNTO 0)
P1_L : IN unsigned (7 DOWNTO 0)
P1_A : IN unsigned (7 DOWNTO 0)
P1_B : IN unsigned (7 DOWNTO 0)
P2_L : IN unsigned (7 DOWNTO 0)
P2_A : IN unsigned (7 DOWNTO 0)
P2_B : IN unsigned (7 DOWNTO 0)
P3_L : IN unsigned (7 DOWNTO 0)
P3_A : IN unsigned (7 DOWNTO 0)
P3_B : IN unsigned (7 DOWNTO 0)
P4_L : IN unsigned (7 DOWNTO 0)
P4_A : IN unsigned (7 DOWNTO 0)
P4_B : IN unsigned (7 DOWNTO 0)
P5_L : IN unsigned (7 DOWNTO 0)
P5_A : IN unsigned (7 DOWNTO 0)
P5_B : IN unsigned (7 DOWNTO 0)
P6_L : IN unsigned (7 DOWNTO 0)
P6_A : IN unsigned (7 DOWNTO 0)
P6_B : IN unsigned (7 DOWNTO 0)
P7_L : IN unsigned (7 DOWNTO 0)
P7_A : IN uns igned ( 7 DOWNTO 0 )
P7_B : IN unsigned (7 DOWNTO 0)
R : IN unsigned (7 DOWNTO 0)
G : IN unsigned (7 DOWNTO 0)
B : IN unsigned (7 DOWNTO 0)
Rin : IN unsigned (7 DOWNTO 0)
Gin : IN unsigned (7 DOWNTO 0)
Bin : IN unsigned (7 DOWNTO 0)
output ready : OUT std_logic;
out_L : OUT signed8bits;
out_A : OUT signed8bits;
out_B : OUT signed8bits) ;
END alu;






C0_A, C0_B, C1_L, C1_A, C1_B,
C2_A, C2_B, C3_L, C3_A, C3_B,
C4_A, C4_B, C5_L, C5_A, C5_B,
C6_A, C6_B,
C7 LI, C7 A, C7 Al, C7 B, C7 Bl signed8bits;
SIGNAL P0_Lsig, P0_Asig, P0_Bsig : signed8bits;
79
SIGNAL P0_Lsigl, PO_Asigl, P0_Bsigl : signed8bits;
SIGNAL deltaXY, deltaXZ,
deltaYZ : signedl6bits := 0;
--SIGNAL deltaXYZ : signed24bits := 0;








Aterm3 , Aterm4 , Aterm5 ,
Bterm3, Bterm4, BtermS,
std_logic;
Lterm6, Lterm7, LI, L2
Aterm6, Aterm7, Al, A2
Bterm6, Bterm7, Bl, B2













R_var , G_var , B_var ,
deltaXvar, deltaYvar, deltaZvar : signedsbits;
BEGIN





if (reset = '0 ) then
C1_L <= 0; CI A <= 0; C1_B <= 0
C2_L <= 0; C2_A <= 0; C2_B <= 0
C3_L <= 0; C3 A <= 0; C3_B <= 0
C4_L <= 0; C4_A <= 0; C4_B <= 0
C5_L < = 0; C5 A <= 0; C5_B
<= 0
C6_L <= 0; C6 A <= 0; C6_B
<= 0
C7 L < = 0; C7_A <= 0; C7_B
<= 0
deltaX < = 0; deltaY <= 0; delta!2
<= 0;















Rin_var := CONV INTEGER (Rin) ;
Gin_var : = CONV_INTEGER (Gin) ;
Bin_var := CONV_INTEGER(Bin) ;
R_var : = CONV_INTEGER(R) ;
G var : = CONV_INTEGER(G) ;
B_var : = CONV INTEGER (B) ;
P0 Lvar := CONV_INTEGER(P0_L)
P0 Avar := CONV_INTEGER(P0_A)
P0_Bvar := CONV_INTEGER(P0_B)
Pl_Lvar := CONV_INTEGER(Pl_L)
PI Avar := CONV INTEGER (P1_A)
PI Bvar : = CONV INTEGER (P1_B)
P2_Lvar := CONV INTEGER (P2_L)
P2_Avar := CONV_INTEGER ( P2_A)




P4 Lvar := CONV_INTEGER(P4_L>
P4_Avar := CONV_INTEGER(P4_A)
P4 Bvar := CONV_INTEGER(P4_B)


























deltaXvar := Rin_var - R_var;
deltaZvar := Bin_var - B_var;











C1_L < = (P4_Lvar - P0_Lvar) ;
C1_A <= (P4_Avar - PO Avar) ;
C1_B <= (P4_Bvar - PO Bvar) ;
C2_L <= (P0_Lvar - P2_Lvar) ;
C2_A < = (P0_Avar - P2_Avar) ;
C2_B < = (P0_Bvar - P2_Bvar) ;
C3_L < = ( Pl_Lvar - P0_Lvar) ;
C3_A <= ( Pl_Avar - P0_Avar) ;
C3_B <= ( Pl_Bvar - P0_Bvar)
C4_L <= (P2 Lvar + P4_Lvar - P0_Lvar - P6 Lvar) ;
C4_A <= (P2 Avar + P4_Avar - PO Avar - P6 ;
C4_B <= (P2 Bvar + P4_Bvar - PO Bvar - P6 ;
C5_L <= ( P5 Lvar - Pl_Lvar P4_Lvar + P0_Lvar) ;
C5_A <= ( P5_Avar - Pl_Avar P4_Avar + P0_Avar)
C5_B <= ( P5 Bvar - Pl_Bvar P4_Bvar + P0_Bvar)
C6_L <= ( P3 Lvar - Pl_Lvar P2_Lvar + P0_Lvar)
C6_A <= ( P3 Avar - Pl_Avar P2 Avar + P0_Avar)
C6_B <= ( P3_Bvar - Pl_Bvar P2_Bvar + P0_Bvar)
C7_L <= (PO Lvar - P7_Lvar + P3 Lvar + P5 Lvar + P6_Lvar
- P4_Lvar
PI Lvar - P2_Lvar) ;
C7_A <= (PO Avar - P7_Avar + P3_Avar + P5 Avar + P6_Avar
- P4_Avar
Pl_Avar - P2_Avar) ;
C7_B <= (PO Bvar - P7_Bvar + P3_Bvar + P5_Bvar + P6_Bvar
- P4_Bvar






















































if (elk = ' 1' and c





















































































































































































elsif (ready2 = '1'
LI <= P0_Lsigl
L2 <= Lterm4 +
Al <= P0_Asigl
A2 <= Aterm4 +
Bl <= P0_Bsigl
B2 <= Bterm4 +
) then
+ Lterml + Lterm2 + LtenrG ;
Lterm5 + Lterm6 + (Lterm7*C7_Ll) /64 ;
+ Aterml + Aterm2 + Aterm3 ;
Aterm5 + Aterm6 + (Aterm7*C7_Al) /64 ;
+ Bterml + Bterm2 + Bterm3 ;




LI < = 0
L2 < = 0
Al < = 0
A2 <= 0
Bl < = 0








if(clk = '1' and elk'event) then













out_L <= ( LI + L2 ) ;
out_A <= ( Al + A2 ) ;












1 1.2.2 High Level Synthesis Method 1
library ieee;
use ieee . std_logic_1164 .all;
use ieee . std_logic_arith. all;
use ieee . std_logic_unsigned . all ;





































IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0)
IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
IN unsigned (7 DOWNTO 0) ;
output_ready : OUT std_logic;
out_L : OUT signed8bits;
84
END alu;
out_A : OUT signed8bits;
out_B : OUT signed8bits) ;




VARIABLE CO_L, C0_A, C0_B, C1_L, C1_A, C1_B,
C2_L, C2_A, C2_B, C3_L, C3_A, C3_B,
C4_L, C4_A, C4_B, C5_L, C5_A, C5_B,
C6_L, C6_A, C6_B, C7_L, C7_A, C7_B :
















R var, G var, B_var : signed8bits




VARIABLE deltaX, deltaY, deltaZ
--
add these two lines
variable deltaXY, deltaYZ, deltaXZ :

















then exit reset_loop; end if;
start_loop : loop










































































































































= ( Pl_Lvar - P0_Lvar)
= ( Pl_Avar - P0_Avar)




= (P2_Lvar + P4_Lvar - P0_Lvar - P6_Lvar) ;
= (P2_Avar + P4_Avar - P0_Avar - P6_Avar) ;




= ( P5_Lvar - Pl_Lvar
= ( P5_Avar - Pl_Avar
= ( P5 Bvar - PI Bvar
P4_Lvar + P0_Lvar) ;
P4_Avar + P0_Avar) ;




( P3_Lvar - Pl_Lvar - P2_Lvar + P0_Lvar) ;
( P3_Avar - Pl_Avar - P2_Avar + P0_Avar) ;
( P3_Bvar - Pl_Bvar - P2_Bvar + P0_Bvar) ;
C7_L := (P0_Lvar - P7_Lvar + P3_Lvar + P5_Lvar + P6_Lvar - P4_Lvar -
Pl_Lvar - P2_Lvar) ;
C7_A := (P0_Avar - P7_Avar + P3_Avar + P5_Avar + P6_Avar - P4_Avar -
Pl_Avar - P2_Avar) ;
C7_B : = ( P0_Bvar - P7_Bvar + P3_Bvar + P5_Bvar + P6_Bvar - P4_Bvar -
Pl_Bvar - P2_Bvar) ;
out_L <= (P0_Lvar + (Cl_L*deltaX) /64 + (C2_L*deltaY) /64 + (C3_L*deltaZ) /64
+ (C4_L*deltaXY) /4096
+ (C5_L*deltaXZ)/4096 + (C6_L*deltaYZ) /4096 +
(C7_L*deltaX*deltaYZ)/262144) ;
out_A <= (P0_Avar + (Cl_A*deltaX) /64 + (C2_A*deltaY) /64 + (C3_A*deltaZ) /64
+ (C4_A*deltaXY)/4096
+ (C5_A*deltaXZ)/4096 + (C6_A*deltaYZ) /4096 +
(C7_A*deltaX*deltaYZ) /262144) ;
out_B <= (P0_Bvar + (Cl_B*deltaX) /64 + (C2_B*deltaY) /64 + (C3_B*deltaZ) /64
+ (C4_B*deltaXY)/4096








wait on elk until
clk='l'
and elk 'event;
if reset = '0' then exit reset_loop; end if;
end loop algo_loop;
end loop start_loop;




1 1.2.3 High Level SynthesisMethod 2 (Behavioral Retiming)
library ieee;
use ieee. std_logic_1164 .all;
use ieee. std_logic_arith. all;
use ieee. std_logic_unsigned. all;
use work. typepack.all;
ENTITY alu IS
PORT( elk : IN std_logic;
reset IN std_logic;
ready IN std_logic;
P0_L : IN signed8bits;
P0_A : IN signed8bits;
P0_B : IN signed8bits;
P1_L : IN signed8bits;
P1_A : IN signed8bits;
P1_B : IN signed8bits;
P2_L : IN signed8bits;
P2_A : IN signed8bits;
P2_B : IN signed8bits;
P3_L : IN signed8bits;
P3_A : IN signed8bits;
P3_B : IN signed8bits;
P4_L : IN signed8bits;
P4_A : IN signed8bits;
P4_B : IN signed8bits;
P5_L : IN signed8bits;
P5_A : IN signed8bits;
P5_B : IN signed8bits;
PG_L : IN signed8bits;
P6_A : IN signed8bits;
P6_B : IN signed8bits;
P7_L : IN signed8bits;
P7_A : IN signed8bits;
P7_B : IN signed8bits;
R : IN unsigned (7 DOWNTO 0)
G : IN unsigned (7 DOWNTO 0)
B : IN unsigned (7 DOWNTO 0)
Rin : IN unsigned (7 DOWNTO 0)
Gin : IN unsigned (7 DOWNTO 0)




out_L : OUT signed8bits;
out_A : OUT signed8bits;
out_B : OUT signed8bits) ;
END alu;
ARCHITECTURE alu_behave OF alu IS
SIGNAL C0_L, C0_A, C0_B, C1_L, C1_A, C1_B,
C2_L, C2_A, C2_B, C3_L, C3_A, C3_B,
C4_L, C4_A, C4_B, C5_L, C5_A,
C5_B,
C6 L, C6_A, C6_B, C7_L, C7_A,
C7_B signed8bits
:= 0;
SIGNAL deltaX, deltaY, deltaZ :
signed8bits := 0;








-- synopsys label deltaX
-- synopsys label deltaZ
-- synopsys label deltaY
C1_L <= (P4_L - P0_L )
C1_A <= (P4_A
- P0_A )
CI B <= (P4_B
- P0_B )
C2 L <= (P0_L P2 L
87
C2 A < = (PO_A - P2_A ) ,
C2_B <= (PO_B - P2_B ) ,
C3 L <= ( P1_L _ PO_L ),
C3_A <= ( P1_A PO_A ) ;
C3_B <= ( P1_B PO_B ) ,
C4_L < = (P2_L + P4_L _ PO_L - P6 L );
C4 A <= (P2 A + P4_A - PO_A - P6 A ) ;
C4_B < = (P2_B + P4_B - PO_B - P6 );
C5_L < = ( P5_L - P1_L P4_L + PO_L )
C5_A < = ( P5_A - P1_A P4_A + PO_A )
C5_B < = ( P5_B - P1_B P4_B + PO_B )
C6_L < = ( P3_L - P1_L P2_L + PO_L )
C6_A < = ( P3 A - P1_A P2_A + PO_A )
C6_B < = ( P3_B - P1_B P2_B + PO_B )
C7_L <= (PO_L - P7_L + P3_L + P5_L + P6_L P4_L P1_L
C7_A <= (PO_A - P7_A + P3_A + P5_A + P6_A P4_A P1_A




out_L <= ((PO_L + (Cl_L*deltaX)/64 + (C2_L*deltaY) /64 + (C3_L*deltaZ) /64
+ (C4_L*deltaX*deltaY)/4096
+ (C5_L*deltaX*deltaZ)/4096 + (C6_L*deltaY*deltaZ) /4096 +
(C7_L*deltaX*deltaY*deltaZ)/262144) ) when (reset = '0') else 0 ;
out_A <= ((P0_A + (Cl_A*deltaX)/64 + (C2_A*deltaY) /64 + (C3_A*deltaZ) /64
+ (C4_A*deltaX*deltaY)/4096
+ (C5_A*deltaX*deltaZ)/4096 + (C6_A*deltaY*deltaZ) /4096 +
(C7_A*deltaX*deltaY*deltaZ)/262144) ) when (reset = '0') else 0;
out_B <= ( (P0_B + (Cl_B*deltaX) /64 + (C2_B*deltaY) /64 + (C3_B*deltaZ) /64
+ (C4_B*deltaX*deltaY)/4096
+ (C5_B*deltaX*deltaZ)/4096 + (C6_B*deltaY*deltaZ) /4096 +






I would like to thank Vijay Raghavan for the tremendous amount of assistance he
provided me with. I would also like to thank Gary Parrett for his help with VHDL
he provided, Dr. Ken Hsu for his guidance as my thesis advisor and Solheil
Dianat for his helping me to understand the tri-linear interpolation algorithm.
Dave Bishop and Mark Brown from the Eastman Kodak were also a great help
with tool issues and algorithm assistance.
89
