A VHDL design for hardware assistance of fractal image compression by Erickson, Andrew
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
6-1-2000 
A VHDL design for hardware assistance of fractal image 
compression 
Andrew Erickson 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Erickson, Andrew, "A VHDL design for hardware assistance of fractal image compression" (2000). Thesis. 
Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 






Partial Fulfillment of the






rVluhammad E. Shaaban, Assistant Professor, Computer Engineering
Committee l'vIember _
Kenneth W. Hsu, Professor, Computer Engineering
Committee Member _
Roy S. Czernikowski, Professor, Computer Engineering
Department of Computer Engineering
Kate Gleason College of Engineering




Rochester Institute of Technology
A VHDL Design for Hardware Assistance of Fractal Image
Compression
I, Andrew J. Erickson, hereby grant permission to any individual or organiza-





Fractal image compression schemes have several unusual and useful attributes,
including resolution independence, high compression ratios, good image quality,
and rapid decompression. Despite this, one major difficulty has prevented their
widespread adoption: the extremely high computational complexity of compres
sion.
Fractal image compression algorithms represent an image as a series of con
tractive transformations, each of which maps a large domain block to a smaller
range block. Given only this set of transformations, it is possible to reconstruct an
approximation of the original image by iteratively applying the transformations
to an arbitrary image.
Compression consists of partitioning the image into range blocks and finding
a suitable transformation of a domain block to represent each one. This search
for transformations must generally be done using a brute force approach, compar
ing successive domain blocks until a suitable match is found. Some algorithmic
improvements have been found, but none are adequate to reduce the required
compression time to something reasonable for many uses.
This thesis presents a new ASIC design which performs a large number of the
required comparisons in parallel, yielding a substantial speedup over a program on
a general-purpose computer system. This ASIC is designed in VHDL, which may
be synthesized to many different target architectures. The design has considerable
flexibility which makes it applicable to different images and applications.
The design is based around a pipeline of units that each compare one range
block with a series of domain blocks which are fed through the pipeline. Com
parisons are made to minimize the mean square error (MSE) of a transform
given a linear mapping of the intensity values. This is, by far, the most common
minimization strategy used in the literature.
The speedup provided by this design is estimated to be about 1,000 times
for 256 x 256 images divided into 8x8 blocks over a sequential processor given
similar implementation technologies.
Contents
List of Figures iv
List of Tables v
Glossary vi
Introduction 1
1.1 Scope of Work 1
1.2 Organization of the Material 1
Fractal Theory and Image Compression 3
2.1 Images and Transformations 3
2.1.1 Contraction Mapping Theorem 4
2.2 Fractals 4
2.2.1 Iterated Function Systems 5
2.2.2 Local Iterated Function Systems 6
2.3 Other Theoretical Approaches 7
2.3.1 Wavelet Compression 7
2.3.2 Self Vector Quantization (VQ) 8
2.3.3 Convolution Transform Coding 8
Algorithms for Fractal Image Compression 9
3.1 Characteristics of Fractal Image Compression Algorithms 9
3.2 A Generic Algorithm 10
3.3 Software Variations 12
3.3.1 Jacquin's Algorithm 12
3.3.2 Classification of Blocks 13
3.3.3 Quadtree Recomposition 13
3.4 Genetic Algorithms 14
3.5 Hardware and Parallel Approaches 14
3.5.1 Obvious Parallel Algorithms 14
3.5.2 Jackson's and Mahmoud's Parallel Approach 14
3.5.3 Acken's, Irwin's, and Owens's ASIC Architecture 15
Hardware Design Overview 16
4.1 Goals and overview 16
4.2 General Notes and Some Conventions 16
4.3 Primary Components 17
4.3.1 Main Memory Interface 18
4.3.2 Pipeline Unit 19
4.3.3 Host Interface Unit 20
4.4 Testing and Support Code 20
4.4.1 Clock and Reset Generator 20
4.4.2 Simulated Memory 21
5 The Memory Interface Unit 22
5.1 Memory Cache Chunk 22
5.1.1 The Sample Memory Interface 22
5.1.2 System Interface Logic 23
5.1.3 Memory Interface Logic 23
5.2 Block Addressing Chunk 24
5.3 Denominator Computation Chunk 25
5.3.1 High-level Block Addressing 25
5.3.2 Denominator Computation 26
6 The Pipeline Unit 28
6.1 The Parameter Computation Chunk 28
6.1.1 Mathematical Basis for Parameter Computation 28
6.1.2 Implementation 29
6.2 The Range Block Chunk 31
6.3 The Multiply-Accumulate Chunks 31
7 The Host Interface Unit 32
7.1 Description 32
7.1.1 Global Registers 32
7.1.2 Local Registers 35
7.2 Implementation 36
8 Testing and Verification 37
8.1 Algorithmic Testing 37
8.2 Hardware Simulation 48
9 Synthesis 50
9.1 Tools 50
9.2 Procedures and Scripts 50
9.3 Difficulties 50
9.4 Results 51
10 Conclusions and Extensions 53
10.1 Possible Improvements and Extensions 53
10.1.1 Memory Interface Unit 53
10.1.2 Pipeline Unit 54
10.2 Concluding Remarks 54
A Software Implementation of Compression and Decompression 55
A.l Introduction 55
A. 1.1 Invocation 55
A.2 Compression Program: fcomp3.c 57
A.3 Decompression Program: fdecomp3.c 71
A.4 Bitmap File Library 78
A.4.1 Header File: bmp.h 78
A.4. 2 Source File: bmp.c 80
B Synthesis Script File 88
B.l Master Script File: syn.scr2 88
B.2 Memory Interface Unit: syn.scr2a 88
B.3 Parameter Computation Chunk: syn.scr2b 89
B.4 Pipeline Unit: syn.scr2c 89
B.5 Host Interface and Top Level: syn.scr2d 91
C Schematics Generated by Synthesis 92
C.l Top Level Schematic 92
C.2 Pipeline Unit 93
C.3 MAC chunk 94










3.1 A Flowchart of the Generic Algorithm 11
4.1 A Block Diagram of the Hardware 17
5.1 Memory Interface State Diagram 24
5.2 State Diagram for the Block Addressing Chunk 25
5.3 High-Level Block Addressing State Diagram 26
5.4 Dataflow for the Denominator Computation 27





















Test Image (Original) 39
'textl"
Test Image (Original) 39
'chapel"
Test Image (Original) 40
'coke"










Test Result (MSE cutoff =10) 43
'sunset"
Test Result (Max a = 10) 43
'sunset"
Test Result (Initial Blocksize = 8) 44
'sunset"
Test Result (MSE cutoff =50) 44
'sunset"
Test Result (MSE cutoff =80) 45
'sunset"
Test Result (MSE cutoff = 125) 45
'sunset"
Test Result (MSE cutoff = 200) 46
'sunset"
Test Result (Magnification = 2) 47
Test Image for Hardware Simulation 48
Result Image Generated by Hardware Simulation 48
Result Image Generated by Compression Software 48
IV
List of Tables
7.1 The Host Registers 33
7.2 The Isometry Codes 35
8.1 Summary of Test Results with Various Images 38
8.2 Summary of Test Results With Varying Parameters (sunset) ... 38
9.1 Summary of Synthesized Area 51
A.l fcomp3 Command Line Arguments 56
A. 2 fdecomp3 Command Line Arguments 56
Glossary
Page numbers in brackets refer to locations in the text where the term defined
is discussed in greater detail. Terms which have no such references are merely
mentioned in passing in the text.
affine transformation A transformation on a matrix of the form T(x) = Ax+
b, where x and b are column vectors and A is a square matrix.
ASIC Application Specific Integrated Circuit; a complex digital circuit which
performs some specialized function (as opposed to a general-purpose de
vice).
attractor [4] The limiting value of a contractive transformation. The attractor
may be approximated by repeatedly applying the transformation to a seed
value.
chunk [17] In the design developed here, a chunk is a subcomponent within a
unit,.
complete [3] A metric space is complete if there is always a point between any
two given points (if it has no "holes").
contraction mapping theorem [4] A key theorem for fractal image compres
sion; it states that a contractive transform on a metric space has a single
fixed point attractor.
contractive [3] A transformation on a space is contractive if its contractivity
factor is strictly less than one.
contractivity factor [3] A (fractional) upper limit on the change in the distance
between any two points after a transformation is applied to them.
Design Compiler A VHDL synthesis tool which is part of Synopsys.
domain block [10] A block in an image which is the source of a transformation.
entropy encoding Any general-purpose (lossless) compression scheme which
relies on disparities in the relative frequency of various bit patterns to com
press data. Entropy encoding of some sort is often used as a last step in
fractal image compression.
VI
FIFO [54] First In First Out; a cache eviction strategy where the item which
has been in the cache the longest is evicted when a new item is inserted,
regardless of intervening accesses.
fixed point [4] A point in a transformation on a space which remains unchanged
by the transformation; in other words, xj = T(xf).
fractal A mathematical set which exhibits self-similarity. Typically, such sets
are represented graphically.
fractal image compression [10] A class of algorithms for image compression
which are based on fractal theory; they are generally based on LIFSMs.
fractal transform [7] Another name for an LIFSM sometimes encountered in
the literature.
gzip A popular lossless compression system which uses Lempel-Ziv (LZ77) cod
ing.
Fi [5] The Hausdorff set.
HausdorfF metric [5] A metric which defines the distance between two subsets
of a metric space based on the distance metric of the underlying space.
With the Hausdorff set, it forms a metric space.
Hausdorff set [5] The set of all possible subsets of a space except the empty
set.
HDL Hardware Description Language; any of several programming languages
used for describing digital hardware operation, generally for simulation,
synthesis, or human communication.
IFS [5] Iterated Function System; a set of transformations operating on a metric
space which defines a fractal.
IFSM [6] Iterated Function System with Graymap Functions; an IFS with an
additional dimension for intensity information.
JPEG Joint Picture Experts Group; a research group known for a common
image compression algorithm based on the discrete cosine transform. Also,
this compression algorithm.
LIFS [6] Local Iterated Function System; an IFS in which each transformation
acts on a subset of the space, rather than the whole.
LIFSM [7] Local Iterated Function System with Graymap Functions; a LIFS
with an additional dimension for intensity information. An image com
pressed with fractal image compression is generally an LIFSM.
vn
lossless A compression algorithm which reproduces its input precisely after de
compression.
lossy A compression algorithm which is not guaranteed to reproduce the precise
original after decompression.
metric [3] A distance measure between points in a space.
metric space [3] A space and an associated metric.
MSE [28] Mean Square Error; a common distance function and error measure.
PIFS [6] Partitioned Iterated Function System; another name for an LIFS some
times found in the literature.
QD [12] Quadtree Decomposition.
QR [13] Quadtree Recomposition.
quadtree A tree structure where each non-leaf node has (at most) four children.
In fractal image compression, a quadtree is formed when a square range
block is split into four smaller range blocks by bisecting the edges.
quadtree decomposition [12] The partitioning of an image into square blocks
of varying size using a quadtree approach, where large blocks are examined
and are recursively split into four equal smaller blocks as needed.
quadtree recomposition [13] An algorithmic variant of quadtree decomposi
tion where the tree structure is constructed starting with the smallest blocks
and joining them as appropriate.
RAM Random Access Memory; a term which is universally applied solely to
semiconductor memories which allow both reading and writing.
range block [10] A block in a LIFS which is the target of a transformation.
When compressing an image, it is completely divided into non-overlapping
range blocks.
s [3] The contractivity factor.
space [3] An infinite set upon which topological transformations are performed.
(In this thesis, all spaces considered are metric spaces.)
Synopsys A well-known digital logic development system; the Synopsys Design
Compiler was used to synthesize the design presented here.
synthesis [50] The process, generally performed by a computer system, of au
tomatically generating a logic design from an HDL description.
vm
transformation [3] A function which maps a member of a space onto another
member of the same space. In fractal image compression, this is generally
limited to a subset of the affine transforms.
unit [17] In the design developed here, the primary components are referred to
as units. Units may consist of smaller subcomponents named chunks.
VHDL VHSIC Hardware Description Language; a popular hardware description
language developed by the Department of Defense. VHDL was developed
to support simulation, but is now also frequently used for synthesis.
VHSIC Very High Speed Integrated Circuit; a Department of Defense program
which, among other things, developed VHDL.
VQ [8] Vector Quantization; a class of block-based image compression algorithms
where blocks are assumed to take the form of one of a number of possible





Fractal image compression has several features which would seemingly make it
ideal for many applications, including resolution independence, rapid decompres
sion, and high compression ratios. Despite this, it has remained of only secondary
importance, primarily because of one difficulty the very high computational cost
of compression. Despite ongoing research, software improvements which reduce
the time required for compression enough to be practical for general purpose use
have not been forthcoming.
The algorithms required for fractal image compression are easily performed in
parallel; the complexity of the software is due to a very large number of relatively
simple cases. This makes fractal image compression an attractive problem to
perform in special-purpose hardware.
1.1 Scope ofWork
In this thesis, a new design for an ASIC to assist with fractal image compression
is presented. This ASIC is designed using VHDL, which allows flexibility in
implementation; it is not tied to any particular system architecture.
The design presented is based around a pipeline ofmany identical units which
perform the block comparisons necessary for fractal image compression in paral
lel. This allows a substantial speedup when compared with software written for
general purpose processors.
The system was tested using a VHDL simulation, and the results
compared
to that obtained from compression software. The design was also synthesized to
a gate level using Synopsys synthesis
tools. Although no physical devices were
actually constructed to test the design, the testing
which was done should ensure
that such devices do work as expected.
1.2 Organization of the Material
The remainder of this thesis covers three major topics. First, some background on
fractal image compression is given. Chapter 2 discusses the mathematical theory
supporting fractal image compression the theories of local iterated function sys
tems. It also contains a brief summary of some alternate theoretical approaches
to fractal image compression. Chapter 3 describes the algorithm generally used
for fractal image compression, and briefly describes some of the improvements
on this general algorithm. This chapter also describes some related hardware
designs which have been proposed.
The second section describes, in detail, the ASIC design developed for this
thesis. Chapter 4 gives an overview of the design; chapters 5, 6, and 7 each
describe one of the three major components of the design.
The third section considers the testing and analysis of this design. Chapter 8
describes the testing of the design by verifying the expected operation of the ASIC
and by checking the underlying algorithm. Chapter 9 describes the results of
synthesizing the design using Synopsys, and chapter 10 contains some concluding
observations and some suggestions for future work on this design.
Appendices contain source code for a software fractal image compression and
decompression system, a copy of the script files used to synthesize the design, a
representative selection of schematics generated from the synthesis, and a brief de
scription of some utility programs developed in conjunction with this thesis. The
accompanying CD-ROM contains source code to these utility programs, along
with the VHDL code for the ASIC design and an electronic copy of this thesis.
Chapter 2
Fractal Theory and Image
Compression
2.1 Images and Transformations
An image consists of a metric space with an associated set of chromatic attributes.
A metric space (X, d) is composed of a topological space X (e.g. an infinite set)
and a metric d a measure of distance. A metric is a non-negative real valued
function of two points in the space which obeys three requirements [2, 11]:
1. d(x, y) = 0 iff x = y
2. d(x, y) = d(y, x) (reflexivity)
3. d(x, y) < d(x, z) + d(z, y) (the triangular inequality)
For images, the space used is a plane, and the distance measure is commonly
the Euclidean distance (or Euclidean metric). Other metric spaces are used from
time to time with fractal manipulations.
A metric space (X, d) is complete if, for any points x, y G X for which d(x, y) >
0, there exists a point z for which d(x, z) < d(x, y) and d(y, z) < d(x, y). In other
words, complete metric spaces are continuous.
The chromatic attributes of an image are assumed throughout this document
to consist of a numerical equivalent of an intensity a grayscale value. Actual
real-world images contain color information at a variety of wavelengths, which
may be represented in a computer with any of several color models, such as
the RGB model. The algorithms described herein may be applied to each color
component individually if non-grayscale images are used.
A transformation is a mapping of a space onto itself (a function of points
in the space yielding points in the space). A transformation T is said to be
contractive with contractivity factor s if
d(T(x),T(y))<sd{x,y),0<s<l
That is, the distance between an arbitrary pair of points becomes less by a factor
of at least s after applying the transformation. For fractal image compression,
the transformations are usually restricted to a subset of contractive affine trans
formations, although the theories upon which fractal image compression is based
hold for any contractive transformations.
2.1.1 Contraction Mapping Theorem
Every contractive transformation on a complete metric space has an attractor
which is invariant across the transformation. A sequence formed by repeatedly
applying the transformation to any subset of the space has, as its limiting value,
this fixed point. This is the Contraction Mapping Theorem.
Theorem 1 (Contraction Mapping Theorem) A contraction mapping T on
a complete metric space (X, d) has exactly one fixed point xf X (for which




where Tol(x) = T(x), and Ton{x) = T[Tln-V {x)] jorn>\.
Proof Let xXl x2, . . . , xn e X. Since T is contractive,
ma,x{d(xa,xb)} > s m&x{d(T(xa),T(xb)}
for a,b < n. Let e > 0 be given. Then,





This proves there is a single attractor for any set of points in the space (and,
by extension, for all points in the space). Further, as the space is complete, the
limiting value Xf is a member of X . Xf is fixed because d(xf,xj) = 0 and thus
d(xf,T(x,f)) = 0 (as Xf is the attractor.)
This is proven, with similar reasoning but different notation, in [1, 72-73].
This proof also implies that the error (distance between the attractor and a se
quence value) is constrained by a monotonically decreasing upper bound as the
number of iterations increases. Thus, the limit of the sequence may be approxi
mated to within any desired error bound by making a finite number of transfor
mations.
2.2 Fractals
A fractal is a set (often represented visually) which exhibits self-similarity. Frac
tals are, precisely speaking, defined only for a metric space and do not directly
involve any chromatic information associated with it. (An alternate approach is
to consider the chromatic information to lie along an additional dimension of the
space and to define a new metric which includes this dimension. This approach
is useful in developing the theory but is not frequently used in practice.)
Many mathematical constructs lead to fractals; for the purpose of image com
pression, however, fractals based on local iterated function systems (LIFSs) are
used exclusively. A LIFS is a modification of an iterated function system (IFS).
2.2.1 Iterated Function Systems
An iterated function system consists of a set T of contractive transformations
on a metric space A with a distance metric dA. These transformations act in
parallel the union of all of the transformations is iterated.
For analysis, a new metric space is constructed, called the Hausdorff Space
FLa- Ha is defined on the space of all subsets of A (the power set of A) except




Let o,o1,a2,...GA and A,Ai,A2,... G FLA- Then, one may define:
d(a,A) = min{da(a,ai) : ai e A}
d(Ai,A2) = max{d(oi, A2) : a\ G A{\
dn(AuA2) = max{(i(^1,^2),d(^2,^i)}
dy. is the Hausdorff distance.
An IFS may be characterized by a single contractive transformation in FLa-
A single contractive transformation on a metric space is also a contractive trans
formation on the associated Hausdorff space with the Hausdorff metric. (Since
the transformation maps any subset of the original space onto another subset, it
clearly is a transformation of points in the Hausdorff space. It is contractive by
nature of the definition of d-u, which is the distance of some particular pair of
points in the original metric space. This distance must be smaller by at least a
factor of s after the transformation; hence, dy must also be smaller by at least
the same factor.)
Similar reasoning applies to a set of contractive transformations; they (the
union of the transformations) correspond to a single contractive transformation
in the Hausdorff metric space with a contractivity factor of the maximum of
the contractivity factors of the component transformations. This implies, by the
Contraction Mapping Theorem 1, that there exists a single fixed point in Ft which
is the attractor for the IFS. A point in FL is equivalent to a subset of the original
metric space. Further, a finite number of applications of the IFS to any point in
FL will suffice to approximate, to within a desired error, this attractor.
For grayscale images, one may use a three-dimensional metric space for the
IFS, with the third dimension representing the intensity of a point. This may lead
to, in the fixed attractor, one location in the image having several simultaneous
intensity values; for display, these values are of necessity combined into a single
value. This is commonly done by either taking the maximum value or the mean
value of the set of intensities, although other algorithms are possible.
IFSs which are applied to grayscale images in this way (treating the inten
sity as a third axis) are often IFSMs (Iterated Function Systems with Graymap
functions), where the graymap function transforms the intensity at a point in the
metric space independently of the location transformation. Although the trans
formations are usually handled separately in practice, the theory is simpler and
more general without this requirement.
In practice, the combining of multiple grayscale values takes place at each
iteration, as the storage requirements for a two-dimensional structure are less
than for a three-dimensional one. This change may affect distance metrics, and
hence theoretical convergence; however, given a spatially discrete representation
and a limited grayscale value range, it is possible to create a metric which will
converge regardless of the graymap function (for instance, by weighting chromatic
differences far less than spatial differences in the distance function).
Graymap functions are still generally required to convergent; otherwise, al
though the system as defined above (with the weighted distance function) is
theoretically convergent, approximations with even a small error may still have
very large grayscale differences when compared with the original. (This is unsur
prising, as such large grayscale differences do not lead to large distances in the
distance metric for the space. This distance function does not realistically model
human perception of differences between images.)
2.2.2 Local Iterated Function Systems
A LIFS (Local Iterated Function System) is identical to an IFS except the domain
of each transformation is not the entire space, but rather a subset of that space.
The range of the transformation is therefore also a subset of the space. The
transformations are not individually contractive with respect to the entire space;
hence, the system may not be globally contractive and the results of the Con
traction Mapping Theorem cannot be entirely relied upon. (As with an IFS, the
result of applying the function system to a subset of the space is the union of the
transformation results; hence, any parts of the space which are not covered by the
range of any transformation will be absent from the result of a transformation.)
In [1], it is observed that a contractive LIFS may not have an attractor:
further, it may have several attractors. If it has one or more attractors, it has
one largest attractor, the superset of all other attractors
Attractors in an LIFS arise whenever an area in the range of a transforma
tion affects the area of its domain, either directly (where the range and domain
intersect) or indirectly through some series of
transformations. In such spaces,
the system forms a pure IFS and has a single attractor; hence, any seed which
contains information which, through a series of transformations, affects the range,
will give rise to that attractor.
It follows that, given a seed which is the union of all transformation domains,
the largest attractor will be produced. (By extension, the universal set for the
6
space will also produce this largest attractor; in practice, it is often simpler to
use the universal set than to find the union of the domains, especially if there are
a large number of transformations.)
LIFSs used in image compression are frequently referred to as LIFSMs; the
graymap function is treated separately from the topological transformations,
analogously to IFSMs. Image compression algorithms are based on LIFSs, rather
than IFSs, because doing so makes the burden of computation more reasonable;
one need only find transformations whose fixed points approximate small pieces
of the image, rather than the whole.
Maintaining the intensity axis as a single point, rather than a set of points,
makes it impossible to guarantee that the largest attractor is produced. In prac
tice, this does not appear to be a significant difficulty; the desired largest attractor
is still usually found. An image compression system could perform a trial decom
pression to verify this, provided a constant seed image is always used at the start
of iteration. If an incorrect attractor is found, some recovery action must be
taken either attempting compression again with slightly different parameters,
or simply returning an error indicating the failure of the system to compress
the desired image. As already mentioned, this is not a common problem with
real-world images.
(It is an unproven conjecture of the author that, in the case of an image
compression system, an attractor always will exist. An image compression system
uses an LIFSM where the union of the ranges of the transformations covers the
entire area of the image, and the domains of the transformations come from
within the image. Under these conditions, it appears that it is impossible to
generate a system which does not have an attractor.)
2.3 Other Theoretical Approaches
Fractal image compression may also be viewed as a special case of other compres
sion technologies. Doing so gives some additional insight into how and why fractal
compression works and also may give rise to hybrid approaches which have better
performance. In [4], in addition to introducing convolution transform coding, a
concise taxonomy of all of these theoretical approaches is presented.
2.3.1 Wavelet Compression
In [3], fractal image compression is investigated in terms of wavelets. It is shown
that block-based fractal compression schemes are equivalent to self-quantizing a
Haar subtree. Further, much of the effectiveness of fractal image coding systems
is based on their ability to efficiently encode
zerotrees.
By applying these observations to wavelets with basis functions other than the
Haar basis function, Davis was able to improve their effectiveness. Further, Davis
presented some statistical analysis which indicated that self-quantization should
be fairly effective at compressing textures in real-world images. (This analysis
agrees with empirical observations of fractal image compression systems.)
2.3.2 Self Vector Quantization (VQ)
Vector quantization algorithms assume that image blocks may be adequately
represented by one of a number of entries in a predetermined codebook, consisting
of parameterized pattern blocks. This codebook is either agreed upon beforehand
or must be transmitted as overhead data.
Fractal image compression may be viewed as VQ where the codebook is inher
ent in the image, consisting of the set of possible domain blocks. This codebook,
rather than being transmitted separately or predetermined, is rebuilt by the de
coder during the decoding process. Jacquin, in [6], recognized this when he
introduced his algorithm.
2.3.3 Convolution Transform Coding
In [4], it is observed that fractal image compression may be regarded as a trans
form operator, based on the convolution of the range block with the set of domain
blocks. This allows for a somewhat faster approach to compression (by perform
ing fast convolution in the frequency domain) which is more suitable for software
than for hardware. This also implies that fractal image encoders have, at their
core, something very similar to a convolution engine; this is indeed true for the
hardware design developed here, which could be transformed into a parallel con
volution engine with only relatively simple changes.
Chapter 3
Algorithms for Fractal Image
Compression
To compress an image with an LIFSM, it is necessary to find a set of contractive
local transformations whose fixed points approximate the image. For compression
to actually take place, this set must be expressible with less information than the
image.
Decompression is performed by iterating the LIFSM over some seed image to
produce an approximation of the attractor.
3.1 Characteristics of Fractal Image Compres
sion Algorithms
Some characteristics are common to virtually all fractal image compression algo
rithms. Many of these are not found in most other image compression techniques;
they are largely unique to fractal image compression.
Fractal image compression algorithms are inherently resolution-independent.
Images may be decompressed at any desired target resolution, including those




at their final resolution which were not part of the
original. Obviously, any image compression technique cannot add information
to the images; these details are figments of the compression and decompression
techniques solely. Subjectively, they are often believable, especially at moderate
levels of magnification.
As decompression at any desired resolution is possible, compression ratios
must be computed at the original image resolution. Although this seems obvious,
some proponents of fractal image compression wrongly compute compression ra
tios from higher-resolution decompressions and thereby create astoundingly good




They are still illusionary and unhelpful,
except perhaps as marketing tools, as they may be crafted to be unrealistically
high by performing decompression at large magnifications.)
Fractal image compression algorithms are extremely asymmetric with respect
to processing time. Compression requires much more computation than decom
pression (often by orders of magnitude). This makes them more suitable for ap
plications where compression is less frequent than decompression, such as shared
image databases. The hardware proposed in this document is intended to help
speed up compression; it is useless for decompression.
Fractal image compression algorithms are lossy, making them unsuitable for
certain classes of work; in particular, many medical imaging applications require
lossless compression.
Fractal image compression algorithms usually provide high compression ra
tios, often a bit better than JPEG compression at a similar perceived quality.
(Image quality is generally subjective, imprecise, and often varies with the in
tended application of the images. In [5], for instance, it is shown that JPEG
better preserves small, moderate-contrast features, especially at relatively high
compression ratios.)
Fractal image compression algorithms are generally more suitable for use with
"natural"
images than with line art or similar classes of images. Further, the
quality of compression varies significantly from image to image, and is not al
ways discernible from obvious image qualities. (By contrast, JPEG compression
displays artifacts around steep chromatic gradients in images; determining the
relative content of high and low frequencies in an image suffices for determining
its suitability for JPEG compression.)
3.2 A Generic Algorithm
Compressing an image using fractal image compression consists of finding an
LIFSM whose attractor approximates the image in question. Since there are a
theoretically infinite number of possible LIFSMs, only small classes of possible
LIFSMs are considered; generally, those which are easiest to work with on a digital
computer and those which require little information to describe (thus leading to
high compression ratios).
This generic algorithm is typical of most approaches. A flowchart of the
algorithm may be found in Figure
3.1.
The image is divided into a number of non-overlapping square range blocks
of a fixed initial size. Domain blocks are constrained to have dimensions twice
as large as those of the range blocks (that is, to each be a square with an area
four times greater than that of a range block) and to fall within the image.
The geometric transforms permitted are constrained to those affine transforms
which map a given
domain block to a given range
blockfour rotations and four
reflected rotations. The intensity transformations are constrained to be linear
functions.
For each range block, domain blocks are searched until one domain block (and
orientation) may
be nearly made to fit. The graymap
function is determined
using a least





























Figure 3.1: A Flowchart of the Generic Algorithm
11
computed. (The error measure used is generally the mean square error, or MSE;
this implies that the distance metric used is not the Hausdorff metric as defined
above; the theory is much simpler when the Hausdorff metric is used.) When a
good fit (or the best possible fit) is found, it is recorded.
Usually, if no good fit is found for a block, it is subdivided and each smaller
block is tested. (A quadtree approach is frequently used; this is referred to as
quadtree decomposition, or QD.) Certain constraints also may be placed on the
graymap functions; in particular, the allowed multiplicative constants are usually
bounded to ensure the convergence of the decompression algorithm along the
intensity axis.
The bit encodings of the transforms has a large impact on the compression
ratios. These encodings vary considerably from system to system; therefore, they
will not be covered in greater detail here. Applying some entropy encoding to
the resulting transform encodings is common and gives a small increase in the
compression ratio.
Appendix A gives example C programs which perform fractal image compres
sion and decompression. The hardware is designed to essentially replace the core
of this program (the do.block function); without too much difficulty, the program
could be modified to drive the hardware.
Decompression is performed by applying the LIFSM defined in the compres
sion to some seed image several times until it converges on (an approximation of)
the attractor. Typically, this does not require many iterations, usually fewer than




Arnaud E. Jacquin was the first to propose a practical automatic (not requiring
manual intervention or assistance) method of
fractal image compression. His
initial work was in a Ph.D. dissertation in 1989; a summary of that work is found
in [6].
Jacquin's system used two fixed block sizes, selected based on an analysis of
the range blocks. Range blocks were classified as either constant ("shade") blocks,
modeled as a constant value;
slowly-
varying ("midrange") blocks, mapping to




transformations of four small domain blocks. The pool of domain
blocks of each size was based on a sampling from the image, with blocks of
constant brightness eliminated.
For midrange blocks, only one isometry was allowed and the graymap
function
was limited to a linear function with one of four given
multipliers. For edge blocks,
all eight isometries were allowed and six different
multipliers were permitted for
the graymap function. (It would seem that
somewhat better results may have
12
been achieved if eight multipliers were permitted, rather than six, as doing so
would not require any additional bits to encode.)
Jacquin used 8x8 range blocks for the large blocks and 4x4 blocks for the
small ones.
Jacquin's algorithm, when compared with the generic algorithm presented
previously, relies on categorizing the blocks to assist in finding good matches for
them, rather than using a pure brute-force approach as above. Further, only
two sizes of blocks are used, rather than a deeper quadtree as above. These
differences are presumably to reduce the required compression time to a more
reasonable amount, especially given the computer power available at the time of
his work. (Moore's law suggests that current computers are approximately 100
times faster than they were at the time of the publication of Jacquin's paper.)
3.3.2 Classification of Blocks
Jacquin, and most other researchers, classify the set of range and domain blocks
using some simple determination of their variance. Frequently, this is little more
than separating out blocks which are essentially a solid gray (usually called shade
blocks) value from those which contain noticeable gradients. Jacquin used a
slightly more complex model, as described above. By classifying range and do
main blocks, a directed search of the domain blocks may be done, requiring less
time (by a constant factor).
Jumar and Jain proposed a much more complex system in [7], where 4 x
4 blocks were classified as either shade blocks or one of fifty-two varieties of
edge blocks. Each of the edge blocks in a category were mapped to the same
domain block, significantly cutting down on compression time by eliminating
the search for domain blocks from the inner loop of the compression engine.
They also employed a simple prediction algorithm for the shade and edge blocks
which improved the compression ratios by omitting similar adjacent transforms.
(Their approach is essentially a VQ scheme encoded as an LIFSM. A pure VQ
approach might produce similar results with less overhead; they did not, however,
investigate this possibility.)
3.3.3 Quadtree Recomposition
Quadtree recomposition (QR) builds a quadtree by starting with small blocks
and then combining them as
needed. This yields a significant speedup over a
decomposition approach, as less computation is performed which is later rejected.
QR is analyzed in some detail in [8] (and in [9], which has nearly identical material
on the subject). In [8], QR reduced the compression times to about one-half to
one-third of the times required for QD.
QR is a useful approach for software-based compression systems; however, it
is less suitable for a parallel hardware implementation similar to that proposed
here, as it initially requires a very large number of small blocks be
computed. This
would, in turn, require a very large number of repeated
units in the pipeline. By
13
performing decomposition, the total number of units required at a given quadtree
level is reduced. Further, unlike software, computations which are later ignored
do not incur a time penalty; they may occur in parallel with those which are not
ignored.
Since it was desired that the sample compression engine code presented here
be similar to what would be used to drive the hardware, QD was used in preference
to QR.
3.4 Genetic Algorithms
In [10], a genetic algorithm is used to direct the search for domain blocks. The
authors found that doing so reduced the required number of comparisons sub
stantially (by a factor of twenty in one case) while achieving results nearly as
good as a complete search of the problem space.
3.5 Hardware and Parallel Approaches
Since hardware implementations may be (and generally are) inherently parallel,
parallel software implementations can give useful insight into possible hardware
designs.
3.5.1 Obvious Parallel Algorithms
Fractal image compression is easily parallelizable; the encoding of each range
block is independent of the others. In [11], one example is given. This obvious
general approach, treating the range blocks separately in parallel, is carried over
in the design presented here.
3.5.2 Jackson's and Mahmoud's Parallel Approach
In [8], D. J. Jackson and W. Mahmoud describe an interesting parallel supercom
puter implementation of Fractal Image Compression on a SIMD parallel machine,
the nCube-2. Their approach is somewhat similar to the hardware approach de
scribed here, although it is implemented in software.
They used a single master control processor and a queue of slave processors
which communicated to the master processor. The set of domain blocks was
distributed among the slave
processors. Range blocks circulate around a circular
queue formed by the slave processors and are checked against the domain blocks.
When a good match is found, the data is transmitted to the master and a new
range block is inserted in its place.
Jackson and Mahmoud distributed the domain blocks, rather than the more
obvious replication of them at all the slave nodes, due to memory constraints
of the machine they were using. In the hardware proposed by this thesis, range
blocks are distributed and domain blocks circulated through the set of range
14
blocks. This proves somewhat simpler to implement in hardware, as range blocks
are both less numerous and smaller than domain blocks, allowing more paral
lelism.
3.5.3 Acken's, Irwin's, and Owens's ASIC Architecture
In [12], an ASIC architecture is described for fractal image compression. The
architecture described is effectively a smart memory design; many small general-
purpose processing elements are scattered throughout a memory array. (These
processors are optimized for fractal image compression operations; however, it
appears that they could do other operations with different microcoding.) The
processors are arranged in a tree structure. Leaf nodes in this tree calculate
sums and cross products for domain and range blocks; the non-leaf nodes use the
results of lower nodes in the tree to determine a good mapping for the blocks.
Through this tree structure, several levels of a quadtree decomposition of the
image are computed simultaneously.
Several of the low-level design decisions they made are similar to those made
here; in particular, the computations are performed using a bit-serial approach,
similar to that used here. Similar constraints gave rise to these similar decisions;
in both cases, the area consumed was a primary concern.
One important optimization used in their design is a reduction on the pre
cision of some of the intermediate results in the calculations. They found that
such reductions had very little effect on the quality of the output image, while
reducing the time and area required for the computations. Such optimizations




4.1 Goals and overview
The hardware design presented herein is aimed at substantially speeding up the
encoding process by performing many of the repetitive tasks involved in finding
transformations in parallel. It performs block matching using least squares linear
regression on several blocks and orientations simultaneously. As it is a parame
terized VHDL model, the number of range blocks processed simultaneously may
be easily changed; this allows the design to be synthesized to suit the available
area. (It would be desirable to have the maximum block size also configurable.
Doing this would require a substantial amount of additional work and testing, as
the widths ofmany of the signals depend upon this value. While design flexibility
is an excellent goal, flexibility in this particular area required too much effort to
be practical.)
The device is designed to be used in conjunction with a host processor which
performs some of the higher-level computations and many bookkeeping tasks.
Communication with the host processor is via several device control registers
and a shared memory area to hold the source image.
4.2 General Notes and Some Conventions
A few important conventions are followed throughout this design. Signals, enti
ties, and other VHDL objects are named in all caps, with underscores between
words. In this document, references to specific VHDL objects are made with a
typewriter font, SAMPLE_VHDL_OBJECT.
All signals are active high, with one important exception: RESET, which is
for a power-on reset, is active low. This exception follows current practice and
may lead to better synthesis for
certain target architectures. RESET operates
asynchronously. Not all sequential parts of the design are affected by RESET;
some are reset at various times in the normal course of operation, and therefore



































Memory Inteface Unit ,
Memory
Figure 4.1: A Block Diagram of the Hardware
The hardware consists of three distinct kinds of blocks, called units: a single
memory interface unit, a single host interface unit, and a number of pipeline
units. Each pipeline unit performs comparisons of domain blocks with one range
block; thus, ideally one should have as many pipeline units as there are range
blocks of a particular size in an image.
The memory interface and pipeline units are further subdivided into smaller
blocks, called chunks. Figure 4.1 shows an overview of the system architecture.
(In this diagram, control signals have been omitted for clarity; the arrows indicate
data flow only.)
In general, the design has been kept fairly simple (and thus small and rela
tively slow) to allow for a large number of units to be pipelined together. It is
believed that performingmany computations in parallel will be faster overall than
performing fewer at a higher speed. Additionally, a fairly simple implementation
17
of the various parts yielded a design which is evenly matched in speed; no single
component is a substantial performance bottleneck.
4.3.1 Main Memory Interface
The main memory interface is responsible for addressing memory shared between
the host processor and the rest of the device. It is composed of three main
components: a block addressing chunk, an interface and cache chunk, and a
denominator computation chunk.
Block Addressing Chunk
The block addressing chunk generates the required addresses to access a single
block within an image. This is controlled by three control signals: the starting
address, the size of a block, and the length of a row in the source image.
Additionally, the block addressing chunk performs pixel averaging for domain
blocks. The block addressing chunk is implemented, in VHDL, as a fairly straight
forward state machine with some auxiliary registers for performing addressing.
More details of the design are found in Section 5.2
Memory Interface and Cache Chunk
This chunk performs the actual interfacing with the memory bus. Currently, it
is assumed that the memory is sixteen bits wide; this block performs the byte
extraction as necessary. It also maintains any memory cache used; in the current
design, this is a very small four byte cache, which is only sufficient to prevent
repeated accesses to the same memory word, such as would otherwise occur when
averaging four adjacent pixels during the reading of domain blocks.
If the memory interface were the limiting factor on system performance, a
larger (but still relatively modest) cache capable of holding any single domain
block would significantly reduce the memory bandwidth requirements. A further
reduction could be realized by having an entire row of domain blocks in cache at
once; in this last case, an average of about one byte per domain block would be
fetched from memory during operation.
Separating these parts from the rest of the system allows the design to be
modified for use with various system architectures easily. The specifics of the
cache is very heavily related to the precise bus design and the hardware speed,
and is therefore likely to change when the memory interface changes. Keeping
these two in the same chunk minimizes the difficulty of adapting the design for
other environments.
Details on the design used are found in Section 5.1
Denominator Computation Chunk
This chunk performs computations which are used by all the parameter computa
tion units in the pipeline. This chunk is also responsible for high-level addressing
18
of domain blocksthat is, for automatically iterating over all the blocks in the
image.
The two functions of this chunk are largely separate, and could easily be
implemented as two chunks. As neither is very complex, however, it was simplest
to combine them and avoid an additional level of interface.
Section 5.3 describes this chunk in greater detail.
4.3.2 Pipeline Unit
The core of the system is a series of pipeline units; each unit compares a single
range block with a stream of domain blocks. The units each consist of several
chunks: a range block chunk, a parameter computation chunk, and eight multiply-
accumulate (MAC) chunks. Ideally, enough pipeline units should be present in
an implementation to be able to hold the maximum number of range blocks of
some size in an image to be compressed.
Range Block Chunk
The range block chunk stores one range block from the image and provides the
MAC chunks with pixels from the block in various orderings (which correspond
to the eight possible isometries for mapping one block to another). It consists
of a small memory array and the required addressing hardware. The memory
array is actually implemented in a separate file, to allow one to easily use a pre
defined memory component for an actual implementation technology. (Synthesis
tools generally produce inefficient circuitry for relatively large blocks ofmemory.)
Two versions of this file were developed: one as a placeholder for synthesis,
containing instantiations of architecture-independent register arrays; and one as a
placeholder for simulation, containing an algorithmic description of the memory.
(This was done because the synthesizable version was significantly slower for
simulation.)
Key to the range block chunk, but also important elsewhere, is the maximum
block size; larger blocks require larger memory arrays. A block size of 32 x 32 is
currently used and appears to be a reasonable value. (Each range block chunk
contains enough memory to store one block; for a 32 x 32 block size, this amount
is 1,024 bytes.)
Details on the range block chunk are found in Section 6.2
Multiply-Accumulate Chunks
These chunks combine an eight bit multiplier and a twenty-six bit accumulator.
Multiplication is unsigned and is performed using the usual shift and accumulate
approach (taking eight clock cycles per multiplication) . Faster multiplier designs
are not useful here, as the range block units would have to have more than one
read port and the main memory interface would need to supply data faster, both
of which add substantially to the complexity of the
system. Further, it is likely
19
that the bandwidth to the main memory would not allow a substantial increase
in operational speed without a lot more caching of memory data.
The MAC chunks are equipped with a double-buffered output; this allows the
parameter computation chunk to operate on one set of numbers while the next
set is being computed in the MAC chunks.
Since all the MAC units are implemented in a single pipeline, one of the
operand shift registers (that for the domain block data) is actually implemented
as single bit slices in each MAC unit.
Parameter Computation Chunk
The parameter computation chunk is responsible for computing a distance mea
sure for the mapping. This error measure is based on the mean square error when
linear regression by least squares is applied, given certain constraints (primarily
requiring that the resulting transformation be contractive in all dimensions).
The mathematical background and design of this chunk is more complicated
than that of the other chunks; the reader is referred to Section 6.1, where it is
explained in detail.
4.3.3 Host Interface Unit
The host interface provides for communication between the host processor and
the hardware. It also generates certain control signals which are necessary for
the operation of the system. The host interface consists of a number of
sixteen-
bit registers; currently, no interrupt capability exists, although that could
be
added without great difficulty. (This was not included because the specifics of
the interrupt system are highly dependent on the specific host processor used and
because it was not helpful in the simulations of the system.)
Specifics for the host interface are found in Chapter 7.
4.4 Testing and Support Code
A few VHDL entities were developed for testing the system. These are not
intended to be
synthesized
they exist only to aid in simulating the system op
eration.
4.4.1 Clock and Reset Generator
A very simple clock and
reset generator was implemented. The simulated clock
operates at 10 MHz. This speed was chosen because it made mental conversions
of time and clock tick counts simple; an actual implementation of the design using




A simulation of the shared memory was developed; this simulated memory is
automatically loaded from a file when the simulation is started. The bmp2memfile
program, described in section D.l, creates this file.
21
Chapter 5
The Memory Interface Unit
The memory interface unit is divided into three components: the memory cache
chunk, the block addressing chunk, and the denominator computation chunk.
The memory cache chunk is responsible for talking to the main memory system
and for any data caching. The block addressing chunk generates the appro
priate addresses for accessing blocks of an image and performs some important
computations on domain blocks (pixel averaging and the computation of the
quantities needed for parameter computation). Finally, the denominator compu
tation unit forms the interface between the memory addressing hardware and the
pipeline units by automatically adjusting the block addresses as needed and by
computing the denominator value required by the parameter computation unit
{n12y2
(Sy)2)- ft also computes the parameters associated with range blocks
when they are loaded (J^x and (X!^)2)-
This separation has several important advantages. It makes the design more
manageable and understandable; a single unified unit would be significantly more
complex than any chunk. It makes improving the cache system much easier, as
the cache logic is separate from the other parts. It also makes the design much
easier to modify for different memory bus designs.
Separating the block addressing and the denominator computation allows
changes to be made more easily to the pipeline (and, in particular, to the param
eter computation strategy) without changing the logic to address blocks.
The cache and the memory interface were not separated because they are
inherently closely related. The amount and organization of the cache is largely
determined by the specifics of the memory interface.
5.1 Memory Cache Chunk
5.1.1 The Sample Memory Interface
The memory interface currently used in this
design is a simple, representative
sixteen-bit wide bus. While typical of many standard bus designs, and especially
similar to an MC68000 bus, it does not attempt to adhere to any specific protocol.
A modern implementation would probably need a more complex external bus,
22
and thus substantial changes to the implementation. The general approach to
the design described here ought to be adaptable to a wide variety of situations,
however.
Arbitration signals for the memory are BUSJIEQUEST and BUS-AVAILABLE.
When the system needs some datum from memory, it first asserts BUS_REQUEST;
some external logic should assert BUS-AVAILABLE when the device has the bus.
The device keeps BUS .REQUEST active throughout its memory operation (which is
always a single word).
The memory is addressed using ADDRESS_OUT, strobed by ADDRESS_STROBE.
The memory is byte addressed with a 32-bit wide address bus; however, since
only word (two byte) transfers are performed, the least significant bit of the
address is omitted from the physical memory bus.
The external memory asserts MEM_READY after it puts the requested data on
DATA_IN. Once the device has strobed in this value, the memory cycle terminates
and control of the bus is relinquished.
All inputs are sampled on the rising edge of the clock, and outputs change
on the rising edge of the clock; hence, this interface may be characterized as
semi-synchronous. No precise timing requirements are presented for the sample
memory interface bus; it is intended primarily as a placeholder to enable testing.
Since the device never writes to memory, no protocol for writes is defined in
this sample interface.
5.1.2 System Interface Logic
The memory cache chunk must provide individually addressed bytes to the rest
of the system. The interface used is synchronous.
The system provides a thirty-two bit address on ADDRESS_IN, and a flag in
dicating a valid request (DATA_0UT.NEEDED).The cache unit should respond with
the requested data in DATA_0UT and set the DATA_OUT_HERE flag.
In the sample design, the logic to drive this interface is entirely combina
tional. It simply determines if the data is in the cache (based on the cache tags),
and extracts it if it is present. If the data is not in the cache, a flag indicating
this is asserted, which starts the main memory interface state machine. Eventu
ally, when a memory cycle completes, the
data will be placed in the cache and
simultaneously propagated to the rest of the
system.
5.1.3 Memory Interface Logic
A state diagram of the memory interface logic is shown in Figure 5.1. The
BUS.REQUESTstate is active whenever the memory bus is wanted but not currently
granted; the MEMORY_WAIT state is active
while the system waits for the memory
to respond.
When a read occurs, the data goes into one of two word-sized caches blocks. A




















Figure 5.1: Memory Interface State Diagram
tually an optimal strategy when reading domain blocks for this (small) cache size.
A further discussion of the cache in this design may be found in Section 10.1.1.
In addition to updating the caches, the result of a read is propagated to the
block addressing chunk immediately. (The alternative, only passing on values
from the cache registers, adds an additional wait state to memory accesses and
thereby slows down the system substantially.)
5.2 Block Addressing Chunk
The block addressing chunk is primarily responsible for generating memory ad
dresses to access a block within an image. The image is assumed to be uncom
pressed, so the addressing is fairly straightforward. Specifically, the image is
stored in a rectangular array in memory, perhaps with padding at the ends of the
rows of the image. (Throughout this document, it is assumed that the image is in
row-major ordering with the origin at the upper-left corner of the image. This is
primarily a conceptual convention; column-major ordering poses no operational
difficulties to the hardware, but does change the interpretation of a few inputs
and outputs.)
In addition, the block addressing chunk performs averaging on four adjacent
pixels when reading domain blocks. When loading range blocks, no averaging
occurs.











c Avg Ask 2
v





Avg Output 1 Z__)
Figure 5.2: State Diagram for the Block Addressing Chunk
dotted arrows, from the IDLE state to the M0RM_ASK and AVG_ASK1 states indicate
a typical path; in the actual implementation, the AVG_ASK1 and N0RM_ASK states
may be entered at any clock cycle when the START signal is active.
The N0RM_ASK and NORMJJPDATE states are used when loading a range block;
the others are used when comparing blocks (e.g. when reading domain blocks and
averaging pixel values). During the various _ASKstates, the system loops until the
memory is ready; during AVG_0UTPUT1, the system loops until the denominator
computation chunk is ready for another value.
5.3 Denominator Computation Chunk
The denominator computation chunk fulfills two main purposes: it provides high-
level addressing of blocks (as opposed to addressing pixels within a block, which
the block addressing chunk does), and it computes a few values which are used
for parameter computation. These two activities do not interact much; they
could fairly easily be implemented in separate chunks. As they are both fairly
simple and straightforward, this was not done; the complexity of a single chunk
containing both is not too great and avoids an additional interface which must
be maintained.
5.3.1 High-level Block Addressing
The state machine which performs the high-level block addressing is shown in Fig
ure 5.3. There are two primary paths in this system one, through LOAD_START,
for loading range blocks, and one for checking denominators. The Ml through M8
states are to ensure the various multipliers, in the multiply-accumulate units and
in the denominator computation itself, have enough time to operate. Although
the arrows are not explicitly shown, the system goes invariably from state Ml to
M2 and so on to M8.
25
LoadS tart
Figure 5.3: High-Level Block Addressing State Diagram
A series of addressing counters, updated in the NEW_BL0CK state, control block
addressing. A single counter is used to count the pixels within a single block.
5.3.2 Denominator Computation
The denominator computed is
n^y2
(Ey)2- The computation is performed
using a combinational subtracter and shift-and-add multipliers; it effectively adds
ten cycles to the parameter computation time required. (The entire process
requires eighteen cycles, but eight of those are overlapped with the eight cycles
required for the multiply-accumulate units to perform a multiplication.)
Also computed as part of this are y and y2, which are also required by the
parameter computation. Through double-buffering, the parameter computations
and multiply-accumulate steps overlap. For large blocks, larger than about 10 x
10, the limiting factor is the multiply-accumulate time; for smaller blocks, it is
the parameter computation time.
This hardware is also used to compute
(E^)2
when range blocks are loaded.








The pipeline unit is composed of a range block chunk, eight MAC chunks, and a
parameter computation chunk. The parameter computation chunk is, by far, the
most complex of these.
6.1 The Parameter Computation Chunk
6.1.1 Mathematical Basis for Parameter Computation
In least squares linear regression, given a set of range points xi,x2, . . . ,xn and
a set of domain points y\,y2, . . .yn (which, in this case, are averaged values), an














Expanding this expression, one finds that
JZ\x2
-











































The parameter computation chunk computes an expression derived from this
for every isometry and keeps the one with the lowest error, provided the a param
eter satisfies -1 < a < 1. It would be desirable to also determine the MSE with
\a\ = 1 when the best a satisfies \a\ > 1 and store the pairing if, with \a\ = 1, the
MSE is still better than the previous best. This is not actually done, as doing
so would require a great deal of additional hardware, and the improvement in
compression is usually insignificant.




is computed by the memory interface unit,
and is passed through the pipeline of the pipeline units. If this quantity is zero,
then the domain block is (when averaged) a solid, constant value; in this case, it
should be discarded from the search. (It is assumed that range blocks which are
adequately modeled as shade blocks are handled by the host processor separately;
therefore, comparing a non-shade block to a pure shade block is useless. Not
discarding such a domain block may result in an attempt to divide by zero if it
is determined to be a new best block, hence the requirement to discard. The
actual hardware uses a slightly different approach to the problem, and avoids the
division altogether; because of this, it is not necessary for the hardware to discard
such domain blocks.)
The parameter computation unit must (for each isometry) determine if \a\ <









If so, it stores new values for r









Ds = D. (Here, r/Ds is a measure of the relative MSE difference). Note that r
may become negative; thus, the comparison and




can never be negative, however.
6.1.2 Implementation
A block diagram of this operation is shown in Figure 6.1. The implementation
used is a pipelined serial design, where one bit proceeds down the pipeline at each
clock cycle. Pipeline registers are located only at the multipliers; the adders and
comparators are essentially combinational and







D Time Delay (Mult by 2)
Subtraction or Negation
9 Unsigned Multiplication
Figure 6.1: A Block Diagram of the Parameter Computation
(These components are not completely combinational, in that there is some state
associated with them for instance, carry bits in the adders. They do not operate
as separate pipeline stages, however.) This approach was used because it requires
relatively little area while giving adequate performance. (With the current design,
the parameter computation becomes a performance bottleneck when the range
block size is less than about 10 x 10; for larger blocks, the time required to perform
the comparison with the domain block dominates. These two operations take
place in parallel. The exact transition point depends on the maximum block size
allowed by the design, which in turn determines the width of the computations
in the unit. Currently, 32 x 32 is the maximum value.)
Multiplications throughout the parameter computation unit are performed
by a shift-add approach. Where signed multiplication is required (the two final
multiplications), a radix 2 Booth multiplier is used. In the situations where an
unsigned multiply is followed immediately by a negation, these two operations
are combined by using a shift-subtract design rather than a shift-add design; this
is not reflected in the diagram in any special way.
If a parameter comparison is
goodthat is, if the new mapping is better than
the previous best the system must store four pieces of information: the new Ds,
the new r, the coordinates of the
domain block (stored as the starting address
of the block), and the particular isometry. Only the last two of these need be
visible to the host processor; the values of r and Ds are of little concern to the
driver program. Since an actual MSE value and a and b regression parameters
30
are not computed, the host processor will need to determine these for each block.
This is not too great of a computational burden, as it only must be done for one
mapping per range block. The alternative, computing them in special-purpose
hardware, would require significantly more circuitry than the present design, and
would end up reducing the performance of the system by allowing fewer pipelined
comparison units on a chip. (In particular, computing the actual values for these
would require circuitry for division, which does not fit well with the bit-serial
computations performed in this design.)
The parameter computation chunk also buffers the required values from the
memory interface unit and passes them on to the next pipeline stage. Further,
it produces values for the local host registers, which are multiplexed to the host
when requested.
6.2 The Range Block Chunk
The range block unit is conceptually straightforward. It consists of a small mem
ory block, sufficient to hold one range block, and appropriate addressing circuitry.
The addressing circuitry required to produce the eight possible isometries is not
too complex; various permutations of two counters and two subtracters are all
that is required. Each pipeline unit maintains a separate set of these counters;
it is thought that the overhead of passing the values along the pipeline would
outweigh the overhead of the counters themselves. The current approach is also
somewhat simpler conceptually.
The actual implementation is a little less clear, due to constraints in synthesis.
Synopsys, like most current synthesis tools, is not especially adept at synthesizing
anything other than small blocks of memory. For this reason, a separate file
contains an entity for the actual memory block. It is anticipated that an actual
device would use a pre-designed memory component; having the description in
a separate file allows this to be done easily. As a generic placeholder, an array
of DesignWare memory components is instantiated in this file to allow synthesis.
The resulting synthesized hardware is not very area efficient, however; a flip-flop
is used for each bit in the memory, rather than a normal static or dynamic RAM
cell.
Table 7.1.2 lists the eight isometries and the addressing which is used to
generate them.
6.3 The Multiply-Accumulate Chunks
These are quite straightforward; the overview in Section 4.3.2 should be sufficient
for the reader to understand them.
31
Chapter 7
The Host Interface Unit
7.1 Description
The host interface allows the host processor to control the device and to read
data back after the computation is done. It consists of several memory-mapped
registers, some of which are read-only and some of which are also writable. The
registers are divided into two sets: global registers, which apply to the system
as a whole, and local registers, which apply only to one particular pipeline unit.
There are sixteen address locations for each set of registers, but not all are actually
implemented.
All registers are nominally sixteen bits wide, although not all bits may be
operable; bits which are not used are always read as zeros and are ignored on
writes. In general, if a register is partially implemented, the least significant bits
are the ones implemented.
The registers are outlined in Table 7.1; a description of the individual registers
follows.
7.1.1 Global Registers
Except for certain bits in the Mode/Control register, all global registers are
writable. For correct operation, there are some restrictions on when they may
be written. These restrictions are not enforced by the hardware and must be
observed by the host software.
Mode/Control Register
This register determines the global mode of the system (whether loading range
blocks or computing domains) and commences any operations. It also contains
flags indicating the system status. Only two bits are used.
The load bit, bit 1, determines the mode the system is in; writing a value of
1 to this bit indicates that range blocks are to be loaded, while a zero indicates
that comparisons are to be performed. This bit should not be changed during
the entire loading process (e.g. it should remain at 1 throughout).
32
A3..o Global Reg (A4 = 0) Local Reg (A4 = 1)
0 Mode/Control Flags
1 Block Size Best Isometry
2 Block Pixels Best Address (MSB)
Best Address (LSB)3





6 Image Row Length
7 Image Width
8 Image Height Best Ds (MSB)
Best Ds (LSB)9 Pipeline Unit Select













Table 7.1: The Host Registers
The start bit, bit 0, commences a range block load or a series of comparisons
when a 1 is written to it. The implementation allows the writing of both the
mode bit and the start bit at the same time; if the mode is changed, starting
will commence the operation in the new mode. Technically, the start bit is a
write-only bit; reading the bit produces the busy bit. Since the two
are closely
related, no real confusion
should result. Setting the start bit while an operation
is in progress is not allowed. (It probably would cause the operation to start
again from scratch, but this is untested and possibly
untrue in some cases.)
The busy bit, also bit 0, indicates that the
system is working, either loading
or comparing as the case may
be. This is a read-only bit. This bit is cleared
when the first parameter computation chunk has finished; the others may still be
busy for a time afterwards. (This
allows the host processor to read early results
while the later ones are still being computed.)
There may be a delay of a few clock
cycles between writing a one to the start
bit and the busy bit being set. It is
therefore recommended that a small delay
be inserted between a write and a
subsequent read of the register.
Additional bits could be implemented in the future if some
application re
quired them. Some possible examples include
a last pipeline unit busy bit, a
reset bit, and a halt bit.
Block Size Register
This six-bit register must be loaded with the
size of the desired block, less one;
thus, for 8 x 8 blocks, a value of 7 must
be loaded. The largest supported block
size is (currently) 32 x 32. This should not be
changed while any operation is in
33
progress; further, it should have the same value when comparing blocks as it had
when loading the range blocks for the results to be valid.
Block Pixels Register
This eleven-bit register should be loaded with the number of pixels in a block
for example, for a 8 x 8 block, it should be loaded with 256. This register should
be changed under the same circumstances as the Block Size Global Register.
Image Row Length Register
This sixteen-bit register specifies the length, in memory, of a row in the image;
this includes any padding which may be present in memory but not part of the
image proper. It is used for address computations. (It is assumed that the image
is stored in an uncompressed row-major array of bytes.)
This register generally should be set only once per image and not changed
subsequently. It must not be changed while any operation is in progress.
Image Width Register
This sixteen-bit register counts the number of (overlapping) blocks present in a
row of the image.
This register should not be changed while block comparisons are happening.
Image Height Register
This sixteen-bit register counts the number of (overlapping) blocks present ver
tically in the image.
This register should not be changed while comparing blocks.
Pipeline Unit Select Register
This sixteen-bit register chooses which pipeline unit will be loaded and which one
will provide values for the local registers. Pipeline units are numbered starting
with zero; the results of loading this
register with a number greater than the
greatest pipeline unit is undefined (and thus doing so should be avoided.)
This register must not be changed while a range block is being loaded; it may
be changed at any time during block comparisons,
however.
Base Address Register
This thirty-two bit register specifies the base
address of either the image (for
comparing blocks) or the specific range
block (for loading a range block). It may




Other registers may be helpful in some applications; for instance, a read-only
register which contains the number of pipeline units would allow driver software
to automatically adjust to various implementations of the design.
7.1.2 Local Registers
All local registers are read-only. Many may not be of much use during general
operation of the device, but are provided for the software in case they actually
turn out to be useful. These unnecessary registers may be removed from the
system, or replaced with more useful ones. (They are not particularly helpful for
simulations, as the simulator allows one to easily examine the signals directly.)
Flags Register
This register contains flags which are local to the pipeline unit. At present,
there is only one such flag, the idle flag, which is set whenever the parameter
computation chunk of the pipeline unit is not operating. This allows the driver
software to ensure that the pipeline unit has finished the last comparison before
reading the results.
Best Isometry Register
This three-bit register contains a code for the isometry of the best match found
in the search. The translation of the isometries is provided by Table 7.1.2; all
rotations are clockwise, and all reflections are about the horizontal axis. (This
table assumes a row-major ordering of the pixels, from the upper-left corner to
the lower-right corner of the image. Other arrangements would lead to different
geometric descriptions.) This register is automatically cleared to zero when a
new range block is loaded.
Code Geometric Definition Algebraic Description
000 No Isometric Transformation ^x,y ' ^x,y
001 Reflection, Rotation by 90 ^x,y ' *^y,x
010 Reflection, Rotation by 180 ^x,y ' J^nx,y
011 Rotation by 270 ^x,y ' X^ny,x
100 Reflection Ux,y ' *^x,ny
101 Rotation by 90 Ux,y ' -riy,nx
110 Rotation by 180 Ux,y ' -t~Wix,ny
111 Reflection, Rotation by 270 ^x,y ' -Ciny,nx
Table 7.2: The Isometry Codes
35
Best Address Register
This register contains the address of the first pixel in the best domain block for
the range block; it is a fairly simple affair to convert this into a set of coordinates.
This register is cleared to all zeros when a new range block is loaded; hence, it is
recommended that the image data exist elsewhere in the address space.
Best r Register
This register gives the r parameter (as computed by the parameter computation
unit) for the best domain block. This value is most useful for debugging and
testing the chip design, rather than being useful for any host-based computation,
and may be removed from the design eventually.
Best Ds Register
This register gives the denominator used in the parameter computation unit. Like
the best r register, it is of little use to the host system under normal operation.
Ex Register
This register contains the sum of the pixels contained in the range block. This
is computed when the block is loaded, and is useful to the host for determining
regression parameters or other simple image processing.
(Ex)2
Register
This register contains the square of the sum of the range pixels, computed when
the range block is loaded. This value is also useful for finding regression param
eters.
7.2 Implementation
The implementation of the host interface is fairly straightforward; unlike the
other units, it is not a composition of several
chunks. The global registers form a
specialized register file, the outputs of which are always available to the system.
The local registers are multiplexed from the various pipeline units, as selected
by the pipeline unit select global register. (The
actual selection of which specific
local register to read is performed by each pipeline piece. This simplifies the
wiring overhead of the
system somewhat.)
If the VHDL synthesis tool supports it, the local register selection could be
simplified by the use of internal tri-state signals. Not all
target architectures
support such signals, however, and not all VHDL synthesis tools are capable of
using tri-state signals. For these reasons,





Testing was performed in several stages. Algorithmic testing was performed us
ing the compression and decompression program written in C and contained in
Appendix A. Verification of the basic operation of the chunks was performed indi
vidually; care was taken to ensure that all major points of operation were tested,
although no attempt was made to formally cover every possible condition. Ver
ification of the design as a whole was performed by simulating the compression
of a (quite small) image and comparing the result to that generated by the C
program.
This hybrid testing approach was used for two main reasons. First, it allowed
the algorithm to be perfected without undue difficulty in constantly changing the
hardware design. Second, simulating the hardware takes a very long time and is
therefore impractical for all but the smallest images. (For the 80 x 56 test image
used, each hardware simulation required about
eighteen hours.)
8.1 Algorithmic Testing
The program presented in the appendix is the result of several iterations of de
velopment, experimenting with
various distance measurements and other algo
rithmic strategies. In its current state, it is typical of many fractal image com
pression systems in the literature. A few simplifications are made, particularly in
the method of handling images whose dimensions are not an even multiple of the
block size used. (Currently, such images are cropped; this is hardly acceptable
for general-purpose use.).
A summary of some results
of using this program with various images is
presented in Table 8.1. A summary of results for a single image with various
parameter values is in Table 8.1. In these tables, two compression ratios are
given; the ratio in
parentheses is the compression ratio after the encoded image
has been compressed using gzip. (All the
images are 256 x 256 grayscale images;
all compression ratios are
computed based on an input Windows .BMP file, which
is 67382 bytes
long
including 1846 bytes of header information.)
Note that the compression ratios achieved are not the
best possible with frac-
37
tal image compression, as the transformation parameters are encoded with greater
precision than is necessary. It is expected that, without significant changes in
image quality, the compression ratios (prior to entropy encoding with gzip) could
be at least doubled. (The entropy encoding step, in this case, would probably
not be quite as effective, although it would still improve the compression.)
Image Options Compression Output
sunset (Figure 8.1) -a 1 -b 32 -c 25 -m 4 3.26:1 (6.17 1) Figure 8.5
textl (Figure 8.2) -a 1 -b 32 -c 25 -m 4 1.38:1 (4.42 1) Figure 8.6
chapel (Figure 8.3) -a 1 -b 32 -c 25 -m 4 1.54:1 (2.59 1) Figure 8.7
coke (Figure 8.4) -a 1 -b 32 -c 25 -m 4 2.82:1 (5.21 1) Figure 8.8
Table 8.1: Summary of Test Results with Various Images
Max a Blocksize Range MSE Cutoff Compression Output
1.0 32-4 25 3.26:1 (6.17:1) Figure 8.5
1.0 32-4 10 2.07:1 (3.53:1) Figure 8.9
10.0 32-4 25 3.28:1 (5.94:1) Figure 8.10
1.0 8-4 25 2.43:1 (5.19:1) Figure 8.11
1.0 32-4 50 4.62:1 (9.81:1) Figure 8.12
1.0 32-4 80 5.92:1 (14.55:1) Figure 8.13
1.0 32-4 125 7.49:1 (22.17:1) Figure 8.14
1.0 32-4 200 25.32:1 (45.81:1) Figure 8.15
Table 8.2: Summary of Test Results With Varying Parameters (sunset)
Some general observations are clear from these results. High maximum a
values, such as in Figure 8.10, lead to divergence chromatically, as expected. The
most interesting parameter is the MSE cutoff, which determines the quality (and
compression ratio) of the final image. This is visible in the
progression of sunset
results (Figures 8.9, 8.5, 8.12, 8.13, 8.14, 8.15).
Natural images, such as the sunset (Figure 8.1) and chapel (Figure 8.3) test
images, tend to compress relatively well with fractal image compression. Artificial
images, such as textl (Figure 8.2), are less suitable; the compression ratios are
lower and the distortion in the output more objectionable.
The visible figments of compression are blockiness and somewhat ragged edges
in some cases. Continuous tones and textures are preserved fairly well, as are
large-scale sharp transitions.
The resolution independence of fractal image compression may be seen from





Although the alchemists were in
and were dearly not in the intellectual
something that the
philosophers had Qt
various materials to prescribed treat]
scribed us laboratory methods. These
laboratories, nor. only uncovered mar
the systematic experimentation that is
Alchemy began lo decline id lh
1541), a Swiss physician and outspc
$trr>ngly advocated that the
objectives i
ofmedicine and the curing of human ?
senary efforts of
alchemists to conver
But the real beginning of mods
during the Renaissance. Nicolaus Co
















Although the alchemists were m
and were deafly not in the intellectual
something that the philosophers had m
various materials to* prescribed ireati
serfbed &n laboratory methods. These
laboratories, not only uncovered man
the svHtcstiftlie experimentation that is
Alchemy b^igan to dcehoe in th
1541 J, a Swiss physician and ontspc
sit rmgly advocated that the objectives i
cfmedicine t'.nd ihe taring ojf human ?
ccnaiy dibits of alchemists to w>af
Hui ihe renl beginning ol modi
dyrmg the Renaissance* Nkokss Co















Test Result (MSE cutoff = 10)
Figure 8.10:
"sunset"




Test Result (Initial Blocksize = 8)
Figure 8.12:
"sunset"




Test Result (MSE cutoff = 80)
Figure 8.14:
"sunset"








Test Result (Magnification = 2)
47
8.2 Hardware Simulation
The sample image selected for hardware simulation is shown in Figure 8.17. This
image was compressed using only 8x8 blocks; the results are shown in Figure 8.18.
Compression using the software compressor, with similar parameters, resulted
in Figure 8.19. The options used were -a 1 -b 8 -c 1 -m 8 -h; with these
options, the results should be identical (except for possible miniscule differences
due to floating point rounding errors) . Comparing the transforms emitted shows
that this is indeed the case; the only differences are insignificant variations in the a
and b coefficients, which is attributable to slightly different rounding errors in the
floating point computations producing these figures. (These values are computed
by the driver program, not by the simulated hardware; the results found by the
simulated hardware agree exactly with the predicted results generated by the
compression software.)
Figure 8.17: Test Image for Hardware Simulation
m
MIL
Figure 8.18: Result Image Generated by Hardware Simulation
f c
Figure 8.19: Result Image Generated by Compression Software
The simulation required approximately 200 clock cycles to load each range
block and about 2,150,000 clock cycles to perform comparisons with all 2,560 do
main blocks. (Each domain block comparison requires about 840 clock cycles for
8x8 blocks.) With a 10 MHz simulated clock,
the total simulated time required
for compression was about 220 ms. The time required for loading the range blocks
48
is insignificant when compared with the time required for the comparisons; both
times grow linearly with the area of the image.
It is expected that an implementation using current technology would have a
significantly faster clock speed than 10 MHz. This value was used for simulation
because it allowed easy conversion between time and clock cycles when analyzing
the results. With a 10 MHz clock and an adequate number of pipeline units,
performing compression using 8x8 blocks on a 256 x 256 image would take a





The Synopsys Design Compiler was used to synthesize the design. A target
architecture of LSI 10K was used. (The design is in no way tied to this target
architecture; it is simply a reasonable placeholder.)
9.2 Procedures and Scripts
Appendix B contains a copy of the scripts used to synthesize the system. Various
optimizations are used, particularly ungrouping certain portions of the design.
All parts are synthesized assuming a 20 MHz clock.
The set_dont_use command near the start of several of the scripts exists to
get around a difficulty with the particular library installation here; these specific
instances require re-analysis, but the source code for them is missing.
Following standard practice, the CLOCK and RESET networks were not opti
mized at all. These global signals are, depending on the architecture, generally
either routed semi-manually or constructed using special-purpose, pre-built cir
cuitry.
9.3 Difficulties
The memory block in the range block chunk was the only problematic area for
synthesis. A generic placeholder entity for this memory block is used in the
design; it is expected that an actual implementation would replace the generic
design with one specific to (and optimized for) the target architecture. Design
Compiler is not especially adept at synthesizing memory blocks.
(Synopsys does provide some wrappers for architecture specific memories;
unfortunately, none of these wrappers were for memories with an asynchronous
read port, as the range block chunk was designed to use. Using a synchronous
50
memory would require significant changes to at least the range block chunk and
potentially to other parts of the design.)
9.4 Results
The synthesis was performed for a system with four pipeline units. An actual
implementation would contain as many of these units as practical, given the area
constraints of the device; ideally, this number should be much higher than four.
Appendix C contains some sample schematics of the synthesized system. Only
representative portions are included because the complete set of schematics would
require several hundred pages.
Synthesis of the entire design from scratch (with nothing in the synthetic
cache used by Synopsys) using these scripts takes about two or three hours on
an HP 9000/785 system.
The areas reported by Design Compiler are in "gate
equivalences."
It keeps










Memory Interface Unit 2,935 5,697 7,294 12,991
Host Interface Unit 268 413 1,486 1,899
Pipeline Unit (No Memory)






























Total Area (No Memory)







Table 9.1: Summary of Synthesized Area
Table 9.4 contains a summary of the area statistics for the various components
of the design. The cells (in the cells required) column refer to LSI 10K standard
cell in the target library. A few components do not have this information because
they contained references to other entities, either generated by Design Compiler
or instantiated directly; such components are reported as a single cell.
The difficulty of Design Compiler in generating memory blocks should be
immediately evident from this table. Over four-fifths of the area required by the
design is consumed by these (fairly small) memory blocks. If a target-specific
memory design is not significantly better, it may be worthwhile to reduce the
maximum block size.
The area required for the host interface will vary somewhat depending on the
number of pipeline units implemented. Since the host interface is relatively small
51
and simple, this change should be insignificant when compared to the additional




In general, the design performed admirably in the testing. Actual hardware
implementation would probably be best performed with an ASIC design (rather
than some programmable logic system) due to the complexity of the design.
10.1 Possible Improvements and Extensions
10.1.1 Memory Interface Unit
Cache and Memory Interface
As was mentioned in Section 5.1, portions of the memory interface require mod
ification for actual use based on the precise nature of the memory system. Part
of this is designing a suitable cache, such as the very simple one developed here.
Larger caches would decrease the memory bandwidth requirements substan
tially; how necessary this is depends on the relative speed of the system to the
memory bus and on what other devices share the bus. A few cache size possibil
ities are evident.
The smallest reasonable size, as is used here, is two machine words. This
reduces the number of memory accesses required by two over having no cache (as
a word is not accessed twice, once for each byte, in close succession).
A second possibility, useful primarily if the memory bus operates in a burst
mode, is to have enough cache to store two lines of a
domain block; this would
enable the system to read memory using only burst mode transfers. While this
does not change the number ofmemory locations accessed, it does typically speed
up the accesses substantially
over reading words piecemeal.
A third possibility is having enough cache to hold an entire domain block;
this would allow most comparisons (all except for those at the start of a line in
the image) to read only one column per block rather than the entire block. This
reduces the memory bandwidth by a factor of the size of the blocks being used
over the original scheme. (For example, it would take about | of the current
memory bandwidth for 8 x 8 blocks.)
53
A fourth possibility is to have enough cache to hold an entire row of domain
blocks. This is adequate to ensure that the image is only read once during the
course of a set of comparisons, yielding an average of not much more than one
byte read per block compared.
In all cases, using a simple FIFO algorithm should yield fairly good cache per
formance, as the memory addressing is largely linear. (Larger caches, of course,
would probably be set associative rather than the fully associative cache used in
the current design.)
Denominator Computation Chunk
The denominator computation chunk currently starts comparisons in cases where
the denominator is zero. Since division is nowhere actually performed by the
hardware, this does not lead to illegal arithmetical operations. Testing also did
not reveal any incorrect mappings from this strategy; the test images were com
pressed properly, even with such blocks.
It may be better not to perform comparisons in these cases; at the least, it
may save some time with small blocksizes, where the parameter computation is
the bottleneck on performance. It is easy to test, in software, whether or not the
block is reasonably represented by a shade block.
10.1.2 Pipeline Unit
The parameter computation units currently use roughly one hundred clock cycles
per block per isometry. This number could be reduced somewhat for smaller
blocks without changing the overall design, since the maximum possible sums
are smaller. If the number of cycles used were dynamically determined from the
blocksize, the system would be a little faster for small blocks. (For large blocks,
larger than approximately 10 x 10, the parameter computation unit processing
time is not a bottleneck.)
It may also be helpful to determine if limiting the precision of the calcu
lations would result in substantially worse performance, as was done in [12]. If
similar results are applicable to this architecture, the time required for parameter
computation could be reduced by a significant fraction.
10.2 Concluding Remarks
The design performed well in simulations and was shown to be synthesizable. A
speedup of roughly 1000 times
versus a pure software approach was predicted for
similar implementation technologies.
Manufacture of the design developed here may be covered by patents owned by
Iterated Systems, Inc. [13]; a reader wishing to produce this design or derivative
designs should contact them. The author has no affiliation with Iterated Systems,






The implementation is written in ANSI C on Unix (DEC OSF/1). The only
operating system dependent sections involve file I/O and the call to setpriority
in the compression program. (This call reduces the priority of the process to
prevent it from causing angst among interactive users of the system. It could be
removed without affecting the operation of the program.)
There are three pieces of code: the compression program, the decompression
program, and a simple library to read and write MS Windows .BMP files.
A.1.1 Invocation
fcomp3
The command line arguments for fcomp3, the compression program, are listed in
Table A. 1.1. Options may be entered in any order.
The input file must be a grayscale MS Windows .BMP file. The two blocksize
parameters are further constrained either to be an even power of two or to be
three times an even power of two. (This is to ensure that quadtree dissection
works accurately.)
If the maximum a parameter is greater than one it is possible that the LIFSM
produced may be divergent chromatically. Whether or not it is divergent (or
divergent in certain areas) depends on the input image and the other compression
factors; often, a maximum a modestly greater than 1.0 will still produce a globally
convergent LIFSM.
A small MSE cutoff leads to an accurate reconstruction of the original im
age and a relatively low compression ratio. A high MSE cutoff implies a high
compression ratio (but a potentially poor reconstruction).
Compression takes a considerable amount of time, which varies quadratically
with the number of pixels in the image. The test images used (256 x 256 pixels)
55
Option Description Default Legal Range
-a Maximum permitted
a regression value
1.0 0 < a < 256.7
-b Initial Blocksize 32 2 < m < b < 96




-i Input filename none
-m Minimum Blocksize 3 2 < m < b < 96




Table A.l: fcomp3 Command Line Arguments
typically take around 15 minutes of cpu time to compress, depending on the image
and the specific options used. (This is on a 532 MHz Alpha 21164A system.)
The hardware emulation mode simplifies the algorithm somewhat to better
simulate what the hardware design does. (The change affects situations where
linear regression produces a a value which is beyond the allowable range. With
hardware emulation enabled, the mapping is not considered; with it disabled,




The command line arguments for the decompression program fdecomp3 are listed





















Table A. 2: fdecomp3 Command Line Arguments
The input file must be one produced by the fcomp3 program; the output file
will be a grayscale Windows .BMP file.
The number of repetitions required to converge on a
(good approximation
of) the
attractor seems to be between ten and fifteen; much smaller numbers of
repetitions lead to noticeably poorer images, and
greater numbers do not lead to
substantially different
images.
Decompression is relatively quick, especially
when compared with compres
sion. Decompression which produces a
large image may take a few seconds on a
56
fairly fast system.





* Fractally compress an image; various parameters of the compression are *
* configurable, including the range of a values permitted, the sizes of *














/* Access for mallocable two-dimensionable arrays. This is a bit hard to do */
/*
neatly in C; it only does statically defined multidimensional arrays */
/*
using nice array syntax, AFAIK */
#define SOURCECx, y) source[ (y)*srow + (x) ]
#define SAVG(x, y) savg[ (y)*avgrow + (x) ]
#define SSQR(x, y) ssqr[ (y)*avgrow + (x) ]
#define SUMY(x, y) sumy[ (y)*avgrow + (x) ]
#define SUMYYCx, y) sumyy[ (y)*avgrow + (x) ]
/* Other macros */
Sdefine ABS(a) ( (a)<0?-(a) : (a) )















/* Stuff to get stored in output file */
/* These are shorts for ease; they could easily be characters or */
/* bitfields and thereby increase the
compression ratio. */
/* The x coordinate */
/* The y coordinate */
/* The specific mapping */
/* a greymap factor */




/* Element in (internal) quadtree */
/* Block size unneeded */
/* Range x position */

















/* Domain x */
/* Domain y */
/* A graymap factor: y
= ax + b */
/* B graymap factor */
/* Regression coefficient */
/* Saved rs param */
/* Saved ds param */
/* Next in (breadth-first) list do */
/* Quadtree kid 1 */
/* Quadtree kid 2 */
/* Quadtree kid 3 */










/* Source data array */
/* Averages of source data pixels */
/* Squares of source data pixels */
/* Sums of domain blocks */







/* xsize - xcrop */
/* xsize - xcrop
- 1 */
/* ysize - ycrop */
/* ysize - ycrop
- 1 */
/* xsize - xcrop
-
blocksize, kind of */
int xsize, ysize; /* Size of original picture */
int xcrop, ycrop; /* Number of pixels chopped off edge */
int rowlen; /* Number of bytes in a row */
unsigned long blockpixs = INITIAL_BLOCK_SIZE * INITIAL.BLOCK_SIZE;
/* size of block squared */




int blocksize = INITIAL_BLOCK_SIZE; /* Size of a range block */
int minblocksize = INITIAL_MIN_BLOCK_SIZE; /* Minimum size of a range block */
int verbose;
int hwflag;
float cutoff = DEFAULT.CUTOFF;
float maxa = DEFAULT.MAXA;
char *infile = NULL;
char *outfile = NULL;
/* Verbose flag */
/* Hardware emmulation flag */
/* closeness cutoff for r squared */
/* maximum a parameter */
/* Input file name */




/* Input bitmap file */
/* Input file pointer */
/* Output file descriptor */
void precompute_stuff () ; /* Once per quadtree level init */
void do_args( int argc , char **argv );
void initializeO ; /* Once per execution init */
void do_block( struct qt.elem *block ) ;
void *my_malloc( unsigned long size );
int check_and_lower_tree() ;
void write_qtmap() ;
void write.transform( struct qt_elem *which ) ;
void write_bit( int bitval );
void write_qtmap_elem( int blocksize, struct qt.elem *here );
int main( int argc, char
** argv ) {
int err;
58
inbmp = 0 ;
/* Play with command line */
do_args( argc, argv );
/* We have our options. We need to do a few things: */
/* o read in the file */
/* o Set stuff up for compression */
/* o For each quadtree level: */
/* x precompute domain stuff (for speed) */
/* x Do the compression */
/* o Write out the resultant file. */
/* Read in the file */
infilep = fopen( infile,
"r"
);
if( infilep == NULL ) {





inbmp = read_bmp( infilep );
err = cleanup_bmp( inbmp ) ;
if( err != 0 ) {





/* Determine a few important constants and such */
xsize = inbmp->bmi->bmih.biWidth;
ysize = inbmp->bmi->bmih.biHeight ;
xcrop
= xsize / blocksize; /* This should really be changed... */
ycrop
= ysize '/, blocksize; /* but dealing with partial cols/rows is hard. */
rowlen = R0WLEN( inbmp ); /* Bytes in bmp row */
/* Report vital stats to user if appropriate */
if ( verbose ) {
printf( "******** Invocation details
*********\n\n"
);
printf( "Source image: '/,s\n", infile );
printf( "Image size: '/.d x
"/.d\n"






, blocksize, blocksize );
printf( "Minimum Block: '/,dx'/.d\n", minblocksize , minblocksize );
printf ( "MSE Cutoff: 7.f\n", cutoff );







/* Report important stuff (errors, warnings) to user */
if( blocksize * 2 > xsize II blocksize
* 2 > ysize ) {






!= 0 kk ycrop
!= 0 ) {




, xcrop, ycrop );
} else {
if( xcrop
!= 0 ) {





!= 0 ) {
fprintf( stderr, "Notice: I shall crop






/* Be nice to your friends; this does the same as a nice 10 on UNIX. */
/* On other OSs, this could be replaced with something appropriate or */
/* just eliminated. (This also improves the chances of the admins not */
/* killing off the program when they see it using a lot of CPU time; */
/* at least it doesn't steal the CPU from people doing
"useful"
work.) */
setpriority( PRI0_PR0CESS, 0, 10 ) ;
/* Precompute some things */
initializeO ;
/* Store important stuff for file header */
/* Done now because blocksize will change as time goes on. */
hdr .blocksize = blocksize;
hdr.width = xsize - xcrop;
hdr. height = ysize - ycrop;
/* A word on data structures used: */
/* The blocks at a single level are linked in a list to allow for easy */
/* breadth-first access (by following the list) . The descendents of a */
/* top-level block are linked in a tree. At each level, nextlist is */
/* created to hold dissected blocks from thislist ; after a level is */
/* done, nextlist becomes thislist. top always points to the list at */
/* the very top of the tree, the initial block size. */
top
=
nextlist; /* Set top to be the top of the quadtree */
/* Do the compression */
while( nextlist
!= 0 ) {
if( verbose ) {
printf (







while ( currlist != 0 ) {
do_block( currlist );
currlist = currlist -> next;
}
blocksize >>= 1;




== 1 kk nextlist
!= 0 ) {





/* Write the output file */
/* Lower the height of the quadtree
until we find the largest block */
/* which has an actual transform in
it (is not split up.) This makes */
/* the map a bit smaller, especially
in cases where an obscenely huge */
/* initial blocksize was used (for the
image/MSE combination)-such as */
/* a blocksize of 64 and a MSE
of 4 or some such. */






/* Intentionally empty */
}
/* Write the header */
60
outfilenum = open( outfile, O.WRONLYI 0_CREAT| O.TRUNC, 0644 );
if( verbose) printf ( "Writing header...
\n"
);
write( outfilenum, fehdr, sizeof( struct my_header ) );
/* Write the quadtree map (recursively) */




/* (recursively) write the transforms out */




while ( currlist != 0 ) {
write_transform( currlist ) ;
currlist = currlist -> next;
}
close( outfilenum );




/* Here, we do comparisons for a single block. This takes up the lion's */
/* share of the processor time, so it is more or less optomized for speed */
/* (on the system I've been using, at least an alpha, where floating */
/* point operations seem nearly as quick as integer operations. */
void do_block( struct qt_elem *block ) {
int rx, ry;
int tempx, tempy;







unsigned long sumxy[8]; /* Results of doing the various transforms */
int i, slow, fast;
if( verbose ) {
printf ( "Doing '/.dx'/.dSC/.d.'/.d) .blocksize.blocksize,
block->rx,block->ry );








sumxx = 0 ;
for( tempx = 0; tempx < blocksize;
tempx++ ) {
for( tempy
= 0; tempy < blocksize;
tempy++ ) {
sumx




SOURCE (rx+tempx, ry+tempy) ;
}
}
/* Assume a shade block, compute r appropriately */
/* (r = MSE; rs and ds are saved.r













if( block->r >= cutoff ) {
/* Only do if shade is not good enough. */
block->rs = 1.0;
block->ds =0.0;
for( dx = 0; dx <= sumrow; dx++ ) {
for( dy = 0; dy <= sumcol; dy++ ) {
if( blockpixs * SUMYY(dx, dy) != SUMY(dx,dy)*SUMY( dx.dy ) ) {
for( fast = 0; fast < 8; fast++ ) {
sumxy[fast] = 0;
}
for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {
/?orient range and domain */
sumxy[0] += S0URCE(rx+slow,ry+fast ) *
SAVG(dx+slow*2,dy+fast*2) ;
sumxy[l] += SOURCE(rx+fast,ry+slow) *
SAVG(dx+slow*2,dy+fast*2) ;
sumxy[2] += S0URCE(rx+slow,ry+fast) *
SAVG(dx+dblocklen-slow*2,dy+fast*2) ;
sumxy [3] += SOURCE(rx+fast ,ry+slow) *
SAVG(dx+dblocklen-slow*2,dy+fast*2) ;
sumxy[4] += S0URCE(rx+slow,ry+fast) *
SAVG(dx+slow*2,dy+dblocklen-fast*2) ;









den = ((float)blockpixs)*SUMYY(dx,dy) -
((float)SUMY(dx,dy))*SUMY(dx,dy);
sumxxyy
= sumx*sumx * ((f loat)SUMYY(dx.dy) ) ;
for( i = 0; i < 8; i++ ) {
r = 2*((float)sumxy[i])*sumx*((float)SUMY(dx,dy));
r -= sumxxyy
+ ((f loat)blockpixs)*((f loat) sumxy [i] )*sumxy [i] ;
if( r * block->ds < block->rs * den ) {
a = ((float) ((signed) (blockpixs) * (signed) sumxy
[i]-
( signed) sumx* (signed) SUMY (dx , dy) ) ) /
( (float )blockpixs*SUMYY(dx, dy)
-
SUMY(dx,dy)*SUMY(dx,dy));











} else if( Ihwflag ) {
/* We may still do better
than the best if */
/* we clamp a at the
max or min value */





r = (float) ( blockpixs ) *
62




a * a * ((float)SUMY(dx,dy)) * SUMY(dx.dy);
r /= blockpixs;
r -= sumxx;
if( r * block->ds < block->rs ) {






block->b = ( (float) ( sumx - a*SUMY(dx,dy)))/blockpixs;
}
}
/* Previously, if the MSE was the same but the a */
/* value was better, we stored the new value; this */
/* helps the result converge a bit faster. This is */
/* not done anymore because the hardware cannot */
/*
easily do the same and because it's a bit harder */
/* to figure out with the rs/ds approach. */
}
r = ((block->rs/block->ds)+sumxx)/(f loat)blockpixs;
if( r < block->r ) { /* What we found was better */
block->r = r; /* Update the MSE in the block */
y else {
/* Shade was better. We do know it's not under the cutoff; */
/* we still update this in case we are already at the min */
/* blocksize the user has allowed. It's an unlikely case, but */





block->b = (float) sumx / blockpixs;
block->r = (float) (blockpixs * sumxx - sumx*sumx)/(f loat) (blockpixs*blockpixs) ;
}
}
if( block->r > cutoff kk blocksize > minblocksize ) {
/* Perform a quadtree disection */
if( verbose ) {
printf (
"





block->quadl = (struct qt_elem *)my.malloc( sizeof( struct qt_elem ) )
block->quad2 = (struct qt_elem *)my_malloc( sizeof( struct qt_elem ) )
block->quad3 = (struct qt.elem *)my_malloc( sizeof( struct qt.elem ) )
block->quad4 = (struct qt_elem *)my_malloc( sizeof( struct qt_elem ) )
if( nextlist == 0 ) {
nextlist = block->quadl ;
} else {







block->quadl->rx = block->quad3->rx = block->rx;
block->quadl->ry = block->quad2->ry
= block->ry;
block->quad2->rx = block->quad4->rx = block->rx + (blocksize >> 1) ;
block->quad3->ry = block->quad4->ry
= block->ry + (blocksize >> 1) ;
} else {
63
/* We found a winner! Either the MSE was good enough, or we can't */
/* get any smaller than we already have. */
if( verbose ) {
printf ( "HSE=7.f (7.d, 7.d) a=7.f b=7.f o=7.d rs=7.f ds=7.f\n",
block->r, block->dx, block->dy, block->a,
block->b, block->orient, block->rs, block->ds );
}
}
/* Do command line argument processing. */





int err = 0;
int temp;
optind = 1;
while(( chr = getopt( argc, argv, "a:b: c :hi : o :m:
v"
)) != EOF ) {




maxa = atof( optarg );
if( maxa <= 0 I I maxa >= 256.7 ) {
fprintf (stderr, "Error: maximum A parameter "
"must be betwixt zero and two
"









minblocksize = atoi( optarg );
if( minblocksize < 3 II minblocksize > 96 ) {
fprintf (stderr, "Error: minblocksize must be
"














'.= 1 kk temp
!= 3 ) {









blocksize = atoi( optarg );
if( blocksize < 2 II blocksize > 96 ) {
fprintf (stderr, "Error: blocksize must be
"





/* Test blocksize for 2"n or 3*2"n */
temp
= blocksize;





!= 1 kk temp
!= 3 ) {






blockpixs = blocksize * blocksize;







cutoff = atof( optarg ) ;
if ( cutoff < 0 I I cutoff > 1315 ) {
fprintf ( stderr, "Error: cutoff must be "

































if( infile == NULL ) {
err++;
}
if( outfile == NULL ) {
err++;
}
if( minblocksize > blocksize ) {






if( err != 0 ) {







































exit ( err ) ;
}
if( hwf lag kk maxa != 1 ) {
fprintf ( stderr, "Warning: hardware flag set and max a not 1; hardware
"





/* Once-per-execution initialization stuff; mainly filling source, savg,






















/* Allocate memory; this ought to use my_malloc, I guess. */
source = (unsigned int *)malloc( sizeoft unsigned int ) * (srow+1)* (scol+1) );
savg
= (unsigned int *)malloc( sizeof( unsigned int ) * (avgrow+l)*(avgcol+l) );
ssqr = (unsigned int *)malloc( sizeof( unsigned int ) * (avgrow+l)*(avgcol+l) );
sumy
= (unsigned long *)malloc( sizeof( unsigned long ) * (avgrow+1) * (avgcol+1) );
sumyy
= (unsigned long *)malloc( sizeof( unsigned long ) * (avgrow+l)*(avgcol+l) );
if( source == NULL I I savg
== NULL II ssqr == NULL II
sumy
== NULL I I sumyy
== NULL ) {
fprintf ( stderr, "Error: Out Of Memory!
!!\n"
);
exit( -1 ) ;
}
/* Fill source array */
for( dx = 0; dx < srow; dx++ ) {
for( dy
= 0; dy < scol; dy++ ) {
S0URCE( dx, dy ) = inbmp->data[dy*rowlen+dx] ;
}
}
/* Compute averages, squares of averages */
if ( verbose ) {




for( dx = 0; dx < avgrow; dx++ ) {
for (dy =0; dy < avgcol; dy++ ) {
SAVG( dx, dy )
= ( S0URCE(dx, dy) + S0URCE(dx, dy+1) +
S0URCE(dx+l, dy) + S0URCE(dx+l, dy+1)
+
2 ) / 4;





= SAVG(dx, dy)*SAVG(dx, dy) ;
}
}
/* Set up initial
list of blocks to do */
/* This first block will be free'd and is just
to prevent a special */
/* case within the loop that follows. */
nextlist = (struct qt_elem *)my_malloc( sizeof(





for( dx = 0; dx < xsize-xcrop; dx
+= blocksize ) {
for( dy
= 0; dy < ysize-ycrop; dy
+= blocksize ) {
nextlistend->next


















/* Free that extra one at the start; currlist is used as a temp */
currlist = nextlist;
nextlist = nextlist->next ;
free( currlist ) ;
currlist = 0;
}
/* once-per-blocksize initialization--sumy and sumyy arrays */










for( dx = 0; dx <= sumrow; dx++ ) {
for( dy = 0; dy <= sumcol; dy++ ) {
SUMY( dx, dy ) = 0;
SUMYY( dx, dy ) = 0;
}
}
/* Compute first four sums */
if ( verbose ) {






for( dx = 0; dx < blocksize * 2; dx+=2 ) {
for ( dy
= 0; dy < blocksize * 2; dy+=2 ) {
sum += SAVG(dx, dy) ;
sumsq
+= SSQR(dx, dy) ;
}
}
if( sum < 0 II sumsq < 0 ) printf ( "ERROR
!\n"
);
SUMY(0, 0) = sum;




for( dx = 1; dx < blocksize * 2; dx+=2 ) {
for ( dy = 0; dy < blocksize * 2; dy+=2 ) {
sum += SAVG(dx, dy) ;
sumsq
+= SSQR(dx, dy) ;
}
}
if( sum < 0 II sumsq < 0 ) printf(
"ERR0R!\n"
);
SUMY(1, 0) = sum;
SUMYY(1, 0) = sumsq;
sum = 0 ;
sumsq
= 0;
for( dx = 0; dx < blocksize * 2; dx+=2 ) {
for ( dy = 1; dy < blocksize
* 2; dy+=2 ) {
67
sum += SAVG(dx, dy) ;
sumsq
+= SSQR(dx, dy) ;
}
}
if ( sum < 0 II sumsq < 0 ) printf ( "ERROR!
\n"
);
SUMY(0, 1) = sum;




for( dx = 1; dx < blocksize * 2; dx+=2 ) {
for ( dy = 1; dy < blocksize * 2; dy+=2 ) {
sum += SAVG(dx, dy) ;
sumsq
+= SSQR(dx, dy) ;
}
}
if( sum < 0 II sumsq < 0 ) printf(
"ERR0R!\n"
);
SUMY(1, 1) = sum;
SUMYY(1, 1) = sumsq;
/* Compute the rest reasonably efficiently (moving the edges) */
for( dx = 0; dx < sumrow; dx++ ) {
if( dx > 1 ) {
sum = SUMY(dx-2, 0);
sumsq
= SUMYY(dx-2, 0);
for( dy = 0; dy < blocksize*2; dy+=2 ) {








SUMY(dx, 0) = sum;
SUMYY(dx, 0) = sumsq;
sum = SUMY(dx-2, 1);
sumsq
= SUMYY(dx-2, 1);
for( dy = 1; dy < blocksize
* 2; dy+=2 ) {








SUMY(dx, 1) = sum;
SUMYY(dx, 1) = sumsq;
}
for( dy = 2; dy < sumcol;
dy++ ) {
sum = SUMY(dx, dy-2) ;
sumsq
= SUMYY(dx, dy-2);
for( i = 0 ; i < blocksize*2; i
+= 2 ) {







SUMY(dx, dy) = sum;
SUMYY(dx, dy) = sumsq;
}
}





void *my_malloc( unsigned long size ) {
void *result;
68
result = calloc(l, size );
if( result == 0 ) {
fprintf ( stderr, "Error: Out of Memory
!\n"
);




/* Determine if we can knock off the top level of the tree, and do it */
int check_and_lower_tree() {
struct qt_elem *this ;
struct qt_elem *prev;
for( this = top; this != 0; this = this->next ) {
if( this->quadl == 0 ) {
if ( verbose ) printf ( "Can't lower any
more.\n"
);
return 0; /* e.g. we can't prune any more. */
y else {










/* done with check; now lower */
if( verbose ) printf ( "Lowering removing all blocksize
7.d\n"
,




while ( prev ! = 0 ) {
this = prev->next;
free( prev ) ;
prev = this;
}
hdr .blocksize= hdr .blocksize>> 1;
return 1; /* Check again */
}
/* bit by bit output; for
quadtree map */





switch( bitval ) {
case ZERO.BIT:
this_byte >>= 1;















fprintf (stderr, "Error: bad magic constant given to




if( bits_in_char >= 8 ) {




/* Spit out a transformation (recursively) . */
void write.transf orm( struct qt_elem *this ) {
struct filebit thisbit;
static int i = 0;
if( this->quadl == 0 ) {
thisbit. dx = this->dx;
thisbit. dy = this->dy;
thisbit. a = this->a;
thisbit. b = this->b;
thisbit . orient = this->orient ;
if ( verbose ) printf ( "T#7.d dx=7.d dy=7.d rx=7.d ry=7.d orient=7.d a='/.
b=7.f\n"
i++, (int) this->dx, (int) this->dy, (int) this->rx,
(int) this->ry, (int) this->orient , this->a, this->b );
write( outfilenum, fcthisbit , sizeof( struct filebit ) );
} else {
write_transform( this->quadl ) ;
write_transform( this->quad2 )
write_transform( this->quad3 )
write_transf orm( this->quad4 )
/* Spit out a quadtree map (recursively) for the given element */
void write_qtmap_elem( int size, struct qt_elem *here ) {
/* No need to write out map for base cases */




== 0 ) {












/* Spit out the entire quadtree map and
flush the last (partially filled) */
/* byte at the end. */
void write_qtmap() {
struct qt_elem *here ;
int blocksize = hdr .blocksize;
for( here = top; here
!= 0; here = here -> next ) {









* Decompress an image fractally compressed using fcomp3 or ftrans3; the *










/* The following ensures all array accesses are within the array;
this is a */
/* good idea since we don't verify that the input file
is okay in this */
/* regard and since we may have other strange









+ ( (x)7.xsize)] )
y ) (pictdata2[(((y)7.ysize)*xsize)
+ ( (x)7.xsize)] )
y ) (*(pictdata + (y)*xsize
+ (x)))
y ) (*(pictdata2




















/* Stuff to get stored in output file */
/* This should probably contain
bitfields and */
/* thereby increase the
compression ratio. */
/* The x coordinate */
/* The y coordinate */





/* range x coord */
/* range y coord */
/* domain x coord */
/* domain y coord */
/* Orientation index */













char * infile; /* Input file name */
char *outfile; /* output file name */
int infilenum; /* Input file descriptor */
FILE *outfilep; /* Output file pointer */
/* We keep two copies of the image around. We apply the transforms to one */
/* to produce a new generation in the other, then swap the pointers around */










struct transform *last; /*For building the list up */
void apply_transform( struct transform *which ) ;
void read_transf orm( struct transform *which) ;
void read_qt_map( int x, int y, int blocksize );
int read_bit() ;





















while((chr = getopt( argc, argv, "i : o :m:r :
vV"
)) != EOF ) {



















= atoi( optarg );





iterations = atoi( optarg );
if( iterations < 1 II iterations > 1000 ) error++;
break;
72
case 'v': /* Allows for a very verbose output by specifying both */







if( error > 1 || infile == NULL I I outfile == NULL ) {
fprintf ( stderr, "Usage: 7.s -i <infile> -o <outfile> "
"
[options. . . ]\n"
, argv[0] );
fprintf( stderr, "Options: -m <magnif ication> "











"Verbose mode\n" ) ;
exit ( error ) ;
}
/* Read in compressed info */
if( verbose ) {




infilenum = open( infile, 0_RD0NLY );
if( infilenum < 0 ) {







/* We assume the data is out there. That probably isn't good. */
read( infilenum, &hdr, sizeof( struct my_header ) );
/* Read quadtree map and create transforms list */
top
= (struct transform *)malloc( sizeof( struct transform ) );
if( top
== 0 ){








ysize = hdr. height;
for( x = 0; x < xsize; x
+= hdr .blocksize ) {
for( y
= 0; y < ysize;
y+= hdr .blocksize ) {
read_qt_map( x, y, hdr .blocksize );
}
last->next =0; /* Null-terminate list */
this = top; /* Get rid of first fake element */
top
= top->next;
free( this ) ;
this = 0;
/* Read in transforms */
for( this=top; this
!= 0; this = this->next ) {




/* Adjust things for magnification */
if( mag
!= 1 ) {
xsize *= mag;
ysize *= mag;
maxblocksize = hdr .blocksize * mag;








/* Now, we can forget about the magnification; it's just as though the */
/* magnified picture were compressed initially. */
/* Picture computations are done in floating point to keep this easy. */
/* Picture data is converted to bytes only for output. This approach */
/* is not a good idea for really big pictures, as it wastes a fair bit */
/* of memory in that case; it's also bad if floating point operations */
/* are especially slow (e.g. no FPU). Neither are true for me now. */
pictdata = (float *) malloc( sizeof( float ) * xsize * ysize );
pictdata2 = (float *) malloc( sizeof( float ) * xsize * ysize );
if ( verbose) -(
printf ( "File read in.
\n"
);
printf ( "mag = 7.d xsize
= '/,d ysize = 7.d blocksmaxize =
7.d\n"
,
mag, xsize, ysize, maxblocksize );
i = 0;
for( this = top; this
!= 0; this=this->next) {
printf (
"
T#7.d dx=7.d dy='/,d rx=7.d ry=7.d orient=7.d a=7,f b=7.f\n",
i, this->dx, this->dy, this->rx, this->ry,
this->orient
, this->a, this->b );
i++;
}
/* Set up picture array */
for( y
= 0; y < ysize;
y++ ) {
for( x = 0; x < xsize; x++ ) {




if( verbose ) printf ( "Array
inited.\n"
);
/* Iterate and transform */
for( i = 0; i < iterations;
i++ ) {
for( this=top; this
!= 0; this=this->next ) {











/* Output data--first free up core */
74
free( pictdata2 );
for( this=top->next ; top != 0; top=this ) {
this=top->next ;
free( top ) ;
}
this=top=0;
bmp = new_bmp( xsize, ysize );
if( bmp == NULL ) {







= 0; y < ysize; y++ ) {
for( x = 0; x < xsize; x++ ) {
if( PICT(x, y) < 0 ) PICT(x, y) = 0;
if( PICT(x, y) > 255) PICT(x, y)=255;
?datap
= (unsigned char) PICT(x, y) ;
datap++;
}
datap += (ysize 7. 4); /* Correct any misalignment at each row */
}






== NULL ) {





if( bmp == NULL ) {




write_bmp( outfilep, bmp );
fclose( outfilep );
if( verbose ) printf ( "Done 7.s written. \n", outfile );
/* Apply a single transform to a single block; since this gets done a lot, */
/* the case statement is outside the loops rather than inside them. */


















dblocklen = 2*(blocksize-l) ;
a = which->a;
b = which->b;
switch( which->orient ) {
case 0:
75
for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {











for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {










for( slow = 0; slow < blocksize; slow++ ) {
for( fast = 0; fast < blocksize; fast++ ) {















/* I have found that clamping the values every time through generally */
/* leads to better convergence and output image quality, although it */
/* is a bit slower. */
for( slow = 0; slow < blocksize; slow++ ) [
for( fast = 0; fast < blocksize; fast++ ) {
if( PICT2( rx+slow, ry+fast ) < 0 ) {
PICT2( rx+slow, ry+fast ) = 0;
}
if( PICT2( rx+slow, ry+fast ) > 255 ) {





/* Read a file bit by bit. This is used to read the quadtree map. */
int read_bit() {
static unsigned char this.char = 0;
static int bits_left = 0;
if( bits_left <= 0 ) {












/* This is a recursive function,
"unrolling"
the tree and creating the */
/* transform structures. The block size is filled in at this point. */
void read_qt_map( int x, int y, int blksize ) {
if( blksize <= 3 ) {
/* In this case, we need to simply create the transform structure */
last->next =(struct transform *) malloc( sizeof( struct transform) );
last=last->next ;
if( last == 0 ) {









/* Read the map and do what it says. */
if( read.bitO == ZERO.BIT ) {
/* Create a transform structure */
last->next =(struct transform *) malloc( sizeof( struct transform) );
last=last->next ;
if( last == 0 ) {










/* Recursively create transforms */
blksize /= 2;
read_qt_map( x, y, blksize );
read_qt_map( x+blksize, y, blksize );
read_qt_map( x, y+blksize, blksize );




/* Fill in a transform structure from data in the file. */
void read.transform( struct transform
*this ) {
struct filebit bit;





this->a = bit. a;
this->b = bit.b;
this->orient = bit. orient;
}
A.4 Bitmap File Library
A.4.1 Header File: bmp.h
/* Stuff for parsing .BMP
files */
/* Approximate definitions. We'll have to
do junk to fix up x86 endianness. */
/* THESE ARE PLATFORM DEPENDANT: THESE
WILL WORK FOR MANY 32 BIT SYSTEMS. */
78
typedef unsigned short UINT;
typedef signed short WORD;
typedef unsigned int DWORD;
typedef unsigned char BYTE;
typedef signed int LONG;





/* A really useful macro */
?define R0WLEN( bmpp ) ((3+(bmpp)->bmi->bmih.biWidth)&(-0x03))
/* RGB values */

















struct bmfh { /* BITMAPFILEHEADER */
UINT bfType; /* Must be
"BM"
*/
DWORD bfSize; /* Size of file, bytes */
UINT bfReservedl; /* Zero */
UINT bfReserved2; /* Zero */
DWORD bfOffBits; /* Offset of actual picture data */
} ;
struct bmih { /* BITMAPINFOHEADER */
DWORD biSize; /* Sizeof (struct bmih ) */
LONG biWidth; /* Width in pixels of bitmap */
LONG biHeight; /* Height in pixels of bitmap */
WORD biPlanes; /* Pixel planes--must be 1 */
WORD biBitCount; /* 1, 4, 8, 24 we use 8 */
DWORD biCompression; /* BI.RLE8, BI.RLE4, BI.RGB */
DWORD biSizelmage; /* Size of actual image */
LONG biXPelsPerMeter; /* resolution in X direction */
LONG biYPelsPerMeter ; /* resolution in Y direction */
DWORD biClrUsed; /* Number of used colors */
DWORD biClrlmportant ; /* Number of important colors */
} ;
struct bmi { /* BITMAPINFO */
struct bmih bmih;
RGBQUAD bmiColors[256] ; /* Yecch this should be dynamic. */
} ;
/* Public functions: */
/* Read in stuff from a file; does not handle RLE, but does everything else. */
79
struct bmpfile *read_bmp( FILE *f );
/* Write stuff out to a file */
int write_bmp( FILE *f , struct bmpfile *bmp );
/* Create a new bitmap with
"random"
data */
struct bmpfile *new_bmp( int xsize, int ysize );
/* Get rid of a bitmap structure */
void dispose_bmp( struct bmpfile *bmp );
/* Format it for use with our program--make it properly greyscale. */
int cleanup_bmp( struct bmpfile *bmp );
/* Print in a human-readable (?) form */
void dump_bmp( struct bmpfile *bmp );





/* Utility function prototypes */
static UINT read.UINT( FILE *f );
static WORD read_W0RD( FILE *f );
static DWORD read_DWORD( FILE *f );
static BYTE read.BYTE( FILE *f ) ;
static LONG read_L0NG( FILE *f );
static void write_UINT( FILE *f , UINT what );
static void write_WORD( FILE *f , WORD what );
static void write_DWORD( FILE *f , DWORD what );
static void write_BYTE( FILE *f, BYTE what );
static void write_LONG( FILE *f, LONG what );
void dump_bmp( struct bmpfile *it ) ;
/* Reading various Wintel base types; these should be platform independent. */
static BYTE read_BYTE( FILE *f ) {
return getc( f ) ;
}
static void write_BYTE( FILE *f, BYTE what ) {
putc( what , f ) ;
}
static WORD read_W0RD( FILE *f ) {
int a,b;
a = getc( f );
b = getc( f );




(int) a, (int) b, (b * 256) + a ); */
return (b * 256) + a;
}
static void write.WORD( FILE *f , WORD what ) {
/* printf ( "write_WORD: bytes 7.02X 7.02X, word
= 7.04XW ,
(int) what7.256, (int) what/256, what ); */
putc( what 7. 256, f );
putc( what / 256, f );
static DWORD read_DWORD( FILE *f ) {
int a,b,c,d;
DWORD dw;
a = getc( f ); b = getc( f );
c = getc( f ); d = getc( f );
80
/* printf ( "read_DWORD: bytes 7.02X 7.02X 7.02X 7.02X, dword = %08X\n" ,
(int) a, (int) b, (int) c, (int) d, (d * 256 + c) * 65536 + b * 256 + a ) ; */
return (d * 256 + c) * 65536 + b * 256 + a-
}
static void write_DWORD( FILE *f
, DWORD what ) {
int a, b, c, d;
a = what 7. 256;
b = (what / 256) 7. 256;
what /= 65536;
c = what 7. 256;
d = (what / 256 ) ;
/* printf ( "write.DWORD: bytes 7.02X 7.02X 7.02X 7.02X, long =
%08X\n"
,
(int) a, (int) b, (int) c, (int) d, (d * 256 + c) * 65536 + b * 256 + a ); */
putc( a, f ) ; putc( b, f ) ;
putc( c, f ) ; putc( d, f ) ;
>
static LONG read_L0NG( FILE *f ) {
int a,b,c,d;
a = getc( f ) ; b = getc( f ) ;
c = getc( f ); d = getc( f );
/* printf ( "read_L0NG: bytes 7.02X 7.02X 7.02X 7.02X, long =
7.08X\n"
,
(int) a, (int) b, (int) c, (int) d, (d * 256 + c) * 65536 + b * 256 + a ); */
return (d * 256 + c) * 65536 + b * 256 + a;
}
static void write_LONG( FILE *f , LONG what ) {
int a, b, c, d;
a = what 7, 256;
b = (what / 256) 7. 256;
what /= 65536;
c = what 7. 256;
d = (what / 256 ) ;
/* printf ( "write.LONG: bytes 7.02X 7.02X 7.02X 7.02X, long =
7.08X\n"
,
(int) a, (int) b, (int) c, (int) d, (d * 256 + c) * 65536 + b * 256 + a ); */
put c ( a , f ) ; put c ( b , f ) ;
put c ( c , f ) ; put c ( d , f ) ;
}
static UINT read_UINT( FILE *f ) {
int a,b;
a = getc( f ) ;
b = getc( f );
/* printf ( "read_UINT: bytes 7.02X 7.02X, uint = 7.04X\n",
(int) a, (int) b, (b * 256) + a ); */
return (b * 256) + a;
}
static void write_UINT( FILE *f , UINT what ) {
/* printf ( "write.UINT: bytes 7.02X 7.02X, word
= 7.04X\n"
,
(int) what7.256, (int) what/256, what ); */
put c (what 7. 256, f ) ;
putc(what / 256, f ) ;
>
/****************** Public Functions ***********************/
/* Get rid of a bitmap */
void dispose_bmp( struct bmpfile *bmp ) {
if( bmp != NULL ) {
81
if( bmp->bmfh != NULL ) {
free( bmp->bmfh )
}
if( bmp->bmi != NULL ) {
free( bmp->bmi )
}
if( bmp->data != NULL ) {
free( bmp->data );
}
free( bmp ) ;
}
/* Create a new bitmap in memory */




it = (struct bmpfile *)malloc( sizeof( struct bmpfile )
)
if( it != NULL ) {
it->bmfh = (struct bmfh *)malloc( sizeof( struct bmfh ) );
it->bmi = (struct bmi *)malloc( sizeof( struct bmi ) );
/* Data must be aligned mod four */
it->data = (unsigned char *)malloc( xsize * (4*(ysize + 3 ) / 4 ) );
if( it->bmfh == NULL I I it->bmi == NULL || it->data == NULL ) {




if( it == NULL ) return it; /* Not enough core */




it->bmfh->bfOffBits = sizeof( struct bmfh) + sizeof( struct bmi );
it->bmfh->bfSize = it->bmfh->bfOffBits + xsize * (4*(ysize + 3) / 4 ) ;
/* Fill in bitmap info header */
it->bmi->bmih.biSize = sizeof( struct bmih ); /* This should work */
it->bmi->bmih.biWidth = xsize;
it->bmi->bmih.biHeight = ysize;
it->bmi->bmih. biPlanes = 1;
it->bmi->bmih.biBitCount = 8;
it->bmi->bmih. biCompression = BI_RGB;
it->bmi->bmih.biSizeImage = xsize * (4*(ysize+3)/4) ;
it->bmi->bmih.biXPelsPerMeter = 2800; /* somewhere near 72 dpi */
it->bmi->bmih.biYPelsPerMeter = 2800;
it->bmi->bmih.biClrUsed = 256; /* all of them */
it->bmi->bmih.biClrImportant = 256; /* all of them */
/* Fill in the color quads */
for( i = 0; i < 256; i++ ) {
it->bmi->bmiColors [i] .Red = i;
it->bmi->bmiColors[i] .Green = i;
it->bmi->bmiColors [i] .Blue = i;
it->bmi->bmiColors [i] .Reserved = 0;
/* We will not touch the initial picture data; use random data */
return it ;
/* Write a bitmap to a file */





/* dump_bmp( it ) ; */
if( outfile == NULL ) fprintf ( stderr, "ERROR: NULL 0UTFILE!\n" );
/* Write out the header */
write_UINT( outfile, it->bmfh->bfType );
write_DWORD( outfile, it->bmfh->bfSize );
write_UINT( outfile, it->bmfh->bfReservedl );
write.UINT( outfile, it->bmfh->bfReserved2 );
write_DWORD( outfile, it->bmfh->bfOf fBits );
/* Verify the number of color entries */
ColorEntries = it->bmi->bmih.biClrUsed;
if( ColorEntries > 256 ) {






/* Write out the bitmap info header */
write_DW0RD( outfile, it->bmi->bmih.biSize );
write_L0NG( outfile, it->bmi->bmih.biWidth );
write_L0NG( outfile, it->bmi->bmih.biHeight );
write_W0RD( outfile, it->bmi->bmih.biPlanes );
write_W0RD( outfile, it->bmi->bmih.biBitCount );
write_DW0RD( outfile, it->bmi->bmih. biCompression );
write_DW0RD( outfile, it->bmi->bmih.biSizeImage );
write_L0NG( outfile, it->bmi->bmih.biXPelsPerMeter );
write_L0NG( outfile, it->bmi->bmih.biYPelsPerMeter );
write_DW0RD( outfile, it->bmi->bmih.biClrUsed );
write_DW0RD( outfile, it->bmi->bmih.biClrImportant );
/* Write out the color quads */
if( ColorEntries == 0 ) {


















for( i = 0; i < ColorEntries;
i++ ) {
write_BYTE( outfile,
it->bmi->bmiColors [i] . Blue );
write.BYTE( outfile,
it->bmi->bmiColors [i] .Green );
write_BYTE( outfile,
it->bmi->bmiColors [i] .Red );
write_BYTE( outfile,
it->bmi->bmiColors [i] .Reserved );
}
/* Write out actual image data */
datap
= it->data;







/* Read a bitmap from a file */





it = (struct bmpfile *)malloc( sizeof( struct bmpfile )
)
if( it != NULL ) {
it->bmfh = (struct bmfh *)malloc( sizeof( struct bmfh ) );
it->bmi = (struct bmi *)malloc( sizeof( struct bmi ) );
/* Data must be aligned mod four */
it->data = NULL;





if( it == NULL ) return it; /* Not enough core */
/* Fill in bitmap header */
it->bmfh->bfType = read_UINT( infile );
it->bmfh->bfSize = read_DWORD( infile );
it->bmfh->bfReservedl = read_UINT( infile );
it->bmfh->bfReserved2 = read.UINT( infile );
it->bmfh->bfOffBits = read_DW0RD( infile );
/* Check some stuff */
if( it->bmfh->bfType != CORRECT.BF.TYPE ) {
fprintf ( stderr, "Warning: Incorrect magic number for bmp file.
\n"
);
/* We probably ought to quit right about now, but... */
}
/* Fill in bitmap info header */
it->bmi->bmih.biSize = read_DWORD( infile );
it->bmi->bmih.biWidth = read.L0NG( infile );
it->bmi->bmih.biHeight = read_L0NG( infile );
it->bmi->bmih.biPlanes = read_W0RD( infile );
it->bmi->bmih.biBitCount = read_W0RD( infile ) ;
it->bmi->bmih. biCompression = read_DWORD( infile );
it->bmi->bmih.biSizeImage = read_DWORD( infile );
it->bmi->bmih.biXPelsPerMeter = read_L0NG( infile );
it->bmi->bmih.biYPelsPerMeter = read_L0NG( infile );
it->bmi->bmih.biClrUsed = read_DW0RD( infile );
it->bmi->bmih.biClrImportant = read_DWORD( infile );
/* Verify a couple of things */
if( it->bmi->bmih.biSize != sizeof( struct bmih ) ) {




if( it->bmi->bmih.biClrImportant > it->bmi->bmih.biClrUsed ) {




switch( it->bmi->bmih.biBitCount ) {
case 1:
if( it->bmi->bmih.biClrUsed > 2 ) {





!= BI_RGB ) {







if( it->bmi->bmih.biClrUsed > 16 ) {
fprintf ( stderr, "Warning: More than sixteen colors for 4-bit image.
\n"
);
if( it->bmi->bmih. biCompression != BI_RGB kk it->bmi->bmih. biCompression != BI.RLE4 ) {






if( it->bmi->bmih. biCompression != BI.RGB kk it->bmi->bmih. biCompression != BI_RLE8 ) {






if( it->bmi->bmih. biCompression != BI_RGB ) {






fprintf ( stderr, "Warning: Incorrect color depth.
\n"
);
/* Fill in the color quads */
ColorEntries = it->bmi->bmih.biClrUsed;
if( ColorEntries > 256 ) {






if( ColorEntries == 0 ) {




















for( i = 0; i < ColorEntries; i++ ) {
it->bmi->bmiColors[i] .Blue
= read_BYTE( infile );
it->bmi->bmiColors[i] .Green
= read_BYTE( infile );
it->bmi->bmiColors[i] .Red
= read_BYTE( infile );
it->bmi->bmiColors[i] .Reserved
= read_BYTE( infile );
}
/* Read in actual image data */
if( it->bmi->bmih.biSizeImage




it->data = (unsigned char *)malloc(
it->bmi->bmih.biSizeImage );
if( it->data
== NULL ) {






for( i = 0; i < it->bmi->bmih.biSizeImage; i++ ) {








/* Check a couple of essentials */
if( bmp->bmi->bmih.biBitCount != 8 ) {





if( bmp->bmi->bmih. biCompression != BI_RGB ) {





/* Read in translation array */
for( i = 0; i < bmp->bmi->bmih.biClrUsed; i++ ) {
if( bmp->bmi->bmiColors[i] .Red != bmp->bmi->bmiColors [i] . Green I I
bmp->bmi->bmiColors[i] .Red != bmp->bmi->bmiColors [i] .Blue ) {
fprintf ( stderr, "Warning: Color #'/,d is not a greyscale value. \n", i );
}
conv[i]
= bmp->bmi->bmiColors [i] .Red;
}
/* Now make the
"correct"
translation array. */
for( i = 0; i < 256; i++ ) {
bmp->bmi->bmiColors [i] .Red = i;
bmp->bmi->bmiColors [i] .Green = i;
bmp->bmi->bmiColors [i] .Blue = i;
bmp->bmi->bmiColors [i] .Reserved = 0;
}
bmp->bmi->bmih.biClrUsed = 256;
/* Now, fix our actual image data */
datap
= bmp->data;







void dump.bmp( struct bmpfile *bitmap ) {
int i;
































































printf ( "\nColors :\n" );
for( i = 0; i < 64; i++ ) {









i+64, (int) bitmap->bmi->bmiColors [i+64] .Red,
(int) bitmap->bmi->bmiColors [i+64] .Green,
(int) bitmap->bmi->bmiColors [i+64] .Blue,
(int) bitmap->bmi->bmiColors [i+64] .Reserved,
i+128, (int) bitmap->bmi->bmiColors [i+128] .Red,
(int) bitmap->bmi->bmiColors [i+128] .Green,
(int) bitmap->bmi->bmiColors [i+128] .Blue,
(int) bitmap->bmi->bmiColors [i+128] .Reserved,
i+192, (int) bitmap->bmi->bmiColors [i+192] .Red,
(int) bitmap->bmi->bmiColors [i+192] .Green,
(int) bitmap->bmi->bmiColors [i+192] .Blue,




The scripts are separated into a few sections, to allow easier compilation of various
portions of the system. The Master script file calls these various sections in order,
producing a final design.
B.l Master Script File: syn.scr2
include syn.scr2a /* Compile the memory interface unit */
remove_design /* Remove the stuff from memory */
include syn.scr2b /* Compile the parameter computation block */
remove_design /* Remove the stuff from memory */
include syn.scr2c /* Compile the rest of the pipeline unit */
remove_design /* Remove the stuff from memory */
include syn.scr2d /* Compile the host interface and top level */





B.2 Memory Interface Unit: syn.scr2a
def ine_design_lib work -path syn_work2




set_dont_use { dwOl . sldb/DW01_sub/rpcs dwOl .




read -format vhdl memory.
cache.chunk. vhd
check.design
read -format vhdl mem_iface_unit . vhd
check_design
create_clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set_dont_touch_network { CLOCK RESET }
set_attribute all_inputs() is_clk false -type boolean
-quiet
set_attribute { CLOCK RESET} is_clk true -type boolean -quiet
set_input_delay 1 -clock CLOCK f ilter(all. inputs () "Sis.clk != true")




create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set_attribute all_inputs() is_clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs() "Sis.clk != true")
set_output_delay 5 -clock CLOCK all_outputs()
set. load 5 all.outputsO
compile -map.effort high
write -format db
B.3 Parameter Computation Chunk: syn.scr2b
define.design.lib work -path syn_work2
read -format vhdl thesis.pkg.vhd
read -format vhdl param. comp. chunk. vhd
check.de sign
set.dont.use { dwOl . sldb/DWOl.sub/rpcs dwOl . sldb/DWOl.add/rpcs dwOl . sldb/DWOl.addsub/rpcs }
create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs() "(Sis.clk
!= true")




create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock






B.4 Pipeline Unit: syn.scr2c







set dont.use { dwOl .
sldb/DWOl.sub/rpcs dwOl . sldb/DWOl.add/rpcs dwOl . sldb/DWOl.addsub/rpcs }
create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
89
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs() "(Sis.clk != true")




create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs () "Sis.clk != true")





read -format vhdl range.memory.syn. vhd
create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs() "Sis.clk != true")




create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute [ CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all.inputs() "Sis.clk (= true")





read -format vhdl range.block.chunk. vhd
check.design
create.clock -period 50 CLOCK
set.drive 0 { CLOCK RESET }
set.dont.touch.network { CLOCK RESET }
set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK f ilter(all_inputs() "Sis.clk
!= true")




read -format db MAC.CHUNK. db
read -format db RANGE.MEMORY . db
read -format db PARAM.COMP_CHUNK.db
read -format vhdl pipeline.unit .vhd
write -format db
90
B.5 Host Interface and Top Level: syn.scr2d
def ine.design.lib work -path syn_work2
read -format vhdl thesis.pkg.vhd
hdlin.auto.save.templates = TRUE
read -format vhdl host.iface.unit . vhd
read -format vhdl top.level. vhd
read -format db PIPELINE.UNIT.db
read -format db MEM.IFACE_UNIT.db
set.dont.touch PIPELINE.UNIT
set.dont.touch MEM.IFACE.UNIT
read -format vhdl syn.top_level.vhd
check_design
set.dont.use { dwOl .sldb/DWOl.sub/rpcs dwOl . sldb/DWOl.add/rpcs dwOl . sldb/DWOl.addsub/rpcs }
create.clock -period 50 CLOCK




set.attribute all.inputsO is.clk false -type boolean -quiet
set.attribute { CLOCK RESET} is.clk true -type boolean -quiet
set.input.delay 1 -clock CLOCK filter (all.inputsO "Sis.clk != true")




write -format db H0ST_IFACE.UNIT_NUM_PIPELINE_UNITS4





Some representative schematic diagrams, generated by Synopsys, are included
below. These are only small portions of the design; the entire set of schematic
diagrams would require several hundred pages.










Several small image manipulation tools were developed during the writing of this
thesis; code for them is included on the included CD-ROM. This appendix gives
a brief introduction to each of them and information on their operation.
D.l bmp2memfile
This program converts a bitmap file (Windows .BMP format) into the form used
read by the simulated main memory VHDL entity. This format is just a series
of four-digit (two byte) hex numbers, each on its own line, without any other
characters.
The program takes exactly two arguments: a bitmap file and an output file,
in that order. The simulated memory expects the output from this program to be
named memory.data. The program writes five words (ten bytes) of nulls before
the image data; thus, during simulation, the image base address is 10 (decimal).
D.2 bmp2pgm
This program converts a grayscale bitmap file to the pgm format, which is com
monly used as a (simple) intermediate format for
translation into any of several
image file formats. Image data in a pgm file is stored as the digits of decimal
numbers (stored as text); thus, a pgm file takes up considerably more disk space
than a bitmap file.
The program can take zero, one, or two command line arguments. If zero
are provided, it acts as a filter, translating the
standard input to the standard
output. If a single argument is given, it is taken to be the input file (instead of
standard input) and the output is placed
on standard output. If two arguments




This program provides a dump of the headers for a bitmap
file
essentially, all
the information contained in it except for the actual image data. This includes
the size, the format of the data, and the color table.
One or more bitmap files must be supplied via the command line; information
is printed for each.
D.4 imgdiff
This program compares two images and produces a new image where the grayscale
value of each pixel is the absolute value of the difference between the correspond
ing pixels of the two original images. It thus provides a visual indication of errors
between the two images.
Three command line arguments must be supplied: two input files and the
output file name, in that order. The output file will have the same dimensions as
the two input files, which must be of equal size. All image files are bitmap files.
D.5 imgcompare
This program provides a rudimentary histogram of differences of grayscale values
between two equally sized images. Both of these images may be supplied, or one
may be an (automatically generated) image consisting of all black; in that case,
the histogram represents the relative distribution of the grayscale values of the
supplied image. This program produces a text output (a crude histogram) which
is designed to fit on a 132x24 terminal.
Two versions of the histogram are possible: a normal and a
"scrunched"
version, which differ in the size of the histogram bins. In the normal version,
each bin holds a single difference value and only the first 130 are displayed. In
the scrunched mode, each bin holds two difference values, allowing the display of
all 256 possible difference values (at a loss of resolution).
Scrunched mode is selected by a -s flag as the first argument to the program;
its absence indicates normal mode. One or two additional arguments specify the
images to be compared; if only one is present, the all black image is used as the
second and a histogram of grayscale values is produced.
Typically, scrunched mode is used when a single image is specified and normal
mode is used when two (presumably similar) images are compared.
D.6 ftrans3
This program translates a text file indicating the transformations found by (sim
ulated) hardware
into my fractal image
compression file format for later decom
pression by fdeocmp3. The input file format
is designed for easy entry by hand
98
from a simulation; its lines consist solely of an increasing integer (primarily in
tended for detecting missing or duplicate lines), a space, the address of the best
domain block found (assuming an offset of ten to the start of the image), a
forward slash character, and the index of the isometry for the best transforma
tion. No additional characters should be present. Successive lines are assumed
to be successive range blocks in the image, working from low addresses to high
addresses.
The program requires four arguments: the original bitmap file, the block size
used, the results filename (whose contents are described above), and an output
filename, in that order. When it runs, the program produces, on standard output,
a list of the transformations and the MSE values associated with them.
D.7 fstat3
This program dumps some statistics on a fractally compressed image (stored in
the file format of fcomp3 or ftrans3). Additionally, if the -v flag is used, the
transforms are dumped. Any number of files may be specified on the command
line.
The information dumped includes the image dimensions, the number of blocks
of various sizes, and the maximum, minimum, and average values for \a\ and b.
99
Bibliography
[1] Barnsley, Michael F.; Hurd, Lyman P., Fractal Lmage Compression, A. K.
Peters, Ltd., Wellesley, MA, 1993.
[2] Rolewicz, Stefan, Metric Linear Spaces, Panswowe Wydawnietwo Naukowe
(PWNPolish Scientific Publishers), Warszawa, Poland, 1972.
[3] Davis, G. M., "A Wavelet-Based Analysis of fractal Image
Compression,"
IEEE Transactions on Image Processing, 7 141-154, 1998.
[4] Saupe, Dietmar, "A New View of Fractal Image Compression as Convolution
Transform
Coding,"
IEEE Signal Processing Letters, 3 193-195, 1996.
[5] Ewing, Gary J.; Woodruff, Christopher J., "Comparison of JPEG and
Fractal-Based Image Compression on Target Acquisition by Human Ob
servers,"
Optical Engineering, 35 284-288, 1996.
[6] Jacquin, Arnaud E., "Image Coding Based on a Fractal Theory of Iterated
Contractive Image
Transformations,"
IEEE Transactions on Image Process
ing, 1 18-30, 1992.
[7] Kumar, S.; Jain, R. C, "Low-Complexity Fractal-Based Image Compression
Technique,"
IEEE Transactions on Consumer Electronics, 43 987-993, 1997.
[8] Jackson, D. J.; Mahmoud, W., "Parallel Pipelined Fractal Image Compres
sion Using Quadtree
Recomposition,"
The Computer Journal, 39 1-13, 1996.
[9] Jackson, D. J.; Mahmoud, W.; Gaughan, P. T., "Faster Fractal Image Com
pression Using Quadtree
Recomposition,"
Image and Vision Computing, 15
759-???, 1997.
[10] Mitra, S. K.; Murthy, C. A.; Kundu, M. K.,
"Technique for Fractal Im
age Compression Using Genetic
Algorithm,"
IEEE Transactions on Image
Processing, 7 586-593, 1998.
[11] Palazzari, Paolo; Coli, Moreno; Lulli, Guglielmo, "Massively
Parallel Pro




Journal of Systems Architecture: The EUROMICRO
Journal, 45 765-779, 1999.
100
[12] Acken, Kevin P.; Irwin, Mary Jane; Owens, Robert M., "A Parallel ASIC




Processing Systems for Signal, Image, and Video Technology, 19 97-113,
1998.
[13] Barnsley, Michael F.; Sloan, Alan D., "Method and apparatus for com
pression and decompression of digital image
data,"
US Patent 5347600, 23
October 1991.
101

