Fractal image compression has several useful properties, including resolution independence, high compression ratios, rapid decoding, and subjectively good image quality. Despite this, its adoption has been limited, largely because of the very high computational burden of compression.
INTRODUCTION
A practical automatic method for fractal image compression, not requiring human intervention, was first proposed by Arnaud E. Jacquin.' Research has been ongoing to improve on his work, but many of the difficulties remain-chief among them the high computational complexity of compression.
Fractal image compression has several useful properties, many of which are uncommon in other image compression schemes. Several of these properties are especially appealing for use in modern multimedia applications, such as the world wide web.
Fractal image compression algorithms are inherently resolution independent. The image is stored as a series of contractive transformations on a space; these transformations may be scaled to apply to a space of any size, resulting in decompression at any desired resolution. Decompression at higher resolutions than the original produces a smooth, detailed image. (Obviously, an image compression algorithm cannot add information which does not exist in the original image. While decompression at higher resolutions than the original can apparently add details to the image, such details, although often appearing believable, are figments of the compression and decompression processes. ) Fractal image compression algorithms provide fairly high compression ratios, often at least as good as JPEG compression at similar perceived qualities. 2 The decompression of images compressed using fractal image compression is quite rapid and the time required scales linearly with the image area. Real time image decompression is easily possible without exotic hardware.
Fractal image compression algorithms tend to produce images which are subjectively good looking; the artifacts produced by compression and decompression are typically not as distracting as with many other image compression algorithms. The most noticeable artifacts are usually a small or moderate amount of blockiness; edges and textures are usually preserved faithfully.
Fractal image compression suffers from one primary drawback, that of very high compression time. An ideal compression algorithm, which performs a full domain search, requires 0(n2) time for an image with area n; further, the actual time required is relatively high, even with modern, powerful computers. This paper presents an ASIC architecture which seeks to alleviate this high compression time by performing the required block matching in parallel, reducing the time complexity to 0(m) for images up to some maximum size (determined by the implementation). This ASIC architecture was developed and simulated using VHDL and synthesized using the Synopsys Design Compiler software; thus, it may be targeted to virtually any desired technology.
FRACTAL IMAGE COMPRESSION THEORY
The fractal image compression algorithms in current use are based on local iterated function systems (LIFSs) ; this is an extension to the class of fractals defined by iterated function systems (IFSs) . What follows is a terse overview of the theories used; further detail is readily available in the literature.3"
Iterated Function Systems
An IFS is a set of contractive transformations defined on a metric space. A transformation of a space is contractive with contractivity factor s if, for any two points x and y in the space, D(x,y) sD(T(x),T(y)),s < 1 (1) where T(x) is the transformation and D(x, y) is the distance metric used in the space. (Throughout this paper, lowercase variables indicate points in the space, while uppercase represent subsets of the space.) An IFS is used by taking the union of all the transformations on each of a set of points in the space; that is, for an IFS F(X) consisting of transformations T0 ,T,, . . . , T, F(X) = UTj(x : x X). (2) The contraction mapping theorem3 states that, given a contractive IFS F, there exists a fixed attractor Xf such that Xf F(Xf). (3) Further, given any nonempty subset Y of the space, lim F°Th(Y) = Xf, where F°'(X) = F(X), F°2(X) =
An IFS thus may be used to describe a subset of a space, and an iterative method exists to approximate this subset. Further, due to its fractal nature, an IFS can encode a seemingly complicated subset using relatively simple rules. To form an image compression system based on IFSs, one would have to invert this transformation; given a subset of the space of all images, one would be required to find an IFS whose attractor is nearly equal to the subset. No practical general algorithm exists for this inverse problem; hence, for a practical image compression system, some modifications must be made to the IFS definition.
Local Iterated Function Systems
An LIFS extends an IFS by associating with each transformation a domain, consisting of a subset of the space. All points lying inside this domain are transformed; all points outside the domain are ignored by the transform. (This also implies that the transform also has a range, formed of the union of the transforms of all the domain points.)
An LIFS, even if made up of individually contractive transforms, is not necessarily globally contractive; the contraction mapping theorem does not apply directly to LIFSs. Barnsley3 has shown that an LIFS may have zero, one, or several attractors; further, if at least one attractor exists, then there is a single largest attractor which is formed by the union of all the attractors. This largest attractor may be found by applying the LIFS to the universal set of the space.
In current fractal image compression algorithms, an LIFS is generated such that every point in the image is in the range of exactly one transform and such that the domains of the transforms are within the image area. Typically, the ranges are square blocks, arranged in a quadtree starting from some arbitrary root block size. The domains are also square blocks and are spatially larger than the range blocks; frequently, they have four times the area of range blocks. Transformations are limited to suitable affine transformations spatially and contractive linear functions chromatically. (There are eight possible affine transformations mapping one square block to another.) ( Some researchers formulate the theory with the transformations acting in the opposite sense-changing small blocks into larger blocks-and consequently reverse the terms "range block" and "domain block." Throughout this paper, domain blocks are taken to be larger than range blocks.)
It is an unproven conjecture of the authors that any LIFS with individually contractive transforms in which all the transform domains lie entirely within the union of the transform ranges will have at least one attractor. If this conjecture is valid, then the set of transforms produced by a fractal image compression algorithm, as described here, must be convergent.
RELATED WORK
A practical fractal image compression was first introduced by Jacquin.' Virtually all current approaches are direct descendants of his work, and share many of the same strengths and drawbacks.
A significant amount of research has been devoted to reducing the time required for fractal image compression. The attempts have generally fallen into a few categories, outlined here.
Limiting the Domain Pool
Many researchers have searched only a limited portion of the domain, either by searching only the domain blocks in the vicinity of a given range block or by examining only a periodic sampling of all possible domain blocks. Unfortunately, such measures of necessity lower the compression quality somewhat; a broad domain pooi is more likely to have a close match for a given range block than a narrower one. In practice, the redundancy in the domain pool is generally adequate to give reasonable results even if many blocks are eliminated. (If the domain pool is limited to those blocks within the vicinity of the range block being considered, the complexity of the algorithm may be reduced to 0(n) with respect to the image area; however, it is still slow when compared with other image compression techniques, as many blocks are still searched.)
Classifying Blocks
Another common approach is classifying range and domain blocks according to their features and searching only blocks with similar features. Jacquin, in his original system,' used a simple three way classification: blocks were either shade (constant-intensity) blocks, midrange blocks (no sharp contrasts), or edge blocks. Other researchers have used more complex classification schemes.
Algorithmic Improvements
Some minor algorithmic improvements have been suggested; however, none change the basic complexity of the algorithm. One such improvement is quadtree recomposition (QR)4 instead of building a quadtree from the root down, as in QD, one is built from the leaves up. This tends to reduce wasted work, as most images have more leaf (minimum-sized) blocks than root blocks.
Parallelization
The transformations of individual blocks are independent of the others; thus, the algorithm may be easily parallelized. Both software parallel approaches, operating on multiprocessing systems, and hardware parallel approaches, generally ASICs, have been described in the literature. The design discussed in this paper is likewise based on the parallel nature of the algorithm. They distributed the domain blocks to various processors of the parallel computer and then circulated range blocks in a circular pipeline through the processors. If a good enough match between a range block and a domain block was found, the transformation was recorded and a new range block swapped into the pipeline for the old one.
An obvious approach would have been to distribute all the domain blocks to each node in the machine and not circulate the range blocks. This was not done in their case because of memory constraints of the parallel processors.
Ancarani, De Gloria, Olivieri, and Stazzone's ASIC architecture
Ancarani et a15 designed an ASIC which operated over a PCI bus in a PC. Their device only worked with 8 x 8 range blocks and further used the mean absolute difference (MAD) metric rather than the more common mean square error (MSE) metric used here. (It is generally accepted that the MAD metric, while much simpler to compute, gives poorer results than the MSE metric.) Their design also worked with only one range block and one domain block at a time, although all eight isometries were compared in parallel.
Several of the design tradeoffs made in this design are opposite to those made here. Speed of operation was paramount to their design, even at the expense of parallelism. One pairing required thirty-two clock cycles to compare (neglecting overheads which amortize to zero); on the design presented here, each pairing of 8 x 8 blocks requires approximately 800 clock cycles. At equal clock speeds, the design presented here would need to have twenty-five pipeline units to equal the speed of their design.
3.4.3. Acken, Irwin, and Owen's ASIC architecture Ackens et a16 developed a unique ASIC architecture for fractal image compression. Their system uses a "smart memory" approach: small computation elements are superimposed on a memory array, which contains pieces of the image in two planes. On one of these planes, the image data remains stationary, forming the range blocks; on the other, image data is shifted through the ASIC, forming the domain blocks.
Processing elements in the ASIC are of two basic types. One type, directly connected to the memory, computes the cross products needed. The other type, connected in a tree structure, performs the actual correlation tests and selects the best mappings. Three levels of a quadtree dissection are performed in parallel due to the tree structure of the processors.
The processing elements used, although carefully optimized for the operations of fractal image compression, are fairly general and likely could, with different microprogramming, perform other image manipulations.
SYSTEM ARCHITECTURE
The core of the architecture presented here is a pipeline of special-purpose processing elements. Each of these elements maintains a single range block and compares it with a stream of domain blocks which are fed through the pipeline. Figure 1 gives an overview of this design. (The arrows in this figure show datafiow only; control signals are not illustrated.) A complete description of the design may be found in Erickson.7 Blocks of three different types make up the top level of this design: the memory interface unit, the host interface unit, and a number of pipeline units.
Host Interface Unit
This design is intended to be used in conjunction with a general-purpose host processor. The host interface unit is a fairly simple piece which provides communication between the rest of the ASIC and the host processor; it allows for the setup and control of the hardware, and the reading of the results of comparisons.
Main Memory Interface Unit
The memory interface unit performs several operations necessary to interface the pipeline to the image data, stored in an external memory. It is divided into three subcomponents: the memory interface and cache chunk, the block addressing chunk, and the denominator computation chunk. This separation was made for two reasons: it separates the parts of the design which are dependent on the precise nature of the external memory from the rest, and it simplifies the logic considerably when compared with a single massive entity. During synthesis, these three subcomponents are ungrouped and optimized into a single component; thus this separation is primarily a design convenience. 
Memory interface and cache chunk
This chunk is responsible for the details of arbitrating for and accessing the external memory bus. In addition, any desired cache is included here. The current implementation is largely an example placeholder; it implements a simplistic sixteen bit bus and a two word cache. (Although miniscule, this size cache does reduce the memory bandwidth required considerably over having no cache; it prevents repeated accesses to the same location for adjacent bytes of image data.)
This chunk, and perhaps small portions of the host interface unit, are the only sections which must be modified for a new application environment.
Block addressing chunk
The block addressing chunk extracts square blocks from a linear address space. Additionally, when reading domain blocks, it averages four neighboring pixels.
The block extraction is controlled by two inputs: a starting address and the length of a row of the image. It is assumed that image data is stored in memory using a row-major ordering, starting at the upper-left corner. (Other orientations of two-dimensional arrays are usable, although some of the inputs and outputs must be interpreted slightly differently.) The denominator computation chunk performs two largely unrelated operations: it provides high-level addressing of blocks, scanning all the possible domain blocks in the image; and it forms the various signals required for the pipeline units, primarily the denominator value D = y2 -(> y)2. These two operations were included in the one chunk because they were both fairly simple, making the overhead of maintaining two separate parts unattractive.
Pipeline Units
The pipeline units each consist of several subcomponents: a range block chunk, eight multiply-accumulate (MAC) chunks, and a parameter computation chunk. During comparisons, the pipeline units have a latency of eight clock cycles.
Range block chunk
The range block chunk consists of a small memory array, sufficient to hold a single range block, and some associated addressing and control logic. The actual memory block is instantiated from a separate entity; this allows an architecture-specific memory array design to be easily used. (Many VHDL synthesis tools are inefficient at synthesizing significant amounts of memory.) This memory block is just large enough to hold the largest permitted range block; in the current design, this is 32 x 32 pixels, or a total of 1,024 bytes.
The addressing circuitry is based around a row counter and a column counter. By applying various combinations of these and their inverses, the eight possible isometries may be obtained. Table 4 Table 1 . Possible isometries for mapping one square block to another, along with the identifying codes used by the design.
MAC chunks
The MAC chunks each perform unsigned eight-bit by eight-bit multiplication and twenty-six bit accumulation. They employ a shift and add algorithm, requiring eight clock cycles for each multiplication and accumulation. One operand shift register, that corresponding to the domain data, is implemented as a distributed shift register, with only a single bit slice per MAC chunk. This is possible because the domain data is identical for all the MAC chunks in the pipeline.
The MAC chunks are equipped with a double-buffered output, allowing the parameter computation chunk to operate in parallel with them (on successive block pairings).
Parameter computation chunk
The parameter computation chunk must determine if a particular block pairing is an improvement over previously examined pairings. To do this, it computes a value proportional to the mean square error (MSE) the transformation would produce, assuming a least squares linear regression grayscale transformation.
Least squares linear regression over a range block with m pixels to approximate a range block pixel x as a linear function of an averaged domain block pixel y is given by x = ay + b, where a=>> (4 and b=mxa>Y. (5) The MSE is given by MSE= [x-(ay+b)]2 (6) Algebraic manipulations lead to
Substituting the formulae for b and a into this and reducing yields MSE -::
In examining this equation, it is evident that the factor is constant for a given block size and may therefore be ignored in determining the best pairing. Further, x2 is constant for a given range block and also may be ignored. The denominator D = n > y2 -(> y)2 is constant for a given domain block; it is computed by the denominator computation chunk in the main memory interface. The computations left for the parameter computation chunk are determining the value of R, given by (9) and then determining if D8R < R8D, where D and R5 are saved values for D and R from the previous best match. If this inequality holds, and if the value for a given by (4) is in the range -1 < a < 1, then D8 and R8 are updated and the specifics identifying this mapping are stored. (These specifics are the starting address of the block and the isometry code.) D8 and R5 are initialized to 0 and 1 respectively, ensuring that the first block which produces a legal a value will be stored. (In the design, all intensity levels are treated as unsigned quantities, which causes D to always be nonnegative.)
A datafiow diagram of the computation is shown in Figure 2 . The computations are performed using a bit-serial approach. The unsigned multipliers are constructed using a shift-add design (or by a shift-subtract design when immediately followed by a negation). The signed multipliers use a radix-2 Booth multiplier. The comparators and adders are implemented in an essentially combinational design and do not add latency to the data flow. (They are not entirely combinational, as some state is stored between bits-for instance, the adders maintain a carry bit.)
In the current design, all arithmetical operations are carried out to full precision. This is almost certainly unnecessary; however, determining the amount of unnecessary precision is difficult, especially in light of the various block sizes and the precision-reducing subtractions. (If too many bits were eliminated, the results for small sized blocks would be severely degraded, even if larger blocks suffered no great problems.)
In addition to performing the computations described above, the parameter computation chunk also buffers the D, y, and > y2 values for the next pipeline stage.
Operational Description
Once some key image parameters are set, such as the base address and the length of a row, operation of the device proceeds in two stages. First, the range blocks are loaded from memory into the various pipeline stages. Second, the domain blocks are sent through the pipeline and the comparisons are performed. Loading each range block is performed by selecting the desired pipeline unit with a device register, setting the starting address of the block, and activating the device. The actual data is loaded via direct memory access (DMA) and block-related constants necessary for the comparisons (> x and ( x)2) are computed and stored.
Once all the range blocks have been loaded, the device is put into comparison mode and started. Further interaction with the host processor is unnecessary until the comparisons have all finished; the blocks are processed automatically. The pixel values (or, more precisely, the averaged pixel values) of the domain blocks are moved through the MAC chunks in the pipeline units at a rate of one bit per clock cycle (one byte every eight clock cycles). This rate allows the range block chunk to select the eight pixels for the eight isometries without requiring more than one read port. This also allows simple shift-add multipliers in the MAC units.
Simultaneously with this, the denominator computation chunk of the main memory interface unit computes the sum of the averaged range pixel values and the sum of their squares. Once all the pixels in the domain block have been processed, the cross-products, the sum of the pixel values, and the sum of their squares are transferred to holding buffers and processing on the next of the domain blocks is started.
Concurrently with the summations and multiplications, the denominator computation chunk of the main memory interface unit computes the denominator value of y2 -( y)2 and then the parameter computation chunks in the pipeline units compute the MSE values for the eight isometries of the previous block pairing. If the MSE value for an isometry is lower than the previously stored best MSE value and the multiplicative factor for a least-squares linear regression between the two has an absolute value not greater than one, the pairing is stored as the new best pairing. All eight isometries are compared; each comparison requires about one hundred clock cycles.
Through the many pipeline units, and through the separation of the MSE computation and the summations, the device achieves a large degree of parallelism which gives it a substantial speed advantage over sequential processors.
SIMULATION RESULTS
Testing of the design was performed through VHDL simulation. Once the basic operation of the design was verified, tests on small sample images were performed and compared with results obtained using a C implementation of the same algorithm. Larger images were also tested with the C program, but were not tested directly on the hardware design simulation; this is because simulating the hardware design is an extremely slow process.
The C program was primarily written to test the algorithms used in the compression process; it allows several options relating to quadtree dissection and the valid range of various parameters. The actual compression ratios (c) Figure 3 . Images used for testing: (a) is the input image, (b) is the output generated using 8 x 8 blocks, and (c) is the output generated using 4 x 4 blocks.
this is primarily because of inefficient encoding of the transformations. A less wasteful encoding would at least double the compression ratio without any noticeable degradation of the resulting images. (This was not done because the focus of this research was the development of an ASIC architecture, not improvements in the encoding of fractal transforms.) Figure 5 (a) shows an 80 x 56 sample test image, which was compressed using both the C program and the simulated hardware. Figure 5 (b) shows the output after compression; in this case, 8 x 8 blocks were used. The transforms found by the software and the hardware simulation were identical in every case; hence, the output images generated were also identical. The hardware simulation (made assuming eighty pipeline units) required approximately 16,000 clock cycles to load the eighty range blocks and 2,150,000 clock cycles to perform the block comparisons; at a clock speed of 100 MHz, this is about 21.7 ms of real time. The C program required 1.65 seconds of CPU time to execute on a 532 MHz Alpha 21164A system; under these conditions, the hardware would provide a speedup of about 78 times. It is estimated that, for a 256 x 256 image, the hardware running at 100 MHz would require about 487 ms to do compression with 8 x 8 blocks, assuming an adequate number of pipeline units are available; the software, running on the same system as before, requires about 560 seconds to do this same task. This is a speedup of about 1150
times.
The image quality in Figure 5 (b) is poor because of the relatively large block size used. The pool of domain blocks is fairly small, given the small image size, and the details relatively fine; this makes good matches difficult to find. Figure 5(c) shows the same image compressed (by the C program) using 4 x 4 blocks.
SYNTHESIS RESULTS
This design was synthesized to the gate level using the Synopsys Design Compiler software. The synthesis used the LSI 10K target library and assumed a device with four pipeline units. The blocks of memory for storing the range blocks were synthesized from a placeholder file, which instantiated a number of Synopsys pre-built register file designs. This approach allowed the entire design to be synthesized; however, the logic produced for these memory blocks is extremely inefficient. 
CONCLUSIONS AND AREAS FOR FURTHER DEVELOPMENT
The architecture presented here can provide a substantial speedup for fractal image compression over a general purpose processor by exploiting the parallelism inherent in the problem. Other ASIC designs have produced similar speed gains; however, the design presented here provides greater flexibility than other published designs, both in its ability to operate with arbitrary block sizes and in its easy retargetability for different device architectures and applications. This flexibility comes at the cost of chip area and of operational speed, both in terms of the clock rate and the number of clock cycles required for a compression.
Several areas remain where further investigation would be fruitful. First, the current implementation of the parameter computation circuitry performs all operations at full precision. Reducing the precision of certain operations would lead to faster operation, especially at the smaller block sizes where the parameter computation time is the limiting factor on the speed of operation; and to lower area requirements.
(A further refinement of this idea is dynamically changing the precision based on the block size. As the design uses a bit-serial approach to computation, the precision computed can be adjusted by the number of clock cycles allowed for the computations.)
Second, the system is a somewhat inflexible in addressing domain blocks; it only supports a complete search of a rectangular area of the image. Adding more addressing flexibility to this, such as the ability to skip space in between blocks, would allow the user to trade off compression time and compression quality.
Third, allowing the maximum block size to be configurable would further increase the flexibility of the design. This is not as simple as it first appears, as many parts of the design depend directly on this value; in particular, the maximum widths of the various quantities in the computation datapaths are determined by the block size. For most applications, a maximum block size of 16 x 16, 12 x 12, or 8 x 8 may be more suitable than the 32 x 32 used in the current design; this is especially true if the memory required for the range blocks takes up a significant portion of the chip area. The synthesis results presented indicate that this is the case; however, the memory used is synthesized from individual flip-flops, rather than much smaller SRAM cells. A specialized, architecture-specific memory block design should be substantially smaller.
