Set Partitioning in Hierarchical Trees (SPIHT) is a highly efficient technique for compressing Discrete Wavelet Transform (DWT) decomposed images. Though its compression efficiency is a little less famous than Embedded Block Coding with Optimized Truncation (EBCOT) adopted by JPEG2000, SPIHT has a straight forward coding procedure and requires no tables. These make SPIHT a more appropriate algorithm for lower cost hardware implementation. In this paper, a modified SPIHT algorithm is presented. The modifications include a simplification of coefficient scanning process, a 1-D addressing method instead of the original 2-D arrangement of wavelet coefficients, and a fixed memory allocation for the data lists instead of a dynamic allocation approach required in the original SPIHT. Although the distortion is slightly increased, it facilitates an extremely fast throughput and easier hardware implementation. The VLSI implementation demonstrates that the proposed design can encode a CIF (352 × 288) 4:2:0 image sequence with at least 30 frames per second at 100-MHz working frequency. key words: discrete wavelet transform (DWT), set partitioning in hierarchical trees (SPIHT), image coding
Introduction
Discrete Wavelet Transform (DWT) has been widely used for digital image compression [1] , [2] . Bi-orthogonal (5, 3) and (9,7) filters were chosen to be the standard filters used in the JPEG 2000 codec standard [3] , [4] . Since DWT was introduced, several codec algorithms were proposed to compress the transform coefficients as much as possible. Among them, Embedded Zerotree Wavelet (EZW), Set Partitioning In Hierarchical Trees (SPIHT) and Embedded Block Coding with Optimized Truncation (EBCOT) are the most famous ones [5] , [6] . In [6] , if no entropy coding or arithmetic coding methods are incorporated, coding tables are not required with slight loss in compression ratio. Moreover, SPIHT can be easily used for either fixed bit rate or variable bit rate applications, and it is also very suitable for progressive transmission [5] . Furthermore, SPIHT has about 0.6 dB peaksignal-to-noise-ratio (PSNR) gain over EZW [7] and is very close to EBCOT in many circumstances [6] .
While EBCOT has the best compression rate of all, it requires complex multi-layer coding procedures, multiple coding tables and arithmetic coding techniques. So its hardware implementation would be more difficult and ex- pensive. [10] On the contrary, SPIHT applies a much simpler coding procedure and needs no coding table. The implementation of SPIHT would be much cheaper to be suitable for still image compression appliances. Moreover, the SPIHT based encoding algorithm is also applied to the SOT based audio compression [15] - [17] . There are still some interested issues on SPIHT based algorithms and applications. For example, rate-distortion is always an important issue in data compression. Lots of considerations were focused on maximizing compression rate as well as minimizing distortion under a limited data rate. To achieve a progressive bitstream with a smooth rate-distortion curve was discussed more in the past decade for various multimedia applications over Internet. Scalability is easily achieved by SPIHT-based methods, such as image, video, or audio coders [23] - [25] . However, its implementation on a silicon chip still encounters some difficulties. First of all, the addressing scheme of the wavelet coefficients is in a 2-D manner, and the searching for the descendants of a given coefficient has to be performed very frequently. Next, frequent transactions among the three lists, List of Insignificant Set (LIS), List of Insignificant Pixel (LIP) and List of Significant Pixel (LSP), which store the coefficient coordinates, are all necessary. Third, linked lists are suitable for the implementation of the three lists, and insertion or deletion operations are necessary to update those lists. Yet linked lists require more memory space [14] and are more time-consuming in search, insertion and deletion operation. Nevertheless, for VLSI realization, it is necessary to develop fast hardwired circuit solutions to speed up the time-consuming computations mentioned above.
Based on the above observations, a modified SPIHT suitable for hardware implementation is proposed. In spite of special arrangement, the coefficients can be stored in a 1-D memory array. The searching for the descendants of each coefficient becomes very simple. Besides, instead of using dynamic linked lists, fixed size tables are used to execute the operations for the three lists. Transactions of the lists require no storage of the coordinates, and only the specialpurposed flags stored in the tables are updated. Inspection of the content of the three lists becomes very convenient. In summary, by the modified SPIHT, the overall coding procedure is simplified and efficient. The only disadvantage is that the average PSNR is decreased by about 0.2 dB occasionally. The VLSI implementation of the modified SPIHT can encode a CIF (352 × 288) 4:2:0 image sequence with at least 30 frames per second at 100-MHz working frequency. The overall gate count is merely 2950 and the internal memory is 4 kb for storing the three lists and wavelet coefficients using TSMC 0.35-µm technology. Since SPIHT can not only be applied to image compression but also audio and video compression [23] - [25] , it is possible to handle both video and audio bit-streams in one single hardware module with the proposed architecture in this paper.
This paper is organized as follows. The modified SPIHT is presented in Sect. 2. VLSI implementation of the modified SPIHT is discussed in Sect. 3. Finally, the conclusion is shown in Sect. 4.
Modified SPIHT

Motivation
The original SPIHT is presented in [5] by A. Said and W.A. Pearlman. Searching for descendants of a specific coefficient takes lots of operations. For a dedicated hardware, it is necessary to store 2-D data in a 1-D array. Without loss of generality, a 4 resolution level 16-by-16 DWT decomposed image is used as an example. There are 256 coefficients to be encoded. 
The above operations are not complicated at all. However, it is not efficient to implement the dedicated SPIHT hardware by using the rule. One can simplify the design by altering the addressing method. In 16-by-16 cases, 256 addresses are needed. One can use 8-bit symbols to represent 2-D addresses. In Fig. 2 (left) , the upper nibble is for y-axis (height) and the lower nibble is for x-axis (width).
A new addressing is used as shown in Fig. 2 (right). For instance, the coefficient stored in position-071 (b'0100 0111) is stored in position-053 (b'00110101). This notation yields the physical address in Fig. 3 . Because searching for the descendants of a given coefficient has to be performed very frequently, it is necessary to design a dedicate circuit to compute the addresses. This also consumes more clock cycles. In Fig. 3 , the numbers in brackets represent the original addresses of the wavelet coefficients, which are also shown in Fig. 1 . Moreover, the numbers not in brackets indicate the new storage addresses of the wavelet coefficients. The thick lines are used to divide the sub-bands like the thick lines in Fig 
Thus, the direct off-springs of a specific coefficient are stored in consecutive addresses. This makes the memoryread operation more efficient and could reduce the switching frequency of the address bus as well. The address generation circuit is reduced to a 2-bit shifter and an increment-byone operation. Hence, a new coding algorithm, which can simplify the hardware design without losing much coding efficiency, is derived and explained in next section. Most image compression standards, e.g., JPEG2000, suggest that images should be divided into code blocks, and the size of code blocks is limited to powers of two with the minimum size being 2 2 and the maximum being 2 10 [22] . Besides the width and height of code block are usually the same for designing a dedicated and efficient hardware. Under the conditions the rule in Eqs. (5)- (8) is always held, and this is proved in the Appendix A.
Modified SPIHT Algorithm
Most definitions of the lists and symbols are identical to the original SPIHT except that 1-D addresses are used instead of 2-D addresses and three definitions, that is, MaxLIP, MaxLIS and MaxLS P, are increased. They are as follows:
• O(i): Set of coordinates of all offspring of node i.
• D(i): Set of coordinates of all descendants of node i.
• H: Set of coordinates of all spatial orientation tree roots.
• L(i):D(i) − O(i).
• c i : The magnitude of the coefficient of node i.
• MaxLIP: The maximal address of all nodes in the LIP list.
• MaxLIS : The maximal address of all nodes in the LIS list.
• MaxLS P: The maximal address of all nodes in the LSP list.
• T : All nodes in the image.
• N: The address of the last node in the image.
• A set L(i) or D(i) is said to be significant if any coefficient in the set has a magnitude greater than the threshold, such as
where G, G ⊆ T , is a set of nodes.
Similar to the SPIHT algorithm, four encoding steps, initialization, sorting pass, refinement pass and quantization pass, are performed and three linked lists, LIS, LSP and LIP, are used in the proposed SPIHT. Its pseudo-code is described as follows.
Initialization step:
•
• Set LIP(i) = 1, ∀i ∈ H, otherwise set to 0;
• Set LIS (i) = A, ∀i ∈ H with descendants, otherwise set to 0;
MaxLS P = 0; 2. Refinement pass:
Set LS P(i) = 1; 3. Sorting pass:
• Decrement n by one and go to Step 2.
In both SPIHT algorithms, there are three lists, LIS, LIP, and LSP to be constructed. When an element in LIS changes its type, transactions among the lists are necessary. Insertion and deletion operations of the lists are required, too. Dynamic linked lists have to be employed to construct the three lists for the SPIHT algorithm. It is easy to construct dynamic linked lists with high level programming languages because memory management mechanism is available in computer operating systems. However, design of such a dedicated low cost circuit is not so straightforward. These operations reduce the throughput at the same time. With the proposed approach, no dynamic linked list is necessary. If N is the total number of coefficients, based on the proposed addressing method, addresses of the coefficients which have descendants are from 0 to (N/4)−1. Tables with N/4 entries are used for the lists. They are LIS (i), LIP(i), and LIS (i), for i = 0, . . . , (N/4) − 1. LIS (i) is set to be zero when node-i has no descendant, or it has not joined the encoding process yet. Otherwise, it denotes a coefficient at position-i as either type A or type B. Similar definitions are used for LIP(i) and can be found in the algorithm. LS P(i) is set to be zero when node-i is insignificant pixel. LS P(i) is set to be 2 when node-i becomes significant pixel for the first time, then it is set to be 1 after the node-i passes the refinement pass. By scanning the tables subsequently, one encoding pass is finished. Simple counters and finite state machines (FSM) are enough for its implementation. The test images are 8 bpp grey scale images. Bi-orthogonal (9,7) filter and 5-level DWT are used [2] . The other pre-processing follows the suggestions in JPEG2000 [4] . The distortion measure is calculated by Eq. (10).
PS NR = 10 log 10 255
2
MS E
where MS E denotes the mean-square-error between the original image and the decoded image. No entropy coding is employed. Figure 4 presents the rate-distortion comparison among the three algorithms, the original SPIHT [5] , the modified SPIHT [11] and the proposed SPIHT. The solid blue lines exhibit rate vs. distortion performance for SPIHT without arithmetic coding. The solid red lines show the performance of the proposed SPIHT without arithmetic coding. The dash-dot blue lines demonstrate the performance of the modified SPIHT, which is an old version of the proposed SPIHT.
All decoded images were recovered from a single fidelity embedded encoded file, truncated at the desired rate. Obviously, the PSNR performance of the proposed SPIHT is much higher than the modified SPIHT at all of the rate conditions. Notice that the performances of the original SPIHT and the proposed SPIHT. At rates below 1.0 bpp, the difference falls within 0.2 dB, and it becomes negligible at lower rates. The difference increases and falls within 0.3 dB between the rate 1.0 bpp and 1.8 bpp. Since most applications demand higher compression ratios (more than 16:1), the proposed algorithm should be acceptable. Because linked lists are used in the original SPIHT, the nodes that turn significant earlier will be placed near the heads of the lists. These nodes are more important than the nodes that turn significant later. Therefore, the bits for the ear- lier nodes will also be output first. The new algorithm has no such scheme. That is, the original SPIHT tends to have better PSNR performance than the proposed algorithm. Fortunately, PSNR performance of the new algorithm does not decrease much. No arithmetic coding was used on the significant test for these results. Back-end arithmetic coding using contexts and joint encoding generally improves SPIHT by about 0.5 dB [14] . We may expect that improvement for the proposed SPIHT as well. Though the target of the proposed algorithm is designed for low cost ASIC implementation, the improvement in speed on a general purpose computer can be expected. Excluding the time spent for I/O and 8-level DWT, the corresponding CPU times of two algorithms for encoding LENA at different rates are shown in Table 1 . The result is obtained by taking the average after executing the programs 1000 times. The programs run on a Pentium IV-2G with 512 MB memory. The new method is over 6 times faster than SPIHT.
The Proposed Architecture
In this paper, 16 × 16 image blocks are used as a design ex- 
Address
Usage x000-x0ff
The Wavelet Coefficient x100-x1ff
The Output Bit-Stream x200
Bit-Stream Length x201
Encode Stop Level and Encode Point ample. Implementations for different sizes can be extended easily. The architecture of the encoding engine, includes a core module, a system controller, two bus selectors, two address translators and two internal memory buffers. This is shown in Fig. 5 .
The Proposed Encoding Engine
The proposed encoding engine is treated as the peripheral of a certain CPU. The memory-mapping I/O is adopted in this architecture for handling I/O. The engine and the memory share the memory map displayed in Table 2 . The first 256 address spaces are for the wavelet coefficients. The output bit-stream is stored in the second local memory with the next 256 address spaces. The information of bitstream length, Stop Level and Stop Point is recorded. The Stop Level is in the bit plane where the encoding process stops. Similarly, the Stop Point is the coordinate at which the encoding process stops.
After performing Bi-Orthogonal (9, 7) or (5, 3) DWT decomposition, the coefficients are carried into the local memory of the encoding engine. What is more, the coefficient of each node should be stored using the format shown in Fig. 6 . The first four bits are used to record the transactions of the lists that happened in the coding passes. The remaining bits indicate the results after each coding pass. Based on the proposed algorithm, the coefficients should be stored as in Fig. 3 . The address is translated using the trans- lator shown in Fig. 7 .
The function of the Encode Controller module includes handling of the message from the CPU. Two states are performed in Encode Controller module: The first state is to wait for the completion of transfer of wavelet coefficients or bit-streamand the second state indicates that the engine is in the encoding process.
The source of the data for the first local memory is selected using the circuit in Fig. 8 . Likewise, there are three sources for data output bus and the selection is done by us- ing the circuit in Fig. 9 .
The Encode CORE module consisted of a finial state machine is the major processing unit in Fig. 5 . It is then used to control the dataflow of the modified SPIHT algorithm and generate the final bit-stream. Therefore, the final state machine of the Encode CORE module implements the algorithm in Sect. 2.2 and its state diagram is shown in Fig. 10 . The operation of each state is explained in the Appendix B. It outputs the bit-stream bit by bit to the Encode Controller module. The bit-stream is carried into the Fig. 11 The process of the wavelet-based compression system. (9,7) 3-level DWT are utilized. The process of the wavelet-based compression system is shown in Fig. 11 . The average required clock cycles of all 16 × 16 blocks are shown in Table 3 , and the average PSNR for each Stop Level is also shown in Table 4 . For example, when the StopLevel is 0, the required clock cycle is 5530 for a 16 × 16 block, and the average PSNR is 49.69 dB. If the required quality is lower, the encoding speed would be much higher.
The PSNR performance of the 'Lena' image is shown in Table 5 , and it is identical to the results derived from our previous software implementation. A rate control is employed to select the appropriate data rate for each coding block to obtain better rate-distortion performance [9] . To improve rate distortion performance, the block size can be increased and the implementation can be extended from the current design in a very straightforward way. The design passes both pre-simulation and post-simulation tests. Our system was desinged by using Verilog-HDL and simulated for debugging purposes with Verilog-XL. Design Analyzer from Synopsys was used to compile the Verilog-HDL code and generate a net list. The Apollo tool was used to both place and route the design. The specification of the coding engine is shown in Table 6 . This design includes 6 modules, an Encode CORE module, an Encode Controller module, two selectors, and two address translators. In addition, two static RAMs are used in this implementation. The synthesis result of coding engine is shown in Table 7 , where the address translators and selectors are synthesized with Encode Controller. The functions of the two RAMs are presented in Table 8 . The photograph is exhibited in Fig. 12 . Moreover, only the address translators and RAMs need to be changed when treating different block size.
The proposed design has to be combined with DWT to become a complete image code. Bi-Orthogonal (9,7) DWT decomposition with lifting is used [4] . A reference design can be found in [11] . The design also has been completely verified on an Altera APEX PCI development kit. This board consists of an Altera APEX EP20K1000E FPGA. The Altera APEX 20K FPGA allows for 1.7 million gate designs [20] . Only 2% logic elements and 4% ram bits are required for our modified SPIHT design. It can process over 4.6 million image pixels within 1 second depending on the required image quality.
Two hardware implementations of SPIHT based algorithm are presented in [18] , [19] . The SPIHT coding in [18] is performed using content addressable memories to keep track of the order in which information about the wavelet is sent for each image. No optimizations or modifications were made to the algorithm to take into account of the design that would compute on a hardware platform as opposed to a software platform. The design was simulated over an 8 × 8 sized image for functional verification. Since the design was only simulated, no performance numbers were given. The SPIHT coding in [19] is implemented to parallelize the computation and based upon fixed-order SPIHT, which is developed specifically for the use within adaptive hardware. For an N×N image fixed-order SPIHT, it may be calculated in N 2 /4 cycles. However, their SPIHT design required 98% FPGA area to be implemented in Xilinx Virtex 2000E FPGA. The Xilinx Virtex 2000E FPGA allows for two million gate designs [21] . It is apparent that our system requires less FPGA area and may be a good option for low cost applications.
Conclusion
An efficient VLSI implementation of the modified SPIHT encoder is presented. New coefficient addressing method, the fixed-size list tables and straightforward coding procedure are employed. Low-cost and simple hardware implementation is achieved. Though the distortion is slightly increased, it is hard to perceive the difference between images coded by the two algorithms, especially at lower rates. The design is implemented with TSMC 0.35-µm 2P4M technology. The area is 1.2 × 1.2 mm 2 , and the simulated clock frequency is 100 MHz. It can process over 4.6 million image pixels within 1 second depending on the required image quality. We have combined it with a DWT module and an 8-bit microprocessor, and the design has been fully verified over an Altera APEX 20K FPGA board. In addition, rate control is easy with SPIHT based methods and no look-up table is essential. Summing up, the design we proposed is certainly attractive for embedded appliances. (height) and the lower N nibble is for x-axis (width). The general form of physical address in a 1-D array is
Therefore, the general form of searching for descendants of the node by using original address description is shown as follows. 
Address of direct off-spring No
Therefore, the addressing rule of the direct off-springs of a coefficient is modified as follows. has been significant and update the threshold. Set the LSP flag of current node being 1 when the LSP flag of current node is 2. 7. The SortingPass0 state: If the current node is insignificant, output the S n (i). If the coefficient of the insignificant node is greater than threshold, the LSP flag and LIP flag of the current node is reset and set MaxLSP being the address of the current node. 8. The SortingPass1 state: Output the sign of the current node. 9. The SortingPass2 state: Find out the first and the last address of the offspring of the current node. 10. The CheckD state: Output the result when the current node is greater than threshold or the search for all descendants is finished. Set the LIS flag being type B when address of the current node is smaller than the last node, otherwise set the LIS flag being 0. 11. The SortingPass3 state: Output the compared result between the current node and threshold. Besides set the LSP flag being 2 and MaxLSP being the address of the current node when the coefficient is greater than threshold, otherwise set the LSP flag being 1 and MaxLIP being the address of the current node. 12. The SortingPass4 state: Output the sign of the current node. 13. The SortingPass5 state: Find out the first and the last address of the offspring of the current node. 14. The CheckL state: Output the result when the current node is greater than threshold or the search for all descendants is finished. Set the LIS flag being 0 when address of the current node is greater than the last node. 15. The SortingPass6 state: Set the LIS flag being type A and MaxLIS being the maximal address of these descendants of the current node. 16. The SortingPass7 state: Update the address of the next searching for the insignificant nodes. His research interests cover the areas of digital audio signal processing, physical modeling of acoustic musical instruments, human computer interface design, pattern recognition, data compression, image/video signal processing, and VLSI signal processing.
New Address of direct off-spring No
.0 (A· 7) = NewAddr(y N−2 , . . . , y 0 , 0, x N−2 , . . . , x 0 , 0) = 2 N × (y N−2 × 2 N−1 + x N−2 × 2 N−2 + · · · + x N/2−1 × 2 0 ) +(y N/2−2 × 2 N−1 + x N/2−2 × 2 N−2 + · · · +y 0 × 2 3 + x 0 × 2 2 ) =
Yau-Hwang Kuo
