Photographic Expects Group (JPEG) coding standard for lossy compression. Although JPEG is a simple coding standard, its compression efficiency is very low as compared to any typical state-of-the-art image coding standards like set partitioning in hierarchical trees (SPIHT). In this paper, a novel state-table-based SPIHT (STS) algorithm and its field programmable gate array (FPGA) implementation is proposed. The STS uses two small state-tables and two extremely small lists. The STS not only provides better compression efficiency than the state-of-the-art JPEG 2000 at high bit rates but also requires very small memory to hold the state-tables and lists in comparison to SPIHT. On average STS requires 0.86% of the memory needed by SPIHT when evaluated for image sizes ranging from 4 Mpixels to 40 Mpixels. The implementation results show that STS consumes very less FPGA area in comparison to SPIHT-based architectures. The dynamic power dissipation of STS is also less than that of JPEG-like compression standards. This makes our proposed algorithm a better candidate for compression in low-power, low-memory digital image acquisition devices.
Introduction
The global camera industry has made significant growth in the last decade. The main focus has been given to high-end digital single lens reflex (DSLR) cameras. The areas of continuous improvement in camera technology include resolution, compactness, multiple scene modes, storage, and others. The central goal of all improvements in sophisticated camera technology is to improve the quality of the captured scene. One of the issues limiting the image quality in small digital cameras that has attracted less attention is image compression. As the image is captured by camera sensor, it is either compressed and then stored, or stored directly on a secure digital (SD) card. Many cameras have an option to store the captured images in uncompressed RAW format.
Some cameras support a lossless Tagged Image File Format (TIFF) compression standard. Joint Photographic Experts Group (JPEG) [1] is the widely used lossy compression standard in modern DSLR cameras provided by leading manufacturers like Canon 1 and Nikon 2 [2] . A comparison of JPEG with RAW was given in [3] . Despite providing low performance in comparison to the state-of-the-art compression standards, JPEG is widely used * Correspondence: rafimiet@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License. These context and decision pairs are then encoded using an AC coder to generate a compressed bit stream. The tier 2 stage finally reorganizes the bit stream and places markers into it. The BPC has high computational complexity, as it has to process context-decision pairs at bit-level. On the other hand, the AC coder is highly sequential and restricts the throughput of the overall JPEG 2000 standard. Although modifications to the binary arithmetic encoder are proposed by many researchers [7, 8] , the resources utilized and dynamic power dissipation are still high and the throughput achieved is low. High throughput is achievable in SPIHT. In SPIHT the complexity in its algorithm comes from the three lists that it uses to encode an image. Different variants of SPIHT have been proposed. The goals of these variants are to:
(i) Improve compression efficiency [9] [10] [11] ;
(ii) Reduce memory requirements for software and hardware implementations [12, 13] ;
(iii) Provide better throughput; and (iv) Ensure less hardware utilization [14] .
Furthermore, the cameras are mostly based on microprocessors. Microprocessor-based algorithms provide a low performance and consume more power in comparison to Application Specific Integrated Circuit (ASIC)
implementations [15] . Despite being power-hungry devices, microprocessor-based implementation is preferred in modern cameras due to their reconfigurability, reusability, and low cost. On the other hand, field programmable gate arrays (FPGAs) provide performance very close to ASICs and the flexibility very close to that of microprocessors [16] .
No-list SPIHT (NLS) [17] has been proposed to reduce the memory of SPIHT. The hardware implementation of the algorithm was presented in [18] . The memory is reduced successfully in NLS; however, it provides a low throughput. A 4 × 4 bit-plane is processed per clock cycle in modified SPIHT implementation [19] . However, the critical path reduces its frequency significantly. A bit-plane parallel SPIHT architecture was proposed in [20] . Four pixels are processed in parallel in this architecture. The throughput of the encoder is improved; however, the decoder cannot reach a similar throughput. Also, the hardware cost of this architecture is very high. Block-based pass parallel SPIHT (BPS) was proposed in [21] . A bit-plane block of 4 × 4 is processed in a single clock cycle. A 1D SPIHT was presented in [22] , which exploits parallelism to achieve high throughput.
High throughput is reached; however, the performance is reduced as it considers 1D-DWT instead of 2D-DWT.
A low-power FPGA implementation of the DCT-based Cordic-Loeffle (CL-DCT) algorithm in comparison to JPEG was provided in [23] .
From the literature, it follows that a low-power, performance-efficient algorithm must be put forward. The key contributions of this work include the following:
1. A state table-based SPIHT (STS) algorithm is proposed that provides: a. Performance similar to that of SPIHT at low bit rates.
b. Performance higher than SPIHT at high bit rates.
c. Much lower memory requirement than SPIHT.
FPGA implementation of STS algorithm that provides:
a. FPGA area utilization less than SPIHT and SPIHT-based architectures.
b. Dynamic power dissipation comparable to that of JPEG architectures.
The paper is organized as follows. Section 2 briefly describes the traditional SPIHT algorithm. The proposed algorithm is presented in Section 3. Section 4 provides the memory required by STS and other algorithms. Section 5 provides the FPGA implementation of the proposed algorithm. The results and discussion are provided in Section 6. Finally, the work is concluded in Section 7.
SPIHT algorithm
The SPIHT algorithm is a spatial orientation tree (SOT)-based compression scheme. As the image is decomposed into many subbands using DWT, there is a correlation between different subbands at different levels of decomposition. The correlation is such that a parent node in a higher dyadic level is said to be related to four offspring nodes in the next lower dyadic level. This correlation is exploited in SPIHT by observing that if a parent node is insignificant at a particular threshold, it is most probable that the descendant nodes are also insignificant. Each node is checked for significance at every threshold value. However, if descendants are found insignificant, not all the bits corresponding to each coefficient are generated, and only a single '0' may represent the whole tree or a branch of it.
Three lists are used by SPIHT while encoding/decoding an image: the list of significant pixels (LSP) list of insignificant pixels (LIP), and list of insignificant sets (LIS). LSP and LIP contain the addresses of the coefficients, while LIS contains the addresses of the parent nodes that represent its descendants. Initially each node in LIS is a Type-A node, which means that its descendants are all insignificant and need to be checked at the current threshold. If a node in LIS has significant descendants, then the offspring nodes are moved to LSP or LIP according to their significance with respect to the threshold. The node in LIS is moved behind all entries in LIS and treated as Type-B, which implies that all the descendants except the offspring are insignificant and need to be checked at the current threshold. If the descendants are found significant, a '1' is output and the offspring nodes are added to LIS as Type-A.
These lists are updated each time a node is checked. This makes SPIHT an undesirable compression technique, as the number of entries in these lists in certain cases exceeds the total number of coefficients. A large memory requirement is imposed by SPIHT, which makes its hardware implementation inefficient. Besides high memory requirements, power dissipation in SPIHT also increases as the memory size increases, because block RAMs in FPGAs consume a significant amount of power.
Proposed algorithm
The proposed algorithm is put forth primarily to eradicate the need for a large dynamic memory, which limits the use of SPIHT, despite providing very high compression efficiency. Our proposed algorithm is based on state-tables that preserve the status of each block of coefficients. The following terms are used in the algorithm: The DWT transformed image is stored in Morton scan order [24] . Using Morton scan order, it is easier to navigate through a SOT tree. For any parent node located at x, the offspring nodes are located at locations 4x, 4x+1, 4x+2, and 4x+3. In STS, the image is divided into blocks of size 2 × 2. Unlike SPIHT, these blocks are treated as nodes of the SOT tree. Two state-tables, SIG_B and SIG_D, are used to hold the status of the nodes. SIG_B reflects if a node is significant or not at any given threshold. SIG_D reflects if any descendant node is insignificant or not. SIG_B and SIG_D are also arranged in Morton scan order. For an image of size R × C, the sizes of SIG_B and SIG_D in number of memory bits are given in Eq. (2) and Eq. (3), respectively. Two passes, a refinement pass (RP) and sorting pass (SP), are employed by the proposed algorithm. While encoding a bit-plane, the SP is always preceded by the RP. The blocks representing the highest level of DWT are initialized as significant blocks (SIG_B = '1'), while all other blocks are treated as insignificant (SIG_B = '0'). SIG_D is initialized as '0' for all blocks. The initial threshold is set as the highest power of 2, just below the largest coefficient in the transformed image.
Refinement pass
In RP, the significant blocks are encoded for the current bit-plane. Each coefficient is checked for significance at the current threshold. If a coefficient is significant at the current threshold, a bit '1' is output; otherwise, a bit '0' is output. If a coefficient was insignificant previously and becomes significant at the current threshold, a sign bit is also output. The image is encoded in RP in breadth-first search mode, i.e. the blocks in a higher level of DWT are always encoded before the blocks in a lower level of DWT. 
Sorting pass
In SP, a block is treated as significant if at least one of its coefficients is greater than the current threshold or if any of its descendant nodes are significant. It is observed that the initial few passes do not yield any significant coefficient in SP. The proposed algorithm skips SP for these initial passes wisely. The STS uses a depth-first search for SP. Depth-first limits the number of coefficients that actively take part in the encoding process at any given time. Each SOT is treated independently, and hence only the nodes of one SOT tree are considered for filling up the lists. Two small lists, a list of child blocks (LCB) and list of parent blocks (LPB), are employed to store the nodes of the SOT tree. The LCB contains the child nodes and its maximum size is fixed, given by Eq.
(4). The LPB contains the parent nodes and its maximum size is also fixed, given by Eq. (5). Both of these lists are used to update the state-tables. Each SOT tree has its roots in the highest DWT level as shown in Figure 1 . If SIG_D is '1' for a root node, then all the nodes of the SOT are significant already and are encoded in RP. Otherwise, the SOT tree is checked for the current threshold in SP. Each SOT tree is encoded in STS as follows: 
Memory requirement
The input image is first transformed into DWT coefficients. For a transform block size of R × C and each coefficient represented by W bits, the memory (bits) required is presented in Eq. (1). The memory required by the two state tables is given in Eqs. (2) and (3), respectively. The maximum fixed size of the two lists is given in Eqs. (4) and (5), respectively. L represents the DWT transformation level. 
The total number of memory bits needed by the algorithm in addition to the image itself is given in Eq. (6) below:
Here, log 2 (RC) gives the number of bits needed to address the image.
The memory requirements by STS in comparison to different algorithms is presented in Table 1 for various image sizes. It is obvious from the table that, as the image size increases, the memory size also increases. Column 2 to Column 11 show the memory requirement corresponding to different image sizes ranging from 4 Mpixels to 40 Mpixels. It should be noted that the memory requirement for JPEG 2000 and BTCA are given for the bitrate of 1 bpp. Surely, the memory requirement for these algorithms will increase as the bitrate increases. It is evident that the memory size required by STS is very small in comparison to the algorithms. On average, STS requires only 0.86%, 0.88%, 7.6%, 45%, and 82% of the memory required by SPIHT, WBTC, LBTC, BTCA, and JPEG 2000, respectively.
Hardware Design
In this section, the hardware design of the proposed image compressor is presented. The implementation has been targeted for FPGAs. The FPGA platform has been chosen given that it provides reconfigurability besides high performance. The top level architecture of the STS compressor is shown in Figure 2 . The image to be encoded is either input as a whole or it is first partitioned into many subimages called tiles and then input to the compressor. The DWT stage takes a three level 2-D DWT transformation using Cohen-Daubechies- architecture, it can be observed that the wavelet transformed block is stored in a dual-port memory within the DWT stage and the corresponding coefficients are made available to the STS encoder, as and when it needs. Also, the initial threshold is calculated in the DWT stage and made available to the STS encoder. When the tile is completely transformed, the DWT_End signal is set high and the STS starts processing the transform block. The decoder works in the reverse way. The STS decoder decodes the bitstream generated by the encoder and builds the transform block, which is then inverse transformed to recover the image tile. The transform block is stored in the memory located in the inverse DWT stage, 2D_IDWT. Coefficients are read from the memory using the signal R_coeff1 and then updated and written back using the signals W_coeff1, W_coeff2, we1, and we2.
STS encoder
The top level architecture for the STS encoder is shown in Figure 3a . It mainly consists of two passes, REFINEMENT PASS (RP) and SORTING PASS (SP). The resources used by the modules BITSTREAM GEN, COEFF ADDR GEN, and PACKETIZER are shared between the two passes. These resources are utilized by one of the two passes at a time, decided by the CONTROL UNIT. As mentioned in Section 3, the proposed encoder encodes the image in a block-based manner. The specific block addresses corresponding to those blocks that need to the encoded as per the algorithm are produced by the two passes. The COEFF ADDR GEN module produces the desired coefficient addresses, which are then sent to the transform block memory to retrieve the corresponding coefficients. A Valid signal is sent to the BITSTREAM GEN module to prevent it from encoding the coefficients that do not meet the requirements to be encoded. The BITSTREAM GEN receives the coefficients and compares them with the current threshold T n and previous threshold T n−1 to encode them. Two coefficients are encoded per clock cycle and hence a bitstream of width up to 4 bits is generated. If the encoding control is with SP, the bitstream BS is made available to the SORTING PASS module, where the addressing bits are appended to the bitstream to make it possible for the decoder to recognize the coefficients that were encoded. Finally, the PACKETIZER module reorganizes the bitstream and produces a byte-long output, which can then be stored in off-chip memory.
The RTL schematic for REFINEMENT PASS is given in Figure 3b . Select is used to enable any of the two passes at a time. In RP, if the SIG_B value corresponding to the block address is '0', then the signal Valid_Ref is set high so that the coefficients belonging to the block be encoded by the subsequent modules. Otherwise, the address is incremented by 1 and checked again. If the last block address is reached, End_Ref is set high, so that the CONTROL UNIT can activate the SP. In SP, the transform block is encoded by taking one SOT tree at a time, as shown in Figure 4 . A SOT tree encapsulates blocks from all three levels of DWT. From the figure, it can be seen that the block addresses from all the three levels are calculated and checked. The block address from level L3, L3_Addr, is used to calculate its offspring in level L2, which are located at 4 × L3_Addr , 4 × L3_Addr + 1 , 4 × L3_Addr + 2 , and 4 × L3_Addr + 3 . The entries in SIG_D corresponding to the level 2 and 3 addresses are checked by Sort_Control and accordingly a decision is made on whether to take the SOT tree or not. If the SOT tree is taken, it is also decided which of the branches originating from level 2 is to be taken. When a branch is taken, SIG_B is used to decide which of the blocks need to be encoded. A block needs to be encoded if it is not previously significant. The control signals sel1, sel2, and sel3 are generated by Sort_Control to manage the right sequence of encoding the SOT tree. When a branch is encoded, the bitstream corresponding to the level L2 node and its four offspring nodes from level L1 are obtained from BITSTREAM GEN. Consider the corresponding BS values to be stored in registers L2, L11, L12, L13, and L14.
If any of the bitstreams are found to contain values other than zero (significant coefficient), then a '1' followed by BS corresponding to L2 are output. Furthermore, if all of the level L1 bitstreams L11, L12, L13, and L14 are found to be insignificant, then a '0' follows. Otherwise, a '1' follows, which is further followed by encoded bits from each of the level L1 blocks. For convenience, the encoding of only one block from level L1 (L11) is shown in the RTL schematic. The RTL corresponding to COEFF ADDR GEN is shown in Figure 5a . The coefficient addresses are generated from the block address BL_Addr. The BL_Addr is shifted by two bits to the left to get the coefficient located at 4 × BL_Addr. Two coefficient addresses are generated per clock cycle. The Alt generated by RP and SP is used to select between the pair of coefficient addresses. In one clock cycle, one pair of addresses is selected, while in the next clock cycle, the remaining pair of coefficient addresses are selected. The bitstream BS is generated from the BITSTREAM GEN module with the RTL shown in Figure 5b . The coefficient value is first converted into signed magnitude form and then its magnitude is compared with the current and previous thresholds to encode the coefficient. If the coefficient was previously insignificant and becomes significant at the current threshold, then a sign bit 'sgn' is appended with the significance bit. The encoded bits from the two coefficients are attached together and output in BS. The number of valid bits in BS is given by n, which at maximum can be 4. The PACKETIZER receives the bitstreams from BITSTREAM GEN or SP and then reorganizes the bits to produce output in byte-long packets, as shown in Figure 5c . The BS values are buffered into a temporary register. When the contents exceed 8 bits, a byte is output and the rest of the contents are shifted by 8 bits. The Reg_Addr pointing to the highest valid bit in the register is subtracted by 8 as the address exceeds 8.
STS decoder
The architecture of the STS decoder is the reverse of the STS encoder, as shown in Figure 6a . Here the bitstream is used to reconstruct the transform block. Figure 6b shows the schematic for the STS decoder refinement address generator. C_Addr1 is used to access the coefficient from the memory located in the IDWT module. In the next clock cycle, the coefficient is updated and written back into the same location from where it was accessed. That is why C_Addr2 is a delayed version of C_Addr1. From Figure 6c , it is clear that the updating of a coefficient depends on the bitstream. BS[n] is the current bit from the bitstream that is to be exhausted by updating the coefficient, and BS[n+1] is the next coefficient that follows BS [n] . Depending on the coefficient value (whether it is already significant or not) and these two BS bits, the coefficient may be either written back as it is, added or subtracted with 0.5 ×T n , or replaced with 1.5 ×T n or -1.5 ×T n . When a coefficient becomes significant, it is replaced by 1.5 ×T n or -1.5 ×T n and afterwards bits '0' and '1' correspond to subtracting and adding the coefficient magnitude with 0.5 ×T n value, respectively. Like the encoder sorting pass, three addresses are needed in the decoder sorting pass, as well, as shown in Figure 7 . Only one of the three address generators L3_ADDR, L2_ADDR, and L1_ADDR is enabled at any time. Whether a SOT tree is selected or not depends on the SIG_D and the next bit from BS, as shown in Figure 7a . When a particular block is selected, a block address to coefficient address generation takes place in the same way as in the STS encoder. However, the coefficient is not updated the same way as in the decoder refinement pass. In a sorting pass, the coefficients to be updated are all previously insignificant. Hence, the coefficient locations are written only with 1.5 ×T n or -1.5 ×T n , as shown in the 
Results and discussion
The main attraction of this work is to reduce the FPGA area and dynamic power dissipation of the proposed STS architecture below that of JPEG and provide compression efficiency better than SPIHT. For performance evaluation, images from two databases are used. These are the DSLR Photo Enhancement Dataset (DPED) 1 [25] and Pixabay 2 , as shown in Table 2 . The colored images are converted to monochrome images before encoding. For coding performance, peak signal-to-noise ratio (PSNR) has been compared against bitrate. The PSNR and bitrate are expressed by the following relations [26] :
Bitrate(bpp) = N umber of output bits R × C [7] and ABRC [8] , are 15 and 1.6 times more than STS. It should be noted that the logic required to implement the entire JPEG 2000 will be more than that required by MQ Coder and ABRC.
Finally, Table 5 provides the power dissipation comparison between STS, JPEG-based CL-DCT, and arithmetic coders MQ Coder and ABRC. For dynamic power dissipation comparison, both STS and CL-DCT are implemented on a Spartan 3 FPGA board. The second row gives the maximum operating frequency, which is 96 MHz for STS and only 66.4 MHz for CL-DCT. The third and fourth rows give the dynamic power dissipation and normalized power dissipation, respectively. It is evident that the power dissipation and its normalized value is lower in STS than that of CL-DCT, ABRC, and MQ Coder. STS has 96%, 35%, and 48% less normalized power dissipation in comparison to MQ Coder, ABRC, and CL-DCT, respectively. Finally, the power density is provided in the fifth row. STS has slightly more power density than CL-DCT but very close to that of ABRC.
It should be noted that the comparison with JPEG and CL-DCT is only made in order to understand the power and area efficiency of STS; otherwise, the compression efficiency of both these algorithms is very low in comparison to STS. Comparison is also made with the binary arithmetic encoders used in JPEG 2000 and it is clear that STS requires less FPGA area and dissipates less dynamic power. Unfortunately, no variant of SPIHT has provided power dissipation for FPGA implementation to date. The area used by JPEG is 2.7 times that of STS, while it is 10 to 82 times that of SPIHT-based architectures. It is analogous that the power dissipation in SPIHT-based designs will be accordingly high.
Conclusion
A novel state-table-based SPIHT (STS) algorithm is proposed in this paper. The main focus of this algorithm is to provide high performance and low memory requirements. The performance of our proposed algorithm is higher than SPIHT at higher bitrates. The memory requirement is 0.88% that of SPIHT. The algorithm is implemented on Xilinx FPGAs. The FPGA area utilized by SPIHT-based architectures is approximately 10 to 80 times more than STS. Also, JPEG occupies approximately 2.7 times more area than STS. Most importantly, the power dissipation in JPEG is more than that in STS. All these features make the proposed compression algorithm a better candidate for DSLR and other cameras. STS provides better quality than JPEG, with less FPGA area and dynamic power dissipation.
