Abstract. A high-speed image compression architecture with region-of-interest (ROI) support and with flexible access to compressed data based on the Consultative Committee for Space Data Systems 122.0-B-1 image data compression standard is presented. Modifications of the standard permit a change of compression parameters and the reorganization of the bit stream after compression. An additional index of the compressed data is created, which renders it possible to locate individual parts of the bit stream. On request, stored images can be reassembled according to the application's needs and as requested by the ground station. Interactive transmission of the compressed data is possible such that overview images can be transmitted first followed by detailed information for the ROI. The architecture was implemented for a Xilinx Virtex-5QV and a single instance is able to compress images at a rate of 200 Mpx∕s at a clock frequency of 100 MHz. The design ensures that all parts of the system have a high utilization and parallelism. A Virtex-5QV allows compression of images with a width of up to 4096 px without external memory. The power consumption of the architecture is ∼4 W. This example is one of the fastest implementations yet reported and sufficient for future high-resolution imaging systems.
Introduction
Remote sensing sensors are used in various applications from Earth sciences, archeology, reconnaissance, change detection, or for planetary research and astronomy. Disaster management after floodings or earthquakes, detection of environmental pollutions, or fire detection are examples of countless numbers of applications. The spatial as well as the spectral resolution of satellite image data increases steadily with new technologies and user requirements, resulting in higher precision and new application scenarios. In the future, it will be possible to derive real-time application-specific information from the image on-board the satellite also based on high-resolution images. On the technical side, there is a tremendous increase in data rate that must be handled by such systems. Although the memory capacity requirements can still be fulfilled, the transmission capability becomes increasingly problematic. A real-time transmission of the acquired image data is not possible and it can be assumed that it will not be possible in the near future.
The Institute for Optical Sensor Systems at the German Aerospace Center (DLR) has significant experience in developing remote sensing sensors for more than 20 years. During this period, several focal plane assemblies (FPA) for projects, such as Mars96, ADS40, KompSat3, or KompSat3A, have been developed. The last two are high-resolution FPAs with a ground sample distance (GSD) of 70 or 55 cm, respectively.
The article is organized as follows: In Sec. 2, some background and related work are presented. In Sec. 3, the Consultative Committee for Space Data Systems (CCSDS) 122.0-B-1 image data compression standard is presented that is a basis for the extensions presented in Sec. 4 . Section 5 presents a new architecture for image data compression on-board satellites. In Sec. 6, the results are presented and discussed. The conclusion and outlook are given in Sec. 7.
Image Data Compression On-Board Spacecraft
The first known satellite with on-board image compression was SPOT-1 (1980) . It used differential pulse code modulation (DPCM) with a fixed compression ratio of 1.3:1. A transformbased compression algorithm was first used for the PHOBOS (1988) Mars exploration missions. The algorithm used a discrete cosine transform for spatial decorrelation, followed by scalar quantization and fixed length coding. Compression was performed off-line on a Z80. 1, 2 In the following years, the data rate of high-resolution systems has increased rapidly. The JPEG standard was approved in 1992 and used in many remote sensing missions with moderate data rates. At that time, high-resolution systems, such as IKONOS (1999), QuickBird (2001), or WorldView-1 (2007), are using relatively simple algorithms, such as DPCM, in order to perform image compression in real time. 3 In 1991, even before the JPEG standard was approved, CNES developed a JPEG-like compression application-specific integrated circuit (ASIC) which is capable of 4 Mpx∕s real-time compression. For SPOT-5 (2002), a compression architecture targeted for Earth observation was developed. 4 SPOT-5 contains three instruments producing up to seven data streams; each up to 128 Mbit∕s. A proprietary adaptive DPCM image compression algorithm was developed by Eastman Kodak Company. An ASIC called bandwidth compression plus implementing the algorithm was developed and used in IKONOS (1999), QuickBird (2001) , and WorldView-1 (2007). The compression ASIC achieved an operating rate of 22 Mpx∕s.
3 EADS Astrium developed Compression Recording and Ciphering which is used for SPOT-6 (2013), SPOT-7 (2014) , and KazEOSat-2 (2014). As the name suggests, it is an image compression, mass storage, and ciphering unit. It uses multiple dedicated compression ASICs called wavelet image compression module, each with a speed of up to 25 Mpx∕s. 5 MultiRésolution par Codage de Plans Binaires, a wavelet-based image compression algorithm similar to JPEG2000 is used. CCSDS wavelet image compression module (CWICOM) was recently developed by EADS Astrium in the context of a European Space Agency contract. 6 It is an image compression ASIC, which implements the CCSDS 122.0-B-1 standard and supports both lossy and lossless image compression at a data rate of up to 60 Mpx∕s. It does not need external memory, since it contains almost 5 Mbit of internal memory. The ASIC bases on the Atmel ATC18RHA technology/cell library, which is intended to be used for space applications.
In recent years, field-programmable gate arrays (FPGA) have been increasingly used for space applications. GRACE (2002) and FedSAT (2002) were the first missions for using an early Xilinx space-grade FPGA. 7 The FedSAT system is able to compress pixels at a rate of 43. 8 Mpx∕s. An overview of FPGA-based image compression systems is presented in Ref. 2. The authors also propose an architecture for on-board image compression. The design and implementation of the FORMOSAT-5 (2015) remote sensing instrument are described in Refs. 8 and 9. FORMOSAT-5 is an optical satellite with a GSD of 2 m. The total output data rate of the instrument is 970 Mbit∕s (PAN þ 4 × MS). Three space-grade Xilinx XQR5VFX130 are used for online image compression. 10 One FPGA is used for panchromatic (PAN) processing and two FPGAs are used for multispectral (MS) processing. The compression system uses 24 external SRAM chips, each 1 Mbyte. An implementation of the CCSDS 122.0-B-1 standard used for Proba-V (2013) and EnMAP (2015) is presented in Ref. 11 . The authors use a Microsemi ProASIC3E for development and an antifuse RTAX2000S for the flight model. The estimated data throughput is 173 Mbit∕s at 66 MHz.
16 bit signed and unsigned images in lossless as well as in lossy mode. A CCSDS 122.0-B-1 encoder consists of two parts: the discrete wavelet transform (DWT) module and the bit-plane encoder (BPE). As denoted in Ref. 12 , the standard differs from JPEG2000 in several aspects:
(1) it specifically targets high-rate instruments used on board spacecraft; (2) compression performance has been traded off against complexity, with particular emphasis on spacecraft applications; (3) the lower complexity of the recommendation supports fast and low-power hardware implementation; and (4) it has a limited set of options, supporting its successful application without in-depth algorithm knowledge.
At first, the DWT module applies a three-level two-dimensional DWT on the input image, as shown in Fig. 1 . Two specific wavelet filters are provided by the standard: the 9/7 biorthogonal wavelet transform referred to as Float DWT and a reversible integer approximation of that transform referred to as Integer DWT. The Float DWT is intended to be used for lossy compression, whereas lossless compression can only be achieved with the reversible Integer DWT. In the case of the Float DWT, limited precision of the coefficients' floating point representation and a conversion to the nearest integer after transform lead to some loss of information. Both transform types are separable and apply the column transform after the row transform.
The DWT module forms a hierarchy of wavelet coefficients, as shown in Fig. 2 . A "block" is a group of one DC coefficient and the 63 corresponding AC coefficients (3 parents, 12 children, Fig. 2 A block consisting of a DC coefficient and 63 AC coefficients. 12 and 48 grandchildren). A block loosely represents a region in the input image. For the BPE, the blocks are further arranged into groups: a segment is a group of S consecutive blocks, where 16 ≤ S ≤ 2 20 . Segments are encoded independently and are further partitioned into "gaggles," which is a group of G ¼ 16 consecutive blocks.
Once all coefficients are grouped, the BPE starts to encode the image segmentwise. The first step is to "weight" the coefficients if the Integer DWT is used. This is necessary to optimize the rate distortion.
14 Since the sub-band weights have been obtained empirically, the standard supports user-defined weights. Every sub-band has its own weighting factor w and the coefficients of each sub-band are multiplied by 2 w . Each segment starts with a "segment header" containing information about the current segment. The DC coefficients are coded in two's complement representation. The AC coefficients are coded in sign-magnitude representation. After the segment header is written, the DC coefficients are quantized with a quantization factor q that depends on the wavelet transform type and on the dynamic range of the wavelet coefficients. In the next step, DPCM is applied on the quantized DC coefficients and is followed by Rice coding. After all quantized DC coefficients are encoded, some additional DC bit planes may be refined. The next step is to encode the bit depth of the AC coefficients in each block with the same DPCM method already used for the quantized DC coefficients.
The BPE encodes the wavelet coefficients, as the name suggests, bit-plane-wise and in decreasing order. For each bit plane, the encoding process is divided into stages 0 to 4. In stage 0, remaining bits of the DC coefficients are coded (DC refinement). Stages 1 to 3 encode the AC coefficients' sign and the position of the "significant bit," which is the highest nonzero bit. Stage 1 refers to the refinement of the parents' coefficients. The same procedure is applied to the children coefficients at stage 2 and to the grandchildren coefficients at stage 3. Stages 1 to 3 produce words which are first mapped to symbols. The symbols are then encoded with variablelength code (VLC). All bits of a stage are written to the output bit stream before the next stage is commenced, even though the optimal code for the VLC is determined by stages 1, 2, and 3. Once an AC coefficient is selected, stage 4 encodes the AC coefficients' refinement.
4 Extensions of the CCSDS 122.0-B-1 Standard CCSDS 122.0-B-1 neither supports ROI encoding, multispectral compression, spectral decorrelation nor does it produce a bit stream that can be reassembled in any manner. Compression parameters cannot be changed without re-encoding. ROI encoding would be useful in scenarios where on-board classification, registration or object or change detection algorithms are used. If a certain event is detected or a matching object is found, the compression system might encode the corresponding area with a higher detail or lossless. If there are multiple ground stations with divergent downlink capability, a reassembling of already compressed image data might be desirable in order to adjust the amount of data to the bandwidth of the transmission channel. Another application scenario for this approach are ground stations with different access rights to the resolution level or spatial areas of the images. Because the encoder generates an embedded bit stream, the significance of each bit or the position of a block, segment or image region inside the bit stream can only be determined by decoding. However, the algorithm is well-suited for real-time image compression on-board spacecraft.
ROI Encoding
The basic idea of ROI encoding is to encode certain regions with a low distortion (lossless) and other regions with a higher distortion (lossy). A ROI mask contains the information about whether a certain region is of interest or not. It is convenient to adjust the granularity of the ROI mask to a unit of information used in the compression algorithm. In this work, ROI encoding is achieved by controlling the compression parameters segmentwise. Figure 3 shows an example using ROI encoding. The test image "marstest" from the CCSDS reference image set was used. The integer wavelet and a segment size of S ¼ 16 was used. For this work, it was assumed that the ROI masks are either be transferred via telecommand (TC) before image acquisition or determined with image classifiers (e.g., cloud detection). Note that the architecture presented in Sec. 5 does not implement any online classification algorithms.
Scalability
Scalability here means that the compressed image or parts of it can be reassembled to achieve a particular image quality, or spatial or spectral resolution in any area of an image without re-encoding. Furthermore, the method will also be useful to efficiently build transfer frames for multispectral encoded data. A similar approach was presented in Ref. 15 . Scalability is achieved by (1) using or modifying the compression algorithm in a way that different spatial or spectral regions of an image can be independently decoded and (2) creating an index of the encoded bit stream such that the position of an image block is known.
It is desirable to compress images with a region-specific spatial or spectral resolution based on a mask, which is available during compression, and to assemble a transfer frame which probably contains an even coarser spatial or spectral resolution, i.e., to change compression parameters after compressed without re-encoding. Figure 4 shows this concept on three images.
In order to reassemble the bit stream after compression, the index must be stored in memory. On the one hand, this effectively reduces the compression performance. On the other hand, the index does not have to be transmitted to the ground station, since reassembling of the bit stream and transfer frame generation is performed on-board spacecraft. Table 1 shows the percentage size of the index compared to the size of the input image. A segment size of S ¼ 128 blocks leads to an average index size of 1.19% compared to the input image size for the examined images.
The index can be used to manipulate or reassemble the compressed bit stream. The index itself does not need to be transmitted in any of the following application scenarios: (1) Any quality-related compression parameter can be changed before transmission to the ground station. The resulting bit stream may include an ROI; however, it can be decoded without the index since it complies with the standard, and thus requires only the embedded headers. Figure 5 illustrates this approach. (2) If the ground station requests an "update" for an already transmitted (overview) image, only details for some specific regions are transmitted. The ground station can merge this update packet with the previously transferred image during decoding, since it knows which parts of the bit stream can be found in which transmission packet. Depending on the implementation of the decoder, it is also possible to reuse already decoded parts of the bit stream, which leads to interactive decoding. This approach in shown in Fig. 6 . It should be noted that this approach may need increased mass memory, since more data need to be stored onboard as transmitted to the ground station.
Hardware Architecture
In order to fulfill the high demands on the data throughput, which is ∼200 Mpx∕s for a single instance and still achieve an effective and flexible compression, a pipeline approach is chosen. Figure 7 shows the underlying structure of the proposed architecture. The design of the architecture consists of five main modules. The compression controller (CC) communicates with the spacecraft via TC/telemetry and controls all modules of the system except the memory controller. Compression parameters and a ROI mask are supplied by the CC. The data path starts at the DWT module. The coefficient grouping (CG) module is used to rearrange the coefficients from the 10 sub-bands and to form the blocks necessary for the BPE.
The architecture has two 16 bit input channels including data valid signals for image data. It processes two image pixels per clock cycle. This is a valid assumption since the detector can operate in a different clock domain. Two or more readouts of a detector can be combined into a single data stream using some on-chip memory. Furthermore, current sensors developed at DLR have integrated flat field or radiometric correction so that additional processing can be performed before image compression. The output bit-stream interface has a 64-bit data signal and a 8-bit byte-valid signal indication whether an output byte is valid.
Discrete Wavelet Transform
The DWT module uses a line-based architecture for the three individual two-dimensional (2-D) modules. Internally, each 2-D DWT module uses one row-and two column-transform modules. The memory demands of the 2-D DWT is predominated by the column transform, especially by the image width and the chosen wavelet kernel. For the Float DWT or Integer DWT, five or six lines, respectively, must be cached. Figure 8 shows the arithmetic principles of the Integer and Float DWT pipeline. If input values for two lines are to be processed, the corresponding entries stored in memory are read. At the end, the memory is updated for the next iteration. In case of the floating-point DWT, the corresponding lifting scheme is implemented directly.
The memory requirements are as follows: Since up to six lines must be cached, up to six temporary values must be read and written every clock cycle. For an image width of 16384 px and an internal precision of 24 bpp, the total buffer size is 172032 px (504 kbyte). Assuming a clock frequency of 100 MHz, the total data rate of the memory, that is, independent of the image size, is 3600 Mbyte∕s.
At the low-pass filter of the Float DWT, each four input values must be available around the current input value (prior and after). With a line-based architecture, this will cause an output latency of four rows for each decomposition stage. In other words, this means that the LL kþ1 wavelet coefficients depend on the corresponding input pixel position in LL k with a radius of four pixels or an image area of up to 9 × 9 pixels. Every DWT module L k decomposes an input signal LL k into the four sub-bands LL kþ1 , HL kþ1 , LH kþ1 , and HH kþ1 . The first output will arise a few clock cycles after the fifth input line has started. From now on, the DWT module produces output on every second line (due to the subsampling). When the first output arises in the DWT module L 3 , DWT module L 1 has already emitted 12 lines of wavelet coefficients. At this time, DWT module L 2 has emitted four lines of wavelet coefficients. Thus, at least 3 S x blocks must be buffered for synchronization with the L 3 sub-band, where S x denotes the maximum number of blocks in horizontal direction. A value of 4 · S x is chosen for this design in order to buffer more blocks. Thus, assuming a dynamic range of the wavelet coefficients of d ¼ 24 bit, the total amount of memory necessary for coefficient rearranging is as follows:
If the image has up to S x ¼ 2048 blocks in horizontal direction, the coefficient rearranging module requires 1.5 Mbyte of memory. An external memory is necessary to buffer and rearrange the DWT coefficients. Since all the coefficients need to be written and read only once, the data rate of the memory is twice the input data rate of the module. The input data rate of the module is 200 Mpx∕s. Thus, the memory data rate (read and write) is 400 Mpx∕s. Assuming a dynamic range of the wavelet coefficients of d ¼ 24 bit, the total data rate of the memory is 1200 Mbyte∕s.
The total amount of memory needed for a single-level row transform is as follows:
where I w denotes the image width, l denotes the level of the dyadic decomposition and R max ¼ 6 denotes the maximum number of lines, which must be cached. For a three-level dyadic decomposition, the total amount of memory is as follows:
Since up to six lines must be cached, up to six temporary values must be read and written every clock cycle. Assuming a data width of d ¼ 24 bit and a clock frequency of 100 MHz, the total data rate (read and write) of the memory is as follows:
Total memory data rate ¼ 2 · R max · 24 bit · 100 MHz ¼ 3600 MB∕s.
It should be noted that this value is independent of the image size.
Coefficient Rearranging
The idea of the entropy encoder is that the wavelet coefficients are read block by block, until the required number of blocks has been read and the entropy encoding process can start. Unfortunately, this is not the order in which the DWT module outputs the wavelet coefficients. Reordering and block formation are complex processes, since the input comes from 10 subbands with a non-negligible time offset. A line-based DWT architecture has two problems: on the one hand, a CCSDS 122.0-B-1 encoder usually operates in stripe-based mode, i.e., the segment size is S ¼ S x ¼ N × ½I w ∕8 blocks, where I w denotes the width of the image and N is an integer ≥ 1. In order to support ROI, the segment size must be variable or at least less than this value. In stripe-based mode, all blocks in any image line belong to the same segment. Otherwise, the DWT module generates coefficients belonging to multiple segments. On the other hand, the structure of the DWT module causes temporal delays of the wavelet coefficients in the higher decomposition levels L 2 and L 3 (see also Fig. 1) .
The main task of the CG module is to group the wavelet coefficients for a blockwise output. It buffers the wavelet coefficients of the L 1 and L 2 decomposition levels and synchronizes its output to L 3 . Thus, it also buffers coefficients belonging to another segment. Figure 9 shows the structure of the CG module. This input module receives the DWT coefficients of the three 2-D DWT modules and writes them to temporary memory. Therefore, it must determine the block number of a coefficient belongs to. The input and the output modules share a circular list of memory block addresses: it signals the output module the position of the first LL 1 , LL 2 , and LL 3 block that has not been completely written. In turn, the output module signals the position of the last block that has been completely read. Furthermore, segment parameters are requested from the CC. Parameters such as StartImageFlag and EndImageFlag, which indicate the first or last segment, are updated and sent to the output module. The size of the memory buffer depends on the number of blocks that must be buffered. This in turn depends on the length of the wavelet filter and the width of the image.
If the output module has received the segment parameters from the input module, it transfers all blocks that were completely written to memory, to the segment buffer module, until the desired number of blocks for the segment has been read (segment size S). While reading the blocks from memory, the module converts the AC coefficients to sign/magnitude-representation and determines BitDepthDC and BitDepthAC for the segment and BitDepthAC_block for each block. Again, the parameters are updated and sent to the segment buffer module.
The segment buffer module provides blocks to the BPE module. A double buffering mechanism is used, so that the next segment can be written to the buffer, while the BitPlaneEncoder processes the last segment. The module buffers the weighted coefficients in sign and magnitude representation, each consisting of d þ 4 bit, as well as the bit depth of the AC coefficients of each block (5 bit). If the dynamic range of the coefficients is d ¼ 24 bit, the total size of the buffer is
Assuming a segment size of S ¼ 128 blocks, the memory size will be ≈56 kbyte. The FPGA internal BlockRAM should be sufficient.
Bit Plane Encoder
The BPE is the entropy encoder of the compression algorithm and produces the individual bit stream parts. It gets the input from the coefficient rearranging module. Without considering Fig. 9 Structure of the coefficient grouping module.
parallel execution mechanisms, it must be able to compress each segment in 32 · S clock cycles. In order to achieve real-time compression, the individual modules must operate in parallel: The structure of the proposed BPE is shown in Fig. 10 . A BPE compression control module reads the segment blocks from the coefficient rearranging module (segment buffer). Parameters necessary to write the segment header are also provided, such that the encoding process can start immediately. Whenever the segment buffer has collected an entire segment, it sends a request to the BPE compression control module. When the compression of the segment is finished, the module confirms this to the segment buffer. During operation, the BPE control module sequentially generates input data for the encoding modules.
Segment header, quantized DC coefficients, additional DC bit planes, and AC coefficient bit depths are processed sequentially, while the stages 0 to 4 are processed in parallel. The reason for this is that stage encoding produces the major part of the compressed bit stream and consumes the majority of the execution time. All encoding modules use a data valid mask (one bit for each data bit) in order to mark valid output bits and two flags in order to mark the end of a segment or a bit plane (end-of-segment and end-of-bit-plane).
Initial encoding modules
The DC coefficient of each block is read from the segment buffer, quantized and transmitted to the QuantizedDC and BitDepthAC module. The module is shown in Fig. 11 . The module is also used to encode the bit depth of the AC coefficients of each block. In both cases, one block is read from the segment buffer every clock cycle. All input values are first mapped to symbols. Then, the optimal code option for subsequent Rice coding is determined. Note that the architecture does not support the heuristic method presented in the CCSDS 122.0-B-1 standard, since it does not lead to a significant simplification of this architecture. During the determination of the Rice parameter, the mapped symbols are cached in a small FIFO. The depth of the FIFO depends on the maximum gaggle size G and is set to 2 · G. Depending on the code option p, data are written uncoded in one stage or coded in two stages. In the next step, the additional DC bit planes are read from memory and transmitted to the additional DC bit planes module. DC coefficient refinement is performed depending on the DC quantization factor and BitDepthAC. In this step, some bits of the DC coefficients of all blocks are encoded. The order of the encoded bit was changed: instead of sending the (q − 1)th most significant bit of each DC coefficient followed by the (q − 2)th most significant bit of each DC coefficient (and so on, until the BitDepthACth bit of each DC coefficient), bits q − 1; : : : ; BitDepthAC of the first DC coefficient is sent, followed by the corresponding bits of the second DC coefficient, and so on. This reduces the number of memory accesses to the segment buffer.
Stage encoding modules
After completion of the initial encoding of the segment, the encoder starts the stage encoding procedure. In this step, every block is read from the segment buffer for every bit plane to be encoded. The encoding process runs for the bit planes b ¼ BitDepthAC − 1; BitDepthAC − 2; : : : ; 0. Now, stage 0 to 4 output is generated simultaneously.
Stage 0 data, if valid, consists of a single bit (the corresponding bit of the DC coefficient). Stages 1, 2, and 3 data are generated from the current block status, the encoding parameters, the index of the current bit plane, the number of the current gaggle, and the signs of each AC coefficient (Fig. 12) . Stage 4 data consist of up to 63 bit, one bit for each AC coefficient that was selected in a previous bit plane.
In stages 1, 2, and 3, for every block, the types (0, 1, 2, or −1) of each coefficient and the maximum types in the block hierarchy are determined. Afterwards, up to 38 symbols are generated (even not all symbols can occur simultaneously). Both operations each require one clock cycle. Subsequently, the symbols are sent to the VLC encoding modules. The width of the data path here is two mapped symbols for stage 1 and six mapped symbols for stages 2 and 3. The VLC encoding modules consist of a statistics module, which sums up the number of bits for each symbol length and code option. In parallel, the mapped symbols are buffered in FIFOs. The sizes of the FIFOs depend on the gaggle size (over which the optimal code option is determined) and the maximum number of symbols that can be generated in each stage. After the statistics (number of bits for each code option and symbol length) of each stage is determined, the results are fused in a CodeOption Determination module that calculates the optimal code option. Furthermore, it determines the stage, in which the corresponding code option identifier must be written. Since the optimal code option selection method is implemented, the heuristic method described in the CCSDS 122.0-B-1 standard is not necessary and thus not implemented. At the end, a VLC writer module encodes the mapped symbols. The output width of 64 bit is sufficient to write up to six mapped symbols per clock cycle: the maximum code length of a mapped symbol is 8 bit (see Ref. 12) . Since a code option identifier of 2 bit may immediately located before a corresponding code, 10 bit of the output vector is used for one coded symbol. This also explains why up to six symbols are sent in one clock cycle to the VLC encoding modules.
Output Module
Depending on the image and the compression parameters, a segment consists of different bit stream components or parts. The output module, shown in Fig. 13 , merges the bit streams from up to eight encoding modules into a single output bit stream. The output vector has a width of 64 bit which corresponds to the internal width of the compression architecture: stages 2 and 3 produce up to 64 bit and stage 4 up to 63 bit of data in one clock cycle. Assuming a worst-case compression ratio of 1:1 for natural (nonrandom) images, an output data width of 32 bit should be sufficient (input interface has 2 × 16 bit). This will need an additional FIFO to buffer the peak output data. This has not been implemented, because there are no interface requirements at the moment. In order to be standard conform, the output interface has a byte-valid signal instead of a data valid signal. A CodeWordLength of 1, 2, 3, and 4 is supported. Furthermore, the output module creates the bit stream index that is necessary to achieve scalability. The output module gets the information on the segment parts from the compression control module.
Since some previous encoding modules do not use the entire 64 bits for data, preshifter modules are used to compress the sparse bitvectors. In principle, the preshifter modules are similar to the output shifter module shown in Fig. 14 , however, they operate on byte-level instead of bit level in order to reduce the FPGA's resource consumption. The output shifter module first shifts all valid bits together. Simultaneously, it counts the number of valid bits. Based on the number of valid bits, it shifts compressed such that the actual input data can easily combined with the output buffer by logical operations. If the output buffer is full, its value forms the output of the module. The new buffer will either be empty or contain the bits that did not fit in the output. If there is not output, the buffer will be reused for the next input data.
Memory Controller
As shown in the previous sections, the amount of FPGA internal memory is not sufficient for large row or segment sizes. Since external memory is necessary in the case of wide images, a concept to manage external memory is required. The choice of the specific memory technology is a compromise between flexibility, power consumption, and access speed as well as the memory size.
As already mentioned, the memory sizes for the DWT column transform or the CG module will exceed the FPGA's internal memory resources. The corresponding VHDL entities in the design use either BlockRAMs or FIFOs based on BlockRAM. The maximum row size determines the size of the memory in the particular modules. If only the FPGA's internal memory resources are used and if the maximum row size exceed a certain threshold, the design will not fit into the FPGA. From the engineering point of view, it is often not trivial to replace internal by external memory, since signals must be routed through the complete design hierarchy. The situation is further complicated if the decision, as to whether internal or external memory is used, is made based on generic parameters.
With the hardware operating system presented in Ref. 16 , it is possible to abstract external memory (e.g., QDRII þ SRAM) in a way that it can be used like FPGA's internal BlockRAM. In principle the "structure compiler" presented in Ref. 16 can substitute a generic instance of a memory either by BlockRAM or external memory. In the latter, it also adds the required signal to the external memory. This approach is used in the architecture to implement a unified memory interface. Thus, BlockRAM can easily be replaced by external memory. Figure 15 shows this concept.
The benefits of using a unified memory interface are that the design can be developed and verified independent of the specific external memory technology. It can be used on all platforms for which the corresponding interfaces are available. Furthermore, with the use of the hardware operating system, no signal routing of memory-specific signals between top-and bottom-level is necessary in the user application.
Results and Discussion
The compression algorithm presented in the previous sections has been successfully implemented on reconfigurable hardware which is qualified for space applications.
The data compression throughput is measured via simulation in ModelSim at a clock frequency of 100 MHz. The evaluation is made for the CCSDS reference images presented in Ref.
14. The dataset consists of various images from the Earth observation and remote sensing. The images have a dynamic range of 8 bit to 16 bit. The image sizes are between 512 × 512 px and 1400 × 5504 px. All images are compressed lossless, i.e., the Integer DWT is chosen. Using the Float DWT will lead to almost identical results, since the floating point module has the same timing behavior. The segment size is S ¼ 128 blocks. The results for a selection of images are shown in Table 2 .
The complete dataset consists of 34 images (≈7 Mbyte). All images were compressed in the hardware simulation. The resulting bit stream was validated with a functionally identical software implementation of the algorithm. The average data compression throughput for all classes of images is ∼197.71 Mpxs (198.96 Mpxs for the CCSDS reference images). It is evident that a higher dynamic range of the input images has relatively no impact on the data compression throughput with respect to the pixel rate. The encoding time depends almost only on the spatial size of the input data. This can be explained by the fact that the entropy encoder is optimized for a dynamic range of 16 bit and the wavelet transform module limits the throughput of the system.
For evaluation of the resource consumption, the internal precision of the integer/floating point arithmetic is set to 24 bit and the Integer DWT as well as the Float DWT were considered. Tables 3 and 4 show the absolute number and the percentage of used resources. The internal resource consumption of the architecture on a Virtex 5 XC5FX130T-1 for a maximum image width of 4096 px is as follows: if only the Integer DWT is included in the design and for a given maximum segment size S ¼ 128 blocks, the usage of slice registers, slice lookup-tables (LUT) and distributedRAM is ∼34%, 46%, and 8%. There is a dependence between the maximum image width and the amount of memory or the number of BlockRAMs. The percentage of used BlockRAMs is 77%, for I w ¼ 1024 px, it is 50%. A maximum image width of more than 4096 px requires external SRAM. For I w ¼ 8192 px, the number of BlockRAMs is ∼330, which does not fit into the desired FPGA (298 BlockRAMs). The results for an architecture that include both the Integer DWT and the Float DWT are as follows: for a given maximum segment size S ¼ 128 blocks, the usage of slice registers, LUTs and distributedRAM, is ∼46%, 83%, and 15%. The percentage of used BlockRAMs is 76%. The power consumption of the compression system is necessary to estimate the total power of a higher level component. The relative power consumption per Mpx∕s of the "Integer only" version is ∼20 mW∕Mpx∕s (total 3.873 W; 70°C junction temperature). The total power consumption of the "Integer and Float" version is ∼40 mW∕Mpx∕s (total 4.228 W; 70°C junction temperature). The power consumption as a function of the junction temperature is shown in Fig. 16 .
In order to compare the results with other approaches, the total power consumption can be normalized with the data compression throughput. The power consumption of a Xilinx FPGA can be reliably estimated the Xilinx power estimator (XPE). Besides the FPGA device type, XPE needs the clock frequency of the design, the output load, parameters of the environment (for junction temperature estimation), the used resources of the FPGA design, toggle and enable rates of the logic, BlockRAMs, digital signal processings and I/O cells. For this investigation, the clock frequency of the design is set to the maximum clock frequency of the design (100 MHz). The output load is set to 5 pF. The toggle and enable rates are set to their default values. Fig. 16 Power consumption of the design on Xilinx Virtex-5 and Virtex-5QV field-programmable gate arrays.
Conclusion
A demonstrator was built to test the real-time capability of the system. The architecture was implemented for a Xilinx Virtex-5QV and a single instance is able to compress images at a rate of 200 Mpx∕s(or 400 Mbyte∕s for 16 bit images). It operates at a clock frequency of 100 MHz and processes two image pixels per clock cycle. The design ensures that all parts of the system have a high utilization and parallelism. The Virtex-5QV allows compressing images with a width of up to 4096 px without an external memory. Without external memory or additional interfaces, the power consumption of the architecture is ∼4 W. This example is one of the fastest implementations yet reported and sufficient for recent high-resolution imaging systems. Investigations in the resource and power consumption and in external memory devices show that it will be possible to integrate the architecture to directly onto a FPA. For future developments, it is planned to build FPAs with an integrated image compression module. Since the Virtex-5QV has considerable resources, it is imaginable and in our opinion also it is possible to combine detector interface, preprocessing (e.g., flat field correction), image data compression, and ciphering into a single FPGA design, directly mounted onto the FPA.
The results presented in Sec. 6 are based on an internal-memory-only version of the architecture, i.e., no external memory or memory controller is used. Using external memory for wavelet transform and CG is currently being investigated.
Not mentioned in this paper is the fact that the architecture can be used for multispectral compression, since only a spectral decorrelation technique must be used. However, the resource consumption will be quite high such that a single FPGA might not be sufficient. It is imaginable to modify the architecture to support "resource-shared" multispectral compression. Besides the classic scenario of a store-and-download architecture, more advanced application scenarios are imaginable: change, event, or object detection algorithms can be used in conjunction with an image data compression system. The detected areas can be stored in a high quality, whereas the other areas are stored in low quality. More advanced image processing algorithms for "scene interpretation or abnormal event detection" are also conceivable.
Since the download data rate is usually much lower than the image acquisition rate, on-board reassembling of the bit stream can be done with software running on a CPU. It is conceivable to use a radiation-tolerant version of the Freescale P4080, which has eight embedded PowerPC cores running at 1.5 GHz.
