from failing. In this sense, SmallSats are often used as technology demonstrations to reduce time for advancing the state of the art. Space-related applications such as synthetic aperture radar, image processing, and hyperspectral imaging (HSI) require critical data processing to be performed onboard in order to preserve transmission bandwidth. HSI applications, in particular, are characterized by large amount of data in a threedimensional (3-D) cube form. An HSI pixel consists typically of hundreds of components in the spectral domain. The widely used push-broom imagers, such as a lightweight imager of high precision for drone and airborne applications in [1] , record spatial dimensions as a perpendicular cross section of any target along the track direction.
The data compression in on-board HSI processing is performed to lower the amount of data for transmission back to earth. The challenge is to perform extensive data reduction with minimum requirements of limited on-board resources. In addition, researchers in HSI remote sensing urge for no loss of information in the experiments, making lossless compression the preferable compression mode. Even though lossless compression reduces data volume, it does not compromise data integrity and the image can be fully recovered after decompression. Real-time data processing and reconfigurable hardware solutions have become the standard choice for on-board remote sensing application and compression due to their small size, weight, power consumption, and their resistance to damages and malfunctions caused by ionizing radiation [2] . Hybrid processing systems, system-on-a-chip (SoC) platforms, combining different technological solutions such as CPU, GPU, DSPs, and field-programmable gate arrays (FPGAs) are attractive for on-board processing. The sequential portion of algorithms run on the processors, whereas intensive computational operations suitable for parallel implementation are placed in hardware accelerators (in FPGAs on the programmable logic). FPGAs are desirable for space-directed applications because of high performance achieved by creating custom application-specific architectures. Massive computational speedup and low energy consumption are often achieved by implementing highly optimized data paths. By expanding logic resources in current FPGAs, it is also possible to execute several complex algorithmic task in parallel achieving the throughput required for real-time processing. Common SoC devices for space-related applications (CubeSat and other SmallSat single-board computers) used in ongoing low earth orbit missions include the Xilinx Zynq 7000 series, Microsemi Smart Fusion, or Xilinx Virtex-5 as stated in [3] .
The survey also reports the poor performance of rad-hard FPGAs compared to commercial ones (Xilinx Virtex 5QV FX150 versus Zynq-7020).
A typical compression system in image/video processing consists of decorrelation, quantization, and entropy encoding stages. The decorrelation stage contains a prediction process and/or transform process. Most common transforms are KarhunenLoeve transform (KLT), discrete wavelet transform (DWT), and discrete cosine transform. Among algorithms for compressing HSI images, JPEG2000 standard uses DWT for spatial decorrelation, whereas KLT is part of principle component analysis widely used for spectral dimensionality reduction [4] . The low complexity JPEG-LS uses a simple nonlinear predictor and a context-based entropy coder, whereas a differential version of JPEG-LS algorithm encodes differences of adjacent bands. The LCE algorithm [5] , characterized by high tolerance to bit-flips in the encoded stream, is considered as a good candidate for parallel implementation. Recent FPGA implementations of LCE algorithm are presented in [6] and [7] .
The Consultative Committee for Space Data Systems (CCSDS) defines standards for lossless data compressors applicable to 3-D images produced by multispectral and hyperspectral imagers. The CCSDS-121 standard [8] consists of preprocessing unit with prediction and mapping operations and adaptive entropy coder based on Rice coding. A recent FPGA implementation of CCSDS-121 algorithm is presented in [9] . The successor CCSDS-122 [10] shares features with JPEG2000 such as a three-level 2-D DWT, successive quality refinement of images, and a bit plane encoder. An FPGA implementation of CCSDS-122 standard in [11] is tested in EnMap mission satellite. The latest CCSDS-123 standard [12] is characterized by low complexity suitable for fast real-time hardware implementation. Recent FPGA implementations of the CCSDS-123 standard [13] presented in [14] [15] [16] [17] target high throughput and take into account limitations in terms of storage utilization.
In this paper, an efficient high-throughput FPGA implementation of the CCSDS-123 compression standard is proposed. The proposed compression core is capable of processing HSI cube sizes of various imagers. The core supports the majority of user-defined compression parameters proposed by the standard. The achieved throughput is higher in comparison with the state-of-the-art implementations [14] [15] [16] [17] . This paper is structured as follows. Section II presents an overview of the CCSDS-123 standard. The proposed hardware implementation is described in Section III. Results in terms of logic utilization are presented for a variety of compression parameters in Section IV. In addition, performance comparison with the state of the art is described. Finally, conclusions are given in Section V.
II. BACKGROUND
The CCSDS-123 standard [13] is a data compression algorithm for multispectral and hyperspectral images. In this section, detailed algorithm description is given for identifying data dependencies affecting the performance.
An HSI cube is a 3-D array of size (N x , N y , N z ) with sample s z ,y ,x , where z indicates the spectral band and (x, y) are spatial coordinates. The data samples may be also described by the pair of indices (z, t) where t = y · N x + x. In a pushbroom imager, the dimensions x and y correspond to scanned line (cross-track) direction and flight (along-track) direction, respectively. A frame F y is the set of sample values for one y-coordinate. The prediction stage computes estimates of each component based on samples in 3-D space around the processed sample. The difference between the estimate and the actual component value, prediction residual, is then encoded by the entropy encoder. Cube size, dynamic range, output word size, subframe interleaving depth, sample type, scanning order, and entropy encoder type are first encoded in the bitstream. The dynamic range D defines bit resolution of a sample within the range [2, 16] , whereas the output word size B is within the range 1 ≤ B ≤ 8 bytes. The CCSDS-123 standard supports band interleaved (BI) and band sequential (BSQ) encoding orders for defining the arrangement of encoded samples in the input bitstream. The BI ordering reads the image in (y, x, z) order, whereas BSQ ordering reads the image band by band (z, y, x). In BI mode, the parameter subframe interleaving depth M , defined in the range (1, N z ), partitions frame F y into subframes. Each subframe contains M consecutive spectral bands. Special cases of BI ordering are band interleaved by pixel (BIP) for M = 1 and band interleaved by line (BIL) for M = N z .
A. Prediction Stage
Prediction of the image sample s z ,y ,x is computed based on values of nearby predicted samples in the current spectral band and in P preceding bands.
1) Local Sum and Differences:
The local sum σ z ,y ,x is a weighted sum of adjacent samples to sample s z ,y ,x within the spectral band z. The neighbor-oriented local sum is computed as follows: (1) for y > 0 and 0 < x < N X − 1, whereas sample predictions at the edges are given as
The central local differences for P + 1 spectral bands are then defined as
where k = 0, . . . , P is the distance of the previously processed spectral band from the currently processed band z. Reduced prediction mode contains only central local differences d k z ,y ,x , whereas full mode includes in addition the directional
where labels "N," "W," and "NW" define position of neighboring samples with respect to the currently processed sample within the band z. The computed differences are stored in local difference vector U z (t) with size C z = P + 3 for full mode and C z = P for reduced prediction mode.
2) Prediction:
An adaptive linear prediction computes the integer scaled predicted sample valueŝ * z (t) as follows:
where W z (t) is a weight vector with (Ω + 3)-bit signed integer coefficients, and resolution Ω is within range [4, 19] . The roundeds round version of scaled predicted sample value is given ass
where mod *
T U z (t). The final scaled predicted sample values z , for t > 0, is clipped version ofs round s z (t) = clip (s round (t) + 2s mid + 1, (2s min , 2s max + 1)) (9) whereas for t = 0, it is given bỹ
The lower and the upper sample value limit and the midrange sample value [s min , s max ,
for unsigned and signed samples, respectively. The parameters z is then renormalized to the range of the input sample (D-bit quantity)
3) Weight Update: The standard defines default and custom weight initialization for selecting the initial weight vector W z . Default values of vector coefficients ω k z are defined as
Initial directional weight coefficients are set to zero. The weights are dynamically updated based on prediction error e z (t) = 2s z (t) −s z (t) as follows:
Weight update scaling exponent ρ, controlling convergence speed, is given as
Computation of ρ includes parameters ν min , ν max and weight update change interval t inc , which determine the rate at which the predictor adapts to the image content statistics. The ranges of ν min and ν max are −6 ≤ ν min ≤ ν max ≤ 9, whereas the range of t inc is (2 4 , 2 11 ). The exponent ρ is incremented at regular intervals determined by the value t inc and its initial and final values are defined as ν min + D − Ω and ν max + D − Ω, respectively. The final update step for pixel at position t + 1 is given by (15) where ω min = −2 Ω+2 and ω max = 2 Ω+2 − 1.
4) Residual Mapping:
The prediction residual Δ z (t) is the difference between the actual sample value s z (t) and the predicted sampleŝ z (t), Δ z (t) = s z (t) −ŝ z (t). The residual mapping converts the signed predicted residuals to a D-bit unsigned integer, producting the mapped prediction residual δ z (t)
where
B. Entropy Encoding
The CCSDS-123 standard defines a sample-adaptive encoder and a block-adaptive entropy encoder. The focus of this paper is on sample-adaptive encoder, which uses a Golomb 2 n variablelength binary coding technique (GPO2). The maximum produced code word length is U max + D for each incoming mapped residual δ z (t), where the unary length limit U max is defined in the range [8, 32] .
The generation of code words relies on tracking of the average values of the residuals in each band. The average value computation is performed by accumulating the sample values in an accumulator Σ z (t) and dividing the result by the number of processed samples Γ(t). The ratio Σ z (t)/Γ(t) is an estimate of the mean mapped prediction residual within a spectral band. The accumulator Σ z (t) is defined as
and counter Γ(t) is incremented for each sample
The rescaling counter size γ * determines the maximum value of counter Γ z (t). The initial value of the counter and the accumulator is given as
The initial count exponent γ 0 is defined in the range [1, 8] , whereas the range of accumulator initialization constant K is in the range [0, D − 2]. After reaching Γ(t − 1) = 2 γ * − 1, the accumulator and the counter are rescaled in order to increase impact of more recent sample values as follows:
The parameter k z (t) is defined as the largest nonnegative integer k z (t) = max{1 ≤ i ≤ D − 2} satisfying the following inequality:
otherwise k z (t) = 0, for 2Γ(t) > Σ z (t) + 49 2 7 Γ(t) . For an input residual δ z (t) and a given U max , code word generation involves computation of the quotient and residual pair (u z , r z ) as follows:
The parameters k z (t), r z (t), and u z (t) are finally used in code word generation as follows. 1) If t = 0, the code word is the D-bit unsigned integer binary representation of δ z (t). 2) If t > 0 and u z (t) < U max , the code word consists of u z (t) "0s" followed by "1" and the k z (t) least significant bits of δ z (t) corresponding to the remainder r z (t). 3) If t > 0 and u z (t) ≥ U max , the code word consists of U max "0s" followed by the D-bit unsigned integer binary representation of δ z (t).
III. PROPOSED CCSDS-123 ENCODER IMPLEMENTATION
The important aspects of the hardware implementation performance are throughput, algorithmic complexity, and memory/buffering requirements. The computational complexity of the CCSDS-123 algorithm is considered low, so focus is on data dependence identification affecting storage requirements and throughput. Due to the causal nature of the prediction stage, the amount of data to be stored vary with parameter P and with the processing order. In particular, the sample ordering affects storage requirements for local differences and weight vectors, data dependencies between various stages, throughput, and susceptibility to parallel implementation. A. Memory Requirements 1) Neighboring Samples: When encoding a sample s z (t), it is required to store locally the neighboring samples used to compute the local sums and differences. Sample distance is defined as the number of clock cycles from the cycle samples is first streamed in until the cycle when these samples are used as neighbors. Table I lists sample distances required for storing neighbors (W , NW , N , NW ) for different sample orderings.
2) Weight Vectors and Accumulators: Each band uses its own weight vector in the prediction stage and accumulator in the encoding stage. In BIP ordering, it is required to store the weight vector and accumulator for each band due to the fact that sample s z (t + 1) is processed N z samples after s z (t). In BIL ordering, sample s z (t + 1) is processed after sample s z (t) within the same line of the image. When the end of the line is reached, it is required to store the weight vector and accumulator until processing for the same band starts in the next line. This implies the need for storing the weight vector and the accumulator for each band also in BIL ordering. In BSQ ordering, only one weight vector and accumulator are stored since s z (t + 1) is processed after s z (t).
3) Previous Local Differences: For each sample, there is a need to store the central local differences computed in P previous bands of the same pixel. The requirements for storing the local differences are summarized in Table II . In BIP ordering, differences are computed during the P previous cycles. In BIL ordering, the local differences are computed for the whole line of width N x in the P previous bands. The sample distance between the least recent local difference required to be stored and current sample is P × N x cycles. In BSQ ordering, the frames are processed sequentially, implying that the largest sampling distance is P × N x × N y cycles.
A summary of required memory resources in different stages of the compression algorithm for various sample orderings is given in Table III . The BSQ ordering requires large amounts of memory, even for small values of P , for storing the set of N y × N x local difference vectors. Thus, storing local differences in BSQ ordering for a cube of size 512 × 2000 × 128 with 16-b samples and with a default value of P = 3 requires 6.95 MB exceeding available block RAM capacity in a midrange Zynq- 7020 FPGA. For larger P , available block RAM capacity is quickly exceeded even in high-end devices.
B. Data Dependencies
In BIP ordering, the updated weight vectors for all the bands need to be stored. On the other side, the BIP ordering is a good candidate for parallel processing since it is not required to complete prediction of a sample s z (t) before starting prediction of the next sample s z +1 (t). In BIL and BSQ ordering, the prediction and weight update operations have to be completed before starting the prediction of the sample since updated weight vector W z (t + 1) is used for computingd z (t + 1).
C. Architecture Overview
In a typical HSI system, samples are either streamed directly from camera sensor of an imager or from the external memory to the accelerator. Based on the previous analysis in terms of memory requirements, possibilities for parallel implementation, and taking into consideration native layout of the HSI image in memory, the BIP ordering is chosen as sample ordering. An overview of the proposed CCSDS-123 implementation is shown in Fig. 1 . 2) Control Signal Generation: In addition to the sample streaming between the modules in the pipeline, control signals such as x, y, and z coordinates of the current sample are also produced by using a set of counters. The additional signals based on the position of the sample are generated as follows: 1) flags indicating if the sample is in the first line, in the first pixel of a line, the last pixel of a line, or the last sample in the last pixel; and 2) weight update scaling exponent ρ(t), used in the weight update in (14).
3) Local Sum and Difference Computation:
These operations are performed in a three-stage pipeline, as presented in Fig. 3 . The first two stages compute the local sum, whereas the last stage computes the local central difference and the directional differences. The local sum defined by (2) is split across two-pipeline stages in order to reduce delay. In the first stage, expressions term 1 and term 2 are computed as follows:
whereas the local sum is then produced by summation of term 1 and term 2 .
4) Central Difference Store:
The central difference store keeps the local central differences d z −k,y ,x computed in the previous k = 1, . . . , P bands in a shift register shown in Fig. 4 . The computed local central difference d z ,y ,x is stored in the first register, whereas the previous contents are shifted by one position. When z = N z − 1, the contents of the shift register are set to zero so that local differences from the previous pixel are not used for the prediction of a new pixel.
5) Dot Product:
The dot product is performed in a pipeline with variable depth depending on C z . In the first stage, each element in U z (t) is multiplied with the corresponding element in W z (t) followed by a tree of adders, as presented in Fig. 5 . The number of stages in the adder tree is S = log 2 (C z ) . The adder tree is implemented as follows:
S − 2, whereas the initial 2 S − 1 indices are reserved for the multiplication operations s(i) = u i · ω i for 0 ≤ i ≤ 2 S − 1. The result of the dot product operation isd z = s(2 S +1 − 2).
6) Predictor:
The predictor computes the scaled predicted samples z (t) defined in (9) by splitting the operation across two-pipeline stages, as presented in Fig. 6 . In the first stage, fraction temp 1 in the numerator is computed as follows:
where multiplication by a power of 2 is implemented by shifting. The term 4s mid is set to zero since implementation uses signed numbers. The scaled predicted samples z (t) is then computed as follows:
For t = 0, P > 0, and z > 0, the sample s z −1 (t) is used in the computation ofs z (t), otherwises z (t) = 0. A multiplexer finally chooses the result between three defined cases based on status of the flags, parameter P , and band z.
7) Weight Update:
The weight update from (15) is performed in three-pipeline stages, as presented in Fig. 7 . The parameter ρ(t) computed in the control unit is sent along with the input sample. The first stage computes the componentwise product temp 1 = sgn + [e z (t)] · U z (t), equivalent to the change of the sign of the component depending on whether e z (t) is positive or negative. Since e z (t) = 2s z (t) −ŝ z (t), the condition e z (t) < 0 is equivalent to the inequality 2s z (t) <ŝ z (t). This is implemented by choosing between vectors U z (t) and −U z (t) based on the result of the comparison. The second stage computes weight update factor ΔW z (t) from (13) . The parameter ρ(t) has a small range of possible values (at most − 6 to 9), so the expression 2 −ρ(t) temp 1 for each value of ρ(t) is computed in parallel. The operations are left or right shifts depending on the sign of ρ(t). A multiplexer chooses the value of expression 2 −ρ(t) temp 1 based on the computed value of ρ(t). The selected output vector of the multiplexer is added to 1 and shifted to the right. The final stage is the update operation of weight vector W z (t + 1) computed by the sum of the weight vector W z (t) and the weight update factor ΔW z (t).
8) Weight Store:
The weight store keeps the weight vectors in between operations of weight vector updating for band z and reading the weights of the same band for the next sample. A dual port block RAM is used to read a weight vector from one address and simultaneously write the weights at another address. The band z of the incoming sample is used as an address to read the corresponding weights from the weight store. When updating weights, the coordinate z of the new weight vector is used as a write address. Fig. 8 illustrates situation when weight vector W z in (t) in band z in is read and the weight vector W z update (t + 1) in band z update is updated. The relation between the band z update and the band z in of the currently processed sample is given as
where N = 1 + 2 + S + 2 + 3 is a delay that equals to the number of pipeline stages from reading operation in weight store to the final stage of weight update operation. The value of N depends on the number of stages S in tree adder of the dot product module. The time diagram of the described pipeline in Fig. 9 shows that read operation of a weight vector is performed in parallel with the local sum and difference computations. The weight vector from the block RAM is delayed by two cycles, such that the local difference vector and weight vector for the same sample arrive simultaneously at the dot product module.
9) Residual Mapping:
The residual mapping is computed in two-pipeline stages, as presented in Fig. 10 . The first stage computes Δ z (t) and θ z (t) from (17) . The mapped prediction residual δ z (t), defined in (16), is computed then by using a multiplexer to select between three cases. The inequality 0 ≤ (−1)s z (t) Δ z (t) ≤ θ z (t) holds in one of the cases. The expression (−1)s z (t) is equivalent to 1 whens z (t) is even, and −1 whens z (t) is odd. Then, the restated inequality given as s z (t) is even and Δ z ≥ 0 ors z (t) is odd and Δ z (t) ≤ 0 is implemented in hardware as checking whether the LSB ofs z (t) is 0 or 1.
10) Encoding:
The sample adaptive encoder implementation with five-stage pipeline is shown in Fig. 11 . The first and second stages compute the right-hand side of (24). In the third stage, each of the inequalities for i = 1, . . . , D − 2 is evaluated in parallel. Then, parameters k z and u z are chosen based on the results from inequality evaluation by using priority encoder. The highest integer i, for which the inequality with left-hand side Γ(t)2 i holds, is chosen as the value for k z , whereas expression δ z (t)/2 i for chosen i is assigned to u z (t). In addition, a truncated version of δ z (t) with the k z least significant bits is created. The bits are right shifted so that the most significant bits are taken from the truncated δ z (t). In the last stage, the code word is generated based on the defined rules determined by values of U max , u z (t), and position t. Both code word and its length are output of the encoding stage.
11) Bit Packing:
The bit packing module collects variablelength encoded words into packets of a given configurable output word size B. The packing operation is centered around two registers of the size B. The registers alternate between being the current register and next register. Incoming words from the encoder are stored in the current register. When the current register is full, leftover bits are put in the next register, and the data from the current register are sent to the output. In the following cycle, the registers switch roles. Due to the variable length of the incoming code words, the bit position in the current register for storing an incoming word can be any value ranging from the most significant bit N − 1 to the least significant bit 0. Creation of N different candidates for register layout is done by creating words that consist of the i most significant bits of the current register followed by the (U max + D)-bit padded input word for each i ∈ [0, N − 1]. Selection of the correct candidate is performed by using a write pointer that keeps track of the first nonoccupied bit position in the current register. Fig. 12 illustrates an example of the packing process. The left-most register is assigned to be the current register, and the incoming words in the first four cycles are stored into this register. The fifth word is larger than the remaining space in the current register, so the leftover bits are stored as the most significant bits of the next register. In the next clock cycle, the content of the current register is written to the output FIFO, and the next and current registers swap roles. The next input words are written to the newly assigned current register, until it gets full in the eighth cycle.
IV. RESULTS
The factors affecting computational complexity and timing of hardware implementation of CCSDS-123 algorithm are size of the image to be compressed and the configuration parameters, in particular number of bands P for prediction and the sample ordering. The proposed architecture of the CCSDS-123 algorithm is described by the VHDL language, whereas the Vivado tool is used for synthesis, implementation, power estimation, testing, and verification of the proposed implementation on a ZedBoard development board with a Zynq-7020 FPGA. The proposed hardware implementation supports the majority of the parameter settings defined by the compression standard, including full ranges of bit depths, number of previous bands for prediction P , and output word size B. The proposed hardware implementation supports both on-the-fly processing of one input sample per clock cycle and offline processing from the external memory. For on-the-fly processing, N z must be larger than the number of stages N to avoid data corruption. This is due to the fact that the sample s z (t + 1) arrives N z clock cycles after s z (t), and it takes N clock cycles from s z (t) arrives until the weight for s z (t + 1) is stored. The block-adaptive entropy encoding and custom initialization of weight vectors are not supported. The chosen scanning order is BIP based on the memory requirement analysis performed in the previous section. Fig. 13 shows the complete pipeline of the proposed implementation. It is observed that the processing chain after a sample is input until the code word generated takes 15 clock cycles. The packing stage takes a variable number of clock cycles depending on the size of encoded words and output word size B.
A. Logic Utilization Results
Table IV lists the default values of the CCSDS-123 parameters used in performance and logic utilization analysis. The cube sizes of multispectral and hyperspectral imagers-Moderate Resolution Imaging Spectroradiometer (MODIS) [19] , Hyperion [20] , Airborne Visible Infrared Imaging Spectrometer (AVIRIS) [21] , and Hyperspectral Imager for the Coastal Ocean (HICO) [22] and their total resource utilization in terms of LUTs, registers, BRAM, and DSP blocks of the proposed CCSDS-123 implementation-are presented in Table V .
Detailed block-level resource utilization for the HICO cube size is further elaborated. The area resources are dependent on a number of parameters. In particular, the number of bands used for prediction P defined within the range [0, 15] can significantly affect the area utilization. The area utilization in terms of LUTs and registers for the modules dot product, predictor, and weight update dependent on parameter P are given in Tables VI and VII, respectively. Both the number of LUTs and the number of registers scale linearly with P , as illustrated in Figs. 14 and 15 . The LUTs and register resources used in other modules, which are not dependent on P , are presented in Table VIII . For the default value of P = 3, 2952 LUTs and 2469 registers are used corresponding to 5.5% and 2.3% utilization on a Zynq-7020 FPGA, respectively. Block RAM utilization is shown in Fig. 16 Tables IX and X,  respectively. The estimated maximum operating frequency for HICO size images is approximately 147 MHz, where the top ten critical paths of the design are in the first pipeline stage of the local sum computation. The pipeline implementation allows onthe-fly processing of one sample per clock cycle, achieving a throughput of 147 Msample/s. The power estimation performed in Vivado tool using default parameters and HICO cube size is 0.295 W.
B. Verification
Testing of the complete core was initially performed in simulations against the Emporda compressor [23] for a selection of parameters and for a number of edge cases with cubes in all three dimensions restricted to less than 100 elements to reduce simulation time. The core was tested on a set of special test pattern images produced by CCSDS for verifying that certain edge cases with overflows are handled correctly.
An automated verification system is also created comparing the compressed bitstream from the implemented design with the compressed bitstream from the Emporda software [23] . The proposed verification flow shown in Fig. 17 allows the design to be automatically checked for different sets of generic parameters yielding greater confidence in the robustness of the design. The last verification stage for assessment of processing performance and cross-testing of the proposed FPGA implementation with Emporda reference software uses well-known hyperspectral datasets collected by imagers listed in Table V . The core was also tested on a ZedBoard development board by loading these cubes in BIP ordering into the external memory and by using Xilinx AXI DMA core [24] for streaming data to the proposed accelerator. The resulting bitstream was successfully compared to the expected compressed image bitstream.
C. Comparison
The proposed CCSDS-123 implementation is compared with recent state-of-the-art FGPA implementations [14] [15] [16] [17] , as presented in Table XI . The sensor maximum data rates for AVIRIS, AVIRIS NG, and HICO imagers are listed representing realtime constraint for data processing. The lack of implementation details of various stages in state-of-the-art works limits the comparison to the maximum frequency, the throughput performance, power, and to ability to perform on-the-fly compression by streaming one sample per clock cycle directly from the sensor without use of external memory. The data dependencies in BSQ ordering significantly limit the throughput in hardware implementation, whereas BIP ordering provides theoretical throughput of one compressed sample per clock cycle.
In HyLoC implementation [15] , the maximum throughput depends on the selected configuration parameters. Parameter P affects the number of cycles per sample processing. The maximum operating frequency varies between 43.0 and 43.9 MHz for RTAX1000S FPGA, achieving throughput of 1.75 and 3.5 MSamples/s for P = 15 and P = 3, respectively. On Virtex-5 FPGA, the achieved throughput is 11.3 MSamples/s for P = 3. The advantage of this implementation is removal of internal storage requirements for neighboring samples for BSQ ordering. Instead of storing intermediate results, the data are recalculated for each input sample requiring also neighboring samples to be fetched from the external memory. In this manner, the irregular data access patterns are introduced in communication with external memory.
SHyLoC implementation [16] consists of two IP cores-the CCSDS-121 and the CCSDS-123-supporting all three compression orderings. There is a need for different architectures of the CCSDS-123 standard for each ordering. The design is optimized for default value of P. The achieved throughput is higher in BIP than in other two orderings. The implementation requires the use of external memory.
The approach in [18] provides both the use of external memory as a buffer for storing data cube and on-the-fly data streaming directly from the sensor. The achieved throughput for Virtex-4 and Virtex-7 FPGAs is 23 and 48 MSamples/s, respectively. FPGA implementation proposed in [17] achieves throughput of 110 MSamples/s on Virtex-5 FPGA, streaming both current sample to process and its N and NW neighbors from the external memory. This disables further improvements that can be introduced by parallel implementation due to limited I/O capabilities of the supporting system.
The proposed implementation achieves throughput of 147 MSamples/s, with negligible initial latency of 15 clock cycles, after which the data are output after bit packing stage in B-byte words. On-the-fly compression of data streamed in BIP ordering directly from the sensor is supported. Processing of data stored in external memory in the BIP ordering is also supported by the use of AXI DMA core. The improved performance is achieved at the expense of increased BRAM utilization compared to [15] . However, the BRAM utilization corresponding to 1.327 Mbits occupies only 26% of the available resources on Zynq-7020 FPGA for D = 16 and P = 3.
V. CONCLUSION
In this paper, an efficient FPGA implementation of the CCSDS-123 compression algorithm with BIP ordering is presented. The complete pipeline processing one input sample per clock cycle is described in detail and efficient hardware solutions are presented for a number of stages in the algorithm. The implementation is tested against available reference software and it is fully compliant with standard allowing user-defined parameter adjustments. The achieved throughput is 147 MSamples/s for 16-b samples, which represents better performance than the proposed state-of-the-art implementations.
