This paper proposes a high-throughput lossless image-compression algorithm based on Golomb-Rice coding and its hardware architecture. The proposed solution increases compression ratios (CRs) while preserving the throughput by taking advantage of a novel parallel variable-length sign coding (PVSC) algorithm that reduces the sign bits to achieve a higher CR. In addition, the proposed solution adopts and modifies the two existing compression algorithms to improve the overall compression performance. The experimental results show that the proposed solution yields an average CR of 3.12, which is higher than those achieved with the previous algorithms. The hardware implementation of the proposed solution for an 8×8 block unit achieves a throughput of 18 GBps and 24 GBps when encoding and decoding, respectively. This hardware performance is enough to handle 7680 × 4320@240-Hz image processing.
I. INTRODUCTION
In recent years, high-definition (HD) images, such as full HD (1920 × 1080), quad HD (QHD, 2560 × 1440), and ultra HD (UHD, 3840 × 2160 or 7680 × 4320) have been used in mobile devices, PCs, and TVs. To handle 4:2:0 YUV images at 30 Hz, full HD requires a processing capability of 93 MBps, QHD requires 166 MBps, and UHD requires 373 MBps or 1.5 GBps. These processing speeds increase to 3 GBps or 12GBps if the UHD scanning frequency is 240 Hz. With the rapid improvement in image resolution in the latest video systems, the bus bandwidth needed to refer to the images stored in the frame buffer has increased dramatically. In addition, the memory bandwidth requirement has become one of the most concerning issues in binocular video applications such as virtual reality systems, as they require twice The associate editor coordinating the review of this manuscript and approving it for publication was Muhamamd Aleem . the throughput. Therefore, there have been many studies on image-compression techniques to alleviate this problem.
Image-compression techniques are classified into two categories: lossy and lossless. The quantization in lossy methods increases compression ratios (CRs) but data loss can occur. Lossless compression methods have lower CRs than lossy methods, but they allow the original data to be perfectly reconstructed from the compressed data. As a result, this lossless compression could be very suitable as a frame buffer recompression algorithm that is applicable to liquid crystal display (LCD) overdrive [40] . Here, data redundancy is generally eliminated in the prediction stage, and the outcome is compressed via entropy coding. The available prediction methods are either spatial-based (e.g., CALIC [2] , LOCO [3] , DPCM [4] , etc.) or transform-based (e.g., wavelet analysis [5] ) in nature. The coding strategies commonly used for entropy coding include Golomb-Rice coding and Huffman coding [39] . VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
To address the memory bandwidth problem in highresolution images without quality degeneration, a number of lossless embedded compression (LEC) techniques have been proposed [7] , [8] , [18] , [23] - [38] , [42] . However, the studies in [8] , [18] point out that the previous LEC schemes in [23] , [27] - [31] , [34] - [36] , [33] - [35] are not sufficient to handle high-performance applications such as HD video sequences in real time due to heavy data dependency, high hardware complexity, and low throughput. The works in [7] , [32] achieve higher throughput by employing line-based algorithms. In this approach, the images are displayed line by line and non-power-of-two 3D textures are supported. According to the findings in [37] , block-based algorithms generally yield better compression performance than line-based algorithms.
The studies in [8] , [18] , [36] - [38] , [42] , [43] increase CRs by proposing block-based prediction algorithms or by enhancing entropy coding algorithms. These studies also implement hardware for high-resolution image processing. In [36] , [38] , the implemented hardware can process 4-K images with a high CR, but it is unable to process more than 3 pixels per cycle due to data-processing dependency. The methods in [18] , [37] , [43] achieve high CRs while processing 5.1 pixels/cycle, 10.7 pixels/cycle, and 10.67 pixels/cycle when encoding and 14.2 pixels/cycle, 21.3 pixels/cycle and 10.67 pixels/cycle when decoding. In other words, they still encounter the problem of data dependencies. During compression or decompression, they take an N × N block shape in a frame as a basic unit that cannot that cannot be applied to non-power-of-two 3D textures.
In [8] , our previous work and the base model to be improved in this work, the differential-differential pulse-coded modulation (DDPCM) prediction is performed on various M × N blocks of the original image frame, and prediction errors are encoded using Golomb-Rice coding. The hardware architecture in [8] can perform massively parallel processing in the variable-length coding stage and in the prediction stage. It achieves lossless pixel throughput by compressing and decompressing blocks during every cycle with 6-12 times the performance improvement compared to the comparative models [13] , [26] - [29] . However, from the point of CR, the model [8] still have the room for improvement.
In this paper, we propose a lossless compression solution (algorithms and architecture) that increases CRs while offering the massively parallel pixel-processing architecture suggested in [8] . For this purpose, following three techniques are utilized:
• sign-bit field compression • efficient use of spatial locality in image data • flexibility in the use of k parameter The first is to develop a new parallel variable-length sign coding (PVSC) algorithm. The second and third are to adopt and modify the previous compression solution, which shall replace DDPCM with DPCM and adopt adaptive-k instead of fixed-k, respectively. Kim et al. [8] scanned the first left column pixels by vertical prediction method and the other pixels using horizontal prediction method. We follow this sequence in the same way.
The hardware architecture of the proposed algorithm enables each sign bit to be coded in parallel, called PVSC, thus allowing massively parallel processing. In addition, the proposed architecture eliminates the pipeline latency that can occur in variable-length sign decoding. In the experiments with six full-HD benchmarks, the proposed solution yields an average data-reduction ratio of 70%. The four-stage pipeline encoder and decoder implemented in the 55-nm fabrication process have a maximum clock frequency of 370 MHz and 286 MHz and a gate count of 83 K and 121 K, respectively. Due to the pipeline depth adjustment, logic optimization, and the high-end manufacturing, this implementation achieves a throughput of 24 GBps, which is higher than the hardware performance (13 GBps) reported in [8] .
The proposed architecture has the following three characteristics. First, it exceeds the throughput requirement (about 12 GBps) necessary to provide a high-end screen refresh rate for 8-K images (e.g., 240 Hz). Second, its throughput in relation to the hardware size is higher than that in [37] by more than 1.8 times. Third, there is no tradeoff between hardware performance and compression efficiency. Despite improved hardware performance, the CR of the proposed solution is as good as those reported in the latest literature on lossless compression technologies.
The remainder of this paper is organized as follows. Section II reviews the previous works related to the topic. Section III introduces the proposed architecture for our parallel algorithm and the packing and unpacking parallel scheme of two variable-length coded data (unary and PVSC) without individual length information. Section IV presents the experimental results of algorithm and hardware performance. Section V consists of the conclusion.
II. RELATED WORK
In many LEC methods, Golomb algorithms or Golomb-Rice algorithms are used for entropy coding [8] , [11] , [12] , [13] , [22] , [32] . Golomb-Rice coding divides a positive integer (an input value) into two parts: quotient q and remainder r. The quotient is sent in unary coding. Unary coding represents a natural number n, with n ones followed by a zero (a unique terminating symbol). The remainder r is redefined in truncated binary encoding as 2k. In [8] , [13] , [32] , a fixed-k value is used in Golomb-Rice coding.
As far as the authors are aware, the massive parallel pixelprocessing architecture for Golomb-Rice coding was first proposed in [8] . This architecture consists of DDPCM and Golomb-Rice encoding with a fixed-k (k = 2) value. The original image frames are organized as M × N sub-window arrays, to which DDPCM is applied, thereby producing one seed and M × N − 1 pieces of differential data. The Golomb-Rice algorithm then encodes the differential data into a variable-length codeword. The study in [8] noted that the position of a unique symbol in a variable-length codeword gave an indication of the original data; based on this, hardware architecture for parallel encoding and decoding was proposed. The experiments performed with 8 × 8 blocks show that this architecture achieves fully parallel processing of 64 pixels/cycle, but its CR is 1.52, which is rather low.
According to [37] , a high-throughput hardware implementation for a variable-length coding algorithm is difficult to apply; therefore, its reference frame recompression scheme uses semi-fixed length (SFL) or significant bit truncation (SBT) algorithms that encode prediction errors in a group with the same number of bits and store the fixed-length data of each group. In [18] , the hierarchical average and copy prediction (HACP) algorithm that processes N ×N blocks at L-levels is used to create prediction residuals, and the prediction errors are entropy coded using SBT. Reference [37] suggests that the multi-mode DPCM and averaging prediction (MDA) algorithm for N × N blocks should combine the advantages of DPCM scanning and averaging and uses SFL for entropy coding. The compression algorithm in [18] has parallelism of 5.1 pixels/cycle in compression and 14.2 pixels/cycle in decompression for the 16 × 8 block unit. The algorithm in [37] yields parallelism of 10.7 pixels/cycle in compression and 21.3 pixels/cycle in decompression for the 8 × 8 block unit. Although the algorithms in [18] , [37] are not fully parallel, they achieve a high CR of 2.2 and 2.49, respectively.
III. PROPOSED ARCHITECTURE
This section describes the proposed algorithm and architecture for the high-throughput lossless compression converting the overall architecture, pipeline stages, PVSC, adaptive-k parallel scheme of Golomb-Rice, and packing/unpacking codeword scheme including two variable-length data.
A. ARCHITECTURE OVERVIEW Figure 1 shows the proposed architecture. The solid-line components are identical to those proposed in [8] and the dotted-line components are the newly proposed parts presented in this paper. The seven dotted-line boxes in the compressor and decompressor are classified into five groups: 1) DPCM/InvDPCM, 2) KSplitter, 3) SignENC/SignDEC, 4) ZeroDT, and 5) VLSplitter.
The DPCM/InvDPCM component, a replacement for the previous DDPCM/InvDDPCM prediction algorithms in [8] , enables a slight increase in the CR. The KSplitter component replaces the previous fixed-k (k = 2) algorithm with a block-based adaptive-k algorithm in Golomb-Rice coding, compensating for the degradation in the CR related to the fixed-k algorithm. The SignENC/SignDEC component enables parallel processing by resolving the data-dependency problem in the proposed PVSC algorithm which is a variablelength algorithm. The ZeroDT component eliminates latency that occurs in PVSC decoding. It accelerates a restoration of the sign bits that are deleted during encoding. Finally, the VLSplitter splits the unary code and PVSC code from the packed data without requiring knowledge of their individual lengths.
The pipeline architecture consists of four stages, as shown in Figure 2 . One stage is added to the compressor in [8] to perform the proposed algorithm. The adaptive condition of k and the shift amount (SA) are calculated in the added stage. If a macro block is given to the compressor at time t0, DPCM, adaptive condition/SA, and unary/sign encoding are performed from time t1 to t3, respectively, and the codeword is completed at time t4. One stage is also added to the decompressor but alleviates the critical time path, unlike the compressor. If the codeword is given to the decompressor at time t0, the SA/quotient and two steps of inverse DPCM are performed from time t1 to t3, respectively, and a macro block is reconstructed at time t4.
The algorithm proposed for the decompressor is performed in parallel at time t1 when the quotient is reconstructed (see Figure 2 ). Inverse DPCM is separated into two stages. We compute and store half of the mathematical operations in the first stage, and then calculate other half in the half-result stored in the next step. This technique can be applied to mathematical operations which don't have feedback or branch path such as unary or DPCM in our algorithm.
We consider the tradeoff between area and performance and decide to apply it to the inverse DPCM.
Our compressor and decompressor performed four cycles because of the four-stage pipeline. There is no throughput drop for [8] because the macroblock is compressed or decompressed every cycle after the first pipeline latency.
The data-compression flow of the proposed architecture is as follows. The DPCM component takes the pixel image data of an M × N block to be compressed and produces residual data for the M × N block by eliminating data redundancy. The residual data are split into the first element (seed) that is exempted from compression and the prediction error field (consisting of (M × N )− 1 prediction errors) that is to be compressed. The SignCONV component takes the prediction error field and splits it into the sign field consisting of (M × N ) − 1 sign bits and the magnitude field consisting of (M × N ) − 1 magnitudes. The SignENC component takes the sign field and produces variable-length sign data by referring to the magnitude data (see Section 3.1). The Sig-nENC component performs massively parallel bit processing to achieve a throughput that is as high as that achieved using the previous algorithms (see Section 3.2) . The Golomb-Rice encoder takes the magnitude field and produces a variablelength codeword.
During Golomb-Rice encoding, the KSplitter finds the k that is optimized for the code length (see Section 3.4) and produces the remainder of (M × N − 1) × k bits. The UnaryENC component performs the massively parallel processing proposed in [8] for the magnitude field that is split by k and produces a variable-length unary code. The produced variable-length unary code, seed, variable-length sign data, and (M × N − 1) × k remainder bits are packed, creating a final codeword (see Section 3.5).
The decoding flow of the proposed architecture, which is the opposite of the encoding flow, is as follows. The compressed variable-length codeword is unpacked and split into the seed, remainder, and variable-length data. The VLSplitter takes the variable-length data and divides it into the variablelength sign code and the unary code. In the SignDEC component, the variable-length sign code along with the sign bit data that is partially restored in the ZeroDT using the unary code (see Section 3.3) is restored to the sign field. At the same time, the Golomb-Rice decoder restores the magnitude field using the inputted unary code and the remainder. The restored sign field and magnitude field are reconstructed into the (M × N ) − 1 signed residuals. Finally, the InvDPCM component generates the original data of an M × N block with the signed residual data and the seed.
B. PROPOSED PVSC ALGORITHM
The Golomb-Rice coding widely used in existing LEC algorithms is a positive integer-based compression technique. To use Golomb-Rice coding, prediction errors are generally changed to positive numbers in the prediction stage via mapping, as in JPEG-LS [1] , [10] , [12] , [28] , or via preprocessing, as in FELICS [13] , [32] . In [8] , the sign field after the prediction stage is stored without being compressed. In this paper, the PVSC algorithm that compresses the sign field is proposed.
The proposed PVSC algorithm is based on the fact that the sign bits of +0 and −0 are redundant in signed magnitude number representations. The proposed PVSC algorithm concatenates the non-zero magnitude sign bits when encoding. After passing through the PVSC encoding stage, the zero-magnitude sign bits are removed, which contributes to increasing the CRs. During decoding, the restored zero-magnitude sign bits are automatically restored to zero, and the non-zero magnitude sign bits are restored from the inputted PVSC code. Figure 3 shows an example of PVSC coding that exhibits five prediction errors when k = 1. Figure 3 (a) illustrates PVSC encoding. Among the prediction errors a0-a4, those with both a zero quotient and a zero remainder are found. This indicates that the prediction error is 0 and its sign bit is unnecessary (removable). In Figure 3 (a), a1 and a3 have both a zero quotient and a zero remainder, so their sign bits can be removed. Therefore, the inputted sign bits ''1, 0, 0, 0, 1'' become ''1, 0, 1'' after passing through PVSC encoding. Figure 3 (b) presents the decoding process. During PVSC decoding, the prediction errors with both a zero quotient and a zero remainder are found, and their sign bits are reconstructed to 0. The sign bits of the other data are recovered from the PVSC code. In Figure 3 (b), a1 and a3 satisfy zero detection (i.e., the output of the zero-detection stage is ''true''), so their sign bits are restored to 0. The sign bits of a0, a2, and a4 are restored from the PVSC code ''1,0,1'' in a sequential manner. Finally, the restored sign field is ''1,0,0,0,1.'' 
C. SignENC/SignDEC: PARALLEL ARCHITECTURE FOR PVSC
The PVSC algorithm described in subsection 3 B either deletes the unnecessary sign bits or reorders the remaining sign bits using shifts during compression. This can be done in a sequential manner, as shown in Figure 4 (a) . During sequential processing, concatenating the sign bit of each residual with the previously encoded variable-length sign code is repeated sequentially, which gives rise to long latency.
As represented in Figure 4 (b), this paper proposes an architecture that enables the parallel processing of the proposed VOLUME 7, 2019 PVSC algorithm. To provide parallelism during compression, two different types of components are introduced. A single SignDELPosition component produces bit position information simultaneously, and multiple SignBitEncoder (SBE) components perform the bit encoding.
The SignDELPosition checks the magnitude field to determine whether each sign bit should be deleted or reordered and calculates the corresponding SA and total sign bit length. If a sign bit needs to be deleted, its SA becomes a fixed value that is the field length (M × N ) − 1. If the sign bit needs to be reordered, the SA is the sum of sign bits that have been deleted up to the current bit position. Each SBE encodes the 1-bit sign data by shifting it as much as its SA and generates (M × N ) − 1 length codes. If a SA is (M × N ) − 1, its sign bit is shifted out of range and is eventually deleted. Finally, a bitwise OR operation is performed on all the encoded sign data of (M × N ) − 1 length, thus producing PVSC code with variable length that is calculated by SignDELPosition. Figure 5 depicts the encoding and decoding of four sign data items in the proposed parallel-processing architecture. The SignENC represented in Figure 5 (a) encodes a 4-bit sign field into a variable-length sign code. The SignDELPosition takes an input from the magnitude field, determines whether to delete or shift the sign bit of each data item, and calculates the SA for each data item using the ''(1)'' and total variable length using the ''(2)''. (1) is a SA calculation formula. Here, M is a message, and the -symbol means inversion. That is, the i-th SA is the total number of magnitude (M ) of 0 values from 0th to the i-th when the i-th M is 0. The i-th SA is n when the i-th M is 1. (2) is the formula for the PVSC length, which is the total number of 0 M s for the magnitude field. If the magnitude field {2,0,7,0} is inputted, the SAs of the second and fourth data items with a zero magnitude are the field length n (i.e., n = 4). The SAs of the first and third data items with non-zero magnitude are the accumulated number of removed sign bits, that is, 0 and 1, respectively. As a result, the SAs are {0, 4, 1, 4}.
Each SBE encodes an individual sign bit of the inputted sign field. When a given sign field is {1, 0, 0, 0}, each bit of the sign field is sent to each SBE (SBE1, SBE2, SBE3, SBE4) in order. In SBE2 and SBE4 where the SA is 4, the sign bits are shifted out of range and thus deleted. The SAs of the first and third SBEs are 0 and 1, so they produce {1, 0, 0, 0} and {0, 0, 0, 0}, respectively. A bitwise OR operation is performed on all SBE outputs, creating {1, 0, 0, 0}. Finally, a PVSC code becomes {1, 0} with a 2-bit length that comes from SignDELPosition. Figure 5 (b) illustrates how the SignDEC component restores the variable-length sign code {1,0} into the 4-bit sign field {1, 0, 0, 0}. To restore the sign field, the zerodetection result of the magnitude field, {0, 1, 0, 1}, is used. The magnitudes of the second and fourth data items are 0, so their sign bits are reconstructed to 0. To restore the sign bits of the first and third data items, the PVSC code is decoded. The SignRECPosition component accumulates the number of the zero sign data, which makes the SA value of the first and third data items become 0 and 1, respectively. The SignBitDecoder1 (SBD1) creates {1, 0, 0, 0} by shifting the PVSC code bit ''1'' zero times. SBD2 produces {0, 0, 0, 0} by shifting the PVSC code bit ''0'' one time. A bitwise OR operation is then performed on all the created sign fields and on the reconstructed zero-magnitude sign field, restoring the final 4-bit sign field {1, 0, 0, 0}.
D. ZeroDT: AN ADDITIONAL COMPONENT FOR LATENCY REDUCTION
In the proposed PVSC algorithm, decoding is performed in two stages. As shown in Figure 6 (a) , the Golomb-Rice decoder restores the magnitude field in the first stage. In the second stage, the sign field is restored using the zero-detection output of the magnitude field and the PVSC code. That is, 2-stage processing is needed to restore the original data.
As shown in Figure 6 (b) , the proposed architecture introduces an additional component called ZeroDT that avoids dependencies between the first and second decoding stages. The basic idea is that zero detection is possible by identifying a code segment consisting only of symbols (i.e., terminating zero symbols) in the unary code. For example, a 16-bit unary code 1011101111100110 is decoded into {1, 3, 5, 0, 2}. The fourth code bit that involves successive symbols (''00'') can be pinpointed during the zero detection. With ZeroDT, sign-field restoration and magnitude-field restoration can be performed independently and simultaneously at a single pipeline.
E. KSplitter: AN ADAPTIVE-k BIT SPLITTER
According to [13] , the FELICS algorithm uses a simple and efficient method for k parameter selection in the Golomb-Rice code (GR), but it also gives rise to heavy data dependencies that limit parallelism during compression. There is another reference [41] improves the compression efficiency by using the adaptive divisor k. It determines the k of the current block by using the previous k. So that it has dependency between blocks when calculating the current k. Due to this block dependency issue, only sequential processing is possible, and random access cannot be implemented. In [8] , [13] , the implemented hardware uses a fixed-k value (k = 2) for Golomb-Rice coding. The KSplitter of the proposed architecture replaces the fixed-k (k = 2) algorithm for Golomb-Rice coding in [8] with a block-based adaptive-k algorithm. This allows for compensating for a loss of compression efficiency (CRs) related to the fixed-k algorithm. Note that k is still fixed within a block to avoid data-dependency issues. Figure 7 presents the hardware architecture of the KSplitter that finds an adaptive-k in block-based Golomb-Rice encoding. In the KSplitter component, k values are in the range of 0-3. This is because experimental results show that CRs are not significantly affected when the parameter k is greater than 4. Each LenUNARY computes the length of the Golomb-Rice code that is created when a given k is applied to the inputted magnitude field. The Golomb-Rice code length is the sum of the unary code length and the remainder length. The unary code length is proportional to the sum of quotients, and the remainder length is a fixed length determined according to k. The splitter separates the quotient field from the magnitude field using the k value that is determined in the SEL_K component and sends the quotient field to the UnaryENC component.
We can determine k for each block just taking one cycle without increasing latency, unlike the sequential processing case, as the proposed technique simultaneously obtains all unary lengths for the given k. In addition, k is determined before unary encoding because the proposed architecture consists of the KSplitter component followed by the UnaryENC component. Figure 8 shows two types of data-pack formats. Figure 8 (a) presents the format used in the previous algorithm. It contains one variable-length field, Unary data, and stores the length of the entire data in the Length field. Figure 8 (b) shows the data-pack format of the proposed solution. There are two variable-length data fields, Unary data and Sign data. There is also an additional fixed-length field, adaptive-k (AK). To prevent a decrease in the CR, the individual lengths of the two variable-length codes are not stored. Only the total length of the data is stored in the Length field. If the value of the length is greater or equal to the original data length, the original data is stored. In this case, since a separate indicator for recording whether the data is compressed is not necessary, the total length of the codewords can be prevented from exceeding the frame buffer size. As represented in Figure 8 (c) , the unary code and the sign code that compose a variable-length data item place their first bits at one of the two opposite ends of their fields.
F. PACKING OF VARIABLE-LENGTH DATA
When decoding packed data with the format shown in Figure 8 (b) , the fixed-length data are separated and sent to the appropriate decoding components. The fixed-length data include the 2-bit AK field, the 8-bit seed field, and the remainder field (the length of which is determined by AK), with the last one consisting of two variable-length data items. In the VLSplitter, the last fixed-length data item is split into two variable-length data items (i.e., a unary code and a sign code) by retrieving their bits that start from one of each of their field ends. If the value of the length field is equal to the original data length, the data at the location of the unary data is the quotient data.
IV. EXPERIMENTAL RESULTS
This section introduces the simulation results of the proposed algorithm and the implementation of the proposed hardware architecture. The effects of the PVSC algorithm on data compression are analyzed, and the proposed algorithms are compared with others [8] , [18] , [37] in terms of CR. The hardware implementation is designed with Verilog HDL and its evaluation is expressed in terms of clock frequency, throughput and unit/total area in a 55-nm cell library. Regarding the benchmark sequence, we use two kinds of benchmark groups consisting of 18 HEVC benchmark sequences, and 4 4-K × 2-K sequences shown in Figure 13 . The HEVC benchmark sequences are from the Joint Collaborative Team on Video Coding (JCT-VC) and the 4-K × 2-K sequences are from the Xiph.Org Foundation. They are shown in Table 1 . A. COMPRESSION RATIOS Figure 11 shows the distribution of the residual data obtained after DPCM processing. The distribution of 0 data was the highest in all test benches. When our PVSC was applied to these test benches, the sign bits decreased by up to 35%.
The compression efficiency of the proposed algorithm is evaluated with the CR using (3). The CR evaluation is performed on all benchmark sequences without and with quantization by applying the quantization parameter, QP, values of 22, 27, 32, and 37.
CR = Originaldatasize Compresseddatasize
(3) Figure 12 shows the results of CR comparisons for all test benches between the coding algorithms of the adaptive-k GR and the fixed-k GR. The experiment was performed by limiting the adaptive-k to a range of 0 to 3 and by fixing the fixed-k to 2. As a result, the compression ratio increased in every test bench by 9.87% on average. The compression ratio increasing effect was high in HEVC-16 and HEVC-17 where the color and pattern are relatively simple. Table 2 shows the simulation results of the CR of the proposed solution with an average CR of 2.06 sin Luma sample frames. It yielded an average CR of 2.30 in 4:2:0 frames. Table 3 presents the results of the experiment that examined how the CR was influenced by QP. For this experiment, the FHD images were transformed into 8 × 8 block images through a discrete cosine transform (DCT) with four different QP values. The images were restored then via inverse DCT and inverse quantization. Next, the restored images were compressed using the proposed compression solution. The proposed solution achieved an average CR of 3.48. The CRs with regard to each QP value (i.e., 22, 27, 32, and 37) were 2.94, 3.18, 3.58, and 4.22, respectively. The lossless compression of images with a quality loss with a high QP parameter setting shows higher CR. Table 5 and Table 4 show the CRs of the proposed and existing algorithms with HEVC sequence and 4-K sequences, respectively. In Table 5 , the data of [37] was reused for the CRs of the existing algorithm. The test results with HEVC test sequences showed an average CR of 2.78, which is higher than the 1.7, 2.06, and 2.33 CRs of the previous studies. In Table 4 , those with 4-K test sequences showed an average CR of 2.71, which is higher than the 1.7, 2.06, and 2.23 CRs of the previous studies.
B. PARALLEL HARDWARE IMPLEMENTATION
This paper proposes the parallel architecture for PVSC to code each sign bit in parallel and adaptive-k condition of Golomb-Rice coding algorithm. Their performance comparisons are made with other works using bytes per cycle and cycles per 8 × 8 blocks. In Table 6 , the lower section shows the comparison results. The proposed parallel architecture achieves 64 bytes compression and decompression per cycle. That is, one clock cycle is required for 8 × 8 block data processing. This parallelism is at the same level as [8] and is higher than [18] , [37] .
The proposed hardware architecture with a four-stage pipeline was designed for the encoder and decoder with Verilog HDL (Hardware Description Language). It was implemented with a 55-nm standard cell library up to the synthesis step (synopsys dc). After synthesis, the maximum operating frequency was 370 MHz and 286 MHz for the encoder and the decoder, respectively. Both performed massively parallel processing, yielding a throughput of 24 GBps and 18 GBps for 8 × 8 blocks, a total gate count of 83 K and 121 K, and a gate count per pixel of 1.3 K and 1.9 K, respectively. Table 7 summarizes the hardware implementation results in terms of hardware performance, total area, and unit area. The proposed solution achieved better hardware performance and lower power consumption than the previous algorithms.
The proposed hardware architecture had a throughput of 24 GBps during encoding and 18 GBps during decoding, which was the highest rate among the compared algorithms. This can be explained when considering two key points.
First, the proposed architecture provides the massively parallel processing of [8] , whereas the algorithms in [18] , [37] have data-processing dependencies during compression and decompression. Second, the proposed hardware implementation has a higher operating frequency than the implementation in [8] thanks to pipeline depth adjustment, logic optimization, and high-end manufacturing.
The hardware of the proposed solution requires the smallest gate counts per pixel, which leads to the lowest power consumption per pixel. All these characteristics make the proposed compression solution suitable for high-performance mobile applications.
Our decompressor is larger than our compressor. Our proposal uses adders for parallel processing. For unary parallel processing, encoding requires M × N − 1 adders to compute the termination position. Decoding requires a number of adders by unary length to reconstruct quotients. As a result, the number of adders makes the decompressor larger than the compressor.
Even though the proposed method adds some hardware modules for SignENC/SignDEC, zeroDT, and KSplitter, the hardware cost is reduced compared with those in [8] . This result occurred for two reasons. First, the subtraction operation step has been shortened from two to one because we replaced the DDPCM algorithm adopted in [8] with the DPCM algorithm. Second, the logic complexity for the adders on unary coding or inverse DPCM has been reduced by performing the pipeline depth adjustment of [8] .
V. CONCLUSION
This paper proposed a lossless compression solution by developing novel algorithms for frame-buffer recompression and by extending some of the previous compression methods. It also proposed hardware architecture that allows massively parallel processing for the compressor and decompressor. The proposed solution was implemented in a lossless embedded compressor. This hardware implementation has an average CR of 3.12 and an operating frequency of 370 MHz.
Its throughput is 24 GBps, which exceeds the throughput requirement for 8-K UHD image processing at 240 Hz (i.e., 12 GBps). It occupies 1.3K gate counts for single pixel processing, which leads to lower power consumption (23mW for the compressor and 17.6mW for the decompressor). Therefore, the proposed solution is suitable for use in mobile applications where energy efficiency is a significant factor. In addition, the size of the compression unit (M × N blocks) can be adjusted, so the proposed solution can be used in line-based applications and in block-based applications.
The work presented in this paper focuses on compression algorithms for the entropy coding stage and parallelprocessing architecture. In the future, prediction-stage compression algorithms will be studied to further improve compression efficiency. In addition, the current architecture offering block-level random access will be extended to provide intra-block random access. We also need to study that a hybrid compression algorithm to ensure fixed bandwidth requirements. The hybrid algorithm can be a mixture of both lossy and lossless algorithms.
