High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4 × 4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156. 
Introduction
With the rapid development of video compression technology, the resolution and frame rate of popular video format increase quickly in the past twenty years. Ultra HDTV (Ultra High Definition Television) [1], a new video format conceptualized by the Japanese public broadcasting network, NHK, supports as high as 8 K × 4 K @ 120 FPS video throughput [2] . So it is of significance to work on the VLSI technology on 8 K × 4 K @ 120 FPS video coding.
High efficiency video coding (HEVC) [3] is a video compression format, a successor to H.264/MPEG-4 AVC, that was jointly developed by the ISO/IEC moving picture experts group and ITU-T video coding experts group as ISO/IEC 23008-2 MPEG-H Part 2 and ITU-T H.265 [4] . Sample Adaptive Offset (SAO) [5] is a new in-loop filtering technique that reduces the distortion between original samples and reconstructed samples in HEVC. The concept of SAO is to reduce mean sample distortion of a region by first classifying the region samples into multiple categories with Manuscript received March 13, 2014 . Manuscript revised June 30, 2014. † The authors are with the Graduate School of Information, Production and Systems, Waseda Univ., Kitakyushu-shi, 808-0135 Japan.
a) E-mail: jzhu@aoni.waseda.jp DOI: 10.1587/transfun.E97.A.2488 a selected classifier, obtaining an offset for each category, and then adding the offset to each sample of the category, where the classifier index and the offsets of the region are coded in the bitstream. Practically, SAO parameters are Coding Tree Block (CTB) based. There are three types of SAO, band offset (BO), edge offset (EO) and SAO not applied (NA). If SAO type is SAO not applied, then no samples are needed to be offset in the SAO process. If SAO type is band offset, as shown in Fig. 1 , all the samples are equally divided into 32 ranges and each range is called a band. Among the 32 bands, four consecutive bands are selected as four categories. Four different offsets are determined for each of these categories.
If SAO type is edge offset, there are four classes (directions) of neighboring samples. As shown in Fig. 2 , they are horizontal, vertical, diagonally 135 and diagonally 45. As shown in Fig. 3 , the relationship between each sample and its two neighboring samples is divided to four categories. Parameter 5 determines that whether current CTB is left merge mode, upper merge mode or no merge mode. As shown in Fig. 4 , if current CTB is left merge mode or upper merge mode, only parameter 5, 1-bit syntax element sao left merge flag or sao upper merge flag, is transmitted. Under these cases, the parameters 1-4 of current CTB are copied from the parameters 1-4 of left or upper CTB and are not needed to be encoded. Only when current CTB is no merge mode, parameters 1-4 are encoded. Hence the bit number of SAO parameters of left or upper merge mode is much lower than that of no merge mode.
Since SAO is a new encoding tool in video coding standard, the related research works on SAO estimation are limited. Zhu [6] , Park [7] and Mihir [8] worked on SAO decoding VLSI design rather than encoding design. Praveen [9] worked on SAO encoding algorithm but not hardware architecture. No publication on hardware implementations of SAO encoding is found so far and only one software implementation instance, HEVC reference model HM, is well known. HM is the software aimed to implement encoding tools as many as possible, which make it performs well in compression efficiency and hence it is a good comparison object for evaluating the proposed encoding algorithm effect.
The SAO estimation algorithm in HM (we follow version 12.0) has good BD-rate performance, but it is not easy for VLSI design. The SAO estimation algorithm in HM is divided into two phases. The first is statistic collection and the second is parameters determination. In the first phase, the difficulty for VLSI design is that there are so many samples to deal with for statistic collection. The algorithm in HM deals with each sample one by one without considering the throughput performance, which is obviously unacceptable for VLSI implementation. In the second phase, the difficulty for VLSI design is that the RDO (Rate Distortion Optimization) frequently used in various SAO parameters determination in HM algorithm is unsuitable for VLSI implementation.
In this article, we propose fast encoding algorithm based on HM algorithm and its VLSI architecture. For the first difficulty, bitmaps are used to collect statistics of 16 samples in one 4 × 4 block simultaneously and thus the throughput can be improved. For the second difficulty, a series of complicated procedures in HM algorithm are simplified to achieve a better balance between BD-rate and complexity.
Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented with 156.32 K gates, 8,832 bits single port RAM, 400 MHz @ 65 nm technology and is capable of 8 K × 4 K @ 120 fps encoding.
The rest of this article is organized as follows. Sections 2 and 3 introduce the details of SAO estimation algorithm in HM 12.0 and our improved proposals respectively. Section 4 describes the VLSI architecture in detail. The experimental results and implementation results are illustrated in Sect. 5. Finally, Sect. 6 concludes this article.
SAO Estimation Algorithm
The SAO estimation algorithm in HM 12.0 is divided into statistic collection phase and parameters determination phase. They are illustrated in the following two subsections.
Statistic Collection
As introduced in Sect. 1, there are three types for SAO. One is SAO not applied and other two of them, edge offset (EO) and band offset (BO), are effective SAO types. The division and classification of the two effective SAO types are shown in Fig. 5 . There are four classes (directions) for EO (EO 0: horizontal; EO 1: vertical; EO 2: diagonal 135; EO 3: diagonal 45) and four categories for each of EO class. There are 32 bands for BO. Refer [5] for details. 16 EO categories and 32 BO bands are collectively called 48 classifications.
For each classification, information count (C) and sum (S) shall be collected. C means the number of samples which belong to the specified classification within one CTB. S is the sum of difference between original samples and reconstructed samples which belong to the specified classification within one CTB. Table 2 to Table 5 is similar to this.
Parameters Determination
There are four procedures in parameters determination phases. Procedure 1 is to determine offset (O), distortion (D) and cost (CO) for each classification of three components within one CTB. Procedure 2 is to determine the start band position (sbp) of luma, cb and cr for band offset. Procedure 3 is to determine the type of SAO and the class (direction) of edge offset for luma and chroma. Procedure 4 is to determine whether left merge mode or upper merge mode or none merge mode is adopted. The four procedures are explained as following. Table 1 , the number of variable instances of O is also 144, as listed in the 1st line of Table 2 . D is obtained through Formula (1). The variable O, C, S and D in Formula (1) may be replaced by the 144 variables instances in Table 1 and Table 2 . CO is obtained from Formula (2). CO, D and R in Formula (2) can be replaced by the variables instances in Table 2 . R (rate) in Formula (2) is obtained through rate Table 2 List of variables in procedure 1 of parameter determination phase.
Fig. 7
Start band position determination. estimation, it is a function of the value of O in HM algorithm. L (lambda) in Formula (2) can be regarded as known parameters in HEVC encoding. The parameter of luma is different from the parameter of chroma.
• P2: Start band position (sbp) determination. There are 32 bands for BO. Consecutive four bands form a band group. As shown in Formula (3), the CO of one band group is the sum of CO of the four bands within that band group, it is written as CO bg. CO bg in Formula (3) can be replaced by the variables instances in Table 3 . There are 29 bands group for each component in HM algorithms and there are three components in all. X in Table 3 means the position of first band of the four bands in that band group. CO cX in Formula (3) shall be replaced with the first 29 variable instances for BO listed in the 3rd line of Table 2 .
There are 29 band groups in all. CO bg of the 29 band groups are compared and the band group with minimum CO bg is the selected band group. Its first band of the four bands is the selected start band position (sbp). Sbp y, sbp cb and sbp cr for luma, cb and cr are generated through this way.
• P3: Types and edge offset classes (directions) determination. As introduced in Sect. 1, there are three types for SAO: edge offset, band offset and SAO not applied. For edge offset type, there are four classes (directions). So actually there are six sub-types candidates, which are labeled from 0 to 5, as shown in Fig. 8 . Each effective sub-type (sub-type 0-4 in Fig. 8 ), no matter edge offset or band offset, contains four classifications. For edge offset, the four classifications of the sub-type are shown in Fig. 3 . For band offset, the four classifications of the subtype is the consecutive four bands starting from sbp (start band position). The criteria to determine sub-type is also the Formula (2) in P1. Except that the meaning of CO, D and R are different from those meanings in P1. CO, D and R in this procedure mean CO, D and R for one sub-type for luma or chroma instead of for one classification for luma or cb or cr. D, R and CO in Formula (2) shall be replaced with the variables instances listed in Table 4 , X = 0. . . 5 means the 6 sub-types as shown in Fig. 8 (lambda) in Formula (2) in this procedure is same to that in P1.
The sub-type with minimum CO is the determined subtype. Then types and classes for edge offset for both luma and chroma components are generated through this procedure.
• P4: Modes (left merge, upper merge and no merge) determination. As shown in Fig. 9 , upper CTB merge mode, left CTB merge mode and no merge mode are compared and the best one is selected as the mode of current CTB. The criteria in the comparison of this procedure are a transform of cost (CO), which is named as COT (cost transformed) and shown in Formula (6). The criteria to determine the mode is the sum of COT for luma and chroma. COT in Formula (6) means the COT for each mode of luma or chroma component. It can be replaced by the variables instances listed in 1st line of 
Proposals
Although the algorithm adopted in HM effectively raises the BD-rate performance of HEVC, it is difficult for VLSI implementation. In this section, proposals in the two phases of SAO algorithm are suggested respectively to reduce the complexity and make it suitable for hardware implementation.
In statistic collection phase, we propose to use bitmaps to collect statistic of 16 samples in one 4 × 4 block simultaneously. This is efficient and suitable for hardware implementation. In parameters determination phase, a series of modification are adopted to balance the complexity and BD-rate performance. The structure of this section matches that of Sect. 2.
Statistic Collection
In our proposal, statistics of one 4 × 4 block (16 samples) are collected in one round (cycle). So for 64 × 64 CTB, 256 rounds are needed to finish luma statistic collection and 64 rounds are needed to finish cb and cr statistic collection respectively. There are 48 4 × 4 bitmaps which match 48 classifications mentioned in Sect. 2.1. Each bit in the bitmap represents whether the corresponding sample in the 4 × 4 block belongs to the particular classification. S & C mentioned in Sect. 2.1 are easily collected by means of bitmaps.
An example of how bitmaps are generated is shown in Fig. 10 . One 4×4 sample block together with its surrounding samples is inputted as one 6×6 block, which is shown in topleft of Fig. 10 . For edge offset, 16 bitmaps are generated. For each sample, there are four classes (directions) for its two neighboring samples as shown in left-middle of Fig. 10 . For each class, the relationship between current sample and its two neighboring samples can be divided into one of the four categories as shown top-right of Fig. 10 . Fig. 10 .
For band offset, all the samples are equally divided into 32 bands (classifications). They are labeled as BO 0, BO 1, . . . , BO 31. BO 0 range is 0-7, BO 1 range is 8-15, etc. In the example of Fig. 10 , the top-left sample is 0 × 96, it belongs to classification BO 18. Hence, the top-left bit of the bitmap matches BO 18 is labeled as 1. All the 16 samples are operated to determine which band (classification) they belong to. In this example, since all samples in the 4 × 4 block are in the range from 0 × 8a to 0 × a6, so only four bitmaps are non-zero. All other 28 bitmaps are all-zero.
After the 48 bitmaps are generated, S and C can be generated easily through the operation of bitmaps. An example of how to use bitmap to generate S and C of one 4 × 4 block is shown in Fig. 11 . The sum of all 16 bits in one bitmap is C, as shown in right-bottom of Fig. 11 . To obtain S, firstly 4 × 4 original samples and 4 × 4 reconstructed samples are inputted and their clipped difference is outputted. The obtained 4 × 4 block of difference are "and" with the bitmaps. Then each of the sample are added together to obtain S.
Parameter Determination
The parameter determination phase of HM is introduced in Sect. 2.2. It is not suitable for hardware implementation. In this section, a series of modifications are proposed on the base of original algorithm to reduce the complexity of original algorithm while keeping BD-rate performance. There are four procedures in the HM algorithm. Our proposals are illustrated in the order of the four procedures.
• P1: Offset, distortion and cost determination. In Sect. 2.2, there is an iteration process in finding offset. As shown in Fig. 6 , suppose S is −11, C is 6 and rounded quotient of (S/C) is −2. In the original algorithm, three offset (O) candidates −2, −1, 0 are checked one by one and RDO is used to evaluate the best offset according to Formula (1) and Formula (2) in Sect. 2.2. The iteration process and the rate estimation process are complicated. In our proposal, these two processes are removed. The offset is obtained directly by rounded quotient of (S/C). In the proposed algorithm, −1 and 0 are not iterated. Result of rounded quotient of (S/C), which is −2, is directly selected as the offset. Given O, the process to obtain D and CO remains unchanged compared from the original algorithm.
• P2: Start band position (sbp) determination. In Sect. 2.2, CO bg of the band group is used to determine the sbp. The band group with smallest CO bg is the selected band group. In our proposals, D bg of the band group is used to replace CO bg of the band group. As shown in Formula (6), D bg is the sum of D of the four bands within that band group. The band group with smallest D bg is the selected band group. And the first band of the selected band group is start band position.
• P3: Types and edge offset classes (directions) determination. In Sect. 2.2 P3, CO of one sub-type is used to determine the types and edge offset classes for luma and chroma respectively. As shown in Formula (2), R of one sub-type for both luma and chroma component is required to obtain the CO. The process to obtain these R is through CABAC. Unfortunately, there exists difficulties for the SAO estimation hardware implementation to include CABAC encoder for calculating R. The main reason is that CABAC encoder is quite large [10] . It is even larger than SAO estimation implementation itself which is shown in Sect. 5 of this article. So the cost to use CABAC encoder is high.
To avoid this issue, we use constant value to replace the value from CABAC process for R. The rate value of the sub-type are listed in the top four lines of Table 6 . For SAO NA type, rates for both luma and chroma are 3. For edge offset or band offset sub-type, rate for luma component is 10 and rate for chroma component is 16. When setting the value of these rate, it is expected that the rate value in Table 6 should be close to the value obtained through CABAC. Our basic logic in setting these value is to count the number of bits of the syntax elements and then make a discount on it which emulates the process of Table 6 Value for rate estimation.
CABAC compression. The discount is a rough estimation according to experience and test results. These value has been tested and proved to lead a good performance. In Table 6 , NAL and NAC are 3 because under this sub-type (NA), only 3 syntax elements is transmitted. Sao type luma or sao type chroma are 2 bits, sao left merge flag and sao upper merge flag are 1 bit respectively. So 4 bits syntax elements are transmitted. The bits number of syntax elements after CABAC shall be less than the bits number of syntax elements before CABAC. So we set the value to be 3. EBL is 10 because under these kinds of sub-types (EO or BO), sao type luma (2 bits), eo classes (2 bits) or start band position (5 bits), and 4 offsets (4 × 4 = 16 bits) are transmitted. So the bits number of syntax elements before CABAC is 20 bits (EO) or 23 bits (BO). We set the discounted number of bits to be 10 by rough estimation and experimental results. EBC is similar to EBL except that there are 8 offsets (4 cb and 4 cr) and 2 sbp (1 cb and 1 cr) are needed. So the bits number of syntax elements before CABAC is 36 bits (EO) or 44 bits (BO). We set the discounted number of bits to be 16 by rough estimation and experimental results.
• P4: Modes (left merge, upper merge and no merge) determination. In the original algorithm, there are two points unsuitable for hardware implementation in this procedure. Firstly, similar to the problem in last procedure, R in Formula (5) is obtained through CABAC. Secondly, the division in Formulas (5) is unsuitable for VLSI implementation.
To avoid the two problems, we change the definition of COT in Formula (5) to Formula (7). Then the division in Formula (5) is removed. And R in Formula (7) is set to constant instead of the value from CABAC. The related constants are listed in the bottom three lines of Table 6 . For upper merge or left merge mode, R is set to 1. This is because under these two modes, only 1 bit syntax element sao left merge flag or sao upper merge flag is transmitted. For no merge mode, R is set to the sum of EBL and EBC, so it is 26.
In a word, a series of modifications are done to simplify the original algorithm and make it suitable for hardware implementation. Although so many modifications are done, the BD-rate performance of the improved algorithm still keeps well. The details are illustrated in Sect. 5.
VLSI Architecture
The whole SAO estimation architecture is divided to two modules: statistic collection module and parameter determination module, as shown in Fig. 12 . pRec and pOrg means sample blocks from reconstructed pictures and original pictures respectively. Info means S (sum) and C (count) for 48 classifications of three components within one CTB, which are introduced in Sect. 2.1 and Sect. 3.1. Results are SAO parameters, which are introduced in Sect. 1, Sect. 2.2 and Sect. 3.2. The results are also listed in Table 7 .
For each CTB, the statistic collection module costs 256 cycles for luma and 64 cycles for cb and cr respectively. The parameters determination module costs 64 cycles to process each component. The pipeline between statistic collection module and parameter determination module is shown in Fig. 13 . The details of the two modules are explained in the following two sub-sections.
Statistic Collection Module
The block diagram of statistic collection module is shown in Fig. 14 . On each cycle, one 4 × 4 reconstructed block is inputted to bo classification sub-module. It together with its surrounding samples are also inputted to eo classification sub-module as one 6 × 6 block. Then 16 EO bitmaps and 32 BO bitmaps are generated from the two sub-modules as results. For EO case, the boundary samples of one CTB are not under statistic, which avoids reference samples of neighboring CTB. This is achieved by 16 mask sub-modules in Fig. 14. There are 48 b2n sub-modules (16 for EO bitmaps and 32 for BO bitmaps) which output 48 C (count, shown in Fig. 11 ). The diff sub-module in Fig. 14 output 48 S (sum, shown in Fig. 11 ). 48 unsigned accumulators (16 for EO and 32 for BO) are needed to store C of the whole 64 × 64 CTB and another 48 (16 for EO and 32 for BO) singed accumulators are needed to store S of the whole 64 × 64 CTB.
For luma components, one 64 × 64 CTB can be divided into 256 4 × 4 blocks. So it takes 256 cycles to accumulate S 
Parameters Determination Module
The block diagram of parameters determination module is shown in Fig. 15 . There are two sub-modules and two storage devices in it. The two sub-modules are dist & offset generation (DOG) sub-module and cost generation & decision (CGD) sub-module.
Storages in Parameters Determination Module
The two storage devices are necessary: one is an SRAM holding partial SAO parameters of upper CTB and the other is register groups holding those of left CTB, which are used in the derivation of current CTB SAO parameters. The content of SAO parameters to store is all the parameters listed in Table 7 of upper or left CTB, except for the first line (merge left and merge upper). As shown in Table 7 , each offset value takes four bits (its range is −7 to 7), so 12 (4 categories by 3 components) offsets value take 48 bits. Start band position (sbp) is 5 bits (its range is 0-31), so sbp of three components take 15 bits. SAO type is 3 bits (BO, 4 EO classes, SAO not applied, in all 6 options), so luma and chroma in together take 6 bits. Hence, for each CTB, there 
CGD Sub-Module
As shown in Fig. 17 , CGD is divided into three sub-submodules. Among them, offset storage is composed of a pile of register groups which hold all the offsets (O) from DOG. This module finally outputs four Os according the type, edge offset classes and sbp of current CTB SAO parameters from cost compare sub-sub-module.
As shown in Fig. 18 , the type dist sub-sub-module re- ceives Ds of 48 classifications of current CTB and 8 categories of Ds of left and upper CTB from DOG. It accumulates D for four EO classes, BO, left and upper merge mode respectively. The band offset distortion accumulation is achieved by a four layers shifter registers, a register hold minimum band offset distortion and a comparator. The sum of the 4 registers is compared to the register which holds minimum band offset distortion. If the sum is smaller, then the register is updated to the sum and its corresponding band position is stored. When all 32 bands offset distortions are inputted to the shifter registers, the distortion of band offset is stored in the register and the start band position is also obtained.
The cost compare sub-module is shown in Fig. 19 . The distortions of current CTB, left merge and upper merge are inputted to cost compare sub-module. The function of this sub-module is to obtain the cost of current CTB SAO parameter, left merge and upper merge mode and compare them. The smallest is chosen to be the determined type.
Experimental Results
Experiments are conducted to show that the modified SAO estimation algorithm keeps good BD-rate performance compared to original HM algorithm.
The document [11] defines common test conditions and software reference configurations to be used in the context of experiments. This document defines 8 test conditions. From which, we select the Low delay, main, P slices only condition. That is to say, encoder lowdelay P main.cfg in the HM 12.0 package [12] is the basic configuration file for the study. This is because SAO effect is most obvious under this condition [5] .
Note that this condition implies that the encoder operates in 8-bit mode. Only 8-bit mode encoder and 8-bit source sequences are tested in our study. However, no particular difficulties are foreseen to apply the proposed algorithm and VLSI architecture to higher bit depth cases.
The document [11] also defines the set of test sequences for test conditions. There are six classes (class A to class F) of sequences and there are 3-5 sequences for each class. For each class from class B to class F, one sequence is selected as test object, as shown in Table 8 . These source sequences are all 8-bit. Ten frames for each of these sequences are tested in this study.
HM 12.0 [12] works as the reference for BD-rate measurement and the basis of our modified algorithm.
The evaluation criteria in [13] is adopted in this article. The Bjøntegaard measurement method [14] for calculating objective differences between rate-distortion curves was used as evaluation criterion to evaluate the performance of the proposed algorithm. In the practical operation, BDrate (piecewise cubic) is calculated through the excel file published in the [11] package.
As shown in Table 8 , separate rate-distortion curves for the luma and chroma components were used; hence resulting in three different average bit-rate differences, one for each of the components. The left three columns record the BD-rate reduction rate between SAO off and original SAO estimation on. The right three columns record the BD-rate reduction rate between SAO off and modified SAO estimation on.
It is shown the luma BD-rate has some degradation. The chroma BD-rate has been even better than original ones. This means that the original HM12.0 algorithm is not perfect. For example, the rates obtained from CABAC in the parameters determination phase may be not accurate because only partial rather than a complete set of SAO parameters is through CABAC in that procedure.
In addition to the BD-rate for different components, the BD-rate for combined components is also used in this study. Using the bit rate and the combined PSNR yuv as the input to the Bjøntegaard measurement method gives a single average difference in bit rate that (at least partially) takes into account the tradeoffs between luma and chroma component fidelity [13] . The derivation of PSNR yuv is shown in For- Table 8 BD-rate reduction comparison (three components). Table 9 BD-rate reduction comparison (combined components). mula (8) [13] , PSNY y, PSNR u and PSNR v are calculated by the software (HM).
PSNR yuv = (PSNR y * 6+PSNR u+PSNR v)/8
BD-rate reduction for combined components is shown in Table 9 . The 1st column records the BD-rate reduction of HM12.0 SAO estimation algorithms. The 2nd column records the reduction between the proposed SAO estimation algorithms. It is shown that, although a lot of complexity is saved, BD-rate reduction of the proposed SAO estimation algorithms is only a little bit lower than that of the HM12.0 SAO estimation.
Column 3-6 of Table 9 show the BD-rate reduction of the four procedures of our algorithms respectively. In column PX (X = 1. . . 4), these experimental data is collected in the situation that only PX (procedure X) is modified according to our proposal in Sect. 3.2 and other procedures are not modified. The data shows three points. Firstly, BDrate reductions of these independent procedures are better than the reduction of these four procedures combined. This meets the expectation because each single procedure modify the original algorithms less than all of them combined. Secondly, same procedure has different effect on different sequences. Thirdly, it is obvious the BD-rate reduction of our proposed algorithms (four procedures enabled together) is not the sum of four BD-rate reductions of four independent procedure enabled algorithm. Because these four procedures are not independent. The impact of each procedure influences the impact of other procedures. And BD-rate calculations mentioned in [14] is a non-linear algorithm.
The synthesis results of the proposed VLSI architecture are shown in Table 10 . The VLSI architecture is supposed to be suitable for all the bit depth cases. But actually only 8-bit depth case of the proposed VLSI architecture has been implemented. And contents in Table 9 are for 8 bits depth implementation. Although 10/12 bits depth VLSI implementations have not been verified, no particular difficulties are foreseen for their implementations at this moment. 
Conclusion
In this article, we propose fast SAO estimation algorithms and its corresponding VLSI architecture. Our proposals effectively solve the huge amount samples and complex RDO difficulties. The proposed algorithm still keeps good video BD-rate performance, and it is suitable for high performance VLSI implementation.
