Abstract-This paper describes an online lossless datacompression method using adaptive arithmetic coding. To achieve good compression efficiency, we employ an adaptive fuzzy-tuning modeler that applies fuzzy inference to deal efficiently with the problem of conditional probability estimation. In comparison with other lossless coding schemes, the compression results of the proposed method are good and satisfactory for various types of source data. Since we adopt the table-lookup approach for the fuzzy-tuning modeler, the design is simple, fast, and suitable for VLSI implementation.
I. INTRODUCTION
S INCE arithmetic coding [1] - [12] can approach the entropy limit as long as the statistics are accurate, it is superior to Huffman coding and has been used widely for data compression. However, the speed of arithmetic coding tends to be slow, because in standard form it requires at least one multiplication operation per input symbol. Moreover, it needs an extra division operation at every coding step when used adaptively. Many approximate methods, which may replace the multiplication or division operations by less expensive operations such as shifts and additions, have been proposed to reduce the computational complexity [4] - [8] . Some VLSI architectures of arithmetic coding also have been presented [9] - [12] to improve the coding speed. Because the implementation of multialphabet arithmetic coding is very complicated, few VLSI designs that use multialphabet arithmetic coding algorithm have been presented. To make the implementation of arithmetic coding easier and more practicable, the size of alphabet must be reduced to binary, so the coding process can be simplified correspondingly. The -coder, an adaptive binary arithmetic coding chip for the bilevel image compression, has been presented [4] , [9] . Furthermore, the -coder, a linear descendent of the -coder, has been adopted by both JPEG and JBIG still-image compression algorithms [13] . Fu and Parhi [12] also proposed an algorithm that uses redundant arithmetic to obtain further speedup for the -coder. However, all these -coder-based arithmetic coding hardwares described above are designed to process mainly bilevel image data and may be poor for other types of data. It would be desirable to have a compressor universal enough to quickly compress any type of data and still achieve a good compression ratio. The characteristics of various source data bear a lot of uncertainty and are hard to be extracted, so it is not easy to construct a good probability estimator that can provide accurate probability estimation for different types of data. To solve the problem, we employ fuzzy inference [14] , [15] and propose a novel division-free arithmetic coding algorithm that can be readily implemented in hardware. In the design, we adopt a binary order-o fixed-context model that uses the previous symbols as the state (or context). To reduce the complexity of hardware implementation and to increase the coding throughput, we use the table-lookup approach to construct an adaptive fuzzy-tuning modeler (AFTM). With the help of fuzzy inference process that dynamically selects the probability-tuning step, the modeler can determine the estimated probabilities more efficiently and precisely. Therefore, the compression efficiency of the proposed method is improved. Experimental results demonstrate that the proposed method performs better than other lossless compression methods, such as Huffman, approximate arithmetic [5] - [7] , and Lempel-Ziv for different types of source data: text files, image files, and binary files. Besides, some online processing problems of arithmetic coding, such as source termination and carryover, are solved efficiently in the design.
II. PROPOSED ADAPTIVE ARITHMETIC CODING METHOD
The process of general arithmetic coding can be split into two tasks: coding and modeling. A coder actually produces the compressed bit-stream, and a modeler feeds probability estimation information to it. The encoding process starts by initializing a semi-open interval [0, 1), which is recursively divided to subintervals in proportion to the conditional probabilities of the symbol being encoded. Let denote the width of the selected subinterval at stage and represent the position of the lower boundary of the selected subinterval. The encoding process requires the following iterative computations:
(1) (2) where is the estimated conditional probability of input symbol at stage , given the previous string If a general arithmetic coding is applied to a binary alphabet (0 or 1), it permits a simple and fast coding process and is 0090-6778/99$10.00 © 1999 IEEE more suitable for hardware implementation. Thus, we adopt the binary arithmetic coding in our method. The coding system consists of two main components: the AFTM and the coder. While encoding the th input binary symbol (symbol "0" or symbol "1"), the AFTM determines the and feeds it to the coder. The coder uses the and to produce the compressed bit-stream. Finally, the AFTM updates the conditional probability [or determines the new conditional probability ] for the next coding cycle. Here, we adopt the order-o fixed-context model (or -memory Markov model), which means is composed of previously coded bits before Fig. 1 shows the flowchart for the encoding procedure of our method. Obviously, is the only probability concerned during the whole coding process, since can be calculated by . In the following discussion, the AFTM and the coder are stated respectively.
A. The AFTM
The first task of AFTM is to determine for each input bit. Here, the is calculated based on the relative occurring frequency of symbol "0" at stage under current state . It can be described as (3) where is the number of 0's that have occurred under state until stage , and stands for the total number of input bits that have occurred under state until stage . To deal with the necessary expensive division operation and to make hardware implementation easier, we apply two tables to approximate the . We simulated the coding process, found the 128 possible probability values with higher occurring frequency for , and then saved them in a probability table called
, which is applied to approximate the possible values of . Since each of the possible states has its own , an table is constructed to store the pointers, each of them points to one of the entries of so as to get the corresponding state's . The probability table contains the first 128 probabilities with the highest occurring frequencies for symbol "0." To find these probability values, we use a complete binary tree and some counters to simulate the process of calculating conditional probabilities in coding (see Fig. 2 ). Every node in the tree represents a state of the probability calculating process and contains two kinds of data: the value of at state and its weight. The weight of a node shown in parentheses stands for the relative occurring frequency of the state compared to all other states. The left child of a node is the new state when an input symbol "1" is input, and the right child is "0." Initially, the value of the root node is "1/2" and its weight is set as 1000. The value of the right child of the root is "2/3," which means that the is "2/3" after an input symbol "0" is inputted, and the weight of the node is 500 because its parent node's weight is 1000 and the probability of getting the input symbol "0" is assumed to be 1/2. In this way, we build a complete binary tree of 255 levels. The partial tree and its corresponding results are shown in Fig. 2 and in Table I , respectively. After the complete binary tree of 255 levels is constructed, the first 128 probability values according to their corresponding weights (occurring frequencies) are selected and put into the table. 
Apparently, the conditional probabilities will change faster/ slower when few/many input data have been compressed. After many experiments, we found that some source data files require faster probability changes, and some require slower probability changes to achieve better compression efficiency. Hence, based on (4), a new way is applied to approximate the new conditional probability as follows:
if the th input bit is a 0 if the th input bit is a 1
where is the probability-tuning step used to reflect the degree of variation of conditional probabilities. To avoid the division operation, we employ two offset tables to approximate the . If the current input bit is a 0 and the tuning step is given, we can prefind by using (5) if the th input bit is a 0 if the th input bit is a 1
In fact, there are a large number of tuning steps that can be selected. It is certainly impossible and impractical to use too many steps, because that requires too many tables (memories). After many experiments, therefore, we chose only five different steps: 8, 24, 32, 40, and 64 in our design. As shown in Fig. 3 , the five offset tables are indexed according to the value of simultaneously. One of the five offset values, selected with a proper tuning step generated by the fuzzy inference process, is used to determine the . The fuzzy inference process, performed in our design, is based on the concepts of fuzzy implication and the compositional rules of inference for approximate reasoning [14] , [15] . The main function of fuzzy inference is to select a proper probability-tuning step to tune the effectively. For each possible state, we use a ten-bit queue to record the ten previously coded bits under the state. The corresponding queues of the states are all stored in the table . Two evaluation parameters are used to observe the queue: the switching activity and the repeating activity. The switching activity reflects the number of binary symbol transitions (0 changes to 1 or 1 changes to 0) in the state queue, and the repeating activity reflects the number of identical bits counted from the last bit of the queue. Obviously, small means almost no transitions in the previously coded bits and suggests that higher tuning step is more suitable for good performance. Small means almost no repetition and suggests that lower tuning step is more suitable for good performance. Here, we use the and as inputs to infer the proper probability-tuning step . The corresponding membership functions and the fuzzy control rules used for the step selection problem are shown in Fig. 4 . For the purpose of reducing the design complexity and achieving higher fuzzy inference speed, the inference process is implemented by tablelookups. As shown in Fig. 5 , the current state is used both to index the table and to find the contents of that state's queue, which contains the ten previous bits under state . Then, the ten bits are used to index the table to determine the corresponding for the updating of .
B. The Coder
While coding the th input bit, the coder accepts the , calculates and , normalizes and if necessary, and produces the compressed result at the encoding mode or the uncompressed data at the decoding mode. In [2] , a bit-stuffing technique is presented to solve the carryover problem of In our design, the coder applies a new bit-stuffing technique to solve the problems of source termination and carryover together with efficiency. A -bit register called is used as the output buffer during the encoding process. That is, the compressed bits shifted out from are put temporarily into instead of being sent out directly. Thus, carries generated from , as shown in Fig. 1 , can propagate into . However, if contains a consecutive sequence of 1-bit, the carry propagating into it would propagate through and out into the coded outputs that have been transmitted. Two extra stuffed bits are added and used to resolve this problem. If all the bits of are 1's, two stuffed bits "00" are added and shifted into the right side of to block the carryover propagating. The second stuffed bit (or bit 0 of now) may be changed to 1 if a carryover occurs during the process of encoding. Because no carryover can propagate to the same bit position of twice as demonstrated in [2] , the first stuffed bit (or bit 1 of will always be 0. This feature is used to indicate the source termination condition, that is, we send the consecutive 1's as the termination mark when all the input bits are coded. If the decoder receives 1's while decoding, it will check the next two input bits (stuffed bits). If the stuffed bits are "00," the decoder just ignores the two stuffed bits. If the stuffed bits are "01," the decoder will add 1 to and set to 0. If the stuffed bits are " " ( don't care), the decoder will end the decoding process since the termination mark [consecutive 's], which consists of 's in and 's for the first stuffed bit, is detected.
III. EXPERIMENTAL RESULTS AND IMPLEMENTATION
To implement our coding method, we have to use enough memory to store the related tables.
the total size for the tables in the design, can be given as follows:
where , and represent the storage sizes used for tables  ,  ,  ,  , , and , respectively. Since the design uses an order-o fixed-context modeler, the real storage size is proportional to the value of . Table II shows the memory sizes of our method for different orders. For comparison purposes, we considered several different schemes for various source data. Table III shows the outcomes of compressing text, image, and binary files in various schemes. In the table, these figures represents the ratios of the total corpus size/the total compressed size for each kind of source data. In these schemes, HUFF (compact utility on UNIX) is the adaptive Huffman scheme. LZW (compress utility on UNIX) is the LZ78 coders. ARTH is an arithmetic coding scheme implemented by Jiang [7] . JPEG, obtained from the public domain facility at Stanford University, 1 is the lossless JPEG coder. It supports seven different prediction methods denoted as JPEG_1 to JPEG_7. JBIG is the lossless compression software. 2 AR_MF and AR_MDF are the arithmetic coding methods presented in [5] and [6] , respectively. AFT is the proposed coder. For easier comparisons, AR_MF, AR_MDF, and AFT are all implemented with the fixed-context modelers of different orders, represented by the number in each parenthesis. To evaluate the performance of the schemes for text data, we adopt the text files used in [1] that had been gotten from the Calgary/Canterbury text compression corpus. 3 In addition, three types of image files are used for image evaluation. The bilevel images include the eight CCITT document images, four bilevel graph images, and two halftone images. The grayscale images and color images from the Waterloo BragZone 4 contain 12 8-bit grayscale test images and 8 24-bit full-color test images, respectively. The binary files include ten execution files: gcc, ghostview, xv, gzip, edit, tcsh, nroff, audioplay, fmli, and csh on UNIX. Consequently, our method achieves good compression efficiency for these different types of data.
Based on the hardware optimization and tradeoff concepts taken from high-level synthesis, we have developed a simple version of VLSI architecture with order 10 for the proposed method using Cadence's Verilog simulator run on a SUN SPARC10 station. In it, an asynchronous interface circuit for I/O communication is designed; thus, the I/O operation and coding operation can be done in parallel. Besides, the concept of "design for testability" is used, and a full scan is implemented in the design. The tables , , , and are designed with ROM, while the tables and are designed with RAM. Under Verilog simulation, the design yields a compression and decompression ratio of 12 Mbits/s with a clock rate of 50 MHz [16] . A full version is under development, and we are investigating to reduce the sizes of the tables (memories) needed in the coding method.
IV. CONCLUSIONS
For online lossless data compression, we proposed a novel division-free adaptive arithmetic coding method that can be easily realized with VLSI technology. With the help of fuzzy inference, we achieve better compression results for various types of source data. The drawback of the method is that it requires one multiplication operation per input bit to get more accurate probability estimation. However, the method still can achieve high coding speed by using a simplified parallel multiplier proposed by us in [17] , which requires approximately half of the area of a standard parallel multiplier without sacrificing any performance.
