A high throughput multiplication free approximation of arithmetic coding by May, Frank et al.
This paper appeared in Proc. of Int. Symp. on Inform. Theory and Its Applications, Victoria, Canada, 1996
A High Throughput Multiplication Free Approximation to
Arithmetic Coding
Frank May, Andreas Klappenecker ,
Volker Baumgarte, Armin Nuckel, Thomas Beth
Universitat Karlsruhe
Institut fur Algorithmen und Kognitive Systeme
Am Fasanengarten 5, 76128 Karlsruhe
email: wavelet@informatik.uni-karlsruhe.de
Abstract | Several solutions were proposed
to avoid costly multiplications in approxima-
tions to arithmetic coding. These methods rely
on repeated renormalizations which turn out
to be the bottleneck in VLSI implementations.
We propose a new renormalization scheme
that achieves signiciantly higher throughput
in terms of encoded symbols per clock cycle and
give some details on a VLSI implementation of
this scheme.
I. Introduction
Arithmetic coding [1] is a xed precision version of
the Elias coding [2, pp. 61{62] and is widely used as
a nal step in complex compression systems [3, 4]. It
achieves the zero-order entropy asymtotically for arbi-
trary probability distributions. In contrast to Human
coding (which is in general not optimal, but allows very
ecient implementations), arithmetic coding involves
expensive multiplications. Several authors have pro-
posed approximation techniques to avoid these multi-
plications [5, 6, 7, 8]. In VLSI implementations, these
methods require multiple clock cycles for each encoded
symbol (e. g. [5]). Unfortunately, the number of cycles
is even higher for lower compression ratios. This may
be one reason, why arithmetic coding is rarely used in
hardware implementations. We propose a new renor-
malization scheme which allows to encode one symbol
per clock cycle. Furthermore we show that these mod-
ications lead to area ecient CMOS circuits. This
research is part of a VLSI wavelet compression project
at the authors' institution.
II. Arithmetic Coding
A sequence of symbols (si) over a nite alphabet
fa0; : : : ; aN 1g produced by a source with probabili-
ties p(ak) > 0 is encoded by repeated subdivision of
the interval [0; 1): The current interval is represented
by its lower bound Ci and its width Ai: The subdivi-
sion is given by the recurrence equations:
A0 = 1; Ai+1 = p(si) Ai;
C0 = 0; Ci+1 = Ci + P (si) Ai;
(1)
This research was supported by DFG under project Be
887/6-3.
where P (ak) denotes the sum
Pk 1
l=0 p(al): A nite se-
quence (s0; : : : ; sK 1) can be decoded from the value
CK :
Since the probabilities have to be stored in nite pre-
cision registers, the probabilities p are approximated
by dyadic rationals bp with a precision of  bits, such
that
2  6 bp(ak) < 1 and
N 1X
l=0
bp(al) 6 1: (2)
Assume that aN 1 is the most probable symbol. As
the sum of the approximated probabilities may be less
than one, we add the dierence to bp(aN 1); that is,
bp(aN 1) := 1  bP (aN 1):
The values Ai and Ci should also be represented by
xed precision registers A and C; though it follows
from (1) that Ai and Ci in general require an arbitrary
number of bits. Note that the sequence of the inter-
val widths (Ai) is monotonically decreasing. It follows
that for small Ai the addition of P (si)  Ai does not
change the leading bits of Ci: At least these constant
bits can be communicated to the decoder and A and C
can be renormalized. The renormalization is achieved
by shifting left both registers A and C; such that A
always lies in a given interval.
The addition in (1) may result in an carry-overow in
register C:The bits shifted out of C are passed through
an additional shift register C0 to which the carries are
added. This resolves most of the overow carries. The
bits shifted out of C0 are blocked in an output buer
and then communicated to the decoder. If all bits in
this output buer are `1', the addition of a carry must
be propagated further. This carry is stored in an addi-
tional bit (called stubit) so that the addition can be
nished by the decoder, cf. [5].
III. Renormalization
Previous arithmetic coders keep A in a tight interval,
e. g. the interval [0:5; 1) used in [6, 7]. Then, coding one
symbol leads in general to several shifts for renormal-
ization [5]. Thus, several clock cycles are required for
renormalization, prohibiting a continuously processed
input stream. This eect is even worse for larger al-
phabeths. Apart from the obvious gain in through-
put, continuous processing simplies the design of a
complex compression chip, since it is possible to avoid
costly internal synchronization logic and allows the use
of dynamic CMOS circuits in line buers.
Assume that the arithmetic coder has a parallel out-
put with  bits, i. e. in each clock cycle either zero or
 bits are released. To achieve high throughput, we
keep A in the larger interval [2 ; 1) and additionally
require  6    1: Thus, a single shift by  or    1
bits (depending on stubits) is sucient for renormal-
ization. At the same time  bits are released to the
channel. In hardware, the single shift can be realized
by simple multiplexers and does not require additional




















5 10 15 20 25 30
our coder
known coders, N = 2
known coders, N = 4
known coders, N = 8
compression ratio
Figure 1: The diagram shows the performance of our
coder compared to previous coders with dierent alpha-
bet sizes N in terms of the average clock cycles needed to
encode one input symbol. Note that the performance of
our coder does not depend on N .
IV. Approximation
As the two multiplications in (1) are costly in terms
of chip size and throughput, we apply the method pro-
posed by Feygin et al. [7] to approximateA; thereby
reducing each multiplication to one addition/subtrac-
tion and two shifts. Depending on A; there are two
types of approximation:
bA = 2 s(2 1 + 2 (i+1)); i = 1; : : : ;    1; (3)
bA = 2 s(1  2 i); i = 1; : : : ; ; (4)
where s 2 f0; : : : ;  1g is the number of leading `0' bits
in A and  is the maximumnumber of bits used for the
approximation. The exponent i is chosen to maximize
bA under the constraint bA < A: The approximation of
bA in case (3) can be visualized as
bA = 0: 0   0| {z }
s
1 0   01| {z }
i6 1
0   
and in case (4) as
bA = 0: 0   0| {z }
s
1   1| {z }
i6
0    :
In [7] it is shown that the worst case normalized excess















Figure 2: This block diagram sketches the architecture of
the arithmetic coding chip. The main modules APPROX and
REGAC are used twice, easing the layout process of the chip.
V. VLSI Implementation
Our algorithm is very suitable for VLSI implementa-
tions. A rough sketch of the architecture is given in
gure 2. The inputs p and psum supply the probabil-
ity bp(si) and the cumulative probability bP (si) respec-
tively. The two APPROX modules compute the approxi-
mated products bAbp(si) and bA bP (si) which are used to
update A and C in the REGACmodules. The THERM ENC
module computes the number of shifts needed by the
approximation. The output of C is passed through the
OUTREG C0 and through OUTBUF. OUTREG and OUTBUF
handle the overow problem. Finally,  bits at a time
are released via out.
We formulated this algorithm in the high level hard-
ware description language ELLA. This abstract de-
scription was translated into a gate level language as
input to a VLSI CAD tool. We have designed a set of
full custom layouts using a 1 dual metal CMOS pro-
cess to improve the geometric and electrical eciency.
A electrical worst case simulation of the layout showed
a minimum of 33MHz clock frequency. Figure 3 shows
the nal layout of the APPROX module and Figure 4 of
the REGAC module. Note that both modules are used
twice.
Table 1 shows some simulation results of the circuit.
Each line gives the zero order entropy of the encoded
sequence, the resulting bit rate and the normalized ex-
cess code length, i. e. the dierence between bitrate and
entropy normalized by the entropy.
Compared to the architecture described in [7], the
width of the registers A and C is increased by  bits
and two additional shifters are necessary to scale the
product approximations appropriately. On the other
hand, we were able to simplify the control logic signif-
Figure 3: The nal layout of the APPROX module with
about 2,300 transistors per mm2.
Figure 4: The nal layout of the REGACmodule with about
3,600 transistors per mm2.
icantly.
Another algorithm which achieves high throughput
using pipelining and fast multiplications is described
in [9]. However, these multiplications are implemented
with huge lookup tables which are very costly.
VI. Conclusion
Many image and video compression algorithms in-
clude the following three steps: a signal transforma-
tion, a quantizer, and a traditional entropy coder. The
rst two steps typically lead to highly skewed proba-
bility distributions. Thus, an arithmetic coder is a
natural and ecient choice. However, most commer-
cially available compression chip sets use a less ecient
Human coder, since this coder achieves signicantly
higher throughput than previously known arithmetic
coders with comparable chip size. The proposed al-
gorithm combines both high throughput and ecient
coding and is cost eective in terms of chip size.
VII. Acknowledgements
The VLSI system and layout libraries were developed
under the DFG project IDEAS [10]. The support by
DFG is gratefully acknowledged.




Table 1: The tabular shows the results of simulation.
References
[1] I. H. Witten, R.M. Neal, and J. G. Cleary, \Arith-
metic coding for data compression," Comm.
ACM, vol. 30, pp. 520{540, June 1987.
[2] N. Abramson, Information Theory and Coding.
McGraw-Hill, New York, 1963.
[3] W. B. Pennebaker and J. L. Mitchell, JPEG still
image compression standard. Van Nostrand Rein-
hold, 1993.
[4] A. Klappenecker and F. U. May, \Evolving bet-
ter wavelet compression schemes," in Proc. of
Wavelet Applications in Signal and Image Pro-
cessing III, 12{14 July 1995, San Diego, Califor-
nia, pp. 614{622, SPIE, 1995.
[5] R. B. Arps, T. K. Truong, D. J. Lu, R. C. Pasco,
and T. D. Friedman, \A multi-purpose VLSI chip
for adaptive data compression of bilevel images,"
IBM J. Res. Develop., vol. 32, pp. 775{795, Nov.
1988.
[6] D. Chevion, E. D. Karnin, and E. Walach, \High
eciency, multiplication free approximation of
arithmetic coding," in Proceedings of the Data
Compression Conference, Snowbird, Utah, pp. 43{
52, 1991.
[7] G. Feygin, P. G. Gulak, and P. Chow, \Min-
imizing excess code length and VLSI complex-
ity in the multiplication free approximation of
arithmetic coding," Inform. Processing & Man-
agement, vol. 30, no. 6, pp. 805{816, 1994.
[8] T. Y. Tong and I. F. Blake, \An improved
multiplication-free multialphabet arithmetic code
and the redundancy of arithmetic codes," to ap-
pear, 1992.
[9] H. Printz and P. Stubley, \Multialphabet arith-
metic coding at 16 MBytes/sec," in Proceedings
of the Data Compression Conference, Snowbird,
Utah, pp. 128{137, 1993.
[10] T. Beth, A. Klappenecker, T. Minkwitz, and
A. Nuckel, \The ART behind IDEAS," in Com-
puter Science Today (J. van Leeuwen, ed.),
vol. 1000 of Lecture Notes in Computer Science,
pp. 141{158, Springer Verlag, 1995.
