A Parallel Algorithm for Inverse Halftoning and its Hardware Implementation by Siddiqi, Umair F. & Sait, Sadiq M.
 1
 
 
 
A Parallel Algorithm for Inverse Halftoning and its Hardware 
Implementation 
 
Umair F. Siddiqi and Sadiq M. Sait 
{umair, sadiq}@ccse.kfupm.edu.sa 
KFUPM Box: 673  
Department of Computer Engineering, 
King Fahd University of Petroleum & Minerals, Dhahran 31261 
Saudi Arabia 
Telephone: +966-3-860 1099 
Fax: +966-3-860 2440 
 
 
 
 
 
 
 
 
 
 
 2
Abstract 
 
Lookup Table (LUT) method for inverse halftoning is computation less, fast and also yields 
goods results. This paper proposes a parallel algorithm for inverse halftoning by parallelizing 
the LUT method of inverse halftoning. The LUT method for inverse halftoning is parallelized 
by dividing the single Look-Up Table of LUT method for inverse halftoning into many 
smaller Look-up Tables (sLUTs). In the parallel algorithm up-to four pixels can be fetched 
from the halftone image concurrently and go to their separate smaller Look-Up Tables 
(sLUT) from where each template fetches its inverse halftone value independent to other 
pixels. The parallelization can increase the speed of inverse halftoning by up-to 4 times while 
the total entries in all smaller Look-Up Tables (sLUTs) remains equal to the entries in the 
single LUT of LUT method for inverse halftoning. Some degradation in image quality is 
noticed due to parallelization. The complete implementation of the method takes two CPLD 
devices with external content addressable memories (CAM) and static RAMs to store sLUTs.  
 
Keywords: (1) Inverse Halftoning (2) Hardware Implementation  (3) Look-Up 
Table Inverse Halftoning  (4) Complex Programmable Logic Devices (CPLD) 
 (5) Image Processing  
 
 
 
 
 
 
 3
1. Introduction:  
The process of rendition of continuous tone pictures on media on which only two levels can 
be displayed is defined as Halftoning [1]. The problem has gained importance since the time 
of printing press when attempts were made to print images on paper by adjusting the size of 
dots according to the local print intensity. This process is termed as analog halftoning. Digital 
halftoning has also become important with the availability and adoption of bi-level devices 
such as fax machines and plasma displays [2]. The input to a digital halftoning system is an 
image whose pixels have more than two levels (e.g. 256 levels), and the result of the 
halftoning process is an image that has only two levels. Inverse halftoning on the other hand, 
is an operation of converting an image from its halftone version to grey level image i.e., from 
a two level image to say 256 levels image. Inverse halftone operation finds applications in 
areas where processing is required on printed images. The images are first scanned, inverse 
halftoned and then operations like zooming, rotation and transformation are applied. 
Standard compression techniques cannot process halftones directly therefore inverse 
halftoning is required before compression of printed images can be performed [1].  
 
Lookup table (LUT) inverse halftoning is a low computational fast method [3]. LUT inverse 
halftoning was first introduced by Netravali and Bowen [4] but it requires some information 
to be known that is not always available for halftone images. Subsequently Ting and Riskin 
proposed another LUT method [5] which was also LUT based but did not yield quality. In 
the recent past a computation free LUT method that provide fast LUT inverse halftoning with 
good image quality, and which can be applied on several different halftones is reported [1, 3]. 
Two other methods [6, 7] for LUT inverse halftoning are also presented in recent past that 
give better image quality but they are not completely computation free and require 
 4
computation in addition to Look-Up Table (LUT) access. This paper presents parallelization 
of Look-Up Table (LUT) method for inverse halftoning presented by Mese and 
Vaidyanathan [3]. In the method pixels are first fetched from the halftone image and then go 
to the LUT to obtain their contone values. If the LUT method is parallelized without any 
modification then the memory requirements grow very large because we need to store one 
complete Look-Up Table (LUT) for each pixel that is to be inverse halftoned concurrently. 
Therefore, this paper presents a computationally simple algorithm to parallelize LUT method 
for inverse halftoning that has no increase in the Look-Up Table entries. It is accomplished 
by dividing the single Look-Up Table (LUT) of LUT method for inverse halftoning into eight 
smaller Look-Up Tables (sLUT). Using sLUTs up-to four pixels can be fetched from the 
halftone image and inverse halftoned concurrently, in the same time serial LUT method can 
inverse halftone only one pixel.  The rest of this paper is organized as follows: The serial 
LUT method is described then the parallelization of LUT method for inverse halftoning is 
explained. The explanation is followed by the simulation of the parallelization and images 
are obtained. Finally the implementation of the parallelized LUT method of inverse 
halftoning is shown that is accomplished on CPLD devices.   
  
2 Look-Up Table (LUT) Method for Inverse Methods:  
In the LUT method for inverse halftoning a template (t) is a group of pixels consisting of 
pixel to be inverse halftoned and the pixels in its neighbor. The LUT method uses three types 
of templates namely: 16pels, 19pels and Rect. The 16pels consists of 16-pixels, 19pels 
consists of 19-pixels and Rect consists of 21 pixels. The templates are fetched from the 
halftone image following the raster-scan style, i.e. from left to right in a row and travel rows 
from top to bottom. One template (t) is fetched and inverse halftoned before the next 
 5
template will be fetched. The LUT method also incorporates a Look-Up Table (LUT) that 
stores pre-computed contone values of a large number of templates. The templates for 
storage in the LUT are selected from a training set of images that comprise of both halftone 
images and their continuous tone versions before halftoning. The templates are selected from 
the halftone images and their contone values are selected from the continuous tone versions. 
When a template occurs more than once then its contone value is the mean of all contone 
values that corresponds to that template. The inverse halftone operation is performed in this 
way that a template (t) is fetched from the halftone image and it is send to the Look-Up Table 
(LUT). If the LUT has the stored contone value for the template (t) it returns it otherwise the 
template (t) undergoes through anyone of these methods: (a) Low Pass Filtering, or (b) Best 
Linear Estimator.  The LUT method for inverse halftoning can also be applied to color 
halftones. The color inverse halftoning comprise of three color planes (R, G, B) and each 
plane has its independent LUT that stores contone values for its color plan, the templates may 
contain pixels from different color planes.  
 
3 Parallelization of Look-Up Table (LUT) Method for Inverse Halftoning: 
To parallelized LUT method for inverse halftoning we need to fetch more than one template 
from the halftone image at the same time and perform inverse halftone operation on them 
independent to each other. The main problems in parallelizing LUT method for inverse 
halftoning are the following:  
(a) The Look-Up Table (LUT) is composed of a single memory block that does not allow 
simultaneous access to more than one location. Therefore, parallel templates cannot 
fetch their contone values at the same time.  
 6
(b) If the LUT method for inverse halftoning is parallelized as it is then the memory 
requirements grow very large because we need to store one template (t) for each 
template that is fetched in parallel.  
The section presents an algorithm to parallelize the LUT method for inverse halftone while 
solving the above problems.  
 
The algorithm to parallelized Look-Up Table (LUT) method for inverse halftoning consists 
of: (a) Pre-computation of eight smaller look-up tables (sLUT), and (b) Method to parallelize 
inverse halftone operation. In the following we discuss both these methods: 
3.1 Pre-Computation of Eight Smaller Look-Up Tables (sLUT): 
The proposed algorithm of parallelization requires pre-computation of 8 sLUTs from the 
LUT of the LUT method of inverse halftoning using a host PC. The sLUTs are numbered 
from 0 to 7 for reference. The entries generated after the pre-computation phase will be 
stored in a Read Only Memory (ROM) that is included in the hardware implementation that 
will perform the inverse halftone operation. The method of pre-computation starts by 
extending the LUT of LUT method for inverse halftoning to include templates and their 
contone values of all 2p templates where p is the number of bits in the templates. This 
extension in LUT can be performed by increasing the size of the training set or by using the 
methods Hamming distance, or Best linear estimator to calculate contone values of the 
templates not found in the training set images. After completing the entries in the LUT a 
parameter m is calculated as follows: 
LUT the in templates of number
LUT the in templates all of sum m =  
After completing the LUT one template (t) is pushed out from it at a time and the following 
logic operations are applied on the template (t): 
 7
1)m(0..p  1)t(0..p  1)v(0..p −⊗−=−  
The above operation is a bitwise XOR between t and m where both t and m have p-1 bits. p= 
16 for template type 16pels, p= 19 for template type 19pels, and p= 21 for template type 
Rect. The following arithmetic operation is now applied to the result obtained: 
s(0..log2p-1) = v(0) + v(1) + … + v(p-1) 
s(log2p)=0 
In the above expression s(0..log2p) stores the sum of v(0) to v(p-1).  In the next step the 
following arithmetic operations are applied: 
if t<m then s=-s i.e. s is taken 2’s complement, 
slut(0..2)=s (0..2). 
slut have values from 0 to 7 and the fetched template (t) and its contone value will be stored  
in the smaller Look-Up table (sLUT) equal to the slut value because sLUT are also numbered 
from 0 to 7. The next template is now pushed out from the LUT and the same procedure is 
repeated.  
3.2 Method to Parallelize Inverse Halftone Operation: 
In the proposed algorithm of parallelization four p-bit templates t1, t2, t3 and t4 are fetched 
from the halftone image in parallel and the following logic operations are applied on it: 
1) m(0..p1) (0..p t1) (0..pv
1) m(0..p1) (0..p t1) (0..pv
1) m(0..p1) (0..p t1) (0..pv
1) m(0..p1) (0..p t1) (0..pv
44
33
22
11
−⊗−←−
−⊗−←−
−⊗−←−
−⊗−←−
 
Following this operation each result v1 to v4 goes through the Carry Save Adder (CSA) trees. 
The CSA tree has p inputs and each is connected to one bit of the input number (v1 or v2 or v3 
or v4).   The CSA trees for templates 16pels, 19pels, and Rect are shown in Fig. 1, Fig. 2 and 
Fig. 3 respectively. The equations below show the addition operation: 
 8
s1(0..log2p-1) ← CSA_TREE(v1(0), v1(1), ..,v1(p-1)) 
s1(log2p) ← 0 
s2(0..log2p-1) ← CSA_TREE(v2(0), v2(1), ..,v2(p-1)) 
s2(log2p) ← 0 
s3(0..log2p-1) ← CSA_TREE(v3(0), v3(1), ..,v3(p-1)) 
s3(log2p) ← 0 
s4(0..log2p-1) ← CSA_TREE(v4(0), v4(1), ..,v4(p-1)) 
s4(log2p) ← 0 
 
Fig. 1: Carry Save Adder (CSA) Tree for template type 16pels 
 
 9
 
Fig. 2 Carry Save Adder (CSA) Tree for template type 19pels 
 
Fig. 3: Carry Save Adder (CSA) Tree for template type Rect 
The next step consists of the following comparison and arithmetic operations: 
 10
if (t1 < m)= True then  s1=-s1 i,e, s1 is take 2’s complement 
if (t2 < m)= True then  s2=-s2 i,e, s2 is take 2’s complement 
if (t3 < m)= True then  s3=-s3 i,e, s3 is take 2’s complement 
if (t4 < m)= True then  s4=-s4 i,e, s4 is take 2’s complement 
 In the next step we keep only three least significant bits of each sum or its 2’s complement 
and discard the remaining bits i.e.: 
slut1(0..2) = s1(0..2), slut2 = s2(0..2), slut3 = s3(0..2), & slut4 = s4(0..2) 
The value slut1, slut2, slut3, and slut4 send templates t1, t2, t3 and t4 to sLUT having reference 
numbers same as their slut values. The procedure adopted to send templates to corresponding 
sLUTs is shown in the following text: 
The templates t1, t2, t3 and t4 are appended with numbers 001, 010, 011, and 100 respectively. 
The appended t1, t2, t3, and t4 are p+3 bits wide and named as t1’, t2’, t3’, and t4’ respectively. 
In the next step 4 1x8 de-multiplexers are connected 8 8x1 multiplexers. The Boolean 
equations representing the digital logic composition of these two steps is shown below: 
2)),'(0...p (t(0)  slut(1)  slut(2)  slut2) (0...pA
2)),'(0...p (t (0)slut (1)  slut(2)  slut2) (0...pA
2)),'(0...p (t(0)  slut (1)slut (2)  slut2) (0...pA
2)),'(0...p (t (0)slut  (1)slut (2)  slut2) (0...pA
2)),'(0...p (t(0)  slut(1)  slut (2)slut 2) (0...pA
2)),'(0...p (t (0)slut (1)  slut (2)slut 2) (0...pA
2)),'(0...p (t(0)  slut (1)slut  (2)slut 2) (0...pA
2)),'(0...p (t (0)slut (1) slut (2) slut 2) (0...pA
11117
11116
11115
11114
11113
11112
11111
11110
+⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
⋅
 
 
where A0A1A2A3A4A5A6A7 are the outputs from the first de-multiplexer. 
 
 11
2)),'(0...p (t(0)  slut(1)  slut(2)  slut2) (0...pB
2)),'(0...p (t (0)slut (1)  slut(2)  slut2) (0...pB
2)),'(0...p (t(0)  slut (1)slut (2)  slut2) (0...pB
2)),'(0...p (t (0)slut  (1)slut (2)  slut2) (0...pB
2)),'(0...p (t(0)  slut(1)  slut (2)slut 2) (0...pB
2)),'(0...p (t (0)slut (1)  slut (2)slut 2) (0...pB
2)),'(0...p (t(0)  slut (1)slut  (2)slut 2) (0...pB
2)),'(0...p (t (0)slut (1) slut (2) slut 2) (0...pB
2227
2226
2225
2224
2223
2222
2221
2220
+⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
⋅ 2
2
2
2
2
2
2
2
 
 
where B0B1B2B3B4B5B6B7 are the outputs from the second de-multiplexer. 
 
2)),'(0...p (t(0)  slut(1)  slut(2)  slut2) (0...pC
2)),'(0...p (t (0)slut (1)  slut(2)  slut2) (0...pC
2)),'(0...p (t(0)  slut (1)slut (2)  slut2) (0...pC
2)),'(0...p (t (0)slut  (1)slut (2)  slut2) (0...pC
2)),'(0...p (t(0)  slut(1)  slut (2)slut 2) (0...pC
2)),'(0...p (t (0)slut (1)  slut (2)slut 2) (0...pC
2)),'(0...p (t(0)  slut (1)slut  (2)slut 2) (0...pC
2)),'(0...p (t (0)slut (1) slut (2) slut 2) (0...pC
3337
3336
3335
3334
3333
3332
3331
3330
+⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
⋅ 3
3
3
3
3
3
3
3
 
 
where C0C1C2C3C4C5C6C7 are the outputs from the third de-multiplexer. 
 
2)),'(0...p (t(0)  slut(1)  slut(2)  slut2) (0...pD
2)),'(0...p (t (0)slut (1)  slut(2)  slut2) (0...pD
2)),'(0...p (t(0)  slut (1)slut (2)  slut2) (0...pD
2)),'(0...p (t (0)slut  (1)slut (2)  slut2) (0...pD
2)),'(0...p (t(0)  slut(1)  slut (2)slut 2) (0...pD
2)),'(0...p (t (0)slut (1)  slut (2)slut 2) (0...pD
2)),'(0...p (t(0)  slut (1)slut  (2)slut 2) (0...pD
2)),'(0...p (t (0)slut (1) slut (2) slut 2) (0...pD
7
6
5
4
3
2
1
0
+⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
+⋅⋅⋅←+
⋅ 4444
4444
4444
4444
4444
4444
4444
4444
 
 
where D0D1D2D3D4D5D6D7 are the outputs from the first de-multiplexer. 
 
The 8 8x1 multiplexers are implemented in the following way. If more than one template has 
same slut value (slut1, slut2, slut3, & slut4) then the template having the highest template 
number will fetch its inverse halftone value from the sLUT & the remaining templates having 
 12
same slut value are dropped. For example, if t3 & t4 comes out to have same slut value i.e. 
slut4 = slut3. Now t3 has template number 3 and t4 has number 4 therefore, t3 and slut3 will be 
dropped and t4 will go to sLUT that is indicated by the slut4 value. The following Boolean 
equations show the multiplexers with selection of highest template number on the select 
lines:    
 
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0 .. p D2)) (p D1) (p D(p)  (D2)(0..pg
0100
000000000
0000000
0000000
00000001
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
 
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0 .. p D2)) (p D1) (p D(p)  (D2)(0..pg
1111
111111111
1111111
1111111
11111112
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
 
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0.. p D2)) (p D1) (p D(p)  (D2)(0..pg
2222
222222222
2222222
2222222
22222223
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
 
 13
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0 .. p D2)) (p D1) (p D(p)  (D2)(0..pg
3333
333333333
3333333
3333333
33333334
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0 .. p D2)) (p D1) (p D(p)  (D2)(0..pg
4444
444444444
4444444
4444444
44444445
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
         
2)..p(2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)..p( B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) ..p( C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) .. p( D2)) (p D1) (p D(p)  (D2)(0..pg
5555
555555555
5555555
5555555
55555556
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
0
0
0
0
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0.. p D2)) (p D1) (p D(p)  (D2)(0..pg
6666
666666666
6666666
6666666
66666667
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
         
2)(0..p2)). A(pA1) (p A(p) (A
 2))(p B1) (p B(p) (B  2))(pC1) (p C(p) (C  2))(p D1) (p D(p) (D
2)(0..p B2)) (p B1) (p B(p)  (B 2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D 2) (0..p C2))(pC1) (p C(p) (C
  2))(p D1) (p D(p) (D2) (0 .. p D2)) (p D1) (p D(p)  (D2)(0..pg
7777
777777777
7777777
7777777
77777778
+++−+
⋅++++⋅++++⋅++++
++⋅++++⋅++++
⋅++++++⋅++++
⋅++++++⋅++++←+
 
The g1 to g8 are the outputs from the 8 multiplexers. The above 8x1 multiplexers have 
combinational logic attached to its select line that will send the highest sequence number to 
reach to the select line. The outputs from the multiplexers are templates with their template 
numbers.  
 14
 
The next step contains smaller Look-Up Tables (sLUTs) that are implemented using Content 
Addressable Memory (CAM) and Read Only Memory (ROM) pairs. The block diagram in 
Fig. 5 shows implementation of one sLUT. The CAM stores the templates that are assigned 
to the sLUT and it returns the address of the adjacent ROM where the contone value for the 
template is stored. The Boolean equations below show the operations performed in this 
block: 
 
1).(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
1),(0..d(x ROM(0..7) c
1)),(0..p(g CAM1) (0..dx
e ROM  idth of th maximum wbits 8 256 i.e. levels greynumber of 
CAMth of the is the wid1,where d  2sLUT) Up Table (er Look the smallentries innumber of 
878
878
767
767
656
656
545
545
434
434
323
323
212
212
101
101
d
−←
−←−
−←
−←−
−←
−←−
−←
−←−
−←
−←−
−←
−←−
−←
−←−
−←
−←−
=−=−
−=−
 
 
In the above equation the CAM() function represent access to CAM and ROM() function 
represents access to ROM. 
 15
 
Fig. 5: smaller Look-Up Table (sLUT) implemented in terms of CAM and ROM 
 The next step consists of pixel compensation in which the dropped pixels are assigned 
contone values from their neighbors. This block outputs four valid contone values with 
template numbers t1= 001, t2= 010, t3= 011, and t4=100. The Boolean expressions for the 
combinational logic in this block are as follows: 
 
(0..7)a  (0..7)Contone
(0..7)b  a  (0..7)a  a  (0..7)a
a  a  a  a  a  a  a  a  a
(0..7)c  a (0..7)c  a  (0..7)c  a  (0..7)c  a                
  (0..7)c a  (0..7)c  a  (0..7)c a  (0..7)c  a  (0..7)a
2)(p g 1)(pg  (p)g  a
2)(p g 1)(pg  (p)g  a
2)(p g 1)(pg  (p)g  a
2)(p g 1)(pg  (p)g  a
2)(p g 1)(pg (p) g  a
2)(p g1)g3(p  (p)g   a
2)(p g 1)(pg  (p)g  a
2)(p g 1)(pg  (p)g  a
10t1
898910
765432109
8766554
433221108
8887
7776
6665
5554
4443
332
2221
1110
←
⋅+⋅←
+++++++←
⋅+⋅+⋅+⋅
+⋅+⋅+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
7
 
where Contonet1 is the contone value of the template t1.   
 
 16
(0..7) b(0..7) Contone
(0..7) d b (0..7)  b  b(0..7) b
 b  b  b  b  b  b  b  b b
(0..7) c b(0..7)  c  b(0..7)  c  b(0..7)  c b                 
 (0..7)  c b(0..7)  c  b(0..7)  c b(0..7)  c  b(0..7) b
2)(pg 1) (p g (p)g  b
2)(pg 1) (p g (p)g  b
2)(pg 1) (p g (p)g  b
2)(pg 1) (p g (p)g  b
2)(pg 1) (p g(p)g  b
2)(pg 1)(p g (p)g  b
2)(pg 1) (p g(p)g  b
2)(pg 1) (p g (p)g  b
10t2
898910
765432109
87766554
433221108
8887
7776
6665
5554
4443
3332
2221
1110
←
⋅+⋅←
+++++++←
⋅+⋅+⋅+⋅
+⋅+⋅+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
 
where Contonet2 is the contone value of the template t2.   
 
(0..7) d(0..7) Contone
(0..7) e d (0..7)  d  d(0..7) d
 d  d  d  d  d  d  d  d d
(0..7) c a(0..7)  c  a(0..7)  c  a(0..7)  c a                     
 (0..7)  c a(0..7)  c  a(0..7)  c a(0..7)  c  a(0..7) d
2)(p g1) (p g(p) g  d
2)(p g1) (p g (p)g  d
2)(p g1) (p g (p)g  d
2)(p g1) (p g (p)g  d
2)(p g1) (p g(p)g  d
2)(p g1)(p g (p)g   d
2)(p g1) (p g(p)g  d
2)(p g1) (p g (p)g  d
10t3
898910
765432109
87766554
433221108
8887
7776
6665
5554
4443
3332
2221
1110
←
⋅+⋅←
+++++++←
⋅+⋅+⋅+⋅
+⋅+⋅+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
 
where Contonet3 is the contone value of the template t3.     
 17
(0..7) e(0..7) Contone
(0..7) c e(0..7)  c  e(0..7)  c  e(0..7)  c e                    
 (0..7)  c e(0..7)  c  e(0..7)  c e(0..7)  c  e(0..7) e
2)(pg  1)(pg (p)  g e
2)(pg  1)(pg (p)  g e
2)(pg  1)(pg (p)  g e
2)(pg  1)(p g(p)  g e
2)(pg 1) (pg (p) g e
2)(pg 1)(pg (p)   g e
2)(pg  1)(pg (p) g e
2)(pg 1) (pg (p)  g e
8t4
87766554
433221108
8887
7776
6665
5554
4443
3332
2221
1110
←
⋅+⋅+⋅+⋅
+⋅+⋅+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
+⋅+⋅←
 
where Contonet4 is the contone value corresponding to pixel fetched at 0th position in 
template t4.   
 
4 Simulation and Discussion: 
The algorithm to parallelize LUT methods for inverse halftoning is simulated to see: the 
percentage of pixels that are dropped and copy their contone values from neighbors in the 
pixel compensation step, and (b) create a sample image that shows the effect of pixel 
compensation (copying from neighbors) on a perfect contone image i.e. when smaller Look-
Up Tables (sLUT) store the exact contone values that corresponds to the templates.  
4.1 Percentage of Pixels Dropped and Compensated from Neighbors: 
The training set is developed from images boat, barbara, and lena and m is found. The size of 
each SLUT is found to be 2K in average. The method fetches four templates from the 
halftone image and if more then one template has same slut value then the highest template is 
kept and the remaining pixels are calculated as dropped pixels. Table I show the results of the 
calculation of pixels dropped.  
 
 
 18
Table I: Percentage of templates that have their contone values copied from the 
neighbors 
Image Percentage of Pixel dropped and 
compensated  
Boat 31.14% 
Lena 31.35% 
Boat 16.30% 
Barbara 32.40% 
 
4.2 Image quality Analysis: 
In this section we will compare the degradation in image quality that occurs due to copying 
of contone values from the neighbors. It is shown by an example image Boat that achieved a 
PSNR = 21.1783 dB with Parallelized LUT method for Inverse halftoning when loss due to 
LUT method for inverse halftoning is zero. The image is shown in Fig. 6 below:  
  
Fig. 6: The image obtained through proposed parallelized LUT inverse halftoning. 
 19
4. Hardware Implementation: 
The LUT method for Floyd and Steinberg [8] error diffused halftones is parallelized. The 
training set is build from boat, barbara and lena, and the average size of one sLUT is found to 
be 2K entries. The complete agorithm is implemented in two CPLDs (Complex 
Programmable Logic Devices) and external CAM and SRAMS are used to store sLUTs. Fig. 
7 illustrates the system block diagram. The CPLDs used are Altera [9] MAX II and CAM 
and SRAM are implemented in Altera APEX FPGA devices but can be replaced with 
discrete devices in future designs. The CPLD I contains the proposed parallelization 
algorithm and CPLD II contains the pixel compensation circuit. The assignment of template 
numbers to incoming “19pels” is performed partially in both CPLD I & II in order to fit the 
design within MAX II pin count and to reduce fitting complexity of CPLD I.     
 
Fig. 7: Block diagram of the algorithm implementation.  
In the above figure, CPLD I accepts 4 “19pels” from the halftone image and send each 
“19pels” according to its slut value to its four outputs out of total eight output ports. The 
ports from CPLD I are connected to CAMs that are connected to SRAMs. The grey level 
values from SRAMs go to CPLD II where four pixel compensation circuits are present. The 
CPLD II gives grey level values in the correct sequence i.e G1 corresponds to contone value 
of t1 and so on. The results of CPLD implementation obtained from Fitter and Timing 
analyzer tools present in Altera Quartus II 5.0 are tabulated in Table II.  
 
 20
Table II: Results of CPLD implementations 
Device Area I/O pins Clock Frequency 
CPLD I 
EPM2210GF324I5 
Logic elements: 
2049/2210 
261/272 33.86 MHz 
CPLD II 
EPM2210GF324I5 
Logic elements: 
262/2210 
262/272 164.85 MHz 
 
5. Conclusion: 
The parallelization of LUT inverse halftoning is performed which has the following 
advantages: (a) The inverse halftone operation is speed up by 4 times while the number of 
LUT entries remains same, and (b) The inverse halftone operation from halftone images is 
performed on fast hardware instead of embedded hardware-software.    
 
Acknowledgements: 
The authors like to acknowledge King Fahd University of Petroleum & Minerals, Dhahran 
for all support.  
References: 
[1] Murat Mese and P. P. Vaidyanathan, “Recent Advances in Digital Halftoning and Inverse 
Halftoning Method,” IEEE Trans. Circuits and Systems I, June 2002.  
[2] Ping Wong and Nasir D. Memon, “Image Processing for Halftoning,” IEEE Signal 
Processing Magazine, vol. 20, July 2003. 
[3] Murat Mese and P. P. Vaidyanathan, “Lookup Table (LUT) Method for Inverse 
Halftoning,” IEEE Trans. Image Processing, vol. 10, October 2001.  
 21
[4] A. N. Netravali and E. G. Bowen, “Display of Dithered Images,” Proc. SID, vol. 22, pp. 
185-190, 1981. 
[5] M. Y. Ting and E. A. Riskin, “Error-diffused Image Compression using a binary to gray 
scale decoder and predictive pruned tree structured vector quantization,” IEEE Trans. Image 
Procedding, vol. 3, pp. 854-858, 1994.  
[6] P. C. Chang, C. S. Yu and T. H. Lee, “Hybrid LMS-MMSE Inverse Halftoning 
Technique,” IEEE Transactions on Image Processing, vol. 10, January 2001. 
[7] Kuo-Liang Chung; Shih-Tung Wu, “Inverse Halftoning Algorithm using Edge-Based 
Lookup Table Approach,” IEEE Trans. Image Processing, Volume 14,  Issue 10,  Oct. 2005, 
pp. 1583 – 1589. 
[8] R. Floyd and L. Steinberg, “An Adaptive Algorithm for Spatial Grey-scale,” Proc. SID, 
pp. 75-77, 1976. 
[9] http://www.altera.com 
