Design and implementation of a high-quality, low-power deinterlacer circuit by Tenze, Livio et al.
Design and implementation
of a high-quality, low-power deinterlacer circuit
Livio Tenze, Stefano Marsi, Sergio Carrato
D.E.E.I., University of Trieste, v. Valerio 10, 34100 Trieste, Italy
e-mail tenze,marsi,carrato@ipl.univ.trieste.it
ABSTRACT — A novel circuit for a simple
non motion-compensated deinterlacer is presented,
where a motion detection module effectively com-
putes a weighted mean of a temporal and a spatial
interpolator. Both the motion detection module and
the two interpolators are innovative, simple and ef-
fective, so that they are able to provide very good
results, as experimental results show, at a low com-
putational complexity. The final circuit, based on
a pipeline architecture, has been designed to work
on a real time video digital signal with a low power
consumption.
1 Introduction
A common situation which is often encountered in
present days is the need to display interlaced video se-
quences on progressive displays, mainly because of the
widespread use of PCs for multimedia applications and
of the increasing diffusion of solid state displays for TV
sets. In this paper, we consider the implementation of
a non motion-compensated deinterlacer [1] where two
innovative, simple and effective interpolators are used,
and their contributions are weighted by a suitable mo-
tion detection module. Very good operation is provided
also in “difficult” cases such as static horizontal edges
and thin lines, and no image flow artifacts are intro-
duced.
After a brief review of the algorithm, a VLSI imple-
mentation is proposed, where the adopted architecture
will be described along with the technical features of the
realized circuit.
2 The proposed algorithm
As already mentioned, our deinterlacing algorithm is
aimed at providing good quality and artifact free im-
ages at a low computational complexity and, conse-
quently, with low power consumption. It consists of a
motion detector module and of two operators for the re-
construction, one being purely temporal and the other
purely spatial. The detector modulates the contribution
of the two operators, according to the level of estimated
motion. Both the detector module and the two oper-
ators, while presenting low computational complexity,
have very accurate behaviour, i.e. their output is cor-
rect also in non trivial cases such as image flow or thin
horizontal, or quasi horizontal, lines.
Let us consider a 5 × 5 mask around the pixel to be
interpolated, on the previous, the present, and the fol-
lowing field. Going into detail, with reference to Fig. 1,
we consider 5 pixels on the line above and below the
pixel to be interpolated, x7, in the current field, and
one pixel above and one below x7 in the previous and in
the following fields (pixels xp1, x
p
13 and x
f
1 , x
f
13, respec-
tively).
As already mentioned, the approach consists in find-
ing a good spatial estimate, yspatial, and a good tempo-
x1
x2 x3 x4 x5 x6
x7
x8 x9 x10 x11 x12
x13
Figure 1: Pixels considered by the deinterlacer. x7 is
the pixel to be reconstructed; pixels x2 to x6 and x8 to
x12 belong to the current field, while pixels x1, x7, and
x13 are available in the previous and in the following
field, and will be referred to as xp1, x
p
7, and x
p
13, and x
f
1 ,
xf7 , and x
f
13, respectively.
ral one, ytemp, and then suitable combining them in the
final output y7 of the deinterlacer.
Let us first consider the spatial interpolator. If an
edge is present, it is advisable to interpolate along its
direction in order to avoid blurring. The edge can be de-
tected by considering differences along some directions
(namely, 450, 900, and 1350) and giving more weight to
the most similar pixels, i.e. the most “correlated” ones.
Consequently, we define the following directional means
m45 = (x5 + x9)/2
m90 = (x4 + x10)/2
m135 = (x3 + x11)/2
and differences
d45 = (|x4 − x8|+ 2|x5 − x9|+ |x6 − x10|)/4
d90 = (|x3 − x9|+ 2|x4 − x10|+ |x5 − x11|)/4
d135 = (|x2 − x10|+ 2|x3 − x11|+ |x4 − x12|)/4
dmax = max(d45, d90, d135)
dmin = min(d45, d90, d135)
A direction θ (θ = 45, 90, 135) with good correlation will
have a small dθ and, consequently, a large (dmax − dθ)
and a small (dθ − dmin). We then compute the weights
as
ws,θ =
dmax − dθ + 1
dθ − dmin + 1 for θ = 45, 90, 135
where the terms +1 have been added in order to avoid
null numerators and denominators, and finally
y7,spatial =
ws,45 ∗m45 + ws,90 ∗m90 + ws,135 ∗m135
ws,45 + ws,90 + ws,135
(1)
Concerning the temporal part, a weighted mean is
computed between x7 in the previous and in the fol-
lowing field, i.e. xp7 and x
f
7 . In order to do so, some
1
new differences are computed, this time between xp7
or xf7 and the 6 closest pixels in the current field, i.e.
xi, i ∈ I = {3, 4, 5, 9, 10, 11}. In this case we are not
looking for edge directions; in turn, the task is to un-
derstand if one of the two pixels xp7 and x
f
7 may be
considered a reliable estimate for x7. This is likely to
be correct if the latter ones are similar to the xi, i ∈ I.
By defining suitable temporal differences [1] we are able
to compute two weights wpt and w
f
t , similarly to what
has been done for the spatial interpolation. Therefore
x7,temp can be estimated according to the following ex-
pression:
y7,temp =
wpt x
p
7 + w
f
t x
f
7
wpt + w
f
t
. (2)
Finally, we combine the spatial and the temporal
parts with
y7 = (1− wtemp) · y7,spatial + wtemp · y7,temp (3)
with a suitable weight wtemp. In simple cases, a weight
of the type
wtemp = (1− |x
f
7 − xp7|
dmin + 1
)
(with again the term +1 added to avoid null denomi-
nators) should be sufficient. Indeed, in the fraction two
differences, a temporal and a spatial one, appear in the
numerator and in the denominator, respectively; in case
of good temporal correlation, in fact, |xf7 − xp7| would
be smaller than dmin, so that wtemp ' 1; viceversa if
temporal correlation is poor.
A more sophisticated scheme has to be used, however,
if thin lines have to be correctly reconstructed and image
flow has to be avoided. To this purpose, we also consider
several differences between pixels in the present field
close to x7, i.e. x4 and x10, and pixels over and below
xp7 and x
f
7 , i.e. x
p
1, x
p
13, x
f
1 , and x
f
13. In [1] we show
that a suitable parameter δ¯ can be introduced, defined
upon these differences, which is small if the temporal
estimate y7,temp is reliable; consequently we compute
the final weight for Eq. 3 according to
wtemp = max
(
1− |x
f
7 − xp7|+ δ¯
2 ∗ dmin + 1 , 0
)
If the background is uniform and similar to xp7 or x
f
7 ,
both |xf7 − xp7| and δ¯ are small, so that wtemp ' 1 and
temporal interpolation is used. It has to be noted that
this situation also occurs in case of static thin lines,
which consequently are correctly reconstructed and ap-
pear sharp in the progressive image; an example is pro-
vided in the following section. In case of image flow, in
turn, δ¯ is large, so that wtemp ' 0 and y7 ' y7,spatial.
3 Implementation
Even if the proposed algorithm could be considered
quite simple from an analytic point of view, it involves
a large amount of similar operations. It is apparent
that an implementation of the various functions, such
as those described in the algorithm equations, will drive
to manage a lot of complex operators (e.g., adders, mul-
tipliers, minimum and maximum functions, dividers and
so on). Therefore, to produce an effective realization of
the algorithm on a small, low power, circuit a possible
solution can be obtained sharing the various elementary
operators and scheduling them in a pipeline sequential
architecture.
Considering the algorithm as a loop that repeatedly
executes the operations in its body until an exit con-
dition becomes true, a significant improvement can be
obtained scheduling consecutive loop iterations to be
partially overlapped in time, i.e. a new loop iteration is
started before the current iteration has finished. In such
a way the rate of the input/output data becomes inde-
pendent of the time requested to complete all the opera-
tions described in the algorithm and also the throughput
of the design can be significantly improved.
The two fundamental pipeline parameters, which
greately affect the throughput, are the “initiation inter-
val”, i.e. the number of clock cycles between the start
of two consecutive loop iterations, and the “latency”,
i.e. the number of clock cycles required to execute all
the operations in a single loop iteration. While the la-
tency simply corresponds to the input-to-output delay,
the initiation interval is tightly related to the through-
put rate; in fact, the higher the number of clock cycle
in the initiation interval, the larger the re-use of the op-
erator in the algorithm. For example, if the initiation
interval is composed by 8 clock cycles, an operator that
requires less than two clock cycles to complete the oper-
ation can be used up to four times by different parts of
the algorithm along the entire latency; similarly a faster
operator which requires less than one clock cycle can be
re-employed up to eight times.
In the implementation of the proposed algorithm we
focused our attention on a particular application, i.e.
the real time deinterlacing of digital video samples com-
posed by 720×576 pixels at 50 field (25 frames) per sec-
ond (i.e the input samples presents a 96.4 ns cadence).
The developed architecture exploits a pipeline with 64
latency cycles and an initiation period of 8 cycles, while
the clock cycle has been kept under 12 ns. Thus the
entire initial period of the pipeline composed by 8× 12
ns can be completed before a new input sample appears.
A complete scheme of the pipeline is represented in
Fig. 2, where however, for the sake of clarity, the transi-
tions have not been shown; in this figure, every operator
is represented by a column, while the time cadence has
been reported along the vertical axis, and the gray cells
represent the period during which every operator is used
in the pipeline cycle.
In this scheme we can note that all the operators are
characterized by a delay smaller than the input sam-
ple rate, and that most of them are re-employed sev-
eral times during different clock cycles to perform sim-
ilar tasks, obtaining in such a way a high throughput
degree architecture. Moreover, considering that any
point of intersection between a column and a row in the
scheme represents a time-slot where every operator can
be scheduled, it can be noted that the entire schedul-
ing has been designed under the following constrain:
by shifting vertically the whole scheme of pipeline by
a factor which is multiple of the initiation cycles, every
operator has to be put on a free time-slot.
It has to be noted that this constrain is not fundamen-
tal for the inputs, since the input samples can be reused
several times while the pipeline loop goes. Moreover, the
overlap of inputs among different pipeline loops makes
it possible to limit the number of input pins.
For the sake of clarity, referring to figure 1 and con-
2
I/O add abs inc mult add div min/max
%
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
% + +,- +,- +,- +,- +,- +,- - - - absval absval absval absval absval absval absval inc * * * + + divide divide divide minmaxminmaxminmax
Figure 2: Pipeline scheme for the circuit architecture
sidering that several pixels are involved in evaluating the
output, it is possible to constrain a single input pixel to
be interpreted as x6 in the actual pipeline cycle, as x5 in
the previous one, and so on. In such a way it is sufficient
to use, for the whole circuit, just 4 different input data
buses. Two of them are devoted to the inputs of the
horizontal pixels, belonging to the two picture rows in-
volved in the current field (x2 . . . x6 and x8 . . . x12) where
their rate is equal to that of the input data. Indeed, the
other two input buses are used for the acquisition of the
vertical pixels of the two columns (x1, x7, x13) both in
the previous and next field. Since all these data have to
be acquired before a new pipeline cycle starts, a higher
rate has to be considered to feed the circuit with these
data.
In the physical implementation of the circuit all of the
elementary operators (e.g., adders, multipliers, dividers)
have been developed in a combinatorial form, using fixed
point arithmetic with a suitable number of bits. In order
to implement these operators we exploited the features
of architectures such as carry look ahead and Wallace
tree.
The final circuit has been realized with a 0.35 µm
standard cell CMOS technology; a simplified layout is
reported in Fig. 5. The core occupies a total area
of 1.4 mm2 and is composed by about 6000 standard
cells, while the simulated power consumption is approx-
imately 40 mW.
4 Experimental results
The presented algorithm has been tested with several
real world sequences; as an example, we show the results
obtained with “Salesman” and “Tennis”. These original
sequences are progressive; in order to numerically eval-
uate the behaviour of the deinterlacing algorithms, the
mentioned movies have been artificially interlaced. In
this way it is possible to estimate, using the MSE, the
difference between the original progressive frames and
the reconstructed ones.
Each of these sequences is characterized by different
features: in “Salesman” slow motion and some thin lines
are present, while in “Tennis” a large amount of fast
motion and thin features show up. Following [2], our
method is compared with several well-known non mo-
tion compensated methods of comparable complexity,
namely line repetition, line averaging, field insertion,
linear VT filtering , VT median filtering, hybrid median
filtering and weighted median filtering.
In Fig. 3a we compare the results obtained applying
the proposed method to “Salesman”. It may be seen
that the lowest value of Mean Square Error (MSE) is
achieved by the proposed deinterlacer. Moreover, direct
inspection of the sequences shows that our approach is
able to preserve the thin lines which are present, without
introducing flickering, as most of the other methods do.
In Fig. 3b and 4 we propose the results obtained with
“Tennis”. Also in this case, our approach produces the
3
0 5 10 15 20 25 30
0
10
20
30
40
50
60
70
# frame
M
SE
Line repetition
Line averaging
Field insertion
Linear VT filtering
VT median filtering
Hybrid median filtering
Weighted median filtering
The proposed method
0 5 10 15 20 25 30
0
50
100
150
200
250
300
350
400
# frame
M
SE
Line repetition
Line averaging
Field insertion
Linear VT filtering
VT median filtering
Hybrid median filtering
Weighted median filtering
The proposed method
a) b)
Figure 3: MSE for “Salesman” (a) and “Tennis” (b) for several well-known non motion-compensated deinterlacers.
Figure 4: Image number 3 of the “Tennis” sequence. From top-left to bottom-right: line repetition, line averaging,
field insertion, linear VT filtering, VT median, hybrid median, weighted median, proposed deinterlacer.
best results: the moving hand of the table tennis player
is well reconstructed and, most notably, the white di-
agonal line (the border of table) is in our case almost
perfectly interpolated, while line repetition, line aver-
aging, linear VT filtering and weighted median produce
visible artifacts. The field insertion technique correctly
reconstructs the diagonal line, due to the lack of mo-
tion in that region, but introduces large artifacts while
interpolating the moving hand. Median-based methods
perform somehow better than the linear techniques but
the diagonal line is smoothed or jagged in some parts;
moreover, they are not able to maintain the background
details when there is no motion.
References
[1] L. Tenze, A. Fermo, and S. Carrato, “A high-quality
edge and motion sensitive deinterlacer,” in Proc. 1st
COST 276 Workshop on Information and Knowl-
edge Management for Integrated Media Communica-
tion, (Leganes/Madrid, Spain), Nov. 2001. In print.
[2] G. De Haan and E. B. Bellers, “Deinterlacing—
An overview,” Proceedings of the IEEE, vol. 86,
pp. 1839–1857, Sept. 1998.
Figure 5: Simplified layout of the final chip. Core size
is about 1.4 mm2.
4
