Focal-plane dynamic texture segmentation by programmable binning and scale extraction by Fernández-Berni, J. & Carmona-Galán, R.
Focal-plane dynamic texture segmentation by
programmable binning and scale extraction
J. Ferna´ndez-Berni and R. Carmona-Gala´n
Abstract Dynamic textures are spatially-repetitive time-varying visual patterns that
present, however, some temporal stationarity within their constituting elements. In
addition, their spatial and temporal extents are a priori unknown. This kind of pat-
terns is very common in nature, therefore dynamic texture segmentation is an impor-
tant task for surveillance and monitoring. Conventional methods employ optic flow
computation though it represents a heavy computational load. Here we describe tex-
ture segmentation based on focal-plane space-scale generation. The programmable
size of the subimages to be analyzed and the scales to be extracted encode suffi-
cient information from the texture signature to warn its presence. A prototype smart
imager has been designed and fabricated in 0.35mm CMOS, featuring a very low-
power scale-space representation of used-defined subimages.
1 Introduction
Dynamic textures (DTs) are visual patterns with spatial repeatability and a certain
temporal stationarity. They are time-varying, but some relations between their con-
stituting elements are maintained through time. Because of this, we can talk about
the frequency signature of the texture [1]. An additional feature of a DT is its inde-
terminate spatial and temporal extent. Smoke, waves, a flock of birds or tree leaves
swaying in the wind are some examples. The detection, identification and track-
ing of DTs is essential in surveillance because they are very common in natural
scenes. Amongst the different methods proposed for dynamic texture recognition,
those based in the optical flow are currently the most popular [2]. Optic flow is a
J. Ferna´ndez-Berni and R. Carmona-Gala´n
Institute of Microelectronics of Seville (IMSE-CNM-CSIC)
Consejo Superior de Investigaciones Cientı´ficas-Universidad de Sevilla
C/ Americo Vespucio s/n 41092 Seville (Spain)
e-mail: fjfberni,rcarmonag@imse-cnm.csic.es
1
2 J. Ferna´ndez-Berni and R. Carmona-Gala´n
computationally efficient and natural way to characterise the local dynamics of a
temporal texture. This is the case for weak DTs, which become static when referred
to a local coordinate system that moves across the scene. However, the recogni-
tion of strong DTs implies a much greater computational effort. For these textures,
possessing intrinsic dynamics, the brightness constancy assumption associated to
standard optical flow algorithms cannot be applied. More complex approaches must
be considered in order to overcome this problem. Recently, interesting results have
been achieved by applying the so-called brightness conservation assumption [3].
However, this method means heavy computational load and the subsequent high en-
ergy consumption. For a particular type of artificial vision systems, a power-efficient
implementation of dynamic texture recognition is mandatory. Wireless multimedia
sensor networks [4] is an obvious example. These networks are composed of a large
number of low-power sensors that are densely deployed throughout a region of in-
terest in order to capture and analyse video, audio and environmental data from their
surroundings. The massive and scattered deployment of these sensors makes them
quite difficult to service and maintain. Therefore, energy efficiency must be a major
design goal, in order to extend the lifetime of the batteries as much as possible.
We propose an approach that do not rely in heavy computation by a general-
purpose processor, but on an adapted architecture in which the more tedious tasks
are conveyed to the focal plane to be realized concurrently with the image capture.
This results in a simplified scene representation that carries, nevertheless, all the
necessary information. In this scheme, redundant spatial information is removed at
the earlier stages of the processing by means of simple, flexible and power-efficient
computation at the focal plane. This architecture encode the major features of the
DTs, in the sense that the spatial sampling rate and the spatial filter passband limits
are programmed into the system. This permits to track textures of an expected spa-
tial spread and frequency signature. The main processor operates then on a reduced
representation of the original image obtained at the focal plane, thus its computa-
tional load is greatly alleviated.
2 Simplified scene representation
2.1 Programmable binning and filtering
In general, existing research on dynamic textures recognition is based on global fea-
tures computed over the whole scene. A clear sign of this fact is that practically all
of the sequences composing the reference database DynTex [5] contain only close-
up sequences. It does make sense, in these conditions, to apply strategies of global
feature detection. However, in a different context, e. g. video-surveillance, textures
can appear at any location of the scene. Local analysis is required for texture de-
tection and tracking. One way of reducing the amount of data to be processed is to
summarize the joint contribution of a group of pixels to the appropriate medium-
Focal-plane dynamic texture segmentation 3
Fig. 1 Binning and filtering applied to a scene containing a flock of starlings
size-feature index. Let us consider the picture in Fig. 1(a), which depicts a flock of
starlings. It is known that these flocks maintain an internal coherence based on local
rules that place each individual bird at a distance prescribed by their wing span [6].
This is an example of self-organized collective behaviour, whose emergent features
are characterized by a set of physical parameters like, for instance, flock density. We
can estimate the density of the flock —more precisely, the density of its projected
image— by conveniently encoding the nature of the object into the spatial sampling
rate and the passband of the spatial filter selected to process the subimages.
The first step is then to subdivide the image into pixel groups that, for the sake
of clarity, will be of the same size and regularly distributed along the two image
dimensions (Fig. 1(b)). Then, if the image size is MN pixels, and the image is
divided into equally sized subimages, or bins, of size mn pixels, we will end in a
representation of the scene that is 1=R times smaller than the original image, being:
R=
m
M
 n
N

(1)
The problem is conveyed now to finding a magnitude that summarizes the in-
formation contained in each m n-pixel bin and is still useful for our objective of
texture detection. In the case of the starling flock density, a measure of the number of
birds contained in a certain region of the image can be given by the high-frequency
content of each bin. Notice that features of a low spatial frequency do not represent
any object of interest, i. e. bird, but details belonging to the background. Therefore
the value of each bin, represented in Fig. 1(c), can be defined as the quotient:
Bkl =
å8jkj>0Ekl(k)
å8kEkl(k)
(2)
where k 2 f1; : : : ;M=ng and l 2 f1; : : : ;N=ng. Also, k= (u;v) represents the possi-
ble wave numbers and each summand Ekl(k) is the energy associated with frequency
k, computed within the bin indexed by (k; l). This is an spatial highpass filter nor-
malized to the total energy associated with the image bin. If Vi j is the value of the
pixel located at the i-th row and j-th column of the bin, and also, Vˆuv is the compo-
nent of the spatial DFT of the bin corresponding to u and v reciprocal lengths, the
total image energy associated with the bin is:
4 J. Ferna´ndez-Berni and R. Carmona-Gala´n
å
8k
Ekl(k) =
m 1
å
u=0
n 1
å
v=0
jVˆuvj2 =
m
å
i=1
n
å
j=1
jVi jj2 (3)
The result is an estimation of the bird density, at a coarser grain that the full-size
image, that avoids pixel-level analysis.
2.2 Linear diffusion as a scale-space generator
As already described, apart from the appropriate binning, that sets the interesting
feature size and geometrical ratio, a suitable filter family is needed to discriminate
the spatial frequency components within each subimage. Let us consider that an
image is a continuous function evaluated over the real plane, V (x;y), that assigns
a brightness value to each point in the plane. If this brightness is regarded as the
concentration of a scalar property [7], the flux density is proportional to the gradient
and directed against it. The diffusion equation:
¶V (x;y; t)
¶ t
= DÑ2V (x;y; t) (4)
follows from continuity considerations, i. e. no sources or sinks of brightness are
present in the image plane. The original image is assumed to be the initial value
at every point, V (x;y;0). Then, applying the Fourier transform to this equation and
solving in time:
Vˆ (k; t) = Vˆ (k;0)e 4p
2jkj2Dt (5)
what represents, in the reciprocal space, the transfer function of a Gaussian filter:
G(k;s) = e 2p
2s2jkj2 (6)
whose smoothing factor is related to the diffusion duration t through:
s =
p
2Dt (7)
Fig. 2 Gaussian filters of increasing s
Focal-plane dynamic texture segmentation 5
(a) (b)
Fig. 3 Resistor network supporting linear diffusion (a) and its MOS-based counterpart (b)
Hence, the larger the diffusion time, t, the larger the smoothing factor, s , thus,
the narrower the transfer function (Fig. 2), and so, the smoother the output image.
Gaussian filters are equivalent to the convolution with Gaussian kernels, g(x;y;s),
of the reciprocal widths:
g(x;y;s)V (x;y) = 1
2ps2
Z ¥
 ¥
Z ¥
 ¥
V (x  x0;y  y0)e 
x02+y02
2s2 dx0dy0 (8)
These kernels hold the scale-space axioms: linearity, shift invariance, semi-group
structure, and not enhancement of local extrema. This makes them unique for scale-
space generation [8]. Scale-space is a framework for image processing [9] that
makes use of the representation of the images at multiple scales. It is useful in the
analysis of the image, e. g. to detect scale-invariant features that characterize the
scene [10]. Different textures are noticeable at a particular scale, x , which is related
with the smoothing factor:
x = s2 (9)
However, convolving two large images or, alternatively, computing the FFT, and
the inverse FFT, of a given image with a conventional processing architecture repre-
sents a very heavy computational load. We are interested in a low-power focal-plane
operator able to deliver an image representation at the appropriate scale.
2.3 Spatial filtering by a resistor network
Consider a grid composed of resistors like that depicted in in Fig. 3(a). LetVi j(0) be
the voltages stored at the grounded capacitors attached to each node of the resistor
grid. If the network is allowed to evolve, at every time instant each node will satisfy:
t
dVi j
dt
= 4Vi j+Vi+1; j+Vi 1; j+Vi; j+1+Vi; j 1 (10)
6 J. Ferna´ndez-Berni and R. Carmona-Gala´n
where t = RC. Applying the spatial DFT, for a grid of MN nodes, we arrive to:
t
dVˆuv
dt
= 4
h
sin2
pu
M

+ sin2
pv
N
i
Vˆuv (11)
Notice that, Vˆuv represents the discrete Fourier transform of Vi j, that is also discrete
in space. Therefore, u and v take discrete values ranging from 0 toM 1 and N 1
respectively. Solving Eq. (11) in the time domain, we obtain:
Vˆuv(t) = Vˆuv(0)e 
4t
t [sin
2( puM )+sin
2( pvN )] (12)
what defines a discrete-space version of the Gausian filter in Eq. (6) given by:
Guv(s) = e 2s
2[sin2( puM )+sin
2( pvN )] (13)
where now s =
p
2t=t . This function approximates quite well the continuous-space
Gaussian filter at the lower frequencies and close to the principal axes of the recip-
rocal space. At higher frequencies, specially at the bisectrices of the axes, i. e. when
u and v both become comparable to M and N respectively, isotropy degrades as the
approximation of sin2(pu=M) and sin2(pv=N) by (pu=M)2 and (pv=N)2, respec-
tively, becomes too coarse.
3 VLSI implementation of time-controlled diffusion
3.1 MOS-resistor grid design
Resistive networks are, as we have already seen, massively parallel processing sys-
tems that can be employed to realize spatial filtering [11]. But a true linear resistive
grid is difficult to implement in VLSI. The low sheet resistance exhibited by the
most resistive materials available in standard CMOS renders too large areas for the
necessary resistances. A feasible alternative is to employ MOS transistors to replace
the resistors one by one. They can achieve larger resistances with less area than
resistors made of polysilicon or diffusion strips. In addition, controlling the gate
voltage, their resistance can be modified. They can also be operated as switches,
thus configuring the connectivity of the network. This substitution of resistors by
MOS transistors, however, entails, amongst others, linearity problems. In [12], the
linearity of the currents through resistive grids is achieved by means of using tran-
sistors in weak inversion. The value of the resistance associated to each transistor
is directly controlled by the corresponding gate voltage. This property of current
linearity is also applicable even if the transistors leave weak inversion as long as all
of them share the same gate voltage [13]. Linearity is not so easy to achieve when
signals are encoded by voltages as in Fig. 3(a). The use of MOSFETs operating
in the ohmic region instead of resistors is apparently the most simple option [14].
Focal-plane dynamic texture segmentation 7
Fig. 4 2-node ideal resistive grid (a) and its MOS-based implementation (b)
However, the intrinsic nonlinearity in the I-V characteristic leads to more elaborated
alternatives for the cancellation of the nonlinear term [15], even to transconductor-
based implementations [16].
For moderate accuracy requirements, though, the error committed by using MOS
transistors in the ohmic region can be kept under a reasonable limit if the elementary
resistor is adequately designed. For an estimation of the upper bound of this error,
let us compare the circuits in Fig. 4. They represent a 2-node ideal resistor grid and
its corresponding MOS-based implementation. The gate voltage VG is fixed and we
will assume, without loss of generality, that the initial conditions of the capacitors
fulfill V1(0) > V2(0), being V1(0) = V 01(0) and V2(0) = V
0
2(0). We will also assume
that the transistor is biased in the triode region for any voltage at the drain and source
terminals, that will range from Vmin to Vmax. The evolution of the circuit in Fig. 4(a)
is described by this set of ODEs:(
C dV1dt =  V1(t) V2(t)R
C dV2dt =
V1(t) V2(t)
R
(14)
while the behaviour of the circuit in Fig. 4(b) is described by:(
C dV
0
1
dt =  GM [V 01(t) V 02(t)]
C dV
0
2
dt = GM [V
0
1(t) V 02(t)]
(15)
by making use of the following transconductance:
GM = knSn

2(VG VTn) 

V 01(t)+V
0
2(t)
	
(16)
where kn = mnC0ox=2 and Sn =W=L. This transconductance remains constant during
the network evolution, if we neglect the substrate and other second order effects, as
the sum V 01(t)+V
0
2(t) remains the same because of charge conservation. Therefore,
and given that the charge extracted from one capacitor is ending up in the other, we
can define the error in the corresponding node voltages as:
8 J. Ferna´ndez-Berni and R. Carmona-Gala´n
V 01(t) = V1(t) + e(t)
V 02(t) = V2(t)   e(t)
(17)
or, equivalently:
e(t) =
V 01(t) V 02(t)
2
  V1(t) V2(t)
2
(18)
Because of our initial assumptions, V1(0) = V 01(0) and V2(0) = V
02(0), we have
that e(0) = 0. Also, the stationary state, reached when t ! ¥, renders e(¥) = 0, as
V1(¥) = V2(¥) and V 01(¥) = V
0
2(¥). Therefore, there must be at least one point in
time, let us call it text in which the error reaches an extreme value, either positive or
negative. In any case the time derivative of the error:
de
dt
=
1
t

V 01(t) V 02(t)

(1 GMR)  2e(t)t (19)
must cancel in text , resulting in an extreme error of:
e(text) =
1
2

V 01(text) V 02(text)

(1 GMR) (20)
Notice that for GMR = 1 the error is zero at any moment. This happens if the tran-
sistor aspect, Sn, is selected to match the resistance R through:
Sn =
1
knR

2(VG VTn) 

V 01(0)+V
0
2(0)
	 (21)
and the current values of of V 01(0) and V
0
2(0) add up to the same V
0
1(0)+V
0
2(0) with
which Sn was selected. Unfortunately, this will very seldom happen. And because
we do not know a priori where within the interval [Vmin;Vmax] are V 01(0) and V
0
2(0),
neither V 01(text) and V
0
2(text), we are interested in a good estimate of the largest pos-
sible error within the triangle 4ABC (Fig. 5), delimited by points A : (Vmin;Vmin),
B : (Vmax;Vmin), C : (Vmax;Vmax). This is because one of our initial assumptions was
that V 01(0) > V
0
2(0), and this condition is maintained until these voltage identify at
the steady state.
Let us express this error, ex = e(text), as a function of V 01x = V
0
1(text) and V
0
2x =
V 02(text). It can be done because the sum V
0
1(t)+V
0
2(t), present in the definition of
GM , is constant along the evolution of the network, therefore also at text :
ex(V 01x ;V
0
2x) =
1
2
 
V 01x  V 02x

1  knSnR

2(VG VTn) 
 
V 01x +V
0
2x
	
(22)
Then, any possible extreme of ex(V 01x ;V
0
2x) must be at a critical point, i. e. a point in
which Ñex(V 01x ;V
0
2x) = 0. But the only one that can be found in 4ABC is a saddle
point. Therefore, we can only talk of absolute maxima or minima, and they will be
at the borders of the triangle (Fig. 5). More precisely at sides AB and BC, given that
ex  0 along side AC, at points:
Focal-plane dynamic texture segmentation 9(
V 01x = Vmax V
0
2x = VG VTn   12knSnR
V 01x = VG VTn   12knSnR V 02x = Vmin
(23)
and their values are:8><>: exjmax =
1
2knSnR

Vmax VG+VTn + 12knSnR
2
exjmin =   12knSnR

Vmin VG+VTn + 12knSnR
2 (24)
Notice that increasing or decreasing Sn have antagonistic effects in the magnitude
of exjmax and exjmin (Fig. 5). Therefore the optimal design is obtained for:
Sn =
1
knR [2(VG VTn)  (Vmax+Vmin)]
(25)
what minimizes the maximum error, rendering [17]:
min(max jexj) = 116
(Vmax Vmin)2
VG VTn   Vmax+Vmin2
(26)
Fig. 5 Error estimate for the possible and reasonable values of Sn
10 J. Ferna´ndez-Berni and R. Carmona-Gala´n
It is important to notice that the design space is limited by the extreme values of
Sn, those beyond which the target resistance R fall outside the interval of resistance
values that can be implemented within the triangle 4ABC, that represents all the
possible values that can be taken by V 01x and V
0
2x . These extrema are:8<: Snmin =
1
2knR(VG VTn Vmin)
Snmax =
1
2knR(VG VTn Vmax)
(27)
Notice that if one choose to select Sn = (Snmin + Snmax)=2 led by the groundless
intuition that it will render the smallest error as it is at equal distance from the
two extremes of the design space, this will be suboptimal design, as the optimum
Sn, depicted at Eq. (25), is notably below the midpoint. It is also worth to take
into account that Eq. (26) represents a conservative upper bound of the maximum
error that is going to be achieved by optimal design. This is because we have not
considered the relation between V 01(t) and V
0
2(t) imposed by Eq. (15):
V 01(text) V 02(text) =

V 01(0) V 02(0)

e 
2GM
C text (28)
This means that not all the possible values contained in4ABC, defined on the V 01x -
V 02x plane, not V
0
1(0)-V
0
2(0), will be covered by all the possible trajectories of the
circuit Thus by equating Eq. (19) to zero after having substituted the trajectories by
their actual waveforms, text is obtained:
text =
t
2
lnGMR
GMR 1 (29)
and the exact value of the error in the extreme is then given by:
e(text) =
V 01(0) V 02(0)
2
(1 GMR)(GMR) 

GMR
GMR 1

(30)
Unfortunately, to obtain the position of the extreme error in a closed form from this
equation is not possible, but numerically we have confirmed that the Sn rendering
the smallest error is the one depicted at Eq. (25).
3.2 Extrapolated results for larger networks
The results described above have been applied to the design of a 6464MOS-based
resistive grid. Simulations have been realized using standard 0.35mm CMOS 3.3V
process transistor models in HSPICE. The signal range at the nodes is [0V,1.5V],
wide enough to evidence any excessive influence of the MOSFET nonlinearities in
the spatial filtering. VG is established at 3.3V. We aim to implement an array of ca-
pacitors interconnected by network of resistors with a time constant of t = 100ns.
Focal-plane dynamic texture segmentation 11
(a)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
(b)
                                                                                 
(c)
                                                                                                                                                     
(d)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
(e)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
(f)
                                                                                
Fig. 6 (a) Original image, (b) MOS-diffused image at instant of maximum error,(c) image dif-
fused by resistor network, (d) absolute error, (e) absolute error multiplied by 10, (f) absolute error
normalized to maximum individual pixel error
For that we will assume a resistor of R = 100kW and a capacitor of C = 1pF. Sn
is obtained according to Eq. (25). But this equation does not take into account the
substrate effect, or in other words, Sn is not only depending on the sum Vmax+Vmin
but also in the specific values of the initial voltages at drain and source that render
the same sum. For a specific value of Sn, and the same V 01(0)+V
0
2(0), the resistance
(a)
0 0.2 0.4 0.6 0.8 1
x 10−5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time (sec.)
R
M
SE
 (%
)
(b)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x 10−5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time (sec.)
R
M
SE
 (%
)
Fig. 7 RMSE of the MOS-based grid state vs. resistor grid state: (a) w/o mismatch, (b) Monte
Carlo with 10% mismatch
12 J. Ferna´ndez-Berni and R. Carmona-Gala´n
implemented can vary 5%. We have selectedW = 0:4mm and L = 7:54mm1, that
result in an average resistance of 100kW for all the possible initial conditions ren-
dering the optimum sum, i. e. Vmax+Vmin.
The initial voltage at the capacitors is proportional to the image intensity dis-
played at Fig. 6(a). A MOS-based resistor network runs the diffusion of the initial
voltages in parallel with an ideal linear resistor network. The deviation is measured
via the RMSE (Fig. 7) and reaches a maximum soon after the beginning of the dif-
fusion process. The state of the corresponding nodes in both networks at this point,
displayed in Figs. 6(b) and 6(c), can be compared, Figs. 6(d), 6(e) and 6(f). The
maximum observed RMSE for the complete image is 0:5%, while the maximum
individual pixel error is 1:76%. This error remains below 0:6% even introducing an
exaggerated mismatch (10%) in the transistors’ VTn0 and mn (Fig. 7(b)) 2.
3.3 Diffusion duration control
Implementing a Gaussian filter, Eq. (13), by means of a dynamic diffusion requires
a precise control of the diffusion duration, t, given that the scale, x , i. e. the square
of the smoothing factor, is related with it through:
x =
2t
t
(31)
Therefore, to filter the image with a Gaussian function of scale x means to let the
diffusion run for t = tx=2 seconds. The actual value of t is not important, as long
as the necessary resolution of t for the required x is physically realizable. Actually,
x is not determined by t itself, but by the quotient t=t . In other words, a different
selection of R and C will still render the same set of x ’s as long as the necessary
fraction of t can be fine-tuned. Implementing this fine control of the diffusion du-
ration is not a trivial task when we are dealing with t’s in the hundred-nanosecod
range. In order to provide robust sub-t control of the diffusion duration, the opera-
tion must rely on internally, i. e. on-chip, generated pulses. Otherwise, propagation
delays will render this task very difficult to accomplish.
We propose a method for the fine control of t based on an internal VCO. This
method has been tested in the prototype that we are describing later in the chapter.
The first block of the diffusion duration control circuit if the VCO (Fig. 8(a)). It
consists on a ring of pseudo-NMOS inverters in which the load current is controlled
by Vbias, thus modifying the propagation delay of each stage. As the inverter ring is
composed by an odd number of stages, the circuit will oscillate. A pull-up transistor,
with a weak aspect, have been introduced to avoid start up problems. Also a flip-flop
1 This transistor length lies out of the physical design grid, that fixes the minimum feature size to
be 0.05mm. We are using it here as illustrative of the design procedure.
2 Global deviations within the process parameters space have not been considered. In that case, the
nominal resistance being implemented differs from the prescribed 100kW
Focal-plane dynamic texture segmentation 13
(a)
(b)
Fig. 8 (a) 15-stage inverter ring VCO and (b) diffusion control logic
is placed to render 50% duty-cycle at the output. This circuit provides an internal
clock that will be employed to time pulses that add up to the final diffusion duration.
The main block of the diffusion control is the 12-stage shift register. It will store
a chain of 1’s indicating how many clock cycles will the diffusion process run. The
clock employed for this will be either external or the already described internal
VCO3. The output signal, di f f ctrl in Fig. 8(b), is a pulse with the desired duration
of the diffusion, that is delivered to the gates of the MOS-resistors.
3.4 Image energy computation
For real-time detection and tracking of dynamic textures we will be interested in a
simplified representation of the scene. It can be constructed from the filtered image
by first dividing it into subimages, usually of the same size. Each subimage is then
represented by a number that encodes information related with the spatial frequency
content of the subimage. This number is the image energy, as defined in Eq. (3). The
energy of the image bin can be expressed as a function of time:
Ekl(t) =å
8k
Ekl(k; t) =
m 1
å
u=0
n 1
å
v=0
jVˆuv(t)j2 =
m
å
i=1
n
å
j=1
jVi j(t)j2 (32)
3 The aim of the internal VCO is to reach a better time resolution than an external clock. For
programming the appropriate sequence into the SHR an external, and slower, clock is usually
preferred
14 J. Ferna´ndez-Berni and R. Carmona-Gala´n
Fig. 9 In-pixel energy computing circuit
meaning that the energy of the image at time t accounts for the frequency compo-
nents that have not been filtered yet. In terms of the dynamics of the resistor grid,
the total charge in the array of capacitors is conserved but, naturally, the system
evolves toward the less energetic configuration. Therefore the energy at time instant
t indicates how the subimage has been affected by the diffusion until that exact
point in time. The longer t the less of Ekl(0) will remain in Ekl(t). The energy lost
between two consecutive points in time during the diffusion corresponds to that of
the spatial frequencies filtered. Notice also that changing the reference level for the
amplitude of the pixels does not have an effect beyond the dc component of the im-
age spectrum. A constant value added to every pixel does not eliminate nor modify
any of the spatial frequency components already present, apart from changing the
dc component.
In order to analyze the presence of different spatial frequency components within
a particular bin of the image, we would need to measure the energy of the bin pixels
once filtered. Remember that for analyzing a particular band of frequencies we will
subtract two lowpass filtered versions of the image. In this way, only the components
of the targeted frequency band will remain. This will allow to track changes at that
band without pixel-level analysis. The hardware employed to calculate the energy
of the bins at the pixel-level (Fig. 9) consists in a transistor, ME , that converts the
pixel voltage to a current that is proportional to the square of the voltage, a switch
SE to control the amount of charge that will be subtracted from the capacitorCE that
realizes charge to voltage conversion. All the CE ’s within the bin will be connected
to the same node. At the beginning, all of them are pre-charged to VREF . Then,
for a certain period of time, TE , the transistor ME is allowed to discharge CE . But
because the m n capacitors of the bin are tied to the same node, the final voltage
representing the bin energy after t seconds of diffusion will be:
VEkl (t) =VREF  
kESETE
mnCE
m
å
i=1
n
å
j=1
[Vi j(t) VTn0 ]2 (33)
We are assuming that all the ME ’s are nominally identical and operate in satura-
tion. The offset introduced by VTn0 does not affect any spatial frequency other than
the dc component. Deviations occur from pixel to pixel due to mismatch in the
threshold voltage (VTn0 ), the transconductance parameter (kE ), and the body-effect
constant (gE , not in this equation). Being area dependent effects, transistors ME are
Focal-plane dynamic texture segmentation 15
tailored to control the resulting error in the computation. Also, mobility degrada-
tion contributes to the deviation from the behaviour depicted in Eq. (33), what will
ultimately limit the useful signal range.
4 Prototype texture segmentation chip
4.1 Chip description and data
The floorplan of the prototype chip is depicted in Fig. 10. It is comprised of a mixed-
signal processing array, concurrent with the photosensor array, intended to carry out
Gaussian filtering of the appropriate scale, within user-defined image areas. In addi-
tion, the total energy associated with each image bin is calculated. On the periphery,
there are circuits for bias and control of the operation. The outcome of the pro-
cessing can be read out pixel-by-pixel in a random access fashion. The value of the
pixel is buffered at the column bus and delivered to either an analog output pad or
an on-chip 8-bit SAR ADC.
The elementary cell of the analog core (Fig. 11) was described in [18]. It contains
a diffusion network built with p-type MOS transistors. The limits of the diffusion
areas are column-wise and row-wise selected enabled by the appropiate connection
pattern. In this way, scale spaces can be generated at different user-defined areas of
the scene. The pulse which determines the duration of the diffusion can be either
Fig. 10 Floorplan of the prototype chip
16 J. Ferna´ndez-Berni and R. Carmona-Gala´n
Fig. 11 Schematic of the elementary cell
externally input or internally generated, as mentioned in Sect. 3.3. The main charac-
teristics of the chip are summarised in Table 1. A microphotograph of the prototype
chip with a close-up of the photosensors is shown in Fig. 12.
Technology 0.35mm CMOS 2P4M
Vendor (Process) Austria Microsystems (C35OPTO)
Die size (with pads) 7280.8mm  5780.8mm
Pixel size 34.07mm  29.13mm
Fill factor 6.45%
Resolution QCIF: 176144 px
Photodiode type n-well/p-substrate
Power supply 3.3V
Power consumption (including ADC) 1.5mW
Internal clock freq. range 10-150MHz
ADC throughput (I/O limit) 0.11MSa/s (9ms/Sa)
Exposure time range 100ms-500ms
Table 1 Prototype chip data
Focal-plane dynamic texture segmentation 17
Fig. 12 Microphotograph of the prototype chip
4.2 Linearity of the scale-space representation
As expected from simulations, the use of MOS transistors instead of true linear
resistors in the diffusion network (Fig. 3(b)) achieves moderate accuracy even un-
der strong mismatch conditions. However, the value of the resistance implemented
by the transistors and therefore the value of the network time constant, t , is quite
sensitive to the process parameters. In order to have a precise estimation of the
scale implemented by stopping the diffusion at different points in time —recall that
x = 2t=t— the actual t needs to be measured. We have provided access to the ex-
tremes of the array and have characterized t from the charge redistribution between
two isolated pixels. The average t measured is of 71:1ns (1:8%). Attending to the
technology corners, the predicted range for t was [49;148]ns. By reverse engineer-
ing the time constant, using Eq. (25), the best emulated resistance (Req) is obtained.
Once t is calibrated, any on-chip scale space can be compared to its ideal counter-
part obtained by solving the spatially-discretized diffusion equation corresponding
to a network consisting on the same C’s and resistors of value Req. In order to gen-
erate an on-chip scale space, a single image is captured. This image is converted to
the digital domain and delivered through the output bus. It becomes the initial im-
age of both the on-chip scale space and the ideal scale space calculated off-chip. The
rest of the on-chip scale space is generated by applying successive diffusion steps
to the original captured image. After every step, the image is converted to digital
and delivered to the test instruments to be compared to the ideal image generated
by MATLAB (Fig. 13). The duration of each diffusion step is internally configured
18 J. Ferna´ndez-Berni and R. Carmona-Gala´n
as sketched in Sect. 3.3. A total of 12 steps have been realized over the original
captured image. Six of them are represented in Fig. 13 (first row) and compared to
the ideal images (second row). The last row contains a pictorial representation of
the error, normalized in each case to the highest measured error on individual pix-
els, which are 0%, 24:99%, 19:39%, 6:17%, 3:58% and 6:68%, respectively. It can
be seen how FPN eventually becomes the dominant error at coarse scales. Keep in
mind that this noise is present at the initial image of both scale spaces, but it is only
added to each subsequent image of the on-chip scale space because of the readout
mechanism. The key point here is that the error is kept under a reasonable level de-
spite no FPN post-processing is carried out. This fact together with the efficiency
of the focal-plane operation is crucial for artificial vision applications under strict
power budgets. Two additional issues are worth to be mentioned. First of all, the ac-
curacy of the processing predicted by simulation [17] is very close to that of the first
images of the scale space, where fixed pattern noise is not dominant yet. Secondly,
it has been confirmed for this and other scenes that the second major source of er-
ror in the scale space generation comes from uniform areas where the pixel values
fall on the lowest extreme of the signal range. The reason is that the instantaneous
resistance implemented by a transistor when its source and drain voltages coincides
around the lowest extreme of the signal range presents the maximum possible devi-
ation with respect to the equivalent resistance considered. The point is that, in such
t = 0ns t = 40ns t = 100ns t = 400ns t = 800ns t = 1500ns
Fig. 13 Comparative of scale spaces along the time. The first row corresponds to the on-chip scale
space, the second one corresponds to the ideal scale space and finally the third one corresponds to
their normalized difference
Focal-plane dynamic texture segmentation 19
Fig. 14 Independent scale spaces within four subdivisions of the focal plane
a situation, the charge diffused between the nodes involved is very small, keeping
the error moderate despite the large deviation in the resistance.
4.3 Subsampling modalities
Finally, as a glimpse into the possibilities of the prototype, independent scale spaces
were generated within four subdivisions of the focal plane programmed into the
chip (Fig. 14). Because of the flexible subdivision of the image, the chip can de-
liver from full-resolution digital images to different simplified representations of the
scene which can be reprogrammed in real time according to the results of their pro-
cessing. As an example of the image subsampling capabilities of the chip, three dif-
ferent schemes are shown in Fig. 15. The first one corresponds to the full-resolution
representation of the scene —it has been taken in the lab, by displaying a real out-
door sequence in a flat-panel monitor. The second one represents the same scene
20 40 60 80 100 120 140
20
40
60
80
100
120
140
160
20 40 60 80 100 120 140
20
40
60
80
100
120
140
160
20 40 60 80 100 120 140
20
40
60
80
100
120
140
160
Full-resolution image Binning Foveatization
Fig. 15 Example of the image sampling capabilities of the prototype
20 J. Ferna´ndez-Berni and R. Carmona-Gala´n
after applying some binning. In the third picture, the region of interest (ROI), in
the center, is at full resolution while the areas outside this region are binned to-
gether. The binning outside the ROI becomes progressively coarser. This represents
a foveatized version of the scene in which the greater detail is only considered at the
ROI, and then the grain increases as we reach further. From the computational point
of view, this organization translates into important computing resources savings.
5 Conclusions
This chapter has presented a feasible alternative for focal-plane generation of a scale
space. Its intention is to realize real-time dynamic textures detection and tracking.
For that, focal-plane filtering with a resistor grid results in a very low-power imple-
mentation, while the appropriate image subdivision accommodated to the size of the
targeted features also contributes to alleviate the computing load. A methodology
for the design of the MOS-based resistor network is explained, leading to optimal
design of the grid. Also, the means for a simplified representation of the scene are
provided at the pixel level. These techniques have been applied to the design of a
prototype smart CMOS imager. Some experimental results confirming the predicted
behaviour are shown.
Acknowledgements This work is funded by CICE/JA and MICINN (Spain) through projects
2006-TIC-2352 and TEC2009-11812 respectively.
References
1. R. Nelson, R. Polana, CVGIP: Image Understanding 56(1), 78 (1992)
2. D. Chetverikov, R. Pe´teri, in International Conference on Computer Recognition Systems
(CORES’05) (Rydzyna Castle, Poland, 2005), pp. 17–26
3. T. Amiaz, S. Fazekas, D. Chetverikov, N. Kiryati, in International Conference on Scale Space
and Variational Methods in Computer Vision (SSVM’07) (2007), pp. 848–859
4. I. Akyildiz, T. Melodia, K. Chowdhury, Computer Networks 51(4), 921 (2007)
5. R. Pe´teri, M. Huskies, S. Fazekas. Dyntex: A comprehensive database of dynamic textures
(2006). Http://www.cwi.nl/projects/dyntex/
6. M. Ballerini, N. Cabibbo, R. Candelier, A. Cavagna, E. Cisbani, I. Giardina, A. Orlandi,
G. Parisi, A. Procaccini, M. Viale, V. Zdravkovic, Animal Behaviour 76(1), 201 (2008)
7. B. Jahne, H. Haubecker, P. Geib ler, Handbook of Computer Vision and Applications (Aca-
demic Press, 1999), vol. 2, chap. 4
8. J. Babaud, A.P. Witkin, M. Baudin, R.O. Duda, IEEE Trans. Pattern Anal. Mach. Intell. 8(1),
26 (1986)
9. T. Lindeberg, International Journal of Computer Vision 30(2), 79 (1998)
10. D.G. Lowe, International Journal of Computer Vision 60(2), 91 (2004)
11. L. Raffo, S. Sabatini, G. Bo, G. Bisio, IEEE Transactions on Neural Networks 9(6), 1483
(1998)
12. E. Vittoz, X. Arreguit, Electronic Letters 29(3), 297 (1993)
13. K. Bult, J. Geelen, IEEE Journal of Solid-State Circuits 27(12), 1730 (1992)
Focal-plane dynamic texture segmentation 21
14. L. Vadasz, IEEE Transactions on Electron Devices 13(5), 459 (1966)
15. H. Kobayashi, J. White, A. Abidi, IEEE Journal of Solid-State Circuits 26(5), 738 (1991)
16. K. Hui, B. Shi, IEEE Transactions on Circuits and Systems-I 46(10), 1161 (1999)
17. J. Ferna´ndez-Berni, R. Carmona-Gala´n, in European Conference on Circuit Theory and De-
sign (ECCTD’09) (Antalya, Turkey, 2009)
18. J. Ferna´ndez-Berni, R. Carmona-Gala´n, in 12th Int. W. Cellular Nanoscale Networks and
Apps. (CNNA) (Berkeley, CA, 2010), pp. 453–458
