Parallel Implementation of Sequential Morphological Filters by Bartovsky, Jan et al.
Parallel Implementation of Sequential Morphological
Filters
Jan Bartovsky, Petr Dokla´dal, Eva Dokladalova, Vjaceslav Georgiev
To cite this version:
Jan Bartovsky, Petr Dokla´dal, Eva Dokladalova, Vjaceslav Georgiev. Parallel Implementation
of Sequential Morphological Filters. Journal of Real-Time Image Processing, Springer Verlag,
2011, to appear (-), pp.1-13. <10.1007/s11554-011-0226-5>. <hal-00786367>
HAL Id: hal-00786367
https://hal-upec-upem.archives-ouvertes.fr/hal-00786367
Submitted on 8 Feb 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
JRTIP manuscript No.
(will be inserted by the editor)
Jan Ba´rtovsky´ · Petr Dokla´dal · Eva Dokla´dalova´ · Vjaceslav Georgiev
Parallel Implementation of Sequential Morphological Filters
Received: date / Revised: date
Abstract Many useful morphological filters are built as
more or less long concatenations of erosions and dilations:
openings, closings, size distributions, sequential filters, etc.
An efficient implementation of such concatenation
would allow all the sequentially concatenated operators run
simultaneously, on the time-delayed data. A recent algo-
rithm (see below) for the morphological dilation/erosion
allows such inter-operator parallelism.
This paper introduces an additional, intra-operator level
of parallelism in this dilation/erosion algorithm. Realized in
a dedicated hardware, for rectangular structuring elements
with programmable size, such an implementation allows
to obtain previously unachievable, real-time performances
for these traditionally costly operators. Low latency and
memory requirements are the main benefits when the per-
formance is not deteriorated even for long concatenations or
high-resolution images.
Keywords Mathematical Morphology, Serial Filters,
Real-Time Implementation, Dedicated Hardware
1 Introduction
Mathematical Morphology is very popular, self-contained,
image processing framework providing a complete set of
J. Ba´rtovsky´ and E. Dokla´dalova´
Laboratoire Informatique Gaspard Monge
CNRS-UMLV-ESIEE (UMR 8049)
University Paris-Est
93162 Noisy-le-Grand Cedex, France
E-mail: {bartovsj,dokladae}@esiee.fr
P. Dokla´dal
Center of Mathematical Morphology (CMM)
Mines ParisTech
77305 Fontainebleau Cedex, France
E-mail: petr.dokladal@mines-paristech.fr
V. Georgiev
Faculty of Electrical Engineering
University of West Bohemia
30614, Pilsen, Czech Republic
E-mail: georg@kae.zcu.cz
tools from filtering, multi-scale image analysis to pattern
recognition. It has been used in a number of applications, in-
cluding the biomedical and medical imaging, video surveil-
lance, industrial control, video compression, stereology, or
remote sensing ( [8, 16, 18, 19]).
In image-interpretation applications requiring a high
correct-decision liability, one often use robust multi-criteria
and/or multi-scale analysis. It generally consists of a serial
concatenation of alternating atomic operators dilation and
erosion with a progressively increasing computing window,
the so-called structuring element (SE). Its examples include:
– Alternate Sequential Filters (ASF) - that are concatena-
tions of openings and closings with a progressively in-
creasing structuring element, useful for multi-scale anal-
ysis [19, 20].
– Size distributions - (aka granulometries) are concatena-
tions of openings allowing measuring the size distribu-
tion in a population of objects [14, 17, 28].
– Statistical learning - a selected set of morphological op-
erators ζi can be separately applied to an image f . Then
for every pixel f (x,y), the vector of values ζi( f )(x,y)
can serve as a vector of descriptors for the pixel-wise
learning and classification [4].
Although built from basic blocks (the dilation and ero-
sion), these operators are costly due to the number of itera-
tions. The real-time capabilities (i.e., low latency) are even
more difficult to achieve due to the sequential data depen-
dence and high memory requirements.
The recently introduced algorithm for the dilation and
erosion [7] shows how to handle efficiently the implemen-
tation of such concatenations. It enables an inter-operator
level of parallelism where all the sequentially concatenated
operators can run simultaneously, on time-delayed data. Ob-
viously, it is fully exploited only if the algorithm is imple-
mented in an adequate hardware (HW).
In this paper, we propose a HW implementation of
the original algorithm and we introduce an additional,
intra-operator level of parallelism. Such an implementation
allows obtaining previously unachievable, real-time perfor-
mances for these traditionally costly operators. Section 2
2discusses the state of the art of existing dilation/erosion
algorithms and concludes by the novelties presented in this
paper. Section 3 and 4 recall the definitions and algorithmic
principles and Section 5 illustrates the functional scheme
of a sequential HW implementation. Then, Section 6 in-
troduces the parallel implementation allowing obtaining an
additional performance increase. Finally, Section 7 presents
results obtained on FPGA.
2 Fast Implementations of Morphological Filters
During the last decades, propositions of optimized imple-
mentations concentrated on the efficiency of computing the
dilation and erosion. The majority of authors measures this
efficiency as a number of comparisons per pixel. Neverthe-
less, the minimization of comparisons can result in high
memory requirements. It can even penalize the execution
time since the overall latency issues are neglected.
For the following, we define as operator latency the
latency introduced by the dependence of the result on
future data samples. For example the max filter yi =
max(xi−2,xi−1, . . . , xi+2) has operator latency 2. We define
as algorithm latency any additional latency introduced by
the algorithm, e.g., the necessity to perform a reverse scan
on data. The latency is a time-less measure expressed in a
number of data samples.
Note that there is also an additional delay, called com-
puting latency, induced by the time needed to compute the
result after all data are available. It is a platform dependent
measure, independent of two previous latency definitions.
In the example above, the polyadic max can either be exe-
cuted sequentially on sequential machines, or in parallel on
the dedicated hardware.
Then the overall latency of the system is the sum of these
three terms.
2.1 Algorithmic Advances
The most efficient dilation/erosion algorithms are based on
the SE decomposition to a set of basic, more easily opti-
mized shapes, see [22, 30, 31]. A special attention is paid
to 1-D algorithms obtaining a significant gain in the overall
performance.
The most popular 1-D algorithm is called HGW (pub-
lished by van Herk [26], and Gill and Werman [9]). The
computation complexity per pixel is O(1), i.e., is indepen-
dent of the SE size. Nonetheless, the algorithm requires two
scans: forward and reverse. Lemonnier [12] proposes to
identify local extrema and propagate their values as long as
it is covered by the SE. Again, forward and reverse scans are
needed. Notice that in 2-D the reverse scan of the vertical
component multiplies the algorithm latency by a factor of
the image width.
Lemire [11] proposes a fast, stream-processing algo-
rithm for causal linear SE. It runs also on floating-point
data, has low memory requirements and zero algorithm
latency. However, an intermediate storage of local maxima
results in a random access to the input data. This problem
is solved in Dokladal and Dokladalova [7] using the strictly
sequential access to the data. It allows the real on-the-fly
computing and has zero algorithm latency.
A different approach represents the algorithm proposed
by Buckley and Van Droogenbroeck [25]. It detects the an-
chors – the portions of the signal unaffected by the operator
– and updates only the parts to be modified by the operation.
It has zero algorithm latency. However, the algorithm uses a
histogram which makes memory requirements dependent on
the number of gray levels.
Recently, Urbach and Wilkinson [24] propose an algo-
rithm for arbitrary shaped 2-D flat SEs based on the compu-
tation of multiple horizontal linear SEs for every pixel and
storing them in a look-up table. The result is then computed
by taking the maximum from the intermediate values (stored
in the look-up table) corresponding to the shape of the SE.
The horizontal linear SE can be computed with one of the
above mentioned 1-D algorithms.
2.2 Implementations
In the beginning of the 70’s, Klein and Serra [10] propose a
texture analyzer for linear and rectangular SE by decomposi-
tion based on the delay-line concept. More recently, Velten
and Kummert [27] propose also a delay-line based archi-
tecture supporting arbitrary-shaped SEs. However, the com-
plexity being quadratic O(W 2) (W denotes the length of the
SE), it becomes penalizing for large SEs. In Chien et al. [2],
the authors show how to reduce the number of redundant
comparisons within large SEs by merging adjacent smaller
SEs. The complexity becomes O(⌈log2(W )⌉) with identical
memory requirements.
Clienti et al. [3] proposes a highly parallel morpholog-
ical System-on-Chip. It is based on a set of neighbourhood
processors optimized for 3×3 SE, interconnected in a par-
tially configurable pipeline. Larger SE is obtained by ho-
mothecy (see Basic Notions below) requiring to instantiate
a deep pipe of these processors or multiple image scans.
A similar approach has been published by Deforges et
al. [5]. Based on Xu’s [30] decomposition combined with a
stream implementation, the authors propose a pipeline archi-
tecture composed of the elementary parametrizable blocks.
It handles an arbitrary convex shape of structuring element
in only one scan of the input image. However, using large
SE will require the proportional increase of the atomic HW
resources, concatenated in a deep pipe. The principal limita-
tion comes from a limited programmability of such a pipe.
To complete this brief survey, we can also cite the sys-
tolic architectures proposed by Diamantaras and Kung [6],
Malamas et al. [13] or Shih et al. [21] for gray-scale or bi-
nary morphology. Their common inconvenience is the need
of an intermediate storage for 2-D structuring element and a
long response time of the system.
32.3 Novelty of this Paper
All previous algorithms optimize the dilation/erosion algo-
rithm, rather than the entire operator chain. The performance
will inevitably decrease for more complex applications with
long loops (iterations, idempotence) or concatenations.
Consider some serial morphological filter ζ = δε . . . δε
(with δ and ε standing for dilation and erosion, see Section
3 below for details). If the atomic operators δ and ε use se-
quential access to data then the entire ζ can run on pipelined,
time-delayed data. If the atomic algorithms δ and ε - in ad-
dition - have zero algorithm latency, then the entire chain ζ
inherits the same properties: sequential data access and zero
algorithm latency. This is an interesting property, since com-
puting ζ suddenly becomes very efficient: in stream, with
only the (further irreducible) operator latency of ζ .
In comparison with the preceding state of the art, the
Dokladal [7] algorithm extends the possibility to implement
erosion/dilation filters with the arbitrarily large, 2-D SE in
only one scan over the image, with the minimal algorithm
latency and memory requirements. If implemented in a ded-
icated hardware, we can obtain the same features even for
the implementation of long concatenations ζ .
This paper starts from the sequential HW implemen-
tation of the Dokladal algorithm (published in Bartovsky
et al. [1]). It describes more deeply the implementation
features and optimization techniques. It shows how to
exploit the inter-operator parallelism in ζ . Additionally, it
introduces another intra-operator parallelism in the compu-
tation of the 2-D erosion/dilation. The 2-D erosion/dilation
is implemented as a run-time programmable block. The
operation (erosion or dilation), the size and the origin of the
SE can be modified on-the-fly between two frames.
Several such blocks concatenated in a pipeline allow ob-
taining previously unachievable, real-time performances for
operators in the form of ζ . We can reach almost 100Hz
HDTV 1080p performance, independent of the length of ζ .
3 Basic principles
Let δB, εB: Z2 →R be a dilation and an erosion on grey-scale
images, parametrized by a structuring element B, assumed
rectangular, flat (i.e., B ⊂ Z2) and translation-invariant, de-
fined as
δB( f ) =∨b∈B fb (1)
εB( f ) =∧b∈B̂ fb (2)
The hat ̂ denotes the transposition of the structuring ele-
ment, equal to the set reflection B̂ = {x | −x ∈ B}, and fb
denotes the translation of the function f by some vector b.
The SE B is equipped with an origin x ∈ B. Below, B(x) de-
notes B placed with its origin at x.
The implementation of (1) and (2) consists of searching
the extremum of f within the scope of B
[δB( f )](x) = max
b∈B
[ f (x−b)] (3)
[εB( f )](x) = min
b∈B
[ f (x+b)] (4)
The dilation and erosion by convex structuring elements
verify the homothecy. Let B be some convex structuring el-
ement, and rB the change of scale of B, with r > 1, r ∈ Z.
Then for the dilation we have
δrB f = δB . . .δB︸ ︷︷ ︸
r−times
f . (5)
The homothecy allows obtaining large-size dilations by re-
peating several times the dilation by a small SE.
Combinations of dilations and erosions form other
operators. The basic concatenation products are opening
γB = δBεB and closing ϕB = εBδB. From here we can form
the Alternating Filters obtained as γϕ , ϕγ , γϕγ and ϕγϕ .
The number of combinations obtained from two filters is
rather limited. Other filters can be obtained by combining
two families of filters. This leads to morphological Al-
ternate Sequential Filters (ASF), originally proposed by
Sternberg [23], and studied in Serra [19], Chapter 10. In
general, it is a family of operators parametrized by some
λ ∈ Z+, obtained by the alternating concatenation of two
families of increasing and decreasing filters {ξi} and {ψi},
respectively, such that ψn ≤ . . .≤ ψ1 ≤ ξ1 ≤ . . .≤ ξn.
The most known ASF are those based on openings and
closings, obtained by taking ψ = γ and ξ = ϕ :
ASFλ = γλ ϕλ . . .γ1ϕ1 (6)
starting with a closing, and
ASFλ = ϕλ γλ . . .ϕ1γ1 (7)
starting with an opening.
The second application example is the size distribution
of a population of objects [14, 17, 28]. One way to compute
them is the residue from a sequence of openings
sd(λ ) = ||γλ − γλ−1|| (8)
The following section briefly recalls the principles of the
used algorithm, [7]. It starts by the 1-D dilation algorithm,
followed by the principle of separation of the n-D dilation
into perpendicular 1-D computations, preserving the stream
aspects at all levels.
4 Algorithm Description
4.1 1-D Dilation Algorithm
The algorithm principles and properties have been originally
published in [7]. We briefly recall the main important prin-
ciples for HW implementation.
For some 1-D input signal f : 1 . . .N →R, the algorithm1
computes the value δB f (wp) = f (rp). The SE B, B ⊂ Z,
1 See Appendix for the 1-D dilation pseudocode
4is a line segment, containing its origin, and not necessarily
symmetric. Consequently, the size of B is given by the span
from the centre to the left and to the right, SE1 and SE2.
The length of B is SE1+ SE2+ 1. The coordinates wp and
rp stand for the current writing and reading positions.
The algorithm uses a FIFO queue Q. The queue supports
operations push, pop and dequeue (modifying the FIFO’s
content) and queries front and back. The input signal f is
read sequentially. A newly read value f := f (rp) is inserted
in the FIFO queue as a pair { f ,rp}, the sample f and read-
ing position rp (code line 3). In this pair, one can indepen-
dently access either the value or the position by indexing.
For example the last stored element’s value can be accessed
(without dequeing it) by a query Q.back()[1].
The algorithm does not store non decreasing intervals
(see [7] for details and proof). The values that appear to be-
long to increasing or constant intervals are dequeued (code
lines 1-2). Consequently, the values stored in the queue are
always ordered in a decreasing order.
The old values, uncovered by the SE, are retrieved from
the queue (code lines 4-5). The result of the dilation δB f (x)
is read at the front of the queue (code line 7). The result
becomes available as soon as enough input data have been
read, otherwise the output is empty (code line 9).
4.2 2-D Dilation Algorithm
The separability of n-D morphological dilation into lower
dimensions is a well known property. For example, a rect-
angular SE R decomposes as R = H⊕V where H and V are
horizontal and vertical segments and⊕ is the Minkowski ad-
dition. Then the dilation by a rectangle R can be computed
by concatenation of two perpendicular 1-D dilations
δR = δV δH . (9)
The sequential access to the data in 1-D makes that two
perpendicular 1-D computations can be assembled into 2-D
with sequential access at both levels, 2-D and 1-D, for both
input and output data. There is no additional latency and no
intermediate storage (the data are pipelined).
See the example of dilation by a rectangle R=H⊕V of
an N×M image f , Fig. 1. The image is sequentially read in
the raster-scan mode, line by line from left to right. The vari-
ous indices rp and wp denote reading and writing positions,
respectively, for the segments H and V , and the rectangle R.
The computation is illustrated for column i and line j, i.e.,
the result δR f (i, j) is to be written at wpR.
The computation of δR = δV δH decomposes as follows:
The current reading position of δR coincides with rpH , that
is rpR = rpH . The result of the horizontal dilation, at wpH ,
is immediately read by the vertical dilation in the respective
column, that is wpH = rpV . The result of the vertical dilation
δV is written at the writing position wpR, i.e., δV = wpR.
Notes:
- The rpr and wpr run over the image in the raster scan
i
j
RV
R H
H V
wp  = wp
rp  = rp
wp  = rp
N
1
1
M
H
V
R
Fig. 1 Decomposition of dilation by a rectangle R into two 1-D dila-
tions by segments H and V , see (9).
mode. The distance between rpR and wpR is the (further ir-
reducible) operator latency.
- There is one instance of the horizontal dilation running at
the current line j, and N instances of vertical dilation, i.e.,
one per each column.
5 Sequential Hardware Implementation
In this section, we firstly describe in details the implemen-
tation of the 1-D dilation in a basic block that (thanks to
the separability) can be used as a building brick in any di-
mensional system. We illustrate this below on a 2-D dilation
or erosion. Secondly, we show how the intra-operator paral-
lelism can be introduced to increase the performance.
5.1 1-D Algorithm Implementation
The 1-D algorithm presented in Section 4 is a system with
sequential behavior. It contains a while loop that can not be
unrolled (uncertain number of iterations). The common way
to implement such a system is the Mealy Finite-State Ma-
chine (FSM, see [15]). The FSM issues all the necessary op-
erations over the memory as well as it controls the input and
output data-flow. It consists of 2 states {S1,S2}.
S1
S2
End
Start
Q.push({F, rp});
Q.pop();
return (Q.front()[1] );
return (Q.front()[1] );
Q.dequeue();
Q.back()[1] > F
End of data
not End of data
Q.back()[1] ≤ F
output:
output:
output:
output:
Fig. 2 State diagram of the 1-D Algorithm FSM. State transition con-
ditions are typeset in bold; the output signals are given in a shadow
bounding box.
The S1 state Dequeues all useless values. It is a data de-
pendent stage of the algorithm as it dequeues an a priori un-
known amount of pixels. This is represented in the code by
5A><B
A><B
F
S
M
M
E
M
O
R
Y
QUEUE
>
<
=
>
<
CNTR page +
CNTR stamp+
push
back
back(DL) dequeue, pop
front
+ CNTR RP
+ CNTR WP
INPUT OUTPUT
comparator 1
Input !fo empty Output !fo full
comparator 2 A≤B
Switch
request
logic
≤
comparator 3
A<B<
comparator 4
N - vert.
1 - horiz.
SE2
SE1+SE2
Dilation / Erosion
C
O
N
T
R
O
L
 U
N
IT
Fig. 3 Overview of implemented 1-D architecture. The FSM part man-
ages computation, memory part contains the data storage–queues
the while statement (code line 1). Consequently, its compu-
tation time varies from 1 to the SE size clock cycles in the
worst case when all the previously stored pixels are unnec-
essary.
The Enqueue current sample signal (code line 3) is is-
sued upon the transition from S1 to S2.
The S2 state handles the code lines 4 and 5, Delete too
old values, and the lines 6 to 9 Return valid value or Return
empty. These instructions are independent and executed in
parallel. Consequently, the execution of S2 takes only one
clock cycle.
5.2 1-D Block Architecture
The HW implementation can be separated into 2 areas
(Fig. 3), the FSM part and the memory part. The FSM
manages entire computing procedure and temporarily stores
values in the memory part. The memory instantiates one
FIFO queue in the case of horizontal direction (horizontal
scan) and N FIFO queues in the vertical case (N is the
image width). The queues are addressed by a modulo N
Page counter (active in the case of vertical direction).
The Control unit is a sequential circuit that manages the
state transitions. It increments the rp, wp and manages the
Page and position Stamp counter appropriately. The Control
unit also performs the queue memory operations and handles
the backward full flags used for data-flow control.
Principle
In the beginning of S1, the last queued pixel is invoked by
Back() operation from the queue and fetched to the Com-
parator 1. The pixel value is compared with the value of
the current sample. Notice that the comparator evaluates all
three possible relations (>,<,=) at the time, for both dila-
tion and erosion. The Control unit decides on the basis of
comparison results and selected morphological function (di-
lation or erosion) whether the enqueued pixel is to be de-
queued (lines 1-2). Otherwise, the current pixel is extended
with the reading position stamp and enqueued (line 3).
The S2 invokes the oldest queued pair {pixel, stamp} by
Front() operation. The read pixel is a correct result if rp has
already reached or exceeded the SE2 parameter. This output
allowing condition (line 6) is checked by Comparator 3. The
deletion of outdated values is performed by comparing the
current value of the reading-position stamp with the rp value
of the oldest pair. Notice that the deletion has no impact on
the output dilation value because Pop() operation (lines 4
and 5) issued by the Control unit has an effect only with the
next clock edge.
The switch request logic is used only in the parallelized
version of the architecture, see Section 6. It is a simple block
containing several comparators which generate a signal with
the last output value of each parallel segment. Its purpose is
to inform the switch connected to the output that the end of
the segment has been reached and the following segment is
to be processed.
The entire set of parameters, i.e., SE dimensions and
selection of the morphological function, is run-time pro-
grammable at the beginning of the line for 1-D, and of the
frame for the 2-D implementation, respectively. In addition,
no further controller is needed; the internal behavior is
driven only by the regular scan order data-flow.
5.2.1 Reducing the impact of data-dependency
Hereafter, we briefly describe two techniques brought to the
system for higher throughput and lesser area occupation.
Number of dequeue steps
The data-dependent number of dequeue steps (below
denoted by Steps) has an unpleasant consequence on the
HW design: longer balancing FIFOs (see Fig. 4), lower data
throughput. For HW design it is important to minimize the
worst case upper bound Stepsorig=SE−1.
The number of stored pixels is within [1,SE[. Suppose
that we are to dequeue D pixels. We know that the pixels
are queued in a strictly decreasing order. Thus, if the DL-th
pixel (DL< D) can be dequeued then also all previous pixels
can be dequeued. This can be done at the same time. Hence,
the worst-case number of dequeue steps is
Steps = max
D<SE
(DdivDL+DmodDL) (10)
where D denotes the number of pixels to be dequeued and
div and mod the integer division and the remainder opera-
tions. D can be regarded as a uniformly distributed, random
variable D ∈ [1,SE[. Then we need to find the optimal DL
that minimizes Steps (Eq. 10) for all D such as
DLoptim = arg min
DL<D
max
D<SE
(DdivDL+DmodDL) (11)
The optimal DLoptim brings us the minimal number of de-
queue steps Stepsoptim
Stepsoptim = minDL<D maxD<SE(DdivDL+DmodDL) (12)
6Table 1 exemplifies, for some SE widths, the original and re-
duced number of dequeue Steps, obtained with optimal DL.
Notice that more than one optimal DL can exist.
The SE is user programmable. DLoptim also is pro-
grammable, though it is useless to make it accessible to
the user; it can instead be read from a LUT for every given
user-specified SE.
Table 1 Optimal dequeue length DL, original and reduced number of
dequeue steps for selected SE widths
SE width 3 11 21 31 41
Stepsorig 2 10 20 30 40
DLoptim 2 3, 4 4, 5, 6 5, 6, 7 6, 7
Stepsoptim 2 4 7 9 10
Pixel addressing
The absolute pixel addressing in the queues can be advanta-
geously replaced in the HW by using the modulo addressing.
Instead of the absolute reading position rp, we use the rel-
ative modulo position stamp = rpmodSE. The pixels are
enqueued by Q.push( f ,stamp) (code line 3).
The delete condition of line 4 changes accordingly. Us-
ing the modulo addressing, a stored pixel becomes outdated
whenever its modulo address equals the current pixels’ one
(stamp = Q. f ront()[2]).
The advantage of the modulo addressing is a smaller data
width. It fits into ⌈log2(SE − 1)⌉ bits, whereas the absolute
addressing requires ⌈log2(N − 1)⌉ bits. This is mainly ad-
vantageous for vertical orientation using N queues for a unit.
5.3 2-D Dilation Implementation
Recall that dilation is separable into lower dimensions,
Eq. 9. The dilation by a rectangle can be implemented using
two 1-D dilation blocks, Fig. 4.
WP
RP
x 1-D
DILAT
1-D
DILAT
balanc. 
FIFO
input 
FIFO
output 
FIFO
horizontal unit vertical unit
Fig. 4 2-D implementation is composed of 1-D blocks for respective
directions.
The computing latency of the dilation varies per each
pixel. In order to preserve the input/output stream flow, one
needs to compensate the different latencies by insertion of
balancing FIFOs. The FIFO fills when the preceding block
outputs data faster than the subsequent block can read. The
depth of this FIFO directly defines the upper bound of the
system latency of the 2-D block.
Obviously, the FIFOs should be as small as possible. The
necessary depth infers from the dequeuing worst case
Finput =
Stepshor +2
StreamRate
−1 (13)
Fbalance = N
(
Stepsver +2
StreamRate
−1
)
(14)
where Stepshor and Stepsver are numbers of the dequeue
steps in horizontal and vertical directions (12).
The output FIFO ensures a permanent stream delay in all
circumstances. Its maximal size is a sum of both FIFOs (in-
put and balancing). The instantaneous filling of output FIFO
is complementary to the filling of both FIFOs combined.
The overall delay does not change. If more 2-D blocks are
pipelined to form compound operators (e.g., opening, clos-
ing, ASF), only one output FIFO at the end is necessary.
balancing
FIFO
output
FIFO
0
depth of
output FIFO
output.front
balancing.back
output.back
balancing.front
input FIFO
merged !fo 
1-D
DILAT
1-D
DILAT
horizontal unit
vertical unit
IN
OUT
Fig. 5 Merged FIFO replaces the balancing and output FIFOs to re-
duce memory requirements.
The output and balancing FIFOs can be merged (see
Fig. 5) into one memory thanks to the following proper-
ties: 1) the vertical unit reads exactly one pixel from the
balancing FIFO for each pixel written to the output FIFO.
Consequently, filling of these two FIFOs is complementary;
the occupied memory spaces can not collide with each
other, 2) the read/write activity is at most 1 access per 2
clock cycles. Hence, reading ports of both FIFOs can use
one memory port and the writing ports can use the other
memory port (without overloading it). Merging both FIFOs
reduces the memory to approximately one half. The result
memory (see Fig. 5) has two pairs of standard FIFO ports,
but it contains only one dual-port RAM.
5.4 Clock rate
The overall average clock rate stays in the interval from 2
clock cycles per pixel in the best case, up to 3 clock cycles
per pixel in the worst case. The current rate between 2 and 3
clock cycles per pixel is data dependent.
A temporarily worst case arrives whenever a
monotonously decreasing signal is followed by a high
value. This makes a number of samples to be dequeued at
the time (code lines 1 - 2, and the S1 state of the FSM), and
the computing latency temporarily increases. However, the
average computing latency remains unchanged, compen-
sated by the fact that during the entire monotonous decrease
of the signal no values have been dequeued. Therefore, the
average clock cycles per pixel rate remains constant.
75.5 Memory requirements
The memory requirements of the 2-D architecture consist of
horizontal and vertical computation-involved memories and
two balancing FIFOs, defined by (13) and (14).
In the vertical case, the algorithm uses a several queues.
Instantiating N separated memories would be resource in-
efficient because the FPGA RAM blocks could not be ex-
ploited. Instead, these queues are gathered in a single dual-
port memory (see Fig. 6) since only one queue is accessed
at the time (the others are idle). A single memory block also
allows using an off-chip memory.
Queue 1 Queue 2 . . . Queue N
0 H-1 2*H-1 ((N-1)*H)-1 N*H-1Address:
Q1.back
Q1.front
Q2.back
Q2.front
QN.back
QN.front
Fig. 6 Vertical Queues are mapped into linear memory space side by
side. The front and back pointers are stored at separated memory.
Every queue has a related pair of front and back pointers
which must be retained throughout the entire computation
process. The appropriate pair is always read before the par-
ticular queue is used and the modified pointers are stored
back after the computation left the queue. These pointers are
stored in a separated pointer memory. The queues are effi-
ciently packed into RAM blocks resulting in a small mem-
ory extension.
Let W×H denote the width×height of the rectangular
SE, and bpp bits per pixel. The memory contribution per
2-D unit is given by:
Mhor =W (bpp+ ⌈log2(W −1)⌉) [bits] (15)
Mver =N(H(bpp+ ⌈log2(H−1)⌉)+
+2⌈log2(H−1)⌉)
[bits] (16)
The following example illustrates the very low memory con-
sumption achieved thanks to the stream processing. Neither
the input, nor the output or any working image are buffered.
Example: Consider a dilation of 8bpp, SVGA image
(i.e., 800×600=N×M) by a square, 31x31 SE.
The computation (the queues) requires (15) and (16)
Mhor = 31(8+5) = 403 bits
and
Mver = 800(31(8+5)+2×5) = 330.4 kbits
resulting in a total of 331 kbits for the 2-D dilation.
The input and the balancing FIFOs require (13) and (14)
Finput +Fbalance = (N +1)(
Steps+2
StreamRate
−1)8bpp =
= (800+1)(9+23 −1)8bpp≈ 17 kbits
The total memory needs to implement the 2-D dilation are
331+17=349 kbits. This is far below the mere size of the
image itself 800× 600× 8bpp= 3.84 Mbits which, at any
moment, does not need to be stored.
6 Parallel Hardware Implementation
This section develops and implements the concept of the
previously mentioned intra-operator parallelism in the di-
lation/erosion operator. Its main objective is to increase the
throughput while maintaining the beneficial properties of the
proposed algorithm, namely the sequential data access and
minimal algorithm latency as much as possible.
The principle is based on utilization of concurrently
working units that process different parts of the image
simultaneously. The number of units used in parallel for
horizontal and vertical directions defines the parallelism de-
gree (PD). Considering that the input data are fetched line by
line, we propose a solution minimizing the waiting-for-data
periods of all units.
The image partition for 2-D dilation conforms to the in-
tersection of two horizontal and vertical partitions (Fig. 7).
Its granularity is determined by the PD. The horizontal par-
tition (partition of image among horizontal units) is inter-
leaved, whereas the vertical units use the partition into com-
pact blocks.
Horizontal
H1
H2
H3
H1
H2
H3. . .
. . .
. . .
. . .
Vertical
º =V1 V2 V3
H1 º V1
H2 º V1
H3 º V1
H1 º V1
H2 º V1
H3 º V1
H1 º V2
H2 º V2
H3 º V2
H1 º V2
H2 º V2
H3 º V2
H1 º V3
H2 º V3
H3 º V3
H1 º V3
H2 º V3
H3 º V3
Final segments
Fig. 7 Example of image partition for PD=3: image is divided hori-
zontally line by line and into PD equal stripes in a vertical direction.
The final image partition is obtained by intersection.
During the parallel processing the computation runs si-
multaneously at multiple segments of the image, see Fig.
8. These segments must belong to different columns and
lines,i.e., must be placed on a diagonal.
H1
H2
H3
V1
V2
V3
H1
H2
H3
V1
V2
V3
H1
H2
H3
V1
V2
V3
H1
H2
H3
V1
V2
V3
Input
Block
Block
Block
D
a
ta
 in
te
rv
a
l
H1
H2
H3
(0 , N/3)
(0 , N)
0
0
(N/3 , 2N/3)
(N , 2N)
(N , 4N/3)
0
(2N/3 , N)
(2N , 3N)
(4N/3 , 5N/3)
(2N , 7N/3)
(3N , 10N/3)
(3N , 4N)
(5N/3 , 2N)
(7N/3 , 8N/3)
R
o
u
ti
n
g
S
e
g
m
e
n
t
. . .
. . .
. . .
H1 º V1
H2 º V1
H3 º V1
H1 º V1
H1 º V2
H2 º V2
H3 º V2
H1 º V2
H1 º V3
H2 º V3
H3 º V3
H1 º V3
. . .
. . .
. . .
H1 º V1
H2 º V1
H3 º V1
H1 º V1
H1 º V2
H2 º V2
H3 º V2
H1 º V2
H1 º V3
H2 º V3
H3 º V3
H1 º V3
. . .
. . .
. . .
H1 º V1
H2 º V1
H3 º V1
H1 º V1
H1 º V2
H2 º V2
H3 º V2
H1 º V2
H1 º V3
H2 º V3
H3 º V3
H1 º V3
. . .
. . .
. . .
H1 º V1
H2 º V1
H3 º V1
H1 º V1
H1 º V2
H2 º V2
H3 º V2
H1 º V2
H1 º V3
H2 º V3
H3 º V3
H1 º V3
(a) (b) (c) (d)
Fig. 8 Image partitioning and switch routing in parallel processing for
PD=3. Decomposed in time - (a) beginning of processing, (b .. d) after
kN pixels, k=1..3. The shading denotes the state : Dark Gray - being
computed, Light Gray - already computed, White - waiting.
8The input data rate can be theoretically PD-times faster
than the computational throughput of one unit. Therefore,
each image line needs to be buffered in a line buffer. The
line buffers are filled at the external (fast) pixel rate and read
by the internal PD-times slower rate.
Figure 8 gives an example for PD = 3. We have three
horizontal (H1 .. H3), and tree vertical (V1 .. V3) processing
units. As soon as the line buffer receives the first pixel, the
first horizontal unit H1 starts the processing and feeds re-
sults to the first vertical unit V1. Its output is fed to the first
output line, see Fig. 8(a). After N received pixels, the out-
put of H1 is connected to V2 which belongs to output line
1. Since the H1 left V1 and line 2 is read, the H2 can start
processing second line feeding V1 connected to output line
2, see Fig. 8(b). When the 2N input pixel is received, the H1
connects to V3, H2 connects to V2 and H3 connects to V1,
see Fig. 8(c), and so on.
6.1 Architecture
The parallel architecture depicted in Fig. 9 contains four
separable generic parts scalable by n ≡ PD: input buffer,
horizontal and vertical parts and output buffer. The input
buffer is mainly composed of the 1-to-n multiplexer and n
line buffers (we omitted the control logic). It divides the fast
input stream into n (n-times slower) streams processed by
computational units as described above. The output buffer
composes n slow streams of the processed data into a sin-
gle, fast, output stream respecting the image horizontal scan
order. The operator blocks can be concatenated into more
complex functions (opening, closing, ASF, etc.). The buffers
are used only at the beginning and at the end of the chain.
H1
H2
Hn
INPUT BUFFER HORIZONTAL PART
. . .
. . .
. . .
. . .
. . .
. . .
VERTICAL PART OUTPUT BUFFER
Stream clock Processing clock Stream clock 
switch
basic
unit
basic
unit
basic
unit
line bu!ersline bu!er "fomux "fo demux
V1
V2
Vn
switch
basic
unit
basic
unit
basic
unit
Fig. 9 Overview parallel 2-D architecture. The horizontal and vertical
stages can be instantiated several times between input/output buffers to
create compound operators.
Both horizontal and vertical parts instantiate n balanc-
ing FIFOs, n horizontal or vertical units, and one switch
that manages the interconnection. Each horizontal unit along
with the front-end FIFO conforms to Section 5.2.
The width of the processing area proportionally affects
both vertical memories, see (14) and (16). The area of ev-
ery horizontal unit remains unchanged, since every unit pro-
cesses the entire line. The overall memory of the horizontal
part is a factor of n. Contrarily, the memory requirements of
every vertical part is divided by n because it processes only
a fraction of the original image width. The area of the FSM
of vertical units increases linearly with n.
6.2 Switching
The routing of the computation units is handled by a switch
block. Every switch contains n input ports from previous
units and the same number of output ports linked to the sub-
sequent units. The purpose of the switch is to manage up to n
interconnection channels. Notice that they are bidirectional:
forward data and backward FIFO full flag. As described in
Fig. 8, the output switching of all input ports is circular, i.e.,
V1, V2 ... Vn, V1, V2, ... and so forth. This property makes
the switching easier because the only condition to evaluate
is when to switch and whether the requested output unit is
available.
The moment when to switch a given port is provided by
the preceding unit’s Switch Request logic. It generates a re-
quest every time it crosses the border of adjacent segments.
If the desired unit is free, the switch reconnects the channel.
If not, the switch sets high the FIFO full flag of requesting
unit to stall it until the desired destination unit is freed and
the channel can be established. All the channels are switched
independently so stalling one unit does not affect the others.
...
...
. . .
Control
block A
A(1:n)   - destination identi!er
A:N(1)  - source identi!er
B:N(1)N(1:n) A(2)
set of
Fifo full
set of
Input data
A(n)
Halt
Fifo full
Input data
IN
P
U
T
 P
O
R
T
 A
 Signals to/from switch basic units for ports B:N
O
U
T
P
U
T
 P
O
R
T
 AOutput data
Fifo full
Switch 
request
Fig. 10 Basic unit of the switch. Every switch contains n basic units
for a correct routing between n input/output ports.
Figure 10 depicts the basic unit of the switch for one pair
of input/output ports referred to as A. For n pairs of ports this
circuitry is instantiated n-times. Each input port possesses a
related control unit block that manages all channel transi-
tions considering the availability of the requested partition.
If this is still occupied, the requesting computation unit is
stalled by holding its FIFO full flag active.
7 Experimental results
The proposed 2-D stream processing architectures have
been implemented in VHDL, and targeted to the Xilinx
Virtex5 FPGA (XC5VSX95T-2) using the XST synthesis
tool. The processing clock frequency is 100 MHz. Notice
that the queues are gathered in a block RAM memory,
and thus its access time augments the critical path delay.
9The measured performance for non-parallel architectures
(PD=1) in terms of overall latency, clock cycles per pixel
and FPGA area are given by Tables 2 and 3.
Table 2 Timing and area vs. SE, SVGA image size, PD=1.
Size of SE (sq.) 3x3 11x11 21x21 31x31 41x41
Latency [clk] 1908 9474 18888 28351 37969
Av. rate [clk/px] 2.344 2.356 2.360 2.361 2.361
Registers 212 232 242 242 252
LUTs 584 761 859 859 953
Block RAMs 2 6 13 13 28
Table 3 Timing, frame rate and area w.r.t. image, SE = 31x31 square,
PD=1.
Size of Image CIF VGA SVGA XGA 1080p
Latency [clk] 12826 23465 28351 37472 69548
Av. rate [clk/px] 2.371 2.376 2.361 2.383 2.368
Experimental FPS 384 130 85 51 20.5
Worst-case FPS 319 106 68 41 16
Registers 231 237 242 242 253
LUTs 761 853 859 859 1057
Block RAMs 7 13 13 13 26
One can observe that the overall latency is factor of the
SE size, the image width (both caused by operator latency)
and the pixel rate (computing latency). The average pixel
rate (AR) remains constant (Table 2). The average pixel rate
can be expressed by (17) and the stream frame-per-second
(FPS) ratio by (18). Tproc is overall time consumed by pro-
cessing and fclk= 100 MHz is clock frequency of computa-
tion units.
AR =
Tproc−2SE2(N +M+SE2)/PD
N M
[clk/px] (17)
FPS = fclkPD
ARN M+2SE2(N +M+SE2)
[fr/s] (18)
M, N denote the width and height of the image, SE2 denotes
the width of the structuring element from the origin right-
wards.
Concerning the area occupation (see the Xilinx docu-
mentation [29]), the number of registers is quasi-constant;
the number of LUTs and BRAM blocks increases linearly
with the SE and image sizes (Table 3). Although the verti-
cal memory (size is given by (16)) is packed into the RAM
block, the amount of the used memory always exceeds the
theoretical value. It is caused by a different memory organi-
zations; e.g., the required word is 13 bits whereas available
memories are of width 36 bits and its fractions.
The experimental frames-per-second (FPS) rate is ob-
tained on a natural test image (see Fig. 11). The worst-case
FPS is a theoretical worst-case performance of the system
expected on the synthetic saw-shaped data.
Table 4 presents the relative speed-up of the parallel ar-
chitecture vs. the intra-operator parallelism PD. In terms of
overall latency and average processing rate, the processing
domain clock cycle is considered as a reference unit. Note
(a) Test image (b) Zoom on test
(c) ASF4 filtered (d) Zoom on ASF4
Fig. 11 (a) Experimental 800×600 lotus image. (b) Zoom on the fine
veinous texture disadvantageous for the algorithm. (c) Result of ASF4
filter, see Eq. 6 or 7. (d) Zoom on the ASF4 filtered image.
Table 4 Timing vs. degree of intra-operator parallelism PD. SVGA
image, SE = 31x31 square.
PD 2 3 4 5 6
Latency [clk] 14243 9561 7244 5818 4893
Av. rate [clk/px] 1.220 0.824 0.625 0.505 0.426
Exp. speed up 1.938 2.869 3.785 4.682 5.554
Table 5 Area vs. degree of intra-operator parallelism PD. SVGA im-
age, SE = 31x31 square.
PD 2 3 4 5 6
Registers 650 978 1280 1605 1938
LUTs 2138 3227 3862 4875 6054
Block RAMs 13 14 14 18 21
Reg. buf 661 969 1279 1587 1896
LUTs buf 1408 2086 2776 3459 4135
that the latencies of parallel versions are merely fractions
(divided by PD) of non-parallel values.
The FPGA area results, Table 5, are separated into 2
groups: the area of computing parts and buffers. The area of
input and output buffers is linear w.r.t. both N and PD since
their essential components are PD line buffers (FIFO mem-
ories with independent ports of N elements). The area of the
operator units in terms of Slice registers and LUTs is propor-
tional to PD as well because n independent circuits are in-
stantiated in a parallel manner. Although the overall vertical
memory requirements remain unaffected by PD, practically
the number of occupied RAM blocks slightly increases. It is
caused by a different memory organization.
10
Table 6 Timing and frame rate vs. image size, PD = 6, SE = 31x31
square
Size of Image CIF VGA SVGA XGA SXGA 1080p
Latency [clk] 2208 3996 4893 6390 7391 11641
Av. rate [clk/px] 0.443 0.431 0.426 0.426 0.427 0.418
Experimental FPS 2075 724 472 290 174 113
Worst-Case FPS 1915 640 411 246 151 96
The ultimate timing results (PD=6) versus the image size
are listed in Table 6. It illustrates the real performance of the
architecture. It allows to achieve at least 96 fps with 1080p
image size (full HD TV image size).
The worst case occurs on artificial saw-shaped image
with no constant plateaus. Such an image infers the maximal
number of algorithm’s while-loop iterations. The best case
fps (not mentioned in the table) is obtained with a constant
image. A real, unfiltered image containing textures or ran-
dom noise achieves performance somewhere between best
and worst cases. For instance at 1080p, the worst case is 96
fps, best case 140 fps, achieved experimental performance is
113 fps.
This frame rate remains constant for any morphological
serial filter (such as ASF). Obviously, the FPGA area in-
creases accordingly to the size of the ASF. The implementa-
tion is eased by the fact that one can use an off-chip memory.
7.1 Comparison with existing HW implementations
Table 7 presents a comparison with other recent architec-
tures. The table is divided into three sections. The process-
ing unit section presents the features of a single 2D compu-
tational unit. The second part the HW specifications, and the
third part the performance on a given application, an ASF
filter.
One can see that Clienti [3] offers a high throughput
for small 3×3 rectangular SEs. Similarly, the Chien ASIC
chip [2] provides very reasonable performance on small SEs.
On the other hand, De´forges [5] directly offers large, non-
rectangular, convex SE, but with a lower processing rate.
The programmability is not mentioned, namely, the possi-
bility to control the SE shape after the synthesis is not clear.
Although all these solutions are efficient for small SE
sizes or short concatenations, they become more or less pe-
nalized for longer filters. This issue is illustrated in an Exam-
ple Application, Table 7. It estimates the performances on a
five-stage ASF5 = ϕ11×11γ11×11 . . . ϕ3×3γ3×3. Decomposed
into a sequence of dilations and erosions, it can be realized
as ASF5 = ε11×11δ21×21 . . . ε5×5δ3×3. Notice that it makes
use of a progressively increasing SE. On neighborhood pro-
cessors, large SE can be obtained using the homothecy Eq. 5.
The Clienti SPOC instantiates 16 of 3×3 processing units.
Hence, the ASF5 will require 5 image scans with the entire
image necessarily buffered in the memory. Chien also uses
the homothecy. This deteriorates the throughput.
One could immediately figure out to instantiate a longer
pipe in order to reduce the number of image scans. Alas, a
long, fixed-length pipe lacks the flexibility. Consider another
application for the illustration of the problem: the size distri-
butions, exemplified by Fig. 12. Contrarily to ASF, the size
distributions are often sampled sparsely, the SE increments
by more than one and, at the same time, one often goes to
much larger SE sizes. Every opening {γBi} in (8) needs to be
output and stored in the memory to compute the subtraction.
For small sizes, a long pipeline is underused and the work-
load of the processing units unbalanced, whereas for large λ
one may still need several image scans.
For example, for sizes λ = 5, 10, 15, 20, 25, as in Fig. 12,
the Clienti SPOC will require 7 image scans. The 16 pro-
cessor pipe is underused for λ = 5, 10, 15, whereas it will
require 2 scans for λ = 20, 25.
Our processing unit with programmable SE size avoids
using the homothecy. This allows optimal workload distri-
bution over the entire pipe, so important for processing large
images in real-time systems.
(a) Example of a texture
5 10 15 20 25 300
1
2
3 x 10
7 Size distribution
→ Size of λ
→
 
R
es
id
ue
 o
f o
pe
ni
ng
(b) Size distribution sd(λ )
Fig. 12 The size distribution of the texture grain.
8 Conclusions
This paper describes an efficient implementation of serial
morphological filters with flat, rectangular structuring ele-
ments of arbitrary size. The efficiency is obtained through
the following properties:
− The computational complexity is linear w.r.t. the image
size and independent of the SE size.
− The overall latency is mostly equal to the latency of
the operator, inferred by the size of the used structuring
element.
11
Table 7 Comparison of several FPGA and ASIC architectures concerning morphological dilation and erosion
Processing unit HW System Example Application ASF5
Parallel Supported Throughput fmax Clock rate Number of Supported Image FPS
degree SE [Mpx/s] [MHz] [clk/px] units image scans
Clienti [3] 4 arb. 3x3 403 100 0.25 16* 1024x1024 5 80
Chien [2] 1 disk 5x5 190 200 1.052 1 720x480 27 21.5
De´forges [5] 1 arb. convex 50 50 1 1* 512x512 11 17.2
This paper 6 rectangles 234 100 0.426 11* 1920x1080 1 113
* Number of available stages varies with size of used FPGA
− It uses strictly sequential access to the data at all algo-
rithm levels.
− Low memory consumptions (far below the size of the
image) allow embedding on a single chip complex operators
able to process large images.
− Two levels of parallelism: i) the inter-operator parallelism
in serial concatenations ζ = δε . . . δε , allow running all
these atomic δ and ε operators simultaneously, and ii) the
intra-operator parallelism in every atomic dilation/erosion.
The intra-operator parallelism is scalable (tested up to six)
and allows the decomposition of fast streams into several
slower streams processed in parallel without altering the
streaming property of the system.
The architecture serves as a basic building block to be
used for construction of more complex operators such as
ASF, granulometries, etc., with the same properties and per-
formance. The performances obtained on an FPGA are ap-
proaching the 100Hz HDTV 1080p standard. These perfor-
mances are far above what has been reported in the literature.
These performances allied to the programmability are ex-
tremely interesting. They open the accessibility of advanced
morphological operators in industrial systems running under
severe time constraints. The number of examples includes
the on-line production control, aging material defectoscopy,
etc., wherever one requires processing of high resolution im-
ages and low latency.
Appendix: The 1-D Dilation Pseudocode
Algorithm 1: df←1D DILATION (rp, wp, f, SE1, SE2,
N)
Input: rp, wp - reading/writing position; f - input
signal value f (rp); SE1, SE2 - SE size
towards left and right; N - length of the signal;
Q - a FIFO-like queue
Result: output signal value δB f (wp)
while Q.back()[1] ≤ f do1
Q.dequeue() ; // Dequeue useless valuess2
Q.push({f, rp}) ; // Enqueue the current sample3
if wp - SE1 > Q.front()[2] then4
Q.pop() ; // Delete too old value5
if rp = min (N, wp + SE2) then6
return (Q.front()[1] ) ; // Return valid value7
else8
return ({}) ; // Return empty9
References
1. J. Bartovsky´, E. Dokla´dalova´, Petr Dokla´dal, and V. Georgiev.
Pipeline architecture for compound morphological operators. In
ICIP10, 2010.
2. S.-Y. Chien, S.-Y. Ma, and L.-G. Chen. Partial-result-reuse archi-
tecture and its design technique for morphological operations with
flat structuring elements. Circuits and Systems for Video Technol-
ogy, IEEE Transactions on, 15(9):1156 – 1169, sept. 2005.
3. Ch. Clienti, S. Beucher, and M. Bilodeau. A system on chip dedi-
cated to pipeline neighborhood processing for mathematical mor-
phology. In EURASIP, editor, EUSIPCO 2008, Lausanne, August
2008.
4. A. Cord, D. Jeulin, and F. Bach. Segmentation of random textures
by morphological and linear operators. In 8th ISMM, pages 387–
398, Oct. 2007.
5. O. De´forges, N. Normand, and M. Babel. Fast recursive grayscale
morphology operators: from the algorithm to the pipeline architec-
ture. Journal of Real-Time Image Processing, pages 1–10, 2010.
10.1007/s11554-010-0171-8.
6. K. I. Diamantaras and S. Y. Kung. A linear systolic array for real-
time morphological image processing. J. VLSI Signal Process.
Syst., 17(1):43–55, 1997.
7. P. Dokla´dal and E. Dokla´dalova´. Computationally efficient, one-
pass algorithm for morphological filters. Journal of Visual Com-
munication and Image Representation, 22(5):411–420, 2011.
8. E.R. Dougherty. Mathematical morphology in image processing
. Taylor and Francis, Inc., 1992.
9. J. Gil and M. Werman. Computing 2-d min, median, and max
filters. IEEE Trans. Pattern Anal. Mach. Intell., 15(5):504–507,
1993.
12
10. J.-C. Klein and J. Serra. The texture analyser. J. of Microscopy,
95:349–356, 1972.
11. D. Lemire. Streaming maximum-minimum filter using no more
than three comparisons per element. CoRR, abs/cs/0610046, 2006.
12. F. Lemonnier and J.-C. Klein. Fast dilation by large 1D structuring
elements. In Proc. Int. Workshop Nonlinear Signal and Img. Proc.,
pages 479–482, Greece, Jun. 1995.
13. E. N. Malamas, A. G. Malamos, and T. A. Varvarigou. Fast im-
plementation of binary morphological operations on hardware-
efficient systolic architectures. J. VLSI Signal Process. Syst.,
25(1):79–93, 2000.
14. P. Maragos. Pattern spectrum and multiscale shape representation.
IEEE Trans. Pattern Anal. Mach. Intell., 11(7):701–716, 1989.
15. G. H. Mealy. A method for synthesizing sequential circuits. Bell
Systems Technical Journal, 34:1045–1079, 1955.
16. L. Najman and H. Talbot, editors. Mathematical Morphology:
From Theory to Applications. ISTE Ltd and John Wiley & Sons
Inc, 2010.
17. R. Sabourin, G. Genest, and F. Preˆteux. Off-line signature ver-
ification by local granulometric size distributions. IEEE Trans.
Pattern Anal. Mach. Intell., 19(9):976–988, 1997.
18. J. Serra. Image Analysis and Mathematical Morphology, vol-
ume 1. Academic Press, New York, 1982.
19. J. Serra. Image Analysis and Mathematical Morphology, vol-
ume 2. Academic Press, NY, 1988.
20. J. Serra and L. Vincent. An overview of morphological filtering.
Circuits Syst. Signal Process., 11(1):47–108, 1992.
21. F. Y. Shih, T. K. Chung, and C. C. Pu. Pipeline architectures for re-
cursive morphological operations. IEEE Trans. Image Processing,
4(1):11 –18, jan. 1995.
22. P. Soille, E. Breen, and R. Jones. Recursive implementation of ero-
sions and dilations along discrete lines at arbitrary angles. IEEE
Trans. Pattern Anal. Mach. Intell., 18(5):562–567, 1996.
23. S. Sternberg. Grayscale morphology. Comput. Vision Graph. Im-
age Process., 35(3):333–355, 1986.
24. E. R. Urbach and M. H. F. Wilkinson. Efficient 2-D grayscale
morphological transformations with arbitrary flat structuring ele-
ments. IEEE Trans. Image Processing, 17(1):1 –8, jan. 2008.
25. M. Van Droogenbroeck and M. J. Buckley. Morphological ero-
sions and openings: Fast algorithms based on anchors. J. Math.
Imaging Vis., 22(2-3):121–142, 2005.
26. M. van Herk. A fast algorithm for local minimum and maximum
filters on rectangular and octagonal kernels. Pattern Recogn. Lett.,
13(7):517–521, 1992.
27. J. Velten and A. Kummert. Implementation of a high-performance
hardware architecture for binary morphological image processing
operations. In Circuits and Systems, 2004. MWSCAS ’04. The
2004 47th Midwest Symposium on, volume 2, pages II–241 – II–
244 vol.2, 25-28 2004.
28. L. Vincent. Granulometries and opening trees. Fundamenta In-
formaticae, 41(1-2):57–90, January 2000.
29. Xilinx. Virtex-5 family documentation, 2009, available at
http://www.xilinx.com/support/documentation/virtex-5.htm.
30. J. Xu. Decomposition of convex polygonal morphological struc-
turing elements into neighborhood subsets. IEEE Trans. Pattern
Anal. Mach. Intell., 13(2):153–162, 1991.
31. X. Zhuang and R. M. Haralick. Morphological structuring element
decomposition. Computer Vision, Graphics, and Image Process-
ing, 35(3):370–382, 1986.
