Fast median calculation method by Cadenas Medina, Oswaldo et al.
1 
 
Fast median calculation method 
 
 
J. Cadenas, G. M. Megson, R. S. Sherratt and P. Huerta 
 
 
 
The ever increasing demand for high image quality requires fast and efficient 
methods for noise reduction. The best known order-statistics filter is the median filter. 
A method is presented to calculate the median on a set of N W-bit integers in BW  
time steps. Blocks containing B-bit slices are used to find B-bits of the median; using 
a novel quantum-like representation allowing the median to be computed in an 
accelerated manner compared to the best known method (W time steps). The 
general method allows a variety of designs to be synthesised systematically. A 
further novel architecture to calculate the median for a moving set of N integers is 
also discussed.  
 
Introduction: The median filter is the most common non-linear spatial filter known for 
its excellent noise-reduction capability in smoothing of images [1]. It also finds 
applications broadly in digital signal processing analysis [2]. The median, M, on a set 
of integers is such that half the integers in the set are less or equal to M, and half are 
greater or equal to M. Without loss of generality, an odd number of integers in the 
set, N = 2k +1, is most practical; this is maintained throughout this Letter. If the 
integers were sorted, the median is the integer in the middle location. 
The median can be calculated by sequential sorting with a complexity of O(N log N) 
[3]. Non-sorting based methods, especially those designed for hardware 
architectures achieve the calculation in a number of steps related to the bit length of 
unsigned integers, W, rather than N [4]. The method of Prokin and Prokin [5] is better 
than previous sorting and non-sorting methods and uses W steps to find the median. 
This Letter presents a method to calculate the median on a set of N W-bit integers in 
2 
 
BW  time steps, where B is a parameter used to decide how to slice the integers in 
the data set into blocks.  
The new method creates slices of B-bits, forming BW  processing blocks, each 
block calculating B-bits of the output median. Blocks of 2-bit or 3-bit are practical for 
hardware implementation. Working on more than 1-bit reveals hidden parallelism. 
However, no previous known method operates on more than 1-bit at a time, as 
presented here. The data representation of B-bit slices is changed on the fly allowing 
the method to explore ahead, further than one bit a time, using a Quantum 
Representation (QR) of bits, which retains the advantages of flexible data 
manipulation and data parallel properties. This Letter starts with an example on a 
small set of non-negative integers. The analysis produces an architecture for 
calculating the median on a sliding window of size N. The method is then extended to 
cover signed integers and to solve the more generic problem of a rank filter.  
 
Quantum representation: Define a data bit as a qubit with the form d = a0Q[0] + a1Q[1] 
where ai are quantum amplitudes [6]. The qubit is a superposition of two states 
defined as Q[0] = [0 1] and Q[1] = [1 0]. Hence, d = a0 [0 1] + a1 [1 0] = [a1 a0] = Q[d]; if d 
= 0, Q[d] = Q[0], else Q[d] = Q[1]. The tensor operator onto two qubits is defined as: Q[g] 
  Q[h] = [g1 g0]   [h1 h0] = [g1h1 g1h0 g0h1 g0h0] which extends straightforwardly to 
higher dimensions. A p-bit binary string Xj = xp-1xp-2 ... x0 can be represented as 
         0121 XXpXpXjX QQQQQ    , which forms a 2
p-bit QR. For instance, when 
W = 2, x1x0 = 00, Q[00] = [0 1]   [0 1] = [0 0 0 1]. This format maps the input bits to a 
set of mutually orthogonal vectors which allows implicit parallel processing and 
provides a unique index within a 2p elemental vector. 
 
Small example: Consider a data set of N = 5 elements 4, 8, 2, 7, 14, each of W = 4 
bits. N = 2k+1 implies k = 2 and indicates that, P = k+1 = 3 is the position of the 
3 
 
median. Partitioning the input data with B = 2 bits forms two groups as in Table 1, 
creating vertical slices of two bits (far left in table). Group 1 is processed first. The 
QR is generated, for each 2-bit slice and labelled as qi for i = 3, ..., 0. All qi are added 
in parallel (vertically). These sums are then accumulated right to left into A. A parallel 
comparison is performed on each accumulation against P, searching for the first 
occurrence (right to left) when A ≥ P; (column q1 in the example). The two MSB of the 
median are then found as “01”, indicating that only the elements 4, and 7 are median 
candidates. Next, group 2 is processed. First, P is recalculated as P = 3 -1 = 2 (1 
being the A value to the right under q1 column) and then the slices for elements 8, 2 
and 14 (gray in table) are nullified. Computing the QR, sum and accumulation for the 
second group proceeds, as before, on the remaining 2-bit slice. The condition A ≥ P 
is first satisfied for A under q3. The remaining two bits for the median are “11”. 
Combining the results from group 1 and 2 give the median as M = 0111 = 710.  
The QR representation produces a virtual parallel sorting mechanism (see blocking 
4-bits at the time, bottom of Table 1). For 2-bit blocks, block 1 discards item elements 
2, 8, and 14. The element 2 must be to the right of the median, so the median new 
position is P = 3 – 1 = 2, within the only remaining median candidates 4 and 7. The 
process generalises to large problem sizes and in principle after the accumulation 
and comparisons, the median can be obtained in a single step. The exponential 
nature of the tensor operator for generating QR becomes unwieldy for large W, but 
for small and reasonable cases (particularly those used in applications) can provide a 
fast architecture.  
 
Sliding window median architecture: Figure 1 shows an inside view of block 1 to 
process a set of N elements of W bits each, by blocking B-bits at a time which 
produces  BW  processing blocks. Input elements are pipelined with N stages. 
Each stage holds B bits; this forms a sliding window with element slices X1W-1-(j-1)B, ..., 
4 
 
XNW-jB, for j = 1, 2, ..., BW . The QR is generated for each slice producing 2
B bits with 
indices i = 0, ..., 2B-1. The QR is easily implemented with B-to-2B binary decoders. A 
parallel sum of N bits each is performed along index i and labelled S1i for block 1 with 
accumulation A1i. For the sum S, carry-save-adder (CSA) tree structures [7] produce 
a sum of log2N bits over which the accumulation is performed. For a typical N = 9 or 
11, fast 4-bit adders will suffice. The expected position for the median is P1 for the 
first block with a value P1 = k+1. Internally to the logic, 2B parallel greater than or 
equal comparisons are performed on the log2N bits of A
1 against P1. A simple 
arrangement of XOR gates then extracts the first occurrence of the comparisons A ≥ 
P. A single bit over an i bit-array, i = 0, ..., 2B-1 will be set (for B = 2, this could be [0 0 
1 0]) as the result of the comparison process. The set bit position bit-code 
corresponds to the B bits from the median that can be extracted in that block. The 
logic then calculates the new position P2 to be used in the next block, calculated as 
P2 = P1 – A1i-1 for the i value where the comparison is true.  
All previous results are pipelined vertically where the calculation continues to the next 
block. Elements that cannot be the median get nullified (using the enable bits shown 
in the figure as e1t, t = 1, ..., N for the first block). A structure to generate these nullify 
enable signals for the next block is shown in Figure 2 (left) along with a suggested 
logic for each Et block in the figure (for B = 2, right). For the first block these enable 
signals are conveniently wired to 1. Thus, after BW  processing blocks the median 
M is found at the bottom, with each block contributing B bits of M. The position of the 
median in the window is indicated by the enable set bits generated on the last block. 
 
Timing analysis: Reducing the median computation by B steps would not represent 
an improvement if the time of each step, due to the complexity of the arrangement, is 
larger than before. Figure 3 shows that this does not happen for N >7. The median 
computation time in [5] is t[5] = WT[5] and, for figure 1, t = (W/B)T, thus the speedup 
factor of Figure 1 is s = B(T[5]/T) with T[5] and T representing the critical paths of [5] 
5 
 
and Figure 1 respectively. At N = 15, directly from Figure 3, T[5]/T ~ 0.625 for B = 2, 
and consequently speedup s = B(T[5]/T) ~ 1.25 or 25% faster. Similarly, a 25% 
speedup is seen for B = 3. We corroborated this evidence for a circuit implemented in 
FPGA technology (Virtex 4 device) for the case N = 15 and B = 2. T is essentially due 
to the CSA tree followed by a chain of 2B-1 accumulators of log2N width each, which 
make them suitable for implementation with fast full carry look-ahead adders [7].  T[5] 
is basically the delay cost of the CSA tree. 
 
Signed integer extension: Let the number of positive integers be C0 and negative 
integers C1, then N = C0 + C1. If C0 > C1, set P = k + 1 – C1 initially, otherwise set P 
= k + 1 – C0; the method applies unmodified to the remaining W-1 bits. 
 
A rank filter: An order R filter, for a set of N elements has R elements less or equal to 
the output and N – R elements greater or equal to the output [5]. For N = 2k +1, the 
median is a rank filter with R = k. The method in this Letter calculates the median by 
setting P = R + 1 initially. The accumulation in figure 1 can be performed left to right 
as shown, or right to left. Thus, in order for the method to perform as a rank filter, for 
left to right accumulation, requires setting P = N – R initially. For right to left it 
requires setting P = R + 1.  
 
Conclusion: This method reduces by B the number of steps to calculate the median 
on a set of N items compared to previously known methods. We found parameters B 
and N for which circuit implementations also make the method faster.  This is ideal 
for real-time processing of images. The method applies to non-negative or signed 
integers and also easily extends to the case of rank filtering. Multiple architectures 
can be derived with different tradeoffs between area and time in fully pipelined 
structures.  
6 
 
References 
 
 
1 GONZALEZ, R., and WOODS R.: „Digital image processing‟ (Prentice Hall, 2002)   
 
2 YOUNG, N., and EVANS A. N.: „Median centred difference gradient operator and 
its applications in watershed segmentation‟, Electron. Lett., 2011, 47, (3), pp. 178-
180  
 
3 CORMEN, T. H, LEISERSON, C. E., RIVEST R. L., and STEIN C.: „Introduction to 
Algorithms‟ (The MIT Press, 2003)  
 
4 CHANG, S., and CHIG, W.: „A parallel median filter with pipelined scheduling for 
real-time 1D and 2D signal processing‟, IEICE Trans. Fundamentals, 2000, E83-A, 
(7), pp. 1396-1404 
 
5 PROKIN, D., and PROKIN, M.: „Low hardware complexity pipelined rank filter‟, 
IEEE Trans. On Circ and Syst. II., 2010, 57, (6), pp. 446-450 
 
6 NEILSEN, M. A., and CHUANG, I. L.: „Quantum computation and quantum 
Information‟ (Cambridge, 2000) 
 
 
 
7 PARHAMI, B.: „Computer arithmetic, algorithms and hardware designs‟ (Oxford, 
2000) 
 
Authors’ affiliations: 
 
J. Cadenas and R. S. Sherratt (School of Systems Engineering, University of 
Reading, Reading, RG6 6AY, United Kingdom) 
 
G. M. Megson (School of Electronics and Computer Science, University of 
Westminster, London W1T 3UW, United Kingdom) 
 
P. Huerta (Escuela Técnica Superior, Universidad Rey Juan Carlos, Madrid, Spain) 
 
E-mail: o.cadenas@reading.ac.uk 
 
 
7 
 
Figure Captions 
 
 
Fig. 1 A pipelined architecture of a block to calculate B-bits of the median by 
processing slices of B-bits from a set of N = 2k + 1 non-negative elements of W-bits. 
 
Fig. 2 Left: Nullify enable signal architecture and Right: A logic structure for each 
block Et on the left for the case of B = 2 bits, W = 4 bits.  
 
Fig. 3 Estimated ratio of T[5]/T for critical paths of the architecture in [5] (T[5]) against 
the critical path of the architecture in Figure 1 (T) as a function of window size N.  
 
 
Table Captions 
 
 
Table 1: (Top) Median calculation for input items 4, 6, 2, 7, 14 by blocking 2-bits at a 
time. (Bottom) Quantum representation of 4, 6, 2, 7, 14 by blocking 4-bits at a time. 
8 
 
Figure 1 
 
 
 
 
 
+ 
Logic 
X1 X2   XN ... 
XW-1, .., W-B 
 QR  QR  QR 
1
1e  
 
1
2e  
 
1
Ne  
 

N
 
N
 

N
 
1
1B2
S
  
 
1
2B2
S

 
 
1
0S  
 
+ + 
0 ... 
1
0A  
 
1
2B2
A

 
 
1
1B2
A

 
 
2p  
 
 MW-1, .., W-B 
1p  
 
 
 
 
 
 
 
9 
 
Figure 2 
 
 
 
 
M3 
M2 X1 X2   XN ... 
XW-1, .., W-B 
   E1    E2    EN 
 2
1e  
 
2
2e  
 
2
Ne  
 
MW-1, .., W-B 
t
3X
 
 
1
te  
 
1
1e  
 
1
Ne  
 
... 
 
t
2X  
 
2
te  
 
1
2e  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10 
 
Figure 3 
 
 
  
11 
 
Table 1 
 
Item10 Item2  Group 1 q3 q2 q1 q0  Group 2 q3 q2 q1 q0 
4 0100  01 0 0 1 0  00 0 0 0 1 
8 1000  10 0 1 0 0  00 0 0 0 0 
2 0010  00 0 0 0 1  10 0 0 0 0 
7 0111  01 0 0 1 0  11 1 0 0 0 
14 1110  11 1 0 0 0  10 0 0 0 0 
   S = Σqi 1 1 2 1  S = Σqi 1 0 0 1 
   A = ΣS 5 4 3 1  A = ΣS 2 1 1 1 
 
q15 q14 q13 q12 q11 q10 q9 q8 q7 q6 q5 q4 q3 q2 q1 q0 
0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 
 
 
  
 
 
