Abstract-The time to process each of the W/B processing blocks of a median calculation method on a set of N W -bit integers is improved here by a factor of three compared with literature. The parallelism uncovered in blocks containing B-bit slices is exploited by independent accumulative parallel counters so that the median is calculated faster than any known previous method for any N, W values. The improvements to the method are discussed in the context of calculating the median for a moving set of N integers, for which a pipelined architecture is developed. An extra benefit of a smaller area for the architecture is also reported.
bits toward finding the median. A designer has to analyze the tradeoffs of parameters N , B, and W in order to produce a winning architecture. For instance, our previous architecture [6] is made faster than the work in [3] only for N > 7 when working on slices of B = 2, 3, or 4, bits. The improvement here makes the method faster than our previous work [6] for any N while maintaining blocks of 2 or 3 bits for practical hardware implementations. In fact, an analysis indicates that the architecture presented here is faster than previously found, even in the case of 1-bit slices (B = 1). As each block contributes B bits to the median, the key idea in this brief is to maintain a parallel accumulation to select these B bits within each block, whereas previously, this accumulation was serially computed within a block. The novel approach that led to the improvement in this brief relies on the concept of accumulative parallel counters (APCs) [7] . This brief starts by applying the APC concept to a set of N nonnegative integers (or a single window) using a small value of N as an example. An APC is then applied to the case of maintaining the accumulation on a sliding window of size N , from where an architecture for calculating the median follows.
II. APCS
An APC is defined as an l-bit register that is updated by the sum of the previous contents and its r 1-bit inputs [7] . For instance, for a 3-bit register with a current value of 3 and four 1-bit input vector values of [0, 1, 1, 0], the register value is incremented by 2 and is thus updated to 5. This can be considered as if the number of ones in the 1-bit input was accumulated. An APC circuit with r 1-bit inputs is arranged in such a way that the delay to perform its operation in terms of full/half adders using an l-bit ripple-carry adder is given by log 2 r + l (for details, see [7] ). The impact that this result has for the median architecture in this brief will be discussed later in the timing analysis in Section V.
Our median calculation method slices each W -bit data item by B bits to arrange for W/B processing blocks. Within each block, the accumulation of the slices of bits is kept using an array of APC registers. Consequently, within a block, a number of 2 B APC registers are maintained, with the first register being of an r = 1 1-bit input, the second being of r = 2 1-bit inputs, and the last register being of r = 2 B 1-bit inputs. In general, given the q i of an r = 2 B 1-bit input, an array of 2 B APCs, i.e., A i = r−1 i=0 q i , is arranged for processing per block. For N data items within a window, each APC register A i is of l = log 2 N bits. Before an example is presented, it is worth recalling how to generate a 1-bit input vector of length r = 2 B taking B-bit slices from the data items.
1549-7747 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 
A. Generation of Bit Vectors
An item data bit is interpreted as having disjoint amplitudes a 0 and a 1 for bit values 0 and 1, respectively. The item data bit is manipulated to be expressed in the form
] so that when a data bit value is 0, it is represented as Q [0] ; otherwise, it is represented as Q [1] . This expression is based on the quantum representations of bits [8] ; thus, let us call Q [d] a qubit. The operations on qubits, such as a tensor between two or multiple qubits, can be now defined. For instance, the tensor between two qubits is defined as
Thus, for two bits x 0 x 1 = 10 2 , the qubit tensor is
The tensor between two qubits is already familiar to us; it is equivalent to the binary decoding on two bits, i.e., a 2-to-4 binary decoder. The method here manipulates the bit slices of a data input as qubits and builds its tensor; on a B-bit slice, this is equivalent to performing a binary decoding operation of B bits to generate a 2 B bit vector, i.e., a B-to-2 B binary decoder [9] . This bit vector is that previously referred to as vector q with size r = 2 B . For the specific case of B = 2, q = [q 3 q 2 q 1 q 0 ], and r = 4. APC register A 0 takes as input q 0 , A 1 takes as input q 1 q 0 , A 2 takes as input q 2 q 1 q 0 , and A 3 takes as input q 3 q 2 q 1 q 0 . A median calculation procedure using APC registers proceeds as in the following example.
B. Small Example
Consider a data set of N = 9 integers, i.e., x j = {3, 1, 29, 21, 16, 9, 11, 19, 17}, with each of W = 5 bits (labeled as [4 : 0] ). Note that P = N/2 = 5. Using all the five bits in the representation of each integer (and applying a qubit tensor) requires a full 5-to-32 binary decoder generating a bit vector of length r = 2 5 = 32 bits. Performing the 5-to-32 binary decoding for each integer in the set (and ORing into a bit vector of size 32, with all the 32-bit positions initially in zero), we generate the bit mapping presented in Table I . The binary decoding produces an indirect ordering of the integers in the set, and then, the median can be directly taken as 16 10 since it is the middle position of the nine integers in the 32-bit vector q (or the P = 5 position of the nine in the vector). However, the growth of input size W in bits and other nuisances (such as repeated integers in the set) make this full binary decoding approach impractical to be used as a direct method for computing the median, at least for sizes W >8 [10] . Nevertheless, the approach can be used as the principle of operation when processing slices of bits taken from the input integers instead of taking all bits at once. Table II (note the dot in (x j ) 2 ). The integers are processed one at a time, and the binary decoding of the integer slice is performed on the fly into a 2 B -bit q vector (e.g., slice "000" of integer 3 in Block 1 is decoded as vector q = [0, 0, 0, 0, 0, 0, 0, 1]). From this decoded bit vector, the input to a given APC is selected as previously stated. For instance, register A 2 has three 1-bit inputs, taking from the decoding vector bits q 2 q 1 q 0 = "001"; thus, the A 2 register will update to a count of 1. Therefore, in general, APC register A i operates on i + 1 1-bit inputs, taking from decoding vector q all bits with indices 0, . . . , i. All APC registers are updated in parallel for each integer slice, as shown in Table II . The running count is shown on each APC after the slice for each input integer is processed. After all nine integers are processed by Block 1, A 7 , A 6 , . . . , A 1 , A 0 have counts of 9, 8, 8, 7, 4, 4, 2, and 2, respectively. This is the count after the slice for integer 17 10 (the last in the input window) is processed.
C. Finishing the Example: Calculating Median
Calculating the median proceeds in a similar way to the procedure shown in our previous work [6] . A given block finds B bits of the median as the first occurrence of index i (see right to left in Table II) for when A i ≥ P ; this comparison is parallel. For Block 1, this comparison resolves to the bit vector "11110000" using P = 5 (as shown at the bottom of Table II ). Applying priority encoding [9] to this vector (with priority from right to left) gives index i = 4 that corresponds to the column under APC A 4 . This index corresponds to slices of 3 bits with values of "100." Thus, the three MSBs of the median are found as M [4:2] = "100." From the nine input integers in the window, only integers 16, 19, and 17 had their processed bit slices with values of "100," indicating that only integers 16, 19, and 17 are still median candidates (highlighted in gray in Table II ). Integers 3, 1, 29, 21, 9, and 11 need to get nullified so that they cannot update any A i for Block 2.
Next, Block 2 is processed. First, the position for median P is recalculated as P = 5 − 4 = 1 (4 being the A value to the right under the A 4 column, which is underlined in Table II for Block 1). Computing A i proceeds as before, i.e., on the remaining 2-bit slice for all x j . Condition A ≥ P is first satisfied for A under i = 0. The remaining two bits for the median are thus M [1 : 0] = "00." Concatenating the results from Blocks 1 and 2 gives the median as M = 10 000 2 = 16 10 .
D. Key Observations for Improvements
From the aforementioned example, the following key observations are made as regards the improvements to the method presented here. First, reformulating the accumulation of bits in terms of APCs makes A i to be computed in parallel, and as decisions for finding the median are made on parallel logic decisions on accumulations, i.e., A i , the method should be faster than the method as it stands [6] . The previous method was equivalent to calculating the histogram on slice values (in parallel) and then accumulating the histogram from right to left (a serial process); this is one key difference. Second, the nullification of integers that are not candidates for the median is easier to handle when postponed until the next block. This leads to the third observation, i.e., further optimizations can be made to each APC arrangement for the case of a sliding window of N integers accepting a single integer. This observation is valid for the front-end processing block (the block that processes the slices of the MSBs). In this case, a single integer leaves the window while a new integer arrives into the window. The frontend block sees and discards at most one integer within a window, which can be conveniently exploited into improvements, as presented in the following.
III. APCS ON SLIDING WINDOW
Consider a continuous streaming of input integers arriving one at a time for processing; a median filter is interested in finding the median on the most recent N integers; thus, we have a running window of size N . Once a pipeline with N integers gets full, a single old integer leaves the window while a single new integer arrives into the window. For the method here, a processing mechanism requires a coherent update on accumulations A i for a correct fully streaming pipelined operation. Such an update can be thought of as a parallel subtraction of the contribution of the oldest integer slice and, likewise, an addition of the newest integer slice contribution. Consider a stream of integers as x j = {3, 1, 29, 21, 16, 9, 11, 19, 17, 14, . . .}. The first window of nine integers is that presented in Table II . The second window is now composed of integers {1, 29, 21, 16, 9, 11, 19, 17, 14}; the oldest integer in the window was of value 3, and a new integer of value 14 enters the window. With the new window, repeating the whole computation of A i for Block 1 in Table II gives counts of [9, 8, 8, 7, 4 adding [1, 1, 1, 1, 1, 0, 0, 0] to the running accumulation. The net effect is subtracting (in parallel) the vector value [0, 0, 0, 0, 0, 1, 1, 1] ([1, 1, 1, 1, 1, 1, 1, 1 
A. Update Logic on APCs
Note the following from the discussion earlier. First, the slice decoding process sets a bit i in the decoded vector q, and then, all the bits i + 1, i = 0, . . . , 2 B − 1, are also set before being added or subtracted. This, in effect, is a sign-bit extension for a vector of length 2 B . Let us denote the sign extension on the decoded vector by sign(q i ), with a bit vector size of 2 B bits. Second, the sign-extended decoded values of the old and new slices are XORed. A full analysis of what has to be performed to maintain a coherent accumulation A i is given by Table II for B = 2. The block accepts integer inputs x j and the median position for this block P in ; it is most convenient to accept input M in holding the median slice value found by a previous block. The median slice found by this block is generated at the bottom as M out , as well as the median position to be used by a next block P out . Fig. 1 processes the 2-bit slices of a window of N = 5 integers x j of W bits each; thus, an array of four APC registers of up to four 1-bit inputs with each register of a size of 3 bits is arranged. The sign extension performed on the decoder outputs can be maintained in parallel using gates with a fanin of at most 2 B inputs. Alternatively, binary-to-thermometer encoding can replace the decoding and sign extension for a direct lookup table implementation [11] . In addition, notice that the decoders can be inhibited by a single enable bit to account Table 1 , using a front-end block of B = 3 bits followed by a processing block with B = 2 bits.
for the nullification of integers within a running window; they are fully enabled for the front-end block by simply making
. This is the reason to have M in as input, i.e., to contain the nullification signals within a block rather than passing these from one block to the next (that would require N bits). A full circuit arrangement for the APC [7] is not necessary due to the fact that, at most, a single integer arrives or leaves the window; this is a further key optimization to a front-end block, as in Fig. 1 . After the comparison A i ≥ P is performed (the comparator block in Fig. 1 ), a priority encoder produces the median slice for this block. A simple arrangement of a (priority) multiplexer acting on the comparison output to select the value to be subtracted from input P in to generate output P out completes the operation of the block (the Multiplexer Adder block in Fig. 1) .
Note that the pipeline arrangement is clear in Fig. 1 ; a delay of N clock cycles is required to see all integers from a window (from left to right registers in Fig. 1 ) plus the extra delay (top to bottom) in the architecture in Fig. 1 . This delay is denoted by L. The gray boxes in Fig. 1 are to indicate the places where extra registers might be necessary for a faster pipelined operation. The overall latency for a front-end block is of N + L clock cycles, and after this latency, a median slice is produced every clock cycle.
IV. MEDIAN ARCHITECTURE FOR SLIDING WINDOW
In general, for W -bit integers, physical blocks of B bits each arrange for W/B processing blocks; however, it is possible to make each processing block operate with its own B value (of bits), as shown in Table II . Fig. 2 shows the block diagram for computing the median for the example in Table II . The frontend block (Block 1) computes the 3-bit median, i.e., M [4 : 2] , and a second processing block (Block 2) computes the remaining 2-bit median M [1 : 0], as detailed in Table II . M [4 : 2] is available after N + L clock cycles; thus, the next processing block needs to get aligned in time by delaying input x j by L clock cycles so that the current input window is already loaded into the next processing block. Note that, for the next processing block, the assumption made earlier of having a single old integer leaving the window while a single new integer arrives into the window is no longer valid. Consequently, the simplifications shown in Fig. 1 cannot be used directly. Fortunately, the processing block previously presented in [6] can be used instead, except with two key modifications. The first modification is that integer nullification is exactly replaced by the scheme in Fig. 1 here. Hence, M [4 : 2] computed by the front-end block gets compared with x j [4 : 2] seen by the processing block. The second modification is that full APC circuit arrangements, as detailed in [7] , can be incorporated into the block for a parallel accumulation. This is so since more than one integer can enter or leave within a window for Block 2, as is clearly shown in Table II . Each APC accumulator (there are 2 B ) has an N -bit input vector with a register output of length log 2 N . The results from a processing block are vertically pipelined, where the calculation continues to the next block concatenating the generated median bit slices, as shown in Fig. 2 ; the median emerges every clock cycle after an initial latency of N + 2L clock cycles in Fig. 2 . This result is consistent with the latest methods of O(1) time for calculating running medians [12] . In a generic case of K processing blocks (K = W/B if B bits are processed by each block), median M is found every clock cycle with a latency of N + KL clock cycles; L is a tuning design parameter for the speed of operation.
V. TIMING ANALYSIS AND IMPLEMENTATION

A. Timing Analysis
The critical path delay T in Fig. 2 is essentially due to the APC accumulators of l = log 2 N bits each (in Fig. 1) , and as the rightmost APC has r = 2 B 1-bit inputs, then T = log 2 2 B + log 2 N ; thus, T = B + log 2 N . A processing block in our previous method [6] had a critical path complexity of 3 log 2 N + 6 for B = 2; hence, the processing block in Fig. 1 is three times faster than our previous method [6] for any B < 6. The critical path of the work in [3] , i.e., T [3] , is the delay cost of the carry-save adder tree and is at least of log 1.5 (N/2) + log 2 N to account for the final adder [13] . It follows that, for B ≤ log 1.5 N/2, a pipeline path here would be faster than a pipeline of the work in [3] . This is satisfied even for small values of N such as N = 3 and B = 1, which implies that the circuit here is faster for all practical case values of N with a suitable choice of B. Interestingly, for B = 1, this brief is expected to be faster than the work in [3] for any N , which suggests that the architecture in [3] may adopt the concept of APCs for a hybrid architecture. For B > 1, this brief computes the median in W/B processing blocks, whereas the work in [3] needs W processing blocks. It seems convenient to maintain B as small as 2, 3, or 4. This favors the parallel decoding and the sign extension, as shown in Fig. 1 . Remarkably, the final accumulator in this brief is still a ripple-carry adder. For the recent hardware sorting-based method in [4] , the critical path goes through a chain of N − 1 logic OR gates and is therefore longer than the critical path in Figs. 1 and 2 . These latest sorting methods have been proposed for area efficiency [4] or power [5] .
B. Hardware Implementation
In order to verify the architecture presented here, designs were expressed in the register-transfer level (RTL) Verilog hardware description language and functionally verified by simulation. From the RTL design of the processing blocks using APC registers (similar to Fig. 1) , the circuit area and frequency of operation synthesis results are reported in Table III TABLE III  AREA AND FREQUENCY FOR THE MEDIAN ARCHITECTURE BLOCKS   TABLE IV AREA AND FREQUENCY FOR A MEDIAN PARAMETERIZED BLOCK using the application-specific integrated circuit Taiwan Semiconductor Manufacturing Company Ltd., 0.25-μm technology. For a quick comparison, the results of the previous work in [6] are also included. Clearly, using APC registers improves the frequency of the operation, as expected from the timing analysis. The front-end block offers an extra advantage in the area, particularly when the input integers are of W ≤ 12 bits.
In Table IV , it is seen that the area scales well with parameter value N . A front-end block makes it easier to produce designs for any parameter value B; however, this table suggests that it is preferable to keep parameter B to 2 or 3 bits (or to make parameter L > 2 to increase the frequency). Getting a circuit for Fig. 2 is subjected to implementation details at the RTL. We explored maintaining accumulation A i coherent by fusing the decoder and sign extension in Fig. 1 into an ad hoc decoder (a lookup table) such that the critical path in the accumulation process is kept bounded by log 2 N . In this case, the critical path could move into the logic toward the bottom of Fig. 1 . This is the purpose of introducing the delay elements down the pipeline in Fig. 1 . Table IV uses L = 2, and this sort of tuning is best evaluated under the specific technology used to target the architecture; thus, it is not discussed in full detail in this brief. Observe that the latency N + KL paid by the architecture here is related to the number of blocks K (K = W/B when each block is of B-bit slices); K remains small, even in common practical cases (W ≤ 16). The work in [3] has a latency of W but requires all samples in parallel and thus needs W N input wires, whereas this brief only needs W input wires for the streaming pipelined operation. The works in [4] and [5] need N processing blocks; therefore, there is the tradeoff of evaluating the overall area, the frequency of the operation, and the latency for a specific application. Note that Tables III and IV do not report results for a complete median architecture.
C. Extensions
The extensions to handle signed integers and the case of rank filtering, as discussed in [6] , remain valid. For keeping this brief self-contained, we briefly discuss them here. In order to handle signed integers, count the number of negative and positive integers within a window as C 0 and C 1 , respectively, so that N = 2k + 1 = C 0 + C 1 . Set median position P to the first block of the computation to P = k + 1 − C 1 if C 0 > C 1 or to P = k + 1 − C 0 otherwise. The method remains unmodified if applied to the remaining W − 1 bits of the input data set within the window. An order R filter for a set of N data elements has R data elements less or equal to the output [2] . The median is a rank filter with R = k; therefore, this method to calculate the median behaves as a rank filter by setting the initial median position to the first block of computation to P = R + 1 when accumulations A i proceed right to left, as performed here.
VI. CONCLUSION
Fundamental in median filtering methods for noise reduction in high-quality imaging, the method for calculating the median given here makes faster decisions than previous hardware algorithms in literature. The computation within each processing block is executed faster than before for any size of blocking bits (design parameter B). The median on a set of N integers completes after K (typically W/B) processing blocks for a serial pipelined stream of W -bit integers with a latency of N + KL, with L being a tuning pipeline parameter for speed. It is also shown that this result holds irrespective of the actual values of parameter N or any combination of B and N . The use of full APC circuitry is required for extending calculating the median in a parallel approach of accepting more than one integer at a time in a streaming operation. The method is generic following a systematic number of steps from where different architectures and implementations can be derived. The method is also easily extended to be implemented as a fast programmed solution.
