This paper presents four novel area-efficient field-programmable gate-array (FPGA) bit-parallel architectures of finite impulse response (FIR) filters that smartly support the technique of symmetric signal extension while processing finite length signals at their boundaries. The key to this is a clever use of variable-depth shift registers which are efficiently implemented in Xilinx FPGAs in the form of shift register logic (SRL) components. Comparisons with the conventional architecture of FIR filter with symmetric boundary processing show considerable area saving especially with long-tap filters. For instance, our architecture implementation of the 8-tap low Daubechies-8 FIR filter achieves 30% reduction in the area requirement (in terms of slices) compared to the conventional architecture while maintaining the same throughput. Two of the above-cited novel architectures are dedicated to the special case of symmetric FIR filters. The first architecture is highly area-efficient but requires a clock frequency doubler. While this reduces the overall processing speed (to a maximum of 2), it does maintain a high throughput. Moreover, this speed penalty is cancelled in bi-phase filters which are widely used in multirate architectures (e.g., wavelets). Our second symmetric FIR filter architecture saves less logic than the first architecture (e.g., 10% with the 9-tap low Biorthogonal 9&7 symmetric filter instead of 37% with the first architecture) but overcomes its speed penalty as it matches the throughput of the conventional architecture.
I. INTRODUCTION
F INITE-impulse response (FIR) filters are widely used in digital signal processing. A -tap FIR filter is defined by the following input-output (I/O) equation [1] : (1) where are the filter coefficients. Field-programmable gate arrays (FPGAs) have proved a superior platform for FIR implementation compared with general-purpose and DSP processors as they provide much higher throughputs and I/O bandwidth [2] - [4] . Equally, thanks to dedicated hard logic, FPGAs are closing the performance gap with application-specific integrated circuits (ASICs) in this regard, with the added reprogrammability feature. Manuscript Fig. 1 shows the two conventional hardware architectures (the direct and the inverse form) of a general -tap FIR filter implementation [5] . Both architectures seek to align the products in (1) in time before being accumulated. In architecture (a), the chained input samples delays align in time the input samples before parallel multiplication and accumulation. Whereas, in architecture (b), the parallel multiplication is followed by a serial accumulation that aligns the products in time through the internal delays of the adder chain.
An FIR filter implements a convolution operation [6] , which is often built on the assumption of infinite length signals, e.g. , continuous audio signal. Finite length signals (e.g., images) on the other hand, have discontinuities at the boundaries. At this point the problem of which values to use at these regions emerges. Although, this problem could be ignored for a onestage convolution, it cannot be discarded when implementing a multi-stage convolution as in subband coding and filter banks which are widely used in speech, image, and video processing standards and applications [5] , [7] . A commonly recommended solution (called symmetric extension) to this problem is to extend each row by reflection at the signal boundary [8] , [9] as shown in Fig. 2 .
With length-preserved filtering using a -tap general FIR filter, the minimum number of extra samples to be introduced is constant and equal to . This is because input samples are required to generate output samples. However, the number of samples to be added at the left border of the input signal (referred to by ) and the right one (referred to by ) can be variable, i.e., not a constant.
To handle the problem of boundary filtering in hardware, Chakrabati [10] proposed the use of a router (or switcher) to feed the appropriate data, in parallel, to the multipliers (see Fig. 3 ). The router (referred to by Hard-Router) configuration is detailed in [10] . It is implemented using multiplexers where a controller is needed to drive their appropriate selection signals. For a -tap FIR filter, a minimum of four-to-one lookup tables (LUTs) are required to implement the Hard-Router when the input signal is extended by samples at its left side [11] . This represents an hardware complexity 1 . Consequently, the Hard-Router solution requires considerable area and routing resources which will inevitably degrade the speed performance.
To reduce primarily the high area cost of the Hard-Router, we present in this paper novel hardware architectures for FIR filters tailored to Xilinx FPGAs. These architectures cleverly exploit the variable-length shift register logic (SRL) component: SRL [12] , [13] implemented by each LUT in a number of Xilinx FPGA series (see Fig. 4 ). These stretch from early Xilinx Virtex series, through to Virtex-II, Virtex-IIPro, Virtex-4, and Virtex-5 series and their low cost equivalents (e.g., Spartan-II, Spratan-IIE, and Spartan-3 series). Indeed, in all of these FPGAs, each
slice LUT can be configured to create a shift register (SRL). The latter's length is 32 bits in the 6-input LUTs Virtex-5 FPGA and referred to as SRL32, and just 16 bits in the other 4 inputs LUTs FPGA families and referred to as SRL16. The SRL configuration consists of a chained delay with a multiplexer at the output. The input address Addr selects which bit in the chained delay to be output hence controlling the length of the shift register. Each SRL can be immediately pipelined by using the flip-flop (FF) available at the same slice logic cell (see Fig. 4 ). Longer shift register length can be implemented with multiple chained SRLs.
In the remainder of this paper, we first present our novel hardware architecture for a general FIR filter regardless of the signal boundaries extension. We then show how our architecture can be efficiently used to handle symmetric signal extension with little hardware penalty. A detailed scheduling algorithm is presented. We then compare the area requirement of our architectures to that of the conventional "Hard-Router" based FIR architectures. A case study showing real timing and area measurements is included. In Section III, we tailor the results given in [14] to the special case of symmetric FIR filters. Two novel architectures are presented. Similarly, area cost evaluation and a dedicated case study are provided. Finally, conclusions are drawn. It is worth noting that this paper is an enhancement and completion of previous work published by the authors in [11] , [14] , and [15] . This includes a novel architecture for symmetric FIR filters which overcomes previous speed penalty reported in [11] as well as a detailed description of the algorithm used in symmetric signal boundary handling for both general FIR filtering with adder chain accumulator structure and symmetric FIR filters.
Throughout the rest of this paper, we assume the use of bit parallel arithmetic. The term SRL corresponds to either one SRL16/SRL32 component or chained SRL16/SRL32 components if needed. The abbreviation "cc" denotes the term clock cycle and the term refers to the SRL associated with the filter coefficient .
II. GENERAL FIR FILTER
This section presents our novel hardware architecture of a general -tap FIR filter. It explains how symmetric boundary signal processing is smartly handled. The area efficiency of our novel architecture is demonstrated through a comparison of the logic area requirements of our architecture compared to the conventional architecture. A case study is included at the end to quantify the area savings as well as speed performance. The input data samples in the architecture are first multiplied in parallel with the filter coefficients. Then, the SRLs skew and align the multiplication results properly in time as shown in Fig. 6 . The skewed products are then summed up to produce the filter outputs.
A. Novel Architecture of a General -tap FIR Filter
Note that the SRLs in Fig. 5 can be placed before or after the multipliers. Thus skewing the products or the mul- tiplicands . If the filter coefficients are fractional and truncation is carried out at the output of the multipliers, the wordlengths at the outputs of the multipliers could be smaller than at their inputs. Thus, for area optimisation, the SRLs should be placed at the multipliers' outputs. On the other hand, if the filter coefficients values are greater than 1, the wordlengths of the outputs of the multipliers will be greater than at their inputs. Then, placing the SRLs before the multipliers saves us more logic. As a consequence, our architecture of Fig. 5 can have the following four varieties: 1) adder tree with pre-multipliers SRLs; 2) adder chain with pre-multipliers SRLs;
3) adder tree with post-multipliers SRLs; 4) adder chain with post-multipliers SRLs.
In the following, we consider only case 3 and 4. The same analysis and results still hold valid for cases 1 and 2 where the raw input samples instead of the products are skewed. In fact, skewing a product:
by clock cycles (cc) is equivalent to skewing by the same number of clock cycles.
1) Filtering With No Boundary Extension:
When the accumulator structure is an adder chain, each product is delayed by only one clock cycle (cc) through the relevant SRL [see Fig. 6 (a)]. In fact, with the exception of the SRL layer, the architecture of our filter is similar to that of Fig. 1(b) , which aims to supply the products to the adder chain as soon as they are computed. The products are aligned in time through the internal delays of the adder chain structure and not necessarily through the SRL layer. The latter increases only the FIR latency. Contrarily, it is up to the added SRL layer to skew the products and align them properly in time before parallel accumulation as shown in Fig. 6 (b) if the accumulator structure is an adder tree. This is because unlike the structure of Fig. 1(a) , the structure of Fig. 5 does not include an input samples delay chain. Fig. 6 will be used to deduce each SRL skew length. This is given in Table I . With an adder chain accumulator structure, the SRLs' skew lengths are all equal to one (see Fig. 6 (a), actually the SRLs' skew lengths can have any value, a value greater than one will only increase the filter's latency). Whereas with an adder tree structure, each SRL skew length is equal to the time interval between the SRL product instant and the filter output computation instant [see Fig. 6 
Besides being able to implement a convolution operation, our architecture seeks primarily to handle signal boundaries processing. This is handled efficiently with very little hardware overhead, thanks to the SRLs dynamic skewing feature as shown in Section III.
Throughout, the term refers to the FIR data dependency graph (DDG) product associated with the multiplier . In particular, refers to the first (last) FIR DDG's node product.
2) Filtering With Symmetric Boundary Extension: To implement in hardware filtering with symmetric boundary extension, no alteration on the architecture of Fig. 5 is necessary. The boundary filtering is simply implemented by a proper skewing of the products through a dynamic SRL addressing. Fig. 7 shows for instance a 4-tap general FIR filter DDG at two consecutive sequences boundary region (the deduction of such graph will be detailed later in the section). In this figure, refers to the current sequence of input samples to be filtered, the negative values the instants which precede the start of this sequence, the shaded rectangle shows the boundary region between the two consecutive Sequences and , and the dashed arrows show where the input symmetric extension is applied. The boundary DDGs of (shown in dashed lines) represent the possible DDGs at the left-hand side signal boundary, whereas the ones associated with represent the possible DDGs at the signal right-hand side boundary. We say possible as this depends on the location of the symmetry-filtering axis which is determined by the number of input samples introduced at the left-hand side of the input signal and the number of input samples introduced at the signal right-hand side (see Section I). From Fig. 7 and as a general rule we can see that the regular DDG (a straight line as depicted in Fig. 6 ) of a general -tap FIR filter ends at the of instant " " ( in Fig. 7 ), and starts from the of time " " ( in Fig. 7 ). Between these two values, the DDGs become irregular. Because of the DDG irregularity, the SRLs skew lengths given in Table I are no longer valid.
To find the required SRLs' skew lengths, and subsequently the required addresses, according to the filter length and the symmetry-filtering axis, we suggest in the following a dedicated approach. In this paper, we consider only the filter with adder chain accumulator structure. Details regarding the adder tree accumulator case can be found in [14] .
Throughout, the term Hub refers to the irregular DDG deflection point. The term denotes the Hub associated product whereas represents the th product that comes before (after) the Hub where is a positive integer.
• An approach to determine the SRLs' skews lengths when using the symmetric signal extension. When using an adder chain accumulator structure and as already shown in Fig. 6(a) , the relevant products for each filter output should be fed to the accumulator regularly in time such that the product is fed to the th adder cc's before feeding the product to the th adder, where . Based on this rule (referred to as rule 1), the required new SRLs' skew lengths when handling boundary filtering using symmetric extension can be deduced. This is achieved by following the next two steps.
Left Side Signal Boundary Extension: Fig. 8 shows the DDG at the left side boundary of the input when input samples are introduced through symmetry reflection (see the arced arrows). This was necessary since the multiplicands at this boundary region correspond to the previous input samples (see the dashed line). The resulting boundary DDG becomes irregular, i.e., not a straight line.
Since the input signal is extended at its left side by the first samples of , we need to wait for cc's before starting to accumulate the necessary products of the first output. The first partial accumulation is initiated one cc after computing the product which coincides with instant (see the dashed arrow in Fig. 8 ). This corresponds to a delay of one cc. According to rule 1, the product should feed the adder chain one cc after feeding to the adder chain. As such, the product of instant is skewed by three (and not just one!) cc's delay (see Fig. 8 ). This represents a 2 cc's increment to the regular skew length [see Fig. 6(a) ]. This compensation is necessary since the product in the irregular DDG is computed one cc earlier than instead of one cc later.
By following the same reasoning and because of the DDG linearity, we can deduce easily that each needs to be delayed by a 2 extra cc's than its predecessor . Therefore, the skew length is equal to cc's. We refer to this value by the term Piv, i.e.,
. This skew value should be applied on all the products since they are, along with , regularly and adequately separated in time, i.e., is produced cc's after (see Figs. 6(a) and 8). For instance, according to rule 1, should be fed into the adder chain one cc after . Since is computed one cc after (see Fig. 8 ), should be skewed by the same skew length, i.e., Piv cc's. From what precedes, we can conclude that if a finite length signal is extended by samples at its left side, the new DDG's products skews for the first filter output can be represented by a skew list such that: , where . The underlined element in the list refers to the skew length of the hub's SRL. The skew values in the above list are applied subsequently on the filter's SRLs from left to right, regularly delayed in time by one cc (see Fig. 8 ).
The irregularity of the filter's DDGs does not occur only with the first filter output of , but rather with all the first outputs. Consequently, we need to determine the remaining SRLs skew lists associated with the next " " outputs. From Fig. 9 , we can see that the computations of the products of the second filter output are done one cc earlier than with the first filter output. For instance, the product of the second filter output is computed at instant " ", one cc earlier than with the first filter output. Inversely, the computations of the products of the second filter output are done one cc later than with the first filter output.
Since the accumulation of each of the filter output products should be initiated at each new cc, the product of the second filter output should be fed to the adder chain one cc after feeding the accumulator with the of the first filter output. This corresponds to the instant " " (see Fig. 9 ). Therefore, the required skew length for the second filter output is equal to 3 cc's delay (see Fig. 9 ). The remaining SRLs skew lengths of this second filter output can be deduced as already explained with the first filter output since the associated DDGs have the same shape. The reader can easily verify that the skew delay list associated with the second output is The above list can be interpreted as the left shift of the list where the right side is filled with the value Piv. This rule can be extrapolated to the remaining boundary filter outputs skew lists. As a consequence, all the DDGs hubs are delayed by the same number of clock cycles (Piv) since for each new irregular DDG, the hub moves one position backward (see Fig. 9 ), as does its associated skew list index. Therefore, we can conclude that the skew list for the last irregular output ( of instant 1, see Fig. 9 ) is
The first output with a regular DDG corresponds to of instant 0 (see Fig. 9 ). It should be delayed by Piv cc's as it has to be fed to the adder chain one cc after of instant 1 (see Fig. 9 ). Logically, this value is applied equally to the rest of the regular DDG's products. We refer to the resulting list by the regular skew list RegSkew, where This list remains valid through the entire non-boundary regions.
Therefore when the input signal is extended by symmetry at its left side by samples, the required SRLs skew lengths at the boundary region can be represented with a Left-Matrix matrix expression where - Fig. 10 . Irregular DDG of a general FIR filter at the right side signal boundary when using symmetric extension. refers to the identity matrix of size , and is of size , such that can be shown in the equation at the bottom of the page.
The underlined elements in this matrix refer to the skew length of the hub's SRL. All the matrix elements put in bold refer to the pre-hub SRLs skew lengths where irregularity is occurring. They form an upper triangular matrix.
Right side signal boundary extension: Fig. 10 shows the DDG at the right side signal boundary where samples are introduced through symmetric reflection (see the arced arrows). This was needed since the multiplicands at this boundary region correspond to samples from samples (see the dashed line). The resulting boundary DDG is irregular, i.e., not a straight line.
As already explained in the last section, all the regular DDGs products will be delayed by Piv cc's. This is portrayed through the product skew length in the last regular DDG in Fig. 11 . The next is supplied to the adder chain at instant " " cc, and corresponds to the first irregular DDG. The hub of the latter should be fed into the adder chain one cc earlier according to rule 1, i.e., at instant Piv. Since this hub product is computed at instant 0 (see Fig. 11 ), the skew length of the irregular DDG hub is equal to Piv, which is equal to the skew length of all its precedents' DDG hubs (see section a).
By following the same reasoning on the subsequent irregular DDGs in Fig. 11 , we can easily deduce that the required skew length of the remaining irregular DDGs' hubs is also equal to Piv cc's. This result will be used to deduce the skew delays of the remaining irregular DDGs products.
According to Fig. 11 , all products are separated in time as required by rule 1. Therefore they need to be skewed with the same skew length value, i.e., Piv. On counter part, the skew length of the post-hub products need to be updated as rule 1 is broken.
If the signal is extended by samples from the right, the product has to be skewed by cc's delay (see Fig. 11 ). This is because product should be supplied to the adder chain one cc after . However is computed one cc earlier (see Fig. 11 ). To cope with this irregularity, 2 cc's delay should be then added to the skew delay, and that is where the value comes from. The same reasoning once applied on the subsequent products shows that the skew length of should be 2 cc's greater than of . Consequently, when symmetrically extending the signal by samples from its right side, the skew lengths of can be grouped into a skew list such that
This list values are applied from left to right on the filter's elements. Since the value of ranges from 1 to (see Section I), the skew lengths associated with the irregular DDGs at the right signal boundary can be represented with a Right-Matrix matrix expression where: refers to the identity matrix of size , and is of size such that can be shown in the equation at the bottom of the page.
The underlined elements in this matrix refer to the hubs' SRLs skew lengths. All the matrix elements put in bold refer to the post-hub SRL's skew lengths. They form a lower triangular matrix.
The previous matrix along with RegSkew list and Left_Matrix determine the required skew lengths of the SRLs in the architecture of Fig. 5 with adder chain accumulator structure when extending the input signal boundaries by symmetry. The expres- sion of these matrices for the case of adder tree accumulator structure can be found in [14] .
B. Area Comparisons
In the following, we compare the area cost of Figs. 3 and 5 architectures. Table II lists the FPGA logic resources used by these two architectures. It is worth noting that when using Xilinx FPGAs, the internal delays of the adder chain are implemented using the "free" flip flops available in the slices. The LUTs of those slices will be used to implement the combinatorial adders of the adder chain. Thus the cost of the delays in the adder chain can be considered null if the cost of adders has been already considered as it is done in Table II .
From Table II , we can see clearly that our architecture with its two varieties (adder chain and adder tree accumulator structure) does not necessitate input samples delays, thus saving parallel word delays, and consumes the same number of multipliers and nearly the same number of adders as the architecture of Fig. 3 . Indeed, our architecture uses either a -input adder chain or adder tree. Those two types have been grouped in Table II under the term Accumulator . It is well known that both of these accumulators' structures consume nearly the same number of (equivalent) adders, i.e., . The router's logic resource requirements given in Table II need now to be inspected carefully. Table II gives the number of LUTs consumed by the hard router as explained in Section I. With our architecture, the routing functionality is implemented through the SRL layer. If the required SRL's skew length is less or equal to its depth, one SRL16/SRL32 (one LUT) can be used. However, if more depth is needed, more SRL16/SRL32 should be chained, thus increasing the required number of LUTs. Therefore, the Left_Matrix, Right_Matrix, and RegSkew expressions given above are used to determine the number of SRL's LUTs per multiplier. This is equal to , where DL represents Fig. 12 . Normalized areas cost comparison between the soft and hard router in a 4-input LUTs FPGA for K > 9. For K 9, the SRL layer area cost is reduced to a constant value much smaller than the hard router area cost.
the maximum skew length value in a matrix column and DS the SRL16/SRL32 depth, i.e., 16 or 32. If we omit the area cost of the SRLs' address generators (which can be considered negligible in comparison with the final filter area), our SRL layer area cost will depend solely on the number of SRLs used. Fig. 12 plots the router area cost evolution for a -tap filter in 4-input LUTs FPGAs (similar graph trends are retrieved with 6-input LUTs FPGAs) [15] . It shows clearly that the hard router consumes much more area than our suggested architecture. The SRL layer cost has a reverse bell shape when using an adder tree accumulator structure. It is minimal for values equal to and . On the other hand, the SRL layer cost has a stair-wise shape when using the adder chain. It consumes less logic than with the adder tree structure only for values much smaller than (see Fig. 12 ). From what precedes, we can conclude that our novel architecture consumes fewer logic resources than the conventional architecture of Fig. 3 . This is shown clearly in Fig. 13 for 4-input LUTs FPGA architectures (similar graph trends have been achieved with 6-input LUTs FPGAs). This figure does not include the cost of the multipliers as it is the same in all the architectures and allocates the same number of LUTs to a bit parallel delay and W-bit parallel adder. This is valid thanks to the dedicated fast carry logic in Xilinx FPGAs [12] , [13] . Fig. 13 shows how different values favor the adder tree or adder chain accumulator structure. It is worth noting that this graph has been normalized in term of the input wordlength . Thus the real difference in LUTs between the architectures should be multiplied by , which will favor our architectures even further. 
C. Timing and Area Measurement: Case Study
In this case study, we present the real hardware implementation results of the standard low filter Daubechies-8 wavelet (8 taps) [7] on Xilinx Virtex-E FPGAs using our architectures and the architecture of Fig. 3 . The filter has been implemented using bit parallel arithmetic with word level pipelining seeking therefore the maximum speed. To favour the architecture of Fig. 3 , the value of is set to 3, a value for which the area cost of the Hard-Router is minimal (see Fig. 12 ). The adders and multipliers were implemented using the dedicated carry logic of the FPGA slices. Since the filter coefficients are constant, we used a canonic signed digit (CSD) representation based approach for the multiplier design [16] .
The implementation of the SRLs' address generators is straightforward. Each SRL addresses can be subdivided into two sets which correspond to the following.
• Boundary Region: Which defines a set of values per an SRL corresponding to the Left_Matrix and RightMatrix rows. These values are stored in the slices distributed RAMs.
• Non-Boundary Region: Where the SRL address is constant corresponding to a RegSkew value. A counter line and multiplexer are needed to switch between the two sets of values.
Tables III and IV give the achieved results with 9-bit input word length, 8-bit coded coefficients and a 2-bit intermediate and final precision results. The design has been captured in structured VHDL and synthesised using Xilinx ISE software. Timing Fig. 1 are used as well as ours (see Fig. 5 ). We can see that when using Fig. 1(a) architecture, the implementation delivers higher speed but requires more area compared to the inverse form architecture of Fig. 1(b) . This is expected as the architecture of Fig. 1(a) uses input samples delays which automatically increase the filter area and its speed as it does not require long routing line to feed the multipliers. Our architecture with no boundary processing delivers the same speed as the conventional inverse FIR architecture. However, it consumes more area because of the SRL layer. Table IV shows the performances achieved from implementing the same filter using symmetric boundary extension. The architectures of Fig. 3 are used as well as ours (Fig. 5 ). The first row of Tables III and IV shows the effect of the Hard-Router on the architecture of Fig. 1(a) . It implies the use of 78 extra slices leading to 6 MHz speed penalty. On the other hand, by comparing the performance of our architecture with and without boundary processing, we can see clearly that the dynamic skewing of the SRL layer does introduce a slight area penalty (12 slices) but with no speed penalty.
As stated in Section I, our architecture has been developed mainly to handle the boundary filtering more efficiently than the architecture of Fig. 3 . Table IV confirms the superiority of our architecture as almost 70 slices have been saved while maintaining nearly the same throughput.
III. LINEAR PHASE FIR FILTERS
In this section, we suggest novel architectures to handle the problem of boundary processing when using linear phase FIR filters. These are frequently used in digital signal processing since they do not distort the input signal phase [9] .
To obtain linear phase filters, symmetry relationship is imposed on the filter coefficients such as ( is the filter length) The FIR filter is called symmetric (anti-symmetric) if it satisfies the first (second) above condition. Without loss of generality, we focus in the following only on the odd-length symmetric FIR filters, i.e., the filter length is odd. The extension of the results to the remaining categories is straightforward.
When implementing linear phase FIR filters in hardware [5] , the symmetric coefficients property is often exploited to halve the number of multipliers used (see Fig. 14) . Fig. 14 shows two conventional -tap symmetric FIR filter architectures.
A generic taps symmetric filter output satisfies (2) where are the filter coefficients. As with the general FIR filter, Chakrabarti [10] proposes the use of a hard-router to handle the signal boundary filtering (see Fig. 15 ). To avoid mainly the high area cost of the hard router implementation (see Section I), we suggest in Section III-A novel architectures.
A. Novel Architectures of Linear Phase FIR Filters
Equation (2) can be rewritten as (3) Fig. 16 . Two variants of our novel multi-clocked symmetric FIR architecture.
(4)
In Section III-A1, we explain how (4) is implemented using our novel architectures. Two main architectures are suggested. The first architecture (multi-clocked) is considerably more areaefficient but requires a clock frequency doubler, which could nearly halve the filter throughput. Whereas the second architecture is a single clocked architecture allowing a much higher throughput but with replicated logic. In the following, the multiclocked architecture (being the more complex) implementation is thoroughly detailed. The results are then easily extended to the single clocked architecture case.
1) Filtering With No Boundary Processing-Multi-Clocked
Architecture: Fig. 16 shows two variants of our novel multi-clocked symmetric FIR architecture.
Depending on the multipliers' output versus input wordlength and as explained in Section II-A, one of Fig. 16 architectures consumes less hardware. However, since more logic (multipliers) is clocked at double the master clock frequency in architecture (b) than in architecture (a), the throughput of architecture (b) is expected to be lower than it is with architecture (a). Moreover, except from skewing the input samples instead of the products results, the functionality of Fig. 16(b) architecture is highly similar to that of Fig. 16(a) . To avoid redundancy, we consider only the architecture of Fig. 16(a) . In the latter, the input data samples are first multiplied in parallel with the filter coefficients. Then, the SRLs skew the products to align them properly in time before parallel accumulation (see Fig. 17 ). The adder tree generates alternatively the sums and of (4). A dedicated SRL (referred to as sink SRL) is placed at the output of the adder tree to align in time with before addition to generate the filter output.
However, since the computation of and can be initiated at the same cycle (see Fig. 18 ) and since Xilinx FPGA slices accepts one clock polarity (either rising edge or falling edge), the logic involving and computation should be clocked (at least) with double the input clock frequency. The FPGA on-chip DLL/DCM components can be used to generate this clock frequency [12] , [13] . As such, the first half of the master clock cycle generates whereas the sum is generated in the second half clock cycle. In Fig. 16 , we show the required clocking at each node of the architecture. Upward and downward arrows refer to the rising and the falling edge of the clock, respectively. Clk is the input clock (Master clock), whereas Clkx2 runs at double the Clk frequency rate. In both architectures and except the sink SRL, the address generators of the SRLs are clocked by Clkx2 whereas the SRLs inputs are clocked with the master clock Clk. The sink SRL is clocked at the falling edge of the master clock to generate the filter outputs at the rising edge of the master clock as illustrated in Fig. 19 .
In order to ensure the timely synchronization shown in Fig. 19 at the "sink SRL" input, the delay in cycles of the logic that follows the SRL layer has to be considered to identify at which master clock edge the operands of and should be available. In fact, if this delay is even, the operands of and should be available at the rising and falling clock edge respectively. In contrast, if this delay is odd, the operands of and should be available at the falling and the rising clock edge, respectively. Although the control for our architecture looks more complex than the conventional architectures of Fig. 14, it is actually easily parameterized. Indeed, Fig. 20 shows the products skew flow graph for our generic taps symmetric FIR filter. Table V gives the SRLs' skew lengths deduced from Fig. 20 . Each SRL skew length, being related to or is equal to the interval of time between the SRL product instant and or computation instant. The variable in Table V is equal to the following: • 0, if the delay in cycles of the logic that follows the SRLs layer is even; • 0.5, if the delay in cycles of the logic that follows the SRLs layer is odd. By following the same geometrical interpretations, we can easily conclude that the skew length of the sink SRL is constant and equal to .
The SRLs' addresses can be easily implemented. In fact according to Table V , each SRL's bit address could either have a constant value (0 or 1), or toggle between 0 and 1 value where a simple toggling flip-flop can be used. It is worth noting that for an odd length symmetric filter, the SRL linked to the tap " " is actually a simple multiplexer, which outputs its input data and a zero value alternatively.
• Single-clocked Architecture.
Using the architecture of Fig. 16 , a clock frequency doubler was necessary to interleave correctly the computation of the partial sums and since those have to be computed at each master clock cycle through a single adder tree structure. This required a careful and tricky synchronization as shown in Figs. 19 and 20 . However, the use of clock doubler will definitely decrease the filter throughput possibly to its half. Owing to this shortcoming, we suggest the architecture of Fig. 21 which allocates a dedicated SRL layer and an adder tree or adder chain for each partial . This architecture can be seen as a combination of two parallel filters having the architecture of Fig. 5 with shared multipliers. Table VI gives the necessary SRLs skew lengths with adder tree accumulator type. With adder chain, the SRLs skew length is rather constant and equal to one (see Section II-A1). The user can verify that the sink address skew length is equal to .
Although they can be used to implement standard convolution, the architectures of Figs. 16 and 21 seek primarily to handle the convolution operation with boundary signal processing. Section III-A2 details how the above architectures are used to achieve this goal. Throughout, the architecture of Fig. 16(a) is first considered. The results are then extended to the architecture of Fig. 21 with adder tree accumulator type only (to avoid redundancy). In the following, this notation is used.
• Pivot: The point of the DDG graph reflection (see axis in Fig. 17 ).
• Left_Line: The DDG's line associated with computation • Right_Line: the DDG's line associated with computation.
2) Filtering With Symmetric Boundary Extension-Multi-Clocked Architecture:
To handle the boundary processing more efficiently than Fig. 15 suggests, we update slightly our architecture of Fig. 16 through a proper dynamic SRLs addressing as explained in the following.
Throughout, we assume the number of samples introduced at the left side and the right side input signal be equal to (i.e., ) as it is usually done in practice since any other symmetry axis location leads to samples redundancy in the filter outputs. and . We denote by negative values the instants which precede the instant 0 of . From Fig. 22 , and as a general rule we can see that the regular DDG (as depicted in Fig. 17 ) of a generic -tap symmetric FIR filter ends at the pivot of instant " " ( in Fig. 22 ), and starts from the pivot of instant ( in Fig. 22 ). Between these two values, the DDG is irregular. Because of the DDG irregularity at the boundary region, the SRLs skew lengths given in Table V should be updated. The updated SRL's skew lengths could be deduced using the following approach:
• An approach to determine the SRLs' skew lengths when using symmetric signal extension. Fig. 22 shows clearly that when filtering at the left side signal boundary, only the Left_Line of the filter's DDG is altered, whereas it is the Right_Line, which is altered when filtering at the right side signal boundary. To get the necessary skew lengths when extending the signal by symmetry, we consider separately the Left_line and the Right_line of the filter's DDG. This allows us to use the skew matrices and skew lists given in [14] .
Left side signal boundary extension: By comparing Figs. 22 and 7 (which is valid for both accumulator structures), we can conclude that the Left_Line DDG of a -Tap symmetric filter is similar to the DDG of a -Tap FIR filter when the input signal is extended by samples at its left side (the Left_Line DDG is regular in ). The Left_Matrix and RegSkew expressions given in [14] can then be used. However depending on the delay parity of the logic that follows the SRL layer and as explained in Section III-A1, the pivot of the symmetric filter should feed the adder tree 1 cc or 1.5 cc after being computed, instead of 1 cc as assumed with the general -Tap filter [14] . It is equal to 1 cc for even logic delay parity. Therefore, the pivot as well as the Left_Line skew length given in [14] should be updated by a Left_Upd value such that If the logic delay parity after the SRL is odd If the logic delay parity after the SRL is even.
Consequently, the SRL's skew lengths at the left side boundary regions are given by a K_Matrix (since ) of size equal to , such that can be shown in the equation at the bottom of the page.
On the other hand, the Left_Line SRLs' skew lengths at the non-boundary regions are given by the LeftRegSkew list such that:
This list is deduced by using the RegSkew list expression given [14] , where . Unlike the Left_Line, the DDGs Right_Line at the left side signal boundary are all regular. Thus the values given in Table V will be used Right Side Signal Boundary Extension: When filtering through the non-boundary regions, the DDG Left_Line and the Right_Line products are skewed respectively by the Left-RegSkew and RightRegSkew lists given above. However unlike the Left_Line, the Right_Line is no longer a straight line when filtering through the right side signal edge. Therefore, the LeftRegSkew list values are still valid at the right boundary regions whereas the Right_Line skew list should be updated.
By comparing Figs. 7 and 22 at the right signal boundary, we can conclude that the Right_Line for a symmetric FIR filter is equivalent to the DDG of a -tap filter where the left extension value is equal to 0. Therefore, the Right_Matrix expression given in [14] can be used with . Similarly, depending on the delay parity of the logic that follows the SRL and according to Section III-A1, the product should feed the adder tree 1.5 cc or 2 cc after it has been computed, instead of cc as assumed with the general -tap filter [14] . It is equal to 1.5 cc for even logic delay parity. Therefore, the as well as the Right_Line skew lengths given in [14] related to a -tap general FIR filter should be updated by a Right_Upd value such that If the logic delay parity after the SRL is odd If the logic delay parity after the SRL is even.
Given the opposite direction of the arrows in Figs. 7 and 22 , the SRLs' skew lengths at the left side boundary regions are given by a K_Matrix (since ) of size equal to , such that 2 can be shown in the equation at the bottom of the page.
The reader can verify easily that the sink SRL skew length becomes equal to 0 (see Fig. 22 ). This SRL can then be replaced with a Flip-Flop clocked at the falling edge:
• Single-clocked architecture.
The approach explained in the previous section can be easily extended to the architecture of Fig. 21 with adder tree accumulator type. The required SRLs skew lengths are deduced by simply removing the factors Left_Upd, Right_Upd, and from the expression of K_Matrices, LeftRegSkew, and RightRegSkew given above. Those updates are no more necessary as explained previously since and operands and computation are generated now at the rising edge of the master clock. In fact, the architecture of Fig. 21 is a parallel combination of the architecture of Fig. 5 . The latter's required skew lengths have to be applied separately on and SRLs layer. Since the skew length of the sink SRL becomes equal to 0 (see Fig. 22 ), it can then be replaced with a flip-flop clocked at the rising edge. 16 and 21) and the conventional architecture of Fig. 15 with a W-bit data input. The cost of the Hard Table VII has been deduced using its formula given in Section I with . If we exclude the area cost of the SRLs' address generators and the multipliers (which is the same in all architectures), Fig. 23 depicts the evolution of the normalised area cost of all architectures in relation to the filter length in 4-input LUTs type FPGAs. The SRL16 component is assumed to be used. Similar graph trends are retrieved with 6-input LUTs FPGAs. Moreover, the graphs have been normalized in term of the input wordlength . Thus the real difference in LUTs between the three architectures should be multiplied by , which favors our architectures further. From this figure, we can see clearly that our novel architectures are much more compact than the conventional architecture of Fig. 15 . The architecture of Fig. 16 is the most area efficient. However, since it uses a clock frequency doubler, the speed of the architecture will be very much limited by the speed of the logic clocked at Clkx2. We expect Fig. 16 architecture to be significantly slower than the conventional architecture of Figs. 15 and 21. Table VIII gives the results achieved from the implementation of the standard low Biorthogonal 9&7 wavelet filter (9 taps) [7] , using bit parallel arithmetic. The FIR was implemented using the same constraints stated in the case study of Section II-C.
B. Area and Speed Comparisons

Router in
From Table VIII , we conclude that the hard-router solution introduces a considerable area penalty, whereas the speed performance of the architecture of Fig. 16 is significantly lower, although still high enough to ensure a real time implementation. Our architecture of Fig. 21 gives an in-between speed/area performance. It is worth noting that if we implement the same filter under the same constraints, but with sub-sampled output (by a factor of 2), the speed penalty in Fig. 16 disappears as shown in Table IX .
The change in each filter area between Tables VIII and IX is due to the decimation by a factor of 2, which means that fewer inputs are multiplexed or skewed than in the non-decimated version. In fact, to find the SRLs' skew lengths (either for a general or a symmetric FIR filter) when the output is decimated, the expression of the skew lists and Matrices given above can be used. With bi-phase FIR filters [17] , the odd (even) coefficients are multiplied with the odd (even) input samples only. Therefore the above matrices should be sampled, where each odd (even) order SRL is associated with an odd (even) skew matrix column. Moreover, since the multipliers are enabled only once every over two cycles, the SRLs' skew length given in these matrices should be divided by 2. As a consequence, there is no need to use a clock frequency doubler when using the architecture of Fig. 16 to implement a symmetric filter with decimated outputs. As such the architecture of Fig. 16 delivers a similar speed range to the conventional filters but with considerable area saving. This architecture is hence more suitable in multirate architectures such as those based on DWTs [7] .
IV. CONCLUSION
This paper addressed an important signal processing technique which is often lightly considered in hardware implementation, namely: signal boundary extension in finite length signals processing. The technique of symmetric extension has been particularly investigated as it is the most widely used technique in practice. To this end, the design and implementation of four novel bit-parallel FIR filters architectures have been detailed. The architectures cleverly harness the SRL components of Xilinx FPGAs to achieve significant area savings compared to conventional FIR architectures. Detailed scheduling algorithms were presented making the architectures fully scalable and parameterizable.
For general FIR filter, we presented two architectures (see Fig. 5 with its two varying accumulator types) that lead to a very compact FPGA configuration compared to conventional architectures while maintaining the same throughput. The special case of symmetric FIR filters was further considered. In it, two novel architectures (see Figs. 16 and 21) were also devised. The first architecture (see Fig. 16 ) delivered considerable area savings albeit at the expense of a clock frequency doubler and lower throughput. Nonetheless the overall processing speed was still high enough to achieve real time performance, e.g., for wavelet transformation of HDTV. Furthermore, we noted that this architecture can match the speed of the conventional architecture if the filter output is going to be decimated, as is the case in multirate applications (e.g., wavelets) [7] . The second symmetric filter architecture in Fig. 21 replicates some logic to avoid the need for a clock frequency doubler. This allowed us to match the conventional architecture throughput, albeit with less area savings compared to our first symmetric filter architecture.
Another advantage of our architectures compared to conventional ones resides in the fact that they can readily harness subexpression sharing [18] (i.e., sharing multiplier blocks between different terms of the FIR filter) which can lead to even more area savings.
Finally, it is worth noting that the role of SRL presented in this paper can be played by distributed dual port RAM configurations of the FPGAs' LUTs, something which makes the benefits of our architectures possible on a wider range of FPGAs chips including Xilinx XC4k series [19] as well as Lattice Semiconductor's FPGAs [20] . The only drawback is that a 16 bit dual RAM port configuration consumes two LUTs instead of one LUT for an SRL16 configuration. Nonetheless, despite this our architectures are still more area efficient compared to conventional FIR architectures.
The benefits of our architectures are hence applicable to a wide range of FPGAs. The fact that we have presented case studies implementation results for only Xilinx Virtex-E FPGAs does not narrow the scope of our conclusions. In fact, further optimisations could have been made on our architectures including harnessing sub-expression sharing and using the SRL32 component and embedded arithmetic blocks in the latest FPGAs (e.g., hardwired multipliers and Xtreme DSP blocks [13] ) which would have favoured our architectures even further. These issues were not explicitly addressed in the paper as we wanted to focus the paper on the novel architectures/algorithms devised, which concern the optimization of the SRL layer, as opposed to the hard router, in order to skew the FIR input data appropriately before subsequent multiplication and accumulation. Any subsequent multiplier/adder optimization can be readily applied to our architectures.
