Abstract-The growing energy footprint and environmental costs of information and communication technologies has created an awareness of the need for greener communications. However, the task of reducing the energy footprint of wireless infrastructure and terminals is daunting due to the requirements of flexibility and reconfigurability in emerging paradigms like 4G. This paper addresses the flexibility and power consumption challenges of channel filtering, which is one of the most computationally intensive kernels in the radio baseband. Power reduction strategies for programmable time-shared filters have been generally focused on the dynamic power, which has been replaced by leakage power as the dominant mode of power consumption in nanoscale CMOS devices. We investigate the role of parallelism in reducing the nanoscale CMOS power consumption. We also propose a class of programmable timeshared filters that are more area efficient than traditional folded direct form filters, when the level of parallelism is increased.
INTRODUCTION
Green radio is a moniker used for the growing body of research, focused on the role of information and communication technologies (ICT) in sustainable development, taking into account its social, economic and environmental impacts [1, 2] . In terms of relative contribution to the overall ICT carbon footprint, the contribution of mobile telephony, even with exponentially increasing subscribers in emerging markets, is expected to decrease by 2020 with respect to other components of the ICT carbon footprint like datacenters [3] . However merely looking at the carbon footprint can be misleading. The vastly increased mobile telephony subscriber base poses some critical problems which are not very apparent when one looks only at the embodied carbon footprint due to manufacturing or the operational carbon footprint. A holistic lifecycle analysis needs to incorporate the environmental impact, even after the useful lifetime of the devices. It is increasingly apparent that the mounting amount of e-waste from discarded terminals and batteries will exact a significant environmental cost due to the toxic materials used in them [4, 5] .
Increasing the energy efficiency can result in a lower number of battery recharge cycles, and hence, increased battery life. However improving the energy efficiency of mobile terminals faces some daunting challenges as network operators migrate from 2G and 3G technologies to 4G. In power constrained mobile terminals, the available power budget for computations is only about 1W [6] . Moving towards 4G terminals would require energy efficiencies of the order 1 TOPS/W [7] . The 4G paradigm envisions multiple access networks built on a core IP network, with flexible terminals providing seamless mobility across multiple access networks. Interoperability across multiple networks allows an 'always best connected' access model, where the choice of the access network can be matched to the QoS requirements and power consumption constraints [7] . Flexibility however incurs a power penalty and flexible hardware is usually several orders lower in power efficiency when compared to custom hardware [8] .
These conflicting demands of flexibility and low power consumption impose new design challenges for implementing the computationally intensive kernels in the radio baseband, which are typically implemented using hardware (HW) accelerators. HW accelerators are typically inflexible and optimized for a single standard. A scalable radio hardware design strategy for paradigms like 4G, requires flexible HW accelerators that can be reused across multiple standards. This paper focuses on the low power implementation of programmable time-shared filters, required for channel selection in the digital front-end [9] of flexible mobile terminals. Channel selection is extremely computationally intensive, as the filters have to operate on the highly oversampled output of the A/D converter. The high regularity and low control overheads make finite impulse response (FIR) channel filters, a good candidate for HW acceleration. Single mode filter accelerators benefit from a large class of constant coefficient filter optimizations like common subexpression elimination and graph dependency algorithms [10] [11] [12] , which reduce the multiply operations to a set of shift and add operations. Consequently a spatial style implementation of a constant coefficient filter, results in low area, control and power overheads. Reprogrammability, on the hand, necessitates the use of generic HW multipliers which have a significantly higher area and power penalty compared to constant multipliers. The regularity of FIR filter data flow graphs allows variable filter lengths to be folded onto a fixed set of multiply and accumulate (MAC) units, with minimal changes to the control logic. Prior works on reducing the power consumption of time-shared FIR filters have usually focused on reducing the increased switching activity due to loss of input correlation in temporal style implementations [13] [14] [15] . In the nanoscale CMOS technologies, the leakage component dominates the overall power consumption [16] . Since increased area translates to increased number of leaking transistors, the role of parallelism as a power reduction strategy in nanoscale CMOS technologies needs to be revisited. In this paper, we propose a class of programmable time-shared FIR filters based on fast filter algorithms, that can trade area for increased timing slack, more efficiently than traditional folded direct form filters, up to certain filter lengths. The increased timing slack provides more room for supply and threshold optimizations, which can be used to reduce all the major nanoscale CMOS power consumption components. This paper is organized as follows. Section II highlights the need for flexibility in the channel filters for paradigms like 4G. Section III analyzes the role of parallelism as a power reduction strategy in nanoscale CMOS technologies. Section IV compares the area efficiencies of the proposed time-shared filters with traditional folded direct form filters. Section V presents the conclusion and future work.
II. NEED FOR FLEXIBILE CHANNEL SELECTION IN EMERGING COMMUNICATION PARADIGMS
Paradigms like 4G impose a stringent requirement of flexibility on the radio hardware. Providing interoperability between multiple access networks requires a radio front-end capable of handling varying channel bandwidths. Performing the channel selection completely in the analog domain, would require highly frequency selective analog filters, which are tunable over a wide range of bandwidths. The limited flexibility of analog filters is a major bottleneck for multistandard support. A more practical multistandard channel selection paradigm is the so called, 'fixed digitization bandwidth' approach [9] . In this method, the flexibility requirements of the analog filters can be considerably relaxed by designing them for the widest channel bandwidth. Once a coarse band is selected by the analog front-end, the task of fine channel selection is performed in the digital domain, where one has the advantage of easier reprogrammability. Selecting a coarse frequency band in the analog front-end requires the A/D converter to digitize a wideband signal, which may comprise of unwanted interferers and blockers, in addition to the channel of interest. The power levels of the interferers and blockers can be several times that of the required channel. Successful demodulation requires the potential interfering components to be sufficiently attenuated to satisfy the standard specific carrier-to-noise (C/N) ratio.
As shown in Fig. 1 , a multistandard channel selection filter has to accommodate varying channel bandwidth, interferer locations, interferer power levels, and attenuation requirements. Hence supporting multiple standards requires both the coefficient set and the filter lengths to be variable. Channel selection has been traditionally performed by FIR filters due to the relative ease of providing a linear phase and stability. FIR channel filters are computationally intensive, due to the large number of multiply and accumulate (MAC) operations involved. These filters also have to operate on the A/D converter sampling rate, which is typically highly oversampled with respect to the symbol/chip rate. The sheer number of MAC operations per second places it beyond the computational capabilities of digital signal processors, where reprogrammability can be achieved by merely changing the software. Instead, the function has to be performed using a dedicated filter HW accelerator.
III. PARALLELISM AND POWER CONSUMPTION
One of the important architectural-level design parameters in the design of a HW accelerator is the degree of parallelism. Parallelism has been traditionally used as an architectural transformation for trading increased area for reduced dynamic power. In higher device geometries, CMOS power consumption is largely dominated by dynamic power consumption, which is caused by the charging and discharging of capacitive nodes. The dynamic power can be modeled as follows [20] :
where C T stands for the total capacitive load, α f is the fraction of capacitive nodes that are switching, V DD is the supply voltage and f is the clock frequency. Eq. (1) reveals a quadratic dependence on the supply voltage and a linear dependence on the clock frequency. In a fixed throughput system, increasing the level of parallelism has the effect of reducing the frequency of operation and hence increasing the timing slack in the critical path. The delay of a CMOS circuit depends on the time taken to charge and discharge the capacitive nodes, and can be expressed using the alpha-power law MOSFET model [17] , as follows:
where K D is a fitting parameter, L D is the logic depth and α is the velocity saturation term, which is equal to 1.5 in short channel transistors [17] . Eq. (2) suggests that any available timing slack in the circuit can be exploited for reducing the supply voltage or for increasing the threshold voltage. In higher device geometries, the quadratic dependence of dynamic power on supply voltage makes supply voltage scaling, a useful strategy for exploiting the parallelisminduced timing slacks.
Aggressive technology scaling over the years has led to the leakage power replacing the dynamic power as the dominant source of power consumption. Formerly, leakage was only relevant in devices which spent a significant time in idle modes, when they do not have significant switching activity. These trends can be attributed to two main factors. Firstly, scaling of the supply voltages has required the threshold voltage to be also scaled down to maintain the gate overdrive. Reducing the threshold voltage results in an exponential increase in the subthreshold leakage current. The subthreshold leakage power consumption can be modeled as follows [18] :
where C ox is the gate oxide capacitance per unit area, m is the body effect coefficient, v T is the thermal voltage, μ 0 is the zero bias mobility, V GS is the gate-to-source voltage, V DS is the drain-to-source voltage, W is the gate width and L is the gate length. Secondly, scaling of the silicon dioxide gate oxide in nanoscale CMOS technologies has resulted in the gate and the channel being separated by the thickness of just a few atoms, which has tremendously increased the gate tunneling currents. The power consumption due to the direct tunneling gate leakage current can be modeled as follows [19] : Both the gate leakage and subthreshold leakage are linearly dependent on the gate width. Hence the total leakage power consumption is strongly correlated to the total gate width and area [16] . The total physical capacitance term C T in (1) is also strongly correlated to the total area. Hence it is important to ensure that the potential gains from increased parallelism are not offset by the increased area overheads. Parallelism reduces the operating frequency, f, of the datapath operators. Assuming that the supply and threshold voltages are scaled such that the circuit delay is equal to the cycle period, f can be expressed in terms of the V DD and V th as follows:
Eq. (5) can be reformulated as follows:
Eq. (6) can be used to plot the locus of (V th ,V DD ) pairs for a fixed frequency constraint. To demonstrate how the permissible (V th , V DD ) pairs vary with changing frequency constraints, we synthesized a 16 bit adder was in a TSMC 0.18μm process, and used it to extrapolate the circuit specific parameters, L D and K D . Fig. 2 shows the locus of (V th , V DD ) duplets having a fixed frequency, for the 16 bit adder. It shows that relaxed frequency constraints can be used to lower the supply voltages and increase the threshold voltages. The higher V th , lower f, and lower V DD values result in a cubic reduction of dynamic power, and an exponential reduction of subthreshold leakage power and gate leakage power. These reductions can compensate the near-linear increase in total power, due to increased area.
IV. AREA EFFICIENCIES OF TIME-SHARED FIR FILTERS
Given the relationship between increased timing slack and reduced power consumption, one important architectural-level design consideration is the efficiency with which an architecture can trade area for increased timing slack. To measure this trade-off, we define 'area efficiency' as the amount of timing slack increment obtained per unit increase in area. The area efficiency metric of two different classes of time-shared filters is analyzed in this section, namely folded direct form filters, and a fast filter algorithm based timeshared filter structure, proposed by us in [21] .
A. Folded Direct Form FIR Filters
A time-shared FIR filter structure can be derived by folding a direct form FIR filer structure onto a limited set of MAC units. Consider a direct form filter of length N, which is folded onto M MAC units (M<<N, N is assumed to be a multiple of M for simplicity). Assuming that the fixed throughput requirement of the filter is f clk , the operating frequency of each MAC can be given by Nf clk /M.
B. FFA based Time-shared Filters
An alternate strategy for implementing time-shared FIR filters presented in [21] used fast filter algorithm (FFA) structures as the starting point for constructing the time-shared filters. FFAs [22, 23] work on the principle of algorithmic strength reduction, i.e. they reduce the number of expensive is used by the FFA stru MAC operations at the cost of increased add operations. FFA structures are a special class of parallel FIR filter structures. A K-parallel FIR structure corresponding to an N tap FIR filter processes K inputs in parallel and produces K outputs. The above structure has K 2 subfilters, each of length N/K (N is assumed to be a multiple of K for simplicity). When multiple inputs are processed in parallel, there exist significant redundancies across the subfilters. This is used by the FFA structures for reducing the number of subfilters. A K-by-K FFA (henceforth referred to as KxK FFA) has S k subfilters, where S k is typically much smaller than K 2 . Each subfilter is of length N/K. This reduction of subfilters comes at the expense of A k pre-processing/post-processing adders. The general FFA structure can be illustrated through the example of the simplest FFA, the 2x2 FFA, shown in Fig 3. It consists of regular FIR subfilters and a highly irregular pre-processing/postprocessing addition network. The subfilters are either a polyphase component of the original filter, or an additive combination of different polyphase components. The time shared filters in [21] were obtained by folding each of the regular FIR subfilters of a KxK FFA filter, onto a set of L MAC units, while directly mapping the irregular preprocessing/post-processing addition network data flow graph onto hardware. This exploits the regularity of the FIR subfilters for reducing the control overheads, while still benefiting from the algorithmic strength reduction of FFAs. In this paper, the notation KxK|L is used to denote the above time-shared filter. This notation indicates both the FFA order and the subfilter folding set size. Assuming that the throughput requirement is f clk , the input sample rate to each subfilter is f clk /K . Each of the subfilters is of length N/K. When each subfilter is folded onto L MAC units, the operating rate of each MAC unit is Nf clk /K 2 L. The pre-processing/postprocessing addition network operates at a lower rate of f clk /K. The lower bound on the timing slack available in the system, and hence the bottleneck for (V th ,V DD ) scaling, is determined by the time multiplexed MAC units.
C. Comparison of Area Efficiencies
Let the critical path delay of a MAC unit for the nominal supply and threshold voltages be T D . The timing slack in each MAC unit of a M MAC unit based FIR filter of length N can be given by (M/Nf clk_ -T D ). Increasing the level of parallelism by P MAC units, increases the timing slack to ((M+P)/Nf clk_ -T D ). Increased parallelism increases the datapath area, while the coefficient storage memory overhead remains unchanged. Given that the area of each MAC unit is A m , the area penalty incurred due to the addition of P MAC units is PA m . Hence, the area efficiency of the folded direct form filter, E DF can be expressed as follows:
The timing slack in each MAC unit of the KxK|L structure can be given by (K 2 L/Nf clk -T D ). The KxK|L time-shared filter 
The area efficiencies in (7) and (8), can be compared by analyzing the K 2 /S K parameter in Table I . For instance, it can be seen that the area efficiency of the 6x6 FFA, is twice that of the folded direct form filter. This implies that, 2P MAC units need to be added to a folded direct form filter, to obtain the same amount of timing slack increment induced by adding P MAC units to the 6x6 time-shared FFA structure. Hence, the area efficiency metric can be used as an useful guideline, when parallelism is used as a power reduction strategy in time-shared filters. 
D. Limitations of FFA based Time-shared Filters
The area efficiency metric gives an idea about the amount of timing slack increment, induced by adding a given number of MAC units to a time-shared filter structure. In other words, it measures the sensitivity of the timing slack in different timeshared structures to increasing levels of parallelism. However this metric does not take into account, the 'fixed' costs of coefficient memory and pre-processing/post-processing adders. These costs need to be considered while evaluating the total area, which in turn influences the total power consumption. It is shown in this section that, the coefficient memory overhead of FFA based time-shared filters increases much faster than that of folded direct form filters. As a result of their higher area efficiency, FFA based time-shared filters require a lower number of MAC units than folded direct form filters, for a comparable amount of timing slack. However, in high order filters, the higher memory overhead of FFA based time-shared filters might negate the advantage of their lower datapath area. Hence, it is important to identify the filter lengths up to which FFA based time-shared filters are more area efficient than folded direct form filters.
Consider the case of different time-shared filter implementations having an identical amount of timing slack (or equivalently, identical cycle periods in the MAC units). The identical timing constraints allow identical supply and threshold voltages to be used for all the implementations. When the operating frequency and the operating voltages are identical, the power consumptions of different implementations show a near-linear dependence on the total area. Note that in the FFA based time-shared filters, the preprocessing/post-processing adders operate at a much lower operating frequency than the MAC units, as they are implemented in a spatial style. This is not taken into account in the current analysis, for simplicity.
Assume that the area costs of the fundamental building blocks of a programmable FIR filter, namely a coefficient/data register, a MAC unit and an adder are given by A r , A m , and A d respectively. The KxK|L FFA structure has S K L MAC units, with N/KL coefficient multiplications mapped to each MAC unit. The total datapath and coefficient/data storage area of this structure can be given as follows:
The operating frequency of the MAC units in the KxK|L FFA structure is Nf clk /K 2 L. To achieve the same operating frequency, a folded direct form filter should have K 2 L MAC units with N/K 2 L coefficient multiplications mapped to each MAC unit. The total datapath and coefficient/data storage area of the folded direct form filter with K 2 L MAC units can be given as follows:
It can be seen from (9) and (10) that, the KxK|L FFA structure has a lower number of MAC units, but S K /K times the coefficient storage area as that of the K 2 L MAC based folded direct form filter. The S K , K 2 , and S K /K values in Table I give an idea of about difference in the datapath and memory areas of the above two time-shared filter architectures. The filter length, N KxK|L , above which the folded direct form filter has a lower area cost than the KxK|L FFA structure can be derived from (9) and (10), and is shown below :
Equation (11) suggests that increasing the number of MAC units, L, in each subfilter, increases the range of filter lengths over which the KxK|L FFA structure has a lower area than the folded direct form filter. This can be attributed to the fact that, with increasing levels of parallelism, the relative contribution of the coefficient memory area to the total area is reduced. These theoretical inferences are demonstrated by practical synthesis and simulation results below.
E. Experimental Synthesis and Simulation Results
A 16 bit register, a 32 bit adder circuit, a MAC circuit comprising of a 16x16 bit multiplier and a 32 bit adderaccumulator were synthesized on a TSMC 0.18μm process. The Synopsys Design Compiler was used to estimate the cell area A m , A d and A r respectively. The area in terms of gate count, shown in Table II , was obtained by normalizing the above area values by the cell area of a two input NAND gate from the same library. Fig. 4 plots the equivalent number of MAC units, M eq , in a folded direct form filter, required to achieve the same amount of timing slack as a KxK|L FFA structure, for varying values of K and L. For the same time-shared FFA structures, Fig. 5 plots the filter length, N KxK|L , below which a FFA based timeshared structure can be implemented with a lower area cost, than an equivalent folded direct form filter, with an identical amount of timing slack. Fig. 4 and Fig. 5 , clearly demonstrate that with increasing levels of parallelism, the FFA based timeshared filters are more area efficient than folded direct form filter, over a wider range of filter lengths.
V. CONCLUSION AND FUTURE WORK
The task of making communications greener in emerging paradigms like 4G is challenging, due to the conflicting requirements of flexibility and low power consumption. Flexible accelerator cores are necessary for a flexible terminal that can operate across multiple access networks in paradigms like 4G or cognitive radio networks. This paper highlighted the flexibility and power consumption challenges that exist for the channel selection filters in the digital front-end. The role of parallelism as a power reductions strategy in nanoscale CMOS was investigated. It was shown that, even though parallelism incurs an area penalty, the timing slacks obtained by increasing the level of parallelism can be used for reducing all the major nanoscale CMOS power consumption components. A class of time-shared FIR filters based on FFAs was introduced, that was shown to trade area for lower operating frequency more efficiently than traditional folded direct form filters, up to certain filter lengths. The range of filter lengths over which the proposed structures are more efficient than folded direct form filters, increases with increasing levels of parallelism.
Increased device scaling will result in the threshold voltages and supply voltages moving closer to each other. In such a situation even a small change in these voltages can have a large effect on the circuit delay. Hence the additional timing slack obtained by increased parallelism, will only provide limited room lowering V DD or raising V th . We will be studying this timing slack vs. power reduction tradeoff, and the scalability of parallelism as a power reduction strategy into lower device geometries.
