The technology scaling impact on FinFET-based Field-Programmable Gate Array (FPGA) components (Flip-Flops and Multiplexers) and cluster metrics is evaluated for technology nodes starting from 20 nm down to 7 nm. Power consumption, delay and energy (Power Delay Product, or PDP) trends are reported with FinFET technology scaling. Cluster metrics are then evaluated based on three benchmarking circuits: 2-bit adder, 4-bit NAND and cascaded°ip-°o ps chain. The study shows that power, delay and PDP of the FPGA cluster are improved as we scale down the technology. An example for improvement is that for 7 nm 2-bit adder, circuit speed is 15% higher than its value at 20 nm and PDP at 7 nm is reduced by 43% compared to its value at 20 nm. The impacts of temperature and threshold voltage variations on FPGA cluster performance are also reported after evaluating a 2-bit adder circuit as a benchmark which is then used to calculate the design constraints to meet 99.9% yield percentage.
Introduction
Field-Programmable Gate Arrays (or FPGAs) are integrated circuits that can be reprogrammed to implement any digital circuit. There are many di®erences between FPGAs and traditional¯xed logic paradigms, such as Application Speci¯c Integrated Circuits (or ASICs), in terms of the performance and design°exibility, cost, and tool availability. The main di®erence is that the designer can reprogram the FPGA many times on-site after manufacturing. Moreover, using FPGAs in production as reliable alternatives to¯xed logic implementations drives out the nonrecurring engineering (NRE) costs and also reduces time-to-market signi¯cantly. Hence, FPGAs are highly required for digital systems implementation due to their design°exibility, recon¯gurability, and low-end product life cycle which makes them the perfect candidates for prototyping, design debugging and small to medium volume applications. On the other hand, FPGAs are less e±cient and slower than¯xed logic implementation, due to the added circuitry that is needed to make them very°e xible. In FPGA structure shown in Fig. 1 , programmable switches, which are controlled by con¯guration memory, consume a large area and add a considerable amount of parasitic capacitances and resistances to the logic and routing resources. Consequently, FPGAs are approximately 20 times larger, 12 times less power e±-cient, and 3 times slower compared to ASICs. 1 In addition, FPGAs are less suited for high-volume applications due to the area overhead combined with development and research costs which, in turns, increases the per-unit cost of them. During the last 20 years, research and development has produced magni¯cent improvements in FPGAs area and speed e±ciency, minimizing the gap between FPGAs and ASICs and making FPGAs the preferable platform for implementing digital designs. FPGAs also hold remarkable promise as a fast-to-market replacement for ASICs in various applications. There are many researches targeted to enhance the speed and area e±ciency of programmable routing resources.
FinFET devices have been proposed as promising alternatives for the traditional CMOS devices at the nanoscale technologies 2,3 since they have outstanding properties such as improved channel controllability, 4, 5 higher I on /I off current ratio, 6 reduced short-channel e®ects, 7 and higher immunity to gate line-edge roughness. 8 Additionally, the near-ideal subthreshold behavior clari¯es the potential usage of FinFET circuits in the near threshold supply voltage regime, which consumes much less energy than the conventional strong-inversion circuits that operate in the superthreshold supply voltage regime. Compared to Fully depleted SOI MOSFET or double gate FinFET, Tri-gate FinFETs are superior due to the improved electrostatic controllability o®ered by three gates, 9 which leads to e±cient control of short channel e®ects and allows further scaling to meet the International Technology Roadmap for Semiconductor (ITRS) trends. 10 For FPGA users, those mentioned key performance advantages posed by FinFETs in production, as the continuation of Moore's law in the march of improvements in transistor performance, density, control over power dissipation, and cost-per-transistor. This would make the FPGAs that advances to 14 nm technology and beyond power competitive with ASIC design solutions on available competing design nodes, with even more signi¯cant advantages in programmability, performance, and°exibility which motivated us to conduct this study.
11
Several studies are conducted to analyze Predictive Technology Models (PTM) 12 based circuits with technology scaling. [13] [14] [15] [16] For instance, a simulation study for PTM ring oscillator and some basic logic gates is discussed. 13 In this work, we evaluate the performance of FinFET-based FPGA cluster and the components comprising it with technology scaling from 20 nm down to 7 nm. We also investigate the impact of threshold voltage variations, representing Die-to-Die process variations, and temperature variation as well on the basic performance metrics. This paper is organized as follows: Veri¯cation for PTM models versus ITRS values is presented in Sec. 2. Simulation setup, methodology, and device parameters used in simulations, along with FPGA cluster architecture, are presented in Sec. 3. Tri-gate FinFET-based FPGA components simulation results and discussions are presented in Sec. 4. Section 5 presents the evaluation of FinFET-based FPGA cluster performance metrics at nominal conditions and with threshold voltage and temperature variations with some design insights. Finally, the conclusion is drawn in Sec. 6.
PTM Veri¯cation
Predictive Technology Model cards for sub-20 nm multi-gate transistors (PTM-MG) have been developed based on MOSFET scaling theory, the 2011 ITRS roadmap and early stage silicon data from published results.
14 PTM-MG used the published results from foundries such as Intel, TSMC, and IBM [18] [19] [20] [21] to extract the¯tting PTM parameters such as DIBL, and sub-threshold slope. However, the PTM-MG-based models do not have a complete information about the fabricated devices 18-21 but they are introduced by¯ne-tuning both primary parameters (Gate length, Fin thickness, Fin height, and Fin pitch) and secondary parameters (Gate work function, channel doping, source-drain channel coupling, and DIBL coe±cient) 14 to match oncurrent and o®-current of the published results.
For future technologies (beyond 14 nm), PTM-MG model cards are developed using ITRS as a reference. The o®-current for 14 nm technology node and below is expected to be (I off ¼ 0:01 nA/um for LSTP and 100 nA/um for HP) according to ITRS trends. 10 PTM-MG models normalized per e®ective width (W eff ) for a constant o®-current (I off ¼ 0:1 nA/um for LSTP and 100 nA/um for HP). The PTM-MG LSTP devices follow the ITRS LSTP trend but are shifted to be slightly stronger. 14 The di®erence between ITRS o®-current and PTM o®-current impact on transmission gate°ip-°op (TG-FF) metrics is evaluated and plotted in Figs. 2-4. This means that simulation results using nominal PTM-MG parameters have slight deviation from fabricated devices with ITRS o®-current. For instance, 7 nm PTM TG-FF has power with 5% deviation from similar device with ITRS o®-current. Fig. 2 . The di®erence between ITRS o®-current and PTM o®-current impact on TG-FF power. 22 
Simulation Setup
In this work, Low-standby power devices (LSTP) predictive technology models (PTM-MG) 12 based on BSIM-CMG for Multi-gate devices (Tri-gate FinFET) are used from 20 nm down to 7 nm technology nodes. A scaling strategy is considered, Fig. 4 . The di®erence between ITRS o®-current and PTM o®-current impact on TG-FF PDP. 22 Fig . 3 . The di®erence between ITRS o®-current and PTM o®-current impact on TG-FF delay. 22 according to the PTM models, which involves scaling of the channel length (L), scaling of the¯n thickness (T fin ),¯n height (H fin ), and supply voltage (V DD ). For trigate FinFET devices, the e®ective channel width is given by
where (N fin ) is the¯n count. We used Cadence Virtuoso and Spectre for all the simulations. Performance and power consumption simulations of the FPGA cluster are conducted at room temperature with the nominal supply voltage of each technology node starting from 20 nm down to 7 nm (from 0.9 V to 0.7 V, respectively). FPGA cluster metrics are evaluated based on three benchmarking circuits: 2-bit adder, 4-bit NAND and cascaded°ip-°ops chain. We used operation delay, power and power delay product as metrics for evaluating FPGA cluster and its components.
Simulated devices parameters
Simulated FinFET device parameters for MUXs, LUTs and°ip-°ops are presented in Table 1 . Figure 5 shows LUT's SRAM sizing. Figure 6 shows the LUT simulation results where S0-S3 represents the 4-bit selection lines for the 16-to-1 MUX to select a speci¯c SRAM cell output. SRAM cells are programmed in the sequence of ones and zeros (101010. . ..). And regarding the performance evaluation study done for the cluster for all technology nodes, we studied the performance metrics with threshold voltage variations within range AE 18% with step of 6% of the nominal threshold voltage for this technology node. The values are reported in Table 2 .
FPGA cluster architecture
The simulated FPGA cluster structure, as shown in Fig. 7 , consists of three basic logic elements (BLEs), each BLE consists of 4-Inputs lookup table (LUT) cascaded with a TG-FF, and 2:1 multiplexer to output either LUT direct output or the latched one as shown in Fig. 8 .
The cluster built has eight distinctive inputs and three outputs. Its structural parameters (LUT size, number of LUTs, number of inputs and outputs) are chosen to ful¯ll reasonable area, e±cient and fast performance. 
Flip-°ops
Four°ip-°ops topologies are selected to represent the di®erent trade-o® choices between power dissipation and performance. 23 TG-FF and clocked CMOS°ip-°op (C2MOS) are implemented by cascading two complementary latches.
This master-slave implementation results in robust°ip-°op with a good hold time behavior. In addition, they are used in standard libraries 23 which makes it so important to include them in this comparison.
Semi-dynamic°ip-°op (SD) 23 is considered as a one of the fastest°ip-°ops. This°i p-°op can be considered as a pulsed latch, since it samples the input data to the°i p-°op output during a very short transparency period around the clock edge. Accordingly, the input data can arrive after the clock edge. Therefore, this°ip-°op is used in high performance VLSI applications due to its relatively short data to output delay at the expense of a poor hold time behavior and excessive power consumption. 23 Sense-ampli¯er based°ip-°op (SA) can be viewed as a compromise between the master-slave robustness and the pulsed latches high performance. 
Transmission gate°ip-°op (TG-FF)
TG-FF shown in Fig. 9 is simulated using device parameters demonstrated in Table 1 . Figures 10 and 11 show the PDP of TG-FF. TG-FF is the simplest°ip-°op type, also it is the most common type in the digital designs due to its simplicity and consumes low area (according to the few number of its transistors).
Trends of PDP are improved with threshold voltage increase and degraded with temperature increase, for instance, the PDP at þ18% increase of its nominal threshold voltage value at 7 nm technology is lower than PDP at the nominal value by a factor of 0.18.
Energy of°ip-°op is improved with temperature increase. For instance, at 16 nm technology, PDP at 120 is lower than PDP at room temperature value by a factor of 0.065. Figure 10 illustrates that PDP trends of the TG-FF increase as threshold voltage increases due to limiting the driving current, hence increasing operation delay. Figure 11 shows that PDP trends decrease continuously for technologies from 20 nm to 14 nm. TG-FF also shows the least PDP trends among the four°ip-°ops topologies.
Clocked CMOS (C2MOS)
C2MOS-FF shown in Fig. 12 is simulated using device parameters in Table 1 . Figures 13 and 14 show the PDP of C2MOS-FF. This°ip-°op is insensitive to overlap since the rise and fall times of the clock edges (clock slew) are su±ciently small. 
Semi dynamic (SD)
Semi dynamic°ip-°op (SD-FF) shown in Fig. 18 is simulated using device parameters in Table 3 since transistors sizing is critical for SD-FF operation (It doesn't work correctly with minimum sizing). Figures 19 and 20 show the PDP of SD-FF. The°ip-°op consists of a dynamic front-end and a static back-end, that is why it is a semi dynamic circuit. From this work, we¯gured out that SD-FF is the fastest one of the four types. Also, it has negative setup time, so it's very good choice for high performance systems (within available power budget), however, it is the most power consuming and has hold time.
Compared to the other°ip-°ops, TG-FF is the least power consuming type. It has positive setup time and small clock to output delay. It has also the minimum number of transistors compared to other three types, but it has high clock load though.
C2MOS-FF has small clock load, achieved by the local clock bu®ering, also it's robust to clock slope variation due to the local clock bu®ering, however, it is slower than TG-FF.
SA-FF has a very useful feature of monotonous transitions at the outputs, which drives fast domino logic. However, it has remarkable rise and fall times which not only degrade speed but also cause glitches in succeeding logic stages which, in turn, increases total power consumption.
The most vulnerable FF type to soft errors is SA-FF. The reason for that is due to its small°ipping time. 23 The least vulnerable type of FFs to soft errors is SD-FF. The PDP sensitivity (variation) increases with technology scaling in FFs, this can be illustrated in Figs. 11, 14 , 17 and 20 where 7 nm technology node has a high rise in PDP value at high temperature values (power is the dominant factor of this increase).
Multiplexers
We evaluated multiplexers' metrics based on a critical path circuit (Ring Oscillator or \RO") that contains the multiplexer along with some logic gates to represent an actual critical path in a digital circuit 24 since critical path in real microprocessors designs consist of similar circuit (cascaded standard logic gates). Figure 21 circuit is selected to model the e®ect of using PTM FinFET devices on the performance of a real microprocessor design. Ring oscillator frequency is an important parameter in performance evaluation of the critical path in digital designs. 14 nm technology shows the best performance because of its large saturation current. However, beyond 14 nm technology nodes, the situation gets reversed which implies the necessity to looking for alternative device scaling options such as gate-all-around (GAA) nanowires 25 and/or high mobility channels. 26 Despite that with technology scaling, the current value per unit width is expected to increase. However, the current of RO is decreasing due to the adopted scaling strategies to keep SCEs under control, since scaling both T fin , H fin and reduces the e®ective channel width. As the threshold voltage increases, time period increases due to limiting the driving current. Technology Scaling Roadmap for FinFET-Based FPGA Clusters
1850056-17
As temperature increases, RO driving current increases. Hence, the time period is decreased. For 7 nm technology node, the time period at temperature of 120 is lower than the 27 time period by a factor of 0.45.
Nowadays, low power designs are not only needed for battery life for portable devices, but also for reducing cooling costs. Power consumption of RO is decreased with technology scaling which is expected from technology scaling; such a result can 
1850056-18
be used to verify the correctness of the results. For 7 nm, the power at the nominal threshold value is lower than the 20 nm nominal threshold power by a factor of 0.43.
Temperature has a linear relation with RO power dissipation. When the temperature increases, the driving current, in turn, increases the RO power dissipation, for instance, the power dissipation for 7 nm technology at 120 temperature degree is higher than the 27 temperature degree power by 20%. 
1850056-19
We can conclude from Figs. 22-27 that RO's PDP trends are decreasing with technology scaling, for example, the PDP for 7 nm technology at the nominal threshold value is lower than the 20 nm technology nominal threshold PDP by a factor of 0.3.
PDP of RO is linearly proportional with temperature due to power consumption dominance. The PDP sensitivity increases with technology scaling, for instance, the PDP for 7 nm technology at 120 temperature is increased by a factor of 0.2 of its nominal value. However, the PDP for 20 nm technology at same temperature is increased by a factor of 0.1.
From our work, we can observe performance enhancement with technology scaling till 14 nm, power consumption decreases with technology scaling, and PDP also decreases with technology scaling.
Threshold voltage increase has a positive e®ect on the power consumption and PDP, however, it causes performance degradation. The increase in temperature has a negative e®ect on both power dissipation and PDP, however, it causes performance enhancement.
FINFET-Based FPGA Cluster Simulation Results and Discussions

Simulation results at nominal conditions
Adder and NAND benchmarks
Two benchmarking circuits are simulated (2-bit adder, and 4-bit NAND) with technology scaling from 20 nm technology node to 7 nm. Delay trend is enhanced with technology scaling. However, beyond 14-nm technology, node performance is degraded. Alternative trajectories with higher V DD would lead to improved performance at the cost of reduced power scaling as presented in Figs. 28-30. Device scaling options such as high mobility channels 25 and/or gate-all-around (GAA) nanowires 26 hold the potential to improve device scaling in this time frame.
Using supply voltage of 0.8 V keeps on performance enhancement trend with technology scaling at 10 nm and 7 nm technologies. For instance, 7 nm 2-bit adder delay at 0.8 V supply is 80.645 ps, while it is 152.35 ps at the nominal supply voltage at this technology node (V DD ¼ 0:7 V).
Observing power consumption trends, 2-bit adder consumes more power than 4-bit NAND as its switching factor is greater than NAND one. Also, power trends indicate an improvement with technology scaling till 10 nm technology node. Since SRAM's in FPGA LUT is con¯gured once at FPGA programming phase, leakage power is the dominant source of the average power dissipation. As leakage power increases with technology scaling, SRAM's leakage power has signi¯cant impact on the overall average at 7 nm technology node which leads to power dissipation increase at this technology node. The higher supply voltage keeps on performance improvement with technology scaling but this will be on power reduction at 10 nm and 7 nm technologies as discussed earlier. For instance, 7 nm 2-bit adder power at 0.8 V supply is 7.4496 uw while it is 4.3932 uw at the nominal supply voltage at this technology node (V DD ¼ 0:7 V).
PDP is a key metric in evaluating any digital circuit as it indicates the energy consumption and hence battery life for portable devices. PDP trends also indicate improvement of energy consumption with technology scaling from 20 nm down to 14 nm.
While using higher supply voltage (V DD ¼ 0:8 V in this case), it increases power consumption at 10 nm and 7 nm technologies, the overall PDP is enhanced. For instance, 7nm 2-bit adder PDP at 0.8 V supply is 600.773 aJ, while it is 669.304 aJ at the nominal supply voltage at this technology node (V DD ¼ 0:7 V) which is equivalent to 10.24% energy reduction.
Cascaded FFs chain benchmark
Cascaded FFs chain consists of three cascaded FFs path, it is formed by driving one of¯rst BLE inputs and connecting its output to one of the inputs of the second BLE and second BLE output to one of the inputs of the third BLE. Simulations are done at 200 MHz frequency with phase di®erence 400 ps from FPGA cluster inputs.
Delay, power consumption, and PDP trends with technology scaling of the benchmark circuit are presented below in Figs. 31-33. 
1850056-23
The performance of Cascaded FFs chain is predicted to be worse than adder and NAND circuits, as FFs are triggering on clock edges. Monitoring performance with technology scaling, it has the same trend (enhanced from 20 nm down to 14 nm), for instance, 14 nm technology node has a speed 3% higher than 20 nm speed. Power consumption trend also is reduced with technology scaling as a result of supply voltage scaling with technology.
Cascaded FFs chain's PDP trend has its optimum value at 10 nm technology node, however, 14 nm technology node has a better performance, 10 nm node is less power consuming than 14 nm node. PDP also is improved with technology scaling.
Performance evaluation of FinFET-based FPGA cluster
We evaluated Tri-gate FinFET-based FPGA cluster performance based the metrics indicated in the simulation setup section:
Operations delay
Delay is an essential parameter in evaluating the performance of any digital circuit. Observing its trend with the technology scaling, the delay is continuously decreasing with scaling down the technology as a result of shrinking the channel length despite the scaling of the supply voltage which usually leads to degradation in the delay. FPGA cluster's performance is enhanced with technology scaling. For instance, 7 nm 2-bit adder circuit speed (performance) is 15% higher its value at 20 nm.
Power consumption
Power dissipation is the major metric for low power designs. There has been a surge of interest in low-power devices and design techniques recently. The power dissipation is continuously decreasing with scaling down the technology as a result of shrinking the channel length and the scaling of the supply voltage. For instance, 7 nm cascaded°ip-°op chain circuit power consumption is reduced by 41% from its value at 20 nm.
Power delay product
As the power and delay always have a trade-o®, PDP product is an important key metric in circuit's evaluation. PDP is enhanced with technology scaling from 20 nm to 14 nm. For instance, 7 nm 2-bit adder circuit PDP is reduced by 43% from its value at 20 nm.
Some design insights based on nominal simulations
Power consumption is reduced with technology scaling from 20 nm down to 10 nm, however, it has increased at 7 nm technology node due to the large static power of SRAMs at that technology node.
Cluster speed is increased with technology scaling starting from 20 nm down to 14 nm but it has degraded beyond 14 nm. While alternate trajectories with higher V DD would lead to improved performance, this will be at the cost of reduced power scaling.
PDP is decreased with technology scaling from 20 nm down to 14 nm which leads to looking for alternative scaling options such as high mobility channels 25 and/or gate-all-around (GAA) nanowires 26 to keep on technology scaling beyond 14 nm technology node.
5.2.
Simulation results considering variations (on 2-bit adder benchmark as a case study)
Impact of threshold voltage variations
The simulation results indicate that the average power variation percentages with threshold voltage variation increase as we scale down the FinFET technology node. Figure 34 shows the chart for percentages of average power variation with three di®erent change percentages for threshold voltage for all the technology nodes included in the study. For each node, the percentages variation of average power increases as we increase the threshold voltage change percentages from À6% till À18% as the current value decreases with increasing threshold voltage value. 27 Variation percentages of PDP with threshold voltage variation are reported in Fig. 35 . PDP variation percentages with threshold variations increase with down scaling of FinFET technology nodes. The PDP chart is following the same trend as the power variation percentage with technology nodes.
Hence, the power variation percentages are considered the dominant contributor in the PDP equation compared to delay due to the larger variation percentages of the average power. Also, the percentage variation of PDP decreases as we increase the threshold voltage change percentage from À18% till À6%. 
Impact of temperature variations
Observing the simulation results, the average power variation percentages with temperature variation increase as we scale down the FinFET technology node. Figure 36 shows the chart for percentages of average power variation with three di®erent change percentages for temperature for all the technology nodes included in the study. For each node, the percentages variation of average power increase as we increase the temperature change percentages from 100% till 300%.
Coming to the variation percentages of PDP with temperature variation, they are reported in Fig. 37 . PDP variation percentages with temperature variations increase with down scaling of FinFET technology nodes. The PDP chart is following the same trend as the power variation percentage with technology nodes.
Hence, the power variation percentages are considered the dominant contributor in the PDP equation compared to delay due to the larger variation percentages of the average power. For each node, the percentages variation of average power increase as we increase the temperature change percentages from 100% till 300%.
Design insights based on threshold voltage variations
In our study, we de¯ned a targeted yield percentage of 99.87% for which we determined the design constraints of di®erent performance metrics. This targeted yield percentage represents the 3 value, or three standard deviations of the mean, for a particular technology node; The mean value here is the nominal value (the metric value at zero percentage change in the threshold voltage for this node), and here is calculated by calculating the standard deviation between each metric's values for di®erent threshold voltage variation percentages from À18% to 18% with 1% step (total of 37 corners including the nominal condition). Figures 38-40 show the design constraints values for average power, delay, and PDP for all the technology nodes calculated as AE 3. The large gap between the design constraints within the power and PDP curves starting at 14 nm node and increasing till 7 nm node emphasizes the further increase in the variations with technology scaling as previously mentioned. 
Conclusion
The performance of FinFET-based FPGA cluster, based on predictive technology models (PTM-LSTP), is evaluated with technology scaling from 20 nm down to 7 nm. Firstly, we started evaluating some FPGA comprising components (MUXs and FFs). The results show that, with technology scaling, the power and PDP are decreasing, and the delay is enhancing until 14 nm technology node. However, the sensitivity of the power, delay and PDP to threshold voltage variations are increasing with technology scaling. Switching to the FPGA cluster evaluation study, it is done based on three benchmarking circuits: 2-bit adder, 4-bit NAND and cascaded FFs chain. While nominal simulations are done on the aforementioned benchmarks, the study shows that the power is decreasing until 10 nm technology node, and the PDP is enhancing till 14 nm technology node with technology scaling. However, the sensitivity of the power and PDP to threshold variations is increasing with technology scaling. Also, power and PDP trends are enhanced by increasing the threshold voltage. On the other hand, performance (speed) is degraded with threshold voltage increase. The results show that FPGA cluster performance is enhanced with technology scaling, however, after the 14 nm node and down to 7 nm, clear performance degradation is observed. The degradation of the cluster performance with technology scaling is a result of scaling other parameters besides the channel length. The impact of a given range of threshold voltage variations and temperature variation on cluster basic performance metrics for 2-bit adder benchmarking circuit are reported. The results show that the performance metrics' variations increase with technology scaling with respect to threshold voltage and temperature variations; both the average power variations and the PDP variations with threshold voltage and temperature variations increase with technology scaling, while the delay variation with threshold voltage and temperature is not following a certain trend with the technology scaling. Some design insights and constraints for the performance metrics are investigated and proposed to the designers in order to achieve targeted yield of 99.87% with technology scaling. There is a big di®erence between the design constraints values within the power and PDP starting at 14 nm node and increasing until 7 nm node which emphasizes the further increase in variations with technology scaling. The evaluation results may guide and help researchers to further extend the study by utilizing the cluster built to include associate routing channels and inter-cluster routing to study the performance of a FinFET-based FPGA tile.
