A real time multiprocessor chip paradigm is also called a Network-on-Chip (NoC) which offers a promising architecture for future systems-on-chips. Even though a lot of Double Tail Sense Amplifiers (DTSA) are used in architectural approach, the conventional DTSA with transceiver exhibits a difficulty of consuming more energy and latency than its intended design during heavy traffic condition. Variable Energy aware sense amplifier Link for Asynchronous NoC (VELAN) is designed in this research to eliminate the difficulty, which is the combination of Variable DTSA circuitry (V-DTSA) and Transceiver. The V-DTSA circuitry has following components such as bootable DTSA (B-DTSA) and bootable clock gating DTSA (BCG-DTSA), Graph theory based Traffic Estimator (GTE) and controller. Depending upon the traffic rate, the controller activates necessary DTSA modules and transfers information to the receiver. The proposed VELAN design is evaluated on TSMC 90 nm technology, showing 6.157 Gb/s data rate, 0.27 w total link power and 354 ps latency for single stage operation.
Introduction
NoC is a booming area for designing various applications like multimedia, telecommunication, and real time task [1] . Previous researches mainly focus on low power, high speed and scalability in NoC [2] . Algorithmic [3] and architectural models [4] are made and implemented into the NoC to achieve further performance improvement than existing NoC design. Current NoC designers show much progress on this architectural level model by introducing external or internal sense amplifier (SA) in on-chip communication [5] . In addition to the transmitter section (TXS), the pre emphasis capacitance (PEC) is added for high speed and energy reduction in on-chip communication, it requires DC bias circuits at the receiver section (RXS). To overcome this issue, voltage sense amplifier is introduced and tested in 90 nm CMOS cross coupled circuit [6] . In small circuit application, user can't identify the worth of voltage SA, so it is refined into Double Tail Sense Amplifiers (DTSA). This DTSA with transceiver consists of PEC at the transmitter and DTSA at RXS [7] . In a recent paper [8] , we presented transceiver with Reconfigurable DTSA (R-DTSA) to achieve the performance improvement. Both [7] and [8] has achieved a reduction in data rate and link power. In this paper, we have concentrated on improving the latency parameter by adapting the bootable concept in DTSA. Bootable concept is the combinations of clock enable [9] and clock gating [10] . A low power consumption model is developed and implemented in many real time applications. CG low power design approach at RTL TSMC 45 nm CMOS application is tested in [11] . CMOS VLSI design has taken us to real working chips that rely on controlled charge recovery to operate at significantly lower power dissipation levels than their existing counterparts. The energy recovering circuits [10] are applied in microcontrollers, memory devices, display drivers, grouped clock networks and other real time applications. CG-SAFF (sense-amplifier flip-flop) [12] circuit exhibits high speed and low energy. The switching activity and delay of various flip-flops are compared with CG-SAFF. In [13] , the performance improvement achieved in networks with respect to network traffic modeling based on synthetic traffic. The traffic estimator and generator are introduced for QoS in [14] , to generate and estimate the real time traffic data in on-chip communication. In this proposed VELAN design, we followed the Traffic generator [8] and Graph Theory based Traffic Estimator (GTE) [8] . To achieve higher performance under various traffic conditions in on-chip networks, energy recovery clocking [15] concept is introduced in DTSA called M-DTSA [8] and clock gating [13] concept in M-DTSA [8] called MCG-DTSA [8] . Similar to Schinkel et al. and Sakthivel et al. , this VELAN design is validated. In the proposed analysis the Individual and proposed sensor amplifier functionalities are validated. The Proposed V-DTSA module is encapsulated into the receiver section of router construction in NoC. The Proposed Performance metrics like delay, datarate, energy and static power consumption are observed and compared with conventional works.
The key contributions of this work are as follows.
• Schinkel et al. has proposed the DTSA with low power and high speed architecture.
• The conventional DTSA is not evaluated under high traffic scenario and here we examined same DTSA under various traffic scenarios. We have found that the traffic affects the performance of conventional DTSA.
To solve this issue R-DTSA is proposed.
• This R-DTSA is a combination of four DTSA, which is validated with various traffic scenarios and the results are reported in Sakthivel et al., which is better than Schinkel et al.
• Even though R-DTSA is more advantageous than DTSA, but some issues are remaining such as the more latency, high area and high cost. To overcome these issues and reduce the complexity of R-DTSA, we focused and developed a new DTSA which was called V-DTSA.
• This V-DTSA provides a better performance than DTSA and R-DTSA in terms of all the parameters mentioned in the conventional work.
• We have introduced the V-DTSA based NoC which was used to provide a better performance than conventional DTSA and R-DTSA under various Traffic scenarios.
• To construct the V-DTSA, we used Bootable concept into both DTSA and CG-DTSA circuits.
• Similar to the previous work R-DTSA, we introduced the same Graph theory based Traffic Estimator (GTE).
• The top modules of V-DTSA are namely B-DTSA, BCG-DTSA, GTE and controller.
• The primary part of this proposed work is an Analysis of Various Sense amplifiers and selection of suitable DTSA for performance comparison.
• This work is aimed to produce new latency aware NoC design based on V-DTSA under various traffic scenarios.
• The performance parameters such as Delay, Data rate, Power, Energy, area and more parameters are observed and reported in Section 4. The rest of this paper is organized as follows. Section 2 addresses the system model. Proposed work and its module details are discussed in Section 3. The proposed results of various architectures are presented in Section 4. Finally, the conclusion is presented in Section 5. Figure 1 . The use of capacitance in TXS is to reduce power dissipation. In NoC circuitry communication disturbance occurs because of noise and crosstalk [16] . The transceiver with a Differential Interconnect Twist (DIT) provides a high performance improvement. On Early stage, bidirectional interconnects are used. The EM field solver is used to analyze interconnects. The CMOS with 1.2 V, 6 M technology is used for interconnects as in [7] [8]. Table 1 shows the concept involved in VELAN and which is compared with existing design.
System Model

Proposed System
The conventional transceiver configuration is compared with the proposed Transceiver configuration which is shown in Figure 1 . The proposed VELAN design consists of V-DTSA circuitry for reducing the power consumption of data transmission and latency. The proposed work consists of four stages, namely selection, analysis, and design and performance comparison.
In the first stage of the work, suitable SA is selected with respect to the power comparison in both sleep and active mode [8] (observed M-DTSA, MCG-DTSA for further process). After clock enable [9] , both DTSA's refined into B-DTSA and BCG-DTSA. In the second stage, selected SA's (DTSA, B-DTSA and BCG-DTSA) are applied with high traffic (HT) and low traffic (LT) and then the energy comparison is analyzed. In the third stage, we designed the V-DTSA circuitry for complete transceiver. Finally, we compared our results with [7] [8] . The block diagram of VELAN design is shown in Figure 2 . This proposed system consists of PEC with TXS, GTE [8] , V-DTSA circuitry and RXS. The Graph theory Traffic Estimator (GTE) [8] is used to estimate the traffic rate of transmitting data. Based on the data, traffic controller is used to select the corresponding DTSA available in V-DTSA circuitry.
Clock gating has been proved best, when there are more number of flip-flops (coarse grained) in the circuit, since it is independent of the circuit size. In a fine grained system (fewer lip-flops) clock enable achieves better energy conservation, since, the power consumption of this option is very linear with the number if flip-flops. As clock enables activates only a part of the circuit this works better on a partially active task. As clock gating activates the complete circuit, works well with the task needing the whole circuit. And it's proved and experimentally validated in FPGA platform by Oliver et al. Based on these modules, we have constructed our proposed circuit
Proposed Work and Its Module Details
The section discusses about energy recovery clocking Circuit, Clock Gating circuit, bootable concept, low swing transmitter, optimal swing receiver, a V-DTSA components, graph theory based traffic estimator and controller and complete transceiver for proposed DTSA.
Energy Recovery Clocking (ERC) Circuit
Mahmoodi et al. have introduced an energy recovery clock technique in flip-flops that operates with singlephase sinusoidal clocks. In ERC circuit, AC supply voltage is used to recycle the stored energies on their capacitance while standard supply voltage is used for the rest of the circuits. The schematic representation of the ERC is observed from [12] for energy recovery clock generation. The energy recovery technique is implemented in DTSA circuit to accomplish the power reduction in NoC architecture.
Clock Gating (CG) Circuit
Tirumalashetty et al. have introduced clock gating technique in sequential circuits for low power design. In CG circuit, universal logic gate is used for masking the local clock signal to eliminate an energy recovery scheme from the remaining capacitances in fan-out circuit. The schematic representation of CG is observed from [11] for clock gating generations. An energy loss occurs due to non-adiabatic switching between the device oscillators and the resistance of the clock circuit and it can be eliminated by applying clock gating technique in DTSA circuit
Bootable Concept
In general, DTSA has precharge and evaluation phases of operation. The slow rising and falling transitions of the resonant clock will cause overlap between these two phases, which results in short-circuit current. The main purpose of the bootable clocking scheme is to reduce short-circuit power by switching the precharging transistors for a portion of clock period.
Low Swing Transmitter
In a low swing transmitter, large transmitters are required to drive the bus with adequate speed which results in reduction of transmitter efficiency. To overcome this issue and achieve high data rate, Schinkel et al. 
Optimal Swing Receiver
The most commonly used data receivers in a low swing transceiver are clocked comparators and sense amplifier. The comparators are used to regenerate the voltage to full swing. But the sense amplifier is a very fast circuit that regenerates the voltage, samples the incoming data and realign at the reception's end with respect to the clock signal. The sense amplifier circuit is split into two tails to avoid transistor stack, which is called DTSA is used in the receiver section of proposed transceiver.
V-DTSA
The latency and the power dissipation of the DTSA is the basic building block of the clock distribution network that plays a vital role in NoC, an ensured design is needed to achieve low power and small latency [17] .
To gain maximum power reduction of data transmission in NoC architecture, the proposed work presented a variable energy aware sense amplifier design with V-DTSA circuit. The purpose of the V-DTSA circuit is to vary the DTSA module according to the traffic rate of the data. It consists of Graph theory based Traffic Estimator [8] , controller and DTSA modules, namely B-DTSA and BCG-DTSA. The ERC concept is implemented in DTSA circuit is called Modified-DTSA (M-DTSA). The clock gating technique is implemented in the M-DTSA [8] module by adding logical NOR which is a gate to the circuit called Clock Gating Modified-DTSA (MCG-DTSA) [8] . After applying clock enable [9] both circuits are called B-DTSA and BCG-DTSA. The functional diagram of the V-DTSA module and its simulation result is shown in Figure 3 , which consists of transistors S1-S12 with S-pulse signal, logical NOR gating and controller. The GTE estimates the traffic rate of the data with respect to the specification which is given in Table 3 . After estimating the traffic rate of the data, the control signal is sent to the controller to activate the DTSA module according to the traffic rate (LT/HT). If the input data is estimated as low traffic (LT), then the controller activated the S-pulse (ERC output) as input to the DTSA circuit. The controller enables the output of the logical NOR gate to the DTSA circuit for HT. Therefore, the transistor dimensions of the proposed double-tail sense amplifier are optimized comparative to each other to get the lowest offset standard deviation per unit of power cost. Width scaling (or impedance or area scaling) can consequently be useful to all the transistors composed to match the offset standard deviation to the preferred requirement [7] while preserving the original speed characteristics
Graph Theory Based Traffic Estimator and Controller
The optimal weight equation is used for the TE design follows from [15] . The GTE [8] estimates the traffic rate and compares with the given threshold value and then it selects the corresponding DTSA module in V-DTSA circuitry via the controller. In order to reduce complexity in [8] , two DTSA modules eliminated and traffic modes are merged into four states to two states, namely HIGH (HT) and LOW (LT).
Complete Transceiver
The complete transceiver circuit is made of transmitter with pre-emphasis capacitance connected to the receiver with the V-DTSA module via DIT [7] [8]. The V-DTSA circuitry consists of B-DTSA and BCG-DTSA that gets the input data through the bus. The traffic estimator estimates the traffic under low or high condition using graph theory method and enables suitable DTSA by sending selected signal to the MUX. All other techniques are adapted same from [8] and the complete transceiver is shown in Figure 4 and simulation results are shown in Figure 5 .
Results and Discussion
Performance Measure-Analytical Model
To measure the performance of proposed work we have taken following metrics that are widely used for perfor- mance measurement in NoC. The performance measures of delay, data rate, energy, static power consumption, average latency, throughput, energy per useful flits switching factor and analysis in a network-on-chip. The definitions of these metrics are summarized here. 1) To measure the latency of flows, it is essential to evaluate the packet waiting periods for routers.
2) The power consumption and Link power are considered recursively for every communication path starting from the terminus section.
3) The Energy consumption for each core in router is determined.
4) Data rate is measured, based on all communication paths beginning of the terminus section. 5) Average latency is a time interval between the stimulation and response. 6) Throughput is the rate of production or the rate at which something can be processed. 7) Energy per useful flits is obtained with respect to the number of flits. 8) Switching factor is the probability of output switching.
Latency
In each node Ni, the latency LnocNi is defined using network calculus from Bhat et al. [18] and Sakthivel et al. [19] as follows.
LnocNi
where S wit is the service bandwidth and lat T is the latency.
Power 1) Link Power
In Bhat et al. [18] and Sakthivel et al. [19] , the power models practical is used for a NoC Router. The power model is considering the cross-coupling effect for N-wire interconnects. The total power is calculated for an N-wire link per unit length as follows: 
where N wire is the total number of wires in the link C self and C coupl are the self and coupling capacitance of a wire and neighboring nodes respectively, α saw is the switching activity on a wire, α Cou is the switching activity with respect to the adjacent wires, τ is the short circuit period, N N are the number of links and switches respectively involved in transporting the application flows. The total energy consumption can be calculated using Network Calculus arrival curves as follows Bhat et al. [18] and Sakthivel et al. [19] .
Data Rate
A FIFO buffer with an identified capacity from Sakthivel et al. [19] , substitutes a data burst with presumed size, and the arrival data rate is distinct as follows.
[ ] size inteerval_time 1
DataRate bps
where, P size is the packet size and P interval_time and N total input flits. The smallest data unit is a bit in the analytical model and it is a frame with bounded size in the simulation model.
Average Flit Latency
Average Flit Latency is defined as the ratio between Flit Delay and Number of flits received. It is given Equation (6) where, M = Number of flits received
Average Throughput
Average Throughput is defined as the ratio between P and Number of IP cores. It is given in Equation (8) 
P is defined as the ratio between Total Received Flit and Total Simulation Time. It is given in Equation (9) 1 Total Flit Received Total Simulation Time
where N = Number of IP cores
Switching Factor
The ratio between the Switched in port and total simulation cycle count is called as switching factor. It is given in Equation (10) 
Experimental Section Analysis
To evaluate the performance of the proposed work link, each component is modeled. For this experiment, the source router sends the packets to the sink router and a FIFO is located between these routers. The NoC architecture, implementation started with an RTL description of the DTSA components. To achieve power reduction, we focused on bootable concept (clock gating). The RTL description is made to evaluate clock gating and synthesized to the gate level net list with a Synopsys Design Compiler [21] . From the resulting layout, switching factor and power consumption are estimated. The switching factors are reported by the proposed work which has been examined in an Intel® 3.1-GHz LGA 1155 Core i3-2100 Processor. The total simulation cycle, for each of the experiments is 1,200,000. The power consumption of the interconnection network is extracted using 90-nm technology. A power analysis is carried out using the Synopsys Prime Time PX tool [21] . In this analysis, the power consumption under a given traffic pattern is investigated. The conventional traffic approach cannot realis-tically reveal all types of traffic that will traverse the network, but GT-based traffic pattern [8] provides a reasonable measurement for the performance of this method.
The NoC VHDL-synthesized code is made to evaluate 90-nm TSMC CMOS technology under a 500-MHZ operating frequency, a supply voltage of 1.8 V and a switching factor 0.5. In V-DTSA module, the controller part is made as a model and that is synthesized in 90 nm TSMS CMOS technology. To evaluate the performance of the proposed V-DTSA circuitry, comparisons has been performed with other recent works includes DTSA [8] and reconfigurable DTSA [9] . The Sleep mode and Active mode power consumption are tested with CG and without CG and then the results are presented in [8] . The power is compared to DTSA modules such as Singleended Conditional Capturing Energy Recovery (SCCER) [12] , DCCER [12] , Static Differential Energy Recovery (SDER) [12] , Pulsed Flip Flop (PFF) [22] , M-DTSA [8] , MCG-DTSA [8] . The clock enable concept (bootable) is applied to conventional DTSA circuitries (M-DTSA [8] , MCG-DTSA [8] .
A mathematical expression for technical evaluation is similar to [20] . The energy consumption, delay, data Rate and static power consumption results are presented in Figures 6-9 . The DTSA, R-DTSA and V-DTSA circuitry results are estimated under HT and LT. The overall comparison of various parameters (energy consumption, static power, and delay and data rate) with existing work is shown in Table 3 .
The overall results of VELAN design give superior results than conventional design. The conventional method has achieved latency of 300/1500 ps, under single/five stage operation. The latency result of the proposed work is better under average traffic condition than [7] [8] (354/1770 ps).
The following experimental parameters can be used to measure the NoC parameters, namely Average Flit Latency (AFL), Average Throughput (AT), Switching Factor (SF), and energy per useful flit. The above parameters are obtained using mathematical equations [20] . The Flit rate defines the rate at which packets are injected into the system which is noted in flit/node/cycle.
The dynamic power and the leakage power are tested in different terminals such as the traffic generator, GTbased traffic estimator [8] , the router, the input buffer, the output buffer, and links under various approaches ( [7] [8] and proposed work). The results are presented in Table 4 and Table 5 , and the comparison plots are plotted in Figure 10 . It is inferred that the proposed work gives a superior result in terms of power consumption, compared with the [7] [8] works.
The performance comparison of the traffic injection rate, the throughput and the average flits latency are tested. The results are presented in Table 6 and Table 7 and the comparison plots are plotted in Figure 11 . The performance comparison of the flit rate, throughput and the average flits latency are tested. The results are presented in Table 8 and Table 9 and the comparison plots are plotted in Figure 12 . It is inferred that the proposed work gives a superior result in terms of throughput and latency, compared with the [7] [8] works. The performance comparison of energy per useful flits with flit rate is tested. The results are presented in Table 10 and the comparison plots are plotted in Figure 13 . It is inferred that the proposed work gives a superior result in terms of energy consumption, compared with the [7] [8] works.
Schinkel et al. have estimated bandwidth per cross-sectional area (BW/CSA). The differential wires are used in the proposed design which operates at high speed and low swing. The Technical specification used for the proposed work is mentioned in Table 2 , only 4% of the tile area of 100 mm. Sakthivel et al. have designed R-DTSA and results are reported, this R-DTSA is combination of four DTSA which occupies an approximately 12%. Whereas our proposed V-DTSA is similar to that of the DTSA single DTSA element, but it provide a better performance than DTSA and R-DTSA. It has occupied 4% of the tile area of 100 mm and it's the same as that of DTSA but reduction when compared with R-DTSA based NoC. Therefore, in this V-DTSA based NoC consume lesser area usage and low cost than R-DTSA. The power consumption and latency are estimated through the synthesizable VHDL model in the Synopsis environment with 90 nm technology. The following performance metrics Energy, Static, Dynamic power are measured and compared with [7] and [8] . The experimental results of VELAN design shows better performance than those of [7] and [8] .
Conclusion
The proposed work is summarized into four stages, namely selection, analysis, design and performance comparison. In the first stage, among various sense amplifiers circuits few sense amplifiers are selected to form V-DTSA and power comparison is made in both active and sleep modes (M-DTSA and MCG-DTSA selected and refined into B-DTSA, BCG-DTSA). In the second stage, energy comparison is analyzed by applying LT (18 Gb/s/113fJ) and HT (12.8 GB/s/164fJ) traffics on selected DTSA modules (DTSA, B-DTSA and BCG-DTSA). As the result of analysis, power reduction is achieved in B-DTSA for LT and BCG-DTSA for HT. On the third stage, we designed V-DTSA circuit with GTE, Controller and DTSA modules. The GTE estimates the traffic rate and controls the Controller to select B-DTSA for LT and BCG-DTSA for HT. At the final stage, the result of the overall transceiver circuit (VELAN) under average traffic mode is obtained as 6.157 Gb/s data rate, 0.27 w link power and latency of 440 ps/2200ps for single/five stage operation. When compared with conventional methods, the results in VELAN design show performance improvement of 98.512% (data rate) and 18.51% reduction (link power).
