Abstract-We propose a low-voltage, low-current interconnect architecture using buffered/pipelined spin-torque (ST) sensors to optimize the overall delay and energy consumption. Conventional techniques for reducing energy consumption on long interconnects involve low voltage swings on interconnects or current-mode interconnects. However, such techniques require power-consuming voltage converters or trans-impedance amplifiers at the receivers. ST-sensor-based receivers have recently been introduced that can operate without analog components at the receiver. As a result, the energy consumption is lower compared to existing techniques. However, the delay can be relatively high in these networks for long Cu-lines since these methods do not accommodate conventional buffering schemes for delay minimization. Here, we propose the use of ST buffers in the line in addition to ST-sensing at the receiver. The buffers and sensors used in our design consist of a magnetic strip of two magnetic domains separated by a domain wall. The domain wall can be moved by a current flowing through an adjacent spin-Hall metal which leads to a change in the resistance of the receiver. This resistance change is easily sensed using simple CMOS components. With the introduction of buffering for ST-sensor interconnects, the proposed method can be highly efficient in optimizing the energy-delay performance for long, global on-chip or off-chip lines. Our simulation results indicate that for a 10 mm line in 45 nm CMOS technology, the energy consumption with ST-sensing is about 2 percent that of full-swing, and about 4 percent that of low-swing, CMOS interconnects. Moreover, the delay is much lower than low-swing, and comparable to full-swing, CMOS designs.
I. INTRODUCTION
Energy dissipation in charging and discharging the global interconnect wire capacitances have kept increasing with continuous technology scaling [Rabaey 2009 ]. One promising alternate to resolve this issue is to use low voltage swing on the interconnect lines. However, this technique requires voltage converters at the receivers which add excess power consumption and delay [Rabaey 2009 ]. Another alternate is current-mode signaling which can greatly reduce RC-losses in long lines [Bashirullah 2003 ]. However, the current-tovoltage conversion at the receiver for standard CMOS operation leads to performance degradation as power-hungry analog trans-impedance amplifiers are used for this purpose [Lee 2013] . Recently, an alternate scheme has been proposed where current signals are used to switch the magnetization state of a free layer magnet at the receiver based on domain-wall (DW) movement [Sharad 2013 , Azim 2015 . DW movement is achieved by the combined action of spin-orbit torque (SOT) generated by spin-Hall effect (SHE) and Dzyaloshinskii-Moriya interaction (DMI) [Liu 2012 , Emori 2013 . The magnetization state is sensed by using magnetic tunnel junctions (MTJ) and simple digital CMOS components at the receiver. As a result of not using power-hungry analog transceivers, this technique is potentially very energy efficient. However, the proposed method does not use repeaters or buffers in the line to minimize the delay in the long interconnect lines. This is due to the fact that conventional repeaters are not compat-ible with this ultra-low-voltage signal transmission. In this work, in addition to using spin-torque (ST) sensing at the receiver, we also use STbuffers in the interconnect lines to improve the delay and bandwidth. The long interconnect line is pipelined in segments using the STbuffers where the output of one stage drives the next. This technique leads to the reduction of both delay and energy consumption, as we will demonstrate.
The rest of this letter is organized as follows. In Section II, we first present the details of the device operation including modeling and benchmarking with experimental results. The proposed ST receivers/buffers and interconnect circuits are presented in Section III. In Section IV, we present our results and compare the results with existing technologies. In order to better compare the performance, we simulate both full-swing and low-swing CMOS interconnects using IBM 45 nm technology. Our simulation results show significant improvement in energy-delay performance using the ST-buffered interconnects.
II. DEVICE OPERATION AND MODEL BENCHMARKING WITH EXPERIMENTS
We first explain the operation of the DW-based spin device which we use in the buffers (shown in Fig. 1 ). The ultrathin CoFe free layer with up and down magnetic domains lies on top of an spin-Hall metal (SHM) layer (Pt). Due to the strong DMI for this configuration ], DWs in the CoFe layer get stabilized into Néel type with fixed left-handed chirality at rest condition [Down-Right-Up as in Fig. 2(a) or Up-Left-Down as in Fig. 2(b) ] [Emori 2013] . When a charge [Emori 2013]) . Hence, the DW can be moved in either direction by altering the direction of current flow (using terminals V1 and V2 in Fig. 1 ). In order to characterize the device to circuit operation, we use the mixed mode simulation framework (electron transport, magnetization dynamics from the device to the circuit level) proposed by Fong [2011] . The CoFe layer is used as the free layer of an MTJ (shown in Section III), the resistance of which is obtained from the non-equilibrium Green's function (NEGF) based spin transport simulations [Fong 2011 ]. Subsequently, the resistance of the MTJ is used in an MTJ-SPICE model with 45 nm CMOS technology to evaluate the interconnect circuit operations (discussed in Section III) ]. The charge current (I e ) flowing through the SHM is obtained from the SPICE simulations and the corresponding spin current (I s ) is calculated as [Liu 2011 ]
where θ sh is the spin-Hall angle, and A MT J and A S H M are the crosssectional areas of the MTJ and the SHM, respectively. The spin current from (1) is used with the generalized Landau-Lifshitz-Gilbert (LLG) equation to analyze the magnetization dynamics [Slonczewski 1996 , Sun 2000 . We perform these magnetization dynamics simulations using the Mumax3 platform [Vansteenkiste 2014 ]. We first benchmark our simulations by matching the DW velocity against changing SHM layer current density with the micromagnetic simulations from ] (shown in Fig. 3 , parameters in Table 1 ). DW velocity increases with increasing current density through the SHM layer.
As has been shown experimentally, the DW moves even without the application of an external magnetic field [Emori 2013 ]. 
III. CIRCUIT OPERATION AND BUFFER INSERTION FOR LONG INTERCONNECTS
The global interconnect architecture using the above DW based device at the receiving end is shown in Fig. 4 . The Cu-interconnect line is terminated through an SHM layer which offers a low impedance termination with its relatively low resistivity (20 μ -cm [Manipatruni 2014]) . Current flows through the SHM layer at the receiver either in the right or the left direction according to the data input at the transmitter. This can be ensured by altering the voltage V A in Fig. 4 between V + V and V − V in accordance to the data input. The voltage V B is kept fixed at V , which results in altering current direction depending on V A . With the device dimensions shown in Table 2 for the receiver, the maximum voltage required across the SHM layer is 20 mV to ensure the SHM current density in the range of (1 − 2) × 10 12 A/m 2 which ensures DW movement. The V at the transmitter side must ensure this voltage difference across the SHM layer and the rest of the voltage will drop across the Cu-interconnect line. Since this method is primarily intended for long on-chip or off-chip global lines, we use wider metal wires (with lower resistance and higher capacitance due to being wider than lower metal levels) for the signal transmission. The metal wire we use has a unit resistance, r w = 50 /mm (wire used in Lee [2013] and measured experimentally). So, for Cu-lines up to 20 mm length, the maximum required V at the transmitting side is ∼ 200 mV which will ensure sufficient voltage drop across the SHM layer. By altering the polarity of voltage difference across the SHM layer, the current direction is also altered. When the current flows in the right direction through the SHM layer; the DW in the adjacent free layer moves to the right and vice-versa. The position of the DW is read using the reference MTJ as shown in Fig. 4 which functions as a resistance divider [Sharad 2014] . Standard binary level is detected with a clocked CMOS inverter. Although, the voltage across the interconnect line is very low in this method, the delay can be relatively high for very long Cu-lines. The interconnect delay in Cu-lines generally increases as line length squared [Rabaey 2009 ]. In the conventional technique; this delay is minimized by introducing repeaters in the line and thereby breaking up the line in shorter segments [Bakoglu 1985] . We can apply similar strategy for our proposed architecture with ST-buffers. This is shown in Fig. 5 . Here, the interconnect line is pipelined into two segments by introducing a buffer stage along the line. There will be more pipelined stages for longer lines. As we will show in Section IV, this method can considerably reduce the delay for long lines. We show sample waveforms in Fig. 6 for the circuit shown in Fig. 5 . Here, the Cu-line is 10 mm long and it is pipelined in two segments by using one buffer stage in between the transmitter and the receiver. The change in the data input results in a current flow through the SHM layer of buffer 1 and subsequent reversal of the magnetization of the free layer in buffer 1. As a result, the output voltage of buffer 1 changes as shown in Fig. 6 and this drives the output stage.
IV. RESULTS AND DISCUSSIONS
The proposed architecture is a low voltage, low current interconnection with a fast and energy-efficient signal conversion process at the receiver. Moreover, the delay is minimized in comparison with previously proposed ST-sensor based interconnect methods [Sharad 2013 , Azim 2015 due to the introduction of buffering in longer lines. We show the delay comparison of ST-sensor based interconnects with conventional CMOS interconnects in Fig. 7 . The CMOS implementation shown is for 45 nm technology with both full-swing and lowswing (using conventional level conversion [Zhang 2000 ]) designs with buffers. In comparison to full-swing CMOS design, the delay in ST-sensor based interconnect is higher due to the delay in the receiver for DW movement and subsequent resistive divider action. However, the energy consumption in full-swing CMOS is significantly higher compared to the ST-sensor design, as we discuss later. Low-swing CMOS can be used to reduce the energy consumption. However, the conversion from low-to-high voltage is relatively slower and this leads to higher delay for low-swing schemes [Rabaey 2009 ]. The STsensor based design without any buffering shows almost similar delay as low-swing CMOS (Fig. 7) . This delay can be reduced by using optimum number of buffers along the line. The total delay (T p ) can be approximated by the following equation:
Here, L is the total wire length, M is the number of buffer stages, t bu f is the ST-buffer delay, and r w , c w are the unit wire resistance and capacitance, respectively. As shown in Fig. 8 , the introduction of only a few buffers can reduce the delay in ST-sensor based design. However, additional buffers can lead to increase in delay, as expected from (2). Next, we analyze the energy consumption and compare with voltage mode CMOS interconnects. Due to the very low voltage operation in ST-interconnects, there is a significant reduction in the energy consumption. The total energy consumption consists of the ohmic loss (static energy dissipation in the Cu-line and the SHM layer resistances, E static ), the capacitive loss (dynamic line loss in the wire capacitances, E dynamic ) and the energy required for driving the receiver and buffer circuitry (E receiver ). The total energy consumption can be approximated by the following equation:
Here, T p is the total delay in the line, and R wire and R S H M are the wire and SHM resistances, respectively. E receiver is determined from HSPICE analysis. The capacitive loss (≈ C V 2 ) is negligible compared to the static ohmic loss which is as expected for a current-sensing architecture driven by a low voltage [Bashirullah 2003 ]. Moreover, the reduction in delay with the introduction of buffers leads to the reduction of ohmic/static energy loss, which is proportional to the wire delay (3). However, the additional buffer stages themselves add extra energy consumption. As a result, the overall energy consumption first decreases and then starts to increase with additional buffer stages as shown in Fig. 9 for different Cu-line lengths. In Table 3 , we show the comparison of delay and energy consumption between buffered ST-sensor interconnects and full and low swing CMOS interconnects. The comparisons are shown for Cu-line lengths of 10 and 15 mm, respectively. Note that, the energy consumption in the ST sensor is significantly lower in comparison to both full and low swing CMOS interconnects. This reduction of energy consumption is a result of using current-mode signaling with very low voltage swing on the line which suppresses the dynamic/capacitive power (proportional to line voltage squared). Moreover, the static power is also reduced by not using analog amplifiers for signal conversion. However, the delay and area for our scheme will be somewhat higher than full-swing CMOS technique and hence the application will depend on the chip design requirements. Additionally, since the proposed method is a current-mode technique with a low impedance termination; it inherits the property of higher noise immunity for current-mode architectures [Bashirullah 2003 ]. We perform the eye diagram simulation for our design to investigate the effect of inter-symbol interference (ISI) and crosstalk noise from neighboring lines. The eye diagram shown in Fig. 10 is near optimal for the nominal operating conditions (Table 2 , clock period 2 ns). However, at faster operating speeds, the DW movement speed saturates and we observe distortions due to ISI. However, the nominal speed is fast enough for most of the global lines and at this speed; the architecture shows good noise immunity.
