Abstract-In order to simplify power-gating requirements in ultra-low-power architectures, design strategies for lowpower non-volatile flip-flops are sought, for which the utilization of spintronic devices offers a promising option. In this paper, we introduce a D F/F that utilizes a 5-terminal spintronic device for non-volatile state-holding in an intrinsically selfcomplementing fashion. This self-complementing device can be exploited to reduce overhead interfacing circuitry in order to realize a compact ten transistors with one spintronic device (10T 1R) D F/F with instant store and restore functionality, while consuming less than 9uW of power.
I. INTRODUCTION
With CMOS technology scaling ever deeper towards the end of the ITRS roadmap, static power dissipation is quickly becoming the primary source of power inefficiency in digital architectures. One technique for mitigating such static power losses is by power gating portions of digital circuits which are dormant [1] . However, power gating techniques require the state information of the circuit held inside registers to be stored in Non-Volatile (NV) devices before powering down. Due to the high-speed switching and high write endurance characteristics of spintronic devices [1] , several approaches have been proposed to use such devices within an edge-triggered Flip Flop (D F/F) to backup information before power-gating, and then restore information once power has been returned [1, 2, 3, 4] . These approaches utilize solely CMOS circuitry to perform the basic D F/F operations while adding additional read and write circuitry along with magnetic-tunnel-junction (MTJ) devices to store or restore the D F/F state when given a specific store or restore signal. Unfortunately, the overhead circuitry for storing and restoring data to/from the non-volatile memory as well as the additional store/restore signaling circuitry in these approaches impose an area cost ranging from 38% to 169% [4] as well as a delay penalty ranging from 2ns to 30ns for backing up the data prior to power-gating [5] . The implementation proposed herein focuses on using the NV properties of a particular spintronic device to reduce the number of transistors needed compared to the standard pure-CMOS implementation of a master-slave D F/F while providing zero-delay store/restore functionality with full data retention, simplifying the requirements of power-gating techniques. The previously developed Domain Wall Coupled Spin-Torque-Transfer (DWCSTT) device [6] is shown in Figure 1 . The DWCSTT is a 5-terminal device with decoupled read and write paths, allowing independent read/write characteristics optimization. The state of the device is dependent upon the relative orientations of the two fixed reference pillars with the orientation of the underlying free layer; when the orientations are parallel, the path across the MTJ is a low resistance, RP, and when the orientations are anti-parallel, the path is a high resistance, RAP [4] . An important parameter for sensing the correct state of the device is called the Tunnel Magnetoresistance Ratio (TMR), which is defined as TMR = (RAP -RP )/RP [4] .With two anti-parallel fixed reference pillars in the DWCSTT, there will always be one pillar with a high resistance and the other will be a low resistance. Due to this device structure always providing complementary resistances, the read margin and tolerance to process variation is greatly improved [6] . Seo et al. utilized the self-referencing differential nature of the device to read the state of the device by fixing the read out terminal to ground and then comparing the currents of the two fixed reference pillars when a fixed voltage is applied to both [6] . However, the authors herein propose that with proper read and write path optimization of the DWCSTT, 16nm CMOS gates with balanced transistor widths are capable of both writing to and reading from the device as shown in Figure 2 in lieu of using a sense amplifier to compare relative current levels.
The write operation of the device as proposed operates as follows. By passing a write current, IW, from to ̅̅̅̅̅ , the domain wall is moved via Spin-Transfer-Torque (STT) [4] and the free layer is oriented as shown in Figure 2a , which causes a low voltage (VLow) at the read out terminal due to the voltage divider between the two MTJs as shown in Figure 2b . By passing IW from ̅̅̅̅̅ to , then the domain wall is moved and the orientation of the free layer is oriented as shown in Figure 2c , which causes a high voltage (VHigh) at the read out terminal as shown in Figure 2d . Optimizing the read characteristics of the device such that VLow < VTh < VHigh is satisfied with VTh equal to the threshold voltage for a CMOS inverter, then an inverter can be used to obtain the state of the DWCSTT device. This aspect is exploited in order to use 16nm technology CMOS gates for interfacing with the DWCSTT.
II. FLIP-FLOP DESIGN
The D F/F proposed herein is shown in Figure 3 and consists of a SRAM-based master latch, a DWCSTT device as the slave latch, an output inverter, and two pass gates used for control. While the clock signal is low, the pass gate leading into the master latch is open, allowing the master latch to poll the data arriving at the input terminal D. Once the clock signal goes high, the master latch becomes isolated from D and the bit stored in the master latch is latched into the DWCSTT slave latch as shown in Figure  3 . Power gating is achieved by simply disconnecting the entire circuit from VDD as the data is already latched inside the NV element, and no pre-sleep data-storing strategies are necessary. However, since the data stored inside the masterlatch is non-deterministic upon re-powering the circuit, power restoration must commence with the negative edge of CLK to ensure that the good data in the NV slave-latch is propagated through the circuit and the result is ready at the master-latch of the next DFF before CLK may go high and write the data from the master-latch into the slave-latch. The clock-to-Q (C-Q) delay [1] is dependent upon the speed of the STT-driven domain wall motion in the DWCSTT device, which is proportional to IW. This can be adjusted by varying the width of the transistors in the SRAM master-latch cell. By increasing the transistor width, we can reduce the C-Q delay for a power and area overhead. The relationship between transistor width, power, and C-Q delay is shown in Figure 4 , where the x-axis is multiples of the minimum feature size (F) corresponding to WNMOS =xF and WPMOS = 2xF. For these simulations, F is taken to be 16nm. We simulated the D F/F design in HSPICE using the 16nm high-performance transistor models available from Arizona State University [7] . The DWCSTT device was simulated by taking a Verilog-A model from a similar device available online [8] and modifying it such that it accurately performs the operation of the DWCSTT device. The circuit parameters for the simulation are found in Table 1 . The value of RP is chosen to be of a high resistance but not out of the range of feasibility [9] in order to reduce the read power overhead. WNMOS and WPMOS are chosen to minimize the C-Q delay, and if one's application can relax the C-Q delay for improved area and power metrics, they may reduce the transistor sizing. The device width, length, TMR, and write path resistivity were all chosen as the base values included in the model of [8] .
The simulated waveforms for the proposed D F/F is shown in Figure 5 . At the positive CLK edge, the data present at D is written into the DWCSTT slave-latch, and is then outputted at terminal Q as depicted. The functionality of this design is critically dependent upon VR switching above and below VTh, which is shown. Upon power-gating VDD, the data stored in the slave-latch is saved and immediately restored upon restoration of VDD to the circuit, illustrating the instant store/restore functionality of the D F/F.
Compared to previous works [1, 4] , the proposed design has reduced transistor counts and negate the need for store/restore circuitry and signaling before power-gating. However, since the C-Q delay of this design is impacted by the relatively slow write speed of the DWCSTT compared to an SRAM cell, some trade-offs are observed. In particular, to [4] the proposed D F/F uses 17 fewer transistors, but has a 1.2ns longer C-Q delay. Although such an increase in C-Q delay is unfavorable for many applications, applications with relaxed speed requirements that utilize power-gating schemes can be benefited with the compact size and simplified power-gating requirements of the proposed design. In addition, further advancements in spintronic research may lead to faster switching designs, which can improve the C-Q delay.
IV. CONCLUSION
The proposed D F/F design herein is shown to retain its data in a NV spintronic device as a part of its operation, which allows instant store/restore functionality without the need for store/restore signaling or overhead control circuitry. Furthermore, the proposed design uses 10 fewer transistors than a traditional pure-CMOS-based master-slave D F/F [4] . Additionally, we showed that by varying the transistor widths in the SRAM master-latch, it is possible to tune the circuit for the power, delay, and area needs of one's application. The functionality of the design was demonstrated by using 16nm CMOS models and a Verilog-A model of the DWCSTT device in HSPICE. Area results were favorable compared to previous works, but C-Q delay was shown to be worse. 
