Abstract-CMOS device scaling is facing a daunting challenge with increased parameter variations and exponentially higher leakage current every new technology generation. Thus, researchers have started looking at alternative technologies. Magnetic Quantum Cellular Automata (MQCA) is such an alternative with switching energy close to thermal limits and scalability down to 5nm. In this paper, we present a circuit/architecture design methodology using MQCA. Novel clocking techniques and strategies are developed to improve computation robustness of MQCA systems. We also developed an integrated device/circuit/system compatible simulation framework to evaluate the functionality and the architecture of an MQCA based system and conducted a feasibility/comparison study to determine the effectiveness of MQCAs in digital electronics. Simulation results of an 8-bit MQCA-based Discrete Cosine Transform (DCT) with novel clocking and architecture show up to 290X and 46X improvement (at iso-delay and optimistic assumption) over 45nm CMOS in energy consumption and area, respectively.
I. INTRODUCTION
Parameter variations and leakage current are increasing every new technology generation, making it harder to scale CMOS devices. Thus, alternative technologies such as magnetic logic are being developed to overcome the limitations to CMOS devices. Magnetic logic, memories and its variants have been theoretically studied and experimentally demonstrated over the past 30 years as a viable replacement for CMOS [1] [2] [3] [4] [5] [6] . Interestingly, the energy required for switching a single nano-magnet at room temperature is on the order of zepto-Joules (10 -21 J). A system with 1 million such nano-magnets can switch with a theoretical limit of 1fJ and when compared with the ITRS projections for double gate CMOS transistors at the 15nm technology node, this is an improvement of 4 orders of magnitude [7] .
Magnetic Quantum Cellular Automata (MQCA) was proposed in 2000 by Cowburn and Welland to demonstrate the use of magnetic polarizations for logic computations [3] . In MQCAs, directions of magnetic polarizations of nanomagnets represent binary states. The axes along these directions represent the low energy polarizations. The axes along directions of highest energy polarizations are the hard axes. The implementation of logic operations utilizes the magnetic coupling between these nano-magnets aided by a clocked external magnetic field. However, this magnetic field needs to be suitably clocked to properly operate a chain of nano-magnets with high reliability. The generation and distribution of such magnetic fields would require additional CMOS circuitry and interconnects that consume large amounts of power. Niemier et al used current through parallel wires to generate these magnetic fields [8] . Results show that the energy consumed by the clocking circuitry is significantly greater than the energy dissipated by the nano-magnets. Hence, efficient clocking strategies and architectures are needed to make MQCA systems feasible.
MQCAs have shown good scalability [1] , making them promising for replacing standard CMOS designs. Although basic MQCA logic implementations were experimentally demonstrated [3] , their adoption in commercial products has been hindered by the continued scaling of CMOS transistors and lack of system design methodologies, system level simulators and efficient and reliable manufacturing techniques for MQCAs. With the current development of efficient fabrication techniques for nano-magnets, there is a need to explore MQCA circuits and architectures, and compare them to their CMOS counterparts. As such, we developed a design methodology with which MQCA systems can be designed. As we will show in this paper, the design of MQCA systems requires techniques to mitigate design issues arising from the dynamics of MQCA nano-magnets. Hence, to evaluate the power-performance limits of MQCAs, an integrated device/circuit/system simulation framework is needed and we developed such a framework. To that effect, we also show how novel clocking strategies and architectures mitigate MQCA system design issues.
The rest of the paper is organized as follows. Section II introduces the basics of MQCA and proposes suitable architecture for MQCA systems. Section III describes novel clocking architectures and strategies for logic isolation and noise immunity in MQCA systems. In order to evaluate the power-performance limits of an MQCA system, an integrated device/circuit/architecture simulation framework, which we present in Section IV, is needed. Section V discusses our MQCA design methodology, including the synthesis of an MQCA cell library. An MQCA based implementation of DCT was synthesized using the proposed design methodology and evaluated using our simulation framework. The simulation results and comparisons with standard CMOS based DCT design in 45nm predictive technology [9] is discussed in Section VI. Finally, Section VII concludes this paper.
II. BASICS OF MAGNETIC QUANTUM CELLULAR AUTOMATA
The basics of MQCAs were presented in [1, 3] . Fig. 1 shows the majority logic gate developed in [1] . The easy axes are along the directions in which the nano-magnets tend to be polarized in equilibrium. The hard axes are along directions in which the polarizations of nano-magnets have the highest energy and are perpendicular to the easy axes for implementing binary logic. The easy axis is taken to be the yaxis (vertical) in fig. 1 . The magnetic field lines show that dipolar coupling between magnets in the x-direction (horizontal) is anti-ferromagnetic and that between magnets in the y-direction is ferromagnetic. The compute nano-magnet is magnetically coupled to all four surrounding nano-magnets.
However, the output and compute nano-magnets can be first driven into the hard axis using an external magnetic field, and then allowed to relax. Dipolar coupling in the hard axis is very weak; therefore, the compute nano-magnet couples more strongly to nano-magnets A, B and C. As such, its polarization will align itself with the effective magnetic field due to nano-magnets A, B and C but not the output nano-magnet. The majority effect determines the polarization of the compute nano-magnet and this realizes majority logic. By biasing nano-magnet B, the gate can be configured as an OR or AND gate. Using the notation in fig. 1 and using nanomagnets A and C as primary inputs, fixing nano-magnet B as '1' ('0') realizes an OR (AND) gate. Also, MQCAs work like a pipeline since they need clocks to propagate data.
MQCA data propagation is done using chains of nanomagnets [8] . These chains need to be short to realize compact MQCA systems. Thus, a near neighbor architecture is best suited for MQCA systems. We propose implementing MQCA systems as systolic arrays [10] , especially for digital signal processing applications which are well-suited for systolic architectures.
Also, an external magnetic field is needed to isolate the output from the inputs and is clocked to enable computations every cycle. Hence, MQCA systems synthesized from these gates needs be properly clocked for reliable and higher performance operation. The design of clock architectures and strategies are discussed in the next section.
III. CLOCKING ARCHITECTURE AND STRATEGIES
We have seen in Section II that clocks are needed in MQCA systems for input-output isolation and/or computation robustness. Even though basic MQCA cells are very lowpower, the clock network may not be [8] . Thus, efficient clock architectures and strategies are needed to make MQCA systems feasible. Current technology uses CMOS circuits to generate clock signals for switching current sources. The current is then converted to magnetic fields. In this section, we propose different clocking technologies and strategies, and power-efficient clock architectures to trade-off computation robustness and speed.
A. Clocking Strategies
As mentioned in Section II, MQCAs work like a pipelined circuit where computations on different data and data transfer are synchronized by clock signals. Another purpose of the MQCA clock is to control dataflow. Thus, the basic idea for clocking an MQCA logic pipeline can be described as follows: 1) Current stage is clocked in preparation for computation. 2) The following stage is clocked.
3) The previous stage is not clocked while the next stage is, and the magnetic field to the current stage is switched off. 4) The previous stage is clocked since data from the previous stage is not needed. Hence, multiple clocks are needed for high-throughput, robust MQCA systems and the following section describes proposed clocking strategies.
A.1 4-Phase Overlapping Clocks
Fig . 2 shows the operation of a 5-stage MQCA inverter chain using 4-phase overlapping clock scheme. It is clearly shown that data is properly propagated along the chain. The propagation delay of each stage depends on the settling time of the nano-magnets. This strategy requires 50% overlap between clocks of consecutive stages. However, the clocking circuit may consume too much power generating four clocks.
A.2 3-Phase Overlapping Clocks
A less power hungry alternative is the 3-phase overlapping clock scheme shown in fig. 3 . This strategy requires 33% overlap between clocks of consecutive stages. Compared to the 4-phase scheme, the clock frequency needs to be lowered to maintain the clock overlap time. The overlap time ensures that nano-magnets of consecutive stages are fully aligned along the hard axis before computation occurs. A longer pulse and lower frequency implies that computation robustness is maintained at the expense of speed and power. Thus, power, speed and robustness of the circuit can be traded-off by choosing between 3-and 4-phase clocking schemes.
B. Clock Architectures
Current Phased-Locked Loop (PLL) circuits are able to generate the clock signals described in section III-A. Thus, we only focus on technologies to convert current into magnetic fields and discuss their power efficiency. 
B.1 Parallel Wires
Alam et al showed how current through wires is used to generate magnetic fields for nano-magnets [11] . Nano-magnets between zones might be incorrectly clocked due t current noise in wires. Furthermore, large required to generate sufficient magnetic fi power efficiency is compromised. Niemier e clock energy using this scheme can be 100X switching energy [8] . The clocking energy w proposed clocking strategies are used computation robustness. Hence, more technologies are needed for better control clocking in MQCA systems.
B.2 Spin-Transfer Torque Magnetic Tunnelin
By placing individual nano-magnets abo layer of spin-transfer torque magnetic tun (STT-MTJ), the magnetic field from MTJ clock the nano-magnets. Fig. 4 illustrates ho anti-parallel arrangement of the hard and f MTJ is able to generate a magnetic field in the tunneling layer. The arrangement is switc minimum current through the MTJ layers. 6 needed to switch an MTJ with t ox =10nm cross-section in 3.3ns. A 250mV current sour of energy to deliver this current [12] . How between the nano-magnet and the MTJ needs (~1nm) for the magnetic field to be large eno nano-magnet.
B.3 Nano-coils and Solenoids
Alternatively, the nano-magnet can be nano-coils or solenoids for clocking. The magnetic field can be increased by replacing with high permeability material or increa Table 1 lists the permeability of suitable Possible remnant polarization in high perm requires a second solenoid or depolari eliminate it. The impact of core material pe solenoid current and energy requirements are Fig. 6 illustrates how a solenoid can be used nano-magnets. The drawback of coil base inductive coupling between them, whic voltages in non-switching coils. This couplin by larger spacing between coils.
Each of the clocking strategies and tech has different power requirements. Thus, t considered to accurately estimate the power the MQCA system. The next section propo device/circuit/system simulation framework t IV. DEVICE/CIRCUIT/SYSTEM COMPATIBLE FRAMEWORK The dynamics of MQCAs presented in S described by a set of equations describing between MQCA nano-magnets. The Landau (LLG) equation [17, 18] describes the dynam and can be modified to include the dip between magnets; hence, a system of these derived to describe any MQCA system. The of magnetic moment in a MQCA nano-magn (1) . gnetic field ( H ext ), the olar interaction field (H dip ) -magnet is related to its 3). The magnetic moments ng between nano-magnets 4). An MQCA system can ano-magnets to an initial and the LLG equation in ime-domain to yield the ssipated is computed using the ramp time of the clock me ( C ), given by (7), can fig. 5 shows the use of this ter. , the clock circuitry can than the MQCA magnets. be modeled and its power ed circuit model of fig. 6 enerator and its energy ycle is given by (8) . The cked magnetic fields is (9); the system p mean energy consumed by all magn clock period.
Area Estimation
The MQCA system area depend for fabrication. CMOS circuitry and clocked magnetic field can be fabr the MQCA nano-magnet layers sta [11] . Thus, the area is constrained t Alternatively, everything is fabrica system area is the sum of the C Available commercial design tools CMOS area from the layout. The M on the nano-magnet spacing and siz equations for area estimation.
The equations in our simulation fr in Tables 2 and 3 . A simulator can these equations and evaluate the MQCA systems. In the following design methodology for synthesizing V. DESIGN METHO We have seen in Section II that M designed as pipelined systolic clocking architectures and strategie However, a fully-customized des synthesize and thus, we propose a simplify the design process. The des is discussed next followed by th methodology using the MQCA cell l MQCA Cell Library Design NAND and NOR gates are nee Imre et al demonstrated a reconfigu in [1] and this concept can be used basic MQCA logic cells. The majo and reconfigured to realize higher fa inputs (Fig. 9) . For an n-input MQC of input magnets (N) and the numb that are required to synthesize it are N = 2n -2, n 2 N fix = n -1 Fig. 9 shows the implementation in MQCA. Positions labeled with whose values depend on the logic gives the configurations of these gat implemented. An MQCA cell libra nano-magnets was synthesized. Th consumed by some MQCA gates are System Design Methodology An MQCA system can now be s library synthesized from the previo power-performance of the system d available and the design of basic MQ propose a design methodology using 1. Given the MQCA technolog nano-magnet material, etc.), magnet volume 2. Design an MQCA inverter ch magnet size and a spacing (t dm ) 
q dt (9) ower consumption is the netic field generators in a ds on the technology used d wiring for generating the ricated on a substrate and acked on top as shown in to that of the largest layer. ated in one layer and the MOS and MQCA areas. s are able to estimate the MQCA layer area depends e. Table 1 summarizes the ramework are summarized n be implemented to solve e power-performance of g section, we propose a g MQCA systems.
ODOLOGY MQCA systems has to be arrays controlled using s presented in Section III. sign can be difficult to a cell-library approach to sign of MQCA cell library e MQCA system design library.
eded to implement logic. urable majority logic gate to synthesize a library of ority gate can be cascaded an-in logic gates by fixing CA gate, the total number ber of fixed magnets (N fix ) calculated from (10) of 2-, 3-and 4-input gates 'fix' are the fixed inputs implementation. Table 4 tes and the logic functions ary using 10nm side cubic he delay, area and energy e listed in Table 5 . synthesized using the cell ous section. However, the depends on the technology QCA logic cells. Thus, we g the following steps: y parameters (anisotropy, choose an initial nanoain using the initial nano-) optimized from H ani 3. Using a LLG solver to determine t product (EDP) of the chain, minimi iteratively choosing the nano-magn repeating step 2 4. The nano-magnet volume and spacing end of step 3 is used to design a library logic cells as discussed in the previous 5. Synthesize an MQCA systolic array targeted application (eg. DCT) using from step 4 6. Choose a clocking architecture and stra Section III 7. Using a LLG solver incorporating clo the synthesized design meets power-p specifications. Repeat steps 5 and 6 un are met. Specifications may have t iterations do not yield a feasible design An MQCA based systolic array 8-bit DCT using our design methodology. A MATLAB on our simulation framework was impleme evaluate the MQCA DCT. Simulatio comparisons to the standard CMOS t implementation are presented in the next sect
VI. SIMULATION RESULTS
A feasibility study of MQCA systems was the simulation framework and design metho in Sections IV and V, respectively. An MQ Discrete Cosine Transform (DCT) based systolic array architecture of Chang and implemented using our design methodology the systolic array implementation of the MATLAB simulator based on the simulatio Section IV was used to simulate the DCT implementation is made up of a regular identical processing elements (PE), we anal and extrapolated the results to the whole syst
The systolic array DCT implementation identical PE. Each PE has sixteen 2-input g input gates. The 2-input gates can be built us by fixing the remaining inputs to logic '1'. input gates, we can construct a very regu implementation. The area overhead due to u gates is 4.3%. The advantages of using such include 1) ease of fabrication, 2) ease of ad to the circuit and 3) ease of clock control Furthermore, higher fan-in gates reduce delay in MQCA. Table 6 lists the paramete based DCT. The critical path delay in our 884ns and the total energy dissipation is 5.1fJ
Iso-Delay Comparisons with 45nm CMOS
Due to the large delay in MQCA, we an sub-threshold operation for iso-delay compa design parameters for CMOS based DCT is Table 8 shows the power-performance n CMOS and MQCA implementations of DC the area and energy consumption of both M 8-bit DCT implementations. The MQCA shows 290X energy improvement over technology implementation. However, the u is an overly optimistic assumption. nalyzed CMOS in arisons [16] . The listed in Table 7 . numbers of both T. Fig. 11 [15] . techniques of fabricating these coils are ne the power efficiency of using these coils to g fields in MQCA chips. The use of current so magnetic fields also means that the clock g 
