In the years to come new solutions will be required to overcome the limitations of scaled CMOS technology. One approach is to adopt Nano-Magnetic Logic Circuits, highly appealing for their extremely reduced power consumption. Despite the interesting nature of this approach, many problems arise when this technology is considered for real designs. The wire is the most critical of these problems from the circuit implementation point of view. It works as a pipelined interconnection, and its delay in terms of clock cycles depends on its length. Serious complications arise at the design phase, both in terms of synthesis and of physical design.
Nanomagnetic Logic Circuits (NLC) are mentioned in the International Technology Roadmap for Semiconductors (ITRS) [Semiconductor Industry Association 2008] as one of the most promising substitutes for CMOS technology. They rely on the Quantum-Dot Cellular Automata (QCA) idea [Csurgay et al. 2000; Kummamuru et al. 2003; Lent et al. 1993] , based on bistable cells having only two stable states (Figure 1(a) ). These two states represent the logic values 0 and 1. Circuits are built using many identical cells placed close together. The state of every cell is therefore defined by the electrostatic interaction between adjacent elements. This theoretical principle is normally implemented in two ways, molecular QCA and magnetic QCA, also called NML (NanoMagnetic Logic). Molecular QCA are built using complex molecules as basic cells [Demarchi et al. 2009; Lu and Lent 2005; Pulimeno et al. 2011] . On the one hand molecular QCA have great appeal due to the high speed they are expected to reach (about 1THz according to Lu et al. [2006] ) along with the possibility of working at room temperature. On the other hand, neither wires nor gates have yet been experimentally demonstrated. At the moment, in fact, self-assembly techniques are not advanced enough to create useful circuits.
NML Devices [Csaba and Porod 2002; Imre 2005] are built using rectangular nanometer scale magnets, which can be approximated as single domain devices. Since the magnets are rectangular, the shape anisotropy allows only two stable magnetization states for these devices. In these two stable states magnetization is present in the long side of the magnet (called easy axis), and it represents the two logic values, 0 and 1 (Figure 1(b) ). NML cannot reach the high speed of its molecular counterpart, as the expected speed is a few hundred MHz, according to Rizos et al. [2009] . However, small circuits have already been experimentally demonstrated [Imre et al. 2003; Orlov et al. 2008; Pulecio and Bhanja 2010] . The main advantages of this implementation are the expected low power dissipation [Csaba et al. 2004] and its intrinsic magnetic nature: NML devices maintain the information stored even without an external power supply. It is interesting to note that in this technology, even a simple wire is based on the elementary logic magnet (Figure 1(c) ), and this has consequences that will be highlighted in the following paragraphs. Other logic gates are the inverter (Figure 1(d) ) and the Majority Voter (MV). The MV can be used as an AND or an OR, depending on whether one of its inputs is fixed to 0 or to 1 (Figure 1(e) ).
Irrespective of the implementation, QCA circuits require the use of an external clock field to drive the information through the circuit Niemier et al. 2007; Rizos et al. 2009 ] in order to avoid error generation. For example, with magnetic implementation, the maximum number of magnets that can correctly align in an antiferromagnetic order is between 10 and 20 [Imre et al. 2003 ]-less if thermal noise is considered [Csaba and Porod 2010] . The external field, magnetic or electric, depending on the implementation chosen, is used to drive the cells into an intermediate unstable state, lowering the energy barrier between the stable states. When the field is removed, the cells are free to reorient themselves depending on the neighbor cells (more details are reported in Section 1). In molecular QCA this clock field may act as a reset field, or on the contrary, as an activating field, enabling cell reorientation when it is applied. However this depends on the molecule characteristics. In general, the cyclic switching on and off of this external field likens it to a clock, even though it is conceptually more similar to a rhythmic power supply, because it enables the information propagation. The whole circuit is patterned by clock regions; distributed in the whole layout area, these regions can be reached by the external field with appropriate timing. Typically, three or four types of region are necessary, as explained in Section 1, and magnets in the regions of the same type are reached by the same clock signal; different clock signals have different phases, but the same period.
The use of this clock system leads to a problem known as layout-timing [Imre et al. 2003 ]. The delay of a QCA circuit is measured in terms of clock cycles, or in other words, by counting the number of clock regions the circuit goes through. This number depends in turn on the total number of cells in the circuit, but the number of cells is related to the circuit layout (see Section 2 for a detailed explanation about the layout organization). This dependency increases the constraints on criteria and tools that can be used at the design stage, e.g. synthesis and physical design. The reason behind this is that a layout change during circuit design implies a delay variation, which can cause improper behavior of the circuit. Although as yet only developed at research level, the tools [Chung et al. 2005; Zhang et al. 2005] can be helpful, but only for simple circuits. One of the proposed solutions for this problem is to use an asynchronous delay insensitive logic, like Null Convention Logic (NCL) [Choi et al. 2006 [Choi et al. , 2007 Fant and Brandt 1996] . In this logic, gate switches to a new value only when all the signals arrive at its inputs. Therefore, it is possible to place gates everywhere in the circuits' without any concerns regarding signal synchronization (e.g. regarding the number of clock zones each affected by that signal). The use of asynchronous logic applied to NML leads then to the realization of a Globally Asynchronous and Locally Synchronous (GALS) architecture. The GALS technique is often used in CMOS digital designs when interconnect-related issues must be solved [Casu and Macchiarulo 2007] or when different and complex synchronization blocks are to be interfaced [Martina and Masera 2010] . In the case of NML, this happens because the single NCL information propagation is synchronized with the clock signal, but the whole circuit is asynchronous and relies on a handshake communication protocol like most of the commonly used asynchronous architectures [Davis and Nowick 1998; Sparso and Furber 2011] .
The use of asynchronous logic seems then to be a natural solution for NML technology, however, advantages come at a price. As in CMOS technology, NCL leads to a bigger area occupation. From our preliminary investigations we have found that the use of NCL logic on NML circuits has an even bigger penalty than CMOS, and it also causes a slow-down in circuit operations. For this reason, the way in which asynchronous logic impacts magnetic logic circuits has been exhaustively investigated in this work, outlining advantages and disadvantages. Not only is an NCL solution explored, but both a mixed NCL-Boolean organization and a fully-Boolean asynchronous solution are studied. The analysis is performed using an HDL behavioral model of NML circuits that we have developed [Graziano et al. 2009a [Graziano et al. , 2009b Vacca 2008] and explained in detail in Graziano et al. [2011] . This allowed us to design complex microprocessor architecture as a benchmark to evaluate the circuit logic behavior and to estimate its area and power consumption.
In this article, after an explanation of the NML technology foundations, the multiphase clock system (Section 1) is analyzed. Two main issues are then discussed arising from the use of the external clock system: how to manage synchronization at layout level and how to handle feedback signals (Section 2). In Section 3, NCL logic applied to NML is examined using a microprocessor as a case study, and in Section 4 the implementation of the same microprocessor, based on a mix between Boolean and NCL logic is discussed. We demonstrate in the same section how this solution is a good compromise between feasibility and performance. Finally Section 5 describes how the same microprocessor was implemented using Boolean logic only, but still relying on an asynchronous-like organization. This solution, although requiring some care in the physical design phase, demonstrates an important improvement in performance and discloses interesting suggestions for further research. 
NANOMAGNETIC LOGIC CIRCUITS BACKGROUND
Nanomagnetic Logic circuits behavior is based on the interaction between neighboring cells. However the influence of a nanomagnet on its neighbor may be too small to effectively influence its magnetization. This happens because a high energy barrier between the two stable magnetization states exists-a desirable property if stability is to be assured. An external mechanism is then required to lower the energy barrier and to drive cells into an intermediate unstable state. This action favors the influence between two neighboring nanomagnets. This mechanism is called clock and, in the case of NML, is a magnetic field applied along the short side of the magnet (hard axis). The magnetic field is generated using a current flowing through a wire buried under the plane of the nanomagnets (Figure 2(a) ) . Different clock signals are routed through the plane patterned with regular regions. In the example in Figure 3 , the different colors represent the zones affected by different clock signals. The clock zone layout on the plane is a complex trade-off between physical feasibility of wires and NML signal propagation. We proposed a feasible and realistic layout structure to route this signal [Vacca 2008; Graziano et al. 2009b Graziano et al. , 2011 : an example is in Figure 2(b) . A planar top-view of the zone organization is shown in Figure 3 . When the clock field is applied, the related nanomagnet magnetization assumes the same direction as the field. When the field is removed, the nanomagnets reorient themselves antiferromagnetically. In doing so they follow input cells placed nearby that are in meanwhile a stable state, as in the example in Figure 2 (c).
A multhiphase clock system is necessary [Imre et al. 2003 ] in order to build electronic circuits. Three or four phases of clock signals (Figure 2 (e) are used to independently drive the different clock zones of the whole circuit. The working principle of the clock system is shown in Figure 2 (d), according to our three phase proposal [Graziano et al. 2009b] . At every time-step a clock zone can be in one of three different phases: HOLD, RESET, SWITCH. In the HOLD phase no magnetic field is applied, nanomagnets are in one of the two stable magnetization states, and have a strong influence on the magnets of the neighboring zones. In the RESET phase, the magnetic field is applied and magnets are driven into the unstable state; therefore they have a small influence on the magnets of the neighboring zones. In the SWITCH phase, the magnetic field is first applied and then removed, therefore the nanomagnets reorient themselves following the last magnets of the neighboring zone, which is in the HOLD phase, acting as an input. During the next time-step, the situation is repeated but the zone in the SWITCH phase is next in the space sequence due to the time evolution of the clock signals. The time evolution shown in Figure 2 (d) gives a demonstration of how the information propagates through the circuit. Figure 3 shows a possible layout of magnetic cells and a possible information flow, which follows a snake-like propagation in zones 1-2-3-1-... (thus the name snake-clock). The circuit, then, is organized as a fully pipelined architecture.
It is worth underlining that this clock signal is different from the CMOS clock, where parameters like clock frequency and number of pipeline stages are free and are designed to obtain the desired performance (provided that technology limits are respecte(d). In this case the clock is first of all necessary to assure correct circuit operation, and its characteristics, such as frequency or slope, are strongly related to the physical constraints of this technology and cannot be easily changed. In order to ensure correct switching in fact, the clock frequency must be strictly related to the magnets' physical behavior, for example, to the switching time and the associated power consumption. At the same time, the number of pipeline stages, which is equal to the number of clock zones, depends on the physical feasibility of the clock wires (minimum metal pitch) and on the maximum number of magnets that can be placed in one zone. Experimental results show that the maximum number of magnets in a chain that switch without error generation is between 12 and 20 [Imre et al. 2003 ]. If we consider the thermal noise, this number is significantly reduced to a value between five and ten [Csaba and Porod 2010] . As a consequence, the circuit must be divided into smaller areas with a limited number of magnets. This in turn increases the number of pipeline stages.
The clock signal frequency we have used in our simulations is obtained by accurate micromagnetic simulations performed using a finite element simulator called NMAG [Fischbacher et al. 2007 ]. In Figure 4 (a), an example of a wire starting with one fixed input and nine magnets, is shown. The magnetization reorients according to an antiferromagnetic order. In the case presented, the simulation is stopped during the transient in order to show an intermediate point, so that not all magnets are already correctly oriented. In Figure 4 (b), the magnetization transient behavior is shown for the first four magnets: magnetization goes from 0 to a negative value for magnets one and three, and to a positive value for magnets two and four. From this and other simulations, we found that the average switching time of a permalloy nanomagnet with sizes 50nm × 100nm × 30nm is about 120ps. The duration in time of the SWITCH phase must be long enough to assure the correct reorientation of all the magnets inside a clock zone. In order to obtain the timing of the switching phase, which we can apply to the whole circuit, the average number of magnets that can correctly switch (we have chosen in this example 15, as an average value between 12 and 20) is multiplied by the average switching time of a single magnet:
The clock period is evaluated as three times the duration of the SWITCH phase, and the frequency is the inverse of the clock period. From our simulations we have obtained a maximum frequency for the clock signal of 180 MHz.
NANOMAGNETIC LOGIC CIRCUIT DISCUSSION: PROBLEMS AND SOLUTIONS
The clock system previously presented implies an intrinsically wave-pipelined circuit. To better understand this, a comparison can be done with CMOS circuits. Every clock zone, where just a simple magnetic wire is routed, has the same behavior as a D-ff with a clock signal similar to the waveforms in Figure 2 (a). As also shown in the Figure 3 layout, every clock signal is applied to a different clock zone. In other words, the first clock signal is applied to all the zones with the first color, the second clock signal is applied to all zones with the second color, and the third clock signal is applied to all the zones corresponding to the third color. This means that a group of three consecutive zones has a total delay of one clock cycle. The behavior is wave-pipelined, which is an intrinsic characteristic of NML technology and cannot be changed.
Problems
Two problems arise in this structure. They are discussed in the following.
Layout timing. This issue is explained in Figures 5(a) and (b), where an MV is reached by three inputs, corresponding to two different organizations. The delay of a signal, in terms of clock cycles, depends on the number of clock zones it crosses. In order to implement a correct circuit, signals must arrive at the inputs of every logic gate (the MV in this case) at the same time. This synchronous arrival time is necessary because when the reset field is released a magnet switches according to its neighbors. No more sampling of neighbors' magnetization can occur, until after a new reset. Therefore the number of clock zones crossed by each input signal must be the same ( Figure 5(b) ). If this does not happen ( Figure 5(a) ) the operation's result is not correct, because data arrives at different clock cycles. In the simple example shown here it is easy to synchronize signals by controlling the routing. However, in complex circuits only automatic tools could help, but constraints might still not be completely satisfied.
Feedback. The second problem arises from feedback signals. An example is presented in Figure 5 (c), where an ALU executes addition between one input and its own output. Since in the case of NCL the structure is pipelined, the ALU input arrives at every new clock cycle, but the second input-the feedback-arrives later (in this example after 100 clock cycles) due to the length and the delay of the NML wire. Therefore at every time-step, the ALU performs addition between the input and its output result obtained 99 clock cycles before. Changing the length of the input wire does not solve the problem because it simply changes the circuit latency. The circuit will not work even though the input is delayed to match the length of the loop. For example, if a new input is sent exactly every 100 clock cycles, and in the meantime its value is kept constant, the circuit is synchronized and works correctly. But if the input arrives with a bigger delay (e.g. 300 clock cycles), the circuit will not work again. This happens because the feedback signal still arrives at the ALU input after 100 clock cycles (for example suppose the output is 0). The value of the first input is kept constant (for example suppose the input value is 1), therefore the ALU executes the addition between 0 and 1 and gives as a result, 1. At 300 clock cycles the situation is repeated but at this time the two input values are 1 (kept constant) and 1 (the output of the previous operation). This operation gives 2 as a result. At 300 clock cycles a new input is sent, but the output of the ALU will show the incorrect value 2 instead of the expected 0. So the circuit works only if the input is delayed by exactly 100 clock cycles. This is critical because a complex circuit is composed of many loops. If for example, inputs are synchronized with the longest loop, then the shortest loop will not work. A possible solution to this problem, coming from technology, could be the use of electric interconnections only for feedback signals. The magnetization can be converted into an electric signal, using for example the "devide" developed in Becherer et al. [2009] , transferred using a copper line. Finallly it could be reconverted into a magnetic field, for example using the magnetic field generated by the flow of a current. This solution is technologically complicated, but it may solve the problem without delaying circuit operations and could be adopted at least for very long interconnects.
From the logic point of view, of interest in this article, architectural solutions for this specific problem are introduced in the following.
Solutions
A possible solution for these problems to use asynchronous circuits. In this work we explore and compare three different approaches.
For example, one possibility is using a delay-insensitive asynchronous logic-Null Convention Logic (NCL). Signals are encoded using two bits: logic value 0 is represented by 01 and logic value 1 is represented by 10. These two values represent the DATA state. Value 00 represents the NULL state and value 11 is not allowed. The delay insensitivity is achieved by alternating NULL and DATA states. An NCL gate, then, switches from the NULL state to one of the two DATA states; however, this happens only when all the inputs switch from NULL to DATA. A gate will remain in the two DATA states until all inputs return to the NULL state, and then the cycle restarts. Circuits switch periodically from NULL state to DATA state: therefore NULL state works as a time reference for this logic. By adopting this logic then, all the synchronization problems of QCA, mentioned before, are automatically solved. However, using a two bit encoding implies doubling the number of wires of the circuit. This problem is partially solved, because NML technology, at the moment, allows only coplanar wire crossing. However, this technique, experimentally demonstrated in Pulecio and Bhanja [2010] , is nontrivial. Therefore it is better to limit the number of wire crossings in the layout. This constraint is certainly a limitation for NCL logic but can be solved finding a technological means to make multilayer structures.
Most of the works in the literature focus on small circuits, but a realistic architecture should be used to learn the potential and critical points of this technology. For this reason we have investigated the behavior of a complex circuit (a microprocessor) that we have designed using NML logic. A pure NCL solution is discussed in Section 3.
This version is then compared to other two possible implementations of the same circuit. We explored a mixed Boolean-NCL approach (Section 4) in order to limit the overhead due to NCL, at least where the price to pay because of NCL, is unbearable. Finally, still maintaining the idea of asynchronous behavior, we propose a version of the microprocessor based only on Boolean logic (Section 5). This new solution solves most of the problems and remarkably, improves the performance.
Methodology
To analyze NML circuit operation and performance we have developed a VHDL behavioral model of Nanomagnetic Logic circuits. The model is simply a circuit described using VHDL, which behaves like the equivalent NML counterpart. We introduced the model in Vacca [2008] and Graziano et al. [2009a] and explained in more detail in [Graziano et al. 2009b . Even though it is based on the ideas in Ottavi et al. [2006] , Henderson et al. [2004] , Huang and Lombardi [2007] , that were developed for general QCA, our model is particularly dedicated to NML circuits. It accounts for the accurate physical layout of NCL gates, allowing a realistic representation of NML circuit behavior. Moreover, it is based on the clock structure we presented in Graziano et al. [2009a] and Graziano et al. [2009b Graziano et al. [ , 2011 . Starting from the logic equation of each NCL gate (see example in Figure 6 (a)), we have designed the custom layout of each NCL gate (Figure 6(b) ). Then the layout is converted in the corresponding model described using VHDL (Figure 6(c) ). In a real circuit, magnets within a clock zone accept new data when the magnetic field is removed. Therefore we use a register to simulate the delay of a clock zone, using the clock signals reported in Figure 6 (c). When the clock signal in the model is high, the register accepts new data, therefore it is equivalent to the switch phase of the real circuit. In this example, in order to perform logic operations, an ideal MV, without delay, is used.
This model is adopted to hierarchically build more complex circuits. It is currently based on gate layout generated ad-hoc. Clearly a more general design approach would be helpful. For this reason we are at the moment working on an automatic tool for synthesizing, placing, and routing this kind of circuit in order to obtain an accurate representation of every architecture. In this work the physical layout of the gates is based on the constraint that a wire can be composed of a limited number of magnets, for example, between 12 and 20. Hovewer, if a more critical condition is considered, as in Csaba and Porod [2010] , where the presence of thermal noise is taken into account, the number of magnets must be reduced. However, our model is fully parametric with respect to this number, and thus any case can be easily analyzed. It is worth underlining that fewer magnets in each clock zone means a higher clock frequency, but more pipeline stages.
We have also improved our model, allowing for a hierarchical estimation of the circuit power dissipation. Starting from the exact number of magnets of the basic logic gates (majority voter, inverter, wire), the total number of magnets in the circuit is estimated. By multiplying the total number of magnets by the average power dissipated by each of them Augustine et al. [2011] , it is possible to obtain the average total power dissipated by the magnets during switching. Moreover, knowing the dimensions of the magnets and estimating the wasted space, the circuit area can be calculated. These data allow the evaluation of the clock wirelength; and then the power dissipated by the Joule effect in the clock is calculated. We based our estimation on the most efficient clock wire structure presented in Augustine et al. [2011] . However the power model is not detaile1d here, as it is outside of the scope of this work.
NCL LOGIC MICROPROCESSOR
Using NCL gates only, we designed a simple but complete microprocessor. The processor, inpired by Walus et al. [2005] but substantially improved, was chosen because it contains both sequential and combinational logic circuits, therefore it is a good benchmark for testing NML. The microprocessor architecture is shown in Figure 7 .
It is organized in four main blocks: a program counter, an instructions memory, a data memory, and an ALU. The program counter generates the addresses for the instruction memory, which is a parallel memory capable of storing 16 instructions. Another parallel memory is used for storing data. Finally, the ALU is used for the computational part, and it executes arithmetic (addition, subtraction) and it logic (AND, OR) operations. The microprocessor is divided into four stages, separated by asynchronous registers. These registers are different from their CMOS counterparts, because they have no memory ability and their only purpose is to implement the asynchronous communication protocol. As an example, the top-left inset of Figure 7 shows a cell of the instruction memory. It is based on NCL gates and registers. Functions and naming conventions of NCL are complex and are not reported here for sake of brevity. Details are given in our work in Graziano et al. [2009b Graziano et al. [ , 2011 . However, here gates labelled with numbers 4 and 1 are very similar (just slightly more complex) to the TH22 gate previously described (Section 3.3, Figure 6 ); chosen there as the simplest among NCL gates. By associating the layout in Figure 6 (b) to the memory cell structure, and rising up another hierarchical level, to the whole architecture in Figure 7 , an idea can be conceived on how to relate the processor structure to a topology, as in Figure 3 .
The architecture is simple, but can execute many types as of instructions, for example, memory read/write, jump, and arithmetic/logic. To test the microprocessor we have implemented a division algorithm, that executes, in the case reported here, 12/4. The simulation waveforms obtained using Modelsim 1 are shown in Figure 8 . The twobit encoding, typical of this technology, is shown in Figure 8 in the two top and bottom arrays. A periodic switching from DATA state to NULL state (when all signals are 00) can also be observed in Figure 8 . The time evolution of the system follows the ACK signal.
The algorithm is simple and follows the phases enumerated in Figure 8 .
0. Instructions are loaded in memory, outputs are always in DATA 0 (OUT0(x) = 0 and OUT1(x) = 1) 1. First operand (1100) forced to circuit output 2. First operand stored in data memory (outputs = 0) 3. Second operand (0011) subtracted from first 4. Result is stored in data memory 5. A counter variable is incremented by 1 6. Value stored in data memory 7. Previous operation loaded 8. Check if 0 9. Jump to step 4 10. Repeat steps 4-9 11. Repeat steps 4-9 12. Repeat steps 4-9 R. Final result shown = 0100
The performance of the processor are shown in Table I . The time required for the execution of an instruction is about 5.35 μs, which is around 1000 times larger than the clock period used (5.46ns). This can be easily explained because the delay insensitivity of the NCL logic is assured by freezing the circuit operations, while waiting for the arrival of all the signals to the circuit inputs. However, since NML is pipelined, the propagation time of the wires in terms of clock cycles (latency) can be very high; therefore operations are stopped for a very long period.
It is important to underline this point: A purely synchronous Boolean NML circuit has a throughput of 1 data for every clock cycle, due to its pipelined nature, but only combinational data-flow circuits are allowed (no feedback). Therefore a hypothetical Boolean NML processor could execute one instruction at every clock cycle, i.e. every 5.46ns. But since feedback cannot work, this kind of microprocessor cannot be really used in its pure form. The NCL solves the synchronization problems, allowing the construction of any kinds of circuits at the cost of dramatically decreasing the overall speed. The total power dissipation of this version of the processor (due to magnets and clock) is 63.8 μW, which is a very high value, compared to the results found using the other solutions discussed in Sections 4 and 5. This is due to the high number of magnets that compose the microprocessor, about 4 million nanomagnets. This is another disadvantage of NCL logic, the area increment corresponds to an increase in power dissipation.
In the last row of Table I , an estimation of the power consumption due to clock wires is shown, calculated according to the methodology described in previous section.
To summarize the results for this implementation, we can state that adopting NCL completely solves the NML synchronization problems. Moreover, the circuit fabrication is simpler, because gates can be placed without worrying about signal synchronization. Hovewer the drawbacks are very troublesome. The circuit area is significatively higher, and the area increment generates a proportional increment in power dissipation and a decrement in circuit speed. However the huge area increase is mainly because of memories; therefore if we implement them using Boolean logic, we expect to gain in performance. This leads us to the mixed logic approach discussed in the following section.
MIXED LOGIC MICROPROCESSOR
To improve performance we adopted a different solution. We have designed the most critical combinational parts of the processor using Boolean logic, and the sequential part using NCL logic, thus introducing a new mixed logic. This solution is based on the assertion that in NML technology combinational circuits have good performance, but they are also less complicated to implement from the synthesis point of view. From the layout point of view, particular care must be used in signal synchronization (layout-timing). Therefore, if the combinational parts are implemented using Boolean logic, the performance can be substantially improved, as the number of magnets is reduced. A synchronization signal is still required but it is much easier to route. It is related only to some areas and not to the whole circuit, and it is also necessary only for combinational circuits. The feedback problem still remains; in this case, therefore, asynchronous registers are used to better handle feedback signals.
We developed two interfaces that encode/decode signals from Boolean to NCL and from NCL to Boolean. The two interfaces are shown in Figure 7 . The Boolean-NCL logic interface is simple, because it only has to split the Boolean signal into the two bits according to NCL encoding. The NCL-Boolean logic interface is more complicated. This is due to the necessity of including a memory loop inside the interfaces. NCL switches periodically from NULL to DATA, but Boolean logic is always in the DATA state. As a consequence, this interface not only has to merge the two bits encoding into one single bit but it also has to maintain the value stored when the NCL logic is in the NULL state.
Moreover, the two interfaces must guarantee synchronization between the two logic topologies. Therefore the interfaces use an ENABLE signal, which arrives from the previous stage. This signal is different from the ACK signal, which arrives from the next stage. The enable signal is generated by the logic block placed before the interface, and it is generated only when that block has updated its output.
The whole architecture is reported in Figure 7 . The differences with respect to the pure NCL version consist in the Boolean blocks (gray blocks in figure) and in the interface (B/N and N/(b) blocks in figure. We have chosen to realize in Boolean logic only the two memories and to leave the other component in NCL. This choice is due to the inefficiency of memory structures in NCL logic. The memory cell of the Boolean memory is shown in Figure 9 , in the inset. It is simpler than its NCL counterpart, and thanks to its regularity, it is more likely to keep the delays due to magnetic wires (layout-timing) under control. This would certainly be more complicated in a sparse logic block. We have simulated the same division algorithm and measured the processor performance. Waveforms are not reported because they are identical to those shown in Figure 8 , but with a changed time scale.
The performance of the mixed logic processor is in Table I . The time required for an instruction execution is slightly smaller, 4.41 μs instead of 5.35 μs. The improvement is not so high because in the previous case, the NCL memory was a parallel memory so it did not have a big an impact on the time balance. The big improvement is in the estimated number of nanomagnets, which is 600K instead of 4M, and the power dissipation which is 6 times smaller.
As we expected, implementing the memories using Boolean logic allows us to save a lot of area, significatively increasing the performance. However, the overall performance is still not satisfactory. It is also clear from our analysis in Section 2 that the presence of at least one feedback signal slows down the operations of any QCA circuit implemented in any technology. But on the contrary, if the whole circuit is implemented using Boolean logic, the performance can be maximized. However, the implementation of a complex NML circuit using only Boolean logic has two problems. Signal synchronization becomes much more complicated, and an asynchronouslike protocol is still needed to handle feedback. While trying to solve these issues, we have found a way to design NML circuits using Boolean logic only, but still implementing an asynchronous-like protocol. An analysis follows in next section.
BOOLEAN LOGIC MICROPROCESSOR
The innovative idea that we propose here is based on the discussion related to feedback in Section 2. If input data are sent after a certain time interval, which corresponds to the delay of the longest loop, the circuit can work correctly. However in a complex circuit, many loops are present, and it is necessary to take into account the longest loop. Notwithstanding this, as mentioned in Section 2, the shorter loops will not work. The idea is to use a block, placed at the end of every feedback loop, which slows down the faster loops to reach the speed of the slowest loop in the circuit. This is done using an asynchronous-like register placed at the end of the loop, at the beginning of the pipe stage. It is sketched in Figure 9 , where the architecture of the pure Boolean microprocessor is shown. The architecture is the same as presented before, but now all the blocks are implemented using Boolean logic, and the synchronization block is placed at the end of every feedback loop.
The scheme of this block is shown in Figure 9 in the inset (bottom right). It is implemented using a multiplexer with the output connected to one of its inputs. The other input accepts incoming data from outside. The multiplexer normally is in the loop mode. The selected input is its output. In this situation, the output will always maintain the same value. The circuit placed after this block works normally and therefore a feedback signal is generated.
No new inputs are accepted and the circuit is then frozen in the same state, which is a latch in the memory stage. When a time corresponding to the longest loop inside the circuit has passed, a new input is sent. But at the same time a short pulse (ENABLE) is sent to the selection bit of the synchronization multiplexer. This signal travels through the circuit slightly more slowly than the input signal. Slower behavor can however be achieved, if necessary, by using an appropriate layout. This condition, even if not simple, causes fewer burdens at the layout level with respect to a totally synchronous solution, because only the ENABLE signal routing would be critical. Therefore, when this signal reaches the multiplexer, which behaves like a latch responding to a token, all inputs are already at their destinations. The pulse allows the multiplexer to sample the new inputs, that are stored until the next pulse arrives. In this way, we have again implemented an asynchronous communication protocol, but the whole circuit less complexity, as there is no encoding and no handshaking. We have again tested the microprocessor using the division algorithm. The results are shown in Figure 10 . The waveforms are similar to those shown in Figure 8 , but in this case there is no signal encoding. The performance is boosted with respect to the mixed logic case, as the execution of the algorithm requires only 28 μs instead of 194 μs. The synchronization pulse is the ENABLE signal shown in Figure 10 .
In the final columns of Table I the performance of this last version of the microprocessor is shown. A remarkable improvement in terms of speed, area, and power is evident. In particular, the time execution of one instruction is 8 times less than in the mixed version, while the number of nanomagnets and the power dissipation are 3 times smaller. The clock-wire dissipation is approximately four times smaller due to the reduced area and complexity. So it is clear from these results that this is a promising direction to work on in the future.
To summarize, this solution greatly enhances the circuit performance, and at the same time provides a simple way to build any kind of NML circuit. However signal synchronization still remains a problem inside each Boolean block. We were able to synchronize signals because our model is a high-level behavioral model, which does not take into account the real layout of complex interconnections. We are currently building an automatic circuit synthesizer, placer, and router in order to obtain realistic representations of NML circuits. Considering the preliminary results we obtained, in pure Boolean circuits the signal synchronization requirement can generate a large area overhead that partially cancels the advantages of this approach. This is something that we are still investigating, but for now we can state that the approach proposed here is the best solution ever proposed in the literature.
CONCLUSIONS AND FUTURE WORK
We have demonstrated that Nanomagnetic Logic Circuit technology is best suited for pure combinational circuits. Feedback signals need complex solutions, and slow-down circuit operations, unless a technological solution is adopted as, for example, using electric interconnections for long-range data transmission. The only way to solve NML problems, from the logic point of view, is to use asynchronous logic. We completed a detailed analysis of asynchronous circuits implemented using QCA technology. We have performed the analysis comparing three different types of logic: a full NCL logic, a mixed Boolean-NCL logic, and a full Boolean logic.
As expected the use of NCL logic completely solves both the synchronization problem and the feedback signal criticality. However the payback in terms of area and power is too high. A mixed Boolean-NCL solution is a very good compromise between performance and circuit feasibility. The performance is lower than in the full Boolean case but noticeably better than in the full NCL solution. At the same time, signal synchronization is required, but it can be done more easily, because it is limited to a regular combinational part of the circuit.
The full Boolean solution grants a huge saving in terms of speed, area, and power consumption. However signal synchronization remains troublesome. Given the delay of the biggest loop inside the circuit, the inputs must be updated according to that delay. Furthermore, a special block (latch) is required to slow down the operations of the fastest loops inside the circuit, for synchronization purposes. This block is similar to the asynchronous register in NCL logic, which handles the communication protocol, but in this case there is no handshake. Therefore the solution that we have proposed is asynchronous-like. We believe that the results and solutions discussed in this work soundly enhance knowledge of QCA circuits and will be useful as guidelines for the future development of this technology.
