Abstract-In the post-CMOS scenario a primary role is played by the quantum-dot cellular automata (QCA) technology. Irrespective of the specific implementation principle (e.g., either molecular, or magnetic or semiconductive in the current scenario) the intrinsic deep-level pipelined behavior is the dominant issue. It has important consequences on circuit design and performance, especially in the presence of feedbacks in sequential circuits. Though partially already addressed in literature, these consequences still must be fully understood and solutions thoroughly approached to allow this technology any further advancement. This paper conducts an exhaustive analysis of the effects and the consequences derived by the presence of loops in QCA circuits. For each problem arisen, a solution is presented. The analysis is performed using as a test architecture, a complex systolic array circuit for biosequences analysis (Smith-Waterman algorithm), which represents one of the most promising application for QCA technology. The circuit is based on nanomagnetic logic as QCA implementation, is designed down to the layout level considering technological constraints and experimentally validated structures, counts up to approximately 2.3 milion nanomagnets, and is described and simulated with HDL language using as a testbench realistic protein alignment sequences. The results here presented constitute a fundamental advancement in the emerging technologies field since: 1) they are based on a quantitative approach relying on a realistic and complex circuit involving a large variety of QCA blocks; 2) they strictly are reckoned starting from current technological limits without relying on unrealistic assumptions; 3) they provide general rules to design complex sequential circuits with intrinsically pipelined technologies, like QCA; and 4) they prove with a real application benchmark how to maximize the circuits performance.
I. INTRODUCTION

S
TUDIES on quantum-dot cellular automata (QCA) envisage this technology as a promising alternative to CMOS [1] . Information is coded using cells retaining only two stable states used to represent digital values [2] . Nearby cells influence each other like in a domino chain. Circuits are designed placing identical cells on a plane and computation is performed through local coupling among neighbor cells [3] . Different implementations of the general QCA principle were proposed. The most interesting are molecular QCA [4] , [5] Manuscript received August 27, 2013 ; revised February 24, 2014 ; accepted March 25, 2014 . Date of publication October 8, 2014 ; date of current version September 23, 2015 .
The authors are with the VLSI Laboratory, Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Turin 10129, Italy (e-mail: marco.vacca@polito.it; juanchi.wang@polito.it; mariagrazia.graziano@ polito.it; massimo.ruoroch@polito.it; maurizio.zamboni@polito.it).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2014.2358495 Since only one plan is available to route signals a particular block, called cross-wire, allows to cross two signals without interferences. Other logic gates can be fabricated changing the shape of one magnets [18] . (e) Multiphase clock system: timing evolution of the circuit. (f) Clock signal waveforms required for the three-phase clock system. and nanomagnetic logic (NML) [6] - [8] . In the former version, molecules are the basic cells and are interesting for their potential high operating speed (1 THz) and reduced power density due to the absence of intermolecule conduction [9] . However, technology is far from being mature and from giving experimental results in the short term [10] . The NML uses instead single domain nanomagnets as basic cells [ Fig. 1(a) ]. While this technology operates at frequencies lower than in the molecular case (50-200 MHz) [11] , [12] , due to its magnetic nature, it combines logic and memory in the same device enabling the development of completely new type of circuits. It has already been experimentally demonstrated [6] and it proved to have a very good tolerance to process variations [13] , [14] . Furthermore, it is resistant to radiations and heat, being as a consequence a perfect candidate for military and space applications. Even more notably, it has also a potential very low power consumption with respect to stateof-the-art CMOS technology [15] , confirming thus to be the possible candidate to solve those power issues that are the designer nightmares when dealing with forthcoming scaled CMOS technology nodes [16] , [17] . We use NML here as a reference for the discussion. Nonetheless, any aspect mentioned in this paper can be directly applied to the other possible QCA implementations. Details on the circuit's organization (an overview is in Fig. 1 ) and on the most important technological constraints are presented and discussed in Section II. A pair of crucial aspects are herein briefly enlightened, instead, to clearly state the contribution we provide in this paper.
The first issue is related to technological features that have a few consequences: 1) circuits are intrinsically pipelined;
2) the pipeline depth is dictated by technology; and 3) the delay of a signal is counted in terms of number of clock cycles and depends on the circuit layout. This aspect has been baptized layout = timing [19] , it is well known, and several works and discussions on careful circuit layout have been carried on and circuit level solutions have been deeply analyzed [20] - [25] .
The second issue, consequence of the first and focused in this paper, is related to the presence of functional feedbacks in the architecture to be implemented. Due to the coexistence of the layout = timing issue, in the presence of loops, two kind of problems arise: 1) dramatic loss of performance; and 2) signals synchronization issues. On the one hand, these might seem obvious to the experienced designer of circuits based on conventional technologies. On the other hand: 1) their solution is not that obvious considering typical QCA technological constraints and possibilities; 2) it has been mentioned in [20] but only in some cases, it has been given practical solutions [22] , and thus, it still needs to be thoroughly addressed; and 3) it assumes particular relevance when the designer tackles circuits of realistic complexity implementing functions comparable with conventional technology ones. This is true especially considering that often in the literature, simple or medium complexity circuits and case studies have been used for discussing these problems. The following are then our goals and main contributions in this paper.
A. Goals
As the key point is understanding whether QCA technology can be a reliable substitute for CMOS, then we believe the following.
1) The issues arisen are to be completely revealed.
2) The problems must be discussed considering a circuit of realistic complexity. 3) The feasibility of possible solutions should be thoroughly discussed at the light of the currently available technological solutions. 4) The solutions should be general and not specific for a given architecture and a particular QCA implementation.
B. Contributions
After a short introduction on NML circuit layout and a discussion on the timing issues here mentioned in Section II, we perform the following. 1) We introduce in Section III the test architecture we implemented based on a complex systolic array (SA) circuit for biosequence analysis [15] . This architecture represents itself a novelty for the state of the art in NML, because it is completely designed at the layout level and because it respects all the technological constraints, without relying on unrealistic assumptions. 2) The analysis is particularly relevant because it involves a complex circuit counting up to 2.3ML nanomagnets, involving both combinational, sequential, and memory blocks, implying the solution of various and articulate design issues far beyond those addressed up to now in the related literature. 3) We analyze and quantify the loss of performance due to the presence of feedbacks in Section IV and propose solutions that can be applied independently on the type of architecture and of the QCA implementation. 4) We discuss and reveal the synchronization issues in Section V quantifying the impact of this problem on our realistic circuit. 5) We propose in the same section solutions allowing not only to achieve a full signal synchronization but also to maximize performance, and we do this by considering the constraints that technology imposes. Therefore our contribution represents a very important step forward in the development of QCA technology. Moreover, even though our analysis uses here NML as test technology, the results here discussed can be directly extended to all the technologies that present an intrinsically pipelined behavior, like molecular QCA or NanoFabric [26] , or even more conventional technologies [27] . This paper then gives general guidelines for designing sequential circuits in the presence of loops in many emerging and future technologies.
II. NML BACKGROUND AND CIRCUITS ORGANIZATION
Although the basic cell in NML technology is quite different with respect to the cells based on other implementations of the QCA principle, circuits are organized and constrained in a similar way independently on the implementation. Figs. 1 and 2 will help to gather the most important characteristics. Fig. 1(b) shows for example the experimental fabrication of a NML wire, based on horizontally aligned magnets. The basic logic gate is the majority voter (MV) [13] , shown in Fig. 1(c) . It is a three input gate where the value of the central magnet is equal to the majority of the inputs. By forcing one of the inputs to 0/1, the MV works as an AND/OR. More simply, AND/OR gates can be obtained changing the shape of one magnet [18] , as shown in the circuit example of Fig. 1(d) (bottom box) .
Since up to now, NML circuits are limited to only one plan (no stacked layers are admitted), a cross wire block [28] is used to cross two wires without interferences [ Fig. 1(d) ].
The first issue mentioned in the introduction arises from two intrinsic technological aspects. First of all, the near-neighbor interaction among neighbor cells is not sufficient to switch magnets from one state to the opposite. An external field, normally called clock [29] , is needed to temporarily force magnets in an intermediate unstable state [NULL in Fig. 1(a) ]. This action lowers the energy barrier and consequently allows for a cell to switch its neighbors. The second important technological feature is that only a limited number of cascaded elements will switch correctly in sequence without errors. This is particularly true if external influences, like thermal noise [30] , are considered.
To solve these problems and to allow error-free signals propagation, multiphase clock systems were developed [7] , [11] , [31] , [32] . Just to give an example in [7] and [32] , a threephase clock system for NML technology was proposed.
Magnets are organized in zones [e.g., zones 1, 2, and 3 in both Figs. 1(d) and 2(a)]. In each zone, only a sequence of a few magnets can reliably propagate the information, and this is enabled by applying the clock signal with the proper timing to each zone, as shown in Fig. 1(f) . Thanks to this mechanism, in every time step, magnets of a clock zone can be in three different states, RESET, SWITCH, and HOLD, as shown in Fig. 1(e) . In the RESET state, an external means, like a magnetic field, is applied to magnets forcing them in the NULL state. This can be obtained, for example, by injecting a current I through a metal wire under the magnet layer, as shown in Fig. 2(a) [33] . This solution works and was also experimentally demonstrated [34] . In this case, the clock zones layout is made by parallel stripes that correspond to the wires used to transport the current, as shown in both Fig. 2(a) and (b). The current flows and a magnetic field is induced in the direction perpendicular to the nanomagnets main axis, thus erasing any previous magnetization state they might have. Fig. 2(c) shows a detail of the three phase clock system [7] . Wires are placed over and under the plane so that can be twisted allowing signals propagation in every direction. This is one of three techniques available to build loops in NML, the other solution is to use a two-phase clock, as proposed in [35] , or magnetoelectric interfaces to translate the magnetic signal into an electric one. For detailed explanations and results, refer to [7] and [34] .
Going back to the sequence of phases, after the RESET application, in the SWITCH phase, the magnetic field is removed and magnets are free to switch to a stable state. They switch according to magnets on the left, which are in the HOLD state, that means, no magnetic field is applied. Magnets in the HOLD state act therefore as inputs for switching magnets. Fig. 1 (e) shows how in every time step this situation is repeated, but the clock zone in the SWITCH state is the next in the sequence, so signals propagate through the circuits, in this example from left to right. The multiphase clock system leads to an intrinsic pipelined behavior. Wires are equivalent to a CMOS shift register, because every consecutive group of three clock zones has a delay of one clock cycle.
However, differently from CMOS in QCA technology, the pipeline level is not a choice of the designer, but it depends on technological constraints, like the maximum number of cells in a clock zone and the total number of clock zones, and it is normally quite high.
Apart from the magnetic field based clock [7] , [34] , in recent years, different clock solutions were proposed, like spin torque coupling through a current flowing through the magnets [12] or systems based on the magnetoelastic effect [15] , [36] , where an electric field is applied to a piezoelectric material that strains the magnets and rotates the magnetization vector. A comparison between these clock systems can be found in [15] , but here no further details are reported, being they are out of the scope of the paper. It is worth mentioning that different QCA implementations will use different mechanism like an electric field instead of a magnetic field in the molecular QCA case [37] , [38] .
The clock zones layout shown in Fig. 2(a) is based on the constraints of the magnetic field approach. Other clock systems may not be limited to this layout. However, we use this layout organization in this paper because it intrinsically enables the solution of the above-mentioned layout = timing problem. As a matter of fact, using this layout, the length of all the wires from every input to every output in terms of clock cycles is the same. Consequently, signals are perfectly synchronized without the need of asynchronous protocols like widely discussed in [21] , [23] , [25] , and [39] .
Irrespective of the type of physical method used, the intrinsic clocking system is not a feature strictly related to QCA technologies. Other emerging technologies, for example, nanofabric circuits [26] , use a dynamic clocking required to locally control the information flow, independently from the circuit functions. This, actually, means to lead to the extreme what is already happening in conventional highperformance CMOS-based architectures. Often, interconnect delay is reduced by increasing pipelining depth to maximize throughput [40] .
Due to the intrinsic pipelining, the propagation delay (or latency) in terms of number of clock phases of a signal over an interconnection [41] can be very long. As a consequence, it is important to avoid long interconnection wires and to use architectures where no global interconnections are required. SAs were proposed as an ideal target for QCA technology [22] , [35] , [42] . SAs are circuits composed by a network of identical processors and processing elements (PEs) that rhythmically compute and pass data through the system. The circuit regularity, coupled with the presence of only local interconnections, allows to optimize the circuit area and therefore minimize the delay. However, if the PE is too complex, further optimizations are required. It is very important to underline that reducing the area means reducing the power consumption as well, because in this technology, as demonstrated in [43] , the power dissipation strictly depends on the circuit area.
III. BIOSEQUENCE ANALYSIS
On the basis of the discussion above and on the light of the suggestion about using SAs to maximize performance in QCA circuits, it is important to identify which real applications can gain advantage from this technology. We believe that bioinformatics is one of the application fields that can receive the biggest benefits from QCA technology. This, not only due to the remarkable interest growing around this field, but especially because of the need to gain in computation capability for it, is being a so called embarrassingly parallel application [44] . In [15] , we analyze a NML circuit for biosequences analysis and compare its performance to the same architecture implemented with CMOS transistors. Even though the magnetic implementation is by nature slower than the molecular approach-to be considered more suitable for an application, where speed is one of the essential requirementswe use here NML as the only one technologically feasible at the time of writing and because it rises the same implementation issues from an architectural point of view that any other QCA-like technology (and not only) would suffer. Here, we use the same architecture we demonstrated in [15] as a testbench to analyze with a quantitative approach the impact of loops on NML (and in general QCA) technology and to inspect and evaluate the possible solutions. However, for the sake of completion, we give herein a short introduction on what a biosequence is and on how biosequences analysis is normally performed.
A. Background on Biosequence Analysis
Proteins are normally organized as long chains of amino acids (AAs), as shown in Fig. 3 . In biology and biotechnology, very often the need to identify a specific protein or a set of characteristics or defects in a protein arise. This can be obtained comparing the AA sequence of the protein under test against a huge database of proteins, where each protein is made by a variable length sequence of AAs. In most of the cases, the protein identification is executed by finding local alignments (regions of similarity) between the studied protein (Query) and the ones in the databases (Subject), as shown in Fig. 3 . Bioinformatics offers a large variety of algorithms, among which one of the most used is the Smith-Waterman (SW). This algorithm finds an optimum local alignment between two protein sequences. Due to the nature of this problem that involves the analysis of a huge amount of data, software and/or hardware accelerators are necessary to improve the analysis speed. Parallel architectures, like SAs, are therefore a natural choice to be used as a base for a dedicated hardware accelerator. We have developed an optimized version of this algorithm [45] and implemented a SA version for CMOS technology in [46] . We have then mapped the same architecture on NML logic and compared it with the CMOS version in [15] . In the following, we discuss in a short description the NML architectural implementation. Fig. 4 shows the architecture of our NML SW implementation. Fig. 4(a) represents the circuit general organization. The SA is composed by identical PEs connected in a long chain. Every AA of the Query sequence to be studied is stored in one PE. Subject proteins from the database are fed to the SA input one by one. They pass through the entire structure and at the end an alignment score is generated. The alignment score identifies the level of similarity between two AA sequences. Among all the sequences scanned by the circuit the one that gets the maximum value of alignment score is the most similar to the studied protein. Fig. 4(b) shows the single PE architecture that is based on the SW algorithm [46] . A configuration part (PE_CONFIG) handles the loading of the AA of the Query sequence to be studied. The AA is stored inside a MEMORY. The calculation part (PE_CALC) is organized in two macroblocks (MAX3 and MAX4), whose aim is to evaluate the alignment score. Each of these macroblocks is based on three subtracters connected in parallel. The MAX3 block compares the alignment score evaluated inside its PE with the maximum alignment score evaluated by previous PEs. If the alignment score evaluated inside its PE is bigger than the maximum, than it becomes the new maximum and it is propagated to the next PE of the SA. The MAX4 macroblock is the most important computational part of the PE. It evaluates the alignment score between the stored AA and the AA sent to the PE input. More details can be found in [46] .
B. SW NML Implementation
To give an example, Fig. 4(c) shows instead a detail of a multiplexer implemented at the layout level using NML technology. Clock zones are structured by parallel stripes, cross wires are used to cross two wires on the same plane, while AND/OR gates [18] are used as basic logic gates. The main blocks implemented are: adders/subtracters, multiplexers/demultiplexers, boolean functions, decoders, and memory cells. The parallelism used is eight, as in [46] . The whole circuit has been designed at layout level considering all the constraints currently derived by experimental results or by accurate micromagnetic simulations (partially our own work and partially found in the literature). Overall, the whole circuit counts approximately 2.3ML nanomagnets, each sized as 50 nm × 100 nm. Such a large number of magnets can be fabricated with high-end optical lithography, as shown in [47] . Each clock zone includes six nanomagnets. This number was chosen according to [30] to have a reasonable clock zones size and avoid errors in the signals propagation.
C. Circuit Description and Simulation Results
To simulate this circuit, a RTL model we developed and presented in [43] was used. It is summarized in Fig. 5(a) . The model relies on registers with an appropriate clock signal applied to simulate the propagation delay of signals through the sequence of clock zones. Ideal logic gates are instead used to model the logic functions. This kind of register-transfer level (RTL) modeling, which relies on very high-speed IC hardware description language (VHDL) language, allows to easily describe and simulate NML circuits. Further details on the model can be found in [43] . As in [15] and [46] , the architecture has been simulated using as queries, sequences extracted from the human hexokinase 1 regions, and the database is the commonly used Swiss-Prot [48] . Fig. 5(b) shows instead the simulation results of the whole SA structure. Subject Sequence ID identifies the sequence number fed to the SA input, which is composed by many AAs. Maximum Score identifies instead the maximum alignment score of a sequence. In the simulation shown in Fig. 5(b) , the sequences from 2 to 12 obtain the same score, while from 12 to the end, the score is different. The most similar sequence is the number 14, which gets an alignment score of 15.
It is important now to state the initial performance. A new AA is fed to the circuit input every 208 clock cycles, which is the latency needed to execute the whole evaluation. Since every Subject sequence contains N AAs, to find the maximum alignment score for a particular sequence, N times 208 clock cycles is the required time. This means about 1.8 ms with a clock frequency of 100 MHz (considered an average case frequency for this technology [34] ). In this test case, the Subject sequences used for the test were made by the same number of AAs, but in general, every sequence can have a different length. The longer the sequence is, the longer is the time required for the analysis to be completed. The reason why a new AA is fed only every 208 clock cycles lies in the loops present inside the PE. Being the focus of the paper, this point will be throughly tackled in Section IV. A detailed performance analysis and comparison with CMOS cannot find space in this paper as it is out of to the claims this article wants to demonstrate. However, for interested readers, a timing and power comparison between NML and CMOS circuits can be found in this paper [15] .
IV. PERFORMANCE MAXIMIZATION
The presence of loops in the circuit originates a performance issue in NML circuit, and, more in general, in intrinsically pipelined technologies, like in all QCA implementations [20] , [22] . The circuit throughput is reduced by N times, where N is the length in terms of clock cycles of the longest loop. Fig. 6 shows a simple example that clearly outlines this problem. The circuit in the figure is for simplicity of representation an adder, where the output is connected to one of its inputs. It is indeed an accumulator, where the number of registers reflects the number of clock zones interested by the signals. At the first clock cycle [ Fig. 6(a) ], a signal (A) is sent to the adder input. Due to the intrinsically pipelined nature of this technology, theoretically it would be possible to send to a circuit a new input every clock cycle, because the first stage at the input is free to operate on a new value. However, if in this case a new input (B) is sent immediately after 1 clock cycle [ Fig. 6(b) ], the results are wrong. The reason behind this lies in the fact that the result of the previous operation has not yet reached the second adder input in time (as it would happen, instead, in a normal CMOS-based accumulator structure where a single register would be present). To correctly synchronize operations, the first input (A) must be kept constant (the well known concept of stalling) for 4 clock cycles, as shown in Fig. 6(c)-(e) . At the fifth clock cycle [ Fig. 6(f) ], a new input (B) can be safely sent to the adder input. In this case, the result is correct, because the previous value had the time to propagate back.
While a perfect synchronization is obtained, the circuit throughput is reduced by four times, because a new input signal can be sent only every four clock cycles. This is a common and well-known problem also for CMOS technology; however, there are some substantial differences that make the issue intolerable and of much more complex solution. First, in standard technology the level of pipelining is a design parameter, while in NML (and QCA) circuits it is intrinsic to the technology itself, and it is then a constraint. Second, the pipeline depth in CMOS only slightly is influenced by the physical design phase, while for QCA in general, it totally depends on the circuit layout. Moreover, third, in CMOS the level of pipelining is quite low, while in QCA technology it might be dramatically high. Actually one has to think that every gate is a pipeline stage and every interconnect is to be intended as a shift register. To be concrete, for example, in case of the NML SW here used as testbench, the longest loop has a delay of 208 clock cycles. As a consequence, the throughput is reduced by 208 times. This is certainly a remarkable problem, especially because in NML, the clock frequency is quite low (around 50-200 MHz depending on the clock solution chosen). It is clear then that the reduction of speed is not acceptable and largely limits the real possibilities of this technology to become a CMOS substitute. It is worth underlining that solutions proposed in the literature to solve the layout = timing problem itself, like using asynchronous logic [7] , [25] , are not of help in case of loop [23] in any case. To solve this problem, it is possible to work on two different design levels: algorithm and hardware.
A. Interleaving
Since the pipelining is intrinsic to the hardware, the first solution to improve the throughput is to modify the algorithm to avoid data dependencies between one input data and the next. This is a solution commonly adopted in standard technology, for example, in microprocessors, where instructions are dynamically rearranged to avoid data dependency. Another solution, adopted in superscalar microprocessors in case of jump instructions, is to use predictive techniques to speculate if the next instruction depends on the result of the previous instruction or not. These are solutions that can be adopted also in case of QCA technology. However, the applicability and effectiveness of these solutions strongly depend on the algorithm, so they must be studied specifically for each application. A solution to be applied at the design stage is cut-set retiming, as thoroughly discussed in [22] . Though this is a valid solution for general QCA, if the constraints of realistic technology are considered, like the fact that strict limitations on the possible organization of clock zones hold, then the method has to be proven, especially in the case of complex circuits. This approach is at the basis of some of the modifications we propose in this paper (see next sections).
A general solution that can be instead applied to any architecture is interleaving [15] . Interleaving is based on the idea to parallelize the algorithm and to interleave data at the circuit inputs [27] . In case of QCA, it has been envisaged in [39] , even though no or only extremely simple implementation and verification have been provided up to now. Fig. 7 shows the interleaving principle applied to the same adder of Fig. 6 . Four operations are executed in parallel here. At the first clock cycle, the first input of the first operation (A) is sent to the adder [ Fig. 7(a) ]. At the second clock cycle the first input of the second operation (D) is sent to the adder [ Fig. 7(b) ]. This operation is correct because D does not rely on A to be evaluated. A and D are part of different operations, so there is no data dependency between them. At the third clock cycle, the first data of the third operation (G) is sent to the adder input [ Fig. 7(c) ] and at the forth cycle, the first data of the forth operation (L) is sent as input [ Fig. 7(d) ]. At this point, the cycle can start again, and at the fifth clock cycle, the second data of the first operation (B) is finally sent to the adder input [ Fig. 7(e) ]. The results are correct because the signal A had the time to propagate back to the second adder input with the due latency, as shown in Fig. 6(f) . Continuing to feed interleaved data to the adder input [ Fig. 7(f)-(n) ], signals are perfectly synchronized and, at the same time, the throughput is maximized. One single operation is completed with a throughput four times reduced, but four operations can be executed in parallel, so 1 output is generated at every cycle. Fig. 8(a) shows a complete simulation of the SW using a level of interleaving equal to 3. Three different analyses are carried on in parallel so three different Subjects are sent interleaved to the circuit. The delay between two AAs of the same Subject is always 208 clock cycles, about 1.8 µs. However, between one AA of the same sequence and the other, other AAs are sent to the circuit. In this case, the delay between two AAs of different Subjects is between 70 and 68 clock cycles, because it is not possible to divide 208 (the worst case loop latency) in exactly three parts of the same number of clock cycles. This, however, is not a problem and the circuit still works correctly. The maximum alignment score changes accordingly to the Subject sequence sent to the circuit. The use of interleaving level 3 improves the throughput by three times. While to maximize performance, it is necessary to use a level of interleaving equal to 208, this is not mandatory. Using a lower level of interleaving in any case improves performance. The throughput will therefore vary between the maximum (interleaving 208) and the minimum (no interleaving) depending on the number of operations that can be run in parallel. The efficiency obtained by a given interleaving level must be traded off with the increased complexity at the input stage, where physically inputs from different sequences are to be fetched.
Interleaving is therefore a necessity for NML (and QCA) circuits if loops are present. However, due to the extremely high level of pipelining, a huge amount of data has to be provided to obtain the maximum throughput. In case of the NML SW architecture, 208 analyses should be run in parallel, and thus all the correspondent sequences should be available since the first iteration. As a consequence, not all applications are good candidates to exploit the potential of this intrinsically pipelined technology. Biosequences analysis is one of the applications more adapted to NML (and QCA) technology because the huge amount of data to process enables the algorithm massive parallelization, always allowing to reach the maximum throughput. This further validates our choice of developing the SW architecture using NML technology.
B. Architecture Redesign for Loops Length Reduction
To improve throughput, it is possible to work on a different level modifying the circuit layout to reduce the overall length of the loops. This solution is complementary to the algorithmic approach. Ideally, the loop length should be reduced to one clock cycle, clearly not possible in complex circuits. In case of the SW, the general PE architecture [ Fig. 4(b) ] has a simple organization: all inputs come in from the left side and go out on the right side. This organization is chosen according to the general SA architecture [ Fig. 4(a) ] which is composed by a linear chain of PEs. With this PE architecture, the layout is optimized and the latency is minimum. However, as previously discussed, due to the SW algorithm, there is a main loop which connects the end of the blocks for the maximum alignment score to their inputs. This loop is unavoidable, because every SA compares the alignment score with the value evaluated at the previous iteration.
The circuit was changed by bending back the loop and changing the linear structure to a U-shaped structure. This principle is detailed in Fig. 9 , which shows the new circuit architecture. The picture is just a very simplified schematic for the sake of clarity. The drawback of this solution is that the overall latency is increased, but the overall length of the loop is reduced from 208 clock cycles to 141 clock cycles. The result is that the circuit performance are greatly improved. Fig. 8(b) shows a simulation comparison between the original PE and the modified one without using interleaving. The analysis of 14 sequences takes only 16 ms instead of 24 ms. Using interleaving also, it is possible to obtain maximum throughput, and in this case, only 141 analysis must be run in parallel instead of 208. This hardware solution can therefore greatly enlarge the field of applications, where NML (and QCA) technology can be used and coupled with interleaving allows to easily maximize performance.
In Fig. 9 , some local loops on interconnection wires can be seen. Their presence is requested for signal synchronization, and this is the object of discussion in Section V.
V. SIGNALS SYNCHRONIZATION
While the loss of performance is clearly a major problem, the presence of loops has some serious consequences also on the propagation delay; in particular, problems arise when signals must be synchronized. Two important categories of synchronization issues can be identified: 1) the presence of nested loops and 2) additional delays present on specific signals. The two aspects are treated in the following two subsections.
A. Nested Loops
In a generic circuit, it is quite normal to find several loops. Some of these loops have no reciprocal dependencies, while others are nested. A schematic representation of this situation is shown in Fig. 10(a) . Since in QCA technology, the pipelining is intrinsic to the layout, in the presence of multiple loops, the length of these loops must be carefully studied and designed to obtain perfect signals synchronization. The SW PE is again a perfect testbench to reveal and to explain this situation. Fig. 10(b) shows the schematic representation of the PEs. Two main loops are present: loop-1) the output of the MAX4 block which is connected back to one of the adder input and to a multiplexer, and loop-2) the output of the MAX3 block that is connected back to its inputs. These two loops are not nested but independent.
The loop-1 is however, composed by two nested loops, as shown in Fig. 10(b) , in the details. The big arrow identifies the output data signal coming from the MAX4 block, which is connected to the adder at the bottom, that in its turn has output connected to the MAX4 inputs again. The small arrow represents instead a control signal, generated by the MAX4 block, which is connected to the selection bit of a multiplexer. This multiplexer's output is then connected to the adder input together with the signal represented by the big arrow. As a consequence, these two nested loops have two different lengths. That means that the signal represented by the big arrow arrives at the adder input before the correct output can be generated by the multiplexer. The results is therefore unavoidably wrong, as shown in Fig. 8(c) (the whole simulation) or Fig. 10(c) (a detail) . These waveforms refer to a simulation of the SW with and without proper loops lengths. Only if the lengths of the two loops is equalized, the operation is perfectly synchronized, as shown in Fig. 10(b) (correct box) , and the SW behaves correctly, as shown in the simulation.
B. Additional Delay Loops
Another important situation that must be carefully considered is the necessity to add additional delay loops to synchronize signals. In CMOS, it is quite normal to add additional registers to delay specific signals as requested by the implemented algorithm (skewing and deskewing networks). This is also the case of the SW algorithm. The key element of this algorithm implemented in CMOS is shown in Fig. 11(a) . Every PE computes the local maximum alignment score comparing the result of the previous MAX operation with the maximum evaluated by the previous PE at the previous clock cycle and two clock cycles before [46] . This situation is well explained in Fig. 11(a) , where the MAX_IN signal is connected to the MAX4 block 2 times, the first time using only one register and the second time with two registers.
To map the same situation on NML (or QCA) technology, it is important to understand the delay among subsequent data sent to the circuit. In standard technology a new data, i.e., an AA symbol, is sent to the circuit input at every clock cycle. Therefore, if an extra register is added to a specific signal, that signal is effectively sampled to the value that it had two clock cycles before. However, in QCA technology, if at least one loop is present in the circuit, a new data is sent to the circuit every N clock cycles. This is true also considering interleaving. With interleaving, an AA is sent to the circuit, then every clock cycle, for the next 207 clock cycles, a new AA of a different sequence is sent to the circuit input. Only after 208 clock cycles, a new AA of the first sequence is sent again to the circuit input. As a consequence, even adopting interleaving the delay between two subsequent AA of the same sequence is always N clock cycles.
To map this algorithm to QCA technology, then, an additional delay on the MAX_IN signal must be added. Since the pipelining is intrinsic to the layout, adding a delay in a specific signal means making its correspondent wire longer. Nonetheless, to solve the layout = timing issue, every input signal of a specific block must have the same length. As a consequence, to add an additional delay in a specific wire, a wire loop has to be used, as shown in Fig. 11(a) . In this way, every input signal to the MAX4 block has the same length, except for the first one that is longer. Therefore, two results are obtained, as signals are synchronized and the algorithm is respected. Fig. 11(a) shows how in the mapping process from CMOS to NML, only the additional register in the first signal becomes a wire loop. This happens because the registers that are common to all the inputs change the propagation delay on all signals.
The last issue that calls for an investigation is the length of the additional loop. As previously explained to add a register in a specific signal means to consider the signal sent one clock cycle before in CMOS. Since in QCA technology, an input must be sent every N clock cycles, the length of this additional loop in terms of clock cycles must be exactly equal to N. This is equivalent to sample the AAs of the same sequence previously sent. Fig. 11(b) highlights the additional loop added to the circuit. In Fig. 8(d) , it is instead shown a comparison between a simulation obtained with and without the synchronization loop. If this loop is not present, the results are totally wrong. Concluding, additional CMOS registers used to delay only selected signals correspond in QCA technology to synchronization loops.
The synchronization loops in Fig. 9 (a) emulate, therefore CMOS registers, and are used to add a delay in a specific signal. In Fig. 9(a) , the architecture was changed to reduce the main loops length. In that case the circuit was reshaped bending back the main loop. The results was a reduction of the loop length, at the price of an increased propagation delay on that specific signal. As a consequence, all the other signals must be delayed, using synchronization loops, to match the increased delay of the feedback signal.
VI. CONCLUSION
In this paper, we have presented a complete overview of the major issues related to the presence of feedback signals in intrinsically pipelined technologies, using as a reference QCA technology in its NML implementation. Results are based on a considerably big and complex systolic architecture for biosequences analysis. It is implemented using NML down to the detailed layout level and considering realistic technological limits. The results we present are valid not only for QCA technology but also for all the emerging technologies that have an intrinsically pipelined behavior at the microarchitectural level. Two kinds of problems arise in case of loops. 1) Performance reduction, which can be solved using interleaving and redesigning circuits to reduce loop length. 2) Failures due to bad signals synchronization, which can be solved properly designing the loop length in case of nested loops and adding synchronization loops. This paper represents a milestone in the design of circuit for intrinsically pipelined emerging technologies and can be used by researchers as a collection of guidelines for designing complex circuits with both combinational and sequential parts.
