Biosequences analysis on NanoMagnet Logic by Wang J.C. et al.
04 August 2020
POLITECNICO DI TORINO
Repository ISTITUZIONALE
Biosequences analysis on NanoMagnet Logic / Wang J.C.; Vacca M.; Graziano M.; Ruo Roch M.; Zamboni M.. -
ELETTRONICO. - (2013), pp. 131-134. ((Intervento presentato al convegno International Conference on IC Design and
Technology tenutosi a Pavia, Italy nel 29-31 May.
Original
Biosequences analysis on NanoMagnet Logic
Publisher:
Published
DOI:10.1109/ICICDT.2013.6563320
Terms of use:
openAccess
Publisher copyright
(Article begins on next page)
This article is made available under terms and conditions as specified in the  corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/2511684 since:
IEEE - INST ELECTRICAL ELECTRONICS ENGINEERS INC
Biosequences analysis on NanoMagnet Logic
J. Wang, M. Vacca, M. Graziano, M RuoRoch. and M. Zamboni
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Italy
Abstract—In the last decade Quantum dot Cellular Automata
technology has been one of the most studied among the emerging
technologies. The magnetic implementation, NanoMagnet Logic
(NML), is particularly interesting as an alternative solutions to
CMOS technology. The main advantages of NML circuits resides
in the possibility to mix logic and memory in the same device, the
expected low power consumption and the remarkable tolerance
to heat and radiations. NML and QCA circuits behavior is dif-
ferent w.r.t. their CMOS counterparts. Consequently architecture
organization must be tailored to their characteristics, and it is
important to identify which applications are best suited for this
technology.
Our contribution reported in this paper represents a consid-
erable step-forward in this direction. We present an optimized
implementation on NML technology of an hardware accelerator
for biosequences analysis. The architecture leverages the systolic
array structure, which is the best organization for this technology
due to the regularity of the layout. The circuit is described using
a VHDL model, simulated to verify the correct functionality from
the application point of view, and performance are evaluated, both
in terms of speed and power consumption. Results pinpoints that
NML technology with the appropriate clock solution can reach a
considerable reduction in power consumption over CMOS. This
analysis highlights quantitatively, and not only qualitatively, that
NML logic is perfectly suited for Massively Parallel Data Analysis
applications.
I. INTRODUCTION
1 Quantum dot Cellular Automata (QCA) was presented
in [1] as an alternative to CMOS technology. In QCA charge
states are used to represent logic values instead of voltage
levels. Among the implementations of the QCA principle,
NanoMagnet Logic (NML) is one of the most successful [2].
The reason behind the success of NML is the magnetic nature:
Rectangular shaped single domain nanomagnets with only two
stable states (Figure 1.A) are used as basic cells [3]. Since the
basic circuital element is a magnet, it can act both as a logic
device and a memory, allowing the development of a totally
new kind of logic circuits, where memory and logic are no
more separated entities. Moreover NML logic has a strong
resistance to heat and radiations and potentially the power
consumption can be tens of times smaller than CMOS logic
[4].
NML circuits are built placing magnets on a single plane.
Information propagates through the circuit by means of magne-
tostatic interaction among neighbor elements achieving simple
interconnect functions or more complex logic functionalities
(for example see a Majority Voter – MV [3] –in Figure 1.B
we fabricated). To obtain information propagation a clock
mechanism is required [5], about which a detailed explanation
1 2013 IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional
purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
Fig. 1. A) Nanomagnets are used as basic cell. B) Example of a fabricated
Majority Voter (thanks to Nanofacility Piemonte - INRIM Torino). C) Clock
signals waveforms. D) Clock zones and signal propagation.
is given in section II. This unavoidable requirement has an
important consequence at the architectural level: circuits have
an extremely deep level of pipelining (see section II) and,
consequently, in presence of feedback throughput is notably
reduced. The solution we propose in this paper is exploit-
ing data interleaving. We demonstrate the problem and the
solution with a detailed implementation based on NML of a
Systolic Array, that is considered particularly suitable for this
technology [6]. Systolic arrays are circuits made by a network
of identical processing elements locally interconnected. The
regular layout and the absence of long interconnections make
them ideal for NML technology. However in the circuits
proposed in literature [6] processing elements are very simple
making therefore difficult a correct evaluation of this solution.
In this paper we show a complex systolic array design of
a NML circuit, applied to a biosequences analysis hardware
accelerator [7]. Biosequences analysis is interesting for NML
technology because the huge amount of data to process al-
lows massive parallelization maximizing the throughput. With
this work we achieve therefore two important results: 1)
We demonstrate a complex systolic array NML architecture
applied to a real life case of study quantifying advantages and
disadvantage and 2) we show the path that must be followed
in the development of this technology, highlighting what kind
of applications can be successful in this technology.
II. NML CLOCK SYSTEM: PROBLEMS AND SOLUTIONS
The clock system. In NML, when the clock signal is
applied, magnets are forced in an intermediate unstable state,
with the magnetization vector directed along the shorter mag-
nets side. This is achieved by means of an external magnetic
field. When the magnetic field is removed magnets align
themselves antiferromagnetically following an input magnet.
Since the number of magnets that can be aligned without
incurring in errors during the realigning phase is limited [8],
a multiphase clock system must be used. Three variable clock
signals (Figure 1.C for a simplified representation) with a
phase difference of 120◦ are applied to specific areas of the
RE
G
R
E
G
(B) (C)(A)
H
I E
(D) (E)
CLOCK CLOCK
ZONE 2ZONE 1
CLOCK
ZONE 3
AND 
OR
CLK2 CLK3
MTJ
Magnet
Wire Wire Electrodes
Magnet
PZT
R
E
G
CLK1
MQCA CIRCUIT VHDL MODEL
STT−CURRENT MAGNETOELASTIC CLOCKMAGNETIC FIELD
Fig. 2. A) Magnetic field based NML. B) STT-current based NML. C)
Magnetoelastic NML. D) Circuit example. E) VHDL model.
circuit layout called clock zones (Figure 1.D). Clock zones
are made by a limited number of chained magnets, typically
5, assuring an errorless signals propagation. Using this clock
system, when magnets of a clock zone are in the SWITCH
state (magnetic field removed), magnets of the left clock zone
are in the HOLD state (no field applied) and act as inputs,
while magnets of the right clock zone are in the RESET
state (magnetic field applied) and have no influence on the
switching magnets. Figure 1.D shows the signal propagation
through a NML wire (a simple chain of magnets) thanks to
the multiphase clock system.
The clock physical implementation. The external magnetic
field necessary for the clock mechanism can be generated by a
current flowing through a wire placed under the magnets plane
(Figure 2.A). As shown in [3] a current of 545 mA on a 1 µm
width wire is required to successfully switch magnets in the
RESET state. This is a very high value of current that leads to
a very high power consumption, wasting the advantage related
to the tiny power dissipation due to magnet switching. To
reduce power consumption other mechanisms were proposed.
For example in the STT-current approach [9] magneto-tunnel
junctions (MTJ) are used as basic element. MTJ can be reset by
a current flowing through them leading to a power consumption
of just 1.6fJ for each magnet (Figure 2.B). Alternatively, in [4]
we proposed an innovative clock system based on the use of an
electric instead of a magnetic field. With this clock solution
magnets are deposited on a piezoelectric layer (Figure 2.C).
When an electric field is applied the strain of the piezoelectric
layer induces a mechanical stress on the magnets forcing them
in the RESET state. With this clock solutions an energy of just
2 pJ is required to switch magnets [4] allowing to build true
low power circuits.
Consequences at architectural level. The most important
consequence of this system is that every consecutive group
of three clock zones has a delay of one clock cycle. NML
circuits (and QCA in general) have therefore an intrinsic
pipelined behavior, where the level of pipelining is not a
choice of the designer but it depends on the circuit layout.
For example signals that propagate through a wire will have
a propagation delay, in terms of clock cycles, proportional
to the wire length. This is particularly important if there are
loops in the circuit [10]. The presence of a loop prevents from
sending new data every clock cycle, because before sending a
Fig. 3. Biosequence Alignment analysis
new data, propagation of signals through the feedback path
must be waited for. This might require hundreds of clock
cycles [11]. A problem which is typical of pipelined micro-
processors in CMOS is here enhanced due to the extremely
deep pipelining involving both logic and interconnections.
To solve this problem parallel computation can be exploited
using the interleaving technique. For example in case of a
NML microprocessor [11], typically the output of the ALU
is connected back to the circuit with a loop. Before a new
instruction can be sent to the circuit it is necessary to wait
the propagation of the result of the first operation through the
feedback path. However if N independent threads are run in
parallel, where N is the length in clock cycles of the loop, at
every clock cycle it is possible to send a new instruction of
a different program. In this way the pipe is always full and
the throughput is maximized. In section III we show how we
implemented this solution for a biosequence analysis systolic
array structure.
Methodology. The multiphase clock system can be used to
build a VHDL model of NML circuits [12]. Taking as an ex-
ample the circuit shown in Figure 2.C, thanks to the pipelined
nature of this technology, every clock zone is equivalent to
a register with the same clock signals of Figure 1.B applied.
At every new clock cycle a new data is sampled by the clock
zone. Figure 2.E shows therefore the equivalent VHDL model
of the circuit: Registers are used to model the clock zones
and therefore the propagation delay of signals while ideal
logic gates are used to model the logic functions. In this case
AND/OR gates, based on [13] are used as an example. With
this model it is possible to easily design and simulate complex
circuits, and was used to design and simulate the architecture
presented in this paper.
III. ALGORITHM AND ARCHITECTURE
Algorithm. Biosequences alignment analysis is a field
which is constantly growing. Proteins are the fundamental
constituents of animal and plant cells and are composed
of chains of 23 Amino Acids (AAs) (Figure 3) which are
normally represented by alphabetical characters. The aim of
biosequences analysis is therefore to identify similarities be-
tween sequences of amino acids, for example to reconstruct the
evolutionary pathway that led to the differentiation of species
or for understanding the genetic cause of a disease comparing
a mutated cell with a normal one. Biosequences alignment
Fig. 4. A) NML Smith-Waterman systolic array. B) Detailed layout of a processing element. C) Circuit detail of an adder. D) Simulation results: Comparison
between CMOS and NML.
analysis is commonly done by comparing one sequence of
amino acids (Query) with the ones from the databases (Subject)
that have been developed from other studies on genome se-
quencing projects. Such comparisons are done by aligning the
sections of the sequences to find out their maximum similarity
(Figure 3). This can be a costly task thanks to the exponential
growth of biosequences databases, so it is important to speed
up algorithms also using dedicated hardware accelerator [7].
One of the most used algorithm for biosequences analysis is
the Smith-Waterman [14], which evaluates exhaustively the
best alignment score between a Query sequence and a Subject
sequence from the database.
Architecture. The circuit architecture is shown in Figure
4.A. It is based on a systolic array structure organized in a
chain of identical processing elements (PEs). Each amino acid
of the Query sequence that must be studied is associated to
one PE. As a consequence, the higher is the number of PEs,
the longer is the sequence of amino acids that can be studied.
Amino acids of the Subject sequence coming from the database
are sent as inputs to the systolic array. The detailed layout of a
processing element is shown in Figure 4.B. At the beginning,
in each processing element one amino acid of the Query
sequence is loaded. This task is executed by a configuration
block (PE CONFIG), which stores in the memory 23 values.
These values represent, for a specific amino acid of the Query
sequence, its alignment score, or, in other words, its relation
with all the other amino acids. These values are used to
calculate the maximum alignment score by the PE CALC part
of the circuit (Figure 4.B). After the initial configuration phase,
amino acids of the Subject sequence coming from the database
are sent to the first processing element and pass through the
entire chain of processing elements. Each of them calculates
the local alignment score using a circuit which is based on
adders and subtracters. The local alignment score is passed
from one processing element to the other, until it reaches the
end and becomes the maximum alignment score for a specific
sequence of amino acids. The sequence with the highest score
value is therefore the Subject sequence more similar to the
Query sequence which was under investigation. Further details,
not given here for space reason, can be found in [7].
Layout. Figure 4.C shows the circuit layout of an adder
based on NML logic. It is a simple ripple carry adder, which
our analysis demonstrated to be one of the more efficient in this
technology. Every full adder is based on majority voters (MV
in Figure 1.B), where the output is equal to the majority of
inputs [15]. Cross-wires are instead particular blocks that allow
to cross two wires on the same plane without interference.
IV. RESULTS
In order to compare the performance of the NML im-
plementation with CMOS technology, a 5 elements systolic
array has been also described and synthesized using a standard
CMOS technology [7]. Figure 4.D shows the comparison
between the waveforms obtained in CMOS case and in NML
case. Subject ID represents the amino acids subject sequence
number while OUT MAX represents instead the sequence
maximum alignment score. The two simulation waveforms
show the same results, demonstrating the correctness of the
circuit. However, it is worth underlining that the waveforms
have been normalized to better compare them. The timescale
is different: In CMOS the clock frequency is around 370
MHz and the latency for on PE is 1 clock cycle. In the case
of NML the frequency is 100 MHz (considering a realistic
implementation) and the latency is 209 clock cycles, meaning
that a new amino acid can be sent only every 209 clock cycles.
The long latency is due to its NML intrinsic pipelined nature
previously explained and therefore to the long propagation time
of feedback signals [10] that here are in evidence. Using this
straightforward implementation, before sending a new amino
TABLE I. POWER CONSUMPTION AND AREA ESTIMATION FOR A
SINGLE PROCESSING ELEMENT OF THE SYSTOLIC ARRAY, FOR THE MAIN
NML IMPLEMENTATIONS AND FOR CMOS LOP 21NM TECHNOLOGY.
Area (µm2) Power (mW)
Magnetic Field NML 21000 2
STT-current NML 20000 131
Magnetoelastic NML 12000 0.01
CMOS LOP 21nm 1000 0.72
acid it is necessary to wait that all feedback signals propagate
through the circuit to avoid data conflicts. It appears that, due
to this long latency, the throughput of NML version is greatly
reduced. As a solution to maximize throughput we adopt
parallel computing and data interleaving. As a consequence,
both in case of CMOS and NML a new amino acid can be sent
every clock cycle and the only difference in speed is due to the
clock frequency. The NML implementation is therefore with
this improvement 4 times slower than CMOS implementation.
However, the lack of speed is compensated by a much smaller
power consumption. Table I shows the comparison in
terms of power consumption and circuit area of a single
processing element, between CMOS and the main implemen-
tation of NML logic. Data for CMOS are obtained collecting
the results from Synopsys synthesis on an industrial 45 nm
technology and combining them with the ITRS Roadmap
predictions to extrapolate the 21nm equivalent performance.
For NML data are accurately estimated starting from the circuit
layout and technological data. Most of the Smith-Waterman
main blocks, like the adder shown in Figure4.B, are accurately
designed, so it is possible to know exactly their area and their
composition. The total circuit area and the number of magnets
of the processing element can be estimated starting from the
main blocks and using multiplicative constants to keep into
account interconnections overhead [12]. Once the total circuit
area and the total number of magnets is known it is possible to
estimate the circuit power consumption, because it is directly
related to circuit area. The total estimated number of magnets
for a processing element is 470000. Data in Table I show that
in case of magnetic field based NML the area is around 21000
µm
2, slightly lower (20000 µm2) in case of STT-current based
NML and remarkably smaller in case of Magnetoelastic NML
(12000 µm2). While the circuit structure and the number of
magnets are the same in every case, the size of magnets is
different, leading to different values for the area. The area
however is much bigger than CMOS case, which is around
1000 µm2. The reason behind this is the availability in CMOS
technology of multiple layers for interconnections. In NML,
according to the current level of technology maturity, circuits
are built using only one layer.
Comparing instead power consumption of one processing
element (Table I), for magnetic field based NML it is 2
mW while for STT-current based NML it is much higher,
about 131 mW. STT-current based NML are suited for circuit
with a limited number of magnets, lower than 11000 [9].
Unfortunately both values are higher than the value obtained
in CMOS, which is around 0.7 mW, so they cannot be used
for low power application, but only if a high heat or radiations
reliability is required. Using instead the magnetoelastic clock
solution, it is possible to obtain a considerable reduction in
power over CMOS, since the total power consumption is
around 0.01 mW. Clearly this is the best solution for NML
logic which allows to obtain a remarkable reduction of power
consumption with only a relatively limited reduction of speed.
V. CONCLUSIONS
NML logic enables the fabrication of circuits assuring
a considerable reduction in power dissipation with respect
to CMOS, at the cost of a relatively reduced speed. The
intrinsic pipelined behavior of this technology leads to a
consistent reduction of circuit throughput in presence of feed-
back signals if not specific countermeasures are taken. Using
data interleaving, therefore running multiple operations in
parallel, is one the possible solutions. Since the number of
operations that must run in parallel to obtain the maximum
throughput can be high, not all applications are suitable. It
is then clear that massively parallel applications represent the
future for NML (and QCA) circuits, because they permit the
maximum throughput. Biosequences analysis is one of these
applications, used in this paper as a benchmark. This work
helps quantifying the problem and the solution, and represents
a considerable milestone for the ongoing studies on NML
technology, pointing the path that should be followed in NML
circuits development. We are currently exploring to what extent
the design can be further optimized, moving from a simple
implementation to an improved internal PE structure. We are
redesigning circuit components in order to obtain a more
compact layout, minimizing long interconnections and latency.
REFERENCES
[1] C.S. Lent, P.D. Tougaw, W. Porod, and G.H. Bernstein. Quantum
cellular automata. Nanotechnology, 4:49–57, 1993.
[2] A. Imre, L. Ji, G. Csaba, A.O. Orlov, G.H. Bernstein, and W. Porod.
Magnetic Logic Devices Based on Field-Coupled Nanomagnets. 2005
Int. Semiconductor Device Research Symp., page 25, Dec. 2005.
[3] M. Niemier and al. Nanomagnet logic: progress toward system-level
integration. J. Phys.: Condens. Matter, 23:34, November 2011.
[4] M. Vacca, L.D. Crescenzo, M. Graziano, M. Zamboni, A. Chiolerio,
A. Lamberti, E. Enrico, F. Celegato, P. Tiberto, and L. Boarino. Electric
clock for NanoMagnet Logic Circuits . Field Couple Computing
Workshop (FCN), Tampa, February 2013.
[5] M. Graziano, M. Vacca, A. Chiolerio, and M. Zamboni. A NCL-HDL
Snake-Clock Based Magnetic QCA Architecture. IEEE Transaction on
Nanotechnology, (10):DOI:10.1109/TNANO.2011.2118229.
[6] M. Crocker, M. Niemier, and X.S. Hu. A Reconfigurable PLA Archi-
tecture for Nanomagnet Logic. ACM Journal on Emerging Technologies
in Computing Systems, 8(1), February 2012.
[7] G. Urgese, M. Graziano, M. Vacca, M. Awais, S. Frache, and M. Zam-
boni. Protein Alignment HW/SW Optimizations . The IEEE Interna-
tional Conference on Electronics, Circuits, and Systems (ICECS), 2012.
[8] G. Csaba and W. Porod. Behavior of Nanomagnet Logic in the Presence
of Thermal Noise. In International Workshop on Computational
Electronics, pages 1–4, Pisa, Italy, 2010. IEEE.
[9] J. Das, S.M. Alam, and S. Bhanja. Low Power Magnetic Quantum
Cellular Automata Realization Using Magnetic Multi-Layer Structures.
J. on Emerging and Selected Topics in Cir. and Sys., 1(3), 267-276.
[10] M. Vacca and al. Asynchronous Solutions for Nano-Magnetic Logic
Circuits. ACM J. Emerging Tech. in Comp. Systems, 7(4), Dec. 2011.
[11] M. Graziano, M. Vacca, D. Blua, and M. Zamboni. Asynchrony in
Quantum-Dot Cellular Automata Nanocomputation: Elixir or Poison?
IEEE Design & Test of Computers, 2011.
[12] M. Graziano M. Vacca and M. Zamboni. Nanomagnetic Logic Mi-
croprocessor: Hierarchical Power Model. IEEE Transactions on VLSI
Systems, August 2012.
[13] M.T. Niemier, E. Varga, G.H. Bernstein, W. Porod, M.T. Alam,
A. Dingler, A. Orlov, and X.S. Hu. Shape Engineering for Controlled
Switching With Nanomagnet Logic. IEEE T. on Nanotechnology,
11(2):220–230, 2012.
[14] Smith and Waterman. Identification of Common Molecular Subse-
quences . Journal of Molecular Biology, 1981.
[15] M. Vacca and al. Majority Voter Full Characterization for Nanomagnet
Logic Circuits. IEEE T. on Nanotechnology, 11(5), September 2012.
