Towards FPGA hardware in the loop for QCA simulation by Olson, Alan




Towards FPGA hardware in the loop for QCA
simulation
Alan Olson
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Olson, Alan, "Towards FPGA hardware in the loop for QCA simulation" (2011). Thesis. Rochester Institute of Technology. Accessed
from
Towards FPGA Hardware in the
Loop for QCA Simulation
by
Alan Olson
A Thesis Submitted in Partial Fulfillment of the Requirements for the
Degree of Master of Science
in Electrical Engineering
Supervised by
Associate Professor Dr. Dorin Patru
Department of Electrical and Microelectronic Engineering
Kate Gleason College of Engineering




Dr. Dorin Patru, Associate Professor
Thesis Advisor, Department of Electrical and Microelectronic Engineering
Dr. Marcin Lukowiak, Assistant Professor
Committee Member, Department of Computer Engineering
Dr. Dhireesha Kudithipudi, Assistant Professor
Committee Member, Department of Computer Engineering
Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering
Title:
Towards FPGA Hardware in the Loop for QCA Simulation
I, Alan Olson, hereby grant permission to the Wallace Memorial





Dr. Lukowiak and Dr. Kudithipudi for their insightful review
Matt Hibbard for his assistance with the Wildcard 4 system
Konrad Walus and Gabriel Schulhof for their correspondence
regarding QCADesigner
My parents Grieg and Betty Olson for their undying love and support
Dr. Eric Peskin for his assistance and guidance during the early parts
of this work
and
God for his faithfulness
iv
Abstract
Towards FPGA Hardware in the Loop for QCA Simulation
Alan Olson
Supervising Professor: Dr. Dorin Patru
As transistors begin to hit raw physical limits and performance bar-
riers, other technologies are being researched to potentially replace
conventional integrated circuit technology. Quantum-dot Cellular Au-
tomata (QCA) is one such technology which executes computations us-
ing coulomb interactions and quantum-mechanical effects. Part of this
research is pursuant to the design of circuits which exploit QCA tech-
nology and take advantage of what it has to offer. These circuits must
be simulated to ensure their functionality and help prove the viability of
QCA. These simulations, like many scientific computing applications,
can take a long time to complete; hours or days, depending on their
size and complexity. Many scientific applications have benefitted from
research into Field Programmable Gate Array (FPGA) application de-
velopment, which has been used to accelerate the speed at which such
simulations execute.
This thesis investigates the possibility of using FPGAs to accelerate
the simulation of QCA circuits. The hardware developed is a streaming
type architecture using floating point arithmetic and hardware/software
techniques. Hardware implementation shows the system to run slower
than the existing software code, but demonstrates the ability to simu-
late a small QCA circuit. Analysis of the design reveals good potential
for achieving speedup, and an alternate design is proposed to improve
the execution time. In the course of this work, improvements to the
existing software are also developed and contributed to the community.
v
Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 QCA Advancement . . . . . . . . . . . . . . . . . 2
1.1.2 FPGA Advancement . . . . . . . . . . . . . . . . 4
1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . 7
2 QCA Background . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Basic QCA Operation . . . . . . . . . . . . . . . . . . . 8
2.1.1 QCA Cell . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 QCA Wires . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 QCA Logic . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 The Clock . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Implementations . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Metal-Dot . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Molecular . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Magnetic . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 QCA Designer . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Bistable Simulation . . . . . . . . . . . . . . . . . 16
vi
2.3.2 Coherence Vector Simulation . . . . . . . . . . . . 17
2.3.3 Simulation Speed . . . . . . . . . . . . . . . . . . 18
3 Using FPGAs for Application Speedup . . . . . . . . . . 20
3.1 Basic Architecture and Functionality . . . . . . . . . . . 20
3.2 Use In Application Speedup . . . . . . . . . . . . . . . . 21
3.3 Number Systems . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Hardware/Software Design . . . . . . . . . . . . . . . . . 25
3.5 Nature of QCADesigner Computation . . . . . . . . . . . 26
3.6 The Wildcard Hardware . . . . . . . . . . . . . . . . . . 27
4 Proposed Simulation Architecture . . . . . . . . . . . . . 29
4.1 Software Analysis . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Code Analysis . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Numerical Analysis . . . . . . . . . . . . . . . . . 34
4.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . 38
4.2.1 Proof of Concept . . . . . . . . . . . . . . . . . . 38
4.2.2 DRAM Interfacing . . . . . . . . . . . . . . . . . 40
4.2.3 Other Interfacing Issues . . . . . . . . . . . . . . 44
4.2.4 The Arithmetic Core . . . . . . . . . . . . . . . . 45
4.2.5 Operation and Synthesis . . . . . . . . . . . . . . 48
5 Results and Discussion . . . . . . . . . . . . . . . . . . . . 51
5.1 Cause of Slowdown . . . . . . . . . . . . . . . . . . . . . 51
5.2 Hardware Reliability . . . . . . . . . . . . . . . . . . . . 54
5.3 QCADesigner Enhancements . . . . . . . . . . . . . . . . 55
5.4 Design Alternatives . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 New Architecture . . . . . . . . . . . . . . . . . . 56
5.4.2 Other Platforms . . . . . . . . . . . . . . . . . . . 57
vii
6 Conclusion and Future Work . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A run bistable simulation C Code Listing . . . . . . . . . . 69
B run bistable simulation hardware C Code Listing . . . 80
C qcad VHDL Code Listing . . . . . . . . . . . . . . . . . . 96
D p math VHDL Code Listing . . . . . . . . . . . . . . . . .118
viii
List of Tables
2.1 Truth table for a majority gate . . . . . . . . . . . . . . 12
ix
List of Figures
2.1 A QCA cell (reproduced from [45]) . . . . . . . . . . . . 9
2.2 A normal QCA wire (reproduced from [45]) . . . . . . . 10
2.3 An inversion chain (reproduced from [45]) . . . . . . . . 10
2.4 A wire crossing in QCA using an inversion chain (repro-
duced from [45]) . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Majority gate in QCA . . . . . . . . . . . . . . . . . . . 12
2.6 An inverter gate in QCA (reproduced from [45]) . . . . . 13
2.7 Waveform showing how QCA clock zones operate (repro-
duced from [52]) . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Clock zones as they should appear in a wire . . . . . . . 14
3.1 Basic mathematical structure of 2D convolution algo-
rithm (reproduced from [37]) . . . . . . . . . . . . . . . . 22
3.2 Binary reduction tree structure used in [32] . . . . . . . . 23
3.3 A 32 bit floating point number . . . . . . . . . . . . . . . 24
3.4 Block diagram of Wildcard 4 hardware . . . . . . . . . . 28
4.1 QCADesigner simulation using single precision floating
point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 QCADesigner simulation using double precision floating
point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Hardware results from minisim test . . . . . . . . . . . . 40
4.4 Software results from minisim test . . . . . . . . . . . . . 41
4.5 Block diagram of the polarization math pipeline . . . . . 46
x
4.6 Block diagram of high level system architecture . . . . . 49
4.7 Flowchart for the operation of the combined hardware/-
software system . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Result for software run of a majority gate . . . . . . . . 52




There is constant demand for computing resources to become more
powerful, more energy efficient, and less expensive. For decades, the
semiconductor industry has been able to keep up with this demand,
doubling processor performance every 18 months according to Moore’s
Law [31]. Recently, however, semiconductor-based transistors have be-
gun to approach absolute physical barriers that will prevent further
progress at this pace. The size of state of the art transistors is not far
from the atomic scale, and their behaviour is increasingly determined
by quantum effects. As such, it is becoming ever more difficult to
continue with the current scaling trend, and alternate paths are being
researched, which specifically leverage the quantum effects at this level.
QCA is one such technology, with research ongoing to demonstrate its
feasibility and utility. Accelerating the simulations that are performed
in the course of this research can help speed up the overall progress,
just as with other research fields. This can be carried out by using
FPGAs to implement specialized hardware that exploits parallelism for
a more efficient architecture.
1.1 Motivation
There are two primary motives to perform this work. First is to advance
the state of QCA research as a means to overcome the performance
barriers currently faced by the semiconductor industry. Second is to
advance the state of FPGA utilization in support of simulation speedup.
2
1.1.1 QCA Advancement
Currently, the primary manufacturing technology for integrated cir-
cuits is based on complementary metal-oxide-semiconductor (CMOS).
This technology is presently facing absolute theoretical scaling limits as
current leakage, power consumption, oxide thickness, mobility degrada-
tion, threshold voltage variation and many other characteristics become
increasingly difficult to control [1]. This means that traditional scaling
techniques are no longer sufficient to maintain the pace of semiconduc-
tor development.
The International Technology Roadmap for Semiconductors (ITRS)
[1] outlines myriad directions industry research may take in addition to
traditional scaling to keep advancement proceeding apace. These in-
clude equivalent scaling, design equivalent scaling, Emerging Research
Materials (ERM), and Emerging Research Devices (ERD). Equivalent
scaling and design equivalent scaling focus on new forms of architec-
tures, such as parallel or three dimensional computing structures to
improve performance through design rather than implementation tech-
nology. ERM and ERD focus on developing alternate technologies to
augment or replace CMOS. The field of ERM includes the develop-
ment of new channel materials to facilitate the advancement of CMOS
technology as well as materials to enable further development of ERD
topics such as carbon nanotubes, silicon nanowires and spin materials.
QCA is one of the technologies mentioned in the ITRS. Materials
research regarding macromolecules and self-directed assembly benefits
QCA, and the concept of QCA is under scrutiny as a possible replace-
ment for CMOS. It has many potential advantages over CMOS with
regard to device density [51], operating speed [28], and power consump-
tion [9], and it may even serve as the basis for a reversible computing
architecture [26, 25]. These are all desirable characteristics for a tech-
nology that supports data processing to have, and indeed are sought-
after goals stated in the ITRS. However, a direct comparison to CMOS
is difficult because transistors and QCA cells are not directly analogous
(see Section 2.1). Furthermore, none of these aspects of QCA have been
3
fully realized in practice yet, nor have QCA circuits been manufactured
at any significant scale. QCA research is ongoing, focusing on two dis-
tinct aspects: implementation and architecture.
One direction of QCA research is that of physical implementation.
Although the technology shows promise, the current state of QCA is
highly experimental, and it must be made more practical if it is to com-
pete with and supercede, CMOS. QCA must be cheaply and reliably
manufacturable, operational at room temperature, and able to deliver
the high speed and low energy consumption demanded of it.
There are three basic types of QCA proposed: metal-dot, molecular,
and magnetic. Magnetic QCA (see Section 2.2.3) has been shown to
operate at room temperature, but is not able to be clocked much faster
than 100 MHz [7]. Metal-dot QCA (see Section 2.2.1) promises to
operate at speeds up to 10 GHz, but has not been shown to work at
room temperature [43]. Molecular QCA (see Section 2.2.2) purports
to be able to achieve room temperature operation [42] at up to THz
speeds [28], but no actual circuits have been manufactured at any scale.
These operating speeds have been determined using simulations of the
adiabatic switching properties of the materials involved, but have yet
to be verified in a laboratory. This research is currently the domain of
physicists, material scientists, and microelectronic engineers.
This thesis relates more closely to the other important aspect of
QCA research, which is nanoarchitecture research [45]. This research
models the physics of QCA and ignores the manufacturing concerns
so that the engineer may focus on more familiar elements like cells,
wires, and gates. It is important to demonstrate that novel circuits
can be created using QCA, thus making the aforementioned physical
implementation research worthwhile. Is also provides optimized circuits
“in the pipe,” ready to manufacture as soon as possible. This research
can proceed in parallel with the other, without the immediate need for
physical circuits to test on, but is still held back by the large amount of
time needed to simulate each design. If accurate simulations could be
sped up, rapid progress in this field will ensue, verifying and correcting




Field Programmable Gate Arrays (FPGAs) are hardware devices that
can be programmed to implement any arbitrary logic function. There
are many ways they can be designed, but usually they consist of a
large array of multiplexer-based Look-Up-Tables (LUTs) that can be
individually programmed and interconnected according to a program-
ming bit stream [16]. FPGAs are used in a variety of contexts today,
from Application Specific Integrated Circuit (ASIC) prototyping, to
implementation platforms for end products, to a means of scientific
research.
FPGAs are powerful and versatile. Unlike ASICs, which can only
serve one specialized purpose after manufacture, FPGAs can be easily
reconfigured to implement a large variety of hardware tasks. They also
have the advantage compared to CPUs because their hardware struc-
ture can be modified and optimized for very specific tasks, and can have
a very high degree of parallelism. Therefore they have become a way
to achieve performance similar to ASICs without the high monetary
and labour cost of developing and manufacturing an actual ASIC chip.
Furthermore, a single design can be shared between groups that use
different models or brands of FPGAs, and that design can be quickly
implemented, studied, and used without any need to purchase a new
expensive ASIC. Thus, FPGAs are very valuable to the scientific com-
munity and electronics industry as they enable the rapid distribution,
testing, and implementation of novel designs.
Up to now, FPGAs have been typically used to implement fixed
point or limited precision applications; due to constraints on hardware
resources, IEEE-compliant floating point cores were not able to be im-
plemented at useful speeds. However, the performance of FPGAs has
been increasing even more rapidly than that of CPUs [47], and their
floating point performance has improved significantly. In [47], Keith
5
Underwood shows that floating point performance on FPGAs has in-
deed advanced to match and exceed that of general purpose CPUs.
Given the importance of floating point precision to scientific applica-
tions, and the usefulness of FPGAs in general, this is a promising de-
velopment. Despite this advancement, relatively few applications have
been developed which take advantage of it, and none have been found
that combine floating point and extensive hardware/software design.
Hardware/software design (see Section 3.4) can be exploited to speed
up the development process by porting scientific applications from soft-
ware to hardware. Showing that hardware/software design can be used
effectively with floating point applications would be significant.
1.2 Previous Work
Up to now, a fair bit of work has been done to develop a number
of novel circuits that could be implemented using QCA, including a
microprocessor [50], fault tolerant architectures [44, 52], and various
programmable logic structures [33, 23, 9, 46, 11]. Specifically, work has
been done at the Rochester Institute of Technology (RIT) to develop
a Combinational Logic Block (CLB) that would form the basis of a
QCA-based FPGA [46]. Thus it can be seen that nonoarchitecture
research is progressing, and various circuit designs exist that could be
implemented with QCA when the time comes.
There are several simulation tools that have been made to help cir-
cuit designers verify the functionality of their designs. These tools in-
clude QCADesigner, MAQUINAS, QBART and HDLQ [36]. QCADe-
signer and MAQUINAS both focus on the cell level (for metal-dot and
molecular implementations respectively), and execute numerical simu-
lations that are based on Schrödinger equations. QBART instantiates
QCA devices into a grid, and uses ad hoc simulation rules. This allows
simulations to run faster, but the simulation environment is less flexi-
ble and not as accurate. HDLQ is a Verilog library which provides a
basis for behavioural simulation of QCA circuits using existing Verilog
simulation tools. The difference between HDLQ and QCADesigner is
6
somewhat akin to the difference between Verilog and SPICE. At RIT,
QCADesigner has been used for verification of small-scale circuits, and
HDLQ for circuits where QCADesigner’s simulation time proves im-
practical. For reference, the simulation of one CLB (with 22,558 QCA
cells) in [46] takes approximately 50 minutes, while the QCADesigner
simulation of four CLBs takes between 8 and 12 hours.
There has been much work done developing applications for FPGAs
in many categories to varying degrees of success. Image processing is
commonly implemented due to its relatively light numerical demands
and significant parallelizability. There are also numerous Molecular
Dynamics (MD) simulations implemented on FPGAs typically achiev-
ing speedups of 5x over pure software [14]. MD simulations often start
as software applications using floating point, but find themselves us-
ing some other number system such as logarithmic floating point [6] or
semi-floating point [15] (see Section 3.3) that makes the hardware more
compact. The precision suffers somewhat, but the size of mathematical
cores is reduced to allow for greatly enhanced parallelism. A limited
number of applications using fully floating point numbers have also been
created to speed up various applications, in some cases up to 35x [32],
showing the ability of FPGAs to handle floating point applications.
1.3 Contributions
This thesis attempts to make steps toward accelerating the execution
of QCA simulations, a task which is not shown to have been attempted
in the literature. The design used to this end is novel in concept among
other similar applications because it attempts to use IEEE 754 com-
pliant floating point arithmetic in conjunction with hardware/software
design in order to achieve high numerical precision simultaneously with
shortened development time and ease of use. One possible architecture
is designed and tested, but fails to achieve any speedup due to interfac-
ing bottlenecks. The computational core itself, however, demonstrates
the operational speed necessary to serve as the basis for faster systems
in the future. Another design is proposed that would eliminate the
7
most significant observed issues and potentially run much faster, while
maintaining floating point precision and hardware/software usability.
Additionally, some minor improvements are made to the QCADesigner
software, including a bug fix in the algorithm and an optimization that
significantly speeds up a common function.
1.4 Organization
This thesis is organized as follows: Chapter 2 provides the background
to QCA necessary to understand what is being simulated, including de-
scriptions and background of existing software simulators. Chapter 3
offers a detailed look at FPGAs, their basic architecture and functional-
ity, and how they are used to accelerate various computing applications.
This section also contains a brief examination of different number sys-
tems, and how they may be applied in different designs. Chapter 4
details the design process, including the prerequisite software analysis
and explication of the final implemented hardware. Chapter 5 reports
and analyzes the observed results and Chapter 6 discusses their signif-




Quantum-dot Cellular Automata (QCA) is a technology concept de-
vised by Tougaw and Lent in 1993 [24]. It is a way to create digital elec-
tronic devices without using transistors that has potential to be faster,
denser, and more energy efficient than CMOS technology. Rather than
computing with voltage levels requiring current flow, QCA computes
via field polarization [2], in such a way that the ground state of the
system corresponds to the solution state of the computation. Much
progress has been made since 1993 in terms of design and fabrication
techniques. This section explains in detail how this form of computation
works and can be implemented. Note that the logical architecture and
functionality is the same independent of the method of manufacturing
and implementation.
2.1 Basic QCA Operation
This section describes how data is stored, transferred, and manipulated
in QCA, starting with a description of a basic cell, then building wires
and gates out of cells. The clocking structure, which is central to the
functionality of QCA, is also described.
2.1.1 QCA Cell
A QCA Cell is the basic building block of any QCA circuit [3]. It con-




Binary state ‘0’ Binary state ‘1’
e
e
Figure 2.1: A QCA cell (reproduced from [45])
and two distinct charges. The specific configuration of the cell repre-
sents a binary value stored in the cell as shown in Figure 2.1. There
are only two stable configurations in a QCA cell, corresponding to the
electrons occupying opposite corners, hence representing two alternate
binary values. The charges can move between charge sites by means of
quantum tunneling, a process which is facilitated by the clock (section
2.1.4). The configuration of a cell can be calculated in simulation by
using a Schrödinger equation with a quantum-mechanical Hamiltonian,
and is influenced by any other QCA cells surrounding it. A cell can also
theoretically be arranged with the charge sites rotated by 45 degrees to
facilitate wire crossings [9].
2.1.2 QCA Wires
When QCA cells are arranged in a line, they form a wire. In contrast to
CMOS technology where simple metal lines are used to transfer data, it
is the same complex functional devices that process data in QCA which
transmit data. A QCA wire functions by local coulombic interactions
from one cell to the next, generating a “domino effect” that propagates
data from a fixed input to the measured output as shown in Figure 2.2.
Alternately, rotated cells may be arranged to form an inversion chain
10
INPUT OUTPUT
Figure 2.2: A normal QCA wire (reproduced from [45])
INPUT OUTPUT
Figure 2.3: An inversion chain (reproduced from [45])
as shown in Figure 2.3. This is like a normal QCA wire, but the po-
larization flips between each cell. Therefore an inversion chain must
be an odd number of cells in order to preserve the value from input to
output. The main use of an inversion chain is to allow coplanar wire
crossings [9]. In a wire crossing such as that shown in 2.4, the cell at
the crossover point in the primary wire has a neutral coulombic effect
on either of the nearby cells in the secondary wire, allowing data to
cross over unaltered. Multi-layer QCA circuits have not been shown to
be feasible, but coplanar crossings such as the ones described here may
be. Wire crossings, however they are implemented, are important to
having novel circuits in a compact area.
2.1.3 QCA Logic
The logical basis for QCA circuits are majority gates and inverters. A
QCA inverter has the same function as the familiar CMOS inverter,
i.e. inverts a logical value from input to output. A majority gate is a
three-input gate whose output matches the most popular input value.
The boolean expression for this is F = AB +BC +AC, and the truth
table for a majority gate is shown in Table 2.1. The QCA schematics
of an inverter and a majority gate are shown in Figure 2.5 and 2.6.
It can be seen from the schematic that a three-input majority gate






Figure 2.4: A wire crossing in QCA using an inversion chain (reproduced from [45])
implemented in QCA. It functions due to the bistable nature of a QCA
cell, which will “snap” into one configuration or another based on which
cell polarity has a more dominant influence. A majority gate can be
programmed to function as either an AND or an OR gate by holding
one of the inputs at a logical zero or one respectively. With AND, OR
and INV, a complete set of logical devices can be implemented in QCA,
enabling the creation of any digital circuit that can be made currently
in CMOS.
2.1.4 The Clock
All QCA functions are driven by an underlying clock structure which is
essential to the controlled operation of each cell and governs the flow of
data. There are actually four clocks in a QCA circuit, each separated
by 90 degrees of phase shift, and each cell is attached to one of the
four clock zones. The four phases of each clock period may be labeled
as relax, switch, hold, and release, indicating each phase’s effect on a
cell [52]. A timing diagram showing their basic progression is shown in
12
A B C F
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
Table 2.1: Truth table for a majority gate
Figure 2.5: Majority gate in QCA
Figure 2.7
During the relax phase, a cell is free to switch polarizations based
on the surrounding cells’ influence. In the switch phase, the clock level
progresses from low to high, and the polarization of the cell becomes
locked in place. When the clock reaches the high level, this is the
hold phase, and the polarization of the cell is fully locked and able to
influence other cells without being altered itself. After the hold phase
comes the release phase where the clock progresses from high to low,
and the cell gradually loses its fixed value to re-enter the pliable relax
state.
These alternating clock phases allow the controlled progression of
13
INPUT OUTPUT
Figure 2.6: An inverter gate in QCA (reproduced from [45])
data in one direction through a QCA circuit, from input to output,
with minimal error. Figure 2.8 illustrates the prescribed arrangement
of clock zones in a wire. Input is on the left, output is on the right,
and each colour represents a different zone, progressing in order from
Clock 1 to Clock 4 from left to right. However, because every cell is
clocked, including those making up the wires, QCA is a deeply pipelined
architecture. Every full set of clock zones between input and output
adds one clock period of latency to the circuit, and there is a limit to
how many consecutive cells may be in one clock zone. Therefore, even
wires will add to circuit latency, meaning the circuit designer must take
extra care in aligning data arrival times.
The clock structure in QCA is currently implemented as a standard
CMOS clock network directly below the QCA circuit whose voltage
level directly affects the depth of the energy wells holding the charges.
The clock can generally be routed wherever it is needed in the circuit,
but with four clocks to route, the actual layout could be quite complex,
and minimizing skew can be difficult. Research is ongoing to develop





"#$%&'($!)*+! ,(+-+*!)'! .(/*01!,(22+! 34*5!62(+4#%4'!









;#$%&'($P:1#;#$Q:#4>Q*R'! C(+-+*P*?*Q'+?Q*R'! 2(/*01P+$R+(Q:#4>Q*R'! (4*5P?0Q'?0RQ*R'!
!
/A4&*#;&B!3$)! &(! &=)5*! )C&*)-)%0! 4-#%%! ?)#&$*)! 45D)4!#+9!$%&*#!
%(<! 7(<)*! ;(+4$-7&5(+E! ,$#+&$-F9(&! .)%%$%#*! /$&(-#&#!
1,./2! &);=+(%(60! 54! 7*(G);&)9! &(! A)! #! 7*(-545+6!
+#+(&);=+(%(60H! I(<)J)*E! 5+! +#+(&);=+(%(65)4E! -#+$?#;&$*)!
&5-)!9)?);&!%)J)%4!#+9!(7)*#&5(+#%!&5-)!?#$%&!*#&)4!#*)!)C7);&)9!
&(!A)!K$5&)!=56=H!>&*#56=&?(*<#*9!'*57%)!8(9$%#*!:)9$+9#+;0!
1'8:2! A#4)9! ?#$%&! &(%)*#+;)! 54! 5+#77*(7*5#&)! ?(*! ,./!
+#+(&);=+(%(60!45+;)!<5*)!9)%#04!9(-5+#&)!&=)!%(65;!9)%#04!#+9!
?#$%&4! 5+! <5*)4! 9(-5+#&)! &=)! ?#$%&4! 5+! #! ,./! A#4)9! 9)456+H!
"$*&=)*-(*)E! %(+6!<5*)4!#*)!+);)44#*0! 5+!'8:!A#4)9!9)456+4H!
L+! &=54!7#7)*!<)! 4=(<! &=#&! ?#$%&F&(%)*#+;)!;#+!A)!(A&#5+)9!A0!
$45+6! '8:! <5&=! >=5?&)9! @7)*#+94! 1'8:>@2H! '8:>@! $4)4!
4=(*&)*! <5*)4! (?! ,./! ;)%%4! #+9! )C7%(5&4! &=)! 4)%?F%#&;=5+6!




L-7%)-)+&#&5(+! *)4$%&4! (A&#5+)9!$45+6!,./3)456+)*! OPQ! 4=(<!





<?(4+$%! #I! 8S6<! R*A+?*0! +0! T*+$%! (%%2*00+A*4>! :'20'*R! T>!
012+$C+$%! ;2($0+0;#2! R+/*$0+#$0U! 2*R'?+$%! :#V*2! 0'::4>! A#4;(%*0!
($R! +$?2*(0+$%! #:*2(;+$%! I2*&'*$?+*0Q! <'?1! (%%2*00+A*! 0?(4+$%!
(RA*20*4>! 2*0'4;0! +$! $#$W+R*(4! T*1(A+#20! 0'?1! (0! 1+%1! 4*(C(%*!
?'22*$;!($R!1+%1!:#V*2!R*$0+;>!4*A*40Q!"1*0*!+00'*0!V+44!*A*$;'(44>!
T*?#/*!2#(R!T4#?C0!($R!04#V!R#V$!;1*!0?(4+$%!;2*$R!;1(;!1(0!T**$!
#:*2(;+A*! I#2! >*(20Q! X'($;'/WR#;! 8*44'4(2! 3';#/(;(! YX83Z!
:2#:#0*R! +$! ;1*! FNNH0! [F\! (2*! (;;2(?;+$%! (! 4#;! #I! (;;*$;+#$! R'*! ;#!
;1*+2! *5;2*/*4>! 0/(44! I*(;'2*! 0+]*0! ($R! '4;2(! 4#V! :#V*2!
?#$0'/:;+#$Q!^#V*A*2U!(;!$($#/*;*2!0?(4*0U!+;!+0!*5;2*/*4>!1(2R!;#!
(?1+*A*! ;1*! 2*&'+2*R! /($'I(?;'2+$%! ;#4*2($?*0Q! ^*$?*U! I('4;W
;#4*2($?*! (00'/*0! ;1*! 2#4*! #I! ($! *$(T4+$%! R*0+%$! ;*?1$#4#%>! I#2!
X83! ($R! #;1*2! $($#;*?1$#4#%+*0Q! <;2(+%1;I#2V(2R! (::4+?(;+#$! #I!
"2+:4*! S#R'4(2! .*R'$R($?>! Y"S.Z! T(0*R! I('4;! ;#4*2($?*! +0!
+$(::2#:2+(;*! I#2! X83! $($#;*?1$#4#%>! T*?('0*U! '$4+C*! 8S6<!




"1+0! :(:*2! :2*0*$;0! (! $#A*4! X83! I('4;! ;#4*2($?*! ;*?1$+&'*! I#2!
(2+;1/*;+?!?+2?'+;0Q!"1+0!/*;1#RU!"2+:4*!S#R'4(2!.*R'$R($?>!V+;1!
<1+I;*R!6:*2($R0! Y"S.<6ZU! +44'0;2(;*R! #$! ($! (RR*2U! +0! ?#/:(2*R!




#I! ;1+0! :(:*2! +0! #2%($+]*R! (0! I#44#V0Q! J$! <*?;+#$! JJU! ;1*! ?4#?C+$%!
0?1*/*! #I! X83! ($R! R*I*?;0! +$! X83! *4*/*$;0! (2*! :2*0*$;*RQ! J$!
<*?;+#$! JJJU! ;1*! "S.<6! ;*?1$+&'*! +0! +$A*0;+%(;*RQ! J$! <*?;+#$! J`U!
;1*! "S.<6! ;*?1$+&'*! +0! ?#/:(2*R! ;#! ;1*! "S.! ;*?1$+&'*! '0+$%!





I#2! :+:*4+$*! ?#/:';(;+#$0Q! _+%'2*! F! Y(Z! 01#V0! (! :#:'4(2! ?4#?C!
0?1*/*Q!7(?1!?4#?C!1(0!(!NH!R*%2**!:1(0*! 01+I;! I2#/! +;0!:2*A+#'0!
?4#?CQ!_+%'2*!F! YTZ! 01#V0!($! (22(>!#I!X83!?*440! ;1(;! '0*0! 0'?1!(!
?4#?C!0?1*/*Q!"1*! +$:';!?*44!($R!?*44!F!(2*!?#$;2#44*R!T>!;%(;M!VE!
?*44!G!+0!?#$;2#44*R!T>!;%(;M!WU!?*44!O!+0!?#$;2#44*R!T>!;%(;M!XU!($R!
#';:';! ?*44! +0! ?#$;2#44*R! T>! ;%(;M! YQ! 7(?1! 2#V! +$! _+%'2*! F! YTZ!
R*$#;*0!#$*!:1(0*!#I!(!?4#?CQ!3;!;1*!T*%+$$+$%U!(44!?*440!(2*!+$!;1*!
2*4(5*R!:1(0*U!;1(;!+0U!+$!($!'$:#4(2+]*R!$*';2(4!0;(;*Q!)1*$!;%(;M!V!
+0! +$! ;1*! 0V+;?1!:1(0*U!?*44!F! +0!:#4(2+]*R!T>! ;1*! +$:';!?*44Q!)1*$!
;%(;M!V!+0!+$!;1*!1#4R!:1(0*!($R!;%(;M!W!+0!+$!;1*!0V+;?1!:1(0*U!?*44!
F!0*2A*0!(0!($!+$:';!;#!?*44!GQ!)1*$!;%(;M!W!+0!+$!;1*!1#4R!:1(0*!($R!
;%(;M! V! +0! +$! ;1*! 2*4*(0*!:1(0*U! ?*44!G! ;2($0/+;0! +;0!A(4'*! ;#!?*44!O!
($R!;%(;M!X!+0!+$!;1*!0V+;?1!:1(0*Q!)1*$!;%(;M!XU!;%(;M!W!($R!;%(;M!
V!(2*!+$!;1*!1#4R!:1(0*U!2*4*(0*!:1(0*U!($R!2*4(5!:1(0*U!2*0:*?;+A*4>U!
?*44! O! :(00*0! +;0! A(4'*! ;#! #';:';! ?*44! ($R! ;%(;M! Y! +0! +$! ;1*! 0V+;?1!













































84#?C!F 84#?C!G 84#?C!O 84#?C!a !




0-7803-8736-8/05/$20.00 ©2005 IEEE. ASP-DAC 2005

















Figure 2.8: Clock zones as they should appear in a wire
Concerns have been raised that the power savings of QCA may be
offset by the power consumption of a clock running at the proposed
speeds of 10 GHz and more. This is addressed in [4], where Lent et.
al. simulate the power consumption of the type of clock network used
in QCA. Their findings show that the clock does not significantly con-
tribute to the overall power consumption of a QCA circuit, and that a
100 GHz QCA clock consumes similar power to a 2 GHz clock driving
a conventional CMOS circuit.
2.2 Implementations
Three methods of implementation for QCA have been proposed: Metal-
dot, molecular, and magnetic. They each have their own advantages
and disadvantages, and are all actively being researched.
2.2.1 Metal-Dot
The most popular form of QCA implementation being researched is the
metal-dot implementation. This uses small dots of conductors such as
Aluminum [22], Silicon or Gallium-Arsenic [2] as charge sites. Each
4-dot cell in a metal-dot QCA cell is typically a square of about 20 nm
on a side. It has been shown to work in a lab environment, in which
some very basic, though important, circuits have been demonstrated
at temperatures below 1 K [34, 35]. In theory, based on adiabatic
switching simulations of metal-dot cells, the clock speed of this form of
QCA circuit can reach into the tens of GHz, but only up to MHz range
has been demonstrated [43]. Metal-dot is the form of QCA best-suited
to use current CMOS manufacturing lines without too much upgrading
or modification being needed; it would use the same basic lithography
techniques, and has similar feature sizes as the current state of the art
15
transistor technology. It is proposed that if the individual cells and dots
can be made smaller, higher operating temperatures may be achieved.
2.2.2 Molecular
Molecular QCA is a step up from metal-dot in every way, except fea-
sibility. Molecular QCA proposes composing a QCA cell out of just
two molecules specially designed to have two primary charge sites and
mobile electrons. In theory, because individual molecules are being
used, the “dots” are small enough to allow operating temperatures up
to 450 K [51], i.e. well above room temperature. It is also theorized
that molecular QCA supports the highest clock speeds out of the three
implementations: up to 1 THz [28]. However, development of molecular
QCA has not progressed very far beyond simulation. Several candidate
molecules are being studied, and have been experimentally shown to
behave as a QCA cell, but no complex circuits have been manufactured
to date.
2.2.3 Magnetic
Magnetic QCA is slightly different from metal-dot and molecular, be-
cause instead of charges jumping between charge sites in a cell, a cell
consists of a magnetic dipole that changes its alignment based on its
neighbours. The same type of architecture and logic applies, only the
mechanics of how it works changes. Magnetic QCA has been shown to
work at room temperatures, and possesses a natural hardness against
radiation, but simulation of the switching properties of a magnetic QCA
cell suggest that it may only work at speeds up to 100 MHz [7]. Despite
the low operating speed, magnetic QCA may find applications in areas
where low power consumption and high reliability are more important
than operating speed, such as in medical implants or space exploration.
16
2.3 QCA Designer
The simulation tool that RIT and many other institutions [44, 18, 41,
46] use to test their QCA circuits is QCADesigner [48, 49]. It is a
program that allows a designer to easily create and test a QCA circuit
using a graphical user interface (GUI). It offers two different simulation
engines (coherence vector and bistable) with a variety of customizable
parameters. QCADesigner was initially developed by Konrad Walus et.
al. in the Microsystems and Nanotechnology Group at the University
of British Columbia, but is now an open source project, so obtaining
and modifying the source code is very easy, making it an ideal choice
for an application to accelerate.
2.3.1 Bistable Simulation
The bistable simulation engine is the preferred QCADesigner engine
for simulating circuits at RIT, and therefore the engine selected for
the work of this thesis. It models a QCA cell as a system having two
stable states, and the polarization of a cell is calculated based on the
surrounding polarizations and kink energies. The kink energy between
two cells refers to the energy cost of the two cells having opposing
polarizations. Naturally, this value tends to decrease as the distance
between cells increases. The main equation used to determine the kink







This equation is used to calculate the electrostatic energy between
two individual dots in cell i and cell j. ε0 is the permittivity of free
space, and εr is the relative permittivity of the material system in use,
which can be specified by the user in the simulation setup window.
qi and qj are the charges stored on the two dots from cell i and j
respectively, and |ri−rj| is the distance between the dots. This equation
is applied and summed for every combination of dots in each cell, and
17
the difference between the results for aligned and misaligned polarities is
taken as the kink energy. Because there is no net charge on the cell, the
interaction is quadrupole-quadrupole based, so it decays very rapidly
as distance increases. The set of kink energies for all cells only needs to
be calculated once during a simulation, during the initialization phase
before any polarizations are calculated.
The kink energies between a cell and its neighbours are then used














where γ is the tunneling potential, which changes with the clock value,
Pj is the polarization of the jth neighbour of the current cell, and
Eki,j is the kink energy between the current cell and the jth neighbour
cell. This equation is carried out once for every cell in a QCA circuit,
for every iteration the simulation goes through. One “sample” of the
simulation may consist of several iterations of the circuit, in which
the state of each cell is recalculated until a maximum allowable change
threshold is reached. This bistable simulation engine is able to simulate
a large number of cells relatively quickly to verify logical correctness
and basic structural validity, but it lacks certain considerations, such
as temperature and dissipative effects, required to execute dynamic
simulations with a high degree of accuracy.
2.3.2 Coherence Vector Simulation
The coherence vector simulation engine is a slightly more detailed sim-
ulation than the bistable simulation engine. Like the bistable engine, it
assumes a cell is basically a bistable system, but the state is calculated
in a more sophisticated way. The status of a cell is tracked using a
coherence vector where the polarization is represented as the z com-
ponent of this vector. Each component of the vector is calculated by
multiplying the trace of the density matrix by each of the Pauli spin
18
matrices.
λi = Tr{ρ̂σ̂i} i = {x, y, z} (2.3)










λ −−→λ ss) (2.4)
which models the change in the cell’s state over time, including dissi-
pative effects. The Gamma vector is defined as
−→






where gamma is the tunneling potential (tied to the clock), S is the
effective neighbourhood of cell i, and Eki,j and Pj are the kink energy
and polarization respectively of cell j. τ is the relaxation time of the
cell.
−→
λ ss from equation 2.4 is the steady-state coherence vector, and is
defined as
−→










where T is the temperature in kelvin and kB is Boltzmann’s constant.
The simulator evaluates these equations for each cell in the circuit for
each timestep until the end time is reached, as specified by the user. It
is plain to see why this method may be more accurate and take much
longer to run.
2.3.3 Simulation Speed
Each simulation of a 22,558 cell CLB takes a little less than an hour to
complete using the bistable simulation engine. The bistable simulation
19
engine is used in our research because it provides basic architecture
verification an order of magnitude faster than the coherence vector
simulation. For larger circuits, exponentially more time is required.
For example, a circuit four times as big with four CLBs requires 10
hours to complete the simulation. The CMOS versions of these circuits
would not be extraordinarily complex in terms of the number of gates
(i.e. devices), 572 gates for one CLB; but because gates and wires are
made up of complex fundamental devices in QCA, the simulation is
much more complex and time consuming.
It is observed that for both of these simulation engines, the new
polarization of a cell may be calculated concurrently with the new po-
larization of any and all other cells without flouting any data dependen-
cies. This means that there is a great amount of parallelism that may
be exploited when executing these simulations. If one computational
core can be assigned to each cell, then a very large circuit would take
the same time to simulate as a small circuit currently does, yielding
extraordinary speedup. Such a task is well-suited to FPGAs, as they




Using FPGAs for Application Speedup
An FPGA is a chip containing a reconfigurable hardware structure
which can be programmed to implement any desired hardware circuit.
It has no fixed functionality, and unlike a CPU, programming it results
in an actual reassignment and reconfiguration of hardware resources.
FPGAs often have a repeating structure with computing and memory
resources spread throughout the chip, allowing them to be used for a
large number of applications. Their versatility allows them to serve
as specialized application specific hardware platforms without the high
development costs associated with ASICs. They can be used for hard-
ware prototyping before sending a design to ASIC production, or may
be used as end platforms in themselves.
3.1 Basic Architecture and Functionality
Although there is no universal design standard for an FPGA, the most
common approach is to use a combination of programmable logic blocks
and programmable interconnects. Logic may be programmed using
look up tables that consist of SRAM cells attached to multiplexer in-
puts, and interconnects may be programmed using solid-state switches.
Some FPGAs may include embedded multipliers, digital signal pro-
cessing (DSP) units, block RAM, and other features, but the basic
principle remains the same. FPGAs are usually laid out in a repeating
grid pattern, which facilitates their use for parallelizing calculations.
Computing resources are distributed evenly throughout the chip, al-
lowing multiple computational cores to be implemented at once, cores
21
that may or may not be identical to each other. Memory resources are
also distributed such that computing resources in one region may use
memory in that region while other memory is free to be used by other
resources, resulting in a very large effective memory bandwidth.
3.2 Use In Application Speedup
A large class of applications commonly implemented on FPGAs is im-
age processing. These are ideally suited to FPGAs because they rely
on simple fixed point mathematics and their operation is often highly
parallelizable. Therefore, there is a long history and large body of
work related to accelerating image processing on FPGAs. Of particu-
lar interest is the 2D Convolution algorithm [37], often used for image
noise reduction, which replaces a pixel value with a weighted average of
surrounding pixel values. The mathematical algorithm has some simi-
larities to that used in the bistable simulation engine of QCADesigner,
both using a sum of products term based on surrounding values. It
calculates the new value of a pixel by computing the sum of products
of surrounding pixel values, times the convolution weights (see figure
3.1). There is even an example of how to use a basic 3x3 convolver
design to compute arbitrarily large convolution kernels. This might be
analogous to using a basic QCA simulation core with a 50 nm radius
of effect to enable the simulation of any larger radius of effect without
having to use a different core.
Many advanced scientific applications are also executed on FPGAs,
using a variety of implementation methods. In [6], an alternative num-
ber system, logarithmic floating point (see Section 3.3), is implemented
and an application to accelerate Quantum Chromodynamics calculation
is constructed using Handel-C. Handel-C is one of many alternatives to
VHDL, like JHDL [19] and Carte [40, 21], that abstract the hardware
description task to a level more familiar to software developers. This
can make development easier by lowering the learning curve and mak-
ing the hardware description language more similar to the original lan-
guage of the program being adapted. Alternative number systems are
22
The core of the convolution processor contains a grid of
four 16-bit SIMD 2D 3!3 convolvers. Each one exploits
new SIMD arithmetic circuits purposely designed and
optimized for the FPGA platform. The SIMD modules
adapt their structures at run-time to different bit resolutions.
Thanks to this, the new convolution processor is able to
operate on 2D images with different bit resolutions and
varying kernels avoiding the time and power consuming
reconfiguration process.
The paper is organized as follows: design motivations of
the proposed circuit are explained in Section 2, then its
architecture and the basic modules used in it are described in
Sections 3 and 4. Finally, results and conclusions are provided.
2. Research motivations
In image and video processing, convolution is a basic
operation. Thus, it can strongly influence the overall
performance. A convolution operation is usually performed
as illustrated in Fig. 1: for each pixel P(x,y) (with xZ1,.,M
and yZ1,.,N) of a M!N input image a K!R sliding
template, called convolution kernel, is convolved with the
K!R window centered on P(x,y). That is, each value into
the pixels window is multiplied by the corresponding signed
weight into the convolution kernel. Then, the K!R products
obtained in this way are added to produce the output pixel
value.
In several applications, better results are achieved if in a
single image enhancement task progressive execution of 2D
convolutions with differently sized kernels is exploited.
For example, this happens in medical applications. There, as
shown in [10], the enhancement task is useful in scanning
skeleton images. The computational flow typically used in
these cases is illustrated in Fig. 2, which shows how
successive convolutions with differently sized kernels
enhance the perceptual quality by sharpening the input
image and bringing out more of the skeletal details.
It is worth pointing out that in order to obtain the highest
image quality and to avoid inaccurate results, also different
precisions have to be supported.
In applications requiring real-time image convolutions,
software implementations on general-purpose microproces-
sors appear to be very time consuming. Moreover,
commercial DSPs are often unable to efficiently support
image convolution. For example, the TMS320C40 DSP
microprocessor [11] requires about 20 instruction cycles per
pixel when a 3!3 kernel is used [3]. It is then clear that
special purpose parallel circuits for convolution can
represent efficient solutions to provide high computational
capabilities and to ensure high throughput data rates.
Several hardware implementations of convolvers able to
satisfy real-time constraints exist in literature. Many of them
[3–7] take advantages of FPGAs to accelerate convolution
operations. Even though FPGA devices lead with extremely
flexible hardware, previous proposals appear inflexible from
an application point of view. In fact, they operate on kernels
with fixed-size and fixed-precision pixels (usually 8- or
16-bits). Moreover, just restricted set of kernel weights are
typically supported. As an example, the 2D 3!3 convolver
presented in [3] supports convolutions in which kernel
weights can assume only the valuesK4,K2,K1, 0, 1, 2, 4.
This approach yields excellent performance and device
utilization, since only multiplications by constants have to
be executed. But, as a drawback, this kind of circuits is
useless in applications like video processing with special
effects and real-time image manipulation, where the kernel
weights and the convolution window size could vary at run-
time. For these applications changing convolution window
size, pixel resolution and/or kernel weights at run-time
avoiding the conventional FPGA reconfiguration process is
very useful to save time and power consumption.
3. The architecture of the new 2D run-time
reconfigurable convolver
Run-time reconfigurability makes the proposed special-
ized image convolution processor able to perform user
programmable 2D image convolution operations.
The possible operation modes and the parallelism levels
supported by the new convolver are summarized in Table 1.
The control signals rec and Select_window establish
what pixels resolution and convolution window size
the convolver has to operate on, respectively. For example,
when both the control signals are low, four adjacent 3!3
image convolutions on 16-bit pixels and 16-bit kernelFig. 1. Image convolution for KZRZ3.
S. Perri et al. / Microprocessors and Microsystems 29 (2005) 381–391382
Figure 3.1: Basic mathematical structure of 2D convolution algorithm (reproduced
from [37])
also a common practice in FPGA development to reduce the hardware
demands of a system where possible [15, 12, 13].
Molecular Dynamics (MD) simulations are also often implemented
on FPGAs to good effect, often achieving greater than 5x speedup
[21, 15, 14, 13], sometimes much more. A variety of architectures and
methodologies are used to obtain these results, including hardware/soft-
ware implementations [21, 40, 15, 39, 13], discrete event simulation [29],
and multigrid computing [14]. In hardware/software implementations,
only a portion of the application is ported to hardware, while other
parts are left done in software for ease of development and optimum
resource usage. To speed up the simulation, discrete event simulations
use a variety of simplifications to advance MD simulations by event
rather than timestep. The multigrid computing technique breaks the
computation up into spatial grids to parallelize computation and mem-
ory access, and iteratively uses different sized grids to balance between
23
1.2. FPGA-based supercomputing
Rapidly solving linear equations, Ax = b, where A is an
n!n real matrix, b is a real n-vector, and x is the unknown
real n-vector, is essential for a number of parallel applica-
tions, e.g., when trying to solve partial differential equations
(PDE), where the PDE is discretized into a system of equa-
tions that are repeatedly solved at each time step. When
direct methods like Gaussian elimination are inappropriate,
iterative methods are used.
The embarrassingly parallel nature of the Jacobi iter-
ative method coupled with the sculptural fabric of mod-
ern FPGAs make the Jacobi method an ideal candidate for
hardware acceleration. This paper describes the design
of a parameterized, deeply pipelined, highly parallelized
FPGA-based IEEE 64-bit floating-point version of the Ja-
cobi method. Using a Xilinx Virtex-II Pro as the target
device, a Jacobi circuit is implemented, and various im-
plementation statistics and performance estimates are pre-
sented. Analytical techniques are used to estimate the per-
formance of a large sparse matrix Jacobi circuit.
1.3. Organization of this paper
Section 2 provides a brief overview of the Jacobi iterative
method and describes the FPGA-based design. Section 3
presents the testing and implementation of a Jacobi circuit
on a Xilinx Virtex-II Pro target device and gives some im-
plementation statistics. Section 4 compares circuit execu-
tion time to uniprocessor execution time. Section 5 and
Section 6 deal with the sparse matrix case, and Section 7
presents the conclusions.
2. Jacobi design
2.1. Overview of the Jacobi method
To motivate the description of the Jacobi iterative
method, an iterative approach is used to solve a quadratic
equation. The idea is to compute the next value, x(!+1), as






Transforming quadratic equation, (x + 2)(x # 5) = 0, into




Using a starting value x(0) = 1.000, the iteration converges
to one of the known roots in 7 iterations.
x(1) "
$




3 ! 4.999 + 10 % x(7) = 5.000
To solve Ax = b iteratively, it is massaged to look like
Equation 1. Let A = L + U + D, where L is the lower
triangular matrix containing all elements of A below the di-
agonal, U is the upper triangular matrix containing all ele-
ments of A above the diagonal, and D is the diagonal matrix
consisting of only the diagonal elements of A. Substituting




b # (L + U)x(!)
%
. (2)















The Jacobi method often converges rather slowly compared
to the more sophisticated methods. Furthermore, conver-
gence is only guaranteed for weakly diagonally dominant
matrices [8]. According to Saad, Jacobi is seldom used as
a stand-alone solver [17], but rather as a preconditioner for
more advanced iterative methods like conjugate gradient,
therefore convergence is not considered in this monograph.
2.2. Idealized design
An idealized Jacobi iteration block diagram is shown in
Figure 1. It consists of a binary reduction tree, subtraction


































Figure 1. Idealized Jacobi block diagram
a1, is placed at the input lines of the multiplier leaf nodes.
Simultaneously, the x(!) vector is placed at the other input
lines of the multipliers. On the next clock cycle, the second
matrix row, a2, and x(!) vector are ingested. The next cycle
deals with the third matrix row and so forth. The binary tree








Figure 3.2: Binary reduction tree structure used in [32]
speed and accuracy.
Some floating point applications, including SPICE simulation [20],
matrix operations [10, 27, 32], and Fast Fourier Transforms (FFTs)
[17] have also been implemented on FPGAs with speedups of up to 18x
compared to software implementations. For matrix operations, which
involve relatively simple multiply-accumulate mathematics, many float-
ing point cores can be implemented using spatial parallelism to produce
a result quickly in a binary reduction tree pipeline. For example, the
one used in a Jacobi iterative solver (see Figure 3.2) [32]. However, for
the more complex mathematical demands of SPICE and FFT, some
spatial parallelism is exchanged for longer pipelines that continuously
compute a stream of data.
3.3 Number Systems
There are a variety of number systems that can be used for calcula-
tion in FPGA application, and which one is selected depends on the
requirements of the application. Floating point is typically preferred
in scientific applications due to its high dynamic range and good rel-
ative accuracy. This means that a very large range of values can be
stored while at the same time having a very small percent error, inde-
pendent of the magnitude. Floating point comes in two basic flavors:
single precision and double precision. The IEEE 754-2008 Standard
for Floating Point Arithmetic [8] defines single precision to be a 32 bit
24
0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 1 1 1  = 37235.215
sign exponent = 142 - 127 = 15 mantissa = 1 .00100010111001100110111 = 1.1363286
Figure 3.3: A 32 bit floating point number
number with 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double
precision has 1 sign bit, 11 exponent bits, and 52 mantissa bits to make
a 64 bit number. Both of these precision levels function in the same
way. The exponent is determined by the exponent bits’ difference from
a bias value to give a positive or negative exponent. The mantissa bits
combine with an unrepresented implied leading 1 to give the base value
on which the exponent will operate, and the value is read in decimal
as (−1)s ∗ m ∗ 2e. Figure 3.3 shows a 32 bit floating point example.
This is a very versatile method of representing numbers. However, due
to the complexity of operating on such representations, floating point
arithmetic cores are large compared to other implementable systems,
so they are often eschewed in favor of simpler number systems.
The simplest form of number system is fixed point, often realized as
just integer arithmetic. This system is used in applications where very
little dynamic range is needed, such as some forms of image processing
[37, 38, 30, 5]. Fixed point arithmetic cores are simple, compact, and
fast, but only because the computations are simple. They are not very
useful in scientific applications where higher dynamic ranges and more
accuracy are required.
There are other number systems which are more versatile than fixed
point, but easier to implement than floating point. A logarithmic float-
ing point number system converts floating point numbers into a loga-
rithmic floating point domain, allowing logarithm arithmetic to be used
[6]. This means that multiplication, division, and square root opera-
tions are all made much simpler, but addition and subtraction are more
complex than in standard floating point. This is because when multi-
plying or dividing logarithms, only simple addition is required whereas
addition and subtraction operations require large look up tables This
number system is effective when the majority of operations are multi-
plication, division, or square root, but the advantage is lost if too many
25
additions or subtractions are present.
A system called semi-floating point [15] may also be used. In this
system, the decimal may shift only to certain places. This allows the
arithmetic to be carried out using a limited number of fixed point cores
or look up tables, simplifying the hardware requirements compared to
floating point. However, it requires that a small amount of dynamic
range is available and that all numbers in the computational problem
can be represented accurately with the limited set of decimal places.
Additionally, unlike floating point, neither logarithmic nor semi-floating
point arithmetic are standardized.
FPGAs have not historically been well-known as effective platforms
for floating point applications due to the large hardware requirements,
but this has changed in recent years. Studies such as [47] have shown
that FPGA floating point performance has increased to the point of
surpassing general purpose CPUs, especially since the inclusion of em-
bedded multiplication units. Still, only a small number of floating point
FPGA applications have been developed, so there is much room to ex-
plore this feature.
3.4 Hardware/Software Design
Hardware/software design is an application design technique where the
hardware and software aspects are designed simultaneously to allow
a software program to communicate with and use an FPGA to carry
out specialized functions. This is a practice that is possible because
of the programmable nature of FPGAs. It differs from using plug-in
ASICs as part of an application because the hardware can be more spe-
cialized, and the software can potentially use it more effectively [39].
Hardware/software design allows a developer to exploit the strengths of
both software and hardware. For example, software is good at handling
tasks that require complex control structures, and can easily present
setup or results in GUI form for ease of use and interpretation, while
hardware is able to parallelize computation for faster execution. Hard-
ware/software design is also very well suited to porting an application
26
from a software-only implementation to a platform which allows hard-
ware acceleration, because a lesser portion of the application will need
to be redeveloped, cutting down on development time. In order to be
effective, however, the communication avenue between the hardware
and software must not present a significant bottleneck.
3.5 Nature of QCADesigner Computation
The actual algorithmic structure of QCADesigner bears some striking
similarity to the 2D convolution image processing algorithm [37]. A
central cell (analogous to a pixel) is acted upon by a sum of weighted
products in the form of surrounding cells’ polarizations and kink en-
ergies (analogous to the pixel values and convolution weights). There
were techniques presented in [37] which could be used to make arbi-
trarily large and efficient convolution grids (analogous to a variable
radius of effect) for large images (analogous to circuits), using repeated
instances of a relatively small mathematical core. However the mathe-
matics present in the 2D convolution problem are comparatively simple,
using integer arithmetic and a fixed convolution grid with 4-bit weights
for each pixel, and only 16-bit pixel values. QCADesigner requires 32-
bit floating point values for both cell polarizations and kink energies
(pixel values and convolution weights), so the hardware would consume
far too much area to allow this type of architecture to be implemented
on available resources.
The numerics of QCADeisgner are more similar to N-body physics
simulations such as molecular dynamics and quantum chromodynamics.
If a QCA cell is considered like a “particle,” the two problems have
many similarities. Both the cell and the particle change their state
based on the state and proximity of other surrounding particles, and
both problems require high-precision mathematics. On the other hand,
QCA is a simpler problem than N-body simulations because the cells
(“particles”) do not move, and are generally limited to two dimensions
instead of three, therefore making some forms of grid-based memory
structures [14] easier to manage.
27
The floating point FPGA applications in [20], [17], and [32], also
have similar numerical demands to QCADesigner, but their architecture
and controls are more straightforward. They are implemented using
floating point computational pipelines and data streaming techniques,
easily adaptable to other mathematical requirements. Although, these
applications’ mathematics are limited to multiplication and addition,
while QCADesigner demands division and square roots as well. Division
and square roots are more complex functions than multiplication and
addition, so there will be more hardware consumed in each core, leaving
less room for multiple cores.
3.6 The Wildcard Hardware
After completion of the background research, and after it was deter-
mined that hardware/software design was viable for this application,
an appropriate implementation platform is selected. An Annapolis Mi-
crosystems Wildstar II is used in [13] and [14] to implement a hard-
ware/software application, which is a PCI-based peripheral for use with
desktop computers. It possesses two Xilinx Virtex II FPGA chips as
computing elements, as well as on-board memory and PCI communi-
cation drivers for interaction with software.
A similar upgraded card is available, the Wildcard 4, which has a
Xilinx Virtex 4 FPGA and increased on-board memory. Ready-made
libraries for both hardware and software development exist which sup-
port high data transfer rates over the PCI interface, facilitating the
rapid establishment of effective communication between hardware and
software.
In addition to the Virtex 4 chip, the Wildcard 4 also has 128 MB of
DRAM, 8 MB of SRAM, two clocks (one fixed on the communication
bus, the other programmable up to 240 MHz), and PCI communication.
Refer to figure 3.4 for a block diagram of the system. Also included are
libraries for interacting with the card via C programs and simulating
a software driver in VHDL. The PCI interface allows for several types
of communication, including register transfer, direct memory access
28
Figure 3.4: Block diagram of Wildcard 4 hardware




The design of a system that could potentially speed up the execution
of QCADesigner could take one of many shapes based on a large set of
constraints. The functionality of the design is explained in this chapter,
as well as some of the related key decisions and software analysis that
took place beforehand.
4.1 Software Analysis
Prior to committing to any new design, the QCADesigner software is
thoroughly analyzed. This analysis includes application profiling using
gprof; qualitative analysis of the code by means of examination and
debug outputs to determine where most of the computation occurs;
further code analysis to determine the memory structures used to store
data and assess how best to use the FPGA resources; numerical analysis
to track in detail the dynamic range of critical values in the calculation
and compare the accuracy of double precision to single precision floating
point. The goal here is to identify which section of code should be
converted to hardware, and confirm that it has as much parallelizability
as expected based on the simulation engine’s mathematics.
4.1.1 Code Analysis
The first step in analyzing the QCADesigner source code is done using
gprof, which generates an execution profile of the application. This
breaks down how much computation time is spent in each function,
30
and gives other statistics like how many times each function is called,
and where it is called from. This is done by adding a -pg option to the
gcc input, running the application, and running the gprof tool on the
resulting profile data dump. Each such execution profile is generated
by opening QCADesigner, loading a circuit, setting up the simulation,
running the simulation to completion, then closing QCADesigner. The
results of this gprof analysis show that for simulations of any appre-
ciable size (i.e. more than one second long), the dominant function is
run bistable simulation (see Appendix A for a complete code list-
ing of this function), which takes approximately 90% to 98% of the
execution time. This initial analysis shows that there is a high enough
concentration of processor time spent executing a single function for
QCADesigner to be a promising candidate for hardware acceleration.
Deeper analysis of the code and application execution details reveals
greater detail in terms of what specific operations are taking up the
most execution time. For a particular run of the CLB simulation, 96.6%
of the execution time is spent in the function run bistable simulation,
which includes 141 seconds of “initialization” and 3500 seconds of “sim-
ulation.” Initialization takes place in lines 89 through 168 of listing A.1,
and most of the time consumed there is actually used by a function
called bistable refresh all Ek, which calculates all of the kink en-
ergies that will be used in the simulation. The simulation takes place
in lines 170 through 362 of listing A.1, but the execution time is dom-
inated by a smaller loop of code, lines 268 through 318 in listing A.1
(reproduced below for convenience).
1 i t e r a t i o n = 0 ;
2 s t ab l e = FALSE;
3 whi l e ( ! s t ab l e && i t e r a t i o n < max i t e ra t i on s pe r samp l e )
4 {
5 i t e r a t i o n++;
6 // −− assume that the c i r c u i t i s s t ab l e −− //
7 s t ab l e = TRUE;
8
9 f o r ( i cLaye r s = 0 ; i cLaye r s < numbe r o f c e l l l a y e r s ; i cLaye r s++)
10 {
11 #i f d e f REDUCEDEREF
31
12 numb e r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [
i cLaye r s ] ;
13 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n c u r r e n t l a y e r ; i cCe l l s I nLaye r++)
14 #e l s e
15 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n l a y e r [ i cLaye r s ] ; i cCe l l s I nLaye r++)
16 #end i f
17 {
18 c e l l = s o r t e d c e l l s [ i cLaye r s ] [ i cCe l l s I nLaye r ] ;
19
20 i f ( ! ( (QCAD CELL INPUT == c e l l−>c e l l f u n c t i o n ) | |
21 (QCAD CELL FIXED == c e l l−>c e l l f u n c t i o n ) ) )
22 {
23 cu r r e n t c e l l mod e l = ( ( b i s t ab l e mode l ∗) c e l l−>c e l l mode l ) ;
24 o l d p o l a r i z a t i o n = cu r r en t c e l l mode l−>po l a r i z a t i o n ;
25 po la r i za t i on math = 0 ;
26
27 f o r ( q = 0 ; q < cu r r en t c e l l mode l−>number of ne ighbours ; q
++)
28 po la r i za t i on math += ( cu r r en t c e l l mode l−>Ek [ q ] ∗ ( (
b i s t ab l e mode l ∗) cu r r en t c e l l mode l−>neighbours [ q]−>
c e l l mode l )−>po l a r i z a t i o n ) ;
29
30 // math = math / 2 ∗ gamma
31 po la r i za t i on math /= (2 . 0 ∗ sim data−>c l o ck da ta [ c e l l−>
c e l l o p t i o n s . c l o ck ] . data [ j ] ) ;
32
33 // −− c a l c u l a t e the new c e l l p o l a r i z a t i o n −− //
34 // i f math < 0 .05 then math/ sq r t (1+mathˆ2) ˜= math with
e r r o r <= 4e−5
35 // i f math > 100 then math/ sq r t (1+mathˆ2) ˜= +−1 with e r r o r
<= 5e−5
36 new po l a r i z a t i on =
37 ( po la r i za t i on math > 1000 .0 ) ? 1 :
38 ( po la r i za t i on math < −1000.0) ? −1 :
39 ( fabs ( po la r i za t i on math ) < 0 . 001 ) ?
po la r i za t i on math :
40 po la r i za t i on math / sq r t (1 + po la r i za t i on math ∗
po la r i za t i on math ) ;
41
42 // −− s e t the p o l a r i z a t i o n o f t h i s c e l l −− //
43 cu r r en t c e l l mode l−>po l a r i z a t i o n = new po l a r i z a t i on ;
44
45 // I f any c e l l s p o l a r i z a t i o n has changed beyond t h i s
th r e sho ld
32
46 // then the e n t i r e c i r c u i t i s assumed to have not converged .
47 s t ab l e = ( fabs ( new po l a r i z a t i on − o l d p o l a r i z a t i o n ) <=





Listing 4.1: Main simulation loop
The loop in listing 4.1 iterates through every cell in every layer of
the circuit being simulated and calculates a new polarization for that
cell, forming the essential basis of the simulation engine. It reiterates
until total circuit stability is achieved, or until the maximum number
of iterations has elapsed. It executes to its own completion once for
every sample in the simulation, thus these 50 lines of code constitute
a very large portion of the total execution time of a simulation run by
QCADesigner. Thus, listing 4.1 represents a portion of QCADesigner
code that would be desirable to be executed in hardware.
Examining the code for data dependencies, which would affect how it
could be accelerated, it is clear that each sample of the simulation must
be executed sequentially, so there is no possible way to parallelize the
computation per sample. It can also be observed that some cells whose
polarizations are calculated in later iterations of the loop in listing 4.1
will be affected by cell polarizations that are calculated in preceding
iterations. However, this represents a “numerical problem” within the
calculation, rather than an actual data dependency, as mentioned in
the code’s comments (lines 147-149 of listing A.1). Comparison of the
theory of the bistable engine with the actual C code execution indi-
cates that no such data dependency is necessary in the calculation. Its
existence in the software is due to the nature of the storage structures
and limited resources. Memory resources on the PC are constrained
due to operating system overhead and the need for multitasking, pre-
venting the data duplication necessary to avoid the numerical problem
altogether. The software attempts to mitigate this numerical problem
by randomizing the cell list order (lines 152 to 162 in listing A.1) so
that a minimum of influence is bequeathed by the early changes.
33
The lack of any true data dependencies between cells means the
idea presented in Section 2.3 of calculating every cell simultaneously
is mathematically plausible. This confirms that great parallelizability
exists to be exploited. The limiting factor of the practical implemen-
tation would lie with the availability of hardware resources, but given
a large enough chip, such an architecture could be realized.
The critical data in the calculation, as can be seen in listing 4.1, are
the kink energies, polarizations, and gamma (clock) value associated
with each cell. The specifics of how they are stored in memory, and
how often they change, influences what the most effective design for re-
trieving, calculating, and delivering data will be. The communication
functions in the wildcard libraries operate much more efficiently with
data in contiguous blocks such as arrays, and larger transfers are always
more efficient. Kink energies are stored as contiguous arrays associated
with each cell, but only contiguous by cell, so high volume transfers are
not practical. However, kink energies do not change between iterations
or samples in the simulation, so they can be transferred one time at
the beginning and kept for the entire execution time. Polarizations are
stored by cell, individually, so no bulk transfers can be made without
rearranging the data. Additionally, they change for each iteration of
the simulation, so they must be updated each time. Gamma values are
stored in an array by sample index, but there are only four distinct
values that need to be used at any given time. They could be trans-
ferred in high volume for all samples, or could use the alternate register
transfer method on a per-sample basis.
The total memory this data occupies is also important to determine
how it may be used. They are stored in 64 bit double precision floating
point variables, and a CLB has about 22,000 cells. Given an effective ra-
dius of 500 nm (for reference, RIT’s simulations use a radius of 50 nm),
the average number of neighbours per cell is about 400, which means
400 kink energies must be stored for each cell. This means that a total
of 70 MB of storage is required for kink energies. For polarizations,
only one polarization per cell is required, for a total of just 176 kB.
For a 300,000 sample simulation as we are running, 9.6 MB would be
34
required for gamma value storage. These totals would be halved if us-
ing single precision floating point. Either way, there is enough memory
on-board the FPGA to accommodate all this data locally.
4.1.2 Numerical Analysis
An extensive numerical analysis is conducted on QCADesigner to de-
termine what number systems may be acceptable candidates for imple-
mentation. The dynamic ranges of the kink energy, polarization, and
other intermediate variables are tracked in detail throughout a variety
of simulations in order to determine the numerical accuracy demanded
by QCADesigner. To begin with, the polarization value of a cell is
typically close to 1 or -1, however it can in theory be arbitrarily small.
In practice polarizations are not often reported to be any smaller than
1e-6. Kink energies are more extreme. With a large radius of effect, a
kink energy of less than 1e-42 is not uncommon, and can also come in
at 1e-22, representing a minimum of 20 orders of magnitude dynamic
range that must be supported by the Kink energy variable. With less
extreme radii of effect, kink energies of less than 1e-39 become increas-
ingly rare, which is a value readily handled by single-precision floating
point.
The first intermediate calculation value that comes up to be analyzed
is polarization math, specifically the code shown in listing 4.2. In this
code, the polarization math variable is a destination for the sum of
all neighbouring cells’ polarization × kink-energy products.
1 f o r ( q = 0 ; q < cu r r en t c e l l mode l−>number of ne ighbours ; q++)
2 po la r i za t i on math += ( cu r r en t c e l l mode l−>Ek [ q ] ∗ ( ( b i s t ab l e mode l ∗)
cu r r en t c e l l mode l−>neighbours [ q]−> c e l l mode l )−>po l a r i z a t i o n ) ;
Listing 4.2: Polarization math accumulator
A possible issue with reducing the usable precision of variables here
is cumulative rounding errors. Up to 20% of the polarization × kink-
energy values could be less than 5e-42, which is the floating point half-
precision threshold at which point a significant amount of accuracy is
35
lost. Adding that many rounded values together can create a notice-
able rounding error in the final total, depending on what the final total
is. This error, once created, could propagate itself through all subse-
quent iterations and samples of the simulation. However, the nature
of this accumulation operation typically involves many higher magni-
tude values as well. Even in double precision floating point, the least
significant digit of a number can have no greater difference from the
most significant digit of 16 orders of base ten magnitude. So the net
effect of low-magnitude rounding errors will be numerically insignifi-
cant if not zero. In fact, analysis shows that the total contribution of
low-magnitude values to the total in this accumulation is very rarely
more than zero, partly due to having offsetting signs.
1 po la r i za t i on math /= (2 . 0 ∗ sim data−>c l o ck da ta [ c e l l−>c e l l o p t i o n s .
c l o ck ] . data [ j ] ) ;
2
3 new po l a r i z a t i on =
4 ( po la r i za t i on math > 1000 .0 ) ? 1 :
5 ( po l a r i za t i on math < −1000.0) ? −1 :
6 ( fabs ( po l a r i za t i on math ) < 0 . 001 ) ? po la r i za t i on math :
7 po la r i za t i on math / sq r t (1 + po la r i za t i on math ∗ po la r i za t i on math ) ;
Listing 4.3: Gamma and square root simplification for polarization
Listing 4.3 shows another place in the code where dynamic range
consideration is important. Here, the polarization math variable is
divided by the clock value, which typically ranges between 3.8e-23 and
9.8e-22, bringing the magnitude of polarization math back up. Then,













is greater than 0.001 and less than 1000. Therefore the squared term
has no chance of being less than 1e-6, and cannot possibly become a
bottleneck regarding precision. The simplification in this part of the
equation also renders earlier imprecision even more insignificant.
36
Figure 4.1: QCADesigner simulation using single precision floating point
Considering this analysis, it is clear that floating point is the best
number system to use. This application is not suited to use with fixed
or semi-floating number systems, because of the large dynamic ranges
required, and the fact that a large portion of the operations are addi-
tions rules out logarithmic floating point. In addition, floating point
arithmetic cores are readily available as IPCores in the Xilinx IDE,
which greatly reduces development time. Considering the data channel
for communication with the FPGA is only 32 bits wide, and the fact
that double precision floating point cores are amazingly large, single
precision floating point is preferable to double precision for this design.
With this in mind, QCADesigner is tested using single precision
floating point in software to confirm that simulation accuracy is main-
tained. By simply changing variable types from double to float and
37
Figure 4.2: QCADesigner simulation using double precision floating point
comparing the simulation results, it is determined that single preci-
sion floating point is a safe implementation to use. Figure 4.1 shows
the simulation results using single precision, and figure 4.2 shows the
simulation results using double precision. It can be seen that the ex-
pected results (samples 190000 to 300000 on output OUT WEST) are
the same and correct for both. There is some minor discrepancy be-
tween the implementations for the early portion of the results, but this
is inconsequential. Because the latency to the outputs is 16 cycles [45],
any values that appear on the outputs before this latency has elapsed
are simply noise that has been amplified, so they have no special rel-
evance to the simulation. Closer inspection of the numerical results
shows that there is far less than 1% error between double precision
and single precision results, even at the end of the simulation where
38
rounding errors would have a greater chance to accumulate to signifi-
cant values. Thus single precision floating point is deemed acceptable
for this application.
4.2 Hardware Architecture
To accelerate the above simulations, several architectures are consid-
ered for implementation: including streaming, grid-based parallelism,
cellular automata, and instance specific design. A streaming architec-
ture takes a constant stream of data in, processes it, and sends a stream
out, achieving application speedup by means of parallelization in a com-
putational pipeline. Grid-based parallelism would break the problem
into spatial regions and compute results within each grid, achieving
speedup through spatial parallelism. Cellular automata (unrelated to
QCA) refers to a very fine-grained form of grid parallelism in which
one computational core corresponds to one element that needs to be
computed (i.e. one core per QCA cell), and these cores are laid out in a
large repeating structure. Instance specific design uses a basic starting
design and procedurally generates a large design to be implemented on
the FPGA, unique to the specific circuit being simulated [16]. Cellu-
lar automata would be the ideal implementation to achieve maximum
speedup, but the floating point mathematics are too complex to sup-
port a large number of cores, so it must be discounted. The Wildcard
4 libraries do not support any means of procedural circuit generation,
so instance specific design is not feasible. A streaming architecture,
possibly one with multiple pipelines to increase parallelization as much
as possible, is decided on. This also allows easy integration with the
software.
4.2.1 Proof of Concept
Before designing a final general purpose version of the hardware, a
small proof of concept version is created, dubbed “Minisim.” It is an
instance specific design, simulating a QCA majority gate in one clock
39
zone using a streaming architecture. It is meant to confirm that a QCA
circuit can be simulated accurately using single precision floating point
on hardware in a reasonable time while using a reasonable amount of
resources, and to gauge how much of the simulation can be put on
hardware. Initially, the design attempts to run with a starting point of
just cell positions, dynamically calculating the kink energies to be used
in the simulation. The additional hardware required to calculate the
kink energies proves to be excessive, and because kink energy calcula-
tion only contributes a small fraction of execution time, the concept is
abandoned.
Minisim’s final design involves hard-coded kink energies implying
the structure of a majority gate, with the input values and a stream
of gamma values coming from software, and a stream of output values
going back to the software. This hardware/software execution proceeds
to completion in the same perceived amount of time as the software im-
plementation, even for long simulations. The numerical results match
software to within 0.001% for steady state values, but deviate from
each other more during transitional values. This greater deviation is
only in terms of relative error, the absolute error is still quite small, on
the order of thousandths. Plots of the hardware and software results of
minisim are shown in Figures 4.3 and 4.4 respectively. The synthesis re-
port of Minisim shows that 33% of the board’s total available resources
are used by the hardware, giving initial indications that three or four
computational cores could be included on the FPGA simultaneously.
However, these are still rather large cores, so any sort of granular grid
computing structure is out of the question.
It was determined earlier in Section 3.6 that the total set of kink
energy data for a 22,000 cell circuit with a radius of effect of 500 nm
uses about 35 MB of the Wildcard’s available 128 MB of DRAM. Using
this same radius of effect, circuits with 50,000 or even 80,000 cells can be
easily accommodated, or larger if the radius of effect is reduced. And
because the kink energies do not change throughout the simulation,
they can be loaded once and referenced later. All the kink energies are


















Figure 4.3: Hardware results from minisim test
they are calculated. There is no need for them to be in host memory
when all further calculations are done in hardware, so extra transfers
are avoided. Furthermore, they are stored in one solid contiguous block
with no form of demarkation between cells. Instead, reliance is put
upon having a consistent list of cells that does not change its order,
and whose neighbour counts are known.
Therefore, cell randomization must be abandoned. This is of no con-
sequence however, because the hardware will calculate an entire itera-
tion without replacing any values in memory, thus completely avoiding
the numerical problem the randomization was trying to reduce. Re-
call from Section 4.1.1 that cell list randomization is done to minimize
the effect of the numerical problem by shuffling the order in which cell
polarizations are calculated.
4.2.2 DRAM Interfacing
Using the DRAM to store kink energies is not completely without issue,
however. First of all, the DRAM has a latency of 173 cycles between
getting a read request and delivering the requested data, so steps must
be taken to ensure this latency is incurred as infrequently as possible.


















Figure 4.4: Software results from minisim test
stored is only 32 bits wide, so a special interface must be created to
write and read data 32 bits at a time.
The DRAM’s “startup latency” of 173 cycles applies when a new set
of read operations is started. Additional latency applies when seeking
new non-sequential addresses, but after the initial latency has passed,
each further sequential address read is delivered on the immediate next
cycle. If a constant stream of reads can be sustained, the entire con-
tents of DRAM can be delivered with the only latency cost being the
initial 173 cycles. The storage method of the kink energies in the design
actually allows for this to take place independent of the radius of effect,
because they are stored in the same order as they are processed. How-
ever, the mathematical core demands data at a much slower rate than it
is delivered from DRAM, so the DRAM output must go through a First-
In-First-Out queue (FIFO) which can accumulate data and deliver it
when requested. This FIFO and the DRAM are carefully controlled to
ensure no over or underflow occur, so that no data is lost or miscalcu-
lated. It is 256 words large, and the state machine will only send a read
request to the DRAM if less than 50 values are in the FIFO (indicated
by dram fifo prog empty), allowing the next 173 “cool down” words
to be received without overflow. Requesting a new read when the FIFO
42
data count is back down to 50, allows time for the requested data to
get into the FIFO before underflow occurs.
1 port map (
2 din => dram data in ,
3 r d c l k => i c l o c k ,
4 rd en => dram f i f o rd , −−c on t r o l l e d from s t a t e machine when data
i s needed
5 r s t => d r am f i f o c l r , −−c on t r o l l e d in the s t a t e machine so empty
f l a g r e s e t s
6 wr c lk => p c lock ,
7 wr en => dram data in va l id ,
8 dout => dram f i f o out , −−goes to DRAM output bu f f e r
9 empty => dram fi fo empty ,
10 f u l l => d r am f i f o f u l l ,
11 prog empty => dram f i fo prog empty ) ; −−s e t to turn on when l e s s
than 50 data are pre sent
Listing 4.4: DRAM output FIFO
The structures used to mange the data flow into and out of DRAM
are shown in their VHDL form in listing 4.5. They refer to dram addr(1
downto 0), which are two address line bits that are added onto the
normal 23 address line bits of the DRAM module. They are used to
refer to the 32 bit parts of the 128 bit DRAM word individually. This
part of the address line is controlled independently from the other 23
bits so that new DRAM addresses can continue being read at their
own pace independent of which value is actually demanded for use on
the output. The DRAM output manager is quite simple, with the low
address bits essentially controlling a multiplexer between the DRAM’s
output FIFO and the main math core.
The DRAM data input control is slightly more complex. Since data
is written to DRAM in larger portions than it is received, it must be
stored in parts to a register whose value is written every fourth address.
An override (dram super write) is built into the logic controlling the
write line to allow the final set of data to be written if less than four
32 bit words occur in the last set.
1 −−DRAM output bu f f e r , goes d i r e c t l y to p math core
2 Ek in <= dram f i f o ou t (127 downto 96) when ( dram addr (1 downto 0) =
”11”) e l s e
43
3 d ram f i f o ou t (95 downto 64) when ( dram addr (1 downto 0) = ”10”)
e l s e
4 d ram f i f o ou t (63 downto 32) when ( dram addr (1 downto 0) = ”01”)
e l s e
5 d ram f i f o ou t (31 downto 0) when ( dram addr (1 downto 0) = ”00”)
e l s e
6 ( o the r s => ’ 0 ’ ) ;
7
8 −−dram input bu f f e r
9 DRAM input buffer : p roc e s s ( i c l o c k , r e s e t )
10 begin
11 i f ( r e s e t = ’1 ’ ) then
12 dram data out <= ( othe r s => ’ 0 ’ ) ;
13 dram write <= ’0 ’ ;
14 e l s e
15 i f r i s i n g e d g e ( i c l o c k ) then
16 i f ( dram buf wr i te = ’1 ’ ) then
17 case dram addr (1 downto 0) i s
18 when ”00” =>
19 dram data out (31 downto 0) <= i n f i f o o u t ;
20 when ”01” =>
21 dram data out (63 downto 32) <= i n f i f o o u t ;
22 when ”10” =>
23 dram data out (95 downto 64) <= i n f i f o o u t ;
24 when ”11” =>
25 dram data out (127 downto 96) <= i n f i f o o u t ;
26 when othe r s =>
27 dram data out <= dram data out ;
28 end case ;
29
30 i f ( ( dram addr (1 downto 0) = ”11”) or ( dram super wr i te = ’1 ’ ) )
then
31 dram write <= ’1 ’ ;
32 e l s e
33 dram write <= ’0 ’ ;
34 end i f ;
35 e l s e
36 dram write <= ’0 ’ ;
37 dram data out <= dram data out ;
38 end i f ;
39 end i f ;
40 end i f ;
41 end proce s s ;
Listing 4.5: DRAM buffer control
44
4.2.3 Other Interfacing Issues
Other interfacing concerns are involved with sending the polarization
and gamma data to the FPGA, and reporting back the new polarization
results from the FPGA. Methods of efficient transfer must be devised
that are able to meet a large range of parameters. Most importantly, the
hardware must be able to support any reasonable number of neighbours
(effective radius) and all reasonable circuit sizes.
The input into the FPGA is more restrictive than the output from
the FPGA, because data comes in faster than it is used, so data overflow
is a risk, whereas on the output, data can be written to host memory
as soon as it is calculated with less concern of overflow, because DMA
writes are more explicitly controlled in hardware. Care must be taken in
software and hardware to avoid any read-before-write errors that would
corrupt the results of a calculation, while also carrying out necessary
data copying on the host in parallel with data computation on the
hardware.
Another key communication issue is interfacing between different
clock domains. The data transfer bus runs at 40 MHz, while the arith-
metic core runs at 140 MHz. Therefore, dual-clock FIFOs are used
on both the input and output, which are generated using the IPCore-
Gen tool in Xilinx ISE. They are also specifically sized and controlled
to prevent over and underflow while meeting the requirements of the
application.
The input FIFO is designed to have 1024 locations, which will allow
up to 1024 neighbours per cell. This is large enough to accommodate
a radius of effect of well over 500 nm, in which no cell in the CLB has
more than 650 neighbours. Should it be determined that support for
more neighbours is desired, it is easy to create a new, larger FIFO. The
DMA buffer on the software side is allocated at a matching fixed size.
For the hardware’s output, the FIFO does not need to be as large as
there are cells in the circuit, it only needs to be large enough to prevent
any over or underflow problems that may arise due to control signals
having laggy actuation. 256 locations is deemed an appropriate size for
45
this, and is tested to confirm functionality. The DMA buffer for the
results in software is allocated to be able to receive as many cells as are
calculated, which is not equal to the total number of cells in the circuit;
it only needs to accommodate the cells that are not fixed or inputs, as
seen in lines 93 and 94 of listing B.1.
With the architectures in place, the communication algorithm must
be created. In order to send kink energy or polarization data to the
FPGA without garbage data existing at the end of any block transfers,
the FPGA must know how many words to request in the DMA trans-
fer. This is accomplished by sending the number of neighbours to the
FPGA using a register transfer, which is slower than DMA for large
transactions, but better suited to small ones. For transferring polar-
izations, the gamma value for the cell to be calculated is also part of
the register transaction. At the end of a full phase (i.e. all kink ener-
gies have been transferred, or all cell data has been transferred for an
iteration), a special totem value is sent to the hardware instead of the
normal data, so the hardware knows to proceed with the next phase of
operation. The control hardware for this is simpler than including cell
number counters. The hardware sends interrupts back to the software
to indicate when all data has been read or calculated to ensure that no
read-before-write errors occur.
Due to the way polarization data is stored (not in solid blocks),
each neighbour polarization value must be copied individually to the
DMA transfer buffer, as in line 382 of listing B.1. This operation takes
the place of where the software would calculate the accumulation of
polarization math. The software waits for the interrupt signifying the
completion of a cell’s calculation after this memory transfer is complete
to parallelize the workload between host and hardware.
4.2.4 The Arithmetic Core
The main mathematical core is where all of the computation happens.
It consists of a pipeline of floating point math cores as shown in Figure
4.5. The floating point cores are all generated using the IPCoreGen tool
46
Figure 4.5: Block diagram of the polarization math pipeline
in the Xilinx ISE. The VHDL code for this core is shown in Appendix
D.
The control signals on each floating point core are for the most part
connected in such a way that the next core in the line is automatically
started when the data on the previous core is ready. Controlling the
multiplier and adder together to form the accumulator in the first stage
is slightly more complex, requiring precise timing and counting of in-
puts. The delay line shown is simply a shift register in place to sync up
data propagation through multiple branches of one pipeline. Gamma
being multiplied by a constant value of two is accomplished by left shift-
ing the exponent of the floating point value to save on hardware cost. It
works without issue because the value in gamma is never large enough
where a 1 would be shifted out of the MSB. The circuit’s stability is not
calculated in this core, or in fact anywhere else on the hardware. Be-
cause all polarizations are delivered back to the host platform for each
iteration, this comparatively simple operation can be done in software
with little extra cost.
The floating point cores are all tuned to have minimum latency, while
guaranteeing a certain operating clock speed. Of special importance is
the adder’s latency in the accumulator, because this is the critical factor
of how fast a complete cell can be calculated, and how efficiently the full
pipeline may be used. Lower latency for a floating point unit means that
more levels of combinational logic must be waited on between stages in
the computational pipeline, so more delay is inherent per pipeline stage.
Increased latency means the clock may run faster, but more hardware
is required to register more stages of the pipeline. So, the least amount
of latency is desired while still maintaining a minimum clock rate.
47
There are two clocks on the Wildcard, one is fixed at 40 MHz and
governs data bus transactions. The other runs at a user-specified speed
up to 240 MHz, and in this case must be set to at least 130 MHz for
the DRAM to function properly. Calculating the total time required
to calculate the polarization of one cell with ten neighbours shows that
there is little difference between latency configurations, less than 50 ns
difference out of a total around 750 ns. A latency configuration is
selected which enables an optimum balance between speed and stability
at 140 MHz, as indicated by the Xilinx place and route tool.
At this point, the duration of a complete QCA circuit simulation
using this core can be calculated. It is based on the simulation of a
CLB (22,558 cells) with a radius of effect of 50 nm (average number of
neighbours per cell is 5), 300,000 samples, and a convergence tolerance
of 0.001 (average reiteration rate per sample of 1.7). Keeping in mind,
this calculation assumes maximum pipeline usage and no overhead once
the calculation is started, it represents a theoretical minimum time.
The combined latency of the accumulator stage is the only real per-
cell calculation cost, because a new cell can be accumulated as the
previous cell flows down the pipeline. Similarly, the only real per-
neighbour cost in the accumulator is in the adder, as the multiplier
only contributes its own additional latency once before any addition can
start. The accumulator latency is equivalent to the multiplier latency,
plus the adder latency for each neighbour, plus one cycle to account
for informing the core that a cell is finished with input. This value
gets multiplied by the number of cells to get the number of cycles for
one iteration, added with the number of cycles for the last part of the
pipeline to complete on the last cell, then multiplied by the number of
iterations per sample, then multiplied by the number of samples. This
total number of cycles for a simulation is divided by the clock speed to
get the elapsed time.
48
((6 + 6 ∗ 5 + 1) ∗ 22558 + 14 + 4 + 5 + 14 + 14) ∗ 1.7 ∗ 300000
140 MHz
= 3040.68 s (4.2)
This result is on par with the current execution time in software. Al-
though it assumes perfect pipeline usage and no overhead, several cores
can fit on the FPGA simultaneously. The execution time could be re-
duced by a linear factor equivalent to the number of cores present, thus
compensating for non-idealities and potentially achieving the desired
speedup.
4.2.5 Operation and Synthesis
Hardware operation is selected via a checkbox in the simulation setup
window. The block diagram of the system is shown in Figure 4.6,
illustrating the basic connections between major components of the
design. The combined hardware/software system operates as shown in
the flowchart in Figure 4.7.
There is some overhead involved in the frequent DMA transfers,
but any such overhead’s contributing factor is reduced as the number
of neighbours in a circuit increases (i.e. as the size of each transfer
increases). This hardware/software system is shown to work in sim-
ulation using the simplified host software simulator prescribed by the
Annapolis tools. The Xilinx place and route report shows that the
main arithmetic core uses 1834 slices (3037 LUTs, or 11% of the Vir-
tex 4) of the available hardware resources, and the rest of the control
and interface hardware consumes 1809 slices (1608 LUTs, or 12% of
the Virtex 4), indicating that up to 6 arithmetic cores may be used
simultaneously.
49
Figure 4.6: Block diagram of high level system architecture
50




As mentioned in Section 4.2.4, the predicted execution speed of the
hardware is expected to be on the same order of magnitude as the soft-
ware execution speed. However, upon actually using and testing the
hardware, the observed execution speed is several orders of magnitude
slower. An 80,000 sample simulation of a majority gate with an 8 cell
wire on the output takes a reported 1 second to run to completion in
software, but 157 seconds in hardware; a very surprising result. De-
spite this, the hardware is still capable of delivering an accurate result
for small circuits. Figure 5.1 shows a simple majority gate simulation
running in software and Figure 5.2 shows the same simulation done in
hardware, clearly illustrating the same result being obtained in both
cases, with some minor numerical discrepancy due to switching the
floating point precision level. This result verifies the basic function-
ality of the floating point pipeline developed. This and the analysis
showing its potential speed are encouraging results in terms of a proof
of concept design on which to base other designs. Additionally, some
notable improvements have been made to the QCADesigner software
during the course of this work.
5.1 Cause of Slowdown
The first effort to try to determine why the hardware simulation takes
so long is to apply gprof and see if any of the new functions being
applied in the main simulation loop such as IntWait are taking an
excessive amount of time, which would indicate the hardware rather
52
Figure 5.1: Result for software run of a majority gate
than the software is taking too long to finish the computation. This
doesn’t yield any useful insight. The next step is to manually instru-
ment the code with the functions QueryPerformanceFrequency and
QueryPerformanceCounter. QueryPerformanceFrequency gives the
number of processor ticks that occur every second (i.e. the clock
speed), and QueryPerformanceCounter gives the number of processor
ticks that have occurred since startup. Using these functions, is can
be determined precisely how much time is consumed executing certain
sections of the code.
To begin with, it is determined that one cell in software takes roughly
600 ns to compute, with some variance depending on the specific run. In
hardware mode, the calculation time for a single cell is between 18 and
45 µs. Upon further analysis, this rather alarming increase in execution
time is determined to be caused by the interrupt reset procedures, of
which there are two, taking 10 to 20 µs each (usually about 14 µs). The
repeated interrupt reset, shown in listing 5.1, is a required procedure
given the architecture in use, because every time an interrupt is used,
it must be reset before being used again. The only way to avoid using
53
Figure 5.2: Result for a hardware run of a majority gate
interrupts with this architecture would be to wait a fixed amount of
time for hardware operations to complete, but there is no good way to
estimate the completion time dynamically or wait for such brief precise
periods. There also is no way for the software to detect the logical
value assigned to the interrupt bit by the hardware, because the value
is latched at 1 when triggered until reset by the software.
1 rc=WC IntEnable ( Tes t In fo . DeviceNum , FALSE) ; // r e s e t i n t e r r up t
2 rc=WC IntReset ( Test In fo . DeviceNum) ;
3 rc=WC IntEnable ( Tes t In fo . DeviceNum , TRUE) ;
Listing 5.1: Interrupt reset procedure
Due to the large variance in these measurements, it is impossible
to say exactly how long the calculation would take without these in-
terrupt resets. An estimate based on the difference between total exe-
cution time and total interrupt reset time might come between 1 and
5 µs, low enough to achieve speedup with six pipelines. To be clear,
it is just the interrupt reset that takes that long; the interrupt wait
function itself takes a much shorter amount of time, indicating that the
hardware calculation is proceeding apace. This is consistent with the
54
determination in Section 4.2.4 that the main arithmetic core should be
able to calculate the results at the same rate as the software. If better
interfacing and control hardware can be developed, speedup may be
achieved when using multiple cores in parallel.
A small test program is created to test in simpler terms what dif-
ference the interrupt reset has on execution time. Two separate data
transfers are initiated in sequence, with an interrupt between them, and
another interrupt at the end, requiring a reset in between. With the
reset in place, the program takes 40 µs to execute, but simply waiting
the short transfer out without the reset results in a 1.5 µs. There was
no indication in any manual or simulation that interrupt resets would
take so long.
5.2 Hardware Reliability
In addition to the extra long execution time, there are significant re-
liability problems with the control hardware. Any circuit much larger
than 6 cells encounters significant difficulty in proceeding to the end
of simulation. One problem appears to be related to bad data being
reported back to the host, which could be due to problems on either
the input or output sides of the hardware. This causes the simulation
to constantly reiterate until it hits the reiteration limit, and results in a
flatlined simulation result. Another problem is related to the hardware
and software getting out of sync, causing interrupts to timeout, again
resulting in flatlined results. The inconsistent nature of these issues
suggests a problem related to timing, possibly a clock running too fast.
However, reconfiguring the hardware to run on the 40 MHz clock rather
than 140 MHz does not inspire any changes.
Further debugging is quite difficult. Because the programming in-
terface is on the PCI bus rather than a JTAG interface, it is impossible
to use the Xilinx ChipScope hardware analyzer to examine signal time-
lines. The closed form factor of the card also prevents examining any
signals using an oscilloscope or logic analyzer. The simulation works,
so that is not a debugging avenue to pursue. The only way to debug the
55
hardware is to request debug outputs from the hardware. But the times
these data can be requested without interfering with the standard oper-
ation of the hardware are very limited, making it nigh impossible to get
comprehensive debugging information. Because of the great difficulty
in debugging and the timing analysis showing a very low performance
ceiling for the given architecture, further development is abandoned.
5.3 QCADesigner Enhancements
The results of this work are not a complete write-off, however. Some
notable improvements to the QCADesigner software have been made.
First, an error was found in the stability checking method. The version
in current distribution has the following for a stability check.
1 // I f any c e l l s p o l a r i z a t i o n has changed beyond t h i s th r e sho ld
2 // then the e n t i r e c i r c u i t i s assumed to have not converged .
3 s t ab l e = ( fabs ( new po l a r i z a t i on − o l d p o l a r i z a t i o n ) <= to l e r an c e ) ;
Listing 5.2: Original stability check
The comment suggests that the sample should reiterate if any cell’s
polarization changes goes beyond the threshold, but the way the code is
written, it only reiterates if the last cell’s polarization changes beyond
the threshold limit. The fix is very simple, changing line three of listing
5.2 to listing 5.3, thus forcing any unstable cell to cause all future
stability checks to fail. The fix causes simulations to take roughly four
times as long, as expected, because samples reiterate more often. The
fix is also confirmed as needed by Konrad Walus, the QCADesigner
lead.
1 s t ab l e = ( fabs ( new po l a r i z a t i on − o l d p o l a r i z a t i o n ) <= to l e r an c e ) &&
s t ab l e ;
Listing 5.3: Fixed stability check
The other improvement made to QCADesigner is an optimization
made to the select cells in radius function. In the current ver-
sion of QCADesigner, a cell’s neighbours are found by using a complex
conditional statement shown in listing 5.4
56
1 i f ( s q r t ( (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j ] )−>x −
QCAD DESIGN OBJECT( c e l l )−>x ) ∗
2 (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j ] )−>x − QCAD DESIGN OBJECT( c e l l )
−>x ) +
3 (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j ] )−>y − QCAD DESIGN OBJECT( c e l l )
−>y ) ∗
4 (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j ] )−>y − QCAD DESIGN OBJECT( c e l l )
−>y ) +
5 ( ( double ) ABS( t h e c e l l s l a y e r −i ) ∗ l a y e r s e p a r a t i o n ) ∗
6 ( ( double ) ABS( t h e c e l l s l a y e r −i ) ∗ l a y e r s e p a r a t i o n ) ) < wor ld rad iu s )
7 numbe r o f s e l e c t e d c e l l s++;
Listing 5.4: select cells in radius original
This condition contains within it several arithmetic operations, in-
cluding a square root function, which take time to execute. These op-
erations are computing a distance between cells, which takes time and
does not need to be precisely calculated each time to achieve the desired
result. A precondition (shown in listing 5.5) can be added to test just
the horizontal and vertical components of the distance, and if either is
greater than the radius of effect, the detailed distance calculation can
be avoided.
1 i f ( ( (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j ] )−>x − QCAD DESIGN OBJECT(
c e l l )−>x ) < wor ld rad iu s ) && ( (QCAD DESIGN OBJECT( s o r t e d c e l l s [ i ] [ j
] )−>y − QCAD DESIGN OBJECT( c e l l )−>y ) < wor ld rad iu s ) )
Listing 5.5: select cells in radius optimization
This modification reduces initialization time for large circuits by 37%.
5.4 Design Alternatives
Several options exist for pursuing alternate solutions to this design
problem. A variety of other platforms could be used which are larger
and better equipped, or different architectures could be explored.
5.4.1 New Architecture
One possible design architecture that could be implemented for future
research would store all cells’ polarizations on the Wildcard’s SRAM
57
and compute the entire length of the simulation in hardware, rather
than constantly exchanging polarization data and using repeated in-
terrupts. This would eliminate almost all communication overhead,
including most importantly the interrupt resets.
The storage structure of this new architecture would store all cell
polarizations in the form of a spatial grid in SRAM, where every 36 bit
word in SRAM will store the polarization and clock zone (34 bits of
data) for the cell at a particular location. The cell’s location would di-
rectly correspond to the address in SRAM, and zeros would be in SRAM
for locations where there is no cell. The location encoding would work
by each sequential address representing the cell with its location at the
next 10 nm increment. QCADesigner can only place cells at 10 nm
increments (10 nm is half of a cell’s length), so no more location preci-
sion than this is needed. A new row is simply defined by counting off
the number of cells in the previous row. Finding and retrieving neigh-
bour polarization data would be accomplished by simple rectangular
address arithmetic; a given radius of effect would be converted to an
integer number of half-cell distances, and cells within a square rather
than a circle would be counted to simplify the mathematics a little cost
to accuracy. There is enough SRAM on the Wildcard 4 to hold a circuit
4500 nm × 4500 nm, which is enough for 4 CLBs. Most of the data in
SRAM would be zeros, so larger circuits could be stored if some form
of compression can be developed.
Kink energies would be stored in much the same way, but the delivery
of gamma values would have to be modified to be delivered in bulk
rather than per cell, and results would still be sent back to the host as
they are calculated using DMA. Extra software development is required
to sort the cells into the form which they will be stored in, and to change
the calculation of kink energies to consider the new square of effect.
5.4.2 Other Platforms
Other hardware platforms may also be used to great effect. For exam-
ple, a top of the line Virtex 6 card has over 85,000 logic slices, each
58
containing four 6-input LUTs [53]. The Virtex 4 used in this thesis only
has two 4-input LUTs for each of just 15,000 slices. A Virtex 6 also has
PCI Express integration, which would facilitate communication with a
host PC, as well as a faster clock speed. With the much larger amount
of resources, a design based on a Virtex 6 may be able to support up to
100 floating point pipelines in parallel, moving one step closing to being
able to implement a cellular automata style computational architecture.
Many hardware platforms also offer on-board processors, or have
the ability to generate soft-core processors on the FPGA. This type of
hardware may allow the creation of a hardware/software design with




Conclusion and Future Work
The work done in this thesis has shown the potential for a hardware
coprocessor to accelerate the simulations needed in QCA research. Al-
though the system designed here did not succeed in speeding up the
execution of QCA simulations, this has more to do with interrupt resets
taking much longer than expected than any other aspect of the design.
It is possible to achieve the desired speedup using the arithmetic core
presented here if data can be fed to it fast enough. Avoiding the long
interrupt resets, either by avoiding interrupts altogether or moving to a
different hardware which resets them more quickly, could help provide
the necessary throughput. An implementation of this application that
achieves speedup would be a significant achievement for FPGAs, show-
ing that floating point applications may be sped up on easily available
reconfigurable hardware using hardware/software design. This would
be attractive to the realm of scientific research because it combines
floating point accuracy with the relative ease of use and development
of hardware/software design.
The implementation of this alternate architecture is beyond the
scope of this thesis, but is a prime candidate for future research. Alter-
nate hardware platforms may also be sought that have better interrupt
routines or larger FPGAs, and the QCADesigner coherence vector sim-
ulation engine may also be accelerated. The vein of this research may
also be applied to broader applications such as the physical side of QCA
research, advancing more than just QCA architecture development.
60
Bibliography
[1] Wolfgang Arden, Patrick Cogez, Mart Graef, Reinhard Mahnkopf,
Hidemi Ishiuchi, Toshihiko Osada, JooTae Moon, JaeSung Roh,
Carlos H Diaz, Burn Lin, Pushkar Apte, Bob Doering, and Paolo
Gargini. 2009 international technology roadmap for semiconduc-
tors. Technical report, IEEE Computer Society, 2009.
[2] Gary H. Bernstein. Quantum-dot cellular automata: computing
by field polarization. In DAC ’03: Proceedings of the 40th annual
Design Automation Conference, pages 268–273. ACM, 2003.
[3] E. P. Blair and C. S. Lent. Quantum-dot cellular automata: an
architecture for molecular computing. In Simulation of Semicon-
ductor Processes and Devices, 2003. SISPAD 2003. International
Conference on, pages 14–18, 2003.
[4] Enrique Blair, Eric Yost, and Craig Lent. Power dissipation
in clocking wires for clocked molecular quantum-dot cellular au-
tomata. Journal of Computational Electronics, 9(1):49–55, March
2010.
[5] B. Bosi, G. Bois, and Y. Savaria. Reconfigurable pipelined 2-D
convolvers for fast digital signal processing. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 7(3):299–308,
September 1999.
61
[6] Owen Callanan, Andy Nisbet, Emre Ozer, James Sexton, and
David Gregg. FPGA implementation of a lattice quantum chro-
modynamics algorithm using logarithmic arithmetic. In IPDPS
’05: Proceedings of the 19th IEEE International Parallel and Dis-
tributed Processing Symposium (IPDPS’05) - Workshop 3, page
146.2, Washington, DC, USA, 2005. IEEE Computer Society.
[7] R. P. Cowburn and M. E. Welland. Room temperature magnetic
quantum cellular automata. Science, 287(5457):1466–1468, 2000.
[8] Mike Cowlinshaw, editor. 754-2008 IEEE Standard for Floating-
Point Arithmetic. IEEE Computer Society, 445 Hoes Lane, P,
2008.
[9] M. Crocker, Xiaobo Sharon Hu, M. Niemier, Minjun Yan, and
G. Bernstein. PLAs in quantum-dot cellular automata. IEEE
Trans. Nanotechnol., 7(3):376–386, May 2008.
[10] Yong Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev.
64-bit floating-point FPGA matrix multiplication. In Proceedings
of the 2005 ACM/SIGDA 13th international symposium on Field-
programmable gate arrays, FPGA ’05, pages 86–95, New York, NY,
USA, 2005. ACM.
[11] E. N. Ganesh, Lal Kishore, and M. J. S. Rangachar. Study of
complex gate structures in quantum cellular automata technology
for FPGA applications. In Proceedings of the IET-UK Interna-
tional Conference on Information and Communication Technology
in Electrical Sciences (ICTES 2007), pages 789–794, December
2007.
[12] Akila Gothandaraman, Gregory D. Peterson, G. Lee Warren,
62
Robert J. Hinde, and Robert J. Harrison. FPGA acceleration of
a quantum monte carlo application. Parallel Computing, 34(4-
5):278–291, 2008.
[13] Yongfeng Gu, Tom Van Court, and Martin C. Herbordt. Improved
interpolation and system integration for FPGA-based molecular
dynamics simulations. In Proceedings of the 2006 International
Conference on Field Programmable Logic and Applications (FPL),
pages 1–8. IEEE, August 2006.
[14] Yongfeng Gu and Martin C. Herbordt. FPGA-based multi-
grid computation for molecular dynamics simulations. In FCCM
’07: Proceedings of the 15th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines, pages 117–126,
Washington, DC, USA, 2007. IEEE Computer Society.
[15] Yongfeng Gu, Tom VanCourt, and Martin C. Herbordt. Explicit
design of FPGA-based coprocessors for short-range force computa-
tions in molecular dynamics simulations. Parallel Comput., 34(4-
5):261–277, 2008.
[16] Scott Hauck and André DeHon, editors. Reconfigurable Computing.
Systems on Silicon. Morgan Kaufman, 30 Corporate Drive, Suite
400, Burlington, MA 01803, 2008.
[17] K. S. Hemmert and K. D. Underwood. An Analysis of the Double-
Precision Floating-Point FFT on FPGAs. In FCCM ’05: Proceed-
ings of the 13th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines (FCCM’05), pages 171–180, Wash-
ington, DC, USA, 2005. IEEE Computer Society.
[18] Jing Huang, Mariam Momenzadeh, and Fabrizio Lombardi. Defect
63
tolerance of qca tiles. In DATE ’06: Proceedings of the conference
on Design, automation and test in Europe, pages 774–779, 3001
Leuven, Belgium, Belgium, 2006. European Design and Automa-
tion Association.
[19] Brad Hutchings, Peter Bellows, Joseph Hawkins, Scott Hem-
mert, Brent Nelson, and Mike Rytting. A CAD suite for high-
performance FPGA design. In FCCM ’99: Proceedings of the
Seventh Annual IEEE Symposium on Field-Programmable Custom
Computing Machines, page 12, Washington, DC, USA, 1999. IEEE
Computer Society.
[20] Nachiket Kapre and André DeHon. Accelerating SPICE Model-
Evaluation using FPGAs. Field-Programmable Custom Computing
Machines, Annual IEEE Symposium on, 0:37–44, 2009.
[21] V. Kindratenko and D. Pointer. A case study in porting a pro-
duction scientific supercomputing application to a reconfigurable
computer. In FCCM ’06: Proceedings of the 14th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines,
pages 13–22, Washington, DC, USA, April 2006. IEEE Computer
Society.
[22] R. K. Kummamuru, A. O. Orlov, R. Ramasubramaniam, C. S.
Lent, G. H. Bernstein, and G. L. Snider. Operation of a quantum-
dot cellular automata (qca) shift register and analysis of errors.
IEEE Transactions on Electron Devices, 50(9):1906–1913, August
2003.
[23] Timothy D. Lantz and Eric R. Peskin. A QCA implementation
of a configurable logic block for an FPGA. In Proceedings of the
64
Third International Conference on Reconfigurable Computing and
FPGAs (ReConFig 2006), pages 132–141, September 2006.
[24] C. S. Lent, P. D. Tougaw, W. Porod, and G. H. Bernstein. Quan-
tum cellular automata. Nanotechnology, 4(1):49–57, January 1993.
[25] Craig S. Lent, Sarah E. Frost, and Peter M. Kogge. Reversible
computation with quantum-dot cellular automata (qca). In CF
’05: Proceedings of the 2nd conference on Computing frontiers,
page 403, New York, NY, USA, 2005. ACM.
[26] Craig S. Lent, Mo Liu, and Yuhui Lu. Bennett clocking of
quantum-dot cellular automata and the limits to binary logic scal-
ing. Nanotechnology, 17(16):4240+, August 2006.
[27] Antonio Lopes and George Constantinides. A High Throughput
FPGA-Based Floating Point Conjugate Gradient Implementation.
In Roger Woods, Katherine Compton, Christos Bouganis, and Pe-
dro Diniz, editors, Reconfigurable Computing: Architectures, Tools
and Applications, volume 4943 of Lecture Notes in Computer Sci-
ence, chapter 10, pages 75–86. Springer Berlin / Heidelberg, Berlin,
Heidelberg, 2008.
[28] Yuhui Lu, Mo Liu, and Craig Lent. Molecular quantum-dot cellular
automata: From molecular structure to circuit dynamics. Journal
of Applied Physics, 102(3):034311, 2007.
[29] Josh Model and Martin C. Herbordt. Discrete event simulation
of molecular dynamics with configurable logic. In Koen Bertels,
65
Walid A. Najjar, Arjan J. van Genderen, and Stamatis Vassil-
iadis, editors, FPL 2007, International Conference on Field Pro-
grammable Logic and Applications, Amsterdam, The Netherlands,
27-29 August 2007, pages 151–158. IEEE, 2007.
[30] Tamer S. Mohamed and Wael Badawy. Integrated Hardware-
Software Platform for Image Processing Applications. System-
on-Chip for Real-Time Applications, International Workshop on,
0:145–148, 2004.
[31] Gordon E. Moore. Cramming more components onto integrated
circuits. Electronics, 38(8):114 ff., April 1965.
[32] G. R. Morris and V. K. Prasanna. An FPGA-Based Floating-
Point Jacobi Iterative Solver. In Parallel Architectures,Algorithms
and Networks, 2005. ISPAN 2005. Proceedings. 8th International
Symposium on, pages 420–427, 2005.
[33] Michael Thaddeus Niemier, Arun Francis Rodrigues, and Peter M.
Kogge. A potentially implementable FPGA for quantum dot cellu-
lar automata. In Proceedings of the First Workshop on Non-Silicon
Computation (NSC-1), pages 38–45, 2002.
[34] A. O. Orlov, I. Amlani, G. Toth, C. S. Lent, G. H. Bern-
stein, and G. L. Snider. Experimental demonstration of a binary
wire for quantum-dot cellular automata. Applied Physics Letters,
74(19):2875–2877, 1999.
[35] Alexei O. Orlov, Ravi K. Kummamuru, Rajagopal Ramasubrama-
niam, Geza Toth, Craig S. Lent, Gary H. Bernstein, and Gregory L.
66
Snider. Experimental demonstration of a latch in clocked quantum-
dot cellular automata. Applied Physics Letters, 78(11):1625–1627,
2001.
[36] Marco Ottavi, Luca Schiano, Fabrizio Lombardi, and Douglas
Tougaw. HDLQ: A HDL environment for QCA design. J. Emerg.
Technol. Comput. Syst., 2(4):243–261, 2006.
[37] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo. A
high-performance fully reconfigurable FPGA-based 2D convolu-
tion processor. Microprocessors and Microsystems, 29(8-9):381–
391, November 2005.
[38] Anders K. Roger and Roger D. Hersch. Designing a Halftoning Co-
processor. In 8th Eurographics Workshop on Graphics Hardware.
Swiss Federal Institute of Technology, September 1993.
[39] R. Scrofano, M. Gokhale, F. Trouw, and V. K. Prasanna. Hard-
ware/software approach to molecular dynamics on reconfigurable
computers. In FCCM ’06: Proceedings of the 14th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines,
volume 0, pages 23–34, Los Alamitos, CA, USA, April 2006. IEEE
Computer Society.
[40] Ronald Scrofano, Maya B. Gokhale, Frans Trouw, and Viktor K.
Prasanna. Accelerating molecular dynamics simulations with
reconfigurable computers. IEEE Trans. Parallel Distrib. Syst.,
19(6):764–778, 2008.
[41] B. Sen, M. Dalui, and B. K. Sikdar. Introducing universal qca logic
67
gate for synthesizing symmetric functions with minimum wire-
crossings. In ICWET ’10: Proceedings of the International Con-
ference and Workshop on Emerging Trends in Technology, pages
828–833, New York, NY, USA, 2010. ACM.
[42] G. L. Snider, A. O. Orlov, I. Amlani, X. Zuo, G. H. Bernstein, C. S.
Lent, J. L. Merz, and W. Porod. Quantum-dot cellular automata.
In Papers from the 45th National Symposium of the American Vac-
uum Society, volume 17, pages 1394–1398. AVS, 1999.
[43] Yong Tang, Alexei O. Orlov, Gregory L. Snider, and Patrick J.
Fay. Radio frequency operation of clocked quantum-dot cellular
automata latch. Applied Physics Letters, 95(19):193109+, 2009.
[44] Himanshu Thapliyal and Nagarajan Ranganathan. Conservative
qca gate (cqca) for designing concurrently testable molecular qca
circuits. In VLSID ’09: Proceedings of the 2009 22nd Interna-
tional Conference on VLSI Design, pages 511–516, Washington,
DC, USA, 2009. IEEE Computer Society.
[45] Chia-Ching Tung. Implementation of multi-clb designs using
quantum-dot cellular automata. Master’s thesis, Rochester Insti-
tute of Technology, Rochester, NY, February 2010.
[46] Chia-Ching Tung, Ruchi B. Rungta, and Eric R. Peskin. Sim-
ulation of a QCA-based CLB and a multi-CLB application.
In Proceedings of the 2009 International Conference on Field-
Programmable Technology (FPT’09), pages 62–69. IEEE, Decem-
ber 2009.
[47] Keith Underwood. Fpgas vs. cpus: trends in peak floating-point
performance. In FPGA ’04: Proceedings of the 2004 ACM/SIGDA
68
12th international symposium on Field programmable gate arrays,
pages 171–180, New York, NY, USA, 2004. ACM.
[48] K. Walus. QCADesigner — Microsystems and Nanotechnology
Group (MiNa). http://www.mina.ubc.ca/qcadesigner, June 2009.
[49] K. Walus, T. J. Dysart, G. A. Jullien, and R. A. Budiman.
QCADesigner: a rapid design and simulation tool for quantum-dot
cellular automata. IEEE Trans. Nanotechnol., 3(1):26–31, March
2004.
[50] K. Walus, G. Schulhof, Mazur, and G. A. Jullien. Simple 4-bit pro-
cessor based on quantum-dot cellular automata (QCA). In ASAP
’05: Proceedings of the 2005 IEEE International Conference on
Application-Specific Systems, Architecture Processors, pages 288–
293, Washington, DC, USA, 2005. IEEE Computer Society.
[51] Yuliang Wang and M. Lieberman. Thermodynamic behavior of
molecular-scale quantum-dot cellular automata (QCA) wires and
logic devices. IEEE Trans. Nanotechnol., 3(3):368–376, September
2004.
[52] Tongquan Wei, Kaijie Wu, Ramesh Karri, and Alex Orailoglu.
Fault tolerant quantum cellular array (qca) design using triple
modular redundancy with shifted operands. In ASP-DAC ’05:
Proceedings of the 2005 Asia and South Pacific Design Automation
Conference, pages 1192–1195, New York, NY, USA, 2005. ACM.




run bistable simulation C Code List-
ing
The following is the C code for the run bistable simulation function
from QCADesigner.
1 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−//
2 // −− t h i s i s the main s imu la t i on procedure −− //
3 //−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−//
4 s imu la t i on data ∗ r un b i s t ab l e s imu l a t i o n ( i n t SIMULATION TYPE, DESIGN ∗
des ign , b is table OP ∗ opt ions , VectorTable ∗pvt )
5 {
6 i n t i , j , k , l , t o t a l c e l l s = 0 ;
7 i n t i cLayers , i cCe l l s I nLaye r ;
8 t ime t s t a r t t ime , end time ;
9 s imu la t i on data ∗ s im data = NULL ;
10 // opt imiza t i on v a r i a b l e s //
11 i n t numbe r o f c e l l l a y e r s = 0 , ∗ numbe r o f c e l l s i n l a y e r = NULL ;
12 QCADCell ∗∗∗ s o r t e d c e l l s = NULL ;
13 double c l o c k s h i f t = ( opt ions−>c l o ck h i gh + opt ions−>c l o ck l ow ) /2 +
opt ions−>c l o c k s h i f t ;
14 double c l o c k p r e f a c t o r = ( opt ions−>c l o ck h i gh − opt ions−>c l o ck l ow ) ∗
opt ions−>c l o c k amp l i t ud e f a c t o r ;
15 double four p i over number sample s = 4 .0 ∗ PI / ( double ) opt ions−>
number of samples ;
16 double two pi over number samples = 2 .0 ∗ PI / ( double ) opt ions−>
number of samples ;
17 i n t idxMasterBitOrder = −1 ;
18 i n t max i t e ra t i on s pe r samp l e = ( ( bistable OP ∗) opt ions )−>
max i t e ra t i on s pe r samp l e ;
19 BUS LAYOUT ITER b l i ;
20 #i f d e f REDUCEDEREF
21 // For d e r e f e r en c e reduct i on
22 i n t s im data number samples = 0 , pv t ve c t o r s i cUs ed = 0 ,
70
23 de s i gn bus l ayou t ou tpu t s i cUsed = 0 ,
d e s i gn bu s l ayou t i npu t s i cUs ed = 0 , pvt input s i cUsed = 0 ;
24 i n t n umb e r o f c e l l s i n c u r r e n t l a y e r = 0 ;
25 EXPARRAY ∗ pvt input s = NULL ;
26 EXPARRAY ∗ pvt ve c t o r s = NULL ;
27 EXPARRAY ∗ de s i gn bu s l a you t i npu t s = NULL ;
28 EXPARRAY ∗ de s i gn bus l ayou t ou tpu t s = NULL ;
29 BUS LAYOUT ∗ de s i gn bus l ayou t = NULL ;
30 #end i f
31 // For randomizat ion
32 i n t Nix , Nix1 , idxCe l l1 , i dxCe l l 2 ;
33 QCADCell ∗swap = NULL ;
34 // −− the se used to be i n s i d e r u n b i s t a b l e i t e r a t i o n −− //
35 i n t q , i t e r a t i o n = 0 ;
36 i n t s t ab l e = FALSE;
37 double o l d p o l a r i z a t i o n ;
38 double new po l a r i z a t i on ;
39 double t o l e r an c e = ( ( bistable OP ∗) opt ions )−>conve rg enc e to l e r anc e ;
40 double po la r i za t i on math ;
41 b i s t ab l e mode l ∗ cu r r e n t c e l l mod e l = NULL ;
42 QCADCell ∗ c e l l ;
43
44 STOP SIMULATION = FALSE;
45
46 // −− get the s t a r t i n g time f o r the s imu la t i on −− //
47 i f ( ( s t a r t t ime = time (NULL) ) < 0)
48 f p r i n t f ( s tde r r , ”Could not get s t a r t time\n”) ;
49
50 // Create per−l a y e r c e l l a r rays to be used by the engine
51 s imu la t i on inproc data new ( des ign , &numbe r o f c e l l l a y e r s , &
numbe r o f c e l l s i n l a y e r , &s o r t e d c e l l s ) ;
52
53 f o r ( i = 0 ; i < numbe r o f c e l l l a y e r s ; i++)
54 {
55 #i f d e f REDUCEDEREF
56 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [ i ] ;
57 f o r ( j = 0 ; j < numbe r o f c e l l s i n c u r r e n t l a y e r ; j++)
58 #e l s e
59 f o r ( j = 0 ; j < numbe r o f c e l l s i n l a y e r [ i ] ; j++)
60 #end i f
61 {
62 // attach the model parameters to each o f the s imu la t i on c e l l s //
63 cu r r e n t c e l l mod e l = g mal loc0 ( s i z e o f ( b i s t ab l e mode l ) ) ;
64 s o r t e d c e l l s [ i ] [ j ]−> c e l l mode l = cu r r e n t c e l l mod e l ;
65
66 // −− Clear the model po i n t e r s so they are not dang l ing −− //
71
67 cu r r en t c e l l mode l−>neighbours = NULL;
68 cu r r en t c e l l mode l−>Ek = NULL;
69
70 // −− s e t p o l a r i z a t i o n in c e l l model f o r f i x ed c e l l s s i n c e they
are s e t with ac tua l dot charges by the user −− //
71 i f (QCAD CELL FIXED == s o r t e d c e l l s [ i ] [ j ]−> c e l l f u n c t i o n )
72 cu r r en t c e l l mode l−>po l a r i z a t i o n =
q c a d c e l l c a l c u l a t e p o l a r i z a t i o n ( s o r t e d c e l l s [ i ] [ j ] ) ;
73




78 // i f we are per forming a vec to r t ab l e s imu la t i on we cons id e r only the
ac t i va t ed inputs //
79 i f (SIMULATION TYPE == VECTORTABLE)
80 {
81 f o r (Nix = 0 ; Nix < pvt−>inputs−>icUsed ; Nix++)
82 i f ( ! exp ar ray index 1d ( pvt−>inputs , VT INPUT, Nix ) . a c t i v e f l a g )
83 exp ar ray index 1d ( pvt−>inputs , VT INPUT, Nix ) . input−>
c e l l f u n c t i o n = QCADCELLNORMAL ;
84 }
85
86 // wr i t e message to the command h i s t o r y window //
87 command history message ( (” Simulat ion found %d inputs %d outputs %d
t o t a l c e l l s \n”) , des ign−>bus layout−>inputs−>icUsed , des ign−>
bus layout−>outputs−>icUsed , t o t a l c e l l s ) ;
88
89 command history message ( (” S ta r t i ng i n i t i a l i z a t i o n \n”) ) ;
90 s e t p r o g r e s s b a r v i s i b l e (TRUE) ;
91 s e t p r o g r e s s b a r l a b e l ( (” B i s t ab l e s imu la t i on : ” ) ) ;
92
93 // −− I n i t i a l i z e the s imua l t i on data s t r u c tu r e −− //
94 s im data = g mal loc0 ( s i z e o f ( s imu la t i on data ) ) ;
95 sim data−>number o f t race s = design−>bus layout−>inputs−>icUsed +
design−>bus layout−>outputs−>icUsed ;
96 sim data−>number samples = opt ions−>number of samples ;
97 sim data−>t r a c e = g mal loc0 ( s i z e o f ( s t r u c t TRACEDATA) ∗ sim data−>
number o f t race s ) ;
98
99 // c r e a t e and i n i t i a l i z e the inputs in to the sim data s t r u c tu r e //
100 f o r ( i = 0 ; i < des ign−>bus layout−>inputs−>icUsed ; i++)
101 {
102 sim data−>t r a c e [ i ] . d a t a l a b e l s = g strdup ( q c a d c e l l g e t l a b e l (
QCAD CELL ( exp ar ray index 1d ( des ign−>bus layout−>inputs ,
BUS LAYOUT CELL, i ) . c e l l ) ) ) ;
72
103 sim data−>t r a c e [ i ] . drawtrace = TRUE;
104 sim data−>t r a c e [ i ] . t r a c e f un c t i o n = QCAD CELL INPUT;
105 sim data−>t r a c e [ i ] . data = g mal loc0 ( s i z e o f ( double ) ∗ sim data−>
number samples ) ;
106 }
107
108 // c r e a t e and i n i t i a l i z e the outputs in to the sim data s t r u c tu r e //
109 f o r ( i = 0 ; i < des ign−>bus layout−>outputs−>icUsed ; i++)
110 {
111 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . d a t a l a b e l s
= g strdup ( q c a d c e l l g e t l a b e l (QCAD CELL( exp ar ray index 1d (
des ign−>bus layout−>outputs , BUS LAYOUT CELL, i ) . c e l l ) ) ) ;
112 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . drawtrace =
TRUE;
113 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] .
t r a c e f un c t i o n = QCADCELL OUTPUT;
114 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . data =
g mal loc0 ( s i z e o f ( double ) ∗ sim data−>number samples ) ;
115 }
116
117 // c r e a t e and i n i t i a l i z e the c l o ck data //
118 sim data−>c l o ck da ta = g mal loc0 ( s i z e o f ( s t r u c t TRACEDATA) ∗ 4) ;
119
120 f o r ( i = 0 ; i < 4 ; i++)
121 {
122 sim data−>c l o ck da ta [ i ] . d a t a l a b e l s = g s t r dup p r i n t f (”CLOCK %d” , i
) ;
123 sim data−>c l o ck da ta [ i ] . drawtrace = 1 ;
124 sim data−>c l o ck da ta [ i ] . t r a c e f un c t i o n = QCAD CELL FIXED; // Abusing
the notat ion here
125
126 sim data−>c l o ck da ta [ i ] . data = g mal loc0 ( s i z e o f ( double ) ∗ sim data
−>number samples ) ;
127
128 i f (SIMULATION TYPE == EXHAUSTIVE VERIFICATION)
129 f o r ( j = 0 ; j < sim data−>number samples ; j++)
130 {
131 sim data−>c l o ck da ta [ i ] . data [ j ] = c l o c k p r e f a c t o r ∗ cos ( ( (
double ) (1 << des ign−>bus layout−>inputs−>icUsed ) ) ∗ ( double )
j ∗ f our p i over number sample s − PI ∗ i / 2) + c l o c k s h i f t
;
132 sim data−>c l o ck da ta [ i ] . data [ j ] = CLAMP ( sim data−>c l o ck da ta [ i
] . data [ j ] , opt ions−>c lock low , opt ions−>c l o ck h i gh ) ;
133 }
134 e l s e
135 // i f (SIMULATION TYPE == VECTORTABLE)
73
136 f o r ( j = 0 ; j < sim data−>number samples ; j++)
137 {
138 sim data−>c l o ck da ta [ i ] . data [ j ] = c l o c k p r e f a c t o r ∗ cos ( ( (
double ) pvt−>vectors−>icUsed ) ∗ ( double ) j ∗
two pi over number samples − PI ∗ i / 2) + c l o c k s h i f t ;
139 sim data−>c l o ck da ta [ i ] . data [ j ] = CLAMP ( sim data−>c l o ck da ta [ i




143 // −− r e f r e s h a l l the kink en e r g i e s to a l l the c e l l s ne ighbours with in
the rad iu s o f e f f e c t −− //
144 b i s t a b l e r e f r e s h a l l E k ( numbe r o f c e l l l a y e r s ,
n umbe r o f c e l l s i n l a y e r , s o r t e d c e l l s , opt i ons ) ;
145 // t h i s func t i on takes up pre t ty much the whole i n i t i a l i z a t i o n time
146
147 // randomize the c e l l s in the des ign so as to minimize any numerica l
problems a s s o c i a t ed //
148 // with having c e l l s s imulated in some prede f i ned order . //
149 // randomize the order in which the c e l l s are s imulated //
150 // i f ( opt ions−>r andom i z e c e l l s )
151 // f o r each l ay e r . . .
152 f o r (Nix = 0 ; Nix < numbe r o f c e l l l a y e r s ; Nix++)
153 // . . . perform as many swaps as the re are c e l l s t h e r e i n
154 f o r ( Nix1 = 0 ; Nix1 < numbe r o f c e l l s i n l a y e r [ Nix ] ; Nix1++)
155 {
156 idxCe l l 1 = rand ( ) % numbe r o f c e l l s i n l a y e r [ Nix ] ;
157 idxCe l l 2 = rand ( ) % numbe r o f c e l l s i n l a y e r [ Nix ] ;
158
159 swap = s o r t e d c e l l s [ Nix ] [ i dxCe l l 1 ] ;
160 s o r t e d c e l l s [ Nix ] [ i dxCe l l 1 ] = s o r t e d c e l l s [ Nix ] [ i dxCe l l 2 ] ;
161 s o r t e d c e l l s [ Nix ] [ i dxCe l l 2 ] = swap ;
162 }
163
164 // −− get and pr in t the t o t a l i n i t i a l i z a t i o n time −− //
165 i f ( ( end time = time (NULL) ) < 0)
166 f p r i n t f ( s tde r r , ”Could not get end time\n”) ;
167
168 command history message (” Total i n i t i a l i z a t i o n time : %g s \n” , ( double ) (
end time − s t a r t t ime ) ) ;
169
170 command history message (” S ta r t i ng Simulat ion \n”) ;
171
172 s e t p r o g r e s s b a r f r a c t i o n ( 0 . 0 ) ;
173
174 // perform the i t e r a t i o n s over a l l samples //
74
175 #i f d e f REDUCEDEREF
176 // Dere f e r ence some s t r u c t u r e s now so we don ’ t do i t over and over in
the loop
177 sim data number samples = sim data−>number samples ;
178 pvt input s = pvt−>inputs ;
179 pvt input s i cUsed = pvt inputs−>icUsed ;
180 pv t ve c t o r s = pvt−>vec to r s ;
181 pv t ve c t o r s i cUs ed = pvt−>vectors−>icUsed ;
182 de s i gn bus l ayou t = des ign−>bus layout ;
183 d e s i gn bu s l a you t i npu t s = des i gn bus l ayout−>inputs ;
184 de s i gn bu s l ayou t i npu t s i cUs ed = de s i gn bus l ayou t i npu t s−>icUsed ;
185 de s i gn bus l ayou t ou tpu t s = des i gn bus l ayout−>outputs ;
186 de s i gn bus l ayou t ou tpu t s i cUsed = des i gn bus l ayout output s−>icUsed ;
187 #e l s e
188 #de f i n e s im data number samples sim data−>number samples
189 #de f i n e pvt input s pvt−>inputs
190 #de f i n e pvt input s i cUsed pvt inputs−>icUsed
191 #de f i n e pv t ve c t o r s pvt−>vec to r s
192 #de f i n e pv t ve c t o r s i cUs ed pvt−>vectors−>icUsed
193 #de f i n e de s i gn bus l ayou t des ign−>bus layout
194 #de f i n e d e s i gn bu s l a you t i npu t s de s i gn bus l ayout−>inputs
195 #de f i n e d e s i gn bu s l ayou t i npu t s i cUs ed de s i gn bus l ayou t i npu t s−>
icUsed
196 #de f i n e de s i gn bus l ayou t ou tpu t s de s i gn bus l ayout−>outputs
197 #de f i n e de s i gn bus l ayou t ou tpu t s i cUsed de s i gn bus l ayout output s−>
icUsed
198 #end i f
199 f o r ( j = 0 ; j < s im data number samples ; j++)
200 {
201 i f ( j % 100 == 0)
202 {
203 // wr i t e the complet ion percentage to the command h i s t o r y window
//
204 s e t p r o g r e s s b a r f r a c t i o n ( ( f l o a t ) j / ( f l o a t )
s im data number samples ) ;
205 // redraw the des ign i f the user wants i t to appear animated //
206 i f ( opt ions−>an imate s imulat ion )
207 {
208 // update the charges to r e f l e c t the p o l a r i z a t i o n s so that they
can be animated //
209 f o r ( i cLaye r s = 0 ; i cLaye r s < numbe r o f c e l l l a y e r s ; i cLaye r s++)
210 {
211 #i f d e f REDUCEDEREF
212 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [
i cLaye r s ] ;
75
213 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n c u r r e n t l a y e r ; i cCe l l s I nLaye r++)
214 #e l s e
215 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numb e r o f c e l l s i n l a y e r [ i cLaye r s ] ; i cCe l l s I nLaye r++)
216 #end i f
217 q c a d c e l l s e t p o l a r i z a t i o n ( s o r t e d c e l l s [ i cLaye r s ] [
i cCe l l s I nLaye r ] , ( ( b i s t ab l e mode l ∗) s o r t e d c e l l s [ i cLaye r s
] [ i cCe l l s I nLaye r ]−> c e l l mode l )−>po l a r i z a t i o n ) ;
218 }
219 #i f d e f DESIGNER
220 redraw async (NULL) ;
221 gdk f l u sh ( ) ;




226 // −− f o r each o f the (VECTORTABLE => a c t i v e ?) inputs −− //
227 i f (EXHAUSTIVE VERIFICATION == SIMULATION TYPE)
228 f o r ( idxMasterBitOrder = 0 , d e s i g n b u s l a y o u t i t e r f i r s t (
de s i gn bus l ayout , &b l i , QCAD CELL INPUT, &i ) ; i > −1 ;
d e s i g n bu s l a y ou t i t e r n e x t (&b l i , &i ) , idxMasterBitOrder++)
229 ( ( b i s t ab l e mode l ∗) exp ar ray index 1d ( de s i gn bus l ayou t i npu t s ,
BUS LAYOUT CELL, i ) . c e l l−>c e l l mode l )−>po l a r i z a t i o n =
230 sim data−>t r a c e [ i ] . data [ j ] = (−1 ∗ s i n ( ( ( double ) (1 <<
idxMasterBitOrder ) ) ∗ ( double ) j ∗ FOUR PI / ( double )
s im data number samples ) > 0) ? 1 : −1 ;
231 e l s e
232 // i f (VECTORTABLE == SIMULATION TYPE)
233 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL INPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) )
234 i f ( exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . a c t i v e f l a g )
235 ( ( b i s t ab l e mode l ∗) exp ar ray index 1d ( pvt inputs , VT INPUT, i
) . input−>c e l l mode l )−>po l a r i z a t i o n =
236 sim data−>t r a c e [ i ] . data [ j ] = exp ar ray index 2d ( pvt vec to r s
, gboolean , ( j ∗ pv t ve c t o r s i cUs ed ) /
sim data number samples , i ) ? 1 : −1 ;
237
238 // randomize the order in which the c e l l s are s imulated to try and
minimize numerica l e r r o r s
239 // a s s o c i a t ed with the imposed s imu la t i on order .
240 i f ( opt ions−>r andom i z e c e l l s )
241 // f o r each l ay e r . . .
242 f o r (Nix = 0 ; Nix < numbe r o f c e l l l a y e r s ; Nix++)
243 {
76
244 // . . . perform as many swaps as the re are c e l l s t h e r e i n
245 #i f d e f REDUCEDEREF
246 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [ Nix ]
;
247 f o r ( Nix1 = 0 ; Nix1 < numbe r o f c e l l s i n c u r r e n t l a y e r ; Nix1
++)
248 #e l s e
249 f o r ( Nix1 = 0 ; Nix1 < numbe r o f c e l l s i n l a y e r [ Nix ] ; Nix1++)
250 #end i f
251 {
252 #i f d e f REDUCEDEREF
253 idxCe l l 1 = rand ( ) % numbe r o f c e l l s i n c u r r e n t l a y e r ;
254 idxCe l l 2 = rand ( ) % numbe r o f c e l l s i n c u r r e n t l a y e r ;
255 #e l s e
256 idxCe l l 1 = rand ( ) % numbe r o f c e l l s i n l a y e r [ Nix ] ;
257 idxCe l l 2 = rand ( ) % numbe r o f c e l l s i n l a y e r [ Nix ] ;
258 #end i f
259
260 swap = s o r t e d c e l l s [ Nix ] [ i dxCe l l 1 ] ;
261 s o r t e d c e l l s [ Nix ] [ i dxCe l l 1 ] = s o r t e d c e l l s [ Nix ] [ i dxCe l l 2 ] ;




266 // −− run the i t e r a t i o n with the g iven c l o ck value −− //
267 // −− i t e r a t e un t i l the e n t i r e des ign has s t ab a l i z e d −− //
268 i t e r a t i o n = 0 ;
269 s t ab l e = FALSE;
270 whi l e ( ! s t ab l e && i t e r a t i o n < max i t e ra t i on s pe r samp l e )
271 {
272 i t e r a t i o n++;
273 // −− assume that the c i r c u i t i s s t ab l e −− //
274 s t ab l e = TRUE;
275
276 f o r ( i cLaye r s = 0 ; i cLaye r s < numbe r o f c e l l l a y e r s ; i cLaye r s++)
277 {
278 #i f d e f REDUCEDEREF
279 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [
i cLaye r s ] ;
280 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n c u r r e n t l a y e r ; i cCe l l s I nLaye r++)
281 #e l s e
282 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n l a y e r [ i cLaye r s ] ; i cCe l l s I nLaye r++)
283 #end i f
284 {
77
285 c e l l = s o r t e d c e l l s [ i cLaye r s ] [ i cCe l l s I nLaye r ] ;
286
287 i f ( ! ( (QCAD CELL INPUT == c e l l−>c e l l f u n c t i o n ) | |
288 (QCAD CELL FIXED == c e l l−>c e l l f u n c t i o n ) ) )
289 {
290 cu r r e n t c e l l mod e l = ( ( b i s t ab l e mode l ∗) c e l l−>c e l l mode l ) ;
291 o l d p o l a r i z a t i o n = cu r r en t c e l l mode l−>po l a r i z a t i o n ;
292 po la r i za t i on math = 0 ;
293
294 f o r ( q = 0 ; q < cu r r en t c e l l mode l−>number of ne ighbours ; q
++)
295 po la r i za t i on math += ( cu r r en t c e l l mode l−>Ek [ q ] ∗ ( (
b i s t ab l e mode l ∗) cu r r en t c e l l mode l−>neighbours [ q]−>
c e l l mode l )−>po l a r i z a t i o n ) ;
296
297 // math = math / 2 ∗ gamma
298 po la r i za t i on math /= (2 . 0 ∗ sim data−>c l o ck da ta [ c e l l−>
c e l l o p t i o n s . c l o ck ] . data [ j ] ) ;
299
300 // −− c a l c u l a t e the new c e l l p o l a r i z a t i o n −− //
301 // i f math < 0 .05 then math/ sq r t (1+mathˆ2) ˜= math with
e r r o r <= 4e−5
302 // i f math > 100 then math/ sq r t (1+mathˆ2) ˜= +−1 with e r r o r
<= 5e−5
303 new po l a r i z a t i on =
304 ( po la r i za t i on math > 1000 .0 ) ? 1
:
305 ( po la r i za t i on math < −1000.0) ? −1
:
306 ( fabs ( po la r i za t i on math ) < 0 . 001 ) ?
po la r i za t i on math :
307 po la r i za t i on math / sq r t (1 + po la r i za t i on math ∗
po la r i za t i on math ) ;
308
309 // −− s e t the p o l a r i z a t i o n o f t h i s c e l l −− //
310 cu r r en t c e l l mode l−>po l a r i z a t i o n = new po l a r i z a t i on ;
311
312 // I f any c e l l s p o l a r i z a t i o n has changed beyond t h i s
th r e sho ld
313 // then the e n t i r e c i r c u i t i s assumed to have not converged .
314 s t ab l e = ( fabs ( new po l a r i z a t i on − o l d p o l a r i z a t i o n ) <=







320 i f (VECTORTABLE == SIMULATION TYPE)
321 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL INPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) )
322 i f ( ! exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . a c t i v e f l a g )
323 sim data−>t r a c e [ i ] . data [ j ] = ( ( b i s t ab l e mode l ∗)
exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . input−>
c e l l mode l )−>po l a r i z a t i o n ;
324
325 // −− c o l l e c t a l l the output data from the s imu la t i on −− //
326 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL OUTPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) ) {
327 sim data−>t r a c e [ d e s i gn bu s l ayou t i npu t s i cUs ed + i ] . data [ j ] = ( (
b i s t ab l e mode l ∗) exp ar ray index 1d ( de s i gn bus l ayout output s
, BUS LAYOUT CELL, i ) . c e l l−>c e l l mode l )−>po l a r i z a t i o n ;
328 }
329 // −− i f the user wants to stop the s imu la t i on then e x i t . −− //
330 i f (TRUE == STOP SIMULATION)
331 j = sim data number samples ;
332 }// f o r number o f samples
333
334 // Free the ne igbours and Ek array introduced by t h i s s imu la t i on //
335 f o r ( k = 0 ; k < numbe r o f c e l l l a y e r s ; k++)
336 {
337 #i f d e f REDUCEDEREF
338 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [ k ] ;
339 f o r ( l = 0 ; l < numbe r o f c e l l s i n c u r r e n t l a y e r ; l++)
340 #e l s e
341 f o r ( l = 0 ; l < numb e r o f c e l l s i n l a y e r [ k ] ; l++)
342 #end i f
343 {
344 g f r e e ( ( ( b i s t ab l e mode l ∗) s o r t e d c e l l s [ k ] [ l ]−> c e l l mode l )−>
neighbours ) ;
345 g f r e e ( ( ( b i s t ab l e mode l ∗) s o r t e d c e l l s [ k ] [ l ]−> c e l l mode l )−>
ne i ghbour l aye r ) ;




350 s imu l a t i o n i np r o c d a t a f r e e (&numbe r o f c e l l l a y e r s , &
numbe r o f c e l l s i n l a y e r , &s o r t e d c e l l s ) ;
351
352 // Restore the input f l a g f o r the i n a c t i v e inputs
353 i f (VECTORTABLE == SIMULATION TYPE)
79
354 f o r ( i = 0 ; i < pvt input s i cUsed ; i++)
355 exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . input−>c e l l f u n c t i o n
= QCAD CELL INPUT ;
356
357 // −− get and pr in t the t o t a l s imu la t i on time −− //
358 i f ( ( end time = time (NULL) ) < 0)
359 f p r i n t f ( s tde r r , ”Could not get end time\n”) ;
360
361 command history message (” Total s imu la t i on time : %g s \n” , ( double ) (
end time − s t a r t t ime ) ) ;
362 s e t p r o g r e s s b a r v i s i b l e (FALSE) ;
363
364 #i f n d e f REDUCEDEREF
365 #undef s im data number samples
366 #undef pvt input s
367 #undef pv t input s i cUsed
368 #undef pv t v e c t o r s
369 #undef pv t ve c t o r s i cUs ed
370 #undef de s i gn bus l ayou t
371 #undef d e s i gn bu s l a you t i npu t s
372 #undef d e s i gn bu s l ayou t i npu t s i cUs ed
373 #undef d e s i gn bus l ayou t ou tpu t s
374 #undef d e s i gn bus l ayou t ou tpu t s i cUsed
375 #end i f
376
377 return s im data ;
378 }// run b i s t ab l e
Listing A.1: Bistable simulation engine software code
80
Appendix B
run bistable simulation hardware C
Code Listing
The following is the code for the version of run bistable simulation cre-
ated to use the hardware
1 // c a l l s the hardware to run an a l t e r n a t e ve r s i on o f the b i s t a b l e
s imu la t i on eng ine
2 s imu la t i on data ∗ run b i s t ab l e s imu l a t i on ha rdwar e ( i n t SIMULATION TYPE,
DESIGN ∗des ign , b is table OP ∗ opt ions , VectorTable ∗pvt )
3 {
4 i n t i , j , k , l , t o t a l c e l l s = 0 ;
5 i n t i cLayers , i cCe l l s I nLaye r ;
6 t ime t s t a r t t ime , end time ;
7 s imu la t i on data ∗ s im data = NULL ;
8 // opt imiza t i on v a r i a b l e s //
9 i n t numbe r o f c e l l l a y e r s = 0 , ∗ numbe r o f c e l l s i n l a y e r = NULL ;
10 QCADCell ∗∗∗ s o r t e d c e l l s = NULL ;
11 f l o a t c l o c k s h i f t = ( f l o a t ) ( ( opt ions−>c l o ck h i gh + opt ions−>c l o ck l ow )
/2 + opt ions−>c l o c k s h i f t ) ;
12 f l o a t c l o c k p r e f a c t o r = ( f l o a t ) ( opt ions−>c l o ck h i gh − opt ions−>
c l o ck l ow ) ∗ opt ions−>c l o c k amp l i t ud e f a c t o r ;
13 f l o a t four p i over number sample s = 4 .0 ∗ ( f l o a t )PI / ( f l o a t ) opt ions
−>number of samples ;
14 f l o a t two pi over number samples = 2 .0 ∗ ( f l o a t )PI / ( f l o a t ) opt ions−>
number of samples ;
15 i n t idxMasterBitOrder = −1 ;
16 i n t max i t e ra t i on s pe r samp l e = ( ( bistable OP ∗) opt ions )−>
max i t e ra t i on s pe r samp l e ;
17 BUS LAYOUT ITER b l i ;
18 #i f d e f REDUCEDEREF
19 // For d e r e f e r en c e reduct i on
20 i n t s im data number samples = 0 , pv t ve c t o r s i cUs ed = 0 ,
21 de s i gn bus l ayou t ou tpu t s i cUsed = 0 ,
d e s i gn bu s l ayou t i npu t s i cUs ed = 0 , pvt input s i cUsed = 0 ;
81
22 i n t n umb e r o f c e l l s i n c u r r e n t l a y e r = 0 ;
23 EXPARRAY ∗ pvt input s = NULL ;
24 EXPARRAY ∗ pvt ve c t o r s = NULL ;
25 EXPARRAY ∗ de s i gn bu s l a you t i npu t s = NULL ;
26 EXPARRAY ∗ de s i gn bus l ayou t ou tpu t s = NULL ;
27 BUS LAYOUT ∗ de s i gn bus l ayou t = NULL ;
28 #end i f
29 // −− the se used to be i n s i d e r u n b i s t a b l e i t e r a t i o n −− //
30 i n t Nix ;
31 i n t q , i t e r a t i o n , calculatedCel lNum= 0 ;
32 i n t s t ab l e = FALSE;
33 f l o a t o l d p o l a r i z a t i o n ;
34 f l o a t new po l a r i z a t i on ;
35 f l o a t t o l e r an c e = ( f l o a t ) ( ( b i stable OP ∗) opt ions )−>
conve rg enc e to l e r anc e ;
36 f l o a t temp ; // j u s t used to move data formats between DWORDS and f l o a t s
37 b i s t ab l e mode l ∗ cu r r e n t c e l l mod e l = NULL ;
38 QCADCell ∗ c e l l ;
39
40 // wi ldcard i n i t i a l i z a t i o n
41 WC RetCode rc = WC SUCCESS;
42 WC TestInfo Test In fo ;
43 char ∗∗TestLoc = NULL;
44 DWORD con t r o l [ 8 ] ;
45 WC4 DmaHandle hDMASource ;
46 WC4 DmaHandle hDMADestination ;
47 DWORD dGrantedDwords ;
48 DWORD dSourceHardwareAddress ;
49 DWORD dDestinationHardwareAddress ;
50 f l o a t ∗DMABufSource ;
51 f l o a t ∗DMABufDest ;
52 boolean in tS ta tu s ;
53 DWORD debugReg [ 8 ] ;
54 DWORD maxNeighbours DMA = 0x400 ; //max number o f ne ighbours
supported , used to a l l o c a t e source bu f f e r (1024)
55 DWORD numCells DMA = 0 ; //number o f c e l l s to be read from the PE;
s e t s DMA de s t i n a t i on bu f f e r ; c a l c u l a t ed l a t e r
56 DWORD totem = 0xBF000000 ; // totem value used to s i g n a l end o f th ing s
57
58 Test In fo . bVerbose = DEFAULT VERBOSITY;
59 Test In fo . DeviceNum = DEFAULT SLOTNUMBER;
60 Test In fo . d I t e r a t i o n s = DEFAULT ITERATIONS;
61 Test In fo . fClkFreq = DEFAULTFREQUENCY;
62
63 STOP SIMULATION = FALSE;
64
82
65 // −− get the s t a r t i n g time f o r the s imu la t i on −− //
66 i f ( ( s t a r t t ime = time (NULL) ) < 0)
67 f p r i n t f ( s tde r r , ”Could not get s t a r t time\n”) ;
68 // Create per−l a y e r c e l l a r rays to be used by the engine
69 s imu la t i on inproc data new ( des ign , &numbe r o f c e l l l a y e r s , &
numbe r o f c e l l s i n l a y e r , &s o r t e d c e l l s ) ;
70 f o r ( i = 0 ; i < numbe r o f c e l l l a y e r s ; i++)
71 {
72 #i f d e f REDUCEDEREF
73 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [ i ] ;
74 f o r ( j = 0 ; j < numbe r o f c e l l s i n c u r r e n t l a y e r ; j++)
75 #e l s e
76 f o r ( j = 0 ; j < numbe r o f c e l l s i n l a y e r [ i ] ; j++)
77 #end i f
78 {
79 // attach the model parameters to each o f the s imu la t i on c e l l s //
80 cu r r e n t c e l l mod e l = g mal loc0 ( s i z e o f ( b i s t ab l e mode l ) ) ;
81 s o r t e d c e l l s [ i ] [ j ]−> c e l l mode l = cu r r e n t c e l l mod e l ;
82
83 // −− Clear the model po i n t e r s so they are not dang l ing −− //
84 cu r r en t c e l l mode l−>neighbours = NULL;
85 cu r r en t c e l l mode l−>Ek = NULL;
86
87 // −− s e t p o l a r i z a t i o n in c e l l model f o r f i x ed c e l l s s i n c e they
are s e t with ac tua l dot charges by the user −− //
88 i f (QCAD CELL FIXED == s o r t e d c e l l s [ i ] [ j ]−> c e l l f u n c t i o n )
89 cu r r en t c e l l mode l−>po l a r i z a t i o n =
q c a d c e l l c a l c u l a t e p o l a r i z a t i o n ( s o r t e d c e l l s [ i ] [ j ] ) ;
90
91 t o t a l c e l l s ++;
92
93 i f ( ! ( (QCAD CELL INPUT == s o r t e d c e l l s [ i ] [ j ]−> c e l l f u n c t i o n ) | | (
QCAD CELL FIXED == s o r t e d c e l l s [ i ] [ j ]−> c e l l f u n c t i o n ) ) )





98 // i f we are per forming a vec to r t ab l e s imu la t i on we cons id e r only the
ac t i va t ed inputs //
99 i f (SIMULATION TYPE == VECTORTABLE)
100 {
101 f o r (Nix = 0 ; Nix < pvt−>inputs−>icUsed ; Nix++)
102 i f ( ! exp ar ray index 1d ( pvt−>inputs , VT INPUT, Nix ) . a c t i v e f l a g )
103 exp ar ray index 1d ( pvt−>inputs , VT INPUT, Nix ) . input−>




106 // wr i t e message to the command h i s t o r y window //
107 command history message ( (” Simulat ion found %d inputs %d outputs %d
t o t a l c e l l s \n”) , des ign−>bus layout−>inputs−>icUsed , des ign−>
bus layout−>outputs−>icUsed , t o t a l c e l l s ) ;
108
109 command history message ( (” S ta r t i ng i n i t i a l i z a t i o n \n”) ) ;
110 command history message ( (”Opening card . . ” ) ) ;
111 rc = WC Open ( Test In fo . DeviceNum , 0) ;
112 i f ( rc != WC SUCCESS) {
113 command history message (” Fa i l ed to open card .\n”) ;




118 command history message ( (” I n i t i a l i z i n g hardware . . . ” ) ) ;
119 // i n i t i a l i z e and r e s e t the PE
120 rc = wc4 qca i n i t ( &Test In fo ) ;
121 i f ( rc != WC SUCCESS) {
122 command history message (” i n i t i a l i z a t i o n f a i l e d \n”) ;
123 re turn NULL;
124 }
125 //CHECKRC( rc ) ;
126
127 command history message ( (” r e s e t t i n g hardware . . . ” ) ) ;
128 rc = WC PeReset ( Tes t In fo . DeviceNum , TRUE) ;
129 i f ( rc != WC SUCCESS) {
130 command history message (” f a i l e d \n”) ;
131 re turn NULL;
132 }
133 //CHECKRC( rc ) ;
134
135 rc = WC PeReset ( Tes t In fo . DeviceNum , FALSE) ;
136 i f ( rc != WC SUCCESS) {
137 command history message (” f a i l e d \n”) ;
138 re turn NULL;
139 }
140 //CHECKRC ( rc ) ;
141 command history message (”Done\n”) ;
142
143 s e t p r o g r e s s b a r v i s i b l e (TRUE) ;
144 s e t p r o g r e s s b a r l a b e l ( (” B i s t ab l e s imu la t i on : ” ) ) ;
145
146 // −− I n i t i a l i z e the s imua l t i on data s t r u c tu r e −− //
147 s im data = g mal loc0 ( s i z e o f ( s imu la t i on data ) ) ;
84
148 sim data−>number o f t race s = design−>bus layout−>inputs−>icUsed +
design−>bus layout−>outputs−>icUsed ;
149 sim data−>number samples = opt ions−>number of samples ;
150 sim data−>t r a c e = g mal loc0 ( s i z e o f ( s t r u c t TRACEDATA) ∗ sim data−>
number o f t race s ) ;
151
152 // c r e a t e and i n i t i a l i z e the inputs in to the sim data s t r u c tu r e //
153 f o r ( i = 0 ; i < des ign−>bus layout−>inputs−>icUsed ; i++)
154 {
155 sim data−>t r a c e [ i ] . d a t a l a b e l s = g strdup ( q c a d c e l l g e t l a b e l (
QCAD CELL ( exp ar ray index 1d ( des ign−>bus layout−>inputs ,
BUS LAYOUT CELL, i ) . c e l l ) ) ) ;
156 sim data−>t r a c e [ i ] . drawtrace = TRUE;
157 sim data−>t r a c e [ i ] . t r a c e f un c t i o n = QCAD CELL INPUT;
158 sim data−>t r a c e [ i ] . data = g mal loc0 ( s i z e o f ( double ) ∗ sim data−>
number samples ) ;
159 }
160
161 // c r e a t e and i n i t i a l i z e the outputs in to the sim data s t r u c tu r e //
162 f o r ( i = 0 ; i < des ign−>bus layout−>outputs−>icUsed ; i++)
163 {
164 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . d a t a l a b e l s
= g strdup ( q c a d c e l l g e t l a b e l (QCAD CELL( exp ar ray index 1d (
des ign−>bus layout−>outputs , BUS LAYOUT CELL, i ) . c e l l ) ) ) ;
165 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . drawtrace =
TRUE;
166 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] .
t r a c e f un c t i o n = QCADCELL OUTPUT;
167 sim data−>t r a c e [ i + des ign−>bus layout−>inputs−>icUsed ] . data =
g mal loc0 ( s i z e o f ( double ) ∗ sim data−>number samples ) ;
168 }
169
170 // c r e a t e and i n i t i a l i z e the c l o ck data //
171 sim data−>c l o ck da ta = g mal loc0 ( s i z e o f ( s t r u c t TRACEDATA) ∗ 4) ;
172
173 f o r ( i = 0 ; i < 4 ; i++)
174 {
175 sim data−>c l o ck da ta [ i ] . d a t a l a b e l s = g s t r dup p r i n t f (”CLOCK %d” , i
) ;
176 sim data−>c l o ck da ta [ i ] . drawtrace = 1 ;
177 sim data−>c l o ck da ta [ i ] . t r a c e f un c t i o n = QCAD CELL FIXED; // Abusing
the notat ion here
178
179 sim data−>c l o ck da ta [ i ] . data = g mal loc0 ( s i z e o f ( double ) ∗ sim data
−>number samples ) ;
180
85
181 i f (SIMULATION TYPE == EXHAUSTIVE VERIFICATION)
182 f o r ( j = 0 ; j < sim data−>number samples ; j++)
183 {
184 sim data−>c l o ck da ta [ i ] . data [ j ] = c l o c k p r e f a c t o r ∗ cos ( ( (
double ) (1 << des ign−>bus layout−>inputs−>icUsed ) ) ∗ ( double )
j ∗ f our p i over number sample s − PI ∗ i / 2) + c l o c k s h i f t
;
185 sim data−>c l o ck da ta [ i ] . data [ j ] = CLAMP ( sim data−>c l o ck da ta [ i
] . data [ j ] , opt ions−>c lock low , opt ions−>c l o ck h i gh ) ;
186 }
187 e l s e
188 // i f (SIMULATION TYPE == VECTORTABLE)
189 f o r ( j = 0 ; j < sim data−>number samples ; j++)
190 {
191 sim data−>c l o ck da ta [ i ] . data [ j ] = c l o c k p r e f a c t o r ∗ cos ( ( (
double ) pvt−>vectors−>icUsed ) ∗ ( double ) j ∗
two pi over number samples − PI ∗ i / 2) + c l o c k s h i f t ;
192 sim data−>c l o ck da ta [ i ] . data [ j ] = CLAMP ( sim data−>c l o ck da ta [ i




196 command history message (” A l l o ca t i ng DMA bu f f e r s . . . ” ) ;
197 rc = WC4 DmaMemAlloc ( Test In fo . DeviceNum , &hDMASource ,
maxNeighbours DMA , (DWORD ∗∗)&DMABufSource ) ;
198 i f ( rc != WC SUCCESS) {
199 command history message (”\ nError A l l o ca t i ng DMA source bu f f e r \n”) ;
200 WC4 DmaMemFree( Test In fo . DeviceNum , hDMASource) ;
201 WC PeDeprogram( Test In fo . DeviceNum) ;
202 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
203 WC Close ( Tes t In fo . DeviceNum) ;
204 re turn NULL;
205 }
206 rc = WC4 DmaMemAlloc ( Test In fo . DeviceNum , &hDMADestination ,
numCells DMA , (DWORD ∗∗)&DMABufDest) ;
207 i f ( rc != WC SUCCESS) {
208 command history message (”\ nError A l l o ca t i ng DMA de s t i n a t i on bu f f e r \n
” ) ;
209 WC4 DmaMemFree( Test In fo . DeviceNum , hDMASource) ;
210 WC4 DmaMemFree( Test In fo . DeviceNum , hDMADestination ) ;
211 WC PeDeprogram( Test In fo . DeviceNum) ;
212 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
213 WC Close ( Tes t In fo . DeviceNum) ;
214 re turn NULL;
215 }
216 p r i n t f (”DONE\n”) ;
86
217
218 p r i n t f (” Binding DMA bu f f e r s \ t \ t \ t ”) ;
219 rc = WC4 DmaBind ( Test In fo . DeviceNum , hDMASource , (DWORD ∗)
DMABufSource , maxNeighbours DMA , &dGrantedDwords , &
dSourceHardwareAddress ) ;
220 i f ( rc != WC SUCCESS) {
221 p r i n t f (”\ nError binding DMA source bu f f e r \n” ) ;
222 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMASource) ;
223 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMADestination ) ;
224 WC PeDeprogram( Test In fo . DeviceNum) ;
225 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
226 WC Close ( Tes t In fo . DeviceNum) ;
227 re turn NULL;
228 } e l s e i f (maxNeighbours DMA != dGrantedDwords ) {
229 p r i n t f (”\ nError could not bind f u l l source bu f f e r !\n”) ;
230 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMASource) ;
231 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMADestination ) ;
232 WC PeDeprogram( Test In fo . DeviceNum) ;
233 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
234 WC Close ( Tes t In fo . DeviceNum) ;
235 re turn NULL;
236 }
237
238 rc = WC4 DmaBind ( Test In fo . DeviceNum , hDMADestination , (DWORD ∗)
DMABufDest , numCells DMA , &dGrantedDwords , &
dDestinationHardwareAddress ) ;
239 i f ( rc != WC SUCCESS) {
240 p r i n t f (”\ nError binding DMA de s t i n a t i on bu f f e r \n”) ;
241 WC4 DmaUnbind ( Test In fo . DeviceNum , hDMASource) ;
242 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMADestination ) ;
243 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMASource) ;
244 WC PeDeprogram( Test In fo . DeviceNum) ;
245 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
246 WC Close ( Tes t In fo . DeviceNum) ;
247 re turn NULL;
248 } e l s e i f (numCells DMA != dGrantedDwords ) {
249 p r i n t f (”\ nError could not bind f u l l d e s t i n a t i on bu f f e r !\n”) ;
250 WC4 DmaUnbind ( Test In fo . DeviceNum , hDMASource) ;
251 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMADestination ) ;
252 WC4 DmaMemFree ( Test In fo . DeviceNum , hDMASource) ;
253 WC PeDeprogram( Test In fo . DeviceNum) ;
254 WC4 PeInitiateAutoProgram ( Test In fo . DeviceNum) ;
255 WC Close ( Tes t In fo . DeviceNum) ;
256 re turn NULL;
257 }
258 p r i n t f (”DONE\n”) ;
87
259
260 // send DMA con t r o l data to PE
261 con t r o l [ 0 ] = dDestinationHardwareAddress ;
262 con t r o l [ 1 ] = dSourceHardwareAddress ;
263 con t r o l [ 2 ] = numCells DMA ;
264 command history message (” Sending DMA con t r o l data\n”) ;
265 rc = WC PeRegWrite ( Test In fo . DeviceNum , CTRL REG BASE, 3 , c on t r o l ) ;
266 i f ( rc != WC SUCCESS) {
267 command history message (” Reg i s t e r t r a n s f e r f a i l e d \n”) ;
268 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &hDMADestination ) ;
269 re turn NULL;
270 }
271 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE+4 ,4 , debugReg ) ;
272 command history message (” s t a t e : %x\n” , debugReg [ 0 ] ) ;
273 command history message (” Ca l cu l a t ing kink en e r g i e s \n”) ;
274 command history message (”DMA To PE Addr : %x\n” , dSourceHardwareAddress )
;
275 // −− r e f r e s h a l l the kink en e r g i e s to a l l the c e l l s ne ighbours with in
the rad iu s o f e f f e c t −− //
276 b i s t a b l e r e f r e s h a l l Ek ha r dwa r e ( numbe r o f c e l l l a y e r s ,
n umbe r o f c e l l s i n l a y e r , s o r t e d c e l l s , opt ions , DMABufSource ,
Tes t In fo . DeviceNum) ;
277 //Using Test In fo . DeviceNum ins t ead o f TestInfo−>DeviceNum . Don ’ t know
why i t has to be l i k e t h i s . Make sure t h i s doesn ’ t cause i s s u e s
278 command history message (” done c a l c u l a t i n g kink ene rg i e s , wa i t ing f o r
i n t e r r up t \n”) ;
279 WC IntQueryStatus ( Test In fo . DeviceNum , &in tS ta tu s ) ;
280 command history message (” i n t e r r up t s t a tu s : %d\n” , i n tS ta tu s ) ;
281 rc=WC IntWait ( Tes t In fo . DeviceNum , INT TIMEOUT MS) ; //wait f o r
i n t e r r up t f o r the l a s t s e t o f Eks to be conf irmed as r e c e i v ed
282 //might want to f i nd a d i f f e r e n t way o f t e s t i n g f o r the in t e r rupt ,
because t h i s seems cumbersome .
283 i f ( rc != WC SUCCESS) {
284 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE, 8 , debugReg ) ;
285 command history message (” In t e r rup t timed out ; i n t Counter ,
numNeighbours , dram addr , dma incount , s ta te , i n f i f o o u t , DRAM
data , DMA outCount : %x %x %x %x %x %x %x %x\n” ,
286 debugReg [ 0 ] , debugReg [ 1 ] , debugReg [ 2 ] , debugReg [ 3 ] , debugReg [ 4 ] ,
debugReg [ 5 ] , debugReg [ 6 ] , debugReg [ 7 ] ) ;
287 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &hDMADestination ) ;
288 re turn NULL;
289 }
290 command history message (” i n t e r r up t r e c e i v ed \n”) ;
291 rc=WC IntEnable ( Tes t In fo . DeviceNum , FALSE) ; // r e s e t i n t e r r up t
292 rc=WC IntReset ( Test In fo . DeviceNum) ;
293 rc=WC IntEnable ( Tes t In fo . DeviceNum , TRUE) ;
88
294 command history message (” i n t e r r up t r e s e t \n”) ;
295 WC PeRegWrite ( Test In fo . DeviceNum , CTRL REG BASE+3, 1 , &totem ) ; // send
a totem value to t e l l the PE that a l l Ek data has been sent
296
297 // −− get and pr in t the t o t a l i n i t i a l i z a t i o n time −− //
298 i f ( ( end time = time (NULL) ) < 0)
299 f p r i n t f ( s tde r r , ”Could not get end time\n”) ;
300
301 // the re was a c e l l randomizat ion sequence here , but the hardware
a c c e l e r a t o r makes i t unnecessary
302
303 command history message (” Total i n i t i a l i z a t i o n time : %g s \n” , ( double ) (
end time − s t a r t t ime ) ) ;
304
305 command history message (” S ta r t i ng Simulat ion \n”) ;
306 s e t p r o g r e s s b a r f r a c t i o n ( 0 . 0 ) ;
307
308 // perform the i t e r a t i o n s over a l l samples //
309 #i f d e f REDUCEDEREF
310 // Dere f e r ence some s t r u c t u r e s now so we don ’ t do i t over and over in
the loop
311 sim data number samples = sim data−>number samples ;
312 pvt input s = pvt−>inputs ;
313 pvt input s i cUsed = pvt inputs−>icUsed ;
314 pv t ve c t o r s = pvt−>vec to r s ;
315 pv t ve c t o r s i cUs ed = pvt−>vectors−>icUsed ;
316 de s i gn bus l ayou t = des ign−>bus layout ;
317 d e s i gn bu s l a you t i npu t s = des i gn bus l ayout−>inputs ;
318 de s i gn bu s l ayou t i npu t s i cUs ed = de s i gn bus l ayou t i npu t s−>icUsed ;
319 de s i gn bus l ayou t ou tpu t s = des i gn bus l ayout−>outputs ;
320 de s i gn bus l ayou t ou tpu t s i cUsed = des i gn bus l ayout output s−>icUsed ;
321 #e l s e
322 #de f i n e s im data number samples sim data−>number samples
323 #de f i n e pvt input s pvt−>inputs
324 #de f i n e pvt input s i cUsed pvt inputs−>icUsed
325 #de f i n e pv t ve c t o r s pvt−>vec to r s
326 #de f i n e pv t ve c t o r s i cUs ed pvt−>vectors−>icUsed
327 #de f i n e de s i gn bus l ayou t des ign−>bus layout
328 #de f i n e d e s i gn bu s l a you t i npu t s de s i gn bus l ayout−>inputs
329 #de f i n e d e s i gn bu s l ayou t i npu t s i cUs ed de s i gn bus l ayou t i npu t s−>
icUsed
330 #de f i n e de s i gn bus l ayou t ou tpu t s de s i gn bus l ayout−>outputs
331 #de f i n e de s i gn bus l ayou t ou tpu t s i cUsed de s i gn bus l ayout output s−>
icUsed
332 #end i f
333 f o r ( j = 0 ; j < s im data number samples ; j++)
89
334 {
335 i f ( j % 100 == 0)
336 {
337 // wr i t e the complet ion percentage to the command h i s t o r y window
//
338 s e t p r o g r e s s b a r f r a c t i o n ( ( f l o a t ) j / ( f l o a t )
s im data number samples ) ;
339 // redraw the des ign i f the user wants i t to appear animated //
340 }
341
342 // −− f o r each o f the (VECTORTABLE => a c t i v e ?) inputs −− //
343 i f (EXHAUSTIVE VERIFICATION == SIMULATION TYPE)
344 f o r ( idxMasterBitOrder = 0 , d e s i g n b u s l a y o u t i t e r f i r s t (
de s i gn bus l ayout , &b l i , QCAD CELL INPUT, &i ) ; i > −1 ;
d e s i g n bu s l a y ou t i t e r n e x t (&b l i , &i ) , idxMasterBitOrder++)
345 ( ( b i s t ab l e mode l ∗) exp ar ray index 1d ( de s i gn bus l ayou t i npu t s ,
BUS LAYOUT CELL, i ) . c e l l−>c e l l mode l )−>po l a r i z a t i o n =
346 sim data−>t r a c e [ i ] . data [ j ] = (−1 ∗ s i n ( ( ( double ) (1 <<
idxMasterBitOrder ) ) ∗ ( double ) j ∗ FOUR PI / ( double )
s im data number samples ) > 0) ? 1 : −1 ;
347 e l s e
348 // i f (VECTORTABLE == SIMULATION TYPE)
349 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL INPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) )
350 i f ( exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . a c t i v e f l a g )
351 ( ( b i s t ab l e mode l ∗) exp ar ray index 1d ( pvt inputs , VT INPUT, i
) . input−>c e l l mode l )−>po l a r i z a t i o n =
352 sim data−>t r a c e [ i ] . data [ j ] = exp ar ray index 2d ( pvt vec to r s
, gboolean , ( j ∗ pv t ve c t o r s i cUs ed ) /
sim data number samples , i ) ? 1 : −1 ;
353
354 // −− run the i t e r a t i o n with the g iven c l o ck value −− //
355 // −− i t e r a t e un t i l the e n t i r e des ign has s t ab a l i z e d −− //
356 i t e r a t i o n = 0 ;
357 s t ab l e = FALSE;
358 whi l e ( ! s t ab l e && i t e r a t i o n < max i t e ra t i on s pe r samp l e )
359 {
360 i t e r a t i o n++;
361 // −− assume that the c i r c u i t i s s t ab l e −− //
362 s t ab l e = TRUE;
363
364 f o r ( i cLaye r s = 0 ; i cLaye r s < numbe r o f c e l l l a y e r s ; i cLaye r s++)
365 {
366 #i f d e f REDUCEDEREF
90
367 numb e r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [
i cLaye r s ] ;
368 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n c u r r e n t l a y e r ; i cCe l l s I nLaye r++)
369 #e l s e
370 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numbe r o f c e l l s i n l a y e r [ i cLaye r s ] ; i cCe l l s I nLaye r++)
371 #end i f
372 {
373 c e l l = s o r t e d c e l l s [ i cLaye r s ] [ i cCe l l s I nLaye r ] ;
374
375 i f ( ! ( (QCAD CELL INPUT == c e l l−>c e l l f u n c t i o n ) | |
376 (QCAD CELL FIXED == c e l l−>c e l l f u n c t i o n ) ) )
377 {
378 cu r r e n t c e l l mod e l = ( ( b i s t ab l e mode l ∗) c e l l−>c e l l mode l ) ;
379
380 f o r ( q = 0 ; q < cu r r en t c e l l mode l−>number of ne ighbours ; q
++)
381 {
382 DMABufSource [ q ] = ( ( b i s t ab l e mode l ∗) cu r r en t c e l l mode l−>
neighbours [ q]−> c e l l mode l )−>po l a r i z a t i o n ;
383 }
384
385 command history message (” s e t t i n g gamma . . . ” ) ;
386 temp = ( f l o a t ) ( sim data−>c l o ck da ta [ c e l l−>c e l l o p t i o n s . c l o ck ] .
data [ j ] ) ;
387 con t r o l [ 0 ] = ∗(DWORD ∗)&temp ; // ca s t i n g nece s sa ry to put a f l o a t
i n to a DWORD una l t e r ed
388 command history message (” s e t t i n g ne ighbours . . . \ n”) ;
389 con t r o l [ 1 ] = (DWORD) ( cu r r en t c e l l mode l−>number of ne ighbours ) ;
390 command history message (” s e t \n”) ;
391 i f ( i cCe l l s I nLaye r != 0) //no i n t e r r up t comes f o r the f i r s t c e l l
392 //we may have to add a cond i t i on f o r the f i r s t l a y e r as we l l . . .
393 { command history message (” wai t ing f o r c a l c u l a t i o n \n”) ;
394 rc=WC IntWait ( Tes t In fo . DeviceNum , INT TIMEOUT MS) ; //wait f o r
in t e r rupt , s i g n i f y i n g cur rent c a l c u l a t i o n i s done
395 i f ( rc != WC SUCCESS) { //debug output
396 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE, 8 , debugReg ) ;
397 command history message (” In t e r rup t timed out ; i n t Counter ,
numNeighbours , dram addr , dma incount , s ta te , i n f i f o o u t ,
DRAM data , DMA outCount : %x %x %x %x %x %x %x %x\n” ,
398 debugReg [ 0 ] , debugReg [ 1 ] , debugReg [ 2 ] , debugReg [ 3 ] , debugReg
[ 4 ] , debugReg [ 5 ] , debugReg [ 6 ] , debugReg [ 7 ] ) ;
399 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &
hDMADestination ) ;
400 re turn NULL;
91
401 }
402 command history message (” i n t e r r up t r e c e i v ed \n”) ; // debugging
in fo rmat ion
403 rc=WC IntEnable ( Tes t In fo . DeviceNum , FALSE) ; // r e s e t i n t e r r up t
404 command history message (” i n t e r r up t d i s ab l ed \n”) ;
405 rc=WC IntReset ( Test In fo . DeviceNum) ;
406 command history message (” i n t e r r up t r e s e t \n”) ;
407 rc=WC IntEnable ( Tes t In fo . DeviceNum , TRUE) ;
408 command history message (” i n t e r r up t enabled \n”) ;
409 }
410 i f ( j==0 && i t e r a t i o n == 1)
411 con t r o l [ 1 ] = 0x80000000 | (DWORD) ( cu r r en t c e l l mode l−>
number of ne ighbours ) ; // t e l l the PE th i s i s not the f i r s t
sample
412
413 command history message (” con t r o l data : %x %x\n” , ( i n t ) c on t r o l [ 0 ] ,
( i n t ) c on t r o l [ 1 ] ) ;
414 command history message (” sending p o l a r i z a t i o n s \n”) ;
415 // get debugging data
416 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE, 8 , debugReg ) ;
417 command history message (”Ek in , read count , wr i t e count , gamma
in , s ta te , i n f i f o o u t , DRAM f i f o data , o u t f i f o i n : %x %x %
x %x %x %x %x %x\n” ,
418 debugReg [ 0 ] , debugReg [ 2 ] , debugReg [ 7 ] , debugReg [ 1 ] , debugReg
[ 4 ] , debugReg [ 5 ] , debugReg [ 6 ] , debugReg [ 3 ] ) ;
419 rc = WC PeRegWrite ( Test In fo . DeviceNum , CTRL REG BASE+3, 2 , c on t r o l
) ; // send gamma and numNeighbours
420 i f ( rc != WC SUCCESS) {
421 command history message (” Reg i s t e r t r a n s f e r f a i l e d \n”) ;
422 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &
hDMADestination ) ;
423 re turn NULL;
424 }
425
426 command history message (” wai t ing f o r t r a n s f e r \n”) ;
427 rc=WC IntWait ( Tes t In fo . DeviceNum , INT TIMEOUT MS) ; //wait f o r
in t e r rupt , s i g n i f y i n g a l l p o l a r i z a t i o n s have been r e c e i v ed .
Loop around to begin gather ing next s e t
428 i f ( rc != WC SUCCESS) {
429 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE, 8 , debugReg ) ;
430 command history message (” In t e r rup t timed out ; i n t Counter ,
numNeighbours , dram addr , dma incount , s ta te , i n f i f o o u t ,
DRAM data , DMA outCount : %x %x %x %x %x %x %x %x\n” ,
431 debugReg [ 0 ] , debugReg [ 1 ] , debugReg [ 2 ] , debugReg [ 3 ] , debugReg
[ 4 ] , debugReg [ 5 ] , debugReg [ 6 ] , debugReg [ 7 ] ) ;
92
432 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &
hDMADestination ) ;
433 re turn NULL;
434 }
435 command history message (” i n t e r r up t r e c e i v ed \n”) ;
436 rc=WC IntEnable ( Test In fo . DeviceNum , FALSE) ; // r e s e t i n t e r r up t
437 command history message (” i n t e r r up t d i s ab l ed \n”) ;
438 rc=WC IntReset ( Test In fo . DeviceNum) ;
439 command history message (” i n t e r r up t r e s e t \n”) ;
440 rc=WC IntEnable ( Test In fo . DeviceNum , TRUE) ;
441 command history message (” i n t e r r up t enabled \n”) ;




446 command history message (” wai t ing f o r c a l c u l a t i o n ( a f t e r loop ) \n”) ;
447 rc=WC IntWait ( Tes t In fo . DeviceNum , INT TIMEOUT MS) ; //wait f o r
in t e r rupt , s i g n i f y i n g l a s t p o l a r i z a t i o n has been ca l c u l a t ed
448 i f ( rc != WC SUCCESS) {
449 WC PeRegRead( Test In fo . DeviceNum , CTRL REG BASE, 8 , debugReg ) ;
450 command history message (” In t e r rup t timed out ; i n t Counter ,
numNeighbours , dram addr , dma incount , s ta te , i n f i f o o u t ,
DRAM data , DMA outCount : %x %x %x %x %x %x %x %x\n” ,
451 debugReg [ 0 ] , debugReg [ 1 ] , debugReg [ 2 ] , debugReg [ 3 ] , debugReg
[ 4 ] , debugReg [ 5 ] , debugReg [ 6 ] , debugReg [ 7 ] ) ;
452 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &
hDMADestination ) ;
453 re turn NULL;
454 }
455 rc=WC IntEnable ( Test In fo . DeviceNum , FALSE) ; // r e s e t i n t e r r up t
456 rc=WC IntReset ( Test In fo . DeviceNum) ;
457 rc=WC IntEnable ( Test In fo . DeviceNum , TRUE) ;
458 WC PeRegWrite ( Test In fo . DeviceNum , CTRL REG BASE+3, 1 , &totem ) ; //
send a totem value to t e l l the PE that a l l c e l l s have been
ca l c u l a t ed
459 i f ( rc != WC SUCCESS) {
460 command history message (” Reg i s t e r t r a n s f e r f a i l e d \n”) ;
461 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &
hDMADestination ) ;
462 re turn NULL;
463 }
464
465 calculatedCel lNum = 0 ; // t h i s v a r i ab l e i s used to generate the
index f o r the r e l e van t data in DMABufDest
466 // s i n c e i t i s a sma l l e r s i z e than numbe r o f c e l l s i n l a y e r
467 command history message (” s o r t i n g out r e s u l t s \n”) ;
93
468 f o r ( i cCe l l s I nLaye r = 0 ; i cCe l l s I nLaye r <
numb e r o f c e l l s i n l a y e r [ i cLaye r s ] ; i cCe l l s I nLaye r++)
469 { c e l l = s o r t e d c e l l s [ i cLaye r s ] [ i cCe l l s I nLaye r ] ;
470 i f ( ! ( (QCAD CELL INPUT == c e l l−>c e l l f u n c t i o n ) | | ( QCAD CELL FIXED
== c e l l−>c e l l f u n c t i o n ) ) )
471 { // t h i s cond i t i on must be here , as i t was f o r the se l i n e s in the
o r i g i n a l v e r s i on
472 // t o l e r an c e t e s t i n g could be done in hardware f a i r l y e a s i l y , but
doing a l ong s i d e the data replacement makes sense
473 cu r r e n t c e l l mod e l = ( ( b i s t ab l e mode l ∗) c e l l−>c e l l mode l ) ;
474 o l d p o l a r i z a t i o n = cu r r en t c e l l mode l−>po l a r i z a t i o n ;
475 cu r r en t c e l l mode l−>po l a r i z a t i o n = DMABufDest [ calculatedCel lNum
] ; // e l im inated need f o r ∗( f l o a t ∗)& ca s t i ng by changing
bu f f e r ’ s type
476 command history message (”new po l a r i z a t i o n : %e\n” , DMABufDest [
calculatedCel lNum ] ) ;
477 s t ab l e = ( fabs (DMABufDest [ calculatedCel lNum ] − o l d p o l a r i z a t i o n )
<= to l e r an c e ) && s t ab l e ;







485 i f (VECTORTABLE == SIMULATION TYPE)
486 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL INPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) )
487 i f ( ! exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . a c t i v e f l a g )
488 sim data−>t r a c e [ i ] . data [ j ] = ( ( b i s t ab l e mode l ∗)
exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . input−>
c e l l mode l )−>po l a r i z a t i o n ;
489
490 // −− c o l l e c t a l l the output data from the s imu la t i on −− //
491 f o r ( d e s i g n b u s l a y o u t i t e r f i r s t ( de s i gn bus l ayout , &b l i ,
QCAD CELL OUTPUT, &i ) ; i > −1 ; d e s i g n bu s l a y ou t i t e r n e x t (&
b l i , &i ) )
492 sim data−>t r a c e [ d e s i gn bu s l ayou t i npu t s i cUs ed + i ] . data [ j ] = ( (
b i s t ab l e mode l ∗) exp ar ray index 1d ( de s i gn bus l ayout output s
, BUS LAYOUT CELL, i ) . c e l l−>c e l l mode l )−>po l a r i z a t i o n ;
493
494 // −− i f the user wants to stop the s imu la t i on then e x i t . −− //
495 i f (TRUE == STOP SIMULATION)
496 j = sim data number samples ;
497
94
498 }// f o r number o f samples
499
500 // Free the ne igbours and Ek array introduced by t h i s s imu la t i on //
501 f o r ( k = 0 ; k < numbe r o f c e l l l a y e r s ; k++)
502 {
503 #i f d e f REDUCEDEREF
504 numbe r o f c e l l s i n c u r r e n t l a y e r = numbe r o f c e l l s i n l a y e r [ k ] ;
505 f o r ( l = 0 ; l < numbe r o f c e l l s i n c u r r e n t l a y e r ; l++)
506 #e l s e
507 f o r ( l = 0 ; l < numb e r o f c e l l s i n l a y e r [ k ] ; l++)
508 #end i f
509 {
510 g f r e e ( ( ( b i s t ab l e mode l ∗) s o r t e d c e l l s [ k ] [ l ]−> c e l l mode l )−>
neighbours ) ;
511 g f r e e ( ( ( b i s t ab l e mode l ∗) s o r t e d c e l l s [ k ] [ l ]−> c e l l mode l )−>
ne i ghbour l aye r ) ;




516 s imu l a t i o n i np r o c d a t a f r e e (&numbe r o f c e l l l a y e r s , &
numbe r o f c e l l s i n l a y e r , &s o r t e d c e l l s ) ;
517
518 wi ldcard c l eanup ( Test In fo . DeviceNum , &hDMASource , &hDMADestination ) ;
519
520 // Restore the input f l a g f o r the i n a c t i v e inputs
521 i f (VECTORTABLE == SIMULATION TYPE)
522 f o r ( i = 0 ; i < pvt input s i cUsed ; i++)
523 exp ar ray index 1d ( pvt inputs , VT INPUT, i ) . input−>c e l l f u n c t i o n
= QCAD CELL INPUT ;
524
525 // −− get and pr in t the t o t a l s imu la t i on time −− //
526 i f ( ( end time = time (NULL) ) < 0)
527 f p r i n t f ( s tde r r , ”Could not get end time\n”) ;
528
529 command history message (” Total s imu la t i on time : %g s \n” , ( double ) (
end time − s t a r t t ime ) ) ;
530 s e t p r o g r e s s b a r v i s i b l e (FALSE) ;
531
532 #i f n d e f REDUCEDEREF
533 #undef s im data number samples
534 #undef pvt input s
535 #undef pv t input s i cUsed
536 #undef pv t v e c t o r s
537 #undef pv t ve c t o r s i cUs ed
538 #undef de s i gn bus l ayou t
95
539 #undef d e s i gn bu s l a you t i npu t s
540 #undef d e s i gn bu s l ayou t i npu t s i cUs ed
541 #undef d e s i gn bus l ayou t ou tpu t s
542 #undef d e s i gn bus l ayou t ou tpu t s i cUsed
543 #end i f
544
545 return s im data ;
546 }// run b i s tab l e ha rdware
Listing B.1: Bistable simulation engine software code with hardware interaction
96
Appendix C
qcad VHDL Code Listing
The following is the code for the top module of the QCA simulating
hardware module.
1 l i b r a r y IEEE ;
2 use IEEE . s t d l o g i c a r i t h . a l l ;
3 use IEEE . s t d l o g i c 1 1 6 4 . a l l ;
4 use IEEE . s t d l o g i c un s i g n ed . a l l ;
5
6 l i b r a r y WILDCARD4 LIB;
7 use WILDCARD4 LIB. pe package . a l l ;
8 use WILDCARD4 LIB. l ad t oo l s pa ckag e . a l l ;
9
10 l i b r a r y UNISIM ;
11 use UNISIM . vcomponents . a l l ;
12
13 a r c h i t e c t u r e qcad o f pe i s
14
15 −−component d e c l a r a t i o n s
16 component p math
17 Port ( po l : in STD LOGIC VECTOR (31 downto 0) ;
18 gamma : in STD LOGIC VECTOR (31 downto 0) ;
19 Ek : in STD LOGIC VECTOR (31 downto 0) ;
20 s t a r t : in STD LOGIC;
21 c l k : in STD LOGIC;
22 ce : in s t d l o g i c ;
23 new ce l l : in s t d l o g i c ;
24 r e s e t : in STD LOGIC;
25 new pol : out STD LOGIC VECTOR (31 downto 0) ;
26 done : out STD LOGIC) ;
27 end component ;
28
29 component f i f o
30 port (
31 c l k : IN s t d l o g i c ;
97
32 din : IN std logic VECTOR (31 downto 0) ;
33 rd en : IN s t d l o g i c ;
34 r s t : IN s t d l o g i c ;
35 wr en : IN s t d l o g i c ;
36 dout : OUT std logic VECTOR (31 downto 0) ;
37 empty : OUT s t d l o g i c ;
38 f u l l : OUT s t d l o g i c ) ;
39 end component ;
40
41 component o u t f i f o
42 port (
43 c l k : IN s t d l o g i c ;
44 din : IN std logic VECTOR (31 downto 0) ;
45 rd en : IN s t d l o g i c ;
46 r s t : IN s t d l o g i c ;
47 wr en : IN s t d l o g i c ;
48 dout : OUT std logic VECTOR (31 downto 0) ;
49 empty : OUT s t d l o g i c ;
50 f u l l : OUT s t d l o g i c ) ;
51 end component ;
52
53 component DRAM fifo
54 port (
55 din : IN std logic VECTOR (127 downto 0) ;
56 r d c l k : IN s t d l o g i c ;
57 rd en : IN s t d l o g i c ;
58 r s t : IN s t d l o g i c ;
59 wr c lk : IN s t d l o g i c ;
60 wr en : IN s t d l o g i c ;
61 dout : OUT std logic VECTOR (127 downto 0) ;
62 empty : OUT s t d l o g i c ;
63 f u l l : OUT s t d l o g i c ;
64 prog empty : OUT s t d l o g i c ) ;
65 end component ;
66
67 −−wi ldcard i n t e r f a c e s i g n a l s
68 constant STARTUPWAIT : natura l := 8000 ;
−− Use f o r s yn th e s i s
69 −−constant STARTUPWAIT : natura l := 16 ;
−− Use f o r s imu la t i on
70 s i g n a l r e s e t : s t d l o g i c := ’ 0 ’ ;
71 s i g n a l c l o c k s i n : c l o c k s i n t y p e ;
72 s i g n a l c l o c k s ou t : c l o c k s ou t t yp e ;
73 s i g n a l l a d i n : l a d i n t yp e ;
74 s i g n a l l ad out : l ad ou t type ;
75
98
76 −−data input /output f o r p math un i t
77 s i g n a l Ek in , gamma in : s t d l o g i c v e c t o r (31 downto 0) ;
78
79 −−c on t r o l s i g n a l s f o r va r i ous a spec t s
80 s i g n a l p math nd , c e l l done , p math con , p math en : s t d l o g i c ;
81
82 −−s t a t e machine f o r core operat i on
83 type c o r e s t a t e s i s ( i d l e , setup Ek DMA , Ek DMA init1 , Ek DMA init2 ,
Ek to DRAM ,
84 next Ek DMA , last Ek to DRAM , ge t r eg s , totem wait , setup pol DMA ,
pol DMA init1 ,
85 pol DMA init2 , DRAM in start , dumbState , c a l cu l a t e1 , c a l cu l a t e2 ,
n e x t c e l l , n ex tCe l l i n t ,
86 s e t up de l i v e r , d e l i v e r i n i t ) ;
87 s i g n a l c o r e s t a t e : c o r e s t a t e s ;
88
89 −−DRAM s i g n a l s
90 s i g n a l dram in : dram in type ;
91 s i g n a l dram out : dram out type ;
92 s i g n a l dram read , dram read se l : s t d l o g i c ;
93 s i g n a l dram write : s t d l o g i c ;
94 s i g n a l dram buf write , dram super wr i te : s t d l o g i c ; −−wr i t e c on t r o l f o r
dram input bu f f e r
95 s i g n a l dram addr : s t d l o g i c v e c t o r (24 downto 0) ;
96 s i g n a l dram write addr , dram read addr : s t d l o g i c v e c t o r (22 downto 0) ;
97 s i g n a l dram data out : s t d l o g i c v e c t o r (127 downto 0) ;
98 s i g n a l d ram data in va l id : s t d l o g i c ;
99 s i g n a l dram data in : s t d l o g i c v e c t o r (127 downto 0) ;
100 s i g n a l dram write rdy : s t d l o g i c ;
101 s i g n a l dram read rdy : s t d l o g i c ;
102
103 s i g n a l d r am f i f o r d cn t : s t d l o g i c v e c t o r (8 downto 0) ;
104
105 −−s t a t e machine c on t r o l s i g n a l s
106 s i g n a l smBus req , smDMA init , DMAfromPE : s t d l o g i c ;
107 s i g n a l smLAD data out : s t d l o g i c v e c t o r (31 downto 0) ;
108 s i g n a l DMA From Pe Addr : s t d l o g i c v e c t o r (29 downto 0) := ( o the r s
=> ’ 0 ’ ) ;
109 s i g n a l DMA To Pe Addr : s t d l o g i c v e c t o r (29 downto 0) := ( o the r s
=> ’ 0 ’ ) ;
110 s i g n a l DMA inCount , DMA outCount , numCells DMA , numNeighbours DMA :
s t d l o g i c v e c t o r (31 downto 0) ;
111 s i g n a l DMA inCount clr , DMA outCount clr : s t d l o g i c ;
112 s i g n a l wai t count : s t d l o g i c ;
113 s i g n a l wa i t counte r : s t d l o g i c v e c t o r (5 downto 0) ;
114
99
115 s i g n a l dram addr inc , dram addr c lr , dram realAddr inc : s t d l o g i c ;
116 s i g n a l dram ta i lAddr inc : s t d l o g i c ;
117 s i g n a l pN delay : s t d l o g i c v e c t o r (3 downto 0) ; −−perNeighbour de lay
118 s i g n a l pN count , pN clr : s t d l o g i c ;
119 s i g n a l i c l o c k p r e v : s t d l o g i c ; −−maybe bad prac t i c e , but used to
de t e c t i c l o c k edges
120 −−s i g n a l sram addr inc , s ram addr c l r : s t d l o g i c ;
121
122 −−other c on t r o l s i g n a l s
123 type l a d r e g i s t e r v e c t o r i s array ( natura l range<>) o f s t d l o g i c v e c t o r
(31 downto 0) ;
124 s i g n a l c o n t r o l r e g i s t e r : l a d r e g i s t e r v e c t o r (0 to 7) ;
125 s i g n a l debug reg : l a d r e g i s t e r v e c t o r (0 to 7) ;
126 s i g n a l r e g s t r ob e ou t : s t d l o g i c ;
127 s i g n a l reg LAD out : s t d l o g i c v e c t o r (31 downto 0) ;
128 s i g n a l Reg Index : natura l := 0 ;
129 s i g n a l Start DMA , gamma set : s t d l o g i c ;
130 s i g n a l g ene r a t e i n t e r rup t , pc i rdy , Inte r rupt , Interrupt Done :
s t d l o g i c ;
131 s i g n a l c l e a r i n t e r r u p t : s t d l o g i c ;
132 s i g n a l i n t c oun t e r : s t d l o g i c v e c t o r (4 downto 0) ;
133
134 −−c on t r o l s to and from the f i f o un i t s
135 s i g n a l i n f i f o r d , i n f i f o empty , i n f i f o f u l l : s t d l o g i c ;
136 s i g n a l i n f i f o o u t : s t d l o g i c v e c t o r (31 downto 0) ;
137 s i g n a l o u t f i f o r d , out f i f o empty , o u t f i f o f u l l , o u t f i f o p r o g f u l l :
s t d l o g i c ;
138 s i g n a l o u t f i f o o u t , o u t f i f o i n : s t d l o g i c v e c t o r (31 downto 0) ;
139 s i g n a l d ram f i f o rd , dram fi fo empty , dram f i fo prog empty ,
d r am f i f o f u l l : s t d l o g i c ;
140 s i g n a l d r am f i f o c l r : s t d l o g i c ;
141 s i g n a l d r am f i f o ou t : s t d l o g i c v e c t o r (127 downto 0) ;
142
143 a l i a s p c l o ck : s t d l o g i c i s c l o c k s i n . p c l o ck . c l o ck ;
144 a l i a s i c l o c k : s t d l o g i c i s l a d i n . c l o c k i n ;
145
146 constant REG BASE : s t d l o g i c v e c t o r (15 downto 0) := x”0100”;
147 constant REGMASK : s t d l o g i c v e c t o r (15 downto 0) := x”FFFF” ;
148 constant REG BASE ADDRESS : s t d l o g i c v e c t o r (15 downto 0) := x
”1000”;





153 po la r i za t i on math : p math
100
154 Port map( pol => i n f i f o o u t ,
155 gamma => gamma in , −−from r e g i s t e r
156 Ek => Ek in , −−from DRAM bu f f e r th ing
157 s t a r t => p math nd ,
158 c l k => i c l o c k ,
159 ce => p math en ,
160 new ce l l => p math con ,
161 r e s e t => r e s e t ,
162 new pol => o u t f i f o i n ,
163 done => c e l l d o n e ) ;
164
165 i n f i f o : f i f o −− f i f o s t o rage f o r incoming data ; s i z e i s 32x1024
166 port map (
167 c l k => i c l o c k ,
168 din => l a d i n . data in ,
169 rd en => i n f i f o r d ,
170 r s t => r e s e t ,
171 wr en => l a d i n . DMA strobe ,
172 dout => i n f i f o o u t , −−goes to DRAM bu f f e r th ing ( s o r t out which
part o f the 128 b i t word to use )
173 empty => i n f i f o empty ,
174 f u l l => i n f i f o f u l l ) ;
175
176 away : o u t f i f o −− f i f o f o r outgoing data ; s i z e i s 32x256
177 −−s i z e was en la rged from 32x32 to dea l with laggy f u l l s i g n a l s and
prevent data over f l ow
178 port map (
179 c l k => i c l o c k ,
180 din => o u t f i f o i n ,
181 rd en => o u t f i f o r d ,
182 r s t => r e s e t ,
183 wr en => c e l l done ,
184 dout => ou t f i f o o u t ,
185 empty => out f i f o empty ,
186 f u l l => o u t f i f o f u l l ) ;
187
188 o u t f i f o r d <= ( not ou t f i f o empty ) and l a d i n . bus gnt and l a d i n .
pc i rdy ;
189
190 DRAM output fifo : DRAM fifo −− f i f o f o r DRAM output ; s i z e i s 128x256
191 port map (
192 din => dram data in ,
193 rd c l k => i c l o c k ,
194 rd en => dram f i f o rd , −−c on t r o l l e d from s t a t e machine when data
i s needed
101
195 r s t => d r am f i f o c l r , −−c on t r o l l e d in the s t a t e machine so empty
f l a g r e s e t s
196 wr c lk => p c lock ,
197 wr en => dram data in va l id ,
198 dout => dram f i f o out , −−goes to DRAM bu f f e r th ing ( the th ing that
s o r t s out which part o f the 128 b i t word to use )
199 empty => dram fi fo empty ,
200 f u l l => d r am f i f o f u l l ,
201 prog empty => dram f i fo prog empty ) ; −−s e t to turn on when l e s s
than 50 data are pre sent
202 −−t h i s i s to ensure t h i s f i f o communicates e f f e c t i v e l y with the
DRAM con t r o l
203 −−to ensure data does not over− or underf low
204 −−because data must always be pre sent to meet t iming
205 −−and the 173 cy c l e de lay on DRAM read means
206 −−the re must be a s i z a b l e bu f f e r zone between stopping the read
and the f i f o being f u l l
207
208 −−DRAM output bu f f e r th ing ( between DRAM output f i f o and p math input ) :
209 −−governing which part o f the DRAM word goe to the p math core based on
the address b i t s
210 −−t h i s may need to be modi f i ed i f we are us ing two p math co r e s
211 Ek in <= dram f i f o ou t (127 downto 96) when ( dram addr (1 downto 0) =
”11”) e l s e
212 d ram f i f o ou t (95 downto 64) when ( dram addr (1 downto 0) = ”10”)
e l s e
213 d ram f i f o ou t (63 downto 32) when ( dram addr (1 downto 0) = ”01”)
e l s e
214 d ram f i f o ou t (31 downto 0) when ( dram addr (1 downto 0) = ”00”)
e l s e
215 ( o the r s => ’ 0 ’ ) ;
216
217 −−dram input bu f f e r th ing : DRAM wr i t e only happens once at the beginning
, in sequence ,
218 −−so every time the l a s t two address b i t s are 11 , the prev ious three
data s l o t s have been f i l l e d ,
219 −−and we have a 128 b i t word to wr i t e .
220 −−the l a s t s e t o f data w i l l need to have a s p e c i a l case to be sure i t ’ s
wr i t t en
221 −−t h i s a l s o means that r e a l address 0 w i l l have no data .
222 DRAM input buffer : p roc e s s ( i c l o c k , r e s e t )
223 begin
224 i f ( r e s e t = ’1 ’ ) then
225 dram data out <= ( othe r s => ’ 0 ’ ) ;
226 dram write <= ’0 ’ ;
227 e l s e
102
228 i f r i s i n g e d g e ( i c l o c k ) then
229 i f ( dram buf wr i te = ’1 ’ ) then
230 case dram addr (1 downto 0) i s
231 when ”00” =>
232 dram data out (31 downto 0) <= i n f i f o o u t ;
233 when ”01” =>
234 dram data out (63 downto 32) <= i n f i f o o u t ;
235 when ”10” =>
236 dram data out (95 downto 64) <= i n f i f o o u t ;
237 when ”11” =>
238 dram data out (127 downto 96) <= i n f i f o o u t ;
239 when othe r s =>
240 dram data out <= dram data out ;
241 end case ;
242
243 i f ( ( dram addr (1 downto 0) = ”11”) or ( dram super wr i te = ’1 ’ ) )
then
244 dram write <= ’1 ’ ;
245 e l s e
246 dram write <= ’0 ’ ;
247 end i f ;
248 e l s e
249 dram write <= ’0 ’ ;
250 dram data out <= dram data out ;
251 end i f ;
252 end i f ;
253 end i f ;
254 end proce s s ;
255
256 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
257 −−DMA con t r o l s t u f f
258 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
259 l a d r e g c o n t r o l p r o c : p roc e s s ( r e s e t , i c l o c k ) i s
260 begin
261 i f ( r e s e t = ’1 ’ ) then
262 Start DMA <= ’0 ’ ;
263 gamma set <= ’0 ’ ;
264 c o n t r o l r e g i s t e r <= ( othe r s => ( o the r s => ’ 0 ’ ) ) ;
265 reg LAD out <= ( othe r s => ’ 0 ’ ) ;
266 r e g s t r ob e ou t <= ’0 ’ ;
267 e l s i f r i s i n g e d g e ( i c l o c k ) then −−make sure t h i s i s the c o r r e c t c l o ck
to use
268 i f ( l a d i n . r e g s t r ob e = ’1 ’ ) then
269 i f ( l a d i n . wr i t e = ’1 ’ ) then
270 c o n t r o l r e g i s t e r ( Reg Index ) <= lad i n . da ta in ;
271 i f ( Reg Index = 4) then
103
272 Start DMA <= ’1 ’ ; −−s t a r t DMA on th i rd t r a n s f e r
273 e l s i f ( Reg Index = 3) then
274 gamma set <= ’1 ’ ;
275 end i f ;
276 e l s e
277 reg LAD out <= debug reg ( Reg Index ) ;
278 r e g s t r ob e ou t <= ’1 ’ ;
279 end i f ;
280 e l s e
281 gamma set <= ’0 ’ ;
282 start DMA <= ’0 ’ ;
283 r e g s t r ob e ou t <= ’0 ’ ;
284 reg LAD out <= ( othe r s => ’ 0 ’ ) ;
285 end i f ;
286 end i f ;
287 end proce s s l a d r e g c o n t r o l p r o c ;
288
289 Reg Index <= CONV INTEGER( ’0 ’ & l a d i n . Addr (2 downto 0) ) ;
290
291 DMA From PE Addr <= c o n t r o l r e g i s t e r (0 ) (31 downto 2) ;
292 DMA To PE Addr <= c o n t r o l r e g i s t e r (1 ) (31 downto 2) ;
293 numCells DMA <= c o n t r o l r e g i s t e r (2 ) ; −−DMA from PE count ; the number o f
c e l l s
294 gamma in <= c o n t r o l r e g i s t e r (3 ) ;
295 numNeighbours DMA <= X”7 f f f f f f f ” and c o n t r o l r e g i s t e r (4 ) ; −−DMA to PE
count ; the number o f ne ighbours a c e l l has
296 −−the gamma and neighbours r e g i s t e r s are arranged in t h i s order to
s t r eaml ine the t r a n s f e r
297 −−o f gamma fo l l owed by the number o f ne ighbours f o r s t a r t i n g a
p o l a r i z a t i o n t r a n s f e r
298 −−ANDed with t h i s va lue because o f dumb th ings . ( s e e comment f o r
dumbState in the t r a n s i t i o n s p roce s s )
299
300 −−debug reg (2 ) <= dram f i fo empty & c o n t r o l r e g i s t e r (4 ) (31) & gamma set
& start DMA & ”000” & dram addr ;
301 debug reg (4 ) <= X”00000000” when c o r e s t a t e = i d l e e l s e
302 X”00000001” when c o r e s t a t e = setup Ek DMA e l s e
303 X”00000002” when c o r e s t a t e = Ek DMA init1 e l s e
304 X”00000003” when c o r e s t a t e = Ek DMA init2 e l s e
305 X”00000004” when c o r e s t a t e = Ek to DRAM e l s e
306 X”00000005” when c o r e s t a t e = next Ek DMA e l s e
307 X”00000006” when c o r e s t a t e = last Ek to DRAM e l s e
308 X”00000007” when c o r e s t a t e = g e t r e g s e l s e
309 X”00000008” when c o r e s t a t e = totem wait e l s e
310 X”00000009” when c o r e s t a t e = setup pol DMA e l s e
311 X”0000000a” when c o r e s t a t e = pol DMA init1 e l s e
104
312 X”0000000b” when c o r e s t a t e = pol DMA init2 e l s e
313 X”0000000 c” when c o r e s t a t e = DRAM in start e l s e
314 X”0000000d” when c o r e s t a t e = dumbState e l s e
315 X”0000000 e” when c o r e s t a t e = ca l c u l a t e 1 e l s e
316 X”0000000 f ” when c o r e s t a t e = ca l c u l a t e 2 e l s e
317 X”00000010” when c o r e s t a t e = n e x t c e l l e l s e
318 X”00000011” when c o r e s t a t e = nex tCe l l i n t e l s e
319 X”00000012” when c o r e s t a t e = s e t up d e l i v e r e l s e
320 X”00000013” when c o r e s t a t e = d e l i v e r i n i t e l s e
321 X” f f f f f f f f ” ;
322
323 debug out : p roce s s ( r e s e t , i c l o c k ) i s
324 begin
325 i f ( r e s e t = ’1 ’ ) then
326 debug reg (0 ) <= ( othe r s => ’ 0 ’ ) ;
327 debug reg (1 ) <= ( othe r s => ’ 0 ’ ) ;
328 debug reg (5 ) <= ( othe r s => ’ 0 ’ ) ;
329 debug reg (3 ) <= ( othe r s => ’ 0 ’ ) ;
330 debug reg (2 ) <= ( othe r s => ’ 0 ’ ) ;
331 debug reg (7 ) <= ( othe r s => ’ 0 ’ ) ;
332
333 d r am f i f o r d cn t <= ( othe r s => ’ 0 ’ ) ;
334 e l s i f ( r i s i n g e d g e ( i c l o c k ) ) then
335 i f ( c o r e s t a t e = ca l c u l a t e 1 ) then
336 debug reg (0 ) <= Ek in ;
337 debug reg (2 ) <= ”00000000000000000000000” & dram f i f o r d cn t ;
338 debug reg (7 ) <= ( othe r s => ’ 0 ’ ) ;
339 debug reg (1 ) <= gamma in ;
340 debug reg (5 ) <= i n f i f o o u t ;
341 debug reg (6 ) <= dram f i f o ou t (127 downto 96) ;
342 end i f ;
343
344 i f ( c o r e s t a t e = nex tCe l l i n t ) then
345 debug reg (3 ) <= o u t f i f o i n ;
346 end i f ;
347
348 i f ( d r am f i f o rd = ’1 ’ ) then
349 d r am f i f o r d cn t <= dram f i f o r d cn t + 1 ;
350 e l s i f ( d r am f i f o c l r = ’1 ’ ) then
351 d r am f i f o r d cn t <= ( othe r s => ’ 0 ’ ) ;
352 end i f ;
353 end i f ;
354 end proce s s ;
355
356 −−s t a t e machine t r a n s i t i o n s
357 c o r e t r a n s i t i o n s : p roce s s ( i c l o c k , r e s e t )
105
358 begin
359 i f ( r e s e t = ’1 ’ ) then
360 c o r e s t a t e <= i d l e ;
361 e l s i f r i s i n g e d g e ( i c l o c k ) then
362 case c o r e s t a t e i s
363 when i d l e => −−t h i s s t a t e should only happen once , when the core i s
r e s e t
364 −−wait f o r s i g n a l to s t a r t DMA t r a n s f e r o f Ek data
365 i f ( Start DMA = ’1 ’ ) then
366 c o r e s t a t e <= setup Ek DMA ;
367 end i f ;
368 when setup Ek DMA => −−t h i s i s the s t a t e we should loop back to f o r
r epea t ing the DMA t r a n s f e r
369 i f ( l a d i n . bus gnt = ’1 ’ ) then
370 c o r e s t a t e <= Ek DMA init1 ;
371 end i f ;
372 when Ek DMA init1 =>
373 c o r e s t a t e <= Ek DMA init2 ;
374 when Ek DMA init2 =>
375 c o r e s t a t e <= Ek to DRAM;
376 when Ek to DRAM =>
377 −−shove Ek va lue s from the FIFO in to DRAM
378 −−Max address l im i t i s i r r e l e v a n t because the amount o f data i s
never that high
379 −−get number o f ne ighbours from the host in the RegWrite that
t r i g g e r s th ing s
380 −−no check f o r dram write rdy here because i n f i f o i s read only
when that s i g n a l i s high
381 −−so i n f i f o emp ty w i l l a c t i v a t e a f t e r the complete s e t o f ∗
s u c c e s s f u l ∗ DRAM wr i t e s
382 i f ( (DMA inCount >= numNeighbours DMA) and ( i n f i f o emp ty = ’1 ’ ) )
then
383 c o r e s t a t e <= next Ek DMA ; −−s t a r t new DMA fo r more Ek data . . .
384 end i f ;
385 when next Ek DMA =>
386 −−w i l l need to send an i n t e r r up t here
387 i f ( Start DMA = ’1 ’ ) then
388 c o r e s t a t e <= setup Ek DMA ;
389 e l s i f ( gamma set = ’1 ’ ) then
390 c o r e s t a t e <= last Ek to DRAM ;
391 end i f ;
392 when last Ek to DRAM =>
393 i f ( dram write rdy = ’1 ’ ) then −− i f dram write rdy i s 1 , then the
th ing has happened and we can move on
394 c o r e s t a t e <= s e t up d e l i v e r ;
395 end i f ;
106
396 when s e t up d e l i v e r =>
397 −−open the channel to d e l i v e r data back to host (DMA from the SRAM)
398 i f ( l a d i n . bus gnt = ’1 ’ ) then
399 c o r e s t a t e <= d e l i v e r i n i t ;
400 end i f ;
401 when d e l i v e r i n i t =>
402 c o r e s t a t e <= totem wait ;
403 when totem wait => −−a wait s t a t e f o r between i t e r a t i o n s
404 i f ( gamma in /= X”BF000000” and wa i t counte r >= X”4”) then
405 c o r e s t a t e <= ge t r e g s ;
406 end i f ;
407 when g e t r e g s => −−get gamma and neighbour va lue s from host f o r
cur rent c e l l in r e g i s t e r t r a n s f e r
408 −−t h i s i s the s t a t e we s t a r t from every time a f t e r we have Ek data
409 −−wait u n t i l we have gamma and neighbours
410 i f (DMA outCount = X”00000000” or start DMA = ’1 ’ ) then
411 c o r e s t a t e <= setup pol DMA ;
412 e l s i f ( gamma set = ’1 ’ and gamma in = X”BF000000 ”) then
413 −−t e s t here f o r whether the l a s t c e l l i s done or not
414 −−t e s t must be done in the f i r s t t r a n s f e r or i t w i l l cause a f a l s e
p o s i t i v e
415 c o r e s t a t e <= s e t up d e l i v e r ;
416 end i f ;
417 when setup pol DMA => −−t h i s i s f o r v1 o f the hardware
418 −−wait f o r the bus grant , l i k e in the Ek DMA setup
419 i f ( l a d i n . bus gnt = ’1 ’ ) then
420 c o r e s t a t e <= Pol DMA init1 ;
421 end i f ;
422 −− i t would be n i c e to keep f e t c h i n g more p o l a r i z a t i o n data as soon
as a l l the cur rent s t u f f goes in
423 −−but I th ink i t would r e qu i r e ex t r a c t i n g the DMA con t r o l i n to a
s epe ra t e machine
424 when pol DMA init1 =>
425 c o r e s t a t e <= pol DMA init2 ;
426 when pol DMA init2 =>
427 c o r e s t a t e <= DRAM in start ;
428 when DRAM in start => −−we have a l l Ek , gamma, and Po l a r i z a t i o n s are
on t h e i r way , so we can s t a r t c a l c u l a t i n g the new po l a r i z a t i o n
429 −−t h i s s t a t e should reque s t data from DRAM, then move to a wait
s t a t e
430 −−DRAM address in t h i s and subsequent s t a t e s should be increment ing
cont inuous ly un t i l some l im i t . . .
431 i f ( dram f i fo empty = ’0 ’ and in t e r rupt done = ’1 ’ ) then
432 −−make sure DRAM data i s ready and that p o l a r i z a t i o n data has come
through the f i f o
107
433 −−the p a r t i c u l a r s o f the l o g i c at play here mean that the i n f i f o
must be at l e a s t
434 −−as l a r g e as the l a r g e s t DMA input t r an sa c t i on
435 i f ( c o n t r o l r e g i s t e r (4 ) (31) = ’1 ’ or DMA outCount /= X”0”) then
436 −−do dumb th ings on the f i r s t c e l l o f any sample but the f i r s t .
437 c o r e s t a t e <= ca l c u l a t e 1 ;
438 e l s e
439 c o r e s t a t e <= dumbState ;
440 end i f ;
441 end i f ;
442 when dumbState =>
443 −−the re i s a dumb thing that happens where the DRAM doesn ’ t a c t u a l l y
read the address i t ’ s t o ld to read . . .
444 −−Upon doing read (1 ) on anything but the f i r s t sample , the f i r s t
d a t a i n v a l i d would be a s s o c i a t ed
445 −−with a l l z e r o e s on the data l i n e ( presumably read (0 ) , but I don ’ t
know) .
446 −−in order to compensate f o r t h i s dumb thing that happens , t h i s dumb
s t a t e was added
447 −−to f l u s h that f i r s t wor th l e s s word out o f the d ram f i f o when
needed .
448 −−The need to do t h i s i s i nd i c a t ed by having a zero in the MSB of
c o n t r o l r e g i s t e r (4 ) when sent
449 −−from the host . This means a l s o that the MSB of c o n t r o l r e g i s t e r (4 )
( which r e c i e v e s numNeighbours )
450 −−must be anded with 7 f f f f f f f in order to make numNeighbours a
normal va lue f o r use in
451 −−DMA i n i t i a l i z a t i o n
452 −−Anyway t h i s i s dumb . I don ’ t know why i t needs to happen l i k e th i s
, I can ’ t exp la in the s imulated
453 −−behaviour o f the DRAM, and i t seems to happen the same in hardware
. . .
454 −− i t may have something to do with the wr i t e .
455 c o r e s t a t e <= ca l c u l a t e 1 ;
456 when c a l c u l a t e 1 =>
457 −− f i r s t data ready from DRAM f i f o , s t a r t c a l c u l a t i o n
458 −− i f ( dram f i fo empty = ’0 ’ and i n f i f o emp ty = ’0 ’ ) then −−make
sure data i s r e c e i v ed be f o r e moving on
459 −−the f u n c t i o n a l i t y o f the p math core i s dependent upon p r e c i s e
and r e l i a b l e t iming
460 −−ZERE CAN BE NO VAITINK!
461 −−(∗ahem∗ the re can be no wai t ing ) Although extra wai t ing around
between c e l l s i s a cceptab l e
462 −−the ques t i on o f the moment i s : i s i t s a f e to assume the DRAM
w i l l r e l i a b l e y have new data ready ?
463 c o r e s t a t e <= ca l c u l a t e 2 ;
108
464 −−end i f ;
465 when c a l c u l a t e 2 =>
466 −−t h i s i s the wait s t a t e ; i t should have two c y c l e s
467 i f ( wa i t counte r = X”1”) then
468 −−how to determine when to move on to the next c e l l . . .
469 i f ( i n f i f o emp ty = ’1 ’ ) then
470 c o r e s t a t e <= n e x t c e l l ;
471 e l s e
472 c o r e s t a t e <= ca l c u l a t e 1 ;
473 end i f ;
474 end i f ;
475 when n e x t c e l l =>
476 −−p math core prepped f o r next s t u f f . . .
477 −−new pol from p math should be automat i ca l l y sent when core i s
f i n i s h ed ,
478 −−ra the r than being c on t r o l l e d e x p l i c i t l y here
479 i f ( c e l l d o n e = ’1 ’ ) then −−wait around un t i l c a l c u l a t i o n i s done
480 c o r e s t a t e <= nex tCe l l i n t ;
481 end i f ;
482 when n ex tCe l l i n t => −−s t a t e f o r gene ra t ing i n t e r r up t
483 i f ( i n t e r rupt done = ’1 ’ and start DMA = ’0 ’ ) then
484 c o r e s t a t e <= ge t r e g s ;
485 end i f ;
486 when othe r s =>
487 c o r e s t a t e <= i d l e ;
488 end case ;
489 end i f ;
490 end proce s s ;
491
492 c o r e c on t r o l : p roc e s s ( c o r e s t a t e , DMA To Pe Addr , numNeighbours DMA ,
493 dram write rdy , i n f i f o empty , dram addr , dram read rdy ,
d r am f i f o f u l l ,
494 DMA inCount , DMA From Pe Addr , DMA outCount , numCells DMA ,
o u t f i f o f u l l ,
495 out f i f o empty , c e l l done , dram f i fo prog empty )
496 begin
497 dram addr inc <= ’0 ’ ;
498 dram realAddr inc <= ’0 ’ ;
499 dram ta i lAddr inc <= ’0 ’ ;
500 dram addr c l r <= ’0 ’ ;
501 dram buf wr i te <= ’0 ’ ;
502 dram super wr i te <= ’0 ’ ;
503 dram read <= ’0 ’ ;
504 d ram f i f o rd <= ’0 ’ ;
505 d r am f i f o c l r <= ’0 ’ ;
506 p math nd <= ’0 ’ ;
109
507 p math en <= ’0 ’ ;
508 p math con <= ’0 ’ ;
509 smbus req <= ’0 ’ ;
510 smdma init <= ’0 ’ ;
511 smLAD Data out <= ( othe r s => ’ 0 ’ ) ;
512 DMA inCount clr <= ’0 ’ ;
513 DMA outCount clr <= ’0 ’ ;
514 i n f i f o r d <= ’0 ’ ;
515 pn c l r <= ’0 ’ ;
516 pn count <= ’0 ’ ;
517 g en e r a t e i n t e r r up t <= ’0 ’ ;
518 c l e a r i n t e r r u p t <= ’0 ’ ;
519 wait count <= ’0 ’ ;
520 dram read se l <= ’0 ’ ;
521
522 case c o r e s t a t e i s
523 when i d l e =>
524 c l e a r i n t e r r u p t <= ’1 ’ ;
525 DMA inCount clr <= ’1 ’ ;
526 DMA outCount clr <= ’1 ’ ;
527 d r am f i f o c l r <= ’1 ’ ;
528 when setup Ek DMA =>
529 smBus Req <= ’1 ’ ; −−r eque s t LAD bus f o r DMA i n i t i a l i z a t i o n
530 when Ek DMA init1 =>
531 smBus Req <= ’1 ’ ;
532 smdma init <= ’1 ’ ;
533 smLAD Data Out <= DMA TO PE Addr & ”00” ;
534 when Ek DMA init2 =>
535 smbus req <= ’1 ’ ;
536 smdma init <= ’1 ’ ;
537 smLAD Data Out <= numNeighbours DMA ;
538 DMA inCount clr <= ’1 ’ ;
539 c l e a r i n t e r r u p t <= ’1 ’ ;
540 when Ek to DRAM => −−shove Ek va lue s from the FIFO in to DRAM
541 −−we could ex t r a c t t h i s out to be an autonomous con t r o l s t ruc ture , so
that i t could cont inue
542 −−running whi l e new data i s be ing fetched , but ” in theory ” new data
should be s to r ed immediately
543 −−so the re should be no wai t ing
544 −−data has to be wr i t t en 128 b i t s at a time , t h i s i s handled by the
DRAM input bu f f e r
545 i f (DRAM write rdy = ’1 ’ and i n f i f o emp ty = ’0 ’ ) then
546 dram buf wr i te <= ’1 ’ ; −−c on t r o l s ent to the DRAM input bu f f e r
547 i n f i f o r d <= ’1 ’ ;
548 end i f ;
549 when next Ek DMA => −−wait f o r next totem , and f i r e i n t e r r up t
110
550 g en e r a t e i n t e r r up t <= ’1 ’ ;
551 when last Ek to DRAM => −−t h i s i s f o r making sure the l a s t s e t o f data
in the DRAM input bu f f e r g e t s wr i t t en
552 −−because i f i t does not end on a ”11” i t w i l l o the rw i se j u s t stay in
the bu f f e r and not get wr i t t en
553 i f ( dram addr (1 downto 0) /= ”00” and DRAM write rdy = ’1 ’ ) then
554 −−i n d i c a t i n g that the l a s t address was not ending in ”11”
555 dram super wr i te <= ’1 ’ ;
556 dram buf wr i te <= ’1 ’ ;
557 dram realAddr inc <= ’1 ’ ;
558 i n f i f o r d <= ’1 ’ ;
559 end i f ;
560 when s e t up d e l i v e r =>
561 −−d e l i v e r data back to host (DMA from the SRAM)
562 smBus req <= ’1 ’ ;
563 c l e a r i n t e r r u p t <= ’1 ’ ;
564 dram read se l <= ’1 ’ ;
565 when d e l i v e r i n i t =>
566 smBus Req <= ’1 ’ ;
567 smdma init <= ’1 ’ ;
568 smLAD Data Out <= DMA From PE Addr & ”01” ;
569 dram read se l <= ’1 ’ ;
570 when totem wait =>
571 wait count <= ’1 ’ ;
572 DMA outCount clr <= ’1 ’ ;
573 d r am f i f o c l r <= ’1 ’ ;
574 dram addr c l r <= ’1 ’ ; −−a l l data has been put in DRAM, r e s e t the
address f o r read ing
575 dram read se l <= ’1 ’ ;
576 when g e t r e g s =>
577 p math en <= ’1 ’ ;
578 p math con <= ’1 ’ ;
579 dram read se l <= ’1 ’ ;
580 when setup pol DMA =>
581 smBus Req <= ’1 ’ ; −−r eque s t LAD bus f o r DMA i n i t i a l i z a t i o n
582 p math en <= ’1 ’ ;
583 dram read se l <= ’1 ’ ;
584 when pol DMA init1 =>
585 smBus Req <= ’1 ’ ;
586 smdma init <= ’1 ’ ;
587 smLAD Data Out <= DMA TO PE Addr & ”00” ;
588 p math en <= ’1 ’ ;
589 dram read se l <= ’1 ’ ;
590 when pol DMA init2 =>
591 smbus req <= ’1 ’ ;
592 smdma init <= ’1 ’ ;
111
593 smLAD Data Out <= numNeighbours DMA ;
594 DMA inCount clr <= ’1 ’ ;
595 p math en <= ’1 ’ ;
596 c l e a r i n t e r r u p t <= ’1 ’ ;
597 dram read se l <= ’1 ’ ;
598 when DRAM in start => −−we have a l l Ek , gamma, and Po l a r i z a t i on s , so
we can s t a r t c a l c u l a t i n g the new po l a r i z a t i o n
599 −−t h i s s t a t e should reque s t data from DRAM, then move to a wait s t a t e
600 −−DRAM address in t h i s and subsequent s t a t e s should be increment ing
cont inuous ly un t i l some l im i t . . .
601 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then
602 dram read <= ’1 ’ ;
603 end i f ;
604 p math en <= ’1 ’ ;
605
606 −− f i r e the i n t e r r up t here , a l l ow ing the host to gather the next s e t
o f p o l a r i z a t i o n s
607 −−whi le the c a l c u l a t i o n i s be ing done .
608 −−need to f i r e i n t e r r up t again when c e l l c a l c u l a t i o n i s done to t e l l
host
609 −− i t i s ok to send new gamma and neighbours
610 i f (DMA inCount >= numNeighbours DMA) then −−t h i s cond i t i on i s used
in a few p l a c e s . . .
611 g en e r a t e i n t e r r up t <= ’1 ’ ;
612 end i f ;
613 dram read se l <= ’1 ’ ;
614 when dumbState =>
615 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then
616 dram read <= ’1 ’ ;
617 end i f ;
618 d ram f i f o rd <= ’1 ’ ;
619 p math en <= ’1 ’ ;
620 dram read se l <= ’1 ’ ;
621 when c a l c u l a t e 1 =>
622 −− f i r s t data ready from DRAM f i f o , s t a r t c a l c u l a t i o n
623 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then −−be
c a r e f u l with t h i s read rdy s i g n a l and p math nd . . . .
624 dram read <= ’1 ’ ;
625 end i f ;
626 p math nd <= ’1 ’ ;
627 p math en <= ’1 ’ ;
628 dram ta i lAddr inc <= ’1 ’ ;
629 i n f i f o r d <= ’1 ’ ;
630 i f ( dram addr (1 downto 0) = ”11”) then −−when the four th 32 b i t word
has been read , get next 128 b i t word
631 d ram f i f o rd <= ’1 ’ ;
112
632 end i f ;
633 dram read se l <= ’1 ’ ;
634 when c a l c u l a t e 2 =>
635 −−cu r r en t l y the p math un i t i s con f i gu r ed f o r a 2− l a t ency accumulator
. . .
636 −−t h i s i s the downcycle
637 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then −−be
c a r e f u l with t h i s read rdy s i g n a l and p math nd . . . .
638 dram read <= ’1 ’ ;
639 end i f ;
640 p math en <= ’1 ’ ;
641 wait count <= ’1 ’ ;
642 dram read se l <= ’1 ’ ;
643 when n e x t c e l l =>
644 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then −−be
c a r e f u l with t h i s read rdy s i g n a l and p math nd . . . .
645 dram read <= ’1 ’ ;
646 end i f ;
647 p math en <= ’1 ’ ;
648 c l e a r i n t e r r u p t <= ’1 ’ ;
649 dram read se l <= ’1 ’ ;
650 when n ex tCe l l i n t => −−i n t e r r up t to inform that the c e l l has been
c a l c u l a t ed
651 i f (DRAM read rdy = ’1 ’ and dram f i fo prog empty = ’1 ’ ) then −−t h i s
i s j u s t running a l l the time to make sure data i s pre sent
652 −−s topping the DRAM t r a n s f e r only a f t e r f i f o f u l l i s detected , may
cause data l o s s . . .
653 −−BECAUSE TIMING IS SUPER CRITICAL WE MUST HAVE DATA
654 dram read <= ’1 ’ ;
655 end i f ;
656 g en e r a t e i n t e r r up t <= ’1 ’ ;
657 p math en <= ’1 ’ ;
658 dram read se l <= ’1 ’ ;
659 end case ;
660 end proce s s ;
661
662 proce s s ( r e s e t , p c lock , i c l o c k ) i s
663 begin
664 i f ( r e s e t = ’1 ’ ) then
665 dram read addr <= ( othe r s => ’ 0 ’ ) ;
666 dram write addr <= ( othe r s => ’ 0 ’ ) ;
667 dram addr (1 downto 0) <= ”00” ;
668 e l s e
669 i f r i s i n g e d g e ( p c l o ck ) then
670 i f ( dram read = ’1 ’ ) then
671 dram read addr <= dram read addr + 1 ;
113
672 end i f ;
673
674 i f ( dram addr c l r = ’1 ’ ) then
675 −−t h i s i s the address where the f i r s t data i s , and i t i s only
s e t a f t e r DRAM i s f i l l e d
676 −−because the nature o f how the DRAM input bu f f e r works means
that address zero i s not wr i t t en
677 −−so we j u s t sk ip over i t when we go to reads .
678 −−(the g l oba l r e s e t i s enough to s e t the i n i t i a l wr i t i ng address
to 0 . )
679 dram read addr <= ”00000000000000000000001”;
680 end i f ;
681 end i f ;
682
683 i f r i s i n g e d g e ( i c l o c k ) then
684 i f ( dram buf wr i te = ’1 ’ and dram addr (1 downto 0) = ”11”) then
685 dram write addr <= dram write addr + 1 ;
686 end i f ;
687
688 i f ( dram addr c l r = ’1 ’ ) then
689 dram write addr <= ”00000000000000000000001”;
690 end i f ;
691
692 −− i t i s nece s sa ry to separa te the se two out at some point , because
whi l e streaming the DRAM output
693 −−we must keep the r e a l address going up as f a s t as p o s s i b l e to
ensure pre s ense o f data ,
694 −−in order to keep up with the t iming concerns o f the p math core .
695 −−but then the two LSBs get out o f sync from where we want them ,
which i s po in t ing to
696 −−the next data coming out o f the DRAM f i f o
697 i f ( dram ta i lAddr inc = ’1 ’ or dram buf wr i te = ’1 ’ ) then
698 dram addr (1 downto 0) <= dram addr (1 downto 0) + 1 ;
699 end i f ;
700
701 i f ( dram addr c l r = ’1 ’ ) then
702 dram addr (1 downto 0) <= ”00” ;
703 end i f ;
704 end i f ;
705 end i f ;
706 end proce s s ;
707
708 −−t h i s i s so we can have the address c on t r o l l e d by two d i f f e r e n t c l o ck s
709 dram addr (24 downto 2) <= dram read addr when dram read se l = ’1 ’ e l s e
710 dram write addr ;
711
114
712 −−be wary o f c l o ck s . . . .
713 wa i t counte r p roc : p roc e s s ( r e s e t , i c l o c k ) i s
714 begin
715 i f ( r e s e t = ’1 ’ ) then
716 DMA inCount <= ( othe r s => ’ 0 ’ ) ;
717 DMA outCount <= ( othe r s => ’ 0 ’ ) ;
718 wa i t counte r <= ( othe r s => ’ 0 ’ ) ;
719 e l s i f r i s i n g e d g e ( i c l o c k ) then
720 −−t h i s i s synchronized to i c l o c k because they are t ra ck ing DMA
t ran s a c t i on s over the LAD bus
721 i f ( l a d i n . DMA strobe = ’1 ’ ) then
722 DMA inCount <= DMA inCount + 1 ;
723 e l s i f ( DMA inCount clr = ’1 ’ ) then
724 DMA inCount <= ( othe r s => ’ 0 ’ ) ;
725 end i f ;
726
727 i f (DMAfromPE = ’1 ’ ) then
728 DMA outCount <= DMA outCount + 1 ;
729 e l s i f ( DMA outCount clr = ’1 ’ ) then
730 DMA outCount <= ( othe r s => ’ 0 ’ ) ;
731 end i f ;
732
733 i f ( wai t count = ’1 ’ ) then
734 wa i t counte r <= wai t counte r + 1 ;
735 e l s e
736 wa i t counte r <= ( othe r s => ’ 0 ’ ) ;
737 end i f ;
738
739 DMAfromPE <= ou t f i f o r d ; −−s im i l a r s t r u c tu r e as the DMA example f o r
outputt ing data
740 end i f ;
741 end proce s s wa i t counte r p roc ;
742
743 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
744 −−DRAM s i g n a l r e g i s t r a t i o n
745 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
746 r e g i s t e r s i g n a l s p r o c : p roce s s ( p c lock , r e s e t ) i s
747 begin
748 i f ( r e s e t = ’1 ’ ) then
749 dram out . read <= ’0 ’ ;
750 dram out . wr i t e <= ’0 ’ ;
751 dram out . addr <= ( othe r s => ’ 0 ’ ) ;
752 dram out . data out <= ( othe r s => ’ 0 ’ ) ;
753 dram data in <= ( othe r s => ’ 0 ’ ) ;
754 dram data in va l id <= ’0 ’ ;
755 dram write rdy <= ’0 ’ ;
115
756 dram read rdy <= ’0 ’ ;
757 e l s i f r i s i n g e d g e ( p c l o ck ) then
758 dram out . read <= dram read ;
759 dram out . wr i t e <= dram write ;
760 dram out . addr <= dram addr (24 downto 2) ; −−lower two b i t s are
used to dec ide which 32 b i t word to s e l e c t
761 dram out . data out <= dram data out ;
762 dram data in <= dram in . da ta in ;
763 dram data in va l id <= dram in . d a t a i n v a l i d ;
764 dram write rdy <= dram in . wr i t e rdy ;
765 dram read rdy <= dram in . read rdy ;
766 end i f ;
767 end proce s s r e g i s t e r s i g n a l s p r o c ;
768
769 l ad ou t p ro c : p roce s s ( r e s e t , i c l o c k ) i s
770 begin
771 i f ( r e s e t = ’1 ’ ) then
772 lad out . Data Out <= ( othe r s => ’ 0 ’ ) ;
773 l ad out . Strobe Out <= ’0 ’ ;
774 l ad out . dma in i t <= ’0 ’ ;
775 l ad out . bus req <= ’0 ’ ;
776 l ad out . pe rdy <= ’0 ’ ;
777 l ad out . i n t r e q <= ’0 ’ ;
778 pc i rdy <= ’0 ’ ;
779 e l s i f r i s i n g e d g e ( i c l o c k ) then
780 i f (DMAFromPe = ’1 ’ ) then −−DMAfromPe w i l l need to be c on t r o l l e d
somewhere , f i g u r e that out
781 l ad out . Data Out <= ou t f i f o o u t ; −−SRAM data output goes here ,
somehow
782 e l s i f ( r e g s t r ob e ou t = ’1 ’ ) then
783 lad out . Data out <= Reg LAD out ;
784 e l s e
785 l ad out . Data Out <= smLAD Data Out ;
786 end i f ;
787 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
788 −− Generate a s t robe on the LAD bus f o r any o f the f o l l ow i ng 2 ca s e s
: −−
789 −− Regs Strobe Out = ’1 ’ : Reg i s t e r read from the con t r o l r e g i s t e r
−−
790 −− DMAFromPe = ’1 ’ : DMA−From−PE t ran sa c t i on data s t robe
−−
791 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
792 lad out . Strobe Out <= DMAFromPe or Reg st robe out ;
793 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
794 −− Allow the s t a t e machine to dr i v e the dma in it l i n e when
nece s sa ry . −−
116
795 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
796 lad out . dma in i t <= smdma init ;
797
798 lad out . bus req <= (( not ou t f i f o empty ) and l a d i n . pc i rdy ) or
DMAFromPE or smBus Req ;
799
800 −− In add i t i on to per forming the DMA−to−PE i n i t i a l i z a t i o n
t r an s a c t i on s −−
801 −− the PE a l s o must inform the PCI c o n t r o l l e r when i t i s ab le to
accept −−
802 −− DMA data . This i s done through the l ad out . pe rdy l i n e .
803 l ad out . pe rdy <= not i n f i f o f u l l ;
804
805 lad out . i n t r e q <= Inte r rup t ;
806
807 −− Delay PCI ready by one c l o ck to make be t t e r t iming .
808 pc i rdy <= lad i n . pc i rdy ;
809 end i f ;
810 end proce s s l ad ou t p ro c ;
811
812 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
813 −−In t e r rup t Control
814 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
815 i n t p r o c : p roc e s s ( i c l o c k , r e s e t )
816 begin
817 i f ( r e s e t = ’1 ’ ) then
818 Int Counter <= ( othe r s => ’ 0 ’ ) ;
819 e l s i f r i s i n g e d g e ( i c l o c k ) then
820 i f ( Genera te Inte r rupt = ’1 ’ and Interrupt Done = ’0 ’ and l a d i n .
dma in progres s = ’0 ’ ) then
821 Int Counter <= Int Counter + 1 ;
822 end i f ;
823
824 i f ( c l e a r i n t e r r u p t = ’1 ’ ) then
825 i n t c oun t e r <= ( othe r s => ’ 0 ’ ) ;
826 end i f ;
827 end i f ;
828 end proce s s i n t p r o c ;
829
830 Interrupt Done <= Int Counter (4 ) ;
831 In t e r rup t <= Int Counter (4 ) or Int Counter (3 ) or Int Counter (2 ) or
Int Counter (1 ) or Int Counter (0 ) ;
832
833 −− Tie o f f unused Clock s i g n a l s




837 −− i n s t a n t i a t e c on t r o l i n t e r f a c e s
838 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
839 d r am in t e r f a c e i n s t : d ram int e r f a c e
840 g ene r i c map (
841 STARTUPWAIT => STARTUPWAIT
842 ) port map (
843 pads => pads . dram ,
844 r e s e t => r e s e t ,
845 p c l o ck => c l o c k s i n . p c lock ,
846 l a d c l o c k => i c l o c k ,
847 u s e r i n => dram in ,
848 use r ou t => dram out
849 ) ;
850
851 d e f a u l t i n t e r f a c e i n s t : d e f a u l t i n t e r f a c e
852 port map (
853 pads => pads . de fau l t ,
854 r e s e t o u t => r e s e t ,
855 c l o c k s i n => c l o c k s i n ,
856 c l o c k s ou t => c l o ck s out ,
857 l a d i n => l ad in ,
858 l ad out => l ad out
859 ) ;
860 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
861 −−end i n s t a n t i a t e c on t r o l i n t e r f a c e s
862 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
863
864 −− Do not d e l e t e /modify t h i s l i n e
865 pads <= in i t p e p ad s ;
866
867 end qcad ;
Listing C.1: QCA hardware simulator code
118
Appendix D
p math VHDL Code Listing
The listing that follows is the VHDL code for the main mathematical
core.
1 −− use i n s t r u c t i o n s : f i r s t c y c l e − turn on ce , prov ide gamma value ,
prov ide Ek and Pol values , pu l s e s t a r t s i g n a l f o r one cy c l e
2 −− l e ave s t a r t s i g n a l o f f f o r one cy c l e
3 −− prov ide next Ek and Pol va lue s and pu l s e s t a r t s i g n a l f o r
one cy c l e
4 −− l e ave s t a r t s i g n a l o f f f o r one cy c l e
5 −− repeat u n t i l a l l Ek and Pol va lue s have been sent in
6 −− a f t e r pu l s i ng s t a r t s i g n a l f o r the l a s t time ,
7 −− wait f o r (mult + add + 1) latency , then pu l s e n ew ce l l to
s t a r t a new c e l l
8 −− repeat i n s t r u c t i o n s from f i r s t c y c l e
9 −− wait u n t i l done s i g n a l a c t i va t e s , r e t r i e v e data .
10
11 l i b r a r y IEEE ;
12 use IEEE . STD LOGIC 1164 .ALL;
13 use IEEE .STD LOGIC ARITH.ALL;
14 use IEEE .STD LOGIC UNSIGNED.ALL;
15
16 en t i t y p math i s
17 Port ( po l : in STD LOGIC VECTOR (31 downto 0) ; −−p a r a l l e l data
inputs f o r p o l a r i z a t i o n and Ek
18 gamma : in STD LOGIC VECTOR (31 downto 0) ;
19 Ek : in STD LOGIC VECTOR (31 downto 0) ;
20 s t a r t : in STD LOGIC; −−pu l s e t h i s with each new s e t o f data
input
21 c l k : in STD LOGIC;
22 ce : in s t d l o g i c ; −−c l o ck enable , s tay s on un t i l ”done”
t r i g g e r s
23 new ce l l : in s t d l o g i c ; −−pu l s e t h i s when ready to send in new
t r a i n o f c e l l data
24 r e s e t : in STD LOGIC;
119
25 new pol : out STD LOGIC VECTOR (31 downto 0) ; −−r e s u l t
output
26 done : out STD LOGIC) ; −−computation complete s i g n a l
27 end p math ;
28
29 a r c h i t e c t u r e Behav iora l o f p math i s
30
31 −−some o f the se f l o a t i n g po int un i t s could have l a t e n c i e s reduced to
f r e e up hardware
32 component mult ip ly
33 port (
34 a : IN std logic VECTOR (31 downto 0) ;
35 b : IN std logic VECTOR (31 downto 0) ;
36 operat ion nd : IN s t d l o g i c ;
37 c l k : IN s t d l o g i c ;
38 s c l r : IN s t d l o g i c ;
39 ce : IN s t d l o g i c ;
40 r e s u l t : OUT std logic VECTOR (31 downto 0) ;
41 rdy : OUT s t d l o g i c ) ;




46 a : IN std logic VECTOR (31 downto 0) ;
47 b : IN std logic VECTOR (31 downto 0) ;
48 operat ion nd : IN s t d l o g i c ;
49 c l k : IN s t d l o g i c ;
50 s c l r : IN s t d l o g i c ;
51 ce : IN s t d l o g i c ;
52 r e s u l t : OUT std logic VECTOR (31 downto 0) ;
53 rdy : OUT s t d l o g i c ) ;
54 end component ;
55
56 component f a s t adde r −− f l o a t i n g po int adder with l a t ency o f 1 f o r low−
l a t ency accumulation
57 port (
58 a : IN std logic VECTOR (31 downto 0) ;
59 b : IN std logic VECTOR (31 downto 0) ;
60 operat ion nd : IN s t d l o g i c ;
61 c l k : IN s t d l o g i c ;
62 s c l r : IN s t d l o g i c ;
63 ce : IN s t d l o g i c ;
64 r e s u l t : OUT std logic VECTOR (31 downto 0) ;
65 rdy : OUT s t d l o g i c ) ;
66 end component ;
67
120
68 component sq r t
69 port (
70 a : IN std logic VECTOR (31 downto 0) ;
71 operat ion nd : IN s t d l o g i c ;
72 c l k : IN s t d l o g i c ;
73 s c l r : IN s t d l o g i c ;
74 ce : IN s t d l o g i c ;
75 r e s u l t : OUT std logic VECTOR (31 downto 0) ;
76 rdy : OUT s t d l o g i c ) ;
77 end component ;
78
79 component d i v i d e r
80 port (
81 a : IN std logic VECTOR (31 downto 0) ;
82 b : IN std logic VECTOR (31 downto 0) ;
83 operat ion nd : IN s t d l o g i c ;
84 c l k : IN s t d l o g i c ;
85 s c l r : IN s t d l o g i c ;
86 ce : IN s t d l o g i c ;
87 r e s u l t : OUT std logic VECTOR (31 downto 0) ;
88 rdy : OUT s t d l o g i c ) ;
89 end component ;
90
91 component s h i f t r e g −−32 b i t s h i f t r e g i s t e r f o r use as a de lay l i n e
92 Generic ( depth : i n t e g e r := 50) ;
93 Port ( da ta in : in STD LOGIC VECTOR (31 downto 0) ;
94 data out : out STD LOGIC VECTOR (31 downto 0) ;
95 c l k : in STD LOGIC) ;
96 end component ;
97
98 component o n e b i t s h i f t −−1 b i t de lay l i n e
99 Generic ( depth : i n t e g e r := 10) ;
100 Port ( da ta in : in STD LOGIC;
101 data out : out STD LOGIC;
102 c l k : in STD LOGIC) ;
103 end component ;
104
105 s i g n a l PxEk , sum : s t d l o g i c v e c t o r (31 downto 0) ;
106 s i g n a l two gamma , pol math , p math squared : s t d l o g i c v e c t o r (31 downto
0) ;
107 s i g n a l po l math delayed : s t d l o g i c v e c t o r (31 downto 0) ;
108 s i g n a l square p lu s 1 , SoP , den : s t d l o g i c v e c t o r (31 downto 0) ;
109 s i g n a l in counte r , c a l c c oun t e r : s t d l o g i c v e c t o r (9 downto 0) ;
110
111 s i g n a l add done , pol math rdy , square done , p lus1 rdy : s t d l o g i c ;





116 PxEk mult : mul t ip ly −−t h i s l a t ency must be g r e a t e r than adder+1 in
order f o r everyth ing to work
117 port map (
118 a => pol , −−pol and Ek from main inputs
119 b => Ek ,
120 operat ion nd => s ta r t , −−s i g n a l from main inputs , should be pulsed
with each new pol /Ek
121 c l k => c lk ,
122 s c l r => r e s e t ,
123 ce => ce ,
124 r e s u l t => PxEk ,
125 rdy => mpy done ) ;
126
127 −−adder l a t ency i s reduced to 1 c y c l e s
128 accum : add −−accumulator , s tage 1b ( mu l t i p l i e r above i s s tage 1a )
129 port map (
130 a => PxEk ,
131 b => SoP ,
132 operat ion nd => mpy done ,
133 c l k => c lk ,
134 s c l r => r e s e t ,
135 ce => ce ,
136 r e s u l t => sum ,
137 rdy => add done ) ;
138
139 proce s s ( c lk , r e s e t , n ew ce l l )
140 begin
141
142 −−here the accumulation proce s s i s handled by c a r e f u l t iming and con t r o l
143 i f ( r e s e t = ’1 ’ or n ew ce l l = ’1 ’ ) then
144 Sop <= ( othe r s => ’ 0 ’ ) ;
145 c a l c c oun t e r <= ( othe r s => ’ 0 ’ ) ;
146 i n i t i a l <= ’1 ’ ; −−t h i s s i g n a l i s an i nd i c a t o r to whether we are at the
beg inning
147
148 accum rdy <= ’0 ’ ;
149 e l s i f ( c lk ’ event and c l k = ’1 ’ ) then
150 i f ( i n i t i a l = ’1 ’ and mpy done = ’1 ’ ) then
151 i n i t i a l <= ’0 ’ ;
152 e l s i f ( add done = ’1 ’ ) then
153 SoP <= sum ; −−running t o t a l updated manually , e a s i e r to manage than
a s imple loop
154 ca l c c oun t e r <= ca l c c oun t e r + 1 ;
122
155 e l s i f ( ( c a l c c oun t e r = in count e r ) and ( i n i t i a l = ’0 ’ ) ) then
156 −−needed to keep accum rdy from going in to a permanent high
157 ca l c c oun t e r <= ca l c c oun t e r + 1 ;
158 end i f ;
159
160 −−t h i s i s where ” i n i t i a l ” i s he lp fu l , because the two counter s are
a l s o equal at the beg inning
161 i f ( ( c a l c c oun t e r = in count e r ) and ( i n i t i a l = ’0 ’ ) ) then
162 accum rdy <= ’1 ’ ;
163 e l s e
164 accum rdy <= ’0 ’ ;
165 end i f ;
166 end i f ;
167
168 end proce s s ;
169
170 −−t h i s p roc e s s c on t r o l s ” i n coun t e r ” which counts how many s e t s o f po l /
Ek come in
171 −−”in coun t e r ” i s compared to ” c a l c c oun t e r ” , c on t r o l l e d above , to
determine when
172 −−the accumulation i s f i n i s h e d .
173 proce s s ( c lk , r e s e t , n ew ce l l )
174 begin
175
176 i f ( r e s e t = ’1 ’ or n ew ce l l = ’1 ’ ) then
177 in coun t e r <= ( othe r s => ’ 0 ’ ) ;
178 e l s i f ( c lk ’ event and c l k = ’1 ’ ) then
179 i f ( s t a r t = ’1 ’ ) then
180 in coun t e r <= in count e r + 1 ;
181 end i f ;
182 end i f ;
183
184 end proce s s ;
185
186 −−t r i c k to double a f l o a t i n g po int number , to avoid us ing an extra
mu l t i p l i e r
187 −−t h i s method w i l l only cause i s s u e when gamma > 1 .7 e38 , which i t never
i s .
188 −−saves about 200 s l i c e s
189 twoGamma : p roce s s ( c lk , r e s e t ) −−gamma doubling , s tage 1c
190 begin
191 i f ( r e s e t = ’1 ’ ) then
192 two gamma <= ( othe r s => ’ 0 ’ ) ;
193 e l s i f ( c lk ’ event and c l k = ’1 ’ ) then
194 i f ( s t a r t = ’1 ’ ) then −−add 1 to exponent
123
195 two gamma <= gamma(31) & (gamma(30 downto 23)+1) & gamma(22 downto
0) ;
196 end i f ;
197 end i f ;
198 end proce s s ;
199
200 divvy : d i v i d e r −−sum of products d iv ided by gamma, s tage 2
201 port map (
202 a => SoP , −−r e s u l t from accumulation
203 b => two gamma ,
204 operat ion nd => accum rdy , −−s i g n a l from accumulator c on t r o l to
say i t ’ s done
205 c l k => c lk ,
206 s c l r => r e s e t ,
207 ce => ce ,
208 r e s u l t => pol math ,
209 rdy => pol math rdy ) ;
210
211 pol mathDelay : s h i f t r e g −−delay l i n e from p math to f i n a l d i v i d e r
212 Generic map( depth => 6) −−depth i s the combined la t ency o f par t s on
other path ( mu l t i p l i e r , adder , s q r t )
213 Port map( data in => pol math ,
214 data out => pol math delayed ,
215 c l k => c l k ) ;
216
217 square : mul t ip ly −−square the d i v i d e r r e su l t , s tage 3
218 port map (
219 a => pol math ,
220 b => pol math ,
221 operat ion nd => pol math rdy ,
222 c l k => c lk ,
223 s c l r => r e s e t ,
224 ce => ce ,
225 r e s u l t => p math squared ,
226 rdy => square done ) ;
227
228 p lus one : add −−add one to squared pol math , s tage 4
229 port map (
230 a => p math squared ,
231 b => X”3F800000 ” , −− f l o a t i n g po int 1
232 operat ion nd => square done , −− nd l i n e s dominoing from rdy l i n e s
. . .
233 c l k => c lk ,
234 s c l r => r e s e t ,
235 ce => ce ,
236 r e s u l t => square p lu s 1 ,
124
237 rdy => p lus1 rdy ) ;
238
239 make den : s q r t −−square root func t i on f o r denominator , s tage 5
240 port map (
241 a => square p lu s 1 ,
242 operat ion nd => plus1 rdy ,
243 c l k => c lk ,
244 s c l r => r e s e t ,
245 ce => ce ,
246 r e s u l t => den ,
247 rdy => den rdy ) ;
248
249 f i n a l : d i v i d e r −−g ive f i n a l r e su l t , s tage 6 , depends on r e s u l t s from
sq r t ( s tage 5) and divvy ( s tage 2)
250 port map (
251 a => pol math delayed ,
252 b => den ,
253 operat ion nd => den rdy ,
254 c l k => c lk ,
255 s c l r => r e s e t ,
256 ce => ce ,
257 r e s u l t => new pol ,
258 rdy => done ) ; −−connected to main output
259
260 end Behaviora l ;
Listing D.1: Polarization math arithmetic core hardware code
