Pond: A Robust, scalable, massively parallel computer architecture by Spirer, Adam




Pond: A Robust, scalable, massively parallel
computer architecture
Adam Spirer
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Spirer, Adam, "Pond: A Robust, scalable, massively parallel computer architecture" (2010). Thesis. Rochester Institute of Technology.
Accessed from














Dr. Dorin Patru, Assistant Professor
Thesis Advisor, Department of Electrical and Microelectronic Engineering
Dr. Eric Peskin, Assistant Professor
Committee Member, Department of Electrical and Microelectronic Engineering
Dr. Daniel Phillips, Associate Professor
Committee Member, Department of Electrical and Microelectronic Engineering
Dr. Sohail Dianat, Professor
Department Head, Department of Electrical and Microelectronic Engineering
DEPARTMENT OF ELECTRICAL AND MICROELECTRONIC ENGINEERING
KATE GLEASON COLLEGE OF ENGINEERING
ROCHESTER INSTITUTE OF TECHNOLOGY
ROCHESTER, NEW YORK
May, 2010
Thesis Release Permission Form
Rochester Institute of Technology
Kate Gleason College of Engineering
Title:
Pond: A Robust, Scalable, Massively Parallel Computer Architecture
I, Adam R. Spirer, hereby grant permission to the Wallace Memorial Library to repro-





Thank you to my family and my colleagues for supporting me in this research and the
writing of this document, specifically to my advisor Dr. Dorin Patru and my committee
members for their invaluable feedback.
iv
Abstract
Pond: A Robust, Scalable, Massively Parallel Computer Architecture
Adam R. Spirer
Supervising Professor: Dr. Dorin Patru
A new computer architecture, intended for implementation in late and post silicon technolo-
gies, is proposed. The architecture is a fine-grained, inherently parallel system consisting
of a large grid of thousands or millions of simple atomic processors (APs) employing a
simple instruction set. Each AP is configured as either a program instruction or data stor-
age element. These elements are organized into logical entities, analogous to traditional
programming functions/methods and data structures. Programming work is underway to
compile and run programs from traditional sequential code where parallelism is automati-
cally discovered at the high level on both instruction level and function level, and integrated
into the object code that is then sent to the processor. The result is a massively parallel ar-
chitecture that fully exploits instruction and thread-level parallelism. The architecture de-
sign is presented, in-progress work involving conversion of existing code is discussed, and
examples are shown to indicate the speedup potential that exists in this new architecture
when compared to current architectures.
v
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Implementation of Compound Operations . . . . . . . . . . . . . . . . . . 10
3 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Handshaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Send and Receive Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Entity Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Entity Abutment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Instruction Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Function Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Machine Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Minimum and Maximum Execution Times . . . . . . . . . . . . . . . . . . 30
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3.1 Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3.2 Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vi
5.3.3 Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . 40
6 Programming and Compilation . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Code Conversion Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.1 Parsing and CIL Representation . . . . . . . . . . . . . . . . . . . 44
6.2.2 Semantics Processing . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.3 Detection of Parallelism . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.4 Translation to Machine Code . . . . . . . . . . . . . . . . . . . . . 55
6.3 Compiler Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.4 Example Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Current Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3.1 Opening Machine Code Files . . . . . . . . . . . . . . . . . . . . . 67
7.3.2 Atomic Processor Record View . . . . . . . . . . . . . . . . . . . 68
7.3.3 Illustration of a Global Cast . . . . . . . . . . . . . . . . . . . . . 68
7.4 To Be Implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8 Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10 Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.1 Detection of Component Defects . . . . . . . . . . . . . . . . . . . . . . . 76
10.2 Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . 77
10.2.1 Size Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 79
10.3 Software Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
10.4 Contributions to the State-of-the-Art . . . . . . . . . . . . . . . . . . . . . 80
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A Functional Simulator Programming Guide . . . . . . . . . . . . . . . . . . 92
A.1 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.2 Variables and Structs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2.1 record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vii
A.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.3.1 wnd functions.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.3.2 ap grid ui.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.3.3 ap comm.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.3.4 sim engine.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
viii
List of Tables
2.1 Configuration and data processing fields for an AP . . . . . . . . . . . . . 7
2.2 Atomic Processor Instruction Set . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Handshake codes for atomic processor communications . . . . . . . . . . . 13
3.2 Transmit directions for global (G), beamed (B), and entity (E) casts . . . . 14
3.3 Message buffer format for result message. All number represent bits. . . . . 15
3.4 Message buffer format for coded message types; codes are described in
Table 3.5. All numbers represent bits. . . . . . . . . . . . . . . . . . . . . 15
3.5 Message codes and descriptions for non-result message types . . . . . . . . 17
5.1 Summary of performance metrics, measured in cycles . . . . . . . . . . . . 35
6.1 Variables table generated for data structure for vector addition code . . . . . 51
6.2 Variables table generated for data structure for vector addition code . . . . . 53
6.3 Vector addition machine code for function definition elements . . . . . . . 55
6.4 Vector addition code machine code for data structure elements . . . . . . . 56
6.5 Fibonacci series machine code for function definition elements . . . . . . . 58
6.6 Fibonacci series machine code for data structure elements . . . . . . . . . . 59
ix
List of Figures
2.1 A simple representation of the atomic processor sea. . . . . . . . . . . . . . 4
2.2 An example sea of APs, populated with functions and data. . . . . . . . . . 5
3.1 Message and handshake buffers in an atomic processor. . . . . . . . . . . . 12
3.2 Timeline of handshake process; AP x is sending a message to its neighbor,
AP y. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Propagation of communication casts in the sea. . . . . . . . . . . . . . . . 14
3.4 Send and receive layers of communication buffer circuitry. . . . . . . . . . 18
4.1 Atomic processor moving around an entity; the dashed line indicates the
desired path, and the solid lines represent the redirected path the entity takes. 20
4.2 Timeline of an instruction cycle, including communication broadcasts. . . . 24
4.3 Input/output interface at border of sea. . . . . . . . . . . . . . . . . . . . . 26
4.4 Examples of loop handling in the architecture: program flow for loop iter-
ations created as separate functions, program flow for parallelizable loop
iterations executed within a single function, and program flow for sequen-
tial loop iterations executed within a single function. . . . . . . . . . . . . 28
5.1 Examples of best case 100% parallelizable code, worst case 100% par-
allelizable code, best case 100% sequential code, and worst case 100%
sequential code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Example multiplication superentity using left-shift algorithm. . . . . . . . . 37
5.3 Example integer division superentity. . . . . . . . . . . . . . . . . . . . . . 39
5.4 Example floating point multiplication superentity, including use of integer
multiplication function for operating on significands. . . . . . . . . . . . . 41
6.1 Vector addition code as superentity, derived from machine code represen-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Fibonacci series code as superentity, derived from machine code represen-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Vector addition code superentity. . . . . . . . . . . . . . . . . . . . . . . . 62
6.4 Fibonacci series code superentity. . . . . . . . . . . . . . . . . . . . . . . 64
x
7.1 Software screen capture of Simulator window with loaded processor sea. . . 67
7.2 Software screen capture of AP Record view. . . . . . . . . . . . . . . . . . 68
7.3 Software screen captures of global cast propagation; consecutive machine
cycles are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69




Parallelism and concurrency are inherent in many computational tasks. Techniques that ex-
ploit instruction and thread level parallelism in traditional von Neumann architectures have
been successfully applied in single processors, as described by Hennessey and Patterson
in [1]. During the past decade researchers and manufacturers have turned to multi-core
processors, which at the present time are limited to just a few cores [2–7]. Scaling up the
techniques used to exploit instruction and thread level parallelism in single core processors
to many core processors is challenging for both hardware and software designers [8–11].
As pointed out by Hennessy and Patterson in [1], multi-core processors are a combi-
nation of computer architecture and communications architecture. Computer networks on
a chip or cluster computing on a chip are adapting the vast knowledge base of designs
and architectures of macro computer networks to the micro scale, [12–24]. Marculescu
et al. in [25] classify outstanding research problems related to networks on chip into 15
categories. Predominant are problems related to communications infrastructure and com-
munications paradigms, as illustrated by [26–52]. Dongarra et al. explore the potential
symbiosis between networks on chip and multicore processors in [53].
Late and post silicon era integrated circuit fabrication technologies will continue to in-
crease the number of components on a chip to billions and trillions. The sheer increase in
number will not translate into an increase in performance unless new parallel and concur-
rent architectures are developed, as pointed out by Rabaey and Malik in [54], and Wenmei
2
et al. in [55]. These new architectures will have to address reliability at the circuit and
system levels because some components will experience premature, transient or permanent
failures, as highlighted by Austin et al. in [56]. Lei Zhang et al. address reliability and fault
tolerance in networks on chip in [57]. Power dissipation will have to be mitigated starting
at the system level. This is already being considered in multicore processors [58, 59], and
in networks on chip [60–65]. Nano architectures attempt to specifically address the afore-
mentioned challenges posed by late and post silicon technologies [66–74].
In this thesis, an architecture is proposed, which can efficiently use a few hundred to
multi-billion cores. Its implementation is cost effective in late and post silicon technolo-
gies, resilient to component failures, massively parallel, and can support a high degree
of concurrency without a radical paradigm shift in programming. The architecture shares
traits with multicore processor architectures, networks on chip, nano architectures, mas-
sively parallel architectures, resilient and fault tolerant architectures, reconfigurable com-
puting, data flow architectures, and neural networks. These similarities and differences will
be discussed after the proposed architecture is presented.
In Chapter 2, the organization of the architecture is covered, followed by the communi-
cations architecture in Chapter 3, operation in Chapter 4, performance evaluation in Chap-
ter 5, programming in Chapter 6, architectural features and benefits in Chapter 8, related




As discussed in Chapter 1, a primary question in developing an effective parallel computing
system is the number of cores. Both extremes have been observed; some architectures
implement a small number of complex cores, while others implement a large number of
simple cores. The proposed architecture takes the latter approach and implements a fine-
grained system with a large number of simple cores.
2.1 Organization
Figure 2.1 shows the top-level layout of processing cores, called atomic processors (APs)
for this architecture.
As shown, the architecture is composed of a grid-like sea of APs. An AP at a given
time will either act as a program instruction (including associated operands), or a stor-
age element containing one data word. Each AP has all the hardware it needs to exe-
cute any instruction, theoretically allowing for all available instruction-level parallelism to
take place. Programs on the architecture are logically organized into function definitions
(FD) consisting of multiple APs containing instructions, which are called to execute when
needed. At runtime, function definitions are called to execute as function instances (FI)
which are copies of the corresponding function definition, and are capable of operating on




















Figure 2.1: A simple representation of the atomic processor sea.
logically-organized groups of APs are collectively called entities. These entities are capable
of “moving” through the sea of processing elements by way of transferring their individual
instructions and data words (entity elements) to adjacent APs, allowing them to interact
with other entities. Figure 2.2 shows an example of what a populated small processing sea
might look like (actual sea and program/data sizes will be significantly larger to reflect the
application).
Each AP is capable of storing both a function definition element, and a function instance
or data structure element, at the same time. This breaks down the sea into two logical
layers: the definition layer (function definitions) and the execution layer (function instances
and data structures). Composed of common high-level programming constructs—function
definitions, instances, and data structures—execution of programs takes place in much the
same way as in traditional architectures.
Instantiation of the function is equivalent to the function call in high-level programming,





Figure 2.2: An example sea of APs, populated with functions and data.
APs, creating a function instance. The function instance can then move and interact with
data structures, which are also present in the execution layer. Interaction between function
instances and data structures is preceded by a move and abut process, in which the function
instance locates the data structure of interest and abuts it, creating a superentity in which
instructions can fetch data as needed.
Each AP is functionally identical and operationally independent. In addition to stor-
ing operands and processing circuitry, configuration information is stored—information
to identify the entity number, individual entity element number, execution order number
(to indicate when the instruction can execute), and other information generated at compile
time. Table 2.1 and shows all the independent configuration fields present in a single AP, as
well as operand/data fields. These fields are duplicated for the standby layer (for function
definitions) as well as the execution layer (for function instances and data structures). The
6
only exceptions noted are that the entity type field is slightly different between the two lay-
ers, and data operand and some other execution-time values are not present in the standby
layer. Development is currently being done using a sea of 64 K atomic processors, which is
considered to satisfy the needs of an embedded computer system. Field bit widths are also
shown in Table 2.1 for 1 Tera atomic processors, which is considered to satisfy the needs
of a desktop computer system.
The fields presented in Table 2.1 provide space for all necessary data, instruction oper-
ation code, execution conditions, and the information necessary to exploit the parallelism
and concurrency discovered at compile time. This also allows for concurrency in instruc-
tions within a function instance, allowing for convergence back to sequential operation if
needed. In addition, all APs are identified physically and logically. In terms of instruction
execution once operands have been loaded (which will arrive via a communications broad-
cast from another AP in the sea; more on this in Chapter 3), each AP is independent and
does not require the resources of any other AP. APs are synchronous blocks, but the entire
sea can be asynchronous, as there is no specific need for neighboring APs to operate syn-
chronously. This eliminates the need for global clock synchronization, a common design
issue in large circuits. While implementing this architecture via a globally asynchronous,
locally synchronous strategy is possible, there are many issues to address in terms of the
asynchronous interfaces among blocks [75–78]. In silicon technologies, fields could be
implemented with SRAM cells similar to the way FPGA configuration bits are held (this
is a topic for future work and is discussed in Chapter 10). Note that the data widths for ID
numbers, execution order, and execution repeat values are defined by the sea size; however,
a 64K sea size is assumed in this presentation.
2.2 Instruction Set
Table 2.2 shows the instruction set for each AP in the sea, and their associated operation
7






Entity Type (Standby) 1 1 0=Unoccupied; 1=FD element
Entity Type (Execution) 2 2 0=Unoccupied; 1=FI element; 2=DS Element
Hardware ID 16 40 X-Y physical location of element
Entity ID 16 40 Entity logical ID number
Element ID 16 40 Entity element logical ID number
Target Entity ID 16 40 FI: Entity ID of the data structure that the FI entity
is associated with
Primary Operand/Data Word (Execu-
tion Only)
32 64 FD/FI: Primary operand value; DS: Data word
value
Secondary Operand (Execution Only) 32 64 FD/FI: Secondary operand value
Primary Operand ID 16 16 FD/FI: Primary operand ID, corresponds to an as-
sociated DS entity or element ID
Secondary Operand ID 16 16 FD/FI: Secondary operand ID, corresponding to
associated DS entity ID or element ID
Primary Operand Type 4 4 FD/FI: Primary operand type; for future imple-
mentation
Secondary Operand Type 4 14 FD/FI: Secondary operand type; for future imple-
mentation
Operation Code 8 8 FD/FI: Instruction/operation code
Execution Conditions 8 8 FD/FI: Status bits (C, N, V, Z, !C, !N, !V, !Z) re-
quired for execution of instruction
Status Bits ID 16 40 FD/FI: Element ID whose execution result pro-
duces the status bits for this instruction (compared
against execution conditions)
Execution Order 16 40 FD/FI: Element execution order number
Prev.-Execution Order 16 40 FD/FI: Execution order number of element(s) that
must execute before this element
Result Value (Execution Only) 32 64 FI: Result value of last execution
Status Bits (Execution Only) 8 8 FI: Resulting status bits of last execution
Execution Count (Execution Only) 16 16 FI: How many times must this FI receive a result
broadcast with a particular execution order num-
ber (from a previous instruction) before executing
itself?
Execution Count ID 16 40 FI: Associated DS Element ID that holds the exe-
cution count value
Message Code (Execution Only) 8 8 FI: Message code for move, abut, and other special
communications operations
Total to Store: 338 698
To move FD/FI element: 226 482
To move DS element: 84 188
8
codes. The instruction set has been chosen to include all arithmetic and logic operations.
Each instruction can be conditional or unconditional.
The instruction set qualifies as a reduced instruction set, keeping with the desired goal
to have simple processing elements. This does not limit the overall processor/architecture
in terms of complexity; any complex operation can almost always be implemented as a set
of smaller operations. The given instruction set supports very common RISC instructions
that are capable of compounding to the complex operations that may be required by an
application. That is, multipliers, floating point units, and other traditional function units can
be implemented as entities containing the appropriate sequence of instructions to execute.
Note that these functions will be “soft” in that they will be inherently tailored at compile
time to the application at hand, making them further efficient.
A given function may have multiple RETURN instructions that can be reached, just
as in most functions in high-level languages. However, in this case, multiple RETURN
instructions may be executed in the same path to allow more data to be sent from the called
function back to the calling function. Regarding execution counts, the default behavior is an
execution count of 1, which will be realized if the FI element’s execution count ID matches
its own element ID. That is, as soon as an FI element receives a message that another FI
element with an execution order number that matches its previous-execution order number,
then it will execute. In the case where there are execution conditions, the FI element will
also compare the message’s source element ID to its own status bits ID when considering
the previous-execution order number, only executing if the element ID and status bits ID
values match. All instructions are therefore capable of conditional execution. However, if a
FI element’s status bits ID matches its own element ID, then the execution conditions field
is ignored, i.e., the instruction will execute unconditionally.
9
Table 2.2: Atomic Processor Instruction Set
Mnemonic Operation Code Operation
NOP 0x00 No operation.
ADDPS 0x01 Add primary and secondary operands.
ADDPC 0x02 Add carry and primary operand.
ADDPSC 0x03 Add primary operand, secondary operand, and carry.
SUBPS 0x04 Subtract secondary operand from primary operand.
SUBPC 0x05 Subtract carry from primary operand.
SUBPSC 0x06 Subtract secondary operand and carry from primary operand.
INC 0x07 Increment primary operand.
DEC 0x08 Decrement primary operand.
INV 0x09 Bitwise inversion of primary operand.
AND 0x0A Bitwise AND of primary and secondary operands.
OR 0x0B Bitwise OR of primary and secondary operands.
XOR 0x0C Bitwise XOR of primary and secondary operands.
SETC 0x0D Explicitly set ‘carry’ flag.
SETZ 0x0E Explicitly set ‘zero’ flag.
SETN 0x0F Explicitly set ‘negative’ flag.
SETV 0x10 Explicitly set ‘overflow’ flag.
RSTC 0x11 Explicitly reset ‘carry’ flag.
RSTZ 0x12 Explicitly reset ‘zero’ flag.
RSTN 0x13 Explicitly reset ‘negative’ flag.
RSTV 0x14 Explicitly reset ‘overflow’ flag.
SHL 0x15 Shift left primary operand, pad with zeroes.
SHR 0x16 Shift right primary operand, pad with MSB.
SHLC 0x17 Shift left primary operand through carry, pad with zeroes.
SHRC 0x18 Shift right primary operand through carry, pad with MSB.
CALL 0x19 Function call instruction; requests instantiation of function def-
inition logical ID# in primary operand to process data structure
logical ID# in secondary operand. Execution of this instruction
completes when the instruction receives a RETURN result from
the called function instance.
RETURN 0x1A Function return instruction; broadcasts return value of function
instance to the corresponding CALL instruction in the calling
function instance. Primary operand holds the data word to return,
and secondary operand holds the logical entity ID of the calling
function instance.
10
2.3 Implementation of Compound Operations
As shown in Section 2.2, each AP supports only a small set of simple integer manipulation
instructions, holding true to the philosophy of processing via a large number of simple pro-
cessing elements. More complex operations such as multiplication and floating point op-
erations are very commonly used in many applications, and a successful high-performance
architecture should implement these functions efficiently. These compound instructions
can be implemented as function definitions that are called when required by other function




The communication architecture driving these processing APs uses nearest-neighbor com-
munication methodology; each AP can communicate directly with its eight adjacent neigh-
bors. In the case of communication with non-adjacent processors, neighboring APs are
capable of acting as conduits for passing messages to other APs in the sea. These com-
munications may take place as broadcasts in all directions to either the entire sea or just a
specific function definition, instance, or data structure (an entity), or a directional broadcast
to a specific AP inside or outside the entity.
3.1 Handshaking
The APs in the sea communicate via a set of eight pairs of handshake buffers (two for each
neighbor; one to send messages and one to receive messages) for handshaking, as well as a
common message buffer for passing messages and data once communication initialization
has been established via the handshaking protocol. Each AP, as well as its eight neighbors,
can write to and read from these buffers. The buffers are shown visually in Figure 3.1.
In addition, a dual message buffer (one to store a message to be sent, and one to load a
received message).
The handshaking algorithm is a three-step process using the handshake buffer. A hand-









Figure 3.1: Message and handshake buffers in an atomic processor.
buffer filled in the next cycle. If and only if the handshake buffer is currently set to 0, AP x
initiates a request to AP y by filling AP y’s corresponding handshake buffer with a specific
handshake code. The code specifies the particular type of communication that will take
place. This completes cycle 1 of the handshake. Next, AP y indicates its availability to
receive a message in the message buffer by resetting the handshake buffer back to 0. This
completes cycle 2. Finally, AP x loads the contents of its message buffer (the message/data
it wishes to send) into AP y’s message buffer. This completes cycle 3, and the commu-
nication cycle is complete. Naturally, there will be many cases where requests arrive to a
particular AP from multiple neighbors at the same time. The AP will process one request
at a time; the remaining APs will wait for their request to be served-as described in cycle
2, AP y indicates its availability by resetting the appropriate handshake buffer. Until this
occurs, AP x will wait. Priority in communication handling is also implemented; priorities
are based on the handshake code, and certain codes will be accepted before others.
Figure 3.2 shows a timeline representation of a communication cycle in which on ele-
ment (occupying AP x) sends a message to another element (occupying AP y).
13
Figure 3.2: Timeline of handshake process; AP x is sending a message to its neighbor, AP y.
The handshake codes are 4 bits wide, and are shown in Table 3.1.
Table 3.1: Handshake codes for atomic processor communications
Code Operation Description
0 (0000) AP is ready Indicates that the AP is ready to receive the next handshake code
1 (0001) Global cast Indicates that the communication is to be propagated throughout
the entire sea
2 (0010) Beamed cast Indicates that the communication is to be propagated directionally
to a specific AP (can be either inside or outside the entity)
3 (0011) Entity cast Indicates that the communication is to be propagated inside an en-
tire entity only, ending propagation at the entity border
4 (0100) Local cast Indicates that the communication is to be sent only to the AP’s eight
neighbors
5 (0101) P2P cast Indicates that the communication is broadcast to be sent only to a
single neighbor
6 (0110) FD Move request Indicates that the communication is a P2P cast request for a FD to
move into a neighboring cell
7 (0110) FI Move request Indicates that the communication is a P2P cast request for a FI to
move into a neighboring cell
8 (0110) DS Move request Indicates that the communication is a P2P cast request for a DS to
move into a neighboring cell
9 (0111) Abut request Indicates that the communication is a P2P cast request by a FI to
abut a neighboring DS
10-14 Reserved Reserved for future expansion
15 (1111) AP is non-
functioning
Used to indicate to neighbors that the AP is currently unable to
receive communications. If an AP is deemed non-functional, then
the AP’s handshake buffers will be all set to 1111, and neighboring
APs will not attempt to handshake with this AP.
For global, beamed, and entity casts listed in Table 3.1, Table 3.2 shows the propagation
14
directions for global, entity, and beamed casts. Figure 3.3 shows visual examples of the
casts’ propagations in the sea.
Table 3.2: Transmit directions for global (G), beamed (B), and entity (E) casts
Received From N NE E SE S SW W NW
N G,B,E
NE G,E G,B,E G,E
E G,B,E
SE G,E G,E G,B,E
S G,B,E
SW G,E G,B,E G,E
W G,B,E
NW G,E G,B,E G,E
Figure 3.3: Propagation of communication casts in the sea.
15
3.2 Message Passing
The message buffer contains the actual message to be communicated as well as the re-
quired decoding information such as source and destination identification information and
locations. Table 3.3 and Table 3.4 describe the two different formats of the message that
is transferred through the message buffer. The former is for broadcasting entity cast re-
sults from instruction executions; the latter is used for other non-result communication
messages.












64 K 4 16 32 32 16 8 20 128
1 T 4 40 64 64 40 8 44 256




Source ID Dest. ID Inter. ID ID Type:
S/D/I
Reserved Total Bits
64 K 4 16 16 16 2/2/2 70 128
1 T 4 40 40 40 2/2/2 126 256
The shaded fields are fields in which the bit width depends on the sea size. As shown,
the result message type has the capability of storing both the execution result value as well
as a primary or secondary operand from the broadcasting element; this allows for future
extension where instructions may change operand values and need to broadcast them to
keep all other copies of that element updated in the entity. In the coded message type, the
ID Type field is broken into S, D, and I fields, corresponding to source, destination, and
intermediate ID numbers with which the message is associated. The source and destination
IDs are the sending and receiving ID numbers, and the intermediate ID number is used for
other entities or elements associated with the message. These numbers can be either entity
IDs, element IDs, or hardware IDs; interpretation is determined by the message type. Note
16
that in both message types, there is reserved space for future extension.
Table 3.5 shows all current message codes utilized by the architecture.
3.3 Send and Receive Layers
One may expect the nature of the communication architecture to present an issue with
message backup. With many messages passing in a single entity, it is conceivable that
collisions in messages could occur, causing stalls and therefore delay in execution, if the
message passing is not handled appropriately. To avoid backup, each AP contains two sets
of handshake buffers and two sets of message buffers; one set is dedicated to receiving
messages and one set is dedicated to sending messages, essentially creating a bi-directional
message passing system that limits the possibility of collisions. Since each AP contains
a dedicated buffer for receiving and sending, an AP can both receive and send a message
at the same time to any neighboring processor. The only time when a message will need
to wait for more than one communication cycle is if it arrives at the same time with an-
other message from another AP. If this does occur, the priority handshaking discussed in
Section 3.1 is capable of handling simultaneous messages through priority settings. There
is a form of communication back-off inherent in this system; in the case of simultaneous
message receipt caused by a number of close entity casts, the priority-based handling of
the receiving AP will naturally “stagger” the casts after the first collision, helping to pre-
vent further collisions. Also, collisions will only occur at the “wave fronts” of the casts.
Once the communication cycle in which collision occurs is completed, there are no further
conflicts.
Figure 3.4 shows the message and handshake buffer connections between two neighbor-
ing APs, including both send and receive layers.
17
Table 3.5: Message codes and descriptions for non-result message types
Code Function Description
0x0 Result The result entity cast by a function instance element after it has
executed its instruction, or by a data structure element following
abutment with a function instance.
0x1 Instantiate Func-
tion




Similar to 0x01, except the function is an interrupt service routine.
0x3 Return Value Beamed cast by a called function instance to the calling function
instance, after it has completed execution.
0x4 Interrupt Return
Value
Similar to 0x03, except the returning function is an interrupt service
routine.
0x5 Move to Abut Code sent to all elements of a function instance that has just been
instantiated from its function definition; triggers the elements to
move and abut a given data structure.
0x6 Abutment Com-
pleted
Entity cast by a function instance element once it has abutted an




Entity cast by the function instance element that executes the RE-




Global cast by a function instance element executing a CALL in-
struction; it requests the (x, y) coordinates, or HardwareID, of the




Beamed cast by a function definition in response to 0x8.
0xA Request Location
of Data Structure
Global cast by a function instance element executing a CALL in-
struction; requests (x, y) coordinates, or HardwareID, of the data




Beamed cast by a data structure element in response to 0xA
18
Figure 3.4: Send and receive layers of communication buffer circuitry.
Note that according to Figure 3.4, the message buffers (MB) on both layers are con-
nected; in the case where an AP is only propagating a message and not processing it, the
propagating message can be loaded directly into the Send layer, saving a clock cycle and





This is a special case of the P2P cast, using a command message broadcast. A move request
is sent to a specific AP by a neighbor when that neighbor (an entity element) wants to move
into that AP’s cell. This would obviously only take place with that AP were empty, i.e.,
it does not contain a function definition element (in the case of the standby layer) or a
function instance or data structure element (in the case of the execution layer). A single
move operation is described in the following steps:
1. In cycle 1, AP x (which contains one element of an entity) requests to move into AP
y by setting AP y’s handshake buffer to code 6.
2. In cycle 2, AP y will reset its handshake buffer back to 0. In addition, AP y will fill AP
x’s message buffer with a specific message indicating either an affirmative or negative
response to the move request.
3. In cycle 3, if the move request was answered affirmatively, then AP x will transfer
into AP y’s cell. If the response was negative, then AP x will need to use the same
communication process on its other neighbors to attempt an alternative path to move.
The algorithmic development of the movement algorithm to be implemented in simu-
lation as well as implementation is currently in progress. The three described steps are
20
together referred to as a communication cycle. The transfer of AP contents from one cell to
another is accomplished via either a parallel interface or the message buffer. In the former
case, each field as described in Table 2.1 is multiplexed to those of its neighbors, allow-
ing the simultaneous transfer to take place. In the latter case, two cycles would be used
for transmitting required fields over the limited message buffer size. In either case, for
simplicity, we will refer to a single move operation as one communication cycle.
Naturally, an AP in the sea may encounter other APs from other entities along its path
that impede its movement along that usually direct path. In this case, the AP will attempt
to move in an alternative direction into an available cell, and then again try to move in the
target direction in the next cycle. Consider the example shown in Figure 4.1.
Figure 4.1: Atomic processor moving around an entity; the dashed line indicates the desired path, and the
solid lines represent the redirected path the entity takes.
As shown in Figure 4.1, an entity element that must redirect its path around another
entity will do the following:
1. Element moves one step in either a counterclockwise (CCW) or clockwise (CW) di-
rection as a redirected path (for example, a target southeast movement will redirect to
an east movement).
21
2. Element repeats step 1, continuing its attempt to move around the entity in a CCW or
CW direction.
3. Element eventually clears the entity and is able to move unobstructed towards its
target.
Note that there is some extra delay when an element must move around an entity. Since
the initial move request is rejected, two extra machine cycles occur for the element to move
into its next cell, making a communication cycle a total of five machine cycles instead of
the usual three machine cycles.
4.2 Entity Abutment
This is also a special case of the P2P cast, using a command message broadcast. This
communication is made to a neighboring processor when a function instance wants to as-
sociate with a data structure. Once the function instance hits a data structure, it initiates
an abutment request to obtain the data structure’s logical ID number at which time it can
then decide if it is the correct data structure to process (in the case that it is not, the APs
abort the request and attempt to move around the data structure via the method described
in Section 4.1). Abutment is described in the following steps:
1. In cycle 1, AP x (which is an element of a FI) requests identification information from
the AP y (which is an element of a DS) in order to determine if it is the correct DS to
abut, by setting AP y’s handshake buffer to code 6.
2. In cycle 2, AP y will fill AP x’s message buffer with its identification information,
and also sets its handshake buffer back to 0 to indicate the request has been served.
3. In cycle 3, AP x will respond by setting AP y’s handshake buffer to 6, and fill its
message buffer with its own identification information. Now that both the FI and DS
22
contain the other’s identification information, an abutment will occur if the identifi-
cation information is correct, i.e., that particular FI was looking for that particular
DS.
Like the movement algorithm described in Section 4.1, simulation- or implementation-
level algorithms for abutment are in current development. Once abutment completes, the
FI element(s) that abutted send out a coded message indicating abutment has completed,
which is entity cast to the entire superentity. Note that multiple command messages may be
sent. Following receipt of this broadcast, every AP in the data structure (DS) portion of the
superentity will broadcast a result broadcast via an entity cast to distribute its current data
word value. This allows instructions to be preloaded with the current state of its operands.
Previous instructions that modify any of these operands will entity cast the updated data
to the appropriate data structure element as well as any instruction in the sea that requires
the use of that particular data storage element. This means that following the initial data
structure broadcast after abutment, all instructions in the function instance (FI) constantly
have updated copies of their data operands, with no need for explicit “memory access” type
procedures.
4.3 Instruction Execution
The execution of an instruction in an AP may be compared to the familiar fetch-execute-
writeback procedure, but note that each of these steps is not exactly as in typical architec-
tures as there is no specific defined memory interface for programs or data. More specifi-
cally, the three steps are most often the following:
1. The needed data operands have been loaded into the AP containing an instruction
(via an earlier result broadcast), and the AP has just received a result entity cast from
a preceding instruction indicating an execution order number that matches the AP’s
23
previous-execution order number. The conclusion of this step means that the instruc-
tion has all the information it needs to execute and is capable of executing based on
the compile-time discovered parallelism.
2. The instruction is executed locally on the AP with its stored data operands. This may
take one or more machine cycles to execute, but as these are simple atomic operations,
they will likely not take any longer than a few clock cycles to complete. Note that the
AP can continue to propagate entity casts and global casts because the communication
circuitry is independent of the execution circuitry. If a broadcast is received during
execution that mandates that the current execution halt (most likely via a global cast;
consider interrupts, discussed in Section 4.7), this is the only case where execution
will be affected by communication activity. The conclusion of this step means that
the instruction execution has completed.
3. The AP sends out a result entity cast addressed to the data structure element ID num-
ber that should contain the result of the instruction execution. This broadcast will
signal the data structure element as well as any instruction that contains this element
as a data operand (i.e., the data structure element ID number matches the primary or
secondary operand ID numbers). The conclusion of this step means that the result of
the instruction execution has been cast to all locations requiring the data.
Figure 4.2 shows a timeline representation of an entire instruction cycle, including exe-
cution as well as related communication broadcasts.
All instructions defined in Table 2.2 follow this process, with the exception of the CALL
and RETURN instructions, which are described further in Section 4.4. These instructions
utilize the data operands to indicate addresses of data structure and calling/called functions
that are involved in the particular function call being done.
24
Figure 4.2: Timeline of an instruction cycle, including communication broadcasts.
4.4 Function Calls
A function call involves three entities: the calling function, the called function, and the
associated data structure that contains the arguments for the called function. The steps for
this process can be broken into the following, saying function A calls function B to process
data structure C:
1. Function A’s CALL instruction (see Table 2.2) will request the (x, y) coordinate lo-
cation of the target/called function B via a global cast message.
2. Function B responds back to that CALL instruction with the coordinates, via a beamed
cast directed at the (x, y) coordinates of the CALL instruction.
3. Function A’s CALL instruction sends a command message beamed cast to function
B indicating it should instantiate and process data structure C, indicating the data
structure’s (x, y) coordinates in the sea.
4. Function definition B receives the message and instantiates each of its elements into
the execution layer.
5. Function instance B moves to the location of data structure C and initiates an abut
request.
6. Data structure C accepts the abut request; each element sends an entity broadcast to
25
the rest of the superentity (data structure and abutted function instance) indicating
their element ID numbers and data word values.
7. Once execution of the function completes (that is, a RETURN instruction is ready
to execute), the RETURN instruction sends a beamed message back to the calling
function instance indicating the return value of the called function.
8. The RETURN instruction sends out an entity cast message to its entity indicating that
the function has completed execution and elements should terminate (disappear from
the sea). Meanwhile, function A’s CALL instruction entity casts the return value it
received from the previous step to the rest of its entity.
The actual execution of a function instance always begins with the “first” instruction,
which is specified by the following two conditions:
1. The previous-execution order number is 0
2. The execution order number is 1
Execution conditions are also ignored for the first-time execution of the first instruction;
subsequent executions however will consider the execution conditions as usually expected.
During the entire function call process, the CALL instruction will not move from its
current location, as it relies on beamed cast communications from the called function and
this is location-sensitive. The called function definition also will not move between steps 1
and 2, as this is also a location-sensitive communication process. If a function is to be in-
stantiated multiple times for parallel execution (on different data structures), then multiple
CALL instructions will be used, using the secondary operand to specify the different data
structure entity IDs to use.
Note that “functions” in the sea may or may not be equivalent to the functions or meth-
ods defined in the high-level source code. This is to be decided by the compiler. For simple
programs, it is likely that the compiler will retain the original function structure of the
26
program. For more complex programs with many large functions, the compiler will break
large functions into smaller blocks that share common data or parallelism.
4.5 Input/Output
Input and output points can be defined at the sea’s periphery, using a border AP to load data























Figure 4.3: Input/output interface at border of sea.
Any number of the input/output interfaces as presented in Figure 4.3 can be used along
the border of the sea, allowing for as many I/O pins on a chip as needed for a target applica-
tion. Interfaces can be serial or parallel, and are limited only by the bandwidth requirements
of the given protocol and technology.
To communicate with outside elements, interfaces on each output port can be used to
emulate the appropriate interfacing protocol, e.g., UART and USART.
27
4.6 Loops
Loops can be generated in two ways, depending on the complexity of the loop. If a loop
containing only a few sequential operations is needed, then the final instruction in an it-
eration of the loop is used to broadcast the appropriate earlier execution order number,
allowing previously-executed instructions to execute again. Using the execution conditions
to look for a particular condition, the number of executions is defined. This is similar to
current handling of loops in sequential computers, and is the most common way of imple-
menting a loop in assembly programming. For more complex loops that contain a larger
number of instructions that present possible concurrency, a function definition is created
for the loop contents, and the loop can be called multiple times using multiple CALL in-
structions.
The handling of loops depends heavily on decisions made by the compiler related to
discovery of parallelism, discussed more in Section 6.2.3. There are many cases where loop
iterations can be parallelized completely, especially in vector and matrix operations; these
loops can be built as a function definition and instantiated multiple times at runtime. Loops
can also be implemented as a subset of instructions within a larger function that repeat
execution; this occurs by using the execution order number to trigger earlier instructions to
execute again if a given condition for running subsequent loop iterations is true. Of course,
instruction-level parallelism within a single loop iteration is always possible via the use of
the execution order number (i.e., instructions that can execute concurrently have the same
execution order number), as it is for any instruction in a loop iteration or not. Figure 4.4
shows three examples of the way loops may run on the architecture.
28
(a) (b) (c)
Figure 4.4: Examples of loop handling in the architecture: program flow for loop iterations created as separate
functions, program flow for parallelizable loop iterations executed within a single function, and program flow
for sequential loop iterations executed within a single function.
4.7 Exception Handling
Exception can be handled at the input/output ports through the use of the global cast, uti-
lizing the command message broadcast format, similarly to the way a function call is per-
formed. The exception will propagate through the entire sea, eventually reaching all APs.
Since global casts are of highest priority, the propagation will not be halted by any other
communications within entities, allowing the exception to be processed as soon as it reaches
the exception handling function (routine). Below is a non-exhaustive list of exceptions, as
described in [1], that will have to be considered:
• Hardware device interrupts
• Breakpoints/debug interrupts
• Arithmetic overflow




• Power failure (requires halting of each AP operation once the exception is received)
Only the power failure will affect all morphological entities and all atomic processors
in the sea. All other exceptions will affect the entities to which they relate. Exceptions
will not stop other communications currently propagating in the sea once the “wave front”
of the exception has passed (as the exception will have higher priority than other com-
munications, those other communications will have to wait for a communication cycle to
allow the interrupt to pass through). That is, while an exception is propagating in the sea,
all other currently propagating communications—result entity cast messages, mostly—will
complete their traveling to their destination, and will update their data structure elements
and data operand values. This is to maintain consistency in data throughout the sea, as the





A machine cycle is assumed to be the standard local instruction cycle on an AP. This in-
cludes just the time for execution of the instruction as well as decoding of the message
buffer (analogous to instruction fetch) and encoding of the result message (analogous to
writeback)—in other words, a full execution cycle. In addition, a machine cycle is assumed
to encompass the amount of time for handshaking and message buffer transfer (also known
as a full communication cycle.
5.2 Minimum and Maximum Execution Times
The total execution time of a called function, from the moment the calling function in-
stance sends out the CALL message, and until it receives the return message and/or data
structure, depends on the duration of the following events: call to instantiate, instantiate,
move to abut, abut, effective execution, and RETURN. In turn, the duration of all these
events depends on the sizes of the called function instance and its associated data structure,
and except for the effective execution time, on the relative location of the calling function
instance and the called function definition, and the relative location of the called function
instance and its associated data structure. Unless the relative locations are enforced and
31
therefore known, the times to call, to instantiate, to move, to abut and finally to return, are
a function of the instantaneous computational context. Therefore, their exact values are
best evaluated using a cycle accurate functional simulator, which is discussed in Chapter 7.
However, the effective execution time can be analytically calculated for a few particular
cases. To demonstrate these effective times, some example patterns of execution are shown
in Figure 5.1.
32
Figure 5.1: Examples of best case 100% parallelizable code, worst case 100% parallelizable code, best case
100% sequential code, and worst case 100% sequential code.
We consider an entity with a square shape and n atomic processors on a side. If the
code is 100% parallelizable, then all elements execute concurrently, except for the one that
executes the RETURN instruction. If the latter is in the center of the square entity, the
33
minimum effective execution time is n−1
2
+1 cycles, where the 1 accounts for the execution
cycle of the return, and n−1
2
for the necessary communication cycles between the elements
at the periphery and the center of the entity, as shown in Figure 5.1(a). Alternatively, if
the element that executes the RETURN element is located at the periphery, the maximum
effective execution time is equal to (n−1)+1, where the 1 accounts for the execution cycle
of the return, and n − 1 for the necessary communication cycles between the RETURN
element and the element diagonally opposite from it, as shown in Figure 5.1(b). The total
number of instructions being n2, the minimum CPI or minimum cycles per instruction is
equal to n+1
2n2
, or approximately 1
2n
. Similarly, the maximum CPI or maximum cycles per
instruction is equal to 1
n
. An alternative metric is the IPC, or instructions per cycle, which
is the inverse of the CPI and thus equal to 2n and n, respectively.
If the code is 100% sequential or non-parallelizable all elements execute in sequence,
i.e. no two elements execute concurrently. Then, the minimum effective execution time is
equal to 2n2, as shown in Figure 5.1(c). This is achieved when the elements executing in
sequence are always adjacent, i.e. each instruction cycle is equal to the shortest instruction
cycle. The minimum CPI, or minimum cycles per instruction is 2, and the maximum IPC,
or the number of instructions per cycle is 1
2
.
For the 100% sequential or non-parallelizable code, the worst-case condition is when
each two elements that execute in sequence are located farthest apart, as shown in Fig-
ure 5.1(d) for an 8×8 sized entity. For convenience, let n be even and m = n
2
. The number
of elements that are located k atomic processors from the center is equal to 4(2k − 1),
and their instruction cycle is 2k cycles long, including the execution cycle. Thus the total
execution time of all this elements is 4(2k − 1)(2k). Notice that moving diagonally, hor-
izontally or vertically takes the same number of cycles. Then, for an entity of size n2 or
34





















m(m + 1)(2m + 1)
6


















n3 + n2 (5.1)















n + 1 (5.2)
The performance metrics discussed so far are summarized in Table 5.1. The very good
results for the 100% parallelizable code meet expectations, because of the massively par-
allel character of the architecture. The results for the 100% sequential code with optimal
placement are good. Considered in isolation, the results for the 100% sequential code
with worst placement are at best satisfactorily. However, all these results capture only the
exploitation of instruction level parallelism, and must therefore be considered in the larger
35
context of the sea of atomic processors. Intrinsically, the architecture allows thread or func-
tional level parallelism to be fully exploited. This will compensate and further enhance the
overall performance of the architecture. Thread and functional level parallelism can only
be considered in a specific computational context, and are therefore best evaluated using a
cycle accurate functional simulator, which is discussed in Chapter 7.
Table 5.1: Summary of performance metrics, measured in cycles
Performance Metric 100% Parallelizable Code 100% Sequential Code
Minimum Effective Execution Time n+12 2n
2
Maximum Effective Execution Time n 23n
3 + n2








To showcase the implementation of a compound operation, we consider multiplication.
APs do not have dedicated multipliers; multiplication can be done instead by using a left-
shift algorithm. For example, multiplying two numbers x and y, yielding result z:
1. Initialize a counter to the data word width, or 32 bits for this case.
2. Decrement the counter by 1. If counter value is negative, multiplication is complete.
Else, continue to step 3.
3. If LSB of x is 1, then add y to z (else, do nothing).
4. Shift x right by 1 bit, and shift y left by 1 bit. Repeat steps 2 through 4.
Figure 5.2 shows an example integer multiplication superentity incorporating the algo-
rithm described above. The following abbreviations may be used for the element fields:
• ID: Element ID
36
• POp: Primary operand
• SOp: Secondary operand
• Cond: Execution conditions
• SBID: Status bits ID
• Count: Execution count
• EO: Execution order number
• PEO: Previous-execution order number
37
Figure 5.2: Example multiplication superentity using left-shift algorithm.
To demonstrate cycle-by-cycle operation, each cycle is enumerated below by execution
order number, indication the behavior of APs in each cycle.
1. Step 1: In cycle 1, execution starts with element ID 8, which bitwise ANDs x with
the value 1 as a mask. If the least significant bit of the result is zero, then the Z flag is
38
set to 1. Else, it is reset to 0.
2. Step 2: In cycle 2, if the Z flag is reset to 0, then y is added to z. At the same time, x
is shifted right and y is shifted left. Else if the Z flag is set to 1, then do nothing. In
either case, the counter is decremented.
3. In cycle 3, if the counter is not zero and not negative, then execution continues again
with step 1. Else, if the counter is zero or negative, then the RETURN element exe-
cutes and the multiplication is complete.
The multiplication algorithm takes the same number of cycles to execute, regardless of
the operand values. The algorithm is effectively two execution steps (enumerated steps 1
and 2 as indicated in the algorithm description in Section 5.3.1) repeated for each bit, in
addition to the return step (enumerated step 3).
5.3.2 Integer Division
To showcase the implementation of another compound operation, we consider integer divi-
sion. APs do not have dedicated dividers; division is instead done using a shift-and-subtract
algorithm. For example, dividing a number n (numerator) by d (denominator), yielding re-
sult quotient q and remainder r:
1. Initialize r = n.
2. Subtract d from r.
3. If r ≥ 0, then add 1 to q and repeat steps 2-3. Else, add d to r, and division is
complete.
Figure 5.3 shows an example integer division superentity incorporating the algorithm
described above.
39
Figure 5.3: Example integer division superentity.
To demonstrate cycle-by-cycle operation, each cycle is enumerated below by execution
order number, indication the behavior of APs in each cycle.
1. Step 1: In cycle 1, execution starts with element ID 1, which adds 0 to n and stores in
r (that is, r is assigned to n).
2. Step 2: In cycle 2, d is subtracted from r. If the result is negative, then the N bit is set
to 1. Else, the N bit is reset to 0.
3. Step 3: In cycle 3, if the N bit it set to 1, then d is added back to r. Else, q is
incremented.
4. In cycle 4, if the N bit was set to 1, then the RETURN element executes and the
division is complete. Else, continue execution at step 2.
40
The division algorithm takes a varying number of cycles to execute, as the number of
iterations is dependent on the size of the operands. That is, the number of subtractions
of the divisor from the dividend is the number of iterations the division algorithm must
perform. The more iterations, the longer the execution time.
5.3.3 Floating Point Operations
Floating point arithmetic, multiplication, and other operations are accomplished through
the implementation of special function definitions dedicated to unpacking, processing, and
repacking these numbers when the program needs it. As long as there are APs available
in the sea, as many floating point functions can be instantiated as needed. There are effec-
tively no structural dependencies in providing this functionality, as there are commonly in
processors that use a (usually) limited number of dedicated floating point units, whose use
must be scheduled.
Consider the floating point multiplication algorithm in Equation 5.3.
sp × 2ep = s1 × 2e1 × s2 × 2e2 (5.3)
The significand is indicated by s and the exponent by e for an unpacked floating point
number. Figure 5.4 shows an example floating point multiplication superentity, which con-
sists of a multiply operation on the significands (via a function call to the multiplication
function shown in Figure 5.2) and an add operation on the exponents.
41
Figure 5.4: Example floating point multiplication superentity, including use of integer multiplication function
for operating on significands.
1. Step 1: In cycle 1, execution begins with element ID 2, where e1 and e2 are added.
2. Step 2: Next, s1 and s2 are multiplied by a call to an integer multiplication function.
Floating point multiplication is now complete.
Note that given the latency required for the CALL operation to implement multipli-
cation, the multiplication function can simply be placed in the function along with the
additional instructions and data required for floating point. The only trade-off is that the
superentity is larger (corresponding to higher communication latency, but this is likely
much lower than the latency associated with a function call).









This assumes e1 > e2. To compute the significand and exponent of a floating point
addition, a division is required of s2 which can be done via a progressive right shift of
e1−e2 times. Following this shift, the result is added to s1 and this is the sum’s significand.





A compiler for this architecture incorporates discovery of parallelism at the high-level (C
programs, for example), followed by assembly into a data stream that will then be down-
loaded to the device, loading the APs with instructions and data. Each instruction and data
element is assigned unique identification information which is used to implement the par-
allelism that is discovered in the compilation process—that is, which specific instructions
can execute in parallel (and which cannot), and which higher-level logical functions can
execute in parallel (and which cannot). What follows the integration of this parallelism is
a program flow unimpeded by resource dependencies or memory access bottlenecks. If a
function has the parameters it needs to execute, it will execute. If an instruction within a
function has the operands it needs to execute, it will execute.
There are a number of steps that must be taken to first discover parallelism in conven-
tional code, and then to determine an appropriate level of implementation of that parallelism
in a given architecture. This architecture is designed to inherently support high-level pro-
gramming concepts (function calls, structured data, iterative processes, etc.) and to resolve
parallelism first at this level as well as the instruction level.
44
6.2 Code Conversion Steps
Traditional compilation involves the use of several intermediate representations of the code
before translation to machine code. Usually these representations resolve memory access
routines explicitly, which makes sense in traditional Von Neumann architectures where
memory read/write operations are explicit instructions. In our architecture, memory op-
erations are implicit operations, taking place during execution, but not defined at the in-
struction level. As such, a new representation must be developed that does not explicitly
define memory operations and focuses on the high-level programming concepts that the
architecture supports.
The C programming language can be used as a basis for these developments. Compila-
tion can be broken into the following steps:
1. Parse the code into a clean high-level format.
2. Identify data structures and their elements, i.e., constants and variables; identify func-
tion blocks and instructions within each function, including function calls.
3. Discover instruction-level parallelism within functions, including function calls.
4. Generate programming information for each AP based on preceding steps.
6.2.1 Parsing and CIL Representation
The first step in compilation is to parse the input code. C Intermediate Language (CIL) [79]
is an intermediate representation of C that parses and rewrites the code using simple con-
trol structures (only if/else and while control structures) while preserving function
definitions, providing explicit variable scope by putting functions further into code blocks,
and also extracting library functions from headers that may be present. Call graphs for
functions may also be generated to indicate dependencies among functions. This cleaner
simpler representation of C code can be used to more easily recognize in a given program
45
which groups of variables should be built into data structures, and where to split up code
into functions.
Giving CIL the source code of a C program we want to convert, the following command
is run:
cilly --save-temps --docallgraph --domakeCFG --dooneRet
--noPrintLn --noInsertImplicitCasts --printCilAsIs --noWrap
source.c > source.c.cg.txt,
where source.c is the source C file. [79] Consider the following simple vector addition
code:
int main() {
int vector1[8] = {1, 2, 3, 4, 5, 6, 7, 8};
int vector2[8] = {5, 9, 1, 45, 13, 52, 9, 23};
int i;
for(i=0; i<8; i++) {




CIL will parse this code into the following:
/* Generated by CIL v. 1.3.6 */
/* print_CIL_Input is true */
int main(void)

























while_0_continue: /* CIL Label */ ;













As shown, CIL converts all control structures in the original code to a predictable
47
while loop structure with continue and break labels. In addition, the function re-
turn value is given an explicit variable, compared to the direct constant assignment in the
original code. Both of these results provide a secure basis for generating data structures
and instruction flow for building function definitions.





int c[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
a = 0;
b = 1;
for(i=0; i<10; i++) {






CIL parses the code into the following:
/* Generated by CIL v. 1.3.7 */
/* print_CIL_Input is true */
int fibonacci(void)




















while (i < 10) {










Once the code has been parsed into CIL format, it can be further processed to identify
functions and data structures.
Function definitions indicated in the original code are first given an identification num-
ber. Each variable within each function is also identified, forming a data structure as-
sociated with the function. In addition, multi-instruction loops within functions may be
identified as functions themselves later on in the case that the iterations can be parallelized
49
(more in Section 6.2.3). In intermediate representation of this step consists of a list of in-
dividual functions and their code, each function supported by a table of variables, logical
identification numbers, and initialization values. The table of variables is a representation
of the data structure that is associated with its corresponding function. Functions and data
structures themselves are also logically identified, allowing cross-referencing for function
calls, as well as allowing for multiple functions to operate on the same data structure (con-
sider global variables here). The preprocessed code from the matrix addition CIL-parsed
code observed in Section 6.2.1 is shown below, followed by Table 6.1 which shows the
table describing the data structure.
/* Generated by CIL v. 1.3.6 */
/* print_CIL_Input is true */
int main(void)








































Table 6.1: Variables table generated for data structure for vector addition code
Scope Entity ID Element ID Variable Index Type
loc 1 0 vector1 vector1[0] int
loc 1 1 vector1 vector1[1] int
loc 1 2 vector1 vector1[2] int
loc 1 3 vector1 vector1[3] int
loc 1 4 vector1 vector1[4] int
loc 1 5 vector1 vector1[5] int
loc 1 6 vector1 vector1[6] int
loc 1 7 vector1 vector1[7] int
loc 1 8 vector2 vector1[0] int
loc 1 9 vector2 vector2[1] int
loc 1 10 vector2 vector2[2] int
loc 1 11 vector2 vector2[3] int
loc 1 12 vector2 vector2[4] int
loc 1 13 vector2 vector2[5] int
loc 1 14 vector2 vector2[6] int
loc 1 15 vector2 vector2[7] int
loc 1 16 i i int
loc 1 17 retres4 retres4 int
The Scope column describes the scope of the variable in the original code: loc for local
variable, par for passed parameter (not present here), and glb for global variables (not
shown here; global variables will get a separate table). Var Name and Index describe the
original variable name. These columns are equal except for arrays, in which case the former
52
gets the variable name and the latter gets the index. Note that in referencing the code above,
the replacement for an array that is indexed by another variable uses the Element ID of the
zero-index element of that array, or the equivalent to the pointer that would be used in C to
reference the array.
Consider now the Fibonacci example; semantics processing produces the following
code:
/* Generated by CIL v. 1.3.7 */
/* print_CIL_Input is true */
int main(void)



















while ($L0 < 10) {










The associated data structure table is shown in Table 6.2.
Table 6.2: Variables table generated for data structure for vector addition code
Scope Entity ID Element ID Variable Index Type
loc 1 0 i i int
loc 1 1 a a int
loc 1 2 b b int
loc 1 3 c c[0] int
loc 1 4 c c[1] int
loc 1 5 c c[2] int
loc 1 6 c c[3] int
loc 1 7 c c[4] int
loc 1 8 c c[5] int
loc 1 9 c c[6] int
loc 1 10 c c[7] int
loc 1 11 c c[8] int
loc 1 12 c c[9] int
loc 1 13 retrest5 retres5 int
54
6.2.3 Detection of Parallelism
Detection of parallelism in a given program is often difficult, though there are some fairly
mechanical structures in typical code where parallelism can be extracted mechanically. For
example, consider the code for vector addition. A clear loop exists in which the individual
elements of the vector are added. Specifically, there is no iteration dependence; that is, an
iteration of the loop does not depend on a previous iteration of that loop. This is detectable
by the fact that the variables indexed using the loop iterator i are not indexed by i−1, i+1,
etc. Given that this is effectively the only instruction in the loop (the addition operation),
there is also no instruction dependence. For this reason, this loop can be fully unrolled. In
the context of our architecture, this means that each iteration (which consists of a single
instruction in this case) gets its own AP with the same execution order number. When
executing, all iterations will therefore occur concurrently. Note that this is a simple loop
example; more complex loops require more analysis to determine iteration dependencies.
Discovery of parallelism in high-level code loops has been a subject of research in the
context of parallel and concurrent architectures [80].
In the Fibonacci example, there is instruction dependence in the three instructions in the
loop iteration. In addition, there is dependence at the iteration level because variable values
from previous iterations are used in future iterations, so no unrolling is possible. For this
reason, the loop is performed sequentially.
Note that discovering parallelism is not a trivial task. The above discussed vector ad-
dition code and Fibonacci code are simple examples in which the programming steps can
be justified; however, specifically in the parallelism detection step, more complex code is
much more difficult to detect, and is the subject of future work. A variety of work has been
done currently that investigates the challenges in detecting both instruction- and thread-
level parallelism in code [81–84].
55
6.2.4 Translation to Machine Code
Once function definitions and data structures are identified, as well as the execution or-
der and concurrency, the fields presented in Table 2.1 are being configured. These fields
provide all information needed to guide the execution of the program.
The preprocessing step as described in Section 6.2.2 provides the logical identification
numbers for elements as well as data operands, operation codes, status bits, and execu-
tion repeat values. The parallelism discovery step described in Section 6.2.3 provides the
execution order and previous-execution order values.
Consider the vector addition code; Table 6.3 and Table 6.4 show the machine code in
tabular format for the function definition and data structure elements, Figure 6.1 shows the
machine code applied to atomic processors in a superentity (note that the x, y coordinates
differ than the machine code, as the function may move during the course of architecture
runtime). Numbers are shown in decimal format for readability.
Table 6.3: Vector addition machine code for function definition elements
Y X Type EntID ElemID PriID SecID OpCode EO PEO Cond CountID SBID
30 10 1 1 0 0 8 1 1 0 0 0 0
30 11 1 1 1 1 9 1 2 1 0 1 1
30 12 1 1 2 2 10 1 2 1 0 2 2
30 13 1 1 3 3 11 1 2 1 0 3 3
31 10 1 1 4 4 12 1 2 1 0 4 4
31 11 1 1 5 5 13 1 2 1 0 5 5
31 12 1 1 6 6 14 1 2 1 0 6 6
31 13 1 1 7 7 15 1 2 1 0 7 7
32 10 1 1 18 0 Calling
FI HID
26 3 2 0 19 18
56
Table 6.4: Vector addition code machine code for data structure elements
Y X EntID ElemID PriOp Primary
10 10 2 1 0 1
10 11 2 1 1 2
10 12 2 1 2 3
10 13 2 1 3 4
10 14 2 1 4 5
10 15 2 1 5 6
10 16 2 1 6 7
10 17 2 1 7 8
11 10 2 1 8 5
11 11 2 1 9 9
11 12 2 1 10 1
11 13 2 1 11 45
11 14 2 1 12 13
11 15 2 1 13 52
11 16 2 1 14 9
11 17 2 1 15 23
12 10 2 1 18 0
12 11 2 1 19 7
57
Figure 6.1: Vector addition code as superentity, derived from machine code representation.
58
Note that AP(0,0) exhibits an ADDPS instruction that executes before the remaining
ADDPS instructions; this is to provide the single “entry point” for the function as described
in Section 4.4.
Consider the Fibonacci series code; Table 6.5 and Table 6.6 show the machine code, and
Figure 6.2 shows a corresponding superentity.
Table 6.5: Fibonacci series machine code for function definition elements
Y X Type EntID ElemID PriID SecID OpCode EO PEO Cond CountID SBID
30 10 1 1 3 1 2 1 1 0 0 3 3
30 11 1 1 4 2 3 1 2 1 0 4 4
30 12 1 1 5 3 4 1 3 2 0 5 5
30 13 1 1 6 4 5 1 4 3 0 6 6
31 10 1 1 7 5 6 1 5 4 0 7 7
31 11 1 1 8 6 7 1 6 5 0 8 8
31 12 1 1 9 7 8 1 7 6 0 9 9
31 13 1 1 10 8 9 1 8 7 0 10 10
32 10 1 1 11 9 10 1 9 8 0 11 11
32 11 1 1 12 10 11 1 10 9 0 12 12
32 12 1 1 13 0 Calling
FI HID
26 11 10 0 13 13
59
Table 6.6: Fibonacci series machine code for data structure elements
Y X EntID ElemID PriOp Primary
10 10 2 1 0 0
10 11 2 1 1 1
10 12 2 1 2 0
10 13 2 1 3 0
10 14 2 1 4 0
10 15 2 1 5 0
10 16 2 1 6 0
10 17 2 1 7 0
11 10 2 1 8 0
11 11 2 1 9 0
11 12 2 1 10 0
11 13 2 1 11 0
11 14 2 1 12 0
12 11 2 1 13 0
60
Figure 6.2: Fibonacci series code as superentity, derived from machine code representation.
61
6.3 Compiler Development
A full-featured compiler for this architecture is currently in progress to implement all steps
in Section 6.2, and is currently under development. Steps 1 and 2 (CIL parsing and seman-
tics processing) are currently completed. Relating to detection of parallelism, the possibil-
ity exists for user interaction during the compilation process to assist in detection of paral-
lelism if needed. Development of algorithms for parallelism detection as well as scripts to
run semantics processing and machine code translation are currently in progress.
6.4 Example Simulations
Consider the vector addition code developed in Section 6.2. The sea in Figure 6.1 can be
manually simulated to show which elements are affected for each execution order step in
the program. The sea is reproduced in Figure 6.3.
62
Figure 6.3: Vector addition code superentity.
63
Observing the propagation of communication broadcasts from the executing function
instance elements, a manual cycle-level simulation can determine the total execution time:
1. In step 1, AP(0,0) executes; its entity cast takes a maximum of 3 cycles to reach
AP(0,3) and AP(1,3).
2. In step 2, the maximum entity cast distance is 4 cycles from AP(1,0) to AP(0,4). Note,
however, that with multiple entity casts there may be a cycle delay as AP(0,4) must
process all the messages. For this reason the message propagation cycle count may
increase to 5 or more cycles.
3. In step 3, execution has completed, after 4 cycles in step 1 and 6 cycles in step 2
(execution plus message propagation), or 10 cycles total. Following these 10 cycles,
the function must send its return value back to the calling function.
Consider now the Fibonacci series code; the sea in Figure 6.2 can also be simulated.
The Fibonacci sea is reproduced in Figure 6.4.
64
Figure 6.4: Fibonacci series code superentity.
Observing the propagation of communication broadcasts from the executing function
instance elements, a manual cycle-level simulation can determine the total execution time:
65
1. In steps 1 through 10, result message propagation only takes 1 cycle to reach the next
element that needs the result. For example, AP(0,1) needs the result from AP(0,0) in
step 1; they are neighbors so the result is received in 1 cycle.
2. In step 11, execution completes and the function can return to the calling function.
In total there are 2 cycles (execution and message propagation) for each of steps 1
through 10. This results in a total function execution time, including the 2 cycles in





Functional simulation software would provide a way of testing the architecture’s cycle-by-
cycle functionality for a given code set, to solidify those expectations indicated in perfor-
mance estimation (Chapter 5). Simulation can begin with the machine code taken from
the translation process described in Section 6.2.4, which would provide information on the
contents of all APs, and populates an example sea with those APs. Execution can then be
taken in steps. The user would have the ability to view the visual contents of the sea at each
step, including message propagation and movement of APs. In addition, the contents of
each AP could be observed at each step. Execution time would be monitored as well, pro-
viding a very accurate basis for benchmarking and comparison with current architectures.
7.2 Software Design
Work has commenced to develop a cycle-accurate simulator. Execution takes place at ma-
chine cycle level so each operation can be observed, including instruction execution and
communication processes. An interface is provided to view the contents of each atomic
processor (AP) at each cycle so expected performance can be confirmed. Software was
built for Microsoft Windows in C using the Windows C API functions.
67
7.3 Current Functionality
The simulation software is designed to remain faithful to machine cycle-specific behavior
of the architecture when given appropriate machine code to execute.
7.3.1 Opening Machine Code Files
Machine code is represented in a table-style format that describes the initial contents of
each function definition (FD) and data structure (DS) element. These initial contents consist
of all values described in Table 2.1 for both execution and standby layersdata structure and
function definition elements including their ID numbers and operand ID numbers, as well as
execution order and execution count numbers. In addition, the machine code file indicates
the initial physical (x, y) location in the sea for each entity element. The typical Simulator
window viewing a loaded processor sea is shown in Figure 7.1.
Figure 7.1: Software screen capture of Simulator window with loaded processor sea.
68
7.3.2 Atomic Processor Record View
AP contents at any given simulation step (i.e., any given machine cycle) can be observed
via a modeless dialog box that can remain open alongside the main program window which
contains the entire Pond processor sea, shown in Figure 7.2.
Figure 7.2: Software screen capture of AP Record view.
Each field describing the contents of a single AP is available on this view; following
each simulation step, or if the user clicks on a different AP, the record view is updated to
reflect the most recent activity.
7.3.3 Illustration of a Global Cast
Each time an atomic processor is updated; that is, it receives a new message or produces
a current result, it is highlighted. The wave front of a global cast can be observed in
Figure 7.3.




Figure 7.3: Software screen captures of global cast propagation; consecutive machine cycles are shown.
the cast through message propagation. The message itself can be seen by right-clicking on
a highlighted AP and viewing its AP Record window.
7.4 To Be Implemented
While the majority of the communication architecture has been implemented in the func-
tional simulation software, there is more work to be done on the execution side. Currently
the following functions are planned for implementation in the future:
70
• Beamed, entity, local, P2P, and redirected cast types
• Send and receive layers for handshaking and message passing
• Decoding of received messages via message buffer
• Encoding of messages to send via message buffer
• Local execution of instructions by APs
• Simulation of defective APs
• Additional simulation engine control (run to completion, pause, etc.)




Having covered the architectures organization, communications, operation, and perfor-
mance evaluation, we continue in this section to highlight and discuss its most relevant
features and benefits. Provided there are sufficient atomic processors available, any num-
ber of programs can execute in parallel without being impeded by structural dependencies.
These, and implicitly structural hazards are practically eliminated. Any number of instruc-
tions can execute concurrently, if there are no data and/or control dependencies. Also for
example, if there are no data dependencies between the iterations of a loop, several or all
iterations may execute concurrently. Current architectures employ loop unrolling, but the
number of unrolled iterations is severely limited by the number of available registers and
functional units, and the number of memory accesses. In the proposed architecture these
structural bottlenecks are eliminated. As pointed out in the introduction, future technolo-
gies will continue to offer an increasing number of components on a chip. However, some
of these will fail prematurely or during the lifetime of the system. Therefore, for the design
of any architecture the implication is that it must be robust, i.e. it must be able to tolerate
component failures while continuing to operate nominally.
Fault tolerance is traditionally addressed through modular redundancy, which in this
context can be implemented at either the gate or functional block levels. In either case, the
redundant components can only be used in place, i.e., within their related unit. Furthermore,
components usually fail due to locally induced conditions. This means that if redundant
72
components or modules are located in close proximity, the probability for all redundant
copies to fail increases.
In the proposed architecture, while each atomic processor can be viewed as a redundant
copy of another atomic processor, each copy can be used by different morphological entities
at the same time, and can serve different data manipulation needs. The failure of one
or more atomic processor does not fatally affect the overall storage, communication, and
processing capabilities of the system. By another point of view, the morphological entities
in the sea of atomic processors act like multicellular organisms that discard their dead cells,
but maintain their functionality. Each atomic processor is envisioned (in future work) to
have a self-test and diagnose capability. Such a self-test would be repeated periodically, or
at the request of one of its neighbors or a system function.
There is no central control in the sea of atomic processors, nor a unit that keeps track
of available resources. Program execution adapts to the resources available at the time of
execution. These resources are equal to the total number of atomic processors available in
the sea at the time of fabrication, less the number of atomic processors that have died since
manufacturing, and the number of atomic processors currently used by other morphological
entities. Similarly, data processing is not centralized, i.e., there are no data paths and
no shared manipulation units. Each data word is referenced by the element identification
number within the data structure to which it belongs.
The size of the sea of atomic processors is scalable, and programs are portable. This
means that programs developed in a 64K sea of atomic processors, can run in a 1 Tera sea
of atomic processors, and vice versa, as long as the number of available atomic processors




In the Introduction, work has been cited in the areas of multicore architectures and net-
works on chip, which addresses problems of late and post silicon technologies. In this
section, elaboration is presented on a few other architectural paradigms with which the
Pond architecture shares traits, but from which it also differs.
Patwardhan et al. explore in [85], a DNA based nano architecture. It is massively paral-
lel and uses simple complexity processing nodes. However, in our architecture, functions
and data are virtually separated from the physical support, whereas the nodes self-organize
into functional blocks. In their follow-up work, they focus on computing for nanoscale
sensors [86].
The architecture we propose resembles a systolic array with respect to the homogenous
and structured array of simple complexity data processing elements, and the use of next
neighbor communications. However, unlike a systolic array which is a data flow archi-
tecture, ours is a control flow architecture in which the processing that a group of atomic
processors can accomplish is completely reconfigurable through software. Systolic arrays
have proven that data driven processing can be a viable alternative to instruction driven
processing, but only for linear and uniform data sets, as is shown in [87].
Reconfigurable computing is currently trying to resolve these limitations by employing
both paradigms [88]. A recent architecture with orthogonally arranged processing elements
and next neighbor communications has been proposed and built by Baas et al. in [89]. As
74
the authors report, the implementation has been a great success. However, it uses only 36
nodes and it is unclear how it could scale up to a higher number of cores to serve other
applications than the ones tested.
In their search for alternative computing paradigms, computer architects and scientists
have turned for inspiration to biological systems. Neural networks are modeled after the
neural system, and have proved useful for specific classes of applications, such as control
systems. For computationally precise tasks, neural networks do not yet match traditional
computer architectures. However, the former exhibit a feature that is desirable in future ar-
chitectures on a chip: resiliency to defective components. Following the example of neural
networks in learning from biological systems, Abelson et al. have focused on self-assembly
and self-organization [90]. Due to the large number of atomic processing elements, our ar-
chitecture can support the development and implementation of neural networks of sizes and
complexities unmatched today. The same physical platform, the sea of atomic processors,
can support precise computing algorithms and neural network algorithms.
Other notable architectures which have been designed specifically to exploit instruction
and thread level parallelism are: the Connection Machine by Daniel Hillis [91], WaveScalar
by Swanson et al. [92], the RAW microprocessor by Agarwal et al. [93], Blue Gene by
IBM [94], supercomputing vector machines [95], and the INMOS Transputer [96]. All
these are massively parallel, but their processing nodes are more complex than our atomic
processors, and long interconnects are not eliminated, but rather mitigated. Architectures
have also been proposed that incorporate similar morphological movement and interac-
tion as a basis for execution of programs. Some of these include Self-Reproducing Au-
tomata [97] and Invasive Algorithms and Architectures [98]; in contrast with these ap-
proaches, the Pond architecture exhibits only physical nearest-neighbor communications
as opposed to additional routing structures. In addition, the Plastic Cell Architecture [99]
shares similarities in terms of communication infrastructure but is targeted specifically for
75
reconfigurable hardware devices. In the context of discovery of parallelism as well as
dataflow architectures, Macromodules [100] and Micropipelines [101] are similar func-




Future Work and Conclusions
10.1 Detection of Component Defects
An excellent feature of this architecture design is resiliency against component defects.
Conceivably, APs can be intelligent enough to detect if there is a defect in their circuitry,
and deactivate at runtime to prevent other APs from attempting communication. Defects
are certainly possible and very probable at the time of fabrication, but it is also possible
that over time, damage can occur to the sea, especially with the increase in component den-
sity and smaller feature sizes of state-of-the-art technologies. Ideally, self-checks could be
implemented for periodic diagnostics so the sea is aware of corruption as soon as possible.
Since the types of checks that can be done are somewhat ambiguous, and there is nothing
preventing the self-check mechanisms themselves from becoming damaged, a good imple-
mentation of self-checking could occur at startup from a separate set of circuitry outside
the sea (but obviously on the same die; this is the same case with the input/output inter-
facing which resides outside the grid/sea of APs), which would run a series of test vectors
along the path for bit stream programming, and would receive a bit stream back indicating
which APs failed a test, and the AP would be forced into an inactive state. Of course, all
of this circuitry is subject to defects as well, but assuming the defect detection circuitry is a
very small percentage of chip area—a safe assumption—it is much more likely that defects
would occur outside the detection circuitry itself.
77
The fact that the processor sea architecture is functionally able to operate with a signifi-
cant amount of defects makes the usage of test vectors post-production a sensible approach.
If a given circuit has the capability of operating with defects because of high redundancy
(as is the case here), a method of running diagnostics is a very valuable way of preserving
processor functionality by taking advantage of this redundancy. Since it is highly unlikely
that enough defects or runtime damage occur in a given chip to render a significant portion
of APs to be inoperable, it is very unlikely that fabrication-time defects as well as field-time
corruption would ever mandate that a chip be discarded, except in extreme damage cases.
As indicated, a small unlikely exception exists if damage occurs to input/output interfaces
or the diagnostic/test circuitry.
10.2 Implementation Considerations
An important aspect of this design to consider is the physical implementation. In the pre-
ceding sections we have discussed mainly the design and performance considerations. The
arrangement of circuitry in terms of small cells with local communications is not a partic-
ularly new approach (though the architecture design, behavior, and programming concepts
presented for this architecture are unique), but follows a well-established compute model
similar to cellular automata where multiple identical elements communicate via a next-
neighbor system [102].
Programming of this architecture may be considered in the form of a bit stream down-
load, where the sea is put into a programming mode, and each AP receives the initialization
bits it needs to create data structures, function definitions, and execution order information.
Logical identification numbers, initialized operands, operand identification numbers, exe-
cution order and preceding execution order numbers, instruction execution repeats (allows
instruction concurrency), opcodes, and condition flags may be programmed into each AP
via SRAM cells similar to the way FPGAs are configured using bit stream downloads.
78
Consider also the implementation benefits of this architecture design. The scalable na-
ture of the architecture allows for very efficient wafer use—areas along the perimeter of
the wafer usually lost because they do not offer enough area to fit another chip can be used
to fabricate smaller seas, as shown in Figure 10.1.
Large Sea Size
Smaler Sea Sizes
Figure 10.1: Example of multiple sea sizes on the same wafer
As shown, the edges of the wafer that cannot fit another die of a given sea size can be
populated with smaller dies with smaller sea sizes, bringing wafer usage closer to the ideal
100%.
In addition, this architecture offers a design that theoretically will not suffer from “hot
spots” as in typical architectures. Hot spots often appear in traditional architectures in
high-density manipulation units (multipliers, etc.) and areas of centralized control, where
the high density, high usage area generates significantly more heat than surrounding areas
[103]. Since there is no centralization of control or areas of the sea that are targeted for
specific manipulation functions, theoretically hot spots would be eliminated. However,
certain “warm” spots will exist likely at the input/output ports for applications that employ
79
heavy input or output such as streaming media or bulk data transfers.
The lack of a central bus implementation and no long interconnects means that operation
likely could run at a significantly higher speed than traditional architectures. A common is-
sue in system performance in terms of clock speed is parasitics in interconnects. The longer
a wire, the higher the series resistance and inductance, and the higher the parallel capac-
itance. The interconnect can become a low-pass filter, limiting the maximum frequency
that can reliably propagate. Therefore, the clock speed and high-frequency communication
lines are often limited by what can be supported in these long interconnects because of this
low-pass behavior. Since this architecture operates using next neighbor connections only,
and does not require global clock distribution, interconnect wires would have significantly
less parasitics, allowing for reliable operation at much higher clock speeds. Higher clock
speeds on the same processor mean faster execution. In addition, the higher parasitic ca-
pacitance in interconnects translates effectively to a larger load capacitance on the wire,
increasing power consumption because of the increase in the amount of charge that it takes
to drive that wire. Avoiding long interconnects avoids this power loss. That power loss can
be “reallocated” to allow for higher clock speeds at the same power draw, or simply allow
for a lower-powered chip overall.
10.2.1 Size Requirements
This architecture uses a novel approach to computing by taking multicore processing to its
most fine-grained level. In many multicore designs, the number of processing units is sub-
stantially smaller than the number of words in storage, requiring the majority of processing
to be done sequentially; the amount of parallelism is limited by the small number of pro-
cessing cores. Since processing functions take significantly more hardware to implement
than memory (RAM), it is not surprising that multicore designs have taken this approach.
We propose a design that does implement the same number of processing elements as data
80
storage elements—in fact, they are interchangeable—to eliminate these structural hazards
that impede concurrent and parallel execution of programs. With feature sizes decreas-
ing dramatically, and the introduction of post-silicon technologies offering more and more
components on a chip, many designs attempt to improve parallelism by further increasing
the complexity of processing cores to sometimes substantially improve instruction through-
put via runtime scheduling systems and more redundancy in functional units. This approach
has its limits; by instead decreasing core complexity and placing more of them on a chip,
the inherent parallelism on instruction-level and function-level can be exploited likely on a
much higher level.
10.3 Software Development
Areas of current and future development of this architecture include refinements to soft-
ware simulation, and development of a compiler/assembler as well as a fully-featured IDE
(Integrated Development Environment) that allows for program flowchart-style graphical
programming with parallelism as well as standard high-level language programming and
compilation. Of course, this is an optional course of action and is intended mainly for
development purposes; this architecture and its associated compiler, once completed, are
intended to work on traditional sequential languages and discover parallelism automati-
cally.
10.4 Contributions to the State-of-the-Art
This thesis has intended to contribute the definition, initial development, and analysis of a
massively parallel computer architecture intended for implementation consideration in late
and post silicon technologies. Among the notable characteristics are the massively parallel
and fine-grained nature, a distributed data storage system providing decentralization of
81
resources, a theoretically successful next-neighbor message passing system, function or
entity-based morphological behavior of cores, and a basis for important features moving
into late and post silicon design including scalability and defect/fault tolerance.
82
Bibliography
[1] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative
Approach. Amsterdam: Elsevier, 4 edition, 2007.
[2] D. Geer. Chip makers turn to multicore processors. Computer, 38(5):11–13, May
2005.
[3] David Yeh, Li-Shiuan Peh, Shekhar Borkar, John Darringer, Anant Agarwal, and
Wen-mei Hwu. Thousand-core chips [roundtable]. Design Test of Computers, IEEE,
25(3):272–278, May/Jun 2008.
[4] Markus Levy and Thomas M. Conte. Embedded multicore processors and systems.
Micro, IEEE, 29(3):7–9, May/Jun 2009.
[5] G. Blake, R.G. Dreslinski, and T. Mudge. A survey of multicore processors. Signal
Processing Magazine, IEEE, 26(6):26–37, November 2009.
[6] J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. BuJanos, D. Wu, M. Braganza,
S. Meyers, E. Fang, and R. KuMar. An integrated quad-core opteron processor.
In Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers.
IEEE International, pages 102–103, 2007.
[7] U.M. Nawathe, M. Hassan, L. Warriner, K. Yen, B. Upputuri, D. Greenhill, A. Ku-
Mar, and H. Park. An 8-core 64-thread 64b power-efficient SPARC SoC. In Solid-
State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE
International, pages 108–590, 11-15 2007.
[8] M. Mehrara, T. Jablin, D. Upton, D. August, K. Hazelwood, and S. Mahlke. Mul-
ticore compilation strategies and challenges. Signal Processing Magazine, IEEE,
26(6):55–63, November 2009.
[9] M.D. McCool. Scalable programming models for massively multicore processors.
Proceedings of the IEEE, 96(5):816–831, May 2008.
[10] S. Gal-On and M. Levy. Measuring multicore performance. Computer, 41(11):99–
102, Nov. 2008.
83
[11] James Donald and Margaret Martonosi. An efficient, practical parallelization
methodology for multicore architecture simulation. Computer Architecture Letters,
5(2):14, Jul–Dec 2006.
[12] S. KuMar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja,
and A. Hemani. A network on chip architecture and design methodology. In VLSI,
2002. Proceedings. IEEE Computer Society Annual Symposium on, pages 105–112,
2002.
[13] Jiang Xu, W. Wolf, J. Henkel, and S. Chakradhar. A methodology for design, mod-
eling, and analysis of networks-on-chip. In Circuits and Systems, 2005. ISCAS 2005.
IEEE International Symposium on, pages 1778–1781 Vol. 2, 23–26 2005.
[14] L. Benini and D. Bertozzi. Network-on-chip architectures and design methods. Com-
puters and Digital Techniques, IEEE Proceedings on, 152(2):261–272, Mar 2005.
[15] M. Amde, T. FeliciJan, A. Efthymiou, D. Edwards, and L. Lavagno. Asyn-
chronous on-chip networks. Computers and Digital Techniques, IEEE Proceedings
on, 152(2):273–283, Mar 2005.
[16] K. Goossens, J. Dielissen, and A. Radulescu. AEthereal network on chip: concepts,
architectures, and implementations. Design Test of Computers, IEEE, 22(5):414–
421, Sept-Oct. 2005.
[17] D. Bertozzi and L. Benini. Xpipes: a network-on-chip architecture for gigascale
systems-on-chip. Circuits and Systems Magazine, IEEE, 4(2):18–31, 2004.
[18] G. Leary, K. Srinivasan, K. Mehta, and K.S. Chatha. Design of network-on-chip
architectures with a genetic algorithm-based technique. Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, 17(5):674–687, May 2009.
[19] M. Forsell. A scalable high-performance computing solution for networks on chips.
Micro, IEEE, 22(5):46–55, Sept/Oct 2002.
[20] D.A. llitzky, J.D. Hoffman, A. Chun, and B.P. Esparza. Architecture of the scalable
communications core’s network on chip. Micro, IEEE, 27(5):62–74, Septt-Oct 2007.
[21] P.P. Pande, C. Grecu, A. IvaNov, R. Saleh, and G. De Micheli. Design, synthesis,
and test of networks on chips. Design Test of Computers, IEEE, 22(5):404–413,
Sept-Oct. 2005.
[22] R. Saleh. An approach that will NoC your SoCs off! Design Test of Computers,
IEEE, 22(5):488, Sept/Oct 2005.
84
[23] H.C. Freitas, F.L. Madruga, M. Alves, and P. Navaux. Design of interleaved mul-
tithreading for network processors on chip. In Circuits and Systems, 2009. ISCAS
2009. IEEE International Symposium on, pages 2213–2216, 2009.
[24] Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices of
network-on-chip. ACM Comput. Surv., 38(1):1, 2006.
[25] R. Marculescu, U.Y. Ogras, Li-Shiuan Peh, N.E. Jerger, and Y. Hoskote. Outstanding
research problems in NoC design: System, microarchitecture, and circuit perspec-
tives. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-
tions on, 28(1):3–21, Jan 2009.
[26] Kuei-Chung Chang, Jih-Sheng Shen, and Tien-Fu Chen. Evaluation and design
trade-offs between circuit-switched and packet-switched NOCs for application-
specific SOCs. In Design Automation Conference, 2006 43rd ACM/IEEE, pages
143–148, 0-0 2006.
[27] Gaoming Du, Duoli Zhang, Yukun Song, Minglun Gao, Luofeng Geng, and Ning
Hou. Scalability study on mesh based network on chip. In Computational Intel-
ligence and Industrial Application, 2008. PACIIA ’08. Pacific-Asia Workshop on,
volume 2, pages 681–685, 2008.
[28] Huy-Nam Nguyen, Vu-Duc Ngo, and Hae-Wook Choi. Assessing routing behavior
on on-chip-network. In Computer Engineering and Systems, The 2006 International
Conference on, pages 62–65, 5-7 2006.
[29] Xinming Duan, Dakun Zhang, and Xuemei Sun. Routing schemes of an irregu-
lar mesh-based NoC. In Networks Security, Wireless Communications and Trusted
Computing, 2009. NSWCTC ’09. International Conference on, volume 2, pages 572–
575, 2009.
[30] ShiJun Lin, Li Su, Haibo Su, Depeng Jin, and Lieguang Zeng. Design trade-offs
in packetizing mechanism for network-on-chip. In Digital Society, 2009. ICDS ’09.
Third International Conference on, pages 316–321, 2009.
[31] J. Hu and R. Marculescu. Communication and task scheduling of application-
specific networks-on-chip. Computers and Digital Techniques, IEEE Proceedings
on, 152(5):643–651, 2005.
[32] A. Leroy, D. Milojevic, D. Verkest, F. Robert, and F. Catthoor. Concepts and imple-
mentation of spatial division multiplexing for guaranteed throughput in networks-
on-chip. Computers, IEEE Transactions on, 57(9):1182–1195, Sept 2008.
85
[33] A. Shacham, K. Bergman, and L.P. Carloni. Photonic networks-on-chip for future
generations of chip multiprocessors. Computers, IEEE Transactions on, 57(9):1246–
1260, Sept 2008.
[34] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta. Low-power,
high-speed transceivers for network-on-chip communication. Very Large Scale Inte-
gration (VLSI) Systems, IEEE Transactions on, 17(1):12–21, Jan 2009.
[35] G. Schelle, J. Fifield, and D. Griinwald. A software defined radio application uti-
lizing modern FPGAs and NoC interconnects. In Field Programmable Logic and
Applications, 2007. FPL 2007. International Conference on, pages 177–182, 2007.
[36] J. Chan and S. Parameswaran. NoCOUT: NoC topology generation with mixed
packet-switched and point-to-point networks. In Design Automation Conference,
2008. ASPDAC 2008. Asia and South Pacific, pages 265–270, 2008.
[37] V.F. Pavlidis and E.G. Friedman. Interconnect-based design methodologies for three-
dimensional integrated circuits. Proceedings of the IEEE, 97(1):123–140, Jan 2009.
[38] A. Mejia, M. Palesi, J. Flich, S. KuMar, P. Lopez, R. HolsMark, and J. Duato.
Region-based routing: A mechanism to support efficient routing algorithms in NoCs.
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 17(3):356–369,
Mar 2009.
[39] J.H. Bahn and N. Bagherzadeh. Design of simulation and analytical models for a 2d-
meshed asymmetric adaptive router. Computers Digital Techniques, IET, 2(1):63–
73, Jan 2008.
[40] M. Palesi, R. HolsMark, S. KuMar, and V. Catania. Application specific routing al-
gorithms for networks on chip. Parallel and Distributed Systems, IEEE Transactions
on, 20(3):316–330, Mar 2009.
[41] A. Ganguly, P.P. Pande, and B. Belzer. Crosstalk-aware channel coding schemes
for energy efficient and reliable NOC interconnects. Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, 17(11):1626–1639, Nov 2009.
[42] Xin Wang, Tapani Ahonen, and Jari Nurmi. Applying CDMA technique to network-
on-chip. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
15(10):1091–1100, Oct 2007.
[43] S. Murali, D. Atienza, P. Meloni, S. Carta, L. Benini, G. De Micheli, and L. Raffo.
Synthesis of predictable networks-on-chip-based interconnect architectures for chip
86
multiprocessors. Very Large Scale Integration (VLSI) Systems, IEEE Transactions
on, 15(8):869–880, Aug 2007.
[44] H.C. Freitas, T.G.S. Santos, and P.O.A. Navaux. Design of programmable
NoC router architecture on FPGA for multi-cluster NoCs. Electronics Letters,
44(16):969–971, Jul 2008.
[45] N. Bagherzadeh and M. Matsuura. Performance impact of task-to-task communi-
cation protocol in network-on-chip. In Information Technology: New Generations,
2008. ITNG 2008. Fifth International Conference on, pages 1101–1106, 7-9 2008.
[46] F. Karim, A. Nguyen, and S. Dey. An interconnect architecture for networking sys-
tems on chips. Micro, IEEE, 22(5):36–45, Sept/Oct 2002.
[47] S. Rodrigo, S. Medardoni, J. Flich, D. Bertozzi, and J. Duato. Efficient implementa-
tion of distributed routing algorithms for NoCs. Computers Digital Techniques, IET,
3(5):460–475, Sept 2009.
[48] S. Yan and B. Lin. Joint multicast routing and network design optimisation for
networks-on-chip. Computers Digital Techniques, IET, 3(5):443–459, Sept 2009.
[49] M. Daneshtalab, M. Ebrahimi, S. Mohammadi, and A. Afzali-Kusha. Low-distance
path-based multicast routing algorithm for network-on-chips. Computers Digital
Techniques, IET, 3(5):430–442, Sept 2009.
[50] M. Palesi, S. KuMar, and V. Catania. Bandwidth-aware routing algorithms for
networks-on-chip platforms. Computers Digital Techniques, IET, 3(5):413–429,
Sept 2009.
[51] Se-Joong Lee, Kangmin Lee, Seong-Jun Song, and Hoi-Jun Yoo. Packet-switched
on-chip interconnection network for system-on-chip applications. Circuits and Sys-
tems II: Express Briefs, IEEE Transactions on, 52(6):308–312, Jun 2005.
[52] F. Jafari, M.S. Talebi, A. Khonsari, and M.H. Yaghmaee. A novel congestion control
scheme in network-on-chip based on best effort delay-sum optimization. In Parallel
Architectures, Algorithms, and Networks, 2008. I-SPAN 2008. International Sympo-
sium on, pages 191–196, 2008.
[53] J. Dongarra, T. Sterling, H. Simon, and E. Strohmaier. High-performance comput-
ing: clusters, constellations, MPPs, and future directions. Computing in Science
Engineering, 7(2):51–59, Mar/Apr 2005.
87
[54] J.M. Rabaey and S. Malik. Challenges and solutions for late- and post-silicon design.
Design Test of Computers, IEEE, 25(4):296–302, July/Aug 2008.
[55] Wen mei Hwu, K. Keutzer, and T.G. Mattson. The concurrency challenge. Design
Test of Computers, IEEE, 25(4):312–320, Jul/Aug 2008.
[56] T. Austin, V. Bertacco, S. Mahlke, and Yu Cao. Reliable systems on unreliable
fabrics. Design Test of Computers, IEEE, 25(4):322–332, Jul/Aug 2008.
[57] Lei Zhang, Yinhe Han, Qiang Xu, Xiao wei Li, and Huawei Li. On topology re-
configuration for defect-tolerant NoC-based homogeneous manycore systems. Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, 17(9):1173–1186,
Sept 2009.
[58] M. Lammie, P. Brenner, and D. Thain. Scheduling grid workloads on multicore
clusters to minimize energy and maximize performance. In Grid Computing, 2009
10th IEEE/ACM International Conference on, pages 145–152, 2009.
[59] P. Chaparro, J. Gonzalez, G. Magklis, Cai Qiong, and A. Gonzalez. Understand-
ing the thermal implications of multi-core architectures. Parallel and Distributed
Systems, IEEE Transactions on, 18(8):1055–1065, Aug 2007.
[60] Jingcao Hu and R. Marculescu. Energy- and performance-aware mapping for regu-
lar NoC architectures. Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, 24(4):551–562, Apr 2005.
[61] N. Banerjee, P. Vellanki, and K.S. Chatha. A power and performance model for
network-on-chip architectures. In Design, Automation and Test in Europe Confer-
ence and Exhibition, 2004. Proceedings, volume 2, pages 1250–1255, 2004.
[62] A. Sarathy, A. Louri, and A.K. Kodi. Low-power low-area network-on-chip archi-
tecture using adaptive electronic link buffers. Electronics Letters, 44(8):512–513,
Apr 2008.
[63] Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. Low-power network-on-chip for
high-performance SoC design. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 14(2):148–160, Feb 2006.
[64] A. Hansson, M. Wiggers, A. Moonen, K. Goossens, and M. Bekooij. Enabling
application-level performance guarantees in network-based systems on chip by ap-
plying dataflow analysis. Computers Digital Techniques, IET, 3(5):398–412, Sept
2009.
88
[65] T. Simunic, S.P. Boyd, and P. Glynn. Managing power consumption in networks
on chips. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
12(1):96–107, Jan 2004.
[66] R. Iris Bahar, Dan Hammerstrom, Justin Harlow, William H. Joyner Jr., Clifford Lau,
Diana Marculescu, Alex Orailoglu, and Massoud Pedram. Architectures for silicon
nanoelectronics and beyond. Computer, 40(1):25–33, Jan 2007.
[67] Shuo Wang, Lei Wang, and F. Jain. Dynamic redundancy allocation for reliable and
high-performance nanocomputing. In Nanoscale Architectures, 2007. NANOSARCH
2007. IEEE International Symposium on, pages 1–6, 21-22 2007.
[68] F. Martorell and A. Rubio. Defect and fault tolerant cell architecture for feasible
nanoelectronic designs. In Design and Test of Integrated Systems in Nanoscale Tech-
nology, 2006. DTIS 2006. International Conference on, pages 244–249, 2006.
[69] N.Z. Haron and S. Hamdioui. Emerging crossbar-based hybrid nanoarchitectures for
future computing systems. In Signals, Circuits and Systems, 2008. SCS 2008. 2nd
International Conference on, pages 1–6, 2008.
[70] Shuo Wang and Lei Wang. A defect-tolerant memory nanoarchitecture exploiting
hybrid redundancy. In Nanotechnology, 2008. NANO ’08. 8th IEEE Conference on,
pages 707–710, 2008.
[71] Shanrui Zhang, Minsu Choi, and Nohpill Park. Defect characterization and yield
analysis of array-based nanoarchitecture. In Nanotechnology, 2004. 4th IEEE Con-
ference on, pages 50–52, 2004.
[72] Trong Tu Bui and T. Shibata. A scalable architecture of associative processors em-
ploying nano functional devices. In Ultimate Integration of Silicon, 2009. ULIS
2009. 10th International Conference on, pages 213–216, 2009.
[73] Shanrui Zhang, Minsu Choi, and N. Park. Modeling yield of carbon-
nanotube/silicon-nanowire FET-based nanoarray architecture with h-hot addressing
scheme. In Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceed-
ings. 19th IEEE International Symposium on, pages 356–364, 2004.
[74] J.A. Casas, J.M. Moreno, J. Madrenas, and J. Cabestany. A novel hardware architec-
ture for self-adaptive systems. In Adaptive Hardware and Systems, 2007. AHS 2007.
Second NASA/ESA Conference on, pages 592–599, 2007.
89
[75] Shengxian Zhuang, J. Carlsson, and L. WanhamMar. A design approach for GALS
based systems-on-chip. In Solid-State and Integrated Circuits Technology, 2004.
Proceedings. 7th International Conference on, volume 2, pages 1368–1371, 2004.
[76] J. Mekie, S. Chakraborty, G. Venkataramani, P.S. ThiagaraJan, and D.K. Sharma.
Interface design for rationally clocked GALS systems. In Asynchronous Circuits
and Systems, 2006. 12th IEEE International Symposium on, pages 12–171, 2006.
[77] M. Krstic, E. Grass, C. Stahl, and M. Piz. System integration by request-
driven GALS design. Computers and Digital Techniques, IEEE Proceedings on,
153(5):362–372, Sept 2006.
[78] F.K. Gurkaynak, T. Villiger, S. Oetiker, N. Felber, H. Kaeslin, and W. Fichtner.
A functional test methodology for globally-asynchronous locally-synchronous sys-
tems. In Asynchronous Circuits and Systems, 2002. Proceedings. Eighth Interna-
tional Symposium on, pages 181–189, 2002.
[79] Shree P. Rahul Westley Weimer George C. Necula, Scott McPeak. Lecture Notes in
Computer Science. 2002.
[80] V. Gustin and P. Bulic. Extracting SIMD parallelism from ‘for’ loops. In Parallel
Processing Workshops, 2001. International Conference on, pages 23–28, 2001.
[81] M. Schlansker, T.M. Conte, J. Dehnert, K. Ebcioglu, J.Z. Fang, and C.L. Thompson.
Compilers for instruction-level parallelism. Computer, 30(12):63–69, Dec 1997.
[82] P.K. Dubey, III Adams, G.B., and M.J. Flynn. Instruction window size trade-offs
and characterization of program parallelism. Computers, IEEE Transactions on,
43(4):431–442, Apr 1994.
[83] Jr. Baumstark, L. and L. Wills. Exposing data-level parallelism in sequential image
processing algorithms. In Reverse Engineering, 2002. Proceedings. Ninth Working
Conference on, pages 245–254, 2002.
[84] Vasilii Zakharov. Parallelism and array processing. Computers, IEEE Transactions
on, C-33(1):45–78, Jan 1984.
[85] Jaidev P. Patwardhan, Vijeta Johri, Chris Dwyer, and Alvin R. Lebeck. A defect
tolerant self-organizing nanoscale SIMD architecture. In ASPLOS-XII: Proceedings
of the 12th international conference on Architectural support for programming lan-
guages and operating systems, pages 241–251, New York, NY, USA, 2006. ACM.
90
[86] Constantin Pistol, Wutichai Chongchitmate, Christopher Dwyer, and Alvin R.
Lebeck. Architectural implications of nanoscale integrated sensing and computing.
In ASPLOS ’09: Proceeding of the 14th international conference on Architectural
support for programming languages and operating systems, pages 13–24, New York,
NY, USA, 2009. ACM.
[87] R. Hartenstein. A decade of reconfigurable computing: a visionary retrospective.
In Design, Automation and Test in Europe, 2001. Conference and Exhibition 2001.
Proceedings, pages 642–649, 2001.
[88] S. Hauck and A. DeHon. Reconfigurable Computing: The Theory and Practice of
FPGA-Based Computing. Morgan Kaufman, 2008.
[89] B. Baas, Zhiyi Yu, M. Meeuwsen, O. Sattari, R. Apperson, E. Work, J. Webb, M. Lai,
T. Mohsenin, D. Truong, and J. Cheung. AsAP: A fine-grained many-core platform
for DSP applications. Micro, IEEE, 27(2):34–45, Mar/Apr 2007.
[90] Harold Abelson, Don Allen, Daniel Coore, Chris Hanson, George Homsy, Thomas F.
Knight, Jr., Radhika Nagpal, Erik Rauch, Gerald Jay Sussman, and Ron Weiss.
Amorphous computing. Commun. ACM, 43(5):74–82, 2000.
[91] W. Daniel Hillis and Lewis W. Tucker. The CM-5 connection machine: a scalable
supercomputer. Commun. ACM, 36(11):31–40, 1993.
[92] Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. WaveScalar.
In MICRO 36: Proceedings of the 36th annual IEEE/ACM International Sympo-
sium on Microarchitecture, page 291, Washington, DC, USA, 2003. IEEE Computer
Society.
[93] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman,
P. Johnson, Jae-Wook Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman,
V. Strumpen, M. Frank, S. AMarasinghe, and A. Agarwal. The raw microprocessor:
a computational fabric for software circuits and general-purpose programs. Micro,
IEEE, 22(2):25–35, Mar/Apr 2002.
[94] Blue gene. IBM Journal of Research and Development, 49(1), 2005.
[95] Jr. Dunigan, T.H., J.S. Vetter, and P.H. Worley. Performance evaluation of the cray x1
distributed shared memory architecture. In High Performance Interconnects, 2004.
Proceedings. 12th Annual IEEE Symposium on, pages 20–25, 2004.
[96] David Mitchell. The Transputer: The Time Is Now. 1989.
91
[97] J. Von Neumann and A. W. Burks. Theory of Self-Reproducing Automata. University
of Illinois Press, 1966.
[98] J. Teich. From dynamic reconfiguration to self-reconfiguration: Invasive algorithms
and architectures. In Field-Programmable Technology, 2009. FPT 2009. Interna-
tional Conference on, pages 11–12, 2009.
[99] R. Konishi, H. Ito, H. Nakada, A. Nagoya, K. Oguri, N. Imlig, T. Shiozawa, M. In-
amori, and K. Nagami. PCA-1: a fully asynchronous, self-reconfigurable LSI. In
Asynchronous Circuits and Systems, 2001. ASYNC 2001. Seventh International Sym-
posium on, pages 54–61, 2001.
[100] G. Gopalakrishnan, P. Kudva, and E. Brunvand. Peephole optimization of asyn-
chronous macromodule networks. Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, 7(1):30–37, Mar 1999.
[101] S. Pagey, S.D. Sherlekar, and G. Venkatesh. Issues in fault modelling and testing of
micropipelines. In Test Symposium, 1992. (ATS ’92), Proceedings., First Asian (Cat.
No.TH0458-0), pages 107–111, 1992.
[102] Edwin Roger Banks. Universality in cellular automata. In Switching and Automata
Theory, 1970., IEEE Conference Record of 11th Annual Symposium on, pages 194–
215, 1970.
[103] Je-Hyoung Park, A. Shakouri, and Sung-Mo Kang. Fast evaluation method for tran-
sient hot spots in VLSI ICs in packages. In Quality Electronic Design, 2008. ISQED
2008. 9th International Symposium on, pages 600–603, 2008.
92
Appendix A
Functional Simulator Programming Guide
The simulation software is written entirely in C using the Windows Win32 C API.
A.1 File Structure
The simulation software consists of five source files, logically splitting the functional com-
ponents of the program:
• simulator win.c: Contains the main program window function and message loop.
• wnd functions.c: Contains other program window functions and their message loops.
• ap grid ui.c: Contains functions related to the processing of the atomic processor
visual sea.
• ap comm.c: Contains state machine for handshaking process and other functions spe-
cific to communications.
• sim engine.c: Contains the main simulator engine loop and associated functions.
In addition, 4 header files are present, containing function prototypes, global variables,
and OS-specific resource allocations:
• resource.h: Resource file containing macro definitions for menu items, images, and
other UI elements
93
• resource.rc: Resource file containing more information on images, dialogs, and other
UI elements
• globals.h: Contains all global variable definitions, type definitions, function proto-
types, and common macros.
• codes.h: Contains macros for message codes for command message format.
A.2 Variables and Structs
A.2.1 record
The record struct represents the storage elements of a single atomic processor. It contains
all identification fields, operation code, execution conditions, order, count, status bits, and
handshake and message buffers. It also contains other AP-specific information such as




BOOL CALLBACK DlgProcOpenPond(HWND hWnd, UINT msg, WPARAM wParam, LPARAM
lParam)
Function corresponding to modeless AP Record dialog window.
Parameters:
• HWND hWnd: HWND Win32 API window handle for main program window.
• UINT msg: Message passing variable for OS message loop.
• WPARAM wParam: Parameter storing data related to OS message passing.
94
• LPARAM lParam: Parameter storing data related to OS message passing.
Returns: OS-specific callback.
A.3.2 ap grid ui.c
void PondBuildGrid(FILE *fi ds pnd, FILE *fd pnd)
Builds the grid user interface for viewing the Pond processor sea. Machine code files are
whitespace-delimited. A line or row contains the record of one AP.
Parameters:
• FILE *fi ds pnd: Pointer to file containing function instance and data structure ma-
chine code.
• FILE *fd pnd: Pointer to file containing function definition machine code.
Returns: Nothing.
void RecenterGridOffset(HWND hWnd)
Assistive function to re-evaluate locations of APs after dragging and zooming.
Parameters:
• HWND hWnd: HWND Win32 API window handle for main program window.
Returns: Nothing.
void PlaceAPBitmaps(HWND hWnd, HDC windc)
Places colors corresponding to function definitions, function instances, and data structures
in the sea.
Parameters:
• HWND hWnd: HWND Win32 API window handle for main program window.
95
• HDC windc: HDC Win32 API device context for handling bitmap placement.
Returns: Nothing.
A.3.3 ap comm.c
void APNeighborComm(int x, int y)
Step in neighbor handshaking process where message buffers are loaded.
Parameters:
• int x: x-coordinate of AP.
• int y: y-coordinate of AP.
Returns: Nothing.
void APNeighborReq(int x, int y)
Step in neighbor handshaking process in which the handshake buffer is initially set by the
sending AP.
Parameters:
• int x: x-coordinate of AP.
• int y: y-coordinate of AP.
Returns: Nothing.
void APBufferProcess(int x, int y)




• int x: X-coordinate of AP.
• int y: Y-coordinate of AP.
Returns: Nothing.
void APSetTransDest(record *rec, int origin)
Resolves cast types into appropriate directions for APs to send message.
Parameters:
• record *rec: Pointer to AP Record structure of element sending message.














Called to run a single simulation (machine) cycle.
Parameters:
• HWND hWnd: HWND Win32 API window handle for main program window.
Returns: Nothing.
void APExecuteMsgInstr(record *rec)
Called when an instruction is ready to execute.
Parameters:
• record *rec: Pointer to AP Record structure containing element that should execute
its instruction.
Returns: Nothing.
void LoadMessage(record *rec, int msg0, int msg1, int msg2, int msg3, int msg4, int msg5,
int msg6, int msg7) Called when an AP needs to send a message through a communica-
tions handshake. Each parameter represents one field of the message vector; indices are
described in macros, which are included in the globals.h include file.
Parameters:
• record *rec: Pointer to AP Record structure containing element whose message buffer
should be set up.
• int msgx: Individual message bitfields to transfer; not all may be used for all messages.
Returns: Nothing.




• record *rec: Pointer to AP Record structure containing element who just received the
message.
Returns: Nothing.
