Efficient Mapping of Neural Network Models on a Class of Parallel Architectures. by Arad, Behnam Seyed
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1997
Efficient Mapping of Neural Network Models on a
Class of Parallel Architectures.
Behnam Seyed Arad
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Arad, Behnam Seyed, "Efficient Mapping of Neural Network Models on a Class of Parallel Architectures." (1997). LSU Historical
Dissertations and Theses. 6409.
https://digitalcommons.lsu.edu/gradschool_disstheses/6409
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may be 
from any type o f computer printer.
The quality o f this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in reduced 
form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6” x 9” black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly to 
order.
UMI
A Bell & Howell Information Company 
300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 
313/761-4700 800/521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
EFFICIENT MAPPING OF 
NEURAL NETWORK MODELS ON 
A CLASS OF PARALLEL ARCHITECTURES
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agricultural and Mechanical College 
in partial fulfilment of the 
requirements for the degree of 
Doctor of Philosophy
in
The Department of Electrical and Computer Engineering
by
Behnam Seyed Arad
B.S., University of Massachusetts, Lowell, 1988 
M.S., Purdue University, 1990 
May 1997
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 9736004
UMI Microform 9736004 
Copyright 1997, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, Um'ted States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGMENTS
I would like to thank God, most gracious, most merciful for guiding me throughout 
my life and providing me with the opportunity and patience to complete this project I 
would like to express my sincere appreciation to my major advisor, Dr. Ahmed El-Amawy, 
for providing me with the opportunity to intellectually challenge myself as an engineer. His 
advice as a mentor is highly valued and will serve me well in the future. I would like to 
thank him for the support and guidance be provided me throughout my studies at L.S.U. and 
the opportunity to benefit from financial support through his grant.
Sincere gratitude is extended to Dr. Alexander Skavantzos and Dr. Ramachandran 
Vaidyanathan for their invaluable advice and insight into my academic pursuits both here at 
L.S.U. and in future endeavors. My thanks to Dr. Gil S. Lee, Dr. John Tyler, and Dr. 
Tryfon T. Charalampopoulos for serving as members of my graduate committee.
I wish to thank my parents for their love, patience, support and encouragement 
throughout my life. Had it not been for them, I could never have found the strength, 
determination, and fortitude necessary to pursue this degree. I would also like to extend my 
gratitude to my in-laws Mr. and Mrs. Jazbi for their support and understanding throughout 
this work. A special acknowledgment to my late grandmother, Bebe Sakineh, for her love 
and kindness. My special thanks are extended to my dear friends Dr. Ramezanian, Dr. 
Sabahi, and Ali Zandieh for their sincere friendship and support over the years.
Finally, I would like to thank the love of my life, Parisa, for her support, patience, 
and encouragement throughout the difficult stages of this project I can never thank her for 
the difficulties she had to put up with.
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OF CONTENTS
ACKNOWLEDGMENTS ..............................................................................................ii
LIST OF TABLES......................................................................................................... v
LIST OF FIG URES.....................................................................................................  vi
ABSTRACT................................................................................................................... viii
CHAPTER 1 INTRODUCTION......................................................................... 1
1.1 Artificial Neural Networks..............................................................................1
1.1.1 Artificial Neuron..............................................................................2
1.1.2 Structure of ANN’S.........................................................................  4
1.1.3 ANN’S Learning (Training) Process................................................ 6
1.2 Implementation of ANN’S ..............................................................................  8
1.3 Previous W ork...............................................................................................13
1.4 Research Objectives...................................................................................... 18
1.5 Outline of the Dissertation............................................................................21
CHAPTER 2 MAPPING OF FEEDFORWARD AND RADIAL
BASIS NETWORKS.................................................................25
2.1 Preliminaries................................................................................................. 26
2.1.1 Feedforward Artificial Neural Networks...................................... 26
2.1.2 Error Backpropagation Training Algorithm................................. 28
2.1.3 The k-ary n-cube Parallel Architecture.......................................  31
2.1.4 Some Preliminaries in Graph Theory..........................................  33
2.2 Optimal Mapping of FFANN's...................................................................... 34
2.2.1 Mapping the LVL onto a VKNC Architecture.............................  37
2.2.2 Applying the Assignment Procedure to the FFANN................... 39
2.2.3 Optimal Simulation of Each Learning Pass on the VKNC............. 40
2.2.4 Optimal Folding of the VKNC........................................................ 51
2.3 Mapping Radial Basis Function Networks on k-ary n-cubes........................62
2.3.1 Radial Basis Function ANN’S ......................................................... 63
2.3.2 Fully Supervised Training of the RBF Networks.......................... 65
2.3.3 Partially Supervised Training of R BF .............................................66
CHAPTER 3 MAPPING UNIT ALLOCATING NETW ORKS..................74
3.1 Preliminaries................................................................................................. 76
3.1.1 The Cascade Correlation Learning Algorithm..............................79
3.2 Parallel Implementation.............................................................................. 84
3.2.1 The Computational Model.............................................................84
3.2.2 The Mapping Procedure................................................................ 91
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2.3 Mapping the P C C ^  onto a VKNC .............................................  93
3.2.4 Folding the VKNC ......................................................................... 95
3.2.4.1 Optimizing H^max ( a , P ) for a Network with Hmax 
Hidden Units .................................................................102
3.2.4.2 Optimizing (a,P) for a Network with Hmax 
Hidden Units.................................................................106
3.2.4.3 Performance of the Proposed Folding Scheme............ 109
3.3 Mapping Adaptive Resonance Theory Networks.....................................121
3.3.1 Adaptive Resonance Theory 1 (ART1)......................................... 123
3.3.2 Training of the ART1 Model.........................................................125
3.3.3 Mapping of theARTl Model on K N C s ....................................... 127
3.3.4 Performance of the Proposed Mapping....................................... 135
CHAPTER 4 SUMMARY AND CONCLUSIONS.......................... 139
REFERENCES............................................................................................................. 146
VITA.............................................................................................................................. 154
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF TABLES
2.1. Possible foldings for a given layer.......................................................................... 60
3.1. Timing patterns for a pipelined CC network.............................................................86
3.2. Data sets for different adjacent units of PCC^ ........................................................95
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF FIGURES
1.1. A simplified biological neuron.................................................................................. 3
1.2. The McCulloch and Pitts model..............................................................................  4
1.3. An artificial neuron..................................................................................................  5
1.4. Common transition functions..................................................................................  6
1.5. Neural networks structures......................................................................................  7
2.1. A generic feedforward A N N ...................................................................................  28
2.2.6-ary binomial trees: a) 1-dimensional, b) 2-dimensional.......................................33
2.3. The largest virtual layer........................................................................................... 36
2.4. BPG graphs of an FFANN........................................................................................ 37
2.5. Processor assignment: a) a sample LVL, and d) its implementation on a 3-D 
hypercube................................................................................................................. 39
2.6. The RBF network..................................................................................................... 64
3.1. The cascade correlation architecture with H  hidden units.......................................80
3.2. The PCC m odel........................................................................................................ 91
3.3. Input and output segments of P C C ^  and PCCH after a  input and P output
digit foldings........................................................................................................... 97
3.4. Possible closed sets for the optimization problem.................................................103
3.5. Simulation results for optimizing THmax with Hmax = 2 0 ........................................I l l
3.6. Simulation results for optimizing THmax with Hmax = 4 0 ........................................I l l
3.7. Simulation results for optimizing THmax with Hmax = 6 0 ........................................112
3.8. Simulation results for optimizing THmax with Hmax = 8 0 ........................................112
3.9. Simulation results for optimizing with Hmax = 2 0 .....................................114
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.10. Simulation results for optimizing with Hmax = 4 0 .....................................114
3.11. Simulation results for optimizing with Hmax = 6 0 .....................................115
3.12. Simulation results for optimizing with Hmax = 8 0 .....................................115
3.13. The iteration time of CC algorithm mapped on KNC's of different sizes when
is optimized.............................................................................................. 116
3.14. The iteration time of CC algorithm mapped on KNC's of different sizes when 
r mnj, is optimized.............................................................................................. 116
3.15. Simulation results of mapping w ith T ^ ^  as the optimization metric with 
different Hmax values........................................................................................... 118
3.16. Simulation results of mapping with THmax as the optimization metric with 
different Hmax values........................................................................................... 118
3.17. Folding digits with THmax as the optimization metric for different Hmax
values.................................................................................................................... 120
3.18. Folding digits with as the optimization metric for different Hmax
values.................................................................................................................... 120
3.19. Performance of the mapping approach for values beyond Hmax, assuming
^Hmax “  * e optimization criterion and Hmax = 4 0 ............................................122
3.20. Performance of the mapping approach for values beyond Hmax, assuming
j^»nax 35 optimization criterion and Hmax -  4 0 ............................................122
3.21. The ART1 m odel.................................................................................................. 124
3.22. The ART1 task graph............................................................................................ 131
3.23. The PARTmax graph............................................................................................ 132
3.24. Simulation results for Lmax = 20 ..........................................................................136
3.25. Simulation results for Lmax = 30 ..........................................................................137
3.26. Simulation results for Lmax = 4 0 ..........................................................................137
3.27. Simulation results for Lmax = 100.........................................................................138
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
This dissertation develops a formal and systematic methodology for efficient 
mapping of several contemporary artificial neural network (ANN) models on k-ary n-cube 
parallel architectures (KNC’s). We apply the general mapping to several important ANN 
models including feedforward ANNs trained with backpropagation algorithm, radial basis 
function networks, cascade correlation learning, and adaptive resonance theory networks.
Our approach utilizes a parallel task graph representing concurrent operations of the 
ANN model during training. The mapping of the ANN is performed in two steps. First, the 
parallel task graph of the ANN is mapped to a virtual KNC of compatible dimensionality. 
This involves decomposing each operation into its atomic tasks. Second, the dimensionality 
of the virtual KNC architecture is recursively reduced through a sequence of transformations 
until a desired metric is optimized. We refer to this process as folding the virtual 
architecture. The optimization criteria we consider in this dissertation are defined in terms 
of the iteration time of the algorithm on the folded architecture. If necessary, the mapping 
scheme may utilize a subset of the processors of a given KNC architecture if it results in the 
most efficient simulation. A unique feature of our mapping is that it systematically selects 
an appropriate degree of parallelism leading to a highly efficient realization of the ANN 
model on KNC architectures.
A novel feature of our work is its ability to efficiently map unit-allocating ANNs. 
These networks possess a dynamic structure which grows during training. We present a 
highly efficient scheme for simulating such networks on existing KNC parallel architectures. 
We assume an upper bound on size of the neural network. We perform the folding such that
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the iteration time of the largest network is minimized. We show that our mapping leads to 
near-optimal simulation of smaller instances of the neural network. In addition, based on 
our mapping no data migration or task rescheduling is needed as the size of network grows.
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1 
INTRODUCTION
1.1 Artificial Neural Networks
The struggle to understand the human nervous system led to the discovery of 
neurons as structural components of the brain. A typical neuron is five to six orders of 
magnitude slower than typical electronic gates [35]. However, the slow processing rate of 
neurons is compensated by the large number of neurons and their dense connectivity in the 
brain. Shepherd and Koch [72] estimated the number of neurons in human cortex to be 10 
billion, with 60 trillion connections among them. This makes human brain an extremely 
efficient system [35].
Artificial Neural Networks (ANNs) also known as neuro-computers, connectionist 
networks, and parallel distributed processors [35] are biologically motivated systems which 
represent simulated models of a real nervous system. They comprise of densely 
interconnected processing units called artificial neurons. ANNs are capable of storing 
knowledge and making it available for use [35]. The procedure used to store knowledge 
in an ANN  is called learning. Learning constitutes an important feature of ANNs. Unlike 
conventional computers which tackle problems through programming, ANNs are capable 
of solving problems through learning.
ANNs are viable computational models for tackling complex and large scale 
problems intractable by conventional computers. These models are suitable for applications 
where explicit knowledge is not available [71]. In particular, ANNs can be applied to 
problems which are difficult or impossible to express mathematically [71]. ANNs have been
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2applied to a wide variety of problems including pattern classification, speech synthesis and 
recognition, adaptive interfaces between human and complex physical systems, function 
approximation, image compression, associate memory, clustering, forecasting and 
prediction, combinatorial optimization, nonlinear system modeling, and control [33]. The 
computing power of ANN'S can be attributed to their massively parallel distributed 
architecture and to their ability to learn and generalize [35].
ANNs are made up of artificial neurons which are connected according to various 
architectures. Links between neurons are assigned weights to resemble biological synapses. 
By properly adjusting its weights and transition function, an ANN can realize a relation 
between an input space and an output space [62]. The realized relation depends on several 
factors including the network structure, threshold and weight values, and the nature of 
network's dynamics [62]. Next we briefly review several important aspects of ANNs.
1.1.1 Artificial Neuron
Figure 1.1 shows a simplified biological neuron. The human brain contains 
approximately 10 billion such cells, each being connected to about 104 other cells [62]. The 
actual operation of a nerve cell is a mystery. However, scientists have been able to 
reasonably approximate how a nerve cell operates. The operation of a cell can be stated as 
follows. Each cell receives (electrochemical) signals through its input branches called 
dendrites, which accept outputs of adjacent cells. Each input signal can be excitatory 
(positive value) or inhibitory (negative value). If the overall excitatory input signals of a cell 
are strong enough (exceeding a threshold), the cell will transmit an electrical pulse along
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
axondendrites
soma
soma
axon
dendrites
Figure 1.1: A simplified biological neuron.
its output branch, called axon, to the dendrites of its neighboring cells. The junction 
between the axon of one cell and the dendrite of another cell is called synapse.
The basic component of an ANN is an artificial neuron. It is an abstract mathematical 
model developed to mimic the behavior of a nerve cell. McCulloch and Pitts [52] modeled 
the biological nerve cell using a binary threshold unit [62]. Their model is shown in Figure
1.2. It has n inputs ( xv  x2,...xn) resembling the dendrites of a nerve cell. A weight (w.) is 
associated with each input (xf)to indicate the strength of the connection between the cell 
and a neighboring neuron. These are called synaptic weights. The model computes the 
sum of products of its inputs and their corresponding weights. If the computed sum is 
above a certain threshold (0), the neuron's output (y) is set to 1. Otherwise, the output 
remains at zero.
The McCulloch and Pitts model can be generalized to allow various transition 
functions. The transition function determines the output of an artificial neuron based on
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 1.2: The McCulloch and Pitts model.
the weighted sum of its inputs. The artificial neuron model used in this dissertation is 
depicted in Figure 1.3. Typical transition functions are the identity, threshold, piecewise 
linear, and sigmoid. These functions are shown in Figure 1.4. We will refer to an artificial 
neuron as simply neuron or unit henceforth.
1.1.2 Structure of ANN'S
ANN models are structured in various ways. The structure is generally linked to the 
scheme used to make the ANN realize an input to output relation [35]. There are different 
classifications of ANN structures. Here, we consider the four classes identified in [35], 
namely single-layer feedforward, multilayer feedforward, recurrent, and lattice structure. 
An example of each class is depicted in Figure 1.5. The simplest structure is a single layer 
of neurons. This network simply projects inputs to a layer of neurons referred to as the 
output layer. Basic associate neural memory [33] is an example of this class.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Transition
Function
Figure 1.3: An artificial neuron.
A multilayer feedforward network comprises of more than one layer of neurons. 
Neuron layers other than the input and output layers are referred to as hidden layers, and 
their corresponding units are called hidden neurons. Nodes in each layer are connected to 
nodes in adjacent layers. In certain models, layers are fully connected, i.e., each neuron is 
fully connected to all neurons in adjacent layers. This class includes a very popular network 
called feedforward neural network which is typically trained with the backpropagation 
learning [69] algorithm. Networks may also be partially connected, i.e„ a neuron is 
connected to a subset of neurons in adjacent layers. The locally connected network [35] is 
an example of such structures . A lattice network is very similar to a feedforward network 
except that output neurons are arranged in arrays (rows or columns).
As opposed to feedforward networks, the recurrent network includes at least one 
feedback link. A feedback link connects the output of a neuron to inputs of neuron(s) in 
previous layers. The most famous recurrent network is the Hopefield network [62]. It
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Identity Function
+ 1
Threshold Function
Piecewise Linear Function Sigmoid Function
Figure 1.4: Common transition functions, 
consists of a single layer of neurons. The output of each neuron is connected to inputs of 
other neurons. The feedback loop has a significant impact on the storage and learning 
capability of the network [35].
1.13 ANN'S Learning (Training) Process
One of the most profound features of ANNs is their ability to learn. Conventional 
digital computers are generally programmed (hardware or software) to perform a certain 
task. ANNs on the other hand are trained by working with examples rather that algorithms 
[71]. Formally, learning constitutes incrementally adjusting weights of the network so that
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Single-Layer Feedforward Multilayer Feedforward
Network Network
Recurrent Network Ooe-dimensional Lattice Structure
Figure 1.5: Neural network structures.
it can realize an input to output relation. The learning generally involves improving a 
predefined performance measure over time [33]. Hence, learning can be viewed as an 
optimization search [33]. The scheme followed to train a network is referred to as the 
learning (training) algorithm.
Learning algorithms are classified into three major categories, supervised, 
unsupervised, and reinforced training [33]. In supervised training a desired output is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8associated with each input pattern [33]. Generally, the algorithm is designed to reduce the 
error between the desired and actual output patterns. Error backpropagation learning [69] 
and cascade correlation [25] employ supervised training. Unsupervised training, on the 
other hand classifies the input patterns into different categories through optimization of 
some criterion [33]. Hebbian learning [36], Adaptive Resonance Theory (ART1) [31], and 
concept-forming cognitive models [3] are examples of unsupervised training. Reinforcement 
learning constitutes updating the weights in order to maximize the probability of a factor 
called reinforcement signal [33]. This scheme originated in connection with experimental 
studies of animal learning [33]. Associate Reward-Penalty Reinforcement rule [7] is an 
example of reinforcement learning.
1*2 Implementation a t ANN'S
Neural processing can gain acceptance as a practical problem solving tool only if it 
provides significant performance improvement over conventional methods. In principle, 
neural processing may be the only viable tool for solving many difficult scientific and 
engineering problems. However, only cost-effective implementations of ANN models can 
lead to their practicality. The problem of efficient realization of ANNs has been under study 
since the late 1980's. The studies include both hardware and software approaches.
ANNs are extremely computationally demanding. In particular, ANN training 
algorithms are computationally intensive procedures and are generally rather slow in nature. 
Most training algorithms require large number of iterations and hence time to converge. 
Software and hardware implementations have been proposed in an attempt to efficiently
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
9realize these models and enhance their practicality. Recent advances in simulation methods 
and neurohardware have made ANN'S more practical [71].
Hardware implementations have utilized digital and analog technologies. Examples 
of hardware implementations include: a low-power VLSI arrhythmia classifier [48], fully 
parallel stochastic neural networks [79], optoelectronic VLSI shunting neural network [39], 
a single chip realizing 106-synapse neural network [84], a generic systolic array building 
block for on-chip learning [47], the pRAM chip [21], an analog CMOS chip set for neural 
networks with arbitrary topology [46], a general purpose neurochip [23], a CMOS 
implementation of neural network models [30], a VLSI architecture for on-chip learning 
[32], a neuromorphic VLSI learning system [2], and visual computations using analog 
CMOS processing arrays [74].
Digital approaches (in particular those based on VLSI technology) are considered 
well suited for most neural network applications [35][43]. The principle reasons for utilizing 
VLSI technology are: 1) functional density available on VLSI chips, 2) many ANN'S have 
regular topology, 3) ANN'S require a few simple and well defined arithmetic operations 
[11], and 4) the ease and low cost of mass production of VLSI chips. Hitachi's WSI chip 
with 576 digital neurons and 36k weights integrated onto a 5-inch silicon wafer using 0.8p 
CMOS technology is one of the fastest neural chips [71]. A system built based on eight 
such WSI boards produces 2.3 giga connection updates per second (CUPS) for a 
backpropagation network [71].
Analog electronics on the other hand provide high speed [43] [71], packing density, 
and tow power consumption [71]. However, problems such as low precision [43], analog
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
storage of weights, susceptibility to temperature changes and interference [71] make analog 
neural networks less practical than digital approaches. Nevertheless, several analog neural 
chips have been developed including Mitsubishi's Neurochip [43]. It contains 336 neurons 
and 28k connections. The chip can provide 28 giga CUPS during the learning phase.
Despite advances in neurohardware technology most realizations of ANN'S are still 
done in software on general-purpose computers. The process of simulating ANNs on 
conventional computers is referred to as mapping. The popularity of software simulations 
is due to several reasons. First, software simulations provide flexibility, which is a crucial 
criterion for experimental work in the field [49]. Through software implementations 
researchers have been able to explore different characteristics of ANN models. Such studies 
may not be as cost-effective if done on neurohardware. Second, neural processing on 
general purpose computers provides insight into the behavior of ANN models in an attempt 
to determine how these models can be efficiently implemented in hardware. For instance, 
neural processing on parallel computers has been essential in determining which 
paraflelization techniques are most suitable for hardware, and which hardware feature can 
improve efficiency [71]. Third, hardware implementation of certain ANNs is not feasible. 
For instance, the Cascade Correlation learning architecture [25] which consists of a 
cascaded network of neurons cannot be easily implemented using VLSI technology [64].
Conventional serial computers are generally too slow for computationally intensive 
ANN models, especially for large networks. In particular, processing demands of the 
learning algorithms of ANNs are considerable. A common approach for efficient software 
realization of ANNs is to map them onto parallel computers. Parallelism is at the very heart
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11
of neural processing [71]. ANN models exhibit several characteristics which favor parallel 
implementation. These include large number of simple operations which can be performed 
concurrently, and distributed memory requirements.
The mapping techniques used for simulating ANN'S onto parallel architectures are 
categorized into two general groups: heuristic mapping and algorithmic mapping [49], 
Heuristic mapping schemes rely cm trial and error methods based on familiarity with ANN 
algorithms and the target machines [49]. Examples of heuristic approaches include the 
realization of multilayer perceptron on the MPP [34], Warp [67], Connection Machine [8], 
and Hughes Systolic Cellular/Processor [79]. Algorithmic mapping schemes on the other 
hand are systematic implementations of ANN models on target architectures. Examples 
of such mappings include implementation of feedforward ANN onto multiple bus [24], and 
hypercube based systems [51] [42]. They also include simulation of ANN’s on mesh- 
connected SIMD machines [49].
Several factors should be considered in any attempt to develop efficient mappings 
of ANNs onto parallel architectures. These factors include specific features of the ANN 
model and characteristics of the target parallel architectures [71]. Processing and 
communication demands of the ANN should be studied at the initial stages of the mapping 
process. As mentioned earlier, ANN’s generally have high computational demands 
especially during the learning phase. In addition, the communication demands of ANNs are 
also considerable due to their massive interconnectivity. For that reason, broadcast 
operations could be very efficient [24].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
It is essential to explore paraSeUzation strategies appropriate for a given ANN modeL 
Parallelism can be exploited at different levels of ANN training. The degree of parallelism 
attained at each level is related to the granularity of tasks at that leveL Several 
categorizations of mapping algorithms based on the amount of parallelism appear in the 
literature including the classifications in [40] and [59]. These categorizations are typically 
made with a specific ANN model in mind. However, to some extent they can also be applied 
to other models [71]. The classification due to Nordstorm and Svesson [71] includes the 
following parallelization approaches:
•  Session parallelism (simultaneous execution of different sessions)
•  Training parallelism (concurrent training using different samples [61])
•  Layer parallelism (simultaneous execution of different network layers)
•  Node parallelism (concurrent execution of node operations for a single input)
•  Weight parallelism (concurrent weight summation per node)
Each approach utilizes the parallelism at a different level. It is crucial to determine which 
approach is suitable for a given ANN  model. The characteristics of the target architecture 
and the limitations it might put on problem decomposition should also be considered.
As far as mapping is concerned, parallel processing machines can be divided into two 
broad categories: single instruction stream multiple data stream (SIMD) and multiple 
instruction stream multiple data stream (MLMD). In the SIMD approach, multiple 
processors perform similar operations on a large amount of data in a regular and possibly 
synchronized fashion. Examples of such (centralized) simulations are algorithmic mapping 
of ANN models on parallel SIMD machines [49], algorithmic mapping of feedforward neural
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
networks onto multiple bus systems [24], network learning on the connection machine [8], 
and systolic architectures for artificial neural networks [45]. These simulations are also 
known as data parallel approaches [71]. In mapping ANN'S onto MIMD machines, the ANN 
task is typically divided into several subtasks, each placed on a different computer. Each 
computer operates on its portion of the task and communicates with other computers 
through message passing schemes [71]. Examples of such mapping techniques include the 
simulation of the backpropagation algorithm on transputers [57]. Serbedzija [71] compares 
the performance of several parallel simulations of ANNs on SIMD and MIMD computers. 
Next, we provide an overview of earlier studies on mapping of ANNs on parallel computers. 
13  Previous Work
Several studies on mapping ANNs models onto parallel architectures have been 
reported in [24], [71], [51], [42], [50], [26], [49], [83], [28], [29], [45], [56], [37], [80] and 
[44]. Here, we provide a brief review of these works. El-Amawy and Kulasinghe [24] 
address the problem of mapping a feedforward ANN  onto a multiple bus system with p  
processors and b buses. The proposed mapping scheme minimizes the total execution time 
of the learning algorithm. The multiple bus architecture is selected because of its inherent 
ability to provide concurrent broadcast operations. The paper also introduces a more 
efficient variant of the scheme when overlapping computations and communications are 
possible. The authors show that there is a unique arrangement of bus interfaces such that 
the number of interfaces is minimum while the optimal time is reached. The advantages of 
the proposed scheme over checkerboarding [42] are stated.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
The study in [71] provides an overview of the techniques and strategies introduced 
for simulating artificial neural networks on parallel architectures. The study focuses on 
different parallelizadon schemes and discusses their merits. It explores parallel 
implementation of ANN'S on both general-purpose computers and neurohardware. The 
study points out the significance of mapping techniques on general purpose parallel 
computers and evaluates the performance of various simulation paradigms. It also addresses 
the significance and the need for dedicated computers for neural processing known as 
neurocomputers. Merits of digital and analog technologies as avenues for neurocompures 
are stated.
In [51], a technique for mapping feedforward ANNs and Hopfield ANN models on 
hypercubes is introduced. The proposed scheme initially constructs a parallel architecture 
called mesh-of-appendixed-trees (MATs) for a given ANN. Then, MAT is embedded into 
the hypercube [51]. The asymptotic complexity of this scheme is given as O (logN ), where 
^  is the size of largest layer in the ANN. However, this logarithmic time is obtained at the 
expense of 3N 2 processors for an N x N  MAT. In that approach mapping of a 
feedforward ANN with a maximum of N  neurons per layer on a hypercube requires a target 
architecture of a particular size, 4N1. The paper does not address the mapping of the ANN 
model on a hypercube of arbitrary size. Furthermore, this work does not consider the 
impact of the granularity of parallelism on the efficiency of the simulated algorithm.
In [42] a technique called checkerboarding, for mapping backpropagation algorithm 
on hypercubes and related topologies is introduced. This scheme utilizes the embedding 
of a yfP xJP  grid on a hypercube of size P = 2U. Elements of each weight matrix are
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
partitioned among processors of the grid. Neuron activation functions are assigned to 
diagonal nodes of the grid. The checkerboarding scheme requires less communication time 
than the pattern partitionings scheme reported in [40] and the vertical sectioning scheme 
utilized in [1] and [86]. In vertical sectioning an equal number of nodes from each layer of 
a uniform network are assigned to each processor. For non-uniform networks, the relative 
performance of the three schemes remains the same [42]. In a non-uniform FFANN, the 
number of neurons per layer differs from one layer to another. Non-uniform networks are 
utilized in many applications. The extension of checkerboarding scheme to non-uniform 
networks needs further study. The key issue is the choice of a proper square grid. The 
study in [42] provides a simplified algebraic analysis of the speedup for different network 
partitioning methods assuming high processor use. As the authors note, the trade-off 
between processor use and speedup needs to be further studied. The authors also claim that 
the checkerboarding scheme is asymptotically optimal. In other words, it can achieve time 
complexity of O QogL + log/) , where L is the number of patterns and /  is the number 
of neurons per layer in a uniform network. However, to achieve this lower bound a large 
number of processors is needed, P = U 2 [42]. Hence, to achieve asymptotic optimallity 
the number of processors should be a quadratic function of network size. Furthermore, the 
study does not consider whether varying the degree of granularity could improve the 
efficiency.
Fujimoto, et al., [26] propose massively parallel architectures, such as a toroidal 
lattice architecture (TLA) and a planar lattice architecture (PLA) for simulating ANN'S. The 
parallel architecture is designed in two steps. Initially, a multilayer perception network is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
mapped to a set of virtual processors connected with a PLA or a TLA. Then, the virtual 
processors are mapped to a set of physical processors [26]. The mapping is performed 
based on a load balancing algorithm. It is stated that the PLA can realize a completely 
planar lattice structure which is efficient for WSI implementations [26]. The proposed load 
balancing imposes some limitations on the type of network which can be mapped by this 
scheme.
Lin, Prasanna, and Wojtek describe parallel implementations of ANN'S on fine grain
mesh-connected SIMD arrays [49]. Their mapping scheme can be applied to those ANN
models of arbitrary network topologies whose retrieval and training phases can be
performed as matrix and vector manipulations [49]. The mapping is designed only for
mesh-connected array processors. The authors introduce two versions of the mapping
scheme. One version is designed for implementing ANN models with problem sizes smaller
than the parallel machine size. In other words, the first version of the scheme can simulate
an ANN with n neurons and e connections on an N  x N  mesh-connected SIMD machine
if n + e £ N 2. The other version uses a partitioning technique for implementing ANNs of
arbitrary size on a fixed size processor array. For mapping an ANN model with n neurons
and e connections onto a P x P array, an 0(k) memory is required per processor, where 
N1 £ k z  —. The study does not consider the possibility that a smaller sub-array of the fixed 
size SIMD architecture may lead to a more efficient simulation.
Wah and Chu [83] introduce a heuristic mapping of a multilayer ANN on a 
multicomputer system. They derive several results for reducing the complexity of the 
mapping process. Their mapping is based on the fact that neurons in a multilayer ANN can
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
17
be grouped in a number of clusters such that neurons in two connected clusters are pairwise 
connected [83], They consider the computation time to be the dominant factor in the 
training phase of feedforward ANN models (This hardly reflects reality on any existing 
parallel machine). Based on this consideration, they decompose the physical processors into 
partitions in such a way that the error deviation of the heuristic approach from the optimal 
approach can be bounded [83]. They provide experimental results to confirm their claims. 
Their simplified model fails in cases where the communication time is not negligible (which 
is typically the case).
Ghosh and Hwang [29] investigate critical issues in mapping ANN'S on message- 
passing multiprocessors and develop a broad set of guidelines for efficient mapping of ANN 
models on parallel systems. They develop a structural model of an ANN by partitioning its 
network topology into groups of highly interconnected neurons. This model is used to find 
a proper set of guidelines for a heuristic mapping and to determine the functional behavior 
of neurons. They provide two principles for their heuristic approach, a partitioning principle 
and a mapping principle. They estimate the communication bandwidth required for 
balancing the communication and processing demands based on the structural model and the 
mapping policy [29]. They examine suitability of different classes of parallel architectures 
for simulation of ANN models. They indicate that architectures with direct links, such as 
hypercubes, multiple bus systems [64], and architectures with constant degree such as 
hypemets [38] or cube-connected cycles are more suitable than multistage networks. As 
we mentioned earlier, their work only provides a set of guidelines for developing a heuristic 
mapping scheme.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
18
Kung and Hwang [44] develop a mapping scheme for implementation of ANN 
models on a linear systolic array of processors. Their scheme is based on the fact that the 
training phase of a feedforward ANN model can be expressed in terms of a sequence of 
matrix and vector operations. Wah and Chu [83] show that the scheme in [44] is optimal 
when the network is a feedforward ANN, and when the interconnection network is fast. 
The scheme in [44] however is only applicable to linear systolic arrays.
1.4 Research Objectives
Artificial neural networks with their huge computing power are primary candidates 
for solving large scale intractable scientific problems. The impressive processing capabilities 
of these models are due to their massively distributed structures, their ability to learn and 
generalize, and to their self-organizing and adaptive nature. These models offer a new 
processing paradigm which can be more powerful, robust, and user-friendly than 
conventional approaches [43]. Although ANNs are abstract simulations of the real nervous 
system, we still have a long way to go before we can design a fully operational 
neurocomputer which resembles the human brain [35].
In the most general sense, our objective in this research is to develop a formal 
methodology for efficient mapping of contemporary ANN models on a popular class of 
parallel architectures. We consider parallel computers based on k-aiy n-cube topologies 
since they encompass both mesh-connected and hypercube-based parallel systems. Many 
existing parallel architectures are based on these network topologies. These include 
MASPAR, nCUBE, Connection Machine, the Mosaic, Cray T3D, and the J-Machine [10]. 
We refer to parallel systems based on k-ary n-cube topologies as KNC architectures
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
19
henceforth. Our mappings wiQ be designed to efficiently simulate an ANN model of arbitrary 
size on a KNC. The problem of mapping a specific ANN model onto a parallel architecture 
has been studied in the literature [24], [71], [51], [42], [50], [26], [49], [83], [28], [29], 
[65], [45], and [44]. Our approach has several significant features which distinguish it from 
earlier works. We elaborate on these features next
The first appealing feature of our study is the wide range of ANN  models we 
consider. Most studies on mapping ANN'S onto parallel architectures only considered one 
or two ANN models. In fact most studies in this field were centered on the feedforward 
ANN with the backpropagation training algorithm [24], [51], [42], [28]. The study in [51] 
also covered the Hopefield modeL We do not restrict our scope to a particular ANN model. 
Rather, we study a wide range of ANN models and develop a unified mapping approach for 
different classes of ANN models. The classification we utilize groups ANN’S based on 
similarities of their computational structure. Although specific implementations might vary 
from one model to another within a class, general mapping steps are similar for a given class 
of ANN’S. We present a systematic mapping scheme for each class and show how the 
mapping can be appfed to specific ANNs within that class. This feature of our study should 
have a significant impact on the study of ANN s. With the availability of efficient 
implementations of wide range of ANNs at their disposal, researchers in the field can 
effectively study different aspects of neural processing in order to develop powerful problem 
solving tools. Notice that no single ANN model is appropriate for every application. By 
providing efficient implementation of a range of ANNs for users, the practicality of these 
models in engineering problems is significantly enhanced.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
Another significant aspect of our study is that it covers an important class of ANN'S 
called unit-allocating neural networks. This class includes several important ANN models 
such as the cascade correlation [25] learning algorithm and the adaptive resonance theory 
1 (ART1) [31] model. The common feature of these models is the dynamic nature of their 
architecture which grows during the learning phase. Hardware implementation of these 
models are difficult due to their dynamic structure. For instance, the cascade correlation 
training algorithm results in a network of cascaded neurons which cannot be implemented 
easily using VLSI technology [64]. Software simulation of such ANNs seems to be a viable 
alternative for utilizing these powerful models. We develop a systemic and formal 
methodology for efficiently implementing these unit-allocating ANNs on existing parallel 
systems. Based on our approach, no data migration or task reassignment is needed as 
number of hidden neurons grows during the training. To our knowledge, this has not been 
previously attempted perhaps due to the dynamic nature of the network architecture.
Learning speed is an important performance measure for ANNs. Training 
algorithms are generally slow in nature requiring a large number of iterations to converge. 
One of the most important mapping objectives is to ensure that ANNs are efficiently 
simulated on the given parallel architecture. The amount of parallelism achieved depends 
on the decomposition of the ANN model on the target architecture. To achieve an efficient 
simulation, it is essential to choose an appropriate granularity for partitioning the ANN 
modeL We propose a unique approach which systematically picks an appropriate degree of 
parallelism which leads to a highly efficient realization of the ANN model on the target 
architecture. Our mapping scheme takes into account several factors for determining the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
most suitable task granularity. Among these factors are the computational structure of the 
ANN model and the characteristics of the target parallel architecture including the 
computation time for atomic arithmetic operations and per word communication time 
between adjacent processors. If necessary, the scheme may only utilize a subset of the 
processors of a given KNC architecture (referred to as subcube henceforth). In such cases, 
the simulation on a subcube of the target architecture leads to the most efficient simulation.
The problem of determining the proper amount of parallelism for mapping an ANN 
model has not been addressed in earlier studies. Most studies impose some restrictions on 
the size of the ANN model or the target architecture. For instance, the mapping scheme 
introduced in [51] can efficiently simulate a feedforward ANN  on a hypercube of size 4N1 
only if the maximum number of nodes per layer in the ANN  model is N. The scheme 
introduced in [42] maps a feedforward ANN on hypercube-based architecture of arbitrary 
size. However, to achieve asymptotic opdmallity, the number of processors required is a 
quadratic function of the input size. Our scheme on the other hand does not impose any 
restrictions on the ANN model or on the target architecture.
1.5 Outline of the Dissertation
In Chapter 2, a formal methodology for optimal implementation of the 
backpropagation and similar algorithms on k-axy n-cube (KNC) topologies is presented. 
We first consider mapping of a feedforward artificial neural network (FFANN) models 
trained with the backpropagation algorithm on a KNC architecture. The methodology is 
developed by generalizing the optimal mapping of a bipartite graph. Initially, the FFANN 
is mapped onto a virtual KNC. The extent of parallelism is such that the simulation of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
22
learning pass on the virtual KNC is time optimaL Then, the virtual KNC is recursively folded 
until its dimension matches that of the physical architecture or a subcube thereof, depending 
on the physical size that provides the best execution time. A systematic folding process is 
developed to minimize execution time of each learning pass and to preserve the degree of 
redundancy. We prove that our mapping methodology is time-optimal and that it provides 
for maximum processor utilization regardless of the structure of the FFANN.
We show that the methodology developed for FFANNs can be applied to several 
other classes of ANNs. In particular, we consider several training algorithms for training 
Radial Basis Function (RBF) networks. We show that the training algorithms for these 
networks correspond to computational structures which are similar to those of the 
backpropagation algorithm.
In Chapter 3 we consider several unit allocating networks. A unit allocating network 
is one whose topology is modified during the training. We first consider the mapping of 
the Cascade Correlation learning algorithm. Cascade correlation [33] is an efficient 
supervised learning technique for neural networks. The learning algorithm incrementally 
adds and trains hidden units to a minimal topology until a desired error bound is reached. 
The significant attributes of such a “unit-allocating” network are fast learning (with 
polynomial time complexity) and compact representation of data [64]. The resulting 
architecture is a multi-layer network with cascaded single-unit hidden layers. VLSI 
implementation of this structure is difficult due to its irregular connections and unbounded 
fan-in [64].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
23
The chapter first presents a methodology for efficient parallel implementation of the 
Cascade Correlation algorithm on k-ary n-cubes (KNCs). We develop a computational 
model which captures the inherent parallelism of output-unit and hidden-unit training phases 
of the algorithm. Moreover, our model allows pipelining of several training patterns in 
order to further improve the efficiency of the implementation. The model we develop can 
easily be adapted to various parallel topologies. The mapping is done in two phases. The 
computational model is first mapped onto a virtual KNC of compatible size denoted by 
VKNC. Then, the VKNC is folded repeatedly (as necessary) such that a certain metric is 
optimized for a network with a certain number of hidden units. The folding is repeated until 
the resulting size is less than or equal to the size of the actual KNC. In the Cascade 
Correlation algorithm the number of hidden units is not known in advance. To efficiently 
map the training of such a dynamic network, we consider an upper bound on the number of 
hidden units (Hmax) .  We consider two optimization criteria defined based on 1) the 
execution time of the algorithm for a network with Hmax hidden units and 2) the sum of 
execution times of the algorithm for all instances of the network with 0 through Hmax 
hidden units. We propose efficient analytical schemes for the mapping based on each 
criterion.
In the same chapter, we use the parameters for the benchmark application NETTalk 
to evaluate the performance of our mappings. We present experimental results which show 
that our approach leads to near-optimal results for networks with H  hidden units where, 
H<Hmax. In addition, we show that the proposed scheme leads to very efficient 
simulation of the training algorithm even if the number of hidden units exceeds Hmax. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
also examine the effect of Hmax choice on the mapping. The minimisation of each metric 
(assuming Hmax hidden units) has computational complexity O (logt(L + Hmax)), for a 
network with L output units. Based on the proposed mapping, task assignments for 
networks with 0 through Hmax hidden units are known apriori. Hence, no data transfer or 
task rescheduling is needed as the number of hidden units grows.
Also in Chapter 3 we consider the mapping of a popular clustering network called 
Adaptive Resonance Theory (ART) [31]. We show that the mapping of this algorithm is 
very similar to that of the Cascade Correlation training algorithm. We provide simulation 
results for an efficient mapping of a benchmark example for this case as well.
Chapter 4 contains the conclusions drawn from this work and discusses some open 
problems for future work.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
MAPPING OF FEEDFORWARD AND 
RADIAL BASIS NETWORKS
In this chapter we introduce a methodology for optimal implementation of multi­
layer feedforward artificial neural networks (FFANN's) trained with the backpropagation 
training algorithm on KNCs. The mapping is based on generalizing the mapping of a 
bipartite graph onto the KNC architecture. Initially, the FFANN is mapped onto a virtual 
KNC such that simulation of the learning pass on the virtual KNC is time optimal. The 
virtual KNC is then recursively folded until its dimension matches that of the physical 
architecture or a subcube thereof, depending on the physical size that provides the best 
execution time. A systematic folding process has been developed to minimize execution 
time of each learning pass. We prove that our mapping methodology is time-optimal and 
that it provides for maximum processor utilization regardless of the structure of the FFANN. 
The mapping scheme utilizes the given KNC architecture to achieve minimum execution 
time for both uniform and non-uniform FFANNs. By considering the ratio of 
communication time to computation time per basic operation a proper subcube of the given 
KNC architecture is utilized to obtain the best granularity of parallelism in terms of 
minimization of total execution time.
We also address the efficient mapping of Radial Basis Function neural networks 
(.RBFs) on KNCs. We consider both fully supervised and partially supervised training of 
RBFs in our mapping. We show that the mapping of fully supervised training of RBFs is 
very similar to that of a two-layer FFANN trained with the backpropagation algorithm.
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
Partially supervised training of RBFs consists of two major phases. We introduce an 
efficient scheme for simulating the first phase. We show that the second phase can be 
realized as a two-layer FFANN.
2.1 Preliminaries
In this section we briefly review feedforward artificial neural networks, the 
backpropagation training algorithm, k-ary n-cubes, and some definitions from graph theory.
2.1.1 Feedforward Artificial Neural Networks
Feedforward ANNs (FFANN) belong to a popular class of ANNs used in many 
classification and pattern recognition applications. The backpropagation algorithm is a 
common learning algorithm used in training FFANN. The basic algorithm known as 
gradient-descent backpropagation [69] is very computation intensive and converges very 
slowly. Two common approaches to improve the learning speed of the algorithm are 
paralkl impkmentations and the use of efficient variants of the algorithm. There are three 
types of parallelism associated with the backpropagation algorithm: algorithmic parallelism, 
spatial parallelism, and training parallelism [40]. Algorithmic parallelism exploits the 
paralklism intrinsic to the algorithm itself. Spatial parallelism on the other hand is related 
to the concurrency within a particular layer of a FFANN during the forward or backward 
phase of the algorithm [40]. Training parallelism (or pattern partitioning) is closely related 
to off-line training [40] in which weight increments are obtained for all training patterns 
before any weights are updated. In this case, the training set is divided into several subsets 
which are processed concurrently [40].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
Several efficient variants of the algorithm are presented in [40]. In particular, a 
variant of the algorithm is the second-order least squares backpropagation [41] which is 
more complex than the basic algorithm but is more suitable for parallel implementation [41], 
Other approaches are based on reducing the connectivity of the FFANN or on certain 
sekctkm of the learning rate and the momentum term so as to reduce oscillation and speed 
up the convergence time.
FFANN's have been used in different applications. One common application of 
these networks is in continuous or discrete classification. Neurons in an FFANN are 
arranged into layers. In this work we consider a fully connected FFANN in which neurons 
in layer /, 0 < I < L, are fully connected to neurons in layers l - l  and /+1 in an L-layer 
network. Each link connecting two neurons is assigned a weight called synaptic weight.
Generally, an FFANN consists of an input layer, one or more intermediate (hidden) 
layers, and an output layer. Figure 2.1 shows a fully connected L-layer FFANN. In this 
dissertation, the synaptic weight of the link between neuron j  in layer l - l and neuron i in 
layer I is denoted by w f for I £ i £ L. The input vector to the network is an n-bit vector 
X  = {x,,x2,...,xn}, and the output is an r-bit vector Y = {yi,y2>—>yr}- The output of neuron 
i in layer /, 1 £ / £ L, is denoted by o/ and is computed by:
o f  = / ( £  W;‘ oj~l ) V 1,1 £ i £N; (2.1)
7=1
where oj'1 is the output of the / ‘neuron in layer l -l ,  and /( .)  is the activation function. o° 
denotes bit i of the input pattern, Given a sample set S  = {(Xpl'j), (X2,Y2) , ..., (A^ ,ys)}
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Layer 0 Layer 1 Layer L- 1 Layer L
Figure 2.1: A generic feedforward ANN.
defined on R“ x Rr, the training of an FFANN is defined as the search for a set of synaptic 
weights which can map the input patterns to their corresponding output patterns.
2.1.2 Error Backpropagation Training Algorithm
The error backpropagation training algorithm [69] (refereed to as backpropagation 
henceforth) is a supervised learning algorithm which is commonly used in training FFANN s. 
It consists of two passes. In the forward pass each neuron j  computes a weighted sum of 
its inputs, commonly referred to as NETj. Then, the output of each neuron is computed by:
°j - f (N E T j)  (2.2)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
where /  (.) is the activation function. At the end of the forward pass the computed and 
desired outputs of output layer neurons are used to calculate the error. The error function 
for the backpropagation training algorithm is expressed as follows:
2 j=iE  = E  [ ? /  -OjL ] 2 (2-3)
During the backward pass, the synaptic weights are adjusted so that the total error is 
minimized. The synaptic weights are updated using gradient-decent optimization [69]. The 
complete derivation of weight adjustments for a basic backpropagation algorithm is given 
in [69], The weight updates can be obtained as follows:
AwJ (m) = t) 8* . oj~l + a Aw~(m-1) (2.4)
where Awtjl(m) is the weight adjustment for wf during the m * backward pass, tj is the 
learning rate parameter, dj is the partial derivative of the associated error (or the error term 
for short) for neuron i in layer I, and a is the momentum term. The momentum term is added 
to speed up the learning without leading to oscillation [69]. For output neurons, S(L is 
computed by
»f " O . - o , 1 > / ' ( E  * £ o / ' ‘) (2.5)
7=1
and for hidden layer neurons it is computed as
*!-/'(!>»•»/'') • < E 8j” • “>'■) <2-«>
7=1 7=1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
The backpropagation algorithm is listed below:
Algorithm 2.1:
/* The backpropagation algorithm */
1. Initialize Weights.
2. If the error of output neurons are within acceptable range for every pattern, quit
3. For every pattern s e {(X1,F1),...,(XT,yT) } do:
3.1 For every layer I = 2, 3,..., L  do
For every neuron i = 1,...^/, in layer I do 
Compute the output of neuron i as follows:
» /= /( e  ».;»/")
Endfor
Endfor
3.2 Compute Error Terms as follows 
For the output layer neurons:
" l - i
sfr) ° r  > • < >
y=i
For other layers l = L-1, L-2 ,..., 2 use the backpropagated error as follows:
=/' (E »* • o/'1®). E *i”w •;=1 ?=1
4. Compute the Accumulated Weight Updates for layers / = 1, 2 ,..., L-l:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
Aw~(m) =r\.J2  8™ fa) • p/fa) + «- Aw~(m-1) 
j=i
5. Update Weights for Z = 1,2,..., L-l as follows:
= + (m), where /=1,...^M am /p l,...,# ,
6. Go back to step 2.
2.13 The k-ary n-cube Parallel Architecture
Several existing parallel computers such as Ametech 2020, n-cube 1, n-cube2, 
Mosaic, iWarp, and the Cray T3D are based on k-ary n-cube {KNC) topologies [10]. 
Recent studies on the KNC are reported in [10],[27]. We briefly review general properties 
of KNCs. The size (number of nodes) of an n-dimensional KNC is k \ and each node can 
be uniquely identified by an n-digit k-ary label The degree of each node is 2n. The 
diameter of the network is n L A/2 J. Let <en.i,en.2,...,e0> be the k-ary address of an arbitrary 
node in the KNC. This node is connected to every node <e 'nA,e /n.2,...,e ^>, where there 
exists only one i such that e, = (e', ± 1) mod k , while for all other digits e, = e ', . We make 
the following assumptions regarding the parallel architecture:
•  A simple arithmetic operation such as addition, subtraction, or multiplication takes 
tr units of time on a single processor of the KNC1. We refer to such operation as 
an atomic task.
•  Each processor can communicate over one link at a time (single-port 
communication).
1 This assumption can be easily changed to allow each operation to take a 
distinct time. It is intended here to simplify the expressions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
•  Per word communication time between two adjacent processors on the KNC is tc .
•  Each communication unit is one word.
•  Communication and computation can be overlapped.
•  tr z tc (This is the case in most if not all current machines).
Now, we introduce a binomial spanning tree (BST) of a KNC which is utilized in this 
work for optimal communications. A O-dimensional k-ary binomial tree has one node. A 1- 
dimensional k-ary binomial tree is obtained by connecting k nodes in a linear array format 
and selecting the LA/21** node as the root. A 2-dimensional A-ary binomial tree is 
constructed out of A, 1-dimensional trees by connecting their roots as nodes of a 1- 
dimensional binomial tree. Figure 2.2 shows 1 and 2-dimensional 6-ary binomial trees. An 
n-dimensional A-ary binomial tree is constructed out of A, (n-l)-dimensional trees by 
connecting their roots as nodes of a 1-dimensional binomial tree. The height of the tree is 
clearly n I A/2 J. We assume that the root is at level 0.
We adopt the following labeling scheme for a BST. The scheme we introduce here 
is for an odd A. It can be easily modified for an even A. In a 1-dimensional A-ary BST, each 
node has a one digit label. The root gets address 0. Its left and right children are labeled 
A-l and 1, respectively. The address of any other node on the left (right) subtree is obtained 
by decrementing (incrementing) the address of its parent In an n-dimensional BST, the 
address of each node in any of the A, (n-l)-dimensional subtrees is obtained by appending 
the address of the root of the subtree (in 1-dimensional BST) to the left of its (n-l)-digit 
node address. This is illustrated in Figure 2.2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
Figure 2.2: 6-ary binomial trees: a) 1-dimensional, b) 2-
dimensional.
2.1.4 Some Preliminaries in Graph Theory
Let G=(Vg,Eg) be a graph with p  vertices and q edges. A non-empty graph is a 
graph with a non-empty vertex set. We denote the edge between any two adjacent vertices 
u and v by (u,v).
Definition 2.1: Given a non-empty graph G=(Vc,£g), the line graph LAG) of G is defined 
as that graph whose vertices have a one-to-one correspondence with edges of G in such a 
way that any two vertices of LAG) are adjacent if and only if their corresponding edges in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
G are adjacent [19]. Note that two edges are considered adjacent if they have one vertex 
in common.
Definition 22: A mapping 0.V(G])^V(G^ is called an elementary contraction [19] for two 
graphs Gj and G2 if G; contains two adjacent vertices u and v such that:
a) &(u) = <&(v), and {u^v,} * {«,v} implies 0(uj) * 0(v,),
b) {«7,v;} n {«,v} = 4> implies (u; v;) e E(G;) if and only if (<I>(u;),d>(v;)) e E(G2),
c) If w e V(G}) and w * u,v then (u,w) e£(G j)or (v,w) e£(G j) if and only if 
(«(«),*(w)) e £(G2).
We denote the graph obtained from identification of adjacent vertices u and v in graph G by 
ECm(G).
Definition 23: A graph Gt is isomorphic to graph G2 if there exists a one-to-one mapping 
0 , called isomorphism, from V(G,) to VTG2J such that 0  preserves adjacency [19]. 
Definition 2.4: A contraction C:V(GJ)-'V(G2) is a mapping that is either an isomorphism 
or a composition of finitely many elementary contractions [19].
2.2 Optimal Mapping of FFANN's
In this section we introduce a scheme for optimal mapping of FFANN's on k-ary n- 
cube multiprocessors. The objective is to map the computations of the FFANN to physical 
processors of the KNC such that the execution times required by both the training and 
retrieval phases of the FFANN are minimized. Our mapping scheme is based on network 
partitioning. However, we still utilize characteristics of the training algorithm to simplify 
our mapping procedure.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
The nature of the backpropagation training algorithm does not allow the 
computations corresponding to a given layer of an FFANN to proceed unless the 
computations corresponding to the previous layer have been completed. During the 
forward pass of the algorithm, computations of layer I precede those of layer Z+l, where 
1 5 / s L -1. On the other hand, computations of layer I follow those of layer / + 1 during 
the backward pass. This implies that for a particular pattern, the overlap of execution of 
tasks in different layers of an FFANN is not possible during the forward or backward passes 
of the backpropagation training algorithm. Thus, the problem of mapping an FFANN 
reduces to that of mapping a set of bipartite graphs (or BPG's for short) representing 
(overlapping) pairs of adjacent layers of the FFANN and their synaptic links. Since mapping 
of a certain BPG would be very similar to that of another, we can simplify the problem 
further by considering a virtual BPG that is large enough to accommodate any of the BPG's 
corresponding to adjacent layers.
The key issue here is to determine the size of each partite set of the virtual BPG. 
Let and l ^  be the indexes of the largest layers among odd-indexed and even-indexed 
layers of the FFANN, respectively. Notice that these layers need not be adjacent in the 
FFANN. The sizes of these two layers are denoted by Nomax and N ^ ,  respectively. It is 
clear that such a graph can accommodate any two adjacent layers in the given FFANN in the 
sense that it is a supergraph of the graph representing these layers. We will refer to such a 
BPG as tte largest virtual layer graph (or LVL for short). Figure 2.3 shows the LVL graph 
for a generic FFANN. Conceptually, an L-layer FFANN can be considered as L overlapping 
BPG's. Each such BPG consists of two adjacent layers of neurons and their connecting
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Even-layer Odd-layer 36
omax.
Figure 2.3: The largest virtual layer, 
synaptic links. Such graphs can be also used to represent the concurrent communications 
and computations involved during the forward pass or the backward pass of the 
backpropagation algorithm. We denote the BPG representing layers I and /+1 and their 
corresponding links by BPGt. Figure 2.4 illustrates BPG's of a generic L-layer FFANN.
After obtaining the LVL, our mapping methodology consists of three steps. Initially, 
an optimal processor assignment is obtained for the LVL on a virtual KNC which we refer 
to as VKNC henceforth. The mapping of the FFANN onto the VKNC is then obtained by 
generalizing that of its LVL. In particular, the mapping scheme obtained for the LVL is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BPG BPG BPG BPG
1 2  L-l  L
Figure 2.4: BPG graphs of an FFANN.
applied to every BPG of the FFANN. Finally, the VKNC is contracted (or folded) to fit the 
actual KNC or a subnetwork thereof based on the time optimality requirement The 
contraction of VKNC is performed such that the learning is time optimaL We will explain 
these steps in detail in the following subsections.
2.2.1 Mapping the LVL onto a VKNC Architecture
Here we describe a scheme for mapping the LVL associated with a given FFANN 
onto a VKNC architecture. Initially, we obtain the line graph of the LVL, denoted by 
LG= (Yixft Elo) (see Definition 2.1). Notice that every node of the LVL represents a pair 
of adjacent neurons; one from the even layer and one from the odd layer. We assign vertices 
of the LG to a VKNC with radix k and dimension nv = na + ne , where na = f log* Nomax 1 
and ne = f log* Narua I  Each of the k"* nodes of the VKNC can be uniquely identified by 
an tty-digit address. Obviously, IV^I s k"*, where I V^Jl is the size of the LG. Thus, the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
38
VKNC is large enough to fit the LG in the sense that every vertex of the LG is assigned to 
a unique node of the VKNC. The assignment is done as explained below.
To identify each node in Vu,. we use the following labeling scheme. The label of 
a node in Vu; representing the synaptic link between neuron <a"° \  a"° 2,..., a °> in the 
odd-layer and neuron <b"e \  b"e 2,..., b°> in the even-layer of the LVL is obtained by 
concatenating the labels of the two neurons as follows:
< a °  , a °  , . . . ,a ° ,b e ,b  e ,..., h°>
For convenience, we will separately denote the three components of the address of node v/ 
in Vrlg by <Afip>, where
. n0 -Z q L ne~l L.ne~* Z. 0A. — dj , dj flj , and B - — b- , b ^ ,..., b^
We refer to A, and 5, as the odd-range and the even-range of the node address in the VKNC, 
respectively.
At this point, the nv-digit label of each node in the LG graph directly specifies the 
processor in the VKNC to which that node will be assigned. Based on this assignment 
scheme a copy of neuron J  from the odd-layer of the LVL is assigned to Nemax nodes of the 
VKNC with addresses where Af = /, and 0  ^fl, <, Similarly, a copy of neuron
I  from the even-layer of the LVL is assigned to Nomax nodes of the VKNC with addresses 
<Afip> where 0 k a°- 1 and Bi = I. Figure 2.5 shows a sample LVL and the assignment 
of its neurons to a 3-dimensional binary VKNC. The assignment of node < IJ> of the LG 
to a node of the VKNC implies storing, at that node, (or in some cases initializing, as we 
explain later) the parameters of a 5-tuple ( wu, Oj, o}, S„ 8j) in the corresponding virtual 
processor. The parameters of the 5-tuple are define as follows: wu is the synaptic weight
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
010 Oil
<a) (b)
Figure 2.5: Processor assignment: a) a sample LVL, and b) its implementation
on a 3-D hypercube.
between the two neurons I  and 7, o, (Oj) is the output of neuron I  (7), and 6t (d>) is the error 
term mapped to neuron /  (7). For simplicity we refer to this 5-tuple as DATAU.
22.2 Applying the Assignment Procedure to the FFANN
The simulation of an FFANN onto the VKNC can be obtained through a 
straightforward generalization of the mapping of its corresponding LVL. The mapping 
procedure developed for the LVL is applied independently to each BPGt (1 s / s L) of the 
FFANN. Basically, the assignment of neurons in the odd-layer (even-layer) of the LVL is 
applied to each odd-indexed (even-indexed) layer of the FFANN.
The assignment procedure should provide proper communications between adjacent 
BPGs. Outputs of neurons in layer /, which are computed by the BPGt, are used by BPGM 
to compute outputs of neurons in layer Z+l. Similarly, error terms associated with neurons 
in layer /, which are computed by the BPGM, are used by BPGt to compute the error terms
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
of neurons in layer /-l. The following assignment procedure is designed to provide for such 
communications.
Without loss of generality, assume that / is an odd number. Neurons in the odd-layer 
of BPGi Gayer t) are mapped to neurons in the odd-layer of the LVL such that neuron i from 
the odd-layer of BPG, is mapped to neuron i from the odd layer of the LVL, for 1 £ i ^Nt. 
Also, neurons in the even-layer of BPG: Gayer /-l) are mapped to neurons in the even-layer 
of the LVL in a similar manner. Note that by its definition the LVL can fit any BPGt. The 
processor assignment for each neuron in BPGl is the same as that of the neuron in the LVL 
to which it maps. Recall that the assignment for each neuron in the LVL was obtained by 
mapping the corresponding LG to the VKNC.
2 2 3  Optimal Simulation of Each Learning Pass on the VKNC
Both the forward and backward passes of the gradient-descent backpropagation 
algorithm include the computation of sum-of-products terms. These steps compute the 
output or the error term of each neuron during the forward and backward passes of the 
algorithm, respectively. Efficient implementation of these steps results in efficient overall 
execution of the algorithm. This is due to the fact that the communications and 
computations associated with these steps require significantly more time than other steps 
involved. Note that other steps can be computed locally in each processor without any 
communication overhead.
According to our processor assignment, each virtual processor is assigned a pair of 
adjacent neurons. During the sum-of-products computations, each processor computes one 
product term, the smallest possible task. Then, the sum of such products could be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
computed using a fan-in algorithm involving all processors of the KNC. However, this 
might not be the time-optimal solution. We must find the granularity or the extent of 
parallelism which results in the optimal solution, given the specific times needed to perform 
a computation and to communicate.
We need to find the dimension of the KNC architecture which can compute the sum 
of products of I f pairs optimally. We denote the dimension of such KNC by n0**. Clearly, 
0 <, rf** <, n. We shall utilize the BST rooted at an arbitrary node of the KNC to perform 
these computations in minimum time. We first obtain the optimal time for adding If 
numbers on an n-dimensional KNC. We adhere to the assumptions stated in Subsection 2.1.3 
throughout our derivations.
Theorem 2.1: The minimum time to compute the sum of If numbers stored at the nodes of 
an n-dimensional KNC is n ([ k!2 ] tc + 2 t j  where tc is the communication time between 
two adjacent processors, and tr is the computation time on a single processor for a simple 
mathematical operation.
Proof: We use mathematical induction. Let T(n) denote the time required for adding If 
numbers on an n-dimensional KNC.
Induction basis ( T ( l) ):
Consider the case where k is even. We shall refer to the node at which the final sum 
resides at the root. We need to show that the sum of k numbers residing on nodes of a 1- 
dimensional KNC can be computed optimally in T(l) = ( k!2 tc + 2 t)  time units. The 
diameter of the 1-dimensional KNC is L A/2 J [10]. Hence, at least kl2 tc time units are 
required to get every value to the root. The minimum time required to add k numbers
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
42
(assuming that a processor can add two numbers in tr units of time) is flog2 k 1 tr time units. 
This is accomplished by performing the addition in a binary tree fashion. To obtain the 
optimal addition scheme on the 1-dimensional KNC we have to find the maximum number 
of computational (addition) steps which can be overlapped with communication steps. Each 
overlapped step takes tc units of time, since tc 2  tr. We show that under the given 
conditions, two computation steps cannot be overlapped with communication. One is 
clearly the final addition which is performed to obtain the final sum. Next, we show that the 
first addition step (in any algorithm) cannot be overlapped either.
Originally, each processor has one operand assigned to it. Hence, the first step of 
any algorithm involves the transfer of values from some processors to others. The first step 
takes tc time units. Clearly, there exist processors at distances k/2, k!2 -1,..., and 1 from the 
root We denote the processor at distance i from the root by P[/]. If this processor does not 
participate in the first transfer, it has to send its operand to the root in at least Id2 tc time 
units. Thus, the problem would require at least ( k/2 + 1) tc + tr time units which is greater 
than or equal to the stated bound.
On the other hand, if processor P[&/2] transfers its value during the first step, it can 
only send it to its adjacent processor denoted by ?[k/2 -1], which is at distance k/2 -1 from 
the root Now, we show that to perform the addition in minimum time, the second step of 
any algorithm should perform only addition. Assume that at least one processor performs 
a transfer during the second step. We consider two cases. First assume that processor 
V[k/2 -1] is among the processors that communicate during the second step. Therefore, the 
second step takes tc time units. Note that this processor has two operands after the initial
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
communication step, one received from processor P[£/2] and the one assigned to it 
originally. So, if processor P[A/2 -1] sends one of the two values during the second step, 
it still needs k/2 -  1 communication steps to send the second operand to the root Hence, 
in this case the problem would require at least ( k/2 + 1) tc + tr time units which is greater 
than or equal to the stated minimum.
Second, assume that processor P[&/2 -1] is performing addition, but some other 
processor is performing a communication during Step 2. Processor V[k/2 -1] needs either 
at least k/2 -  1 communication steps to send the sum it has calculated during the second 
step to the root or at least k/2 steps to send its two operands. The latter case would take 
at least ( k/2 + 1) tc + tr units of time which is again greater than or equal to the stated 
bound. Thus, in Step 2 only addition should be performed. Hence, under the given 
conditions, the lower bound on computing the sum of k values on a 1-dimensional KNC 
is T(l) = ( k/2 tc + 2 tr) time units.
When k is odd, there are two nodes at distance [ k/2 J from the root. Since each 
processor can read from one port at a time, at least [ k/2 ] communication steps are 
required. Similar to the case with an even k, there are two addition steps which cannot be 
overlapped with communications. Hence, the optimal addition scheme takes 
T(l) = [ k/2 ] tc + 2 tr time units when k is odd. Since [ k/21 = k/2 when k  is even, the 
theorem is true for the base case.
Induction Step: ( T(n) implies T(n+1)):
Assume T(n) = n ([ k/2 ] tc + 2 tr) is the minimum time to compute the sum of 
values on an n-dimensional KNC. We show that r(n+l) = (n+1) ([ k/2 ] tc + 2tr) is the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
44
optimal time to add A"*1 values on an (n+1) dimensional KNC. By definition, an n+1- 
dimensional KNC is obtained by connecting k  n-dimensional KNCs as nodes of a 1- 
dimensional KNC. By the induction hypothesis, the local sum at some node (call it root) 
of an n-dimensional KNC is computed optimally in T(n) = n ( [  A/2 ] tc + 2 tr) time units. 
It remains to add the values stored at the roots of the n-dimensional KNCs. These A roots 
form a 1-dimensional KNC. By the induction basis, the minimum time to add k values on 
a 1-dimensional KNC is T(l) = [ kl2 ] tc + 2 tr time units. Thus, the optimal time to add A**1 
values on an (n+l)-dimensional KNC is
7(n+l) = r(n)+7U )
= n( [ A/2 ] fc + 2 fr) + ([ A/2 ] fc + 2tr)
= (n+1) ([ A/2 ] rc + 2rr)
Thus the proof is complete. □
Next, we present a fan-in algorithm which adds A" numbers on an n-dimensional
KNC optimally when A is an even number. The algorithm can be easily modified for an odd
A. We define lsum[i\ to represent the partial sum stored in node i of the KNC. Initially,
lsum{i\ is equal to the value assigned to node i. The fan-in addition algorithm is listed
below
Algorithm 23.:
/* Fan-in Addition */
Begin:
For ( m = 1 to n ) Do
/* compute the sum in every m-dimensional BST and store it in its root */
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
For ( p = 0 to £'*■'" - 1 )  Parallel Do
1. /*  Add values in any two adjacent nodes and send the sum toward the root */ 
For ( j  = 0 to \ k /41 -  1) Parallel Do
For ( the left and right branches) Parallel Do
1.1. If ( left branch ) Do
send lsum[pk m + ( k/2 + 2j )km~1]
to [pkm + ((k/2 + 2 / + 1 ) mod k/2) km' 1) ]; 
add lsurn[pkm + ( k/2 + 2j )km' 1]
to Isum [pk m + ((k/2 + 2 j+ l) mod k/2 ) km~l ) ]; 
send lsum[pkm + ( k/2 + 2j + l)/:m' 1]
to [pkm + ((k/2 + 2j+2) mod k/2) km' 1) ];
1.2. Else /*  right branch */
send lsum[pkm + (k/2 - 2 j  -  l)fcm_1]
to [p k m + (k /2 -2 j  - 2 ) k m' 1]; 
add lsum[pkm + ( U 2 - 2 j -  l)^"1’1]
to Isum [ p k m + (k/2~ 2j  -  2  )km~l ]; 
send lsum[pk m + ( k/2 -  2j -  2)fcm' 1]
to[pkm + (k/2 - 2 j  - 3 )  k m' x ] ;
Endfor
Endfor
2. /* Transfer partial sums toward the root o f each m-dimensional BST, and
if a value has arrived at the root add it to the root's partial sum */
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
For ( i = 2 to k/2 ) Do
For ( j  = 0 to I— — - J ) Parallel Do 
2
For ( the root and die left & right branches) Parallel Do
2.1. If ( the left branch ) Do
send lsum[pkm + ( * /2  + 2 /  + I)*1"-1] from 
node \pk m + ( kl2 + 2j + i)
to node [p k m + (Ck/2 + 2 /+ i + 1 ) mod k/2) k m' 1) ];
2.2. Else If ( the right branch ) Do
send lsum[pkm + ( k l 2 - 2 j -  2 )fcm' 1] from 
node [pkn + ( kl2 -  2j -  i -l)^ "* 1]
to node [pkm + (k/2 -  2j  - i - 2 )  k m_1 ];
/* The computations in the root */
2.3. Else If ( the root) Do
If ( k/2 and i are both even or both odd ) Do 
add the Isum received from [ pk m + ( k -  I)*” ' 1] to Isum {pk m] ;
Else Do
add the Isum received from [ pk m + k"1' 1] to Isum [ pk m ];
Endfor
Endfor
Endfor
End;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
Lemma 3.1: The above fan-in algorithm is time-optimal.
Proof: We first obtain the time required by each step in the algorithm. Step 1 consists of 
Steps 1.1 and 1.2 which are performed concurrently. These steps involve 2 communication 
steps between adjacent nodes and an addition step. Hence, they take 2tc + tr time units. 
Steps 2.1 through 2.3 are executed in parallel as well. Steps 2.1 and 2.2 consist of a 
communication between adjacent processors while Step 2.3 involves a single addition. 
Hence, these steps can be done concurrently in tc time units. The first kJ2 -2 iterations of 
Step 2 involve concurrent execution of Steps 2.1 through 2.3. During the final iteration of 
Step 2, only Step 2.3 is executed. Hence, Step 2 takes ( kJ2 -  2 )tc + tr time units. The 
algorithm consists of n iterations of Steps 1 and 2. Therefore, the algorithm takes 
T(n) = n ( k/2 tc+ 2 tr ) time units, which is time-optimal. □
Next, we utilize Theorem 2.1 and Lemma 2.1 to develop an optimal method for 
computing the sum of products of Jfc" pairs on an n-dimensional KNC, i.e., each processor 
holds two values (or several pairs) which must be multiplied and the result added to 
corresponding results at all participating processors. Depending on the ratio of 
communication time to computation time, the optimal solution might utilize only a subcube 
of the physical architecture. Therefore, we will first obtain the minimum time required by 
the computation on an /-dimensional KNC assuming that all processors participate in the 
computation (/ £ n). Theorem 2.2 states this lower bound. Once we compute the lower 
bound, we will find the dimension of the subcube of the n-dimensional which provides the 
best execution time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
Theorem 2.2: The minimum time to compute the sum of products of k* pairs on an i- 
dimensional KNC (/ £ n) is
T(i)= i ( \  k/2]tc + 2 tr ) + (2 k n~‘ -  1) tr (2 .7 )
Proof: Let Qs pairs be assigned to processor j, where 1 z j  z k  and Qj z 1. It takes tr time 
units to compute the first product According to Theorem 2.1, the minimum time required 
to perform the fan-in addition for the first set of products is T(i) = i ( \kJ2\ tc + 2 tr ) time 
units. Hence, the first set of products can be computed and added to the root in 
T(i) = i ( |k/2j tc + 2 tr ) + tr time units. The best possible scenario is one in which 
multiplication of the remaining pairs and the addition can be pipelined such that a new 
computation can begin every tr time units. This results in the following lower bound for 
computation of the sum of products:
T(0 = ( max«2; )tr + i ( \k!2] tc + 2 tr) (2.8)
However, we show that this lower bound cannot be achieved. To obtain a tighter lower 
bound, we use the notion of reservation table [77]. Each physical processor is considered 
a stage of the pipeline (used to multiply and add the pairs) and is assigned a row in the 
reservation table. Each column of the reservation table represents either a communication 
time unit or a computation time unit. The average delay between any two successive 
initiations of new computations in the pipeline cannot exceed the maximum number of marks 
in any row of the reservation table [77].
At least ( i log2 k ) tr time units are needed to add the first set of products. Hence, 
there must exist processors which perform at least one addition and one multiplication in the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
49
same step(s), regardless of how multiplication or addition steps take place. Since this is not 
allowed, according to [77] the average delay between initiations must be at least 2tr. This 
indicates that it takes (max Qj -  1) 2tr time units to compute and fan-in the remaining 
products. Thus, we have
T(i) * ( 2max(Qp -1 )tT + i ( \k/l] tc +2 tr) (2.9)
We need to find the minimum for the RHS of the above inequality by determining the best 
distribution of products among processors. Since ^2  Qj = k"  and Qj>0, clearly 
min (max (Qp ) is obtained when pairs are distributed uniformly among processors , i.e., 
min (max (Qp ) = k " (fence, the optimal time is
T(i) = (2  k -  1) rr + / ( \lcI2] tc -  2 tr) (2.10)
Thus the proof is complete. □
At this point, we can find the dimension (i £ n) of the subcube of the physical KNC 
which provides the best execution time. In other words, we find ncpc such that 
T(nopt)= min T ( i ) .
0 s i  sn
Theorem 2.3: Let T(i) denote the minimum time to compute the sum of products of k? 
pairs on an /-dimensional KNC. Then, ri3*  is either one of the two end points of the interval 
[0 ,n], or is given by either the ceiling or floor function of
[*/2k  1
n 09 -  n ~  log ( 1 _ L L + _ L )  (2 .1 1)
jt 2  tr In k In k
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
Proof: For now, assume i is a real positive number. Then, 7X0 is a continuous function on 
the set of real numbers (given by (2.10)). The first and derivative of 7(0 is given as follows:
In addition, the second derivative of 7(0 is given by
i!ZS2 = 2 ( , ( f a t ) 2 *"-■ (2.13)
d i L
According to [9], if c is an extreme point of this function, then one of the following must 
hold: i) 7  '(c) fails to exist, and ii) 7  '(c) = 0. Since the first derivative exists for every i,
indicating that 7X0 is a convex function. Hence, 7(0 has one minimum which is given by:
We use the above result to find the minimum 7(0 in the closed interval [0,/z]. Clearly, if 
obtained from (2.14), is in the closed interval [0,/t], then it is also the minimum for the 
function over the closed interval, L e., = ri**. However, if zTOn < 0 (i^n > n) then nopt=0
(n^sn). This is due to the fact that 7(0 is a convex function.
For our purposes, n‘T* must be an integer value in [0^i]. Hence, either the ceiling or 
the floor of the RHS of (3.9) results in minimum 7X0 for integer values of i in the given 
interval. □
We use Theorem 2.3 to find the optimal degree of parallelism for each layer. We 
divide the neurons of each layer into n°pt clusters, where n ^ ‘ is the optimal dimension of
(2.12)
the extreme is obtained by setting 7  '(/) = 0. The second derivative is always positive
r
(2.14)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
a subcube of the VKNC executing a sum-of-products computation involving neurons in layer 
/. The size of each cluster for an even (or odd) layer is denoted by E° (or Oz°), where
Ei = k f o r  an even I 
O z° = k*l for an odd I
The significance of £ z° (and Oz°) will be clarified shortly. For simplicity, henceforth we
refer to nf** as nz.
Once the overall stun is calculated for layer I during the forward pass (backward 
pass), it should then be broadcasted to every node of the subcube associated with neurons 
in layer Z+l (/-l) using an optimal one-to-all broadcast algorithm. According to [10], this 
task (during the forward pass) can be performed optimally in nM k/2 tc time units when k 
is even. For the case of odd k it is shown that the one-to-all broadcast can be performed 
in nZM[ k/2 ] time units [10]. Notice that the size of the subcube used for such broadcasts 
is properly chosen for each layer to minimize the communication overhead. This maximizes 
the performance even for non-uniform FFANNs.
22.4 Optimal Folding of the VKNC
The VKNC will generally be larger in size than the physical KNC. In this subsection, 
we introduce a procedure for partitioning nodes of the VKNC and assigning them to those 
of the physical KNC. The mapping procedure ensures optimal simulation of the FFANN 
network.
The assignment procedure is based on the topological structure of the KNC. An n- 
dimensional KNC structure contains k edge-disjoint (n-l)-dimensional KNCs interconnected 
as vertices of a 1-dimensional KNC. There are n distinct ways to partition nodes of an n-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
dimensional KNC into k, (n-l>dimensional KNCs obtained by partitioning along any of the 
n dimensions. By selecting any of the n digits and fixing its value to say r, an (n-1)- 
dimensional KNC is uniquely specified. Since r  is such that 0 s r s k-1, k distinct (n-1)- 
dimensional KNCs are identified uniquely based on that digit.
In our scheme, the VKNC undergoes several contractions (see Definition 2.4) until 
its size becomes equal to that of the actual KNC. At each iteration of the contraction 
process, the dimensionality of the VKNC is reduced by 1, by identifying all nodes whose k- 
ary addresses differ only in a particular digit We refer to this digit as the folding digit.
A set is associated with each node of the contracted VKNC, called neuron set, to 
represent all neurons assigned to the node. Originally, the neuron set of node //contains 
DATA (defined in Subsection 2.2.1) associated with any neuron in its cluster. After each 
contraction, the neuron set of each resulting node (obtained as a result of contracting k 
nodes of the original architecture) includes (the union of) neuron sets of the k identified 
nodes. Notice that during the r* folding step, each k-node set of the VKNC vertices whose 
addresses differ in only one digit are contracted to one node called the identified node.
The key issue here is how to select the folding digit during each iteration of the 
contraction process to minimi7P. the overall execution time. Obviously, nf  = nv -  na 
iterations are required until the folded VKNC is equal in size to the actual KNC, where nv 
and na are the dimensions of the VKNC and the physical KNC, respectively. Further 
contractions (see Definition 2.1.4) might be necessary to ensure the time optimality of the 
overall simulation. As shown earlier, the optimal solution might utilize a subnetwork of the 
actual KNC. The best size of the subcube is given by Theorem 2.3.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
The folding digit can be selected from either the odd range or the even range. Let 
tip and rip denote the number of folding digits from the odd and even ranges, respectively. 
A particular layer I is affected by odd (even) foldings only if nt > n0 -  nfo (or n, > ne -  nfe); 
that is, an odd (or even) layer I would be folded max ( n, -  na + nfo, 0 )  
(max ( n( -  ne + nfe , 0 ) )  times after rip odd foldings (rip even foldings). Folding different 
digits in a particular range might result in different processor utilization. Next, we show 
how to select the folding digit to maximize processor utilization.
Lemma 2.2: Folding along the most significant digit in a particular range (odd or even) 
maximizes processor utilization.
Proof: Without loss of generality, assume that the folding digit for the next iteration is from 
the odd range. If every BPGl was exactly of the same size as LVL the proof would be 
trivial. It can be easily shown that folding along any odd digit would result in uniform 
distribution of LVL tasks among processors.
Assume that for an odd layer I, nt < n0. The assignment procedure assigns nodes of 
this layer to the first k"‘ nodes of the LVL  So, the remaining k n° -  kn‘ nodes have one less 
task assigned to each. Now, assume that the folding digit for the next iteration is from the 
odd range. Folding any of the first Qeast significant) nt digits would result in k"' 1 nodes 
with k tasks and it"0'1 -  k " 1 1 nodes with no tasks from layer /. However, folding any of 
the na - nt most significant digits, results in k n‘ nodes with one task each and it"0' 1 - k 
nodes with no tasks from layer /. Clearly, the latter folding results in a more uniform 
distribution of tasks among processors resulting in higher processor utilization (to the extent 
indicated by Theorem 2.3). According to Theorem 2.3, this is necessary to guarantee time-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
optimal execution. Since n: varies from one odd layer to another, in general, selecting the 
most significant digit always guarantees that the selected digit is among the na - ^  most 
significant digits. Thus, selecting the most significant digit guarantees best processor 
utilization and ensures time optimality since it guarantees the best possible uniform 
distribution of tasks. □
Next, we will derive the total execution time for each learning pass and will show 
how to select the folding digits properly to minimize the time. We assume that the 
computations off(x)  and/  '(x) take c,fr and c2rr time, respectively, w here/is the activation 
function, and cy and c2 are constants.
Theorem 2.4: The time required by each learning pass for an L-layer FFANN (assuming 
L  is an even integer) on the contracted VKNC after folding a even and P odd digits is
r '(a ,P ) = £  o f[  m /is*! + m4 £,!,) + (n*t + n£x) + ] +
l x \ aUi (2.15)
£  £ “[ m,( m4 0 l x + 0f_x) + (n ^  + nf^) + n4 ]
1=0, even I
where t = a  + P , mx = 2 tr, = ([  k/2 ] + 1) tc + 2 tr, ntj = cx + c2, and m4 = 2.5. 
Proof: We compute the overall computation time by determining the execution time 
required by each step involved in each learning pass. Without loss of generality, assume that 
k is even. The forward pass consists of three steps. E*0  ;p multiplications and ExO f 
additions take place during Step 1, for layer I , requiring a total time of 2 £ “0  f tr. The 
netj(p) term is computed in Step 2 and broadcasted in Step 3 using the given fan-in and fan­
out algorithms, respectively. Using the results of Theorem 3.3, we can show that these two 
steps take o f  \kJ2\ ( n “j ( tr + tc ) + tc ) time units for each odd layer I, and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
55
Ex U/2J ( nf_x ( tr + tc ) + nflx rc ) units for each even layer I. By adding up the times 
required by different steps for each layer, we obtain equation (2.16)
£  l30 f£^ t ,*0 f lU 21(n^ , ( ;r * 0 * n ^  tc) * oj'c,l,)
{■1 and odd I
L (2’16)
+ E  [ 20/-IE \  + E*[kl2\( n f.fi, + tc) * n£x tc ) * E?cxtr ]
/■O and even I
Using a similar approach, we obtain the time taken by the backward pass for each 
pattern as follows
E  [ 20,?O , *  0 ^ 1  (<,(<,♦>,) ♦»,!,!,> -
/• I  and odd I
L (1 1 7 )
+ E  120 l t f t r + EI°l*/2 J(nf x(tr + g  + n £  tc) * E?cxtr )
/•O and even I
The weight increments are computed after each pattern presentation. The weights, 
however, might be updated after each pattern presentation or after presentation of all 
patterns (on-line or off-line training). The computation here is done for an on-line training. 
Computation of each weight increment involves one addition and one multiplication. Thus, 
for an even (odd) layer I a total of 2Ex O \_x t r (2E f xO f t r )  steps are required to compute 
weight increments and an additional E*0  f.j t r (E*xO f t r )  steps are required to update the 
weights. Therefore, for each pattern presentation, the weight update takes
E  3 OfE‘ ,tr * £  3 0 * t f t r (2.18)
M  and odd I 1=2 and even I
time units. The total time required by each learning pass is given by equation (2.19).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
7X0 -  7}orKan/0  + Tbaciward(t) + ’^'■wdght - update (0 ( 2 .1 9 )
After some simple algebraic simplifications, 7* f(«»P) can be obtained and is as stated in the 
theorem. □
We can use Theorem 2 .4  to determine the folding range (odd or even) which would 
mimmire the execution rime at each folding iteration. Assume that after t contractions (see 
definition 2 . 4 ) ,  a even and P odd digits have been used for folding. Let T  ^ (o^p+l) denote 
the total execution time on the KNC obtained after the (f+1)* folding when the folding digit 
at iteration f+ 1  is chosen from the odd range. Similarly, let r ^ a + l .P )  denote the total 
execution time on the KNC obtained after the (f+l)a folding when the folding digit at 
iteration f+ 1  is chosen from the even range. Gearly, folding an odd digit at iteration f+ 1  
results in lower overall time if r M(a,p+l) < r '^ a + l .p ) ,  and there exists an odd folding 
digit after f contractions. Based on the above, we have developed the folding algorithm 
listed below:
Algorithm 23:
/* Folding Algorithm FA */
Begin:
a = 0  ; P = 0 ; 
nf = nv - na
FOR ( i = 1 to n ,) DO
IF ( r M(a, p+1) < T Rl(a+l, p) and n0 > 0)
I* Fold the most significant odd digit *1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
n„ =n0- 1 ; 
a  = a + 1 ;
ELSE
/♦Fold the most significant even digit *1 
ri'=ne - 1 ;
P =  P + i ;
ENDEF
ENDFOR
End
Next, we show that if FA is used for folding the VKNC, then the overall learning simulation 
will be optimal.
Theorem 2.5: The folding scheme described by Algorithm FA minimizes the total execution 
time of the learning phase on the resulting KNC.
Proof: We use mathematical induction in this proof. Let P(j) represent the following 
statement: Algorithm FA results in an optimal simulation after j  foldings, for j  z  1. 
Induction Basis: (P(l)) is true since by definition for one folding FA folds the digit which 
results in the lowest overall execution time.
Induction Step: We need to show that P(j) implies P(j+1).
Assume that there exists a folding algorithm OPT which results in optimal simulation 
of the FFANN on the KNC. We assume that after j  foldings, OFT has folded a even and 
P odd digits, while FA has folded y even and e odd digits for the same FFANN and VKNC.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
Clearly, a  + 0 = e+ y =j. Let 7^,(a,P) and 7^(e,y) denote the overall execution times
of a learning pass when the OPT and FA algorithms are implemented, respectively.
Using the above notation and Theorem 2.4 we can obtain 7^,(a,P) and *«(e.Y) 
as follows
TipM>) = £  ° h  + m4 O  + « 2  ("m + n “i) +mJ  +
1 = 1, odd l
E  « i( + Om) + ^ 2  ("l-l + "£l) + m3 1/ = 0, even I
r wCe >Y) = £  o /C  + m4 * ,! ,)  + " h  ( " h  + « i 'i )  +1=1, odd l 
L
£  Eii  mi( « 4 O h  + Oh)  + "h  ("/- i + nili) + J
1 = 0, even I
(2.20)
(2.21)
The induction hypothesis can be stated as follows: TFA(e,y)  = 7^,(a,P) . We need to prove 
that TJFA = r j j .  We can easily show that the equality holds if a  = y  and p = e by using the 
induction hypothesis and the definition of FA. It remains to prove the statement when 
a  * e and P * y .
We need to examine two possible cases i) a < e  , p > y and if) a < e  , p >y .We 
show that for both cases TFA = T^J.
Case I) a  < y and P > e :
Without loss of generality, assume that OPT folds an odd digit at iteration j+1. 
We show that, for this case, TFA = TJg*lt holds whether FA folds an odd or even digit. 
Case LI) FA folds an odd digit:
In this case we assume that both algorithms fold an odd digit. The execution time 
after j  + 1 foldings based on the above assumption in iteration j+ 1 is given by (2 .2 2 ).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71,(ct,p*l) = E  Of“ [ ♦ m, £■,!,) + m, (it,!, ♦ nf,) * m3] ♦
1 = 1, odd I
5 2  £ “ [  m i (  m 4 + 1  +  ° M X) +  "*2 (flH  + " m )  +  »*3 1
/ = 0, even /
and,
7^ (e,Y + l) = S  miCEfti + « 4 ^m ) + "h (« h  + «w ) + ] +
1 = 1, odd l
52 £/€[ w*i( ™4 0 £ l + O l \ l) + (n^ 1 + n^j1) + m, ]
/ = 0, even /
We need to show that r jV P + l)  -  ^ (a .P ) = ^ W - l )  -  ^ (e ,  y) . Let
P^ i
59
(2.22)
£-1
(2.23)
l £ ( a  , p + 1) -  r i , ( «  . P) = 5T A / (2.24)
/=o
and
7&‘(6 , Y*l) - T ‘a<£ , Y) -  E  A? ' 1 (2.25)
/ =0
where Af^1 (Aj*1) represents the additional execution time required by layer I once an odd 
digit is folded by the OPT (FA) algorithm assuming that P (y) odd digits have been folded 
previously. We show that for every layer I, Aj*l =Af+1.
The execution time required by layer Z depends on how neurons in layers Z-l, Z, 
and Z+l are assigned to physical processors. Therefore, we have to consider how these 
layers are folded. There are at most five possible ways by which folding of an additional 
digit might affect execution time of layer Z depending on how this folding affects layers Z-1, 
Z, and Z+l. Note that each of these layers might or might not be folded at the most recent 
iteration. These cases are listed in Table 2.1.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 2.1: Possible foldings for a given layer.
60
layer/-I layer/ layer Z+l
Case 1 Not Folded Not Folded Not Folded
Case 2 Not Folded Not Folded Folded
Case 3 Not Folded Folded Not Folded
Case 4 Folded Not Folded Not Folded
Case 5 Folded Not Folded Folded
We first show that for every odd layer I and for any of the above five cases A j*1 = Af+1. 
Using (2.22) and (2.23) we can find Af"1 -  Aj+1 for an odd layer as follows:
A?*1 -  A p  = ( O f *1 -  O f )  [ m ,(£“j + m4 £ “t) + (n“j + ] -
(0 / H- 0 / )  [ mx(F l , + m4 £/_!) + (n ^  + n ^ )  + m3 ]
In cases 1,2,4, and 5 layer / is not folded. So, o f *1 -  o f  = O f * 1 -  0 /  = 0. This clearly 
implies A J*1 = AfTl. In case 3, however, only layer I is folded. Since P > e, we have
-  o f ) * m3 { O r 1 -  0 /  ) (2.27)
In addition, njtx £ nf_x and n “j 2: because a < y. So,
mj (£*! + m4 £ “i)  ^mj (£ ^ 1  + m4 Ef_x) (2.28)
Let y = a + 0 and P = e + 0, where 0 > 0. Since both algorithms fold layer / and P>e, we
have
Of *1- O f  = k ‘ *e (it— 1) 
y .  v (2.29)
0 ] l - 0 ]  = k ' { k - \ )
for some i £ 0. Since a  < y, we have £ “j  ^k z, E*x 2. k z, E f x <• k z*e, and Ef_x £ kz*e, 
where z 2. 0. Therefore, we can derive equation (2.30).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
(Of*1 -  o f )  [ + m4 £  “,)] * (tf/ * 1 -  Of)  [ /«,(£", + m4 £,!,)] (2.30)
By combining (2.27), (2.28), and (2.30) we observe that for case 2, A^ 1 -  Aj*1^  0.
Therefore, for any of the five possible cases and for an odd layer we have shown that
A?*1 -  A r 1* 0 .
At this point, we must show that this also holds for every even layer. Using (2.22) 
and (2.23) we can find Af+1 -  A/ ”1 for an even layer as follows: 
a T - A T - E?[ml((of:il - o f l)+ mA(Of*f~ o fo )*  m* ((n £ l -  n f j  * ( n f f - n f j ) ] -
(2.31)
-  nf.t) + (« £ l -«,!,))]
None of the three layers is folded in case 1, so clearly A^ 1 -  AJrl = 0.
For case 2,
A?"1-  A J+1 = £*[ m, (tf, ^ 1 -  O f j )  + (n^ 1 -  n f j ] -
(2.32)
£,e[ -  Off )  + (nfSf -  n ^ )]
Similar to the rationale used for (3.2S) we can show that
m, m,[ (o £ ;‘ -  0,?,) -  E ‘ (O’ ; 1 -  O ’ ,)] * 0 (2.33)
In addition, since n^j1 -  n f v = -1 we obtain the following:
"h. [Ef {nf f  -  n f x)~ E f (n f l l -  n f j  ] = m ^ E f -E f )*  0 (2.34)
Using the above results, we have Af*1 -  Aj+I£ 0.
In case 3, by definition layers /-I and l+l are not folded. Layer I is not folded
either since it is an even layer. So, the result is similar to that of case 1. The proof for case
4 can be obtained from that of case 2 by replacing /+1 by l-l. By combining results of cases
2 and 4 we observe that in case 5, A^+1 -  A j+12: 0. Thus, Af*1 -  A ]'1* 0 holds for every
layer of the FFANN. Based on this result we can derive equation (2.35).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
rj,'(«,P+1) -  ri(o,p) 2 r£'(e,Y*i) -  r^(e,y) (2.35)
However, by definition of the OPT algorithm, the above time measures can only be equal. 
For this to be true o = y and P = e must be true. In other words, FA and OPT must result 
in identical mappings.
Case 1.2) FA folds an even digit:
We need to show that
r £ ( € +l,Y) = r j / ( a ,  P+l) (2.36)
If FA folds and even digit at iteration j  +1, by definition
r& V l ,Y )  s T&\<i,y+1) (2.37)
However, we have already shown that
r£ ,‘(a,P*l) = Tg(€ ,y* l )  (2.38)
Equations (2.37) and (2.38) clearly indicate that (2.36) holds.
Case II) a  > y and P < e:
Using a similar approach we can easily show that this case also leads to the fact that
OPT and FA must result in identical mappings. □
23  Mapping Radial Basis Function Networks
In this section we introduce an efficient scheme for simulation of Radial Basis
Function networks RBFs on KNCs. We consider both fully supervised and partially
supervised algorithms for training RBFs. We show that as far as mapping is concerned,
fully supervised training of RBF networks is very similar to training a two-layer FFANN.
The partially supervised training takes place in two phases. In the first phase, certain
parameters related to the radial basis functions used in the network (namely the centers and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
widths) are determined using an unsupervised scheme. We introduce an efficient parallel 
scheme for performing the unsupervised phase. The second phase is a supervised training 
algorithm similar to training a two-layer FFANN. Hence, we can utilize the scheme we 
already introduced to map the second phase of the partially supervised algorithm. Next, 
we briefly review the RBF network and then present our mapping scheme.
23.1 Radial Basis Function ANN'S
Radial Basis Function is an ANN model designed based on the “locally tuned” 
response observed in many parts of biological nervous systems such as cochlear stereocilia 
cells [33]. RBF's have been used in a variety of applications including interpolations
[53][68][12], probability density estimation £60][22][75], and multivariate function 
approximation [6 6 ] [33]. These models are particularly suitable for approximating 
continuous or piecewise continuous functions/: R" -*■ RL where n is sufficiently small [33]. 
Next we describe the basic RBF network and its training algorithm. Our description is 
based on material in [33].
RBFs have a feedforward structure as shown in Figure 2.6. It consists of two layers 
of neurons, a hidden layer and an output layer. The model employs certain number (J) of 
hidden units , each receiving an n-dimensional (real-valued) input pattern. We represent 
the ife* input pattern by an n-dimensional vector X k = (x *, jc2\ . . .,*„*)and its corresponding 
output pattern by an L-dimensional vector Y k = ( y *,y2*>—t y / ) • We denote the number of 
input/output training patterns by m. The hidden units are fully connected to the output 
units. We denote the output of hidden unit j  by Zj.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
hidden Output
layer layer
64
Figure 2.6: The RBF network.
Unlike models we have considered so far, RBFs do not have synaptic weights 
associated with hidden layer units. These units do not compute their outputs based of 
weighted-sum of their inputs. Instead, each hidden unit computes its output (zj) based on the 
“closeness” of the input pattern to an n-dimensional vector ( p; ) which is called center. 
Formally, the output of hidden unit j  can be represented as follows:
where G(.) is a strictly positive radially symmetric function with a unique maximum at its 
center p; which goes to zero rapidly away from the center. This function is also referred 
to as the receptive filed in the input space for hidden unit j. The parameter o; represents the 
width of this receptive field . In other words, G(.) has a significant value if distance 
1 X  -  Pj I is less than the width.
(2.39)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
Examples of G(.) are the Guassian and logistic functions:
IX-Myl 2
Guassian: Zj (X) = e 2°j Logistic: z,(X) = 1+  e ' 1
where 6  ^is an adjustable bias in the logistic function. The output of each output unit is 
computed in terms of the weighted sum of hidden-unit outputs. For instance, the I * 
component of the output pattern for input pattern X  is given by:
Training of an RBF basically corresponds to determining free parameters of the 
network for a given set of m training pairs. These parameters are the centers and the widths 
of the hidden-unit receptive fields (ji; and a) as well as output-layer weights (w?'s). Several 
training strategies are described in [33] including a fully supervised gradient-decent method
[54] [6 6 ] and several partially supervised schemes [55]. Next we describe each training 
algorithm in detail and explore its parallel implementation on fc-ary /i-cubes.
23.2 Fully Supervised Training otRBF  networks
In the fully supervised training, the free parameters of the RBF are updated to 
minimize an error function E. Several studies [54] [6 6 ] have considered the gradient-decent 
method over E  to update the free parameters. Formally, the training takes place using the 
following updates
j
>/ = E (2.41)
where pM, pa , and pw are small positive constants.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
If the delta rule is used for updating the output-layer weights, weight updates can 
be computed as follows:
A h V  =  P w  ( 4  -  yi> f ' ( n e ti) zj  ( 2 . 4 3 )
j
where nett = w$Zj, and d, is the f  component of the actual output In the fully
7=1
supervised training, the hidden layer updates are computed in terms of output-layer errors. 
Clearly, this scheme updates receptive field centers and widths by back propagating the 
output error through the network.
Clearly the computations involved in fully supervised training of RBF networks are 
very similar to those of a feedforward neural network trained with the backpropagation 
algorithm. In fact, an RBF network with logistic basis function is very similar to a two-layer 
feedforward network [33]. The only difference is in the computation of hidden-layer units. 
As far as mapping is concerned, such differences are relatively insignificant. Hence, the 
mapping of a fully supervised RBF network can be performed in the same exact way a two- 
layer feedforward neural network would be mapped using the scheme described in Sections
2 .1  and 2 .2 .
23.3 Partially Supervised Training of RBF
The fully supervised scheme leads to training time similar to that of sigmoidal-type 
networks [85]. The slow convergence is mainly due to inefficient use of locally tuned hidden 
units [33]. Hassoun [33] describes several training schemes which “decouple” learning at 
the hidden layer from the output layer errors for RBFs. Essentially, the computations of 
hidden-layer receptive field parameters are performed independent of output-layer errors. 
Several schemes for computing receptive field centers and widths are described in [33],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
67
Moody and Darken [55] proposed an unsupervised training scheme for locating 
centers of receptive Helds which require relatively few RBF's [33]. This scheme leads to a 
very efficient representation of data because of the small number of receptive fields used. 
The algorithm proceeds as follows. Originally, a number of training patterns are selected at 
random as centers of receptive fields. Each receptive field represents a region or class of 
the input space. The remaining training patterns are then assigned to the class with the 
closest center. In other words, pattern Xt is assigned to class j  if center p, is closer to X, than 
any other center. Next, the center of each class is recomputed as the average of all patterns 
assigned to the class. This process is repeated until all centers remain unchanged. We 
formally represent the above algorithm as follows:
Algorithm 2.4:
I* Batch-mode center (p) search */
Begin:
For (class j  = 1 to J ) Do
Select a random input pattern as the center;
While (centers p/s change) Do
1. For ( /  = 1 to m - k ) Do
1.1 Initialize minimum, to a large value ;
1.2 For (7 = 1  t o / )  Do
1.2.1 Compute y py. -  X,. 1;
1.2.2 If ( 8 p7- -  X( || < minimum?)
minimum^ = 8 p;- -  X{ | ;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
Class ( X; ) =J
Endlf
EndFor
EndFor
2. For ( ;  = 1 to / )  Do
2 .1  u .=  £  Xf/ |  class;!
X^e Classy
EndFor
EndWhile
End
Next, we explore parallel implementation of the algorithm . The training algorithm 
proceeds in a well defined and uniform manner. It consists of two major steps. During the 
first step, each pattern is involved in identical computations. Basically, the pattern is 
assigned to the class with the closest center. Clearly, patterns can be processed 
independently and concurrently. In the second step, new centers are calculated for each 
class as the average of all patterns assigned to the class in the first step. These computations 
can take place for each class independently. However, the number of patterns assigned to 
each class might be different Notice that in practice centers are gradually adjusted to 
represent densely populated regions in the input space. Clearly, the number of patterns in 
each region may not be identical.
There are various ways to exploit parallelism in the above algorithm. We propose 
a parallel implementation of the algorithm based on partitioning the training se t This way, 
we can fully exploit the parallelism involved in the first step of the algorithm while executing
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
the second step with reasonable efficiency as we will explain. The key here issue is how to 
partition the training set among physical processors of the KNC architecture. We show that 
a uniform distribution of patterns leads to efficient simulation of the training algorithm. We 
make the following assumptions in subsequent derivations:
•  1 \ij -  Xi | can be computed on a single processor in ctntr time units, where is a 
positive constant, n is the dimensionality of the input pattern, and tr is the 
computation time for a simple arithmetic operation.
•  Qp patterns are assigned to processor p of the s-dimensional KNC, such that at least 
one pattern is assigned to each processor; i.e., I z Q z m - k '  + l .
We introduce a parallel version of Algorithm 2.4 which efficiently computes the 
centers of receptive fields. Our approach is based on partitioning the training set and 
adheres to the above assumptions. The proposed algorithm is listed below.
Algorithm 2.5:
Begin:
While ( centers are changing) Do
1. For (all processors p of the s-dimensional KNC) Parallel Do
1.1 For (all patterns X{ assigned to processor p ) Do
1.1.1 Initialize minimum to a very large value ;
1.1.2 For (each hidden unit j ) Do
1.1.2.1 compute D -  Xt 1;
1.1.2.2 If ( | |iy. -  | < minimum)
minimum = tl P; ~ Xt I ;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
ClassQQ = j ;
Endlf
EndFor
EndFor
1.2 For ( all hidden units j  = 1 to J ) Do
For ( each pattern X, in Class j ) Do 
sum ( j ) = sum(j ) + 1;
EndFor
Mjp = sum(j) /  ttze-0/-class y ;
EndFor
EndFor
2. For (all hidden units j  = 1 to / )  Do 
For ( all processors p ) Do 
sum( p, )= sum( p,-)+ p*,;
EndFor
p,. = sum( p ,) / ^ ;
EndFor
Endwhile
End;
Next, we determine the iteration time of Algorithm 2.5. Iteration time is the execution time 
of one iteration of the training algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
Theorem 2.6: One iteration of Algorithm 2.4 on an ^-dimensional KNC takes
T(s) = max(Qp)(c1ntr + 6  tj  + Js( \ k /21 tc + 2fr) (2-44)
time units.
Proof: We assume that the Step 1.1.1 takes negligible time. We have assumed that 
| | takes c^nt, time units. Steps 1.1.2.1 and 1.1.2.2 consist of 4 simple arithmetic
operations. Hence, they require 4fr time units. Therefore Step 1.1 on processor p takes 
Qp{cx n +4 )tr time units. The worst case for Step 1.2 happens when all patterns in at least 
one processor belong to the same class. In this case, Step 1.2 takes 2max(Qp)tr. In Step 2, 
the overall average for each of the J  centers is calculated. According to the results of 
Theorem 2.1 this step takes Js(f kl2\ + 2tr ). Hence, one iteration of the algorithm takes 
at most
Theorem 2.7: For a given s, a uniform distribution of m patterns among k* processors 
minimizes the iteration time of Algorithm 2.5, T(s).
Proof: Clearly, a uniform distribution minimizes the first term of T(s). The second term 
on the other hand is independent of pattern distribution. Hence, a uniform distribution leads
™*{Qp)(cxntr + 6tr) + Js( f k/2} tc + 21) (2.45)
time units. □
to the minimum value of T(s) for a given s. □
For a uniform distribution, T(s) (2.44) becomes:
T(s) = T mk's ] (cxntr + 6tr) + Js( f k/21 tc + 2fr) (2.46)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
72
Theorem 2 .8 : The minimum of function 7\s) with respect to s occurs at*
m(cxn + 6  )lnk
ncpt log* / ( fife/21  rc + 2tr) (2.47)
Proof: 7(s) is a convex function of s since its second derivative with respect to s is always 
positive. For instance, when m is divisible by F  the second derivative of the function is given
which is always positive. A similar result can be obtained if m is not divisible by k f . The
leads to (2.47). Notice that since the function is convex, its extreme point is a minimum. □ 
Now, assume that the actual parallel architecture is an /ia-dimensional KNC. Using 
Theorem 2.8, we can determine the dimensionality of the subcube of the KNC which 
executes Algorithm 2.5 optimally ( in minimum time). Notice that we are looking for an 
integer value , where 0 £ sm  £ na. (2.47) gives the real-valued minimum of the 
function. However, from basic calculus [9] the minimum of the convex function T(s) (with 
respect to s) in the closed interval [0,/iJ is either the critical point of the function or one 
of the end points of the interval. Hence, the dimensionality of the subcube of the na- 
dimensional KNC which executes Algorithm 2.5 in minimum time is given by (2.49).
by:
S^Tis)  = m In k k ~*tr (nc. + 6 ) 
a s 2 r 1
(2.48)
critical point of the function is obtained by setting the first derivative equal to zero. This
r v 1 0  * v  r (  rv ] noP, J)
_ | l v J if0 s V i n a a n d 7 ’( [ V 1 ) i r ( i V J) 
’  0  v * 0
(2.49)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
We can use (2.49) to implement Algorithm 2.5 on a KNC architecture optimally. 
Once centers of the receptive fields are computed using Algorithm 2.5, we can proceed with 
the computation of the output-layer weights. These weights can be computed using the 
delta rule [33] which is similar to finding weights for a two-layer feedforward network. 
Hence, the mapping of the second phase of the partially supervised RBF network can be 
performed using the scheme described in Sections 2.1 and 2.2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3 
MAPPING UNIT ALLOCATING NETWORKS
In this chapter we consider mapping of several unit allocating networks. A unit 
allocating neural network is one whose topology is modified during the training. This 
class of neural networks includes several important ANN models such as the cascade 
correlation learning algorithm [25] and adaptive resonance theory models [31]. The 
common feature of these models is the dynamic nature of their architecture which grows 
during the training. Hardware implementation of these models are impractical due to their 
dynamic nature. We develop a general mapping scheme for highly efficient parallel 
simulation of such networks.
We first consider the mapping of the Cascade Correlation learning algorithm. 
Cascade correlation [25] is an efficient supervised learning technique for neural networks. 
The learning algorithm incrementally adds and trains hidden units to a minimal topology 
until a desired error bound is reached. The significant attributes of such a “unit-allocating” 
network are fast learning (with polynomial time complexity) and compact representation of 
data [33]. The resulting architecture is a multi-layer network with cascaded single-unit 
hidden layers. VLSI implementation of this structure is difficult due to its irregular 
connections and unbounded fan-in [25].
In Sections 3.1 and 3.2 we present a formal methodology for efficient parallel 
implementation of the Cascade Correlation algorithm on Jfc-ary n-cubes (KNC's). We 
develop a computational model which captures the inherent parallelism of output-unit and 
hidden-unit training phases of the algorithm. Moreover, our model allows pipelining of
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
75
several training patterns in order to further improve the efficiency of the implementation. 
The model we develop can easily be adapted to various parallel topologies. The mapping 
is done in two phases. The computational model is first mapped onto a virtual KNC of 
compatible size denoted by VKNC. Then, the VKNC is folded until a certain metric is 
optimized for a network with a certain number of hidden units, and the resulting size is less 
than or equal to the size of the actual KNC. In the Cascade Correlation algorithm the 
number of hidden units is not known in advance. To efficiently map the training of such a 
dynamic network, we consider an upper bound on the number of hidden units denoted by 
Hmax. We consider two optimization criteria defined based on 1) the execution time of 
the algorithm for a network with Hmax hidden units, and 2) the sum of execution times of 
the algorithm for all instances of the network with 0 through Hmax hidden units.
We propose efficient analytical schemes for mapping based on each criterion. We 
use the parameters for the benchmark application NETTalk [64] to evaluate the 
performance of our mappings. Experimental results show that our approach leads to near- 
optimal results for networks with H  hidden units where, H <, Hmax. In addition, we show 
that the proposed scheme leads to very efficient simulation of the training algorithm even 
if the number of hidden units exceeds Hmax. We also examine the effect of Hmax choice 
on the mapping. The minimization of each metric (assuming Hmax hidden units) has 
computational complexity o(logfc(L + Hmax)), for a network with L output units. Based 
on the proposed mapping, task assignments for networks with 0 through Hmax hidden units 
are known apriori. Hence, no data transfer or task rescheduling is needed as the number 
of hidden units grows.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
76
In Section 3.3 we consider the mapping of a popular clustering network called 
Adaptive Resonance Theory (ART) [31]. We show that the mapping of this algorithm is 
very similar to that of the Cascade Correlation training algorithm. We provide simulation 
results of efficient mapping of a network to implement the benchmark example for this case, 
as well.
3.1 Preliminaries
One of the most desirable attributes of a neural network learning algorithm is its 
efficiency. However, most learning algorithms have exponential time complexity [33]. This 
is particularly true about training multilayer neural networks with fixed topologies [33]. 
On the other hand, unit allocating networks [33] which allocate new units as needed have 
polynomial time complexity. Fahlman and Lebiere [25] introduced an efficient and practical 
unit-allocating learning technique called Cascade Correlation.
Cascade Correlation (CC) learning is a fast and efficient algorithm for supervised 
training of artificial neural networks [25],[33], For brevity, we shall refer to the cascade 
correlation algorithm as the CC algorithm and to the resulting architecture as the CC 
architecture. The algorithm constructs a layered network by allocating hidden units one 
at a time until a desired error bound is reached. Each new hidden unit is fully connected to 
the input units and to any preexisting hidden units. Moreover, hidden units are fully 
connected to output units. Unlike conventional training algorithms, such as 
backpropagation, which train networks with wide and fixed hidden layers, the CC 
algorithm builds a deep network of cascaded units (see Figure 3.1). Each new hidden unit 
is trained to optimize a performance measure. Once trained, its weights are frozen for the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
remainder of the training phase. This is an attempt to solve the moving-target problem [25] 
attributed to constantly changing weights of existing training algorithms. The scheme also 
eliminates the back-propagation of error signals through the network. Fahlman and Lebiere 
[25] have shown that the CC algorithm is much faster than the back-propagation algorithm.
The cascaded structure built by the CC training algorithm is not a good candidate 
for VLSI implementations due to its irregularity and unbounded fan-in [64]. In addition, the 
depth of the resulting structure could lead to long propagation delay since the delay is 
directly proportional to the number of hidden units allocated [64]. Notice that each 
allocated hidden unit serves as one network layer. Phatak and Koren [64] introduced a 
modified version of the CC algorithm which is intended to generate networks with small 
depth and restricted fan-in by controlling the connectivity. Their scheme generates a 
“strictly layered” network in which there are no connections that skip a layer. Their results 
reveal that imposing such restrictions leads to longer training time but results in a structure 
better suited for VLSI implementations. Although the CC algorithm cannot be easily 
implemented in a VLSI context, it can be efficiently implemented on parallel architectures 
as we shall demonstrate.
In this chapter, we investigate the problem of efficiently implementing the CC 
learning algorithm on existing parallel architectures. To our knowledge this has not been 
previously attempted which could be due to the dynamic nature of the architecture produced 
by the CC learning algorithm. We develop necessary computational models to capture the 
inherent parallelism of the algorithm. The models can be adapted to different parallel 
architectures. Here, we propose a mapping methodology for efficient simulation of the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
78
algorithm on k-ary n-cubes. Our scheme does not impose any restrictions on the original 
training algorithm or network connectivity. The proposed scheme achieves efficiency by 
utilizing the inherent parallelism of the training procedure and by pipelining training 
patterns.
The mapping involves implementation of a computational model developed for the 
pipelined version of the Cascade Correlation algorithm (called PCC henceforth) onto an 
na-dimensional KNC. This consists of two steps. First, the PCC model is mapped onto a 
virtual KNC (called VKNC) of compatible size. Since the number of hidden units is not 
known apriori, we consider an upper bound for this value (called Hmax) and use a PCC 
model developed based on a network with Hmax hidden units. The size of the VKNC is 
then reduced until it matches that of the actual KNC. This process is referred to as folding. 
Further foldings maybe necessary if mapping onto a subcube of the KNC architecture leads 
to a more efficient simulation of the algorithm. The folding procedure is performed such that 
a desired metric for a network with Hmax hidden units is minimized. Since the actual size 
of the neural network is determined during the training, we consider two optimization 
criteria based on the size of the largest network (one with Hmax hidden units). One metric 
is the iteration time of the largest network and the other is the sum of iteration times of all 
possible network sizes less than or equal to Hmax. Iteration time refers to the execution 
time of one iteration of the algorithm. We show that a minimization approach based on 
either of these two metrics leads to efficient mapping of other instances of the network; that 
is to say, networks with Hmax hidden units are optimal whereas networks with less than 
Hmax units are near optimal. We also show that optimizing the sum of iteration times of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
79
all instances of the network (assuming a maximum of Hmax hidden units) leads to a more 
efficient mapping of other network instances. Further, we show that our mapping approach 
leads to a very efficient simulation of the training algorithm even if the number of hidden 
units exceeds Hmax. In addition, we show that the choice of Hmax is not critical (within 
certain limits) if the sum of iteration times is used as criterion. Based on our approach, no 
task reassignment or data migration is needed on the fc-ary n-cube as the number of hidden 
units grows during the training. We show that minimizing each metric for a network with 
Hmax hidden units has time complexity 0^.ogk(L + Hmax) ), where L is the size of the 
output layer. This search can be performed off line without adding any computational 
overhead to the training.
3.1.1 The Cascade Correlation Learning Algorithm
We now introduce the notation used in this section. Figure 3.1 shows a CC network 
after H  hidden units have been allocated. The external input is an ^ -dimensional vector X 
= {x]^ c3f...txN}t and the output is an L-dimensional vector Y = {y,,y2,—,yL). The output of 
hidden unit i is denoted by z,. The weight between input unit j  and output unit i is 
denoted by w§ for 1 £ i z L and 1 z j  £ N . pri (q^) denotes the weight between hidden 
unit r and input unit / (output unit j), where 1 z i z N (1 <,jzL).  The weight of the link 
joining hidden units i and j  is denoted by . We represent the number of patterns in the 
training set by m.
CC learning consists of two phases: output-unit training and hidden unit training. 
The training begins with no hidden units. Input and output units are fully connected. 
Initially during the first phase, input-unit to output-unit weights (w~s) are adjusted to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
80
y  i
yt
Figure 3.1: The cascade correlation architecture with H  hidden units.
irrinimi7f»- an error measure (typically the sum of squared error). If the error remains above 
a predefired threshold after a certain number of training cycles, residual errors (£,*'s) are 
recorded, where E k is the difference between the actual (y,*) and desired (d f ) outputs for 
output-unit I when pattern k is presented.
During hidden-unit training a new unit is added to the network. This unit is fully 
connected to all input units and any preexisting hidden unit(s). Its outputs are not 
connected in this phase. Its incoming weights (p„'s and v '^s for hidden unit r) are then 
determined to maximize the covariance between its outputs and the residual errors 
computed during the latest output-unit training phase. For the r® hidden unit this measure 
is formally represented as follows:
L
S ' - Z
1=1 k=1
(3 .1)
where zr and Et are average values taken over all patterns [25]. Generally, several 
randomly initialized candidate hidden units are trained in parallel during this phase. Then,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
81
the candidate unit which best maximizes the covariance measure S r is chosen. The objective 
is to tune the new unit to the features not yet captured by the existing network [33]. The 
incoming weight will remain fixed during the subsequent training phases. At this point, the 
output of the new unit is connected to the output units, and its outgoing weights would be 
determined during the next output-unit training phase.
The two training phases are repeated, and new hidden units are allocated until the 
desired error bound is achieved. The delta learning rule is used for output-unit training. 
For hidden units on the other hand, a gradient-ascent optimization is performed to maximize 
the covariance measure S  [25]. Incoming weights are adjusted to improve d S/d p ri (or 
d S / d  yrf). We denote the output layer (hidden layer) activation function by F(.) (G(.)), the 
learning rate by p, and the sign of correlation between the candidate unit's output and that 
of output unit I by o ,. The training phases of the algorithm are listed below.
Algorithm 3.1 
/* Output-Unit Training */
Begin:
For ( certain number of training cycles) Do 
For ( training patterns k  = 1 to m ) Do 
For ( hidden units r = 1 to H ) Do 
For ( input units i = 1 to N ) Parallel Do 
Compute prixik
EndFor
For ( hidden units j  = 1 to r-1) Parallel Do
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
82
Compute Vjzf 
EndFor
N r-1 _
Compute net, = £ p „x *  + E  v„ z,.
i-l i- l
Compute zr* = G (netf)
EndFor
For ( output units j  = 1 to L) Parallel Do 
For ( input units i =  1 to N )  Parallel Do 
Compute w-x*
EndFor
For ( hidden units r = 1 to H ) Parallel Do 
Compute q^Zr
EndFor
Y H
Compute net* = £  *> x * + £  ^  z,
i*l i-l
Compute = F (nery*)
_ t t t $ Hnetk) kCompute Aw* = p [djk - Fineth] -j— x,
dx-
_ t t k d F(neth kCompute = p \dk - F{netk)]  - i-  z,
Bzi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
83
EndFor
EndFor
EndFor
If ( error above the desired threshold) Do 
For ( /  = 1 to L ) Do
For( k = 1 to m ) Do 
Ei = y * - di
EndFor
EndFor
End
Algorithm 3.2
/* Hidden-Unit training */
Begin
For (several candidate hidden units) Parallel Do 
For ( training patterns k = 1 to m ) Do 
For ( hidden units r  = 1 to H ) Do
For ( input units i = 1 to N)  Parallel Do 
Compute pnx,k
EndFor
For ( hidden units j  = 1 to r-1 ) Parallel Do 
Compute
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
84
EndFor
s
Compute net? = £/>„ j*  + E  vHz?
r-1
n i n!• 1 *»l
Compute zr* = G {net?)
EndFor
EndFor
For ( input units i = 1 to N ) Parallel Do
A / W  = pE Em t-i
t _  dG{net?) k 
dx?
EndFor
For ( hidden units r  = 1 to H ) Parallel Do
= P E E/-I *•!
a(F* *° /(  )  r ----zr
dzr
EndFor
EndFor
End
3.2 Parallel Implementation
3.2.1 The computational model
Both output-unit and hidden-unit phases of the training algorithm proceed in a well 
defined manner. The computations in each phase are flow dependent We model the 
operation in each phase by a task graph. Nodes in this graph correspond to computations 
of units in the CC model and edges represent data communications between adjacent units. 
Due to the cascaded nature of the network, outputs of hidden units for a given training
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
85
pattern are calculated at different times. Each hidden unit can compute its output only after 
all the preceding hidden units have computed their outputs for the given pattern. Processing 
one pattern at a time with this flow dependency will certainly lead to poor performance. 
Here, we develop a pipelined version of the training algorithm to reduce the overall 
execution time by processing more than one input pattern in parallel In the pipelined 
version, hidden and output units compute products associated with different patterns in a 
given cycle (one computation per neuron per cycle).
To illustrate pipelining of patterns we use the following example. Consider a CC 
network with 2 input units, 4 hidden units, and 2 output units. Assume that the network 
is to be trained with 6 training patterns (denoted by p x through p6). Table 3.1 shows one 
pass of the training set through the network. Each row in this table represents a stage of the 
pipeline whereas each column represents a time step. We have assigned one stage to each 
output-layer unit and to each hidden-layer unit At each time step a unit is busy with 
computation of the pattern listed in the corresponding stage. For instance, in the first time 
step hidden unit z, and output units y, - y2 are computing products associated with pattern 
Pj. During the next 3 time steps, hidden units -  z4 compute tasks corresponding to 
pl,...,etc.
In general when pattern k is presented, hidden unit r  computes the products p^x* 
(for all 1 < z < A/) corresponding to the weighted inputs from the input units and the 
products vjzfi (for all 1 s j  £ r-1 and r z k) corresponding to the weighted inputs from the 
preceding hidden units and stores them. An output unit, say unit I, on the other hand 
computes products (for all 1 5 i s N) corresponding to the weighted inputs from the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 3.1: Tuning pattern for a pipelined CC network. 86
1______ T im e
S tages 1 2 3 4 S 6 7 8 9
P i P i P i P a P s P a
h P i P i P i P a P s P a
h P i P i P i P a Ps P a
P i P i P i P a Ps P a
J i P i P i P i P a P s P a
y* J>, P , P i . J U  . P a P a
input units and products (1 corresponding to the weighted inputs from the
allocated hidden units. The output of each hidden and each output unit for pattern k can be 
determined once all the product terms it needs have been computed.
Our ability to pipeline patterns depends on the ability of processors (of the parallel 
architecture) to store the computed products until the products can be added to the other 
products associated with the same pattern. This implies that the space complexity at each 
processor is increased by a factor of H  with respect to a non-pipelined implementation. 
Fortunately, with declining memory cost and increased integration level1 this increase in 
space complexity does not pose any practical problems. Notice that the number of training 
patterns which can be processed in parallel is equal to the number of allocated hidden units 
in the network. The algorithms for pipelined output-unit and hidden-unit training phases 
are listed next
1 DRAM's with one giga bits have already been built.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
87
Algorithm 3 J:
/* Pipelined Output-Unit Training: */
Begin:
1. For ( certain number of training cycles) Do 
For ( f c = l t o m  + / /)D o  
For ( hidden units and output units) Parallel Do
1.1 For ( hidden units r  = 1 to H ) Parallel Do
1.1.1 For (input units / = 1 to N and hidden units j  = 1 to r -1 )  Parallel Do
Compute p Hx* and z*
EndFor
N r- 1
1.1.2 Compute net* ' r = £  prix* ' r + £  vri z* ' r
i=l i=l
1.2.3 Compute z*~r = G (net*‘ r)
EndFor
1.2 For ( output units / = 1 to L ) Parallel Do
1.2.1 For (input units i = 1 to N and hidden units j  — 1 to H  ) Parallel Do
Compute wux* and
EndFor
1.2.2 Compute: net■'H = £  wf  x- ' H + £  £ '  *
t*l i-l
1.2.3 Compute: y*'H = F (net*' H)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
88
1.2.4 Compute in parallel: Ah£'"= p [df~B -F(netf~By[ d- ^ n£-}~- - x*~g
dx,k' H
EndFor
EndFor
EndFor
EndFor
2. If ( error above the desired threshold) Do 
For ( /  = 1 to L ) Parallel Do
2.1 For ( k = 1 to m ) Parallel Do
E i = y *  - d k
EndFor
2.2 Compute Et 
EndFor
Endlf
End
Algorithm 3.4:
/* Pipelined Hidden-Unit training */
Begin:
For (several candidate hidden units) Parallel Do
1. For ( training patterns k = 1 to m ) Do
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
89
For ( hidden units r = 1 to H ) Parallel Do
1.1 For ( input units i = 1 to N  and hidden units j  = 1 to r-1 ) Parallel Do
Compute prix)  and v^z*
EndFor
k - r
1.2
1.3
N  r-1
Compute net) ' r = £  pri x) "r + E  vn zlk - r
i=1 i=l
Compute z) ~r = G (net) ' r)
EndFor
EndFor
2. For ( training patterns k = 1 to m ) Do
For ( input units i = 1 to N  and hidden units r = 1 to H) Parallel Do
2.1
L
A P / m . =  A P / m , +  p £
i=i
. / P i - r  - x dG(net)’ r) k_r 
<*/( £/ - £ , )     x,
dx,k - r
2.3
A  V;/Mr
EndFor
A r + p £
/=i
7 ^  d G (n e t ) ' r) k . r 
° /(£z - £ / ) --------- 2rk - r
EndFor
EndFor
End
We model the pipelined learning architecture by a virtual two-layer network called 
pipelined cascade correlation network or PCC, for short. The PCC consists of a virtual 
input layer and a virtual output layer. The virtual input layer includes the input units as
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
90
well as the hidden units. The virtual output layer on the other hand consists of the hidden 
units and the output units. Hidden units are included in both virtual layers since each hidden 
unit receives external inputs as well as the outputs of any preceding hidden units and sends 
its output to output units as well as to any succeeding hidden units. Figure 3.2 illustrates 
the PCC for a network with H  hidden units. The hidden units in the virtual input and output 
layers are labeled zu, Za, .... and zoI, zo2, zoH, respectively.
The PCC is intended to model the parallelism inherent to the pipelined CC algorithm. 
Edges of this model represent the communications which can take place in parallel between 
input-layer and output-layer units. Output-layer units on the other hand symbolize the 
concurrent computations. Each output-layer node represents a collection of concurrent 
atomic computations associated with a given output or hidden unit These operations can 
be broken down into smaller subtasks depending on the degree of parallelism sought For 
instance, a virtual output-layer node may represent a weighted-sum of its inputs
N
( ^ 2  wu xi ) • Clearly, this operation can be broken down into multiplications and additions.
i=l
Each pipelined training phase can be expressed in terms of multiple passes through 
the PCC network. For instance, each presentation of the training set during the output-unit 
learning phase consists of m + H  passes through the PCC network. Based on the PCC 
model, output-unit training proceeds as follows. During the ife^pass, virtual output-layer 
units receive outputs of the virtual input-layer units and compute the products 
corresponding to the kP pattern. Then, every output unit (of the actual CC network) 
computes its output for pattern k-H based on the products computed in previous iterations. 
At the same time, each hidden unit, say unit r ( I  £ r £ H ), computes its output for pattern
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 3.2: The PCC model
k-r. Finally, hidden unit z^-fin the virtual output layer) sends its output for pattern k-r to 
Zr (in the virtual input layer) for subsequent operations. Clearly, outputs of each pattern 
are computed after H  passes through the PCC network. Notice that multiple patterns ( at 
most H  patterns) are processed concurrently in the pipelined algorithm.
Due to the nature of the pipelined algorithm, computations and communications in 
multiple passes of the PCC network cannot be overlapped. Hence, the problem of mapping 
the pipelined algorithm on the KNC is simplified to that of mapping the PCC model, which 
is described next
3.2.2 The Mapping Procedure
In this subsection we consider mapping the PCC for a network with Hmax hidden 
units to an na-dimensional KNC. Our objective is to find a mapping scheme which leads to 
an efficient simulation of the CC learning algorithm. In this algorithm, the number of hidden
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
92
units is not known in advance. Hidden units are installed one by one until a desired error 
bound is reached. Consequently, the size and connectivity of the PCC change dynamically. 
One possible approach is to devise an efficient simulation for each instance of the PCC. 
However, such a scheme would require task reassignment whenever a new hidden unit is 
added. This would lead to a very inefficient implementation due to the time wasted in 
reassignment. Here, we propose an efficient method which eliminates the need for task 
reassignment and minimizes a desired metric for a network with Hmax hidden units. 
Instead of mapping all instances of the PCC, we consider mapping a single PCC which 
captures (as shall be seen) the features of all possible instances of the PCC (within certain 
limits).
We consider an upper bound on the number of hidden units (H ) and denote it by 
Hmax. We assume that H  can take any value between 0 and with equal probability. 
Then, we map the PCC with Hmax hidden units (denoted by P C C ^  henceforth) such that 
a desired metric is minimized. In particular, we propose two optimization metrics for a 
network with Hmax hidden units and compare the performance of their resulting mappings. 
In our approach, the processor assignment for PCCs with 0 through Hmax hidden units 
is known in advance. In other words, the tasks each processor should perform for any of 
these network sizes are known once the mapping is done. Moreover, no task rescheduling 
or migration is needed as the number of hidden units grows. Each processor basically needs 
to compute more tasks as new hidden units are allocated. The mapping of the P C C ^  takes 
place in two steps. We first map the P C C ^  to a virtual KNC of compatible size, called 
VKNC. Then, we fold the VKNC until its dimension matches that of the actual KNC. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
93
folding is done in such a way so as to optimize a desired metric for a network with Hmax 
hidden units. Further foldings maybe necessary if a subcube of the parallel architecture 
minimizes the desired metric. In such cases the foldings continue until the metric is 
minimized.
3.2.3 Mapping the PCC^r onto a VKNC
The PCC learning model involves both communications as well as computations. 
The mapping should satisfy both communication and computation requirements in such a 
way that the a desired metric is minimized for a network with Hmax hidden units. We need 
to find the degree of parallelism which leads to minimization of the metric assuming Hmax 
hidden units. The granularity of the parallelism should be chosen based on two factors: the 
ratio of communication time to computation time of the actual KNC ( t c/  rr )and its 
dimensionality (n j.
Our approach is to first obtain the task assignment for the finest grain parallelism; 
Le., each simple arithmetic operation is assigned to a virtual processor. Each P C C ^  node 
represents a computation which can be partitioned into atomic subtasks. Notice that the 
number of atomic computations associated with each node is related to the number of edges 
connected to the node. Such tasks are assigned to processors of the virtual KNC. For 
instance, each multiplication in £  wlixi is considered an atomic task and is assigned to a
i * l
VKNC node. Then, we refine the granularity of parallelism by folding the virtual 
architecture thus increasing the amount of computation performed by each processor at least 
until the number of concurrent tasks matches the size of the actual KNC. Further foldings 
might be necessary to optimize the chosen metric when Hmax hidden units are allocated.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
94
For the finest parallelism, we assign each pair of adjacent units of the P C C ^  , 
(which represents an atomic task), to one node of the VKNC. Notice that the finest grain 
parallelism may not lead to minimization of the desired metric. However, we use it as the 
starting point of our mapping. The degree of granularity is increased as the mapping 
proceeds. An atomic task in this work represents a simple arithmetic operation such as 
addition, subtraction, or multiplication. This way, the local memory of the VKNC nodes 
provide the communication between any two adjacent units. Since each P C C ^  unit has 
several neighbors, multiple copies of each node would be assigned to different VKNC 
nodes. The KNC interconnection network will provide communications among different 
copies of each unit The granularity of the parallelism is refined during the folding process.
We assign an n^-digit (/mat-digit) fc-ary address to each input-layer (output-layer) 
unit of the P C C ^  where = [ log* (N + Hmax) ] and = [ log* (L + Hmax) ]. These
labels are used to uniquely identify the atomic computations associated with each pair of 
adjacent PCC^  units. P C C ^  units are then assigned to an ( n ^  + Z^j-dimensional VKNC 
as follows. A copy of adjacent P C C units i and j  are assigned to the VKNC node whose 
address is obtained by concatenating the addresses of units / and j. The VKNC node 
performs all computations involving both of these units. The assignment implies storing in 
the VKNC node (or in some cases initializing) several parameters associated with units i and 
j . We refer to these parameters as the data sets of VKNC nodes. Table 3.2 shows possible 
data sets for different adjacent nodes of PCC^ . Each column in this table represents a 
different data set.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 3.2: Data sets for different adjacent units of P C C ^
95
I Input x(. and output y t Input z i and 
hidden Z,
Hidden units
z, « d  z,
Hidden z, and
OUtpUt JF,
ir/.x?, x 1 x2 J.‘
.
d j,d 2,..^djk
3V » J; • "1 3; Pi y/.y/.
1 dj, dj,..., dj Zi
This assignment process is repeated for each pair of adjacent PCC^a units. It is easy 
to show that with this assignment scheme all copies of any virtual output-layer unit will lie 
on an n^-dimensional KNC. In all, there are edge-disjoint -dimensional KNCs 
which will provide the communications corresponding to all virtual output units. We refer 
to these KNC's as input-to-output subcubes. Notice that in each phase of the algorithm 
certain sum-of-products operations are performed which correspond to the computations 
that would be performed by each hidden or output unit We utilize input-to-output 
subcubes to provide the communications necessary for such operations. Similarly, output- 
to-input subcubes, provide communications to the virtual input units.
3.2.4 Folding the VKNC
The next step in the mapping process involves reducing the size of the VKNC until 
it is less than or equal to that of the actual architecture, depending on the ratio of 
communication time to computation time. We refer to this process as folding the VKNC. 
Our objective is to fold the VKNC of a network with Hmax hidden units such that the 
resulting mapping leads to minimization of a desired metric for the given network. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
96
consider two optimization metrics defined based on the iteration time of the learning 
algorithm for a network with Hmax hidden units, which is defined as the total execution 
time taken by one iteration of the two learning phases. Let na and nv denote the sizes of 
the actual KNC and the VKNC, respectively. We refer to the architecture obtained after 
folding tte VKNC t  times by VKNC \  In the proposed scheme the VKNC undergoes at least 
rij folding steps where nf = nv -  na. At folding step r+1, the dimensionality of VKNC1 is 
reduced by one, and tasks and data sets of all nodes whose k-ary addresses differ only in 
the digit corresponding to the folding dimension are assigned to a single node of the 
VKNC*1. Any desired communication among these tasks would take place through the 
local memory of the assigned node. The issue here is how to select the folding digits 
(dimensions) to satisfy the optimization criterion.
The folding digit can be selected from the mput segment digits or the output segment 
digits of the + Z^-digit VKNC address. In Chapter 2 we have shown that folding the 
most significant digit from each segment of the VKNC label maximizes processor utilization. 
We need to determine how many digits should be folded from each segment It should be 
pointed out that we fold the VKNC obtained for a network with Hmax hidden units. 
However, for subsequent derivations we need to determine how such a given folding step 
affects smaller instances of the network. Figure 3.3 shows all possible ways folding the 
VKNC can affect different instances of the model. Let lh = [logt (L + H) ] and 
nh = [logt (N + //)]. In general, folding a input digits of the VKNC address of a network 
with Hmax hidden units reduces the size of the input segment of other PCCH's 
(0 z H £ Hmax-l)by max(a- -  nh , 0) digits. Similarly, folding P output digits from
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PC C ^ ■  ' •  I t z I ----------------p c q j  i --------------------------- j
input segment output segment input segment output segment
(a) (b)
I max nmax
-> <  >  < =► <-
PCC -J —- J  PCC 1 1max <  >  < —>  max <  ■>  < — >
a  p a  p
TL lh \  k
< — < -  -  ■ >  < — — >  <  >
PCC  ^ M~ 1 [ _____ I PCC^J L ____ _ J f~ ]
input segment output segment input segment output segment
(c) (d)
Figure 3.3: Input and output segments of P C C ^  and PCCH after a input and p
output digit foldings.
the VKNC address of a network with Hmax hidden units reduces the size of the output 
segment of other PCC^'s by max(P -  -  lh , 0) times. Next, we find the iteration time
for any instance of a network (with 0 through Hmax hidden units) if the VKNC of the largest 
instance of the network (with Hmax hidden units) undergoes certain number of foldings. 
We wish to stress on the fact that the folding scheme reduces the size of the VKNC of a 
network with Hmax hidden units.
Theorem 3.1: If the CC output-unit training algorithm is to be run on a KNC whose size 
equals that of VKNC a+^ , obtained after a  input-segment and P output-segment foldings of 
the VKNC's label, then with m input patterns each iteration of a network instance with H  
hidden units will take
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
98
.output. Ox { ~  + II\wT>oWl-V ~lma*+lk^ ( a ,  P) = (m+tf)max(*p' ' ~  \  1)
(1 + 8max(i5:“ , 1 ))  tr+
min (nk , -  a ) ( \k/2]tc + 2 tr ) + (3‘2)
( min (nh , -  a) + c, min(/A, 1^ - $ ) )  \U2] tc
time units, where c, = .
max(fcp"^“ w‘ , 1)
Proof: This phase consists of two major steps (Steps 1 and 2 of Algorithm 3.1). Step 1 
consists of Steps 1.1 and 1.2 which are executed in parallel and involve a sum-of-products 
computation. It has been shown in Chapter 2 that the minimum time to compute the sum 
of products of £” pairs on an /-dimensional KNC (i z  n) is
T(i) = z' ([ &/2 ] rc + 2 fr ) + (2 k n~' -  1) tr (3.3)
A fan-in algorithm is introduced in Chapter 2 (Algorithm 2.2) which achieves this lower 
bound. The dimension of the largest subcube of the folded VKNC involved in Step 1.1.2 
and Step 1.2.2 is min (nh , -  a) (see Figure 3.3). The output segment of the VKNC
labelisfolded max(P -  + lh , 0) times. So.eachofthe min (nh , -  a)-dimensional
subcubes has to compute max(fcp" **“  *lh, 1) sum of products. Hence, according to (3.3) 
Steps 1.1.2 and 1.2.2 take at most
max(JfcP" W\  1) [2max(ka l)fr + min(/zA , -  a)( \kJl[ tc + 2fr)] (3.4)
time units.
The computations in Steps 1.2.3 and 1.1.3 take max(kP W\  1) tT time units 
assuming that G{.) and F(.) take tr time units. The outputs of output units computed in Step
1.2.3 should be broadcasted to other nodes of the corresponding input-to-output subcubes 
for subsequent operations. The hidden-unit outputs (z^'s) in Step 1.1.3 should be sent to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
99
their corresponding virtual input-layer nodes, Zf's. According to [10], one-to-all broadcast 
on a k-ary n-cube can be performed optimally in nkl2tc time units when k is even. For the 
case of odd & it is shown that the one-to-all broadcast can be performed in fc/2  ] time units
[10]. Hence, Steps 1.1.3 and 1.2.3 take at most
max(fc l)[rr + ( ^mintf*, -  p) + min(nA, -  a ) ) \k/l] tc ] (3.5)
time units.
Each weight increment in Step 1.2.4 involves 6  simple mathematical operations, 
requiring 6tr time units. This step is repeated m a x ^  ^ ^ , l).max(£“ n"E" 1) times
due to the folding. Hence, Step 1.2.4 takes
6 max(fc15"Ima * , l).max(& “ , 1 )tr (3.6)
time units.
Step 1 is repeated m+H times during each iteration due to pipelining. Therefore, 
using (3.4) through (3.6) we conclude that Step 1 takes
(m+fl)max(Jfc * , 1)
|max(min ("ft • nmax -  «)« c,min(/A , / ^ - P ) ) \k/2] tc +
(1  + 8 max(jta' n-“ TB*, 1) )  tr
time umts.
.P-U**Step 2.1 takes m.max(it '  ,l)fr time units. To compute Et in Step 2.2, first 
residual errors are added locally in each node. This takes m.max(kp l)rr time units. 
Then, using output-to-input subcubes of dimension min(/, -  P), Et is calculated and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
100
broadcasted to each output node in 2min(/, -  p)(Tk/2l tc + time units. Hence, Step
2  takes
2( m.min(* p '  ^  "1, 1 )tr + min(/, -  P)(T*/2 l te + t j j  (3.8)
time units.
Step 1 is repeated a certain number of iterations to see if a desired error bound can 
be reached. Step 2 on the other hand is performed only once to compute the residual error 
when the desired error bound is not reached in Step 1. Moreover, by comparing (3.7) and 
(3.8) we observe that the time taken by Step 2 is negligible. Hence, we approximate the 
iteration time of the output-unit training by the time taken by Step 1. □
Theorem 3.2: If the CC hidden-unit training algorithm is to run on a KNC whose size 
equals that of VKNCx+p, obtained after a input-segment and p output-segment foldings of 
the VKNC's label, then for a network instance with H  hidden units and m input patterns each 
iteration will take
7 ^ ( a , P )  = (m + H) max( 1)
( 2max(ka~"~,r *"*, 1) + 3c2 + CjC  ^tr +
l h ’ lm a x -  P ) ( ( C 1 + C i M  l c +  Cz t ]  +  ( 3 ' 9 )
cOfa + l j M  tc +2t)
L max(kanma*nk, 1)  ^ mtime units, where c. = ----- -------------- -—- and c, = -------- .
maxOfc15 ,"“ + /‘ , l )  m + H
Proof: Step 1 of hidden unit training is similar to Step 1.1 of the output-unit training phase
(see Algorithm 3.2). This step is repeated m + H  times. Hence, Step 1 takes
2max(kix~n“ x+"*, 1) tT +(m+H)  max(kp l) t .+
(3.10)
Cj min(/A , / ^ - P )  \k/2] rc+ min (nh , -  a) ( \k!2\ tc + 2 rr)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
101
Step 2 is repeated m times. The product o,( £* -  Et ) in Steps 2.1 and 2.2 for 
each node can be computed and added locally in 3max(fcP ~lma' lk , 1 )rrtime units . The 
overall sum of these local values for each pattern can be computed using the output-to-input 
subcubes of dimension min( I , -  P ). Hence, Steps 2.1  and 2.2 take
max( fcp ' ,  1) [ 3rr + min( P)(ffc/2 ] tc +rr)] (3.11)
time units. The sums should be broadcast to each node of the input-to-output subcubes
associated with the new hidden units. Finally, p]T E\ " r -  El ) is multiplied by
0d G(net*" r)/d x f "r) x* ~r in tT time units. Hence, Step 2 takes:
m.max(fcp" '  '* ,1 )  j (3 + cx)tr + min(/ ,  -  p)(f*/2]fc +rr)]+
time units. Using (3.10) and (3.12) we can express the overall iteration time for hidden unit 
training as in (3.9). □
Let Th (a , P) denote the iteration time of the pipelined training algorithm for an 
instance of the network with H  hidden units if a  and P digits have been folded from the 
input and output segments of the VKNC label of the largest network, respectively. By 
definition, TH (a , P) is obtained by adding T^ttput(a, P) and p). Hence,
Tf£a, p) = (m + H) max (*P' * \ l )
(10 max (ka , 1) + 3c, + ka’pc, + l)f +
(3 13)
min(nh , n ^ - a )  ((c2 + 3 ) \k / l] tc +4g +
* Cj) N  *, + c, t , )]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
102
We introduce two optimization metrics and show that a mapping based on either of 
these criteria leads to a very efficient simulation of the learning algorithm, although one 
metric is superior to the other. Our approach is based on the hypothesis that optimizing the 
mapping for a network with a known size will lead to near-optimal mappings for other 
instances of the network. We shall show that our hypothesis is indeed valid.
The first optimization metric is the iteration time of a network with Hmax 
hidden unit The other criterion is the sum of iteration times of all instance of a network
with 0 through Hmax hidden units. We denote this metric by * £ S « « ,P )  = E W ) .
Notice that this metric is closely related to the average iteration time of the network 
assuming that H  can take any value from 0 through Hmax with equal probability.
Next we present our analytical approaches for mapping a CC network on a given 
KNC parallel architecture based on the above criteria. We will analyze the computational 
complexity of each approach. Then, we demonstrate the efficiency of the mapping based on 
each of the two metrics.
3.2.4.1 Optimizing THmax( a  , P ) for a Network with Hmax Hidden Units
We need to determine the iteration time of the largest instance of the network. By 
substituting Hmax in (3.13) we obtain
where c2 =-----—-----. We need to optimize this equation in terms of selection of a and p.
T Hmax (°»P) = 0 »  + Hmax) k p
0 W ~ a ) « c2 + 3 > [ # 2 ]tc +4tr) + 
( ^ - P ) C ( 2 * - P + c2) \kli\ tc + c2 tr )
(3.14)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
103
max
■>
a maxmax a
(c)(b)(a)
‘ max nf
(d)
max
Ce)
Hgure 3.4: Possible closed sets for the optimization problem.
We know that a + P >nf , where nf = nv -  na. Additional foldings might be necessary 
depending on the ratio of t j t r . This optimization problem is basically a search for a and P 
in a closed set interval; specifically one of the five possible sets shown in Figure 3.4. This 
is essentially a nonlinear integer programming problem. Because of the nature of the 
function it is difficult to use typical optimization techniques. We introduce a 
computationally efficient method for finding a  and P that yields optimal mapping by utilizing 
properties of the function. Notice that by considering all points in the search space (shown 
in Figure 3.4), the size of the folded graph will be less than or equal to na, depending on 
which size yields minimum value of THmar Hence, the folding approach may only select a 
subcube of the parallel architecture. Since we want to minimize the function in terms of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
104
both a  and p, we can first minimize it in terms of a  for different P value. Then, we can 
minimizR the function in terms of P using the a values obtained for each p value. This is the 
approach we adopt here.
Theorem 3 J :  T[imax (a,P) is a convex function of a.
Proof: The first partial derivative of this function with respect to a are given by
= (m + Hmax) k 11
d a
[ (1 0  In k k a* c2 In fc*“' p)rr -  (3.15)
«c2 * y ) \ k n ] t c +4fr) ♦
(is -P ) 2  to t  fc-- 0 f ] r 1
The second partial derivative with respect to a  is as follows:
= (m + Hmax) ln2fc
P)
(3.16)
a a2
[ ( 1 0  c2 Jt“ p)rr +
( U - P  m - ’ f w l r j
The right hand side of (3.16) is always positive. Hence, the function is convex with respect 
to a. □
Theorem 3.4: For a given P, the value of a which minimizes THmax is given by 
’ [Cl if O’ '1/ "  lmJ< Q < ^max and Thj \ q ]  , P) < Thj [ q \  , p)
[ej if max( 0 , nf  -  /mflX)< Q < and Tn (fQ], P)> T h \ q\ , p) 
a = i  — -  ( 3 n )
max( 0 ,  nf  -  l ^ )  if Q <;max(0 , nf  -
/i if n £ Qmax max
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
105
Parameter Q in (3.17) is obtained as follows:
(3.18)
Proof: TBliax (a,P) is a convex function with respect to a (See Theorem 3.3). From basic 
calculus we know that for a fixed value of P, the minimum of such a function is either the 
critical point or one of the two end points of the a interval. Figure 3.3 shows the range of 
a to be a e  [max( 0 ,nf -  l^ ) , ]. The critical point of the function is obtained by
setting BTH /da = 0. Thus the critical point is at
n mat
We denote the right hand side of (3.19) by Q. Notice that we are looking for an integer 
value of o. Hence, for a fixed p, the minimum of THmax (a,P) occurs at
Hence, we can find the values of a and P that yield optimal THmax (a,P) by finding the best 
value for a  for each P value, and then find the best p among those found. As the next 
theorem shows, this process has logarithmic time complexity.
Theorem 3.5: The search for the minimum value of THmax (a,p) has computational 
complexity 0 (log(L + H ^ J )  = 0(1max).
(c2 *3)[«2l<t - 4 ( r
(3.19)
’ [21 if max( 0 ,  nf  -  l„J< Q < and T „ J q] , P) < ThJ q \ , P) 
[ej if max( 0 , n f -  (TO)< 2 < n ^ a n d  ThJ ^ Q \  , P) > Th_ J q \ , P)
a =• (3.20)
max( 0 ,nf -  l ^ )  if Q ^max(0 , nf  -  
if n<> Qmax max ^
□
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
106
Proof: Clearly, we can find the minimum of the function for each p in constant time using
(3.20). We can repeat the above process for each p. Notice that P ^ /^ . Hence, the overall 
search has computational complexity 0(1^ . □
3.2A2 Optimizing Tj£^(a,$) Assuming a  Network with Hmax Hidden Units
The optimization criterion stated in the previous subsection takes into account only 
the iteration time of the largest possible network. The resulting mapping might not be 
efficient for smaller networks. Notice that the number of hidden units is not known in 
advance. Here we consider an optimization metric defined based on the sum of iteration 
times of all instances of a network with 0 through Hmax hidden units. We denote this
Hmax
metric by P) where T£Ht(a,(l) = £  (®>P)- Now we introduce an analytical
H-0
approach for minimizing T ^ ^ a , P). As before, We can find the minimum by first 
mmimpmg the function with respect to a. Then, we find the P which results in the overall 
minimum.
We define four sets P ,, 1 £ i s 4, as follows:
=
P2 =
^3 =
Pa =
H \ 0 z H z  Hmax, lh <. -  P , and nh s -  a
H  | O s f f s  Hmax, lh s -  P , and nk > -  a
H  | 0 z H  z Hmax, lh > -  P , and nh > -  a
H \ Q z H z  Hmax , / * > /„ « -  P , and nh <> -  a
(3.21)
Theorem 3.6: 7 )^ (a ,P )  is a convex function with respect to a.
Proof: It is easy to show that sets P, through $  are mutually disjoint and that 
Pj U P2 U  P3 U  P4 = {0,1,...,#}. Hence,
Hmax
(3.22)
H=1 HeP{ HeP2 HeP2 HePA
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
107
The four summation terms in (3.22) can be obtained as follows:
E W ) '
HeP{ HeP{ '
[(10 + 4c2 + l ) f r +
nh ((c2 + 3) f k tl 1 tc +41)  +
/ ,( (  2 t c 2)fW2ltc + c2rr )])
HeP2 HeP2'
[(3 c2 + ( 1 0  + c2 ) fc*+ l)f,+
K « -« >  «<*♦ 3) f ^ K * 4',) ♦
( y « 2 i*  + c2) [a/2] re + c2 rr )] j
(3.23)
(3.24)
E W)=
[(10 (fc“+ 3 c2 +  c2 £ “ "p + l)fr +
(* „ » -« )  ((c2 + 3) [ fc/2 ] f. + 4fr) +
( C . - P M a * - 11 ^ 2> N < c - c 2 <r )])
E W>= e ((«-h j*s
[(10 +3c2 + c2 JTp + l)fr + 
n„ ((c2 + 3) [ &/2 ] fc + 4rr) +
Notice that Tff(a,P)is a constant function of a  when H is in sets P, or P2 . Hence, the 
second partial derivative of T ^^(a,P ) with respect to a is given by equation (3.27).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
108
3 a 2 ffePj 3a2 ffe/»3 3a2
= J  In2* k a ( 10 + c2fr + 2lh\k /2 1 rc) + (3.27)
£  In2 * *^(10 * p + c2) rr + 2  ( / ^  ~P) ]
H eP z
The right hand side of (3.27) is always positive. Hence, r^^(ce,P) is a convex function 
with respect to a.
Theorem 3.7: The minimum value of 7^^(a ,P ) for a fixed P occurs at
\A} if maK(0,nf - l mJ < A < n maxmd THmax<\A], P) < THmax{A\ , p)
[A\ if max( 0 , n -  l„J<  A < and THiiiJ a ] , P) > THmax{A\ , P)
(3.28)
® k( 0 ,  nf  -  if A £max(0, nf  -  Zmax)
"max nmax^A
a =
where
A ____________ In* ' 1 \P2\ I ^3 1 (c2 + 3) [ */2 j tc +4 rt____________
* £  (1 0 +c2)fr+2 /A\ k l l \ t c +k~t £  (1 0 +c2) tr * 2 in*(I^-P)[ fe/ 2 1 rc (3-29)
HeP2 HePj
Proof: l£ £ (a ,P )  is a convex function with respect to a (see Theorem 3.6). Hence, we 
can find the minimum of the function, for each P, from the critical points or the end points 
of tte range of a. The critical points of the function are obtained from 37^^.(a,p)/3a = 0. 
This results in
‘ £  aa*c2v^nh\ui]tc*k^ £  ( i o i , » 2 in*o^-P)f^ 2 ] <3-30)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
109
We denote the right hand side of (3.30) by A. Since we are looking for an integer a, we 
need to use [ a  ] or [ A J ,  whichever results in a lower value when substituted in 
Hence, we find the value for a which minimizes the function for each p as follows:
a  =
M Ta2X(°^f-lmJ<A<nmaxand 7'»n«xW>P)<:r»nax(lAJ’P)
|Aj if max( 0 ,nf -  lma3)<  A < and T ^ a ] , P) > T ^ a J , P)
(3.31)
max( 0 ,  ^  -  / _ )  if A <;max(0, nf  -  
n^,r ifmax max
Clearly, we can find the minimum of the function for each p in constant time. We can repeat 
the above process for each p. Hence, the computational complexity of this approach is also 
0(log(L + Hmax))=CKlmaz).
3.2.4.3 Performance of the Proposed Folding Scheme
We now evaluate the performance of the proposed mapping. We begin by 
examining the efficiency of the resulting learning simulation. We map the learning 
algorithm for a network with Hmax hidden units using the proposed folding scheme. Then, 
we apply the resulting mapping to all instances of the network (networks with 0  through 
Hmax hidden units) and compute their iteration times. Iteration time is the execution time 
of one iteration of the learning algorithm. We then compare the iteration time of each 
instance of the network when mapped using our proposed scheme with its corresponding 
optimal iteration time. We stress on the fact that our scheme optimizes a desired metric only
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
110
for a network with Hmax hidden units. The optimal iteration time for other instances of the 
network is obtained through an exhaustive search for comparison purposes only. As we 
shall show, our mapping leads to near-optimal results for other instances of the network. 
The results we report here are based on simulation of a network whose parameters are 
found by running the benchmark application used in [64]. In this case, there are 196 input 
units, 26 output units, and 1114 training patterns. The average number of hidden units 
allocated during actual training for this application is reported to be 27. We assume that 
computation time (fr) and communication time (() are 0.01 /zs and 0.4 /zs, respectively. 
Unless otherwise stated, we assume that the parallel KNC architecture has 64 processors.
For the first set of simulations, we performed the folding so as to minimize 
THmax{a,P) assuming a network with Hmax hidden units. We then used the resulting a and 
P to compute the iteration times of all instances of the network, with 0 through Hmax 
hidden units. We compared these results with their corresponding optimal values. Figures 
3.5 through 3.8 show the results of our simulations for different Hmax values; specifically 
for H ^  = 20,40,60, and 80, respectively. Clearly, the iteration times obtained as a result 
of our mapping are very close to the optimal results except for the noticeable deviation 
for very small networks. This is to be expected because the optimization criterion takes into 
account only the iteration time of the largest network. Hence, the resulting mapping might 
not be efficient for very small networks.
We observe several abrupt changes in the plots depicted in Figures 3.5 through 3.8. 
These jumps can be attributed to changes in o,P) as H  varies. As H  grows, several terms 
in r„(a,p) may vary (see equation (3.13)). Clearly, H  grows linearly. nh and lh on the other
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I l l
0.18
0 .1 7 -
0.16 -
1 0.15
8  m
1 0.14 s  c &
i o .1 3  a
* optimal solution 
o proposed approach
o o o o o o o
0.12 -
0-11 - x  * * * * *  *
0.1
5 10 15 20
Number of hidden units
Figure 3.5: Simulation results for optimizing THmaxwith Hmax = 20.
25
0.28
0.26 .* optimal solution
-------- 1----------- 1----------- 1------
■ a
0.24
o proposed approach
-
'370.22 .TJCou
*. OS
•
£
c  0.180
1
5  0.16 - -
0.14 - -
0.12
o o o o o o o
-
ft 1
XKKXXKZ
1 1 . J_ . —1-- L i i i_
15 20 25 30 35 40 45
Number of hidden units
0 5 10
Figure 3.6: Simulation results for optimizing with Hmax = 40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
112
.* optimal solution 
o proposed approach
0.18
=  0.16
0.14
0.12 -
0.1
Number of hidden units
Figure 3.7: Simulation results for optimizing with Hmax = 60.
0.45
* optimal solution 
o proposed approach0.4
0.35
0.2
0.15
0.1 60
Number of hidden unite
Figure 3.8: Simulation results for optimizing THnax with Hmax = 80.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
113
hand will change only when N + H  or L +H exceed a power of k. For instance, in Figure 3.8  
abrupt changes occur at H  values 7,39, and 61. The jumps at H=7 and H=39 are due to an 
increase in nh while the jump at H= 61 is due to an increase in lh. Notice that T^a.p) grows 
exponentially in terms of nh and lh .
Next, we performed the folding by optimizing T ^ x (a ,  P ). We then applied the 
resulting mapping to all instances of the network with 0 through Hmax hidden units and 
computed P) for each case. We compared these results with their corresponding
optimal values obtained by exhaustive search. To obtain the optimal result for a network 
with H  hidden units we performed the mapping to optimize 7 *^ (0 , P ). Figures 3.9 through 
3.12 show the results of our simulations with different Hmax values. These results indicate 
that deviation from optimal results appear only for small networks and are relatively 
insignificant Comparing these results with those shown in Figures 3.5 through 3.8 we 
observe that 7 ^ ^ (a ,p ) is a more efficient folding criterion than THmax(a,$) since it leads 
to near optimal results for other instances of the network.
For the next set of tests we varied the number of physical processors to examine the 
speedup of the simulation. Figures 3.13 and 3.14 show how increasing the number of 
physical processors reduces the iteration time. We observed that speedup eventually 
saturated. For instance, increasing the number of processors to 8192 did not improve the 
iteration time when was the optimization metric.
Another issue of relative importance is the selection of Hmax. The number of hidden 
units is determined during the training of the given CC architecture. In our mapping scheme 
we assume an upper bound on the number of hidden units {Hmax). Clearly, the efficiency
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
114
3.5
 optimal solution
-  -  proposed approach
2.5
§ 1 .5
0.5
20
Number of hidden units
Figure 3.9: Simulation results for optimizing T,r r with Hmax = 20.
 optimal solution
-  proposed approach
35 40
Number of hidden units
Figure 3.10: Simulation results for optimizing 2 * ^  with Hmax = 40.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
115
 optimal solution
- proposed approach
•8C
gis
60
Number of hidden units
Figure 3.11: Simulation results for optimizing Ta with Hmax = 60.
 optimal solution
- -  proposed approach
20
Number of hidden units
Figure 3.12: Simulation results for optimizing with Hmax = 80.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
116
25
. . .64 processors 
ooo256 processors 
***1024 processes 
 4096 processors
<0
s
5£
80
Number of hidden units
Figure 3.13: The iteration time of CC algorithm mapped on KNCs of different sizes 
when t £0L is optimized.
0.45
0.4
0.35
025
1  02 
23 
9
0.1
0.05
°0 10 20 30 40 50 60 70 80 90Number of hidden units
Figure 3.14: The iteration time of CC algorithm mapped on KNCs of different sized
when Tumax is optimized.
. . .64 processors 
■ 0 0 0 256 processors 
* * * 1024 processors 
x x x 4096 processors
/ y i n r r n n » » ' U B D
I.I.ItlT D  XOOOCOOOOtOCOOtOOOCC
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
117
of the proposed mapping depends on the selection of this value. Next, we explore the 
effect of Hmax on the mapping and on the iteration time of the learning algorithm.
The first issue we considered here was the effect of Hmax on performance of smaller 
instances of the network (with 0 through Hmax - 1 hidden units). We mapped a CC 
learning architecture with different Hmax values and compared the resulting processor 
assignments and iteration times. We repeated these simulations for both optimization 
criteria. The results we report here are based on the NETTalk [64] application mentioned 
earlier. We assumed a parallel architecture with 32 processors. We used different Hmax 
values, namely 50, 100, and 200. Figure 3.15 shows the iteration times of the sample 
network when mapped using different Hmax values with T ^ x(a, (3) as the optimization 
metric. The results show that the selection of Hmax is not critical *  (a, P) is
minimized. Notice that different Hmax values resulted in similar iteration times for smaller 
instances of the network. Figure 3.16 on the other hand shows that selection of Hmax 
plays a more significant role if THmax{a, P) is the chosen optimization metric. Notice that 
for the sample network, mapping with Hmax =100 leads to lower iteration times than 
mapping with Hmax = 50 or with Hmax = 200.
Figure 3.17 shows the folding digits selected for different Hmax values when 
THmax{tt, P) is the chosen criterion. The folded digits in each case are indicated by shaded 
areas. As an illustration, let us consider how different Hmax values affect the performance 
of the smallest network (a network with H  = 0). Based on our approach, the optimal 
mapping for a network with no hidden neurons keeps 3 input and 2 output segment digits 
of the VKNC label intact Notice that the dimensionality of the actual architecture is 5.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
118
150
 Hmax = 200
xxx Hmax = 100 
 Hmax = 50
¥
s
-  50
200 250150100
Number of hidden units
Figure 3.15: Simulation results of mapping with as the optimization metric with
different Hmax values.
o o o Hmax = 200 
. . . Hmax =100
.xxx Hmax = 50
02
250200100 Number of 150 hidden units
Figure 3.16: Simulation results of mapping with ^Hmax as the optimization metric with 
different Hmax values.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
119
Hence, there is a total of 5 address digits. The mapping with Hmax = 100 also keeps these 
digits intact However, the mapping with Hmax = SO or Hmax -  200 chooses a different 
set of folding digits; i.e., they keep 2 input and 3 output digits. Hence, folding using the 
most significant input digit would result in different processor assignments for a network 
with no hidden units. This would clearly lead to higher iteration times. Figure 3.18 on the 
other hand shows that minimizing T{f*^T(a.Q') with different Hmax values chooses the 
same folding digits. We can attribute this significant behavior to the fact that the 
optimization metric 7^^(a ,P ) takes into account the iteration times of all instances of the 
network. Thus clearly Tjj*^a, P) is a better criterion than THmax(a, P).
So far we have considered only the mapping of networks with up to Hmax hidden 
units. The final issue we need to address here is how to map the algorithm if the number of 
hidden units allocated during the training grows beyond the assumed upper bound, Hmax. 
Notice that in general we cannot estimate the ultimate number of the hidden units with high 
accuracy. Our approach is to apply the same mapping procedure for a network with Hmax 
hidden units to larger networks. Notice that once the folding process concludes, portions 
of the na-digit address of the parallel architecture are dedicated to input and output segments 
of the VKNC address. For larger networks, we fold the corresponding VKNC until its 
address matches the pattern obtained during the mapping of a network with Hmax hidden 
units. For instance, Figure 3.18 shows that after mapping a network with Hmax = 80, 3 
digits have been allocated for the input segment of the VKNC address while 2 digits have 
been allocated to its output segment.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
120
n p a t i e g a a i t ^ ^  CTHmtsepnoit ^
Hmax = 0
Hmax ~ 50
f fm a r  =  100 ^
Hmax = 200 E
* 8
0=6
‘max » 5
0=4piI
M s  8 " 0
0=6
‘max “  7
0 = 5
a ■ 9 tu x
0 = 7
‘max * 7
0 =  5
^  fo ld ed  d ig it | | n M  d igit
Figure 3.17: Folding digits with THmax as the optimization metric for different 
Hmax values.
input tegm ta t ^  output icgmeot ^
Hmax = 0
Hmax = SO
Hmax = 100
Hmax = 200
M M S i
W  = *
ce= 5
‘max = 5
0 =  5
I Wmmm 1
•max “  8
o = 6
‘max * 7
0 = 5
n „  -  9 MKZZ
0 = 6
‘max * 7 
0 = 6
NNNHMNN 1
‘max
^  folded digit 1 [ intact d ig it
Figure 3.18: Folding digits with t £0L as the optimization metric for different 
Hmax values.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
121
To examine the efficiency of the above approach, we mapped the network for the 
NETTalk application reported in [64] assuming Hmax -  40. Then, we applied the resulting 
mapping to larger networks (with up to 1000 hidden units). We compared iteration times 
obtained based on this mapping approach with the corresponding optimal values for 
different networks. The results of our simulations are depicted in Figures 3.19 and 3.20. 
These results show that our approach leads to near-optimal results even when the number 
of allocated hidden units grows beyond Hmax.
3 3  Mapping Adaptive Resonance Theory Networks
In this section we address the mapping of another unit-allocating ANN, the Adaptive 
Resonance Theory (ART) model. ART [17] [31] neural networks are capable of adaptively 
categorizing an arbitrary sequence of (input) patterns into several clusters [33]. The 
clustering is typically performed based on a similarity measure [33]. For instance, all input 
patterns close to a particular vector (in Euclidean distance) are classified into the same 
group. ART was originally introduced by Grossberg [31] as a phenomena in human and 
animal cognitive information processing. This phenomena has led to the development of a 
series of unsupervised and supervised neural networks capable of pattern clustering and 
recognition. The resulting ANN models include ART1 [18] which can categorize binary 
patterns, ART2 [17] which can group both analog and binary patterns, ARTMAP [16] , a 
class of supervised neural networks, which is capable of category recognition and multi­
dimensional maps, FUZZY ARTMAP [14], [15], a modified version of ARTMAP which 
utilizes fuzzy neurons, and finally ART-EMAP [13] which is an ANN model capable of 
recognizing pattern classes after supervised and unsupervised learning.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
122
 optimal solution
o o o proposed approach
20
g15
0 200 400 600
Number of hidden units
Figure 3.19: Performance of the mapping approach for values beyond Hmax,
aoo 1000 1200
assuming Tjjmax as the optimization criterion and Hmax =  40.
7000
 optimal solution6000
proposed approach
5000
v 4000
3000
2000 -
1000 -
1200600
Number of hidden units
800 1000200 400
Figure 3.20: Performance of the mapping approach for values beyond Hmax, 
assuming 7^*^ as the optimization criterion and Hmax = 40.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
123
Most variants of the ART  model have been developed based on the original model, 
ART1. For instance, ART2 is a variant of ART1 which recognizes both binary and analog 
patterns. ARTMAP on the other hand incorporates several ART1 modules. In this section 
we introduce a systematic mapping approach for highly efficient parallel simulation of the 
ART1 training algorithm on KNC parallel architectures. Efficient simulation of other ART 
models can be derived from that of the ART1 model. The simulation of the ART1 model is 
a challenging problem since the corresponding network structure changes dynamically. We 
show that the general mapping approach we developed in Section 3.2 for efficient simulation 
of the CC learning (which is also a unit-allocating model) can be applied to the ART1 model 
as well. We modify the mapping for the ART1 model and examine its performance for a 
network running the NETTalk benchmark application.
33.1 Adaptive Resonance Theory 1 (ART1)
In this subsection we introduce the architecture and training algorithm of the ART1 
modeL We adopt the abstraction of the ANN model presented in [33]. The ART1 network 
consists of an input layer and an output layer as shown in Figure 3.21. Each output neuron 
represents a category or a cluster. We denote the output pattern for the input pattern 
by an L-dimensional vector: Y k = (y*,y2*,...,y/)r  . It is assumed that any arbitrary input 
pattern belongs to only one cluster. Hence, each input pattern leads to only one active 
output unit, a unit whose output is 1. The number of clusters in the ART1 model is not 
known apriori Hence, the number of output units cannot be determined in advance. These 
units are allocated incrementally during the training until all input patterns can be classified 
based on some criterion which is discussed later. In fact, the ART1 model is capable of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
124
L
Figure 3.21: The ART1 model
y
1
y
2
y
y
3
accepting an infinite sequence of input patterns and allocating new clusters whenever 
necessary. Here, we use L to denote the number of output units that already exist in the 
network at a given time.
The input units of the ART1 network are fully connected to its output units. The 
links between the input and output neurons are weighted. The A* input pattern is expressed 
by an Af-bit binary vector X k= (x{,x£ ,...,x£ f. The weight between input unit i and 
output unit j  is denoted by . The Af-bit vector Wj=(wjr  wy)r  is generally referred 
to as the prototype of cluster j .  To classify each input pattern, say X*, first the output of 
each existing output-layer unit, say j, is computed for the pattern as follows:
N
Jzl____ (3.32)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
125
where [Wj{ is the Euclidean norm of vector Wj. The output-layer unit with the highest 
output value is considered as the winner. Input pattern X* is classified in the cluster 
associated with the winning output unit Since only one winner can exist for each pattern, 
this process is refereed to as a winner-takes-all operation [33].
3.3.2 Training of the ART1 Model
Each iteration of the ART1 training algorithm involves examining the prototypes of 
the network for a given training pattern X*. If the pattern X* matches the prototype of an 
existing cluster according to a predetermined similarity test, then the pattern is added to the 
cluster. The cluster's prototype is adjusted to include the features of the new pattern. 
Otherwise, a new cluster is created with the new training pattern as its prototype. The 
similarity test consists of two major phases, a winner-takes-all phase and a verification 
phase. As we stated earlier, during the winner-takes-all phase, the output of each output- 
layer unit is computed for the pattern X* . Then, the output-layer unit with the highest 
output is selected as the winner. During the verification phase, the prototype of the winning 
unit is examined to see if it matches pattern X* well.
The verification phase consists of two steps. First, the prototype of the winning unit 
is examined to see if it satisfies the following inequality:
w J * k BX*S2 (3.33)
I W f  N
If the inequality holds, it indicates that a significant fraction of the bits in the vector Wj and 
the bits in the vector X* match [33]. The second test known as the vigilance test [33] 
involves testing the following inequality:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
126
(3.34)
m 2
where p (0 < p < 1) is a user defined parameter known as the vigilance parameter. If the 
inequality holds, it indicates that a significant fraction of the l's in the pattern X* and those 
in the prototype Wj match. The two steps are used to verify if the input pattern X* can be 
classified in the cluster j  (with the vector Wj as its prototype).
If the prototype of the winning output unit (Wj) passes both tests of the verification 
phase, then X* is added to cluster j. In addition, the cluster's prototype is updated as 
follows:
On the other hand, if the winning prototype passes the first test but fails the vigilance test, 
its corresponding output unit is deactivated (set to 0 ), and the output unit with the next 
highest output is selected as the new winner. This process is repeated until either a winner 
passes both verification tests or until all output units are deactivated. If no output unit 
passes the two tests, a new output unit (representing a new cluster) whose prototype is the 
pattern X* is allocated. The above procedure can be formally presented as follows: 
Algorithm 3.5:
/* ART1 Training Algorithm *1
W ™  = Wj / \ X k (3.35)
Begin:
For (Each training pattern X*) Do
1. maximum = 0  ;
2. For (Clusters j  = 1 to L ) Do
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
127
w j x k_ "j2.1 Compute: y,
'  \Wj\2
2.2 If ( yj > maximum)
maximum = y;; 
winner = j;
Endlf
EndFor
tvT Wr . X* n y k  [2
3 .1  If ( — 2= 2;— and — 1w""gr------> JL 2 JL  )
I X | 2 l i W.  J2 N* 1 “ winner ■
U/ -  W  A Y k •
winner "  winner1 A  ’
3.2 Elself ( L and Active units still exits )
I W - I2 N1 winner ■ 
y winner ~  ® »
goto Step 2.2;
3.3 Else
L = L+ 1;
W = Y k ■WL+1 A ’
Endlf
EndFor
END
3.3.3 Mapping the ART1 Model on KNC's
The training algorithm of the ART1 model incrementally adds output neurons until 
all unclassified patterns are properly labeled. The resulting structure is a two-layer network 
as shown by Figure 3.21. The simulation of the ART1 model on parallel architectures has 
not been attempted before. As far as the mapping is concerned, the structure of the network
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
128
poses a challenging problem due to its dynamic nature. In principle, this mapping problem 
is similar to that of the CC algorithm we addressed in Section 3.2. Hence, we adopt a 
similar mapping methodology for parallel simulation of the ART1 model. In particular, we 
perform the mapping assuming an upper bound on the number of output units (denoted by 
Lmax). We prove that our scheme minimizes the iteration time of the training algorithm for 
an ART1 network with Lmax output units. Furthermore, through experimental results we 
show that the mapping leads to a very efficient simulation of other ART1 networks with 
fewer than Lmax output units.
We can estimate the growing parameter of the ART1 model better than that of the 
CC model Clearly, the size of the ART1 network depends on the number of possible 
groups into which the input patterns can be classified. For example, we know that the 
number of groups cannot exceed the number of input patterns. However, we should 
mention that ART  models are capable of clustering an infinite stream of input data [33]. 
Under such circumstances obviously no limit can be set on the number of clusters unless 
we are interested only in certain clusters.
To utilize the parallel architecture efficiently, we first develop a parallel version of 
the training algorithm. We basically utilize the inherent parallelism of the original training 
algorithm, in particular Step 2 of Algorithm 3.5. Furthermore, we modify the algorithm by 
performing the vigilance test for all nodes concurrently. We believe that this would improve 
the overall efficiency of the algorithm. The modified algorithm is listed below:
Algorithm 3.6:
/* Parallel Training Algorithm for ART1 */
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
129
Begin:
For (Each training patter X*) Do
1.1
1. For (Clusters j  = 1 to L ) Parallel Do
w ? x k
Compute: y, =— —  ;
' I W, I2
1.2
C-Flagj = 1; /* cluster j  is a winning candidate */ 
else
C-Flagj = 0; /* Cluster j  is not a winning candidate *1 
Endlf 
EndFor
2 For (Clusters j  = 1 to L ) Do
2.1 If ( yj > maximum & C-Flagj = 1 )
maximum = yf,
winner = j ;
acttve = 1;
Endlf
2.2 If ( active = 1)
winner
2.3 Else
L = L+  1;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
130
Endlf
EndFor
END
The parallel ART1 training algorithm can be modeled by a computational graph as 
shown in Figure 3.22. Nodes in this graph represent concurrent computations performed by 
the ART1 neurons during one training iteration while edges represent the communication 
among adjacent neurons. The first layer of nodes in this graph corresponds to concurrent 
operations taking place during Step 1 of Algorithm 3.6. These include the computation of 
II Wj | , 0 X k J , and WjTX k. The second layer represents computations which take place 
in Step 2.1 of the algorithm. Clearly, the nodes in this graph represent computations which 
can be decomposed into atomic operations.
The computations in the first layer of the task graph precede those in the second 
layer. Hence the mapping problem can be simplified to that of mapping a bipartite graph 
such as that shown in Figure 3.23. We refer to this graph as PART^ . We assign n-digit 
(/max-digit) k-ary addresses to nodes of P A R T representing input-layer (output-layer) 
neurons of the ART1 network, where:
n = f logjAJ1 , lmax = f logJjnax  1 (3 .3 6 )
The mapping of the PARTm  graph onto the parallel architecture is similar to that of the 
PCCmQi we introduced in Subsections 3.2.3 and 3.2.4. First, the computational graph 
PART^ is mapped to a virtual KNC architecture called VKNC. Then, the VKNC is folded 
until its size is equal to that of the actual KNC architecture and a metric associated with 
the training time of an ART1 network with Lmax output units is minimized. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
131
2 *
Computational 
Layer 2
Computational 
Layer 1
Figure 3.22: The ART1 task graph.
optimization metric we use here is the sum of iteration times of the training algorithm for 
all instances of an ART1 network with 1 through Lmax output units on the VKNC 
architecture. This metric is represented as r S « . P )  where
Lmax
T ^ ( a , P) = £  Tt (a , P) (3.37)
L= 1
a and P denote the number of digits folded from input and output segments of the VKNC 
node address after a  +p foldings. a , p) denotes the iteration time of the largest
instance of the ART1 network after a  +P foldings. We prove that the folding procedure 
minimizes the metric for a given assignment of tasks to processors of the VKNC 
architecture.
Theorem 3.8: If the training algorithm for an ART1 network with Lmax output units and 
n input units is executed on a KNC architecture whose size matches that of the VKNCI+p
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
132
1
2
N Lmax
Figure 3.23: The PARTmax graph, 
after a  input and P output address digit foldings, then each iteration of algorithm will take:
r £(a,p) = max{k*-'max-1,1 ) [(2*a+l)rr +
( — rain(/,f>nax-P)+(n-a)) (T4/2lre+2fr)j (3.38)
ci
time units, where cx = max(it|}' 6BaI+/,l) .
Proof: Steps 1.1 and 1.3 of Algorithm 3.6 include the computation of fl Wj: I, [ X  k J, and 
WjTX  *. These tasks can be performed as sum-of-products operations. According to 
Theorem 2.2, each of these operations takes:
raaxCfcP-frw*-/, i)[ 2k*tr + (n-a)(Jk/2\ + 2 ^ ] (3.39)
time units. Step 2.1 of the algorithm involves the search for the minimum of L numbers. 
Based on our original task assignment, these numbers are uniformly distributed on a 
min(/,l>7tt£c-P) -dimensional KNC. We utilize the binary spanning tree of the KNC 
architecture (for definition see Subsection 2.1.3) to compute the minimum. Basically, each 
node of the BST computes its local minimum and sends it to its parent Then, the global
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
133
minimum is computed through a fan-in process (using an algorithm similar to Algorithm 
2.2). Hence, the global minimum in Step 2.1 can be computed in
m in(/, lmax-$) (T k/2 lrc + I t )  + max( fcP' bmx w,l)  tr (3.40)
time units. Steps 2.2 and 2.3 take only 2tr time units, which is relatively insignificant 
Hence, the iteration time of the algorithm is given by:
r L(a,P) = max(k ^ bnax'1,1 ) [(2ka+l)rr +
( —min(/,lmax-P)+(n-a) J (lk/2lrc+2rr)J (3-41)
ci
where cl = maxfk*3' 4" " ^ ! ) . □
Corollary 3.1: The iteration time of the training algorithm for the largest instance of an 
ART1 network is given by:
[(2 * * + D >, *
( (lmax -  P)* »(n -  a ) ) (f H I It, ♦ 2tr )]
Theorem 3.9: r tmax( a , P)is a convex function with respect to a.
Proof: The second partial derivative of TUnax(a , P)with respect to a  is always positive: 
= 6 Ink2 k a' \ .  □
eta2
The following corollary can be derived from Theorem 3.9 using a derivation similar to that 
utilized in the proof of Theorem 3.6.
Corollary 3.2: , p)is a convex function with respect to a.
Define sets and S2 as follows:
Sj = { L \ \ <> L <. Lmax, I <. lmax -  P}
(3 .43)
S2 = { L 11 £ L s Lmax, I > lmax -  P}
( 3 .4 2 )
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
134
Theorem 3.10: The minimum value o i T ^ ^ a , P) for a fixed p occurs at 
\B1 if max(0* f lmax) < B < n  and T££(\B\V) < 7 ^ (l£ J ,P )
a =
LfiJ if max(Qjif -lm ax)< B< n  and 
max(0jif -lmax) if B z  max(0jif -lmax) 
n 'tin <.B
where
Lmax ( f kI2 ] tc + 2tr )
(3.44)
B =
2 ln*rr ( J5J+ £  k p- bnax*1) (3-45)
L s S 2
and || 5j Q is the cardinality of set Sv
Proof: r * , P) is a convex function with respect to a. Hence, the minimmn of the 
function for each P can be obtained from the critical points of the function or the end points 
of the range of a. The critical points of the function are obtained by setting the first partial 
derivative with respect to a  to zero. This results in
Lm ax{\k/2]tc + 2tr)
a =
21n*rr (flSJ + £  (3.46)
LeS2
and | Sj D is the cardinality of set Sv We denote the right hand side of the above equation 
by B. We are looking for an integer a, hence to find the minimum, we use [ B ] , [ B J, or 
one of the end points of the range of a  (1 , n), whichever leads to the lowest value when 
substituted in T j^ ia  , P). This is formally represented by equation (3.47).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
135
tel if max(0sif -lmax) <B < n  and T ^ B \ $ )  < 7 ^ (te J ,p )
15J i f  max(Qjif-lmax)<B<n and 7 ^ ^ ( lB J ,p )  < r / ^ ( T f i \ P )  
o= 1  1 (3.47)
max(0jif-lm ax) if 5  £ taax(0^if -lmax) 
ni f n  £ B
a
We use Theorem 3.10 to find the a and P which minimize the metric C ( a , P ) .  The 
solution space for a and P for this optimization problem is similar to the closed sets shown 
in Figure 3.4. The only difference is that in this figure should be replaced by n. We find 
the minimum of the function with respect to a in terms of P using Theorem 3.10. Clearly, 
this takes constant time. Then, we find the P which leads to the overall minimum. The time 
complexity of the overall search for the minimum is order 0( log* Lmax). This makes our 
mapping approach computationally very efficient 
33.4 Performance of the Proposed Mapping
In this section we evaluate the performance of the proposed mapping. In particular, 
we map the learning algorithm for an ART1 network with Lmax output units on a KNC 
architecture using the proposed folding scheme. Then, we apply the resulting mapping to 
all instances of the network (networks with 1 through Lmax output units) and compute their 
iteration times. We then compare the iteration time of each instance of the network when 
mapped using our proposed scheme with its optimal iteration time. We wish to emphasize 
on the fact that our folding scheme minimizes [(a , P) for an ART1 network with 
Lmax output units. As we shall show, such a mapping leads to near-optimal results for
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
136
*10'
. o optimal solution 
x proposed approach
3 2.5
0.5
20
Number at Prototypes
Figure 3.24: Simulation results for Lmax = 20.
other instances of the network with fewer than Lmax output units. The optimal iteration 
time for these instances of the network is computed through an exhaustive search for 
comparison purposes only. The results we report here are based on simulation of the 
network implementing the NETTalk application presented in [64]. In this case, there are 
196 Input units, and 1019 training patterns. We assume that computation time (tr) and 
communication time (t) are 0.01 ps and 4 ps, respectively. We also assume that the parallel 
KNC architecture has 16 processors.
For our simulations, we performed the folding so as to minimize . P )
assuming a network with Lmax output units. We then used the resulting mapping to 
compute the iteration times for ART1 networks with 1 through Lmax output units. We 
compared these results with their corresponding optimal values. We repeated our 
experiments with different Lmax values, namely Lmax = 20, 30,40, and 100. Figures 3.24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
137
through 3.27 show the results of our simulations. These results indicate that our mapping 
teads to near-optimal simulation of different instances of the ART1 network. Insignificant 
deviations from optimal results appear only for small networks.
* 10'
o optimal eolutian 
x prapoaad approach
N unbar of Prototype*
Figure 3.25: Simulation results for Lmax -  30.
* 10'* I ” " I ■
1.2 -
1
i
|  as ■
e
£
ia6h
a« ■
o optimal solution 
x proposed approach
■■
02h xxg5*
H
aS iS lL -
X*Zo°
10 1S 20 a s 30 36 40
Number of Prototypes
Figure 3.26: Simulation results for Lmax = 40.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
138
xIO'
. a optimal eolutian 
x prapoaad approach
S 3
70 DO 90 10020
Figure 3.27: Simulation results for Lmax = 100.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4 
SUMMARY AND CONCLUSIONS
Artificial neural networks with their impressive processing power and inherent fault 
tolerance are viable candidates for solving many large scale scientific problems. These 
models offer a new processing paradigm which can be applied to problems intractable by 
conventional computing approaches. In this research we developed a formal methodology 
for efficient mapping of several contemporary ANN  models on a popular class of parallel 
architectures. We considered parallel computers with fc-ary n-cube topologies (KNCs) 
since they encompass both mesh-connected and hypercube-based parallel systems. Many 
existing parallel architectures are based on these network topologies. Our mappings were 
designed to efficiently simulate ANN models of arbitrary sizes on KNC architectures'.
Unlike earlier studies, we did not restrict our scope to a particular ANN  model. 
Rather, we developed mapping schemes for several important classes of ANN'S. The 
classification we utilized grouped ANNs based on similarities of their computational 
structures. Although specific implementations might vary from one model to another within 
a class, general mapping steps are similar for a given class of ANN  s. We presented a 
systematic mapping scheme for each class and showed how the mapping could be applied 
to specific A N N s within each class. This feature of our study should have a significant 
impact on the study of ANN  s. With the availability of efficient implementations of a wide 
range of A N N s at their disposal, researchers in the field can effectively study different 
aspects of neural processing in order to develop powerful problem solving tools. By
139
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
140
providing efficient implementation of a range of ANN'S for users, the practicality of these 
models in engineering problems is significantly enhanced.
One of the most important mapping criteria is to ensure that ANN'S are efficiently 
simnlatflH on a given parallel architecture. The amount of parallelism achieved depends on 
the decomposition of the ANN model on the target architecture. To achieve an efficient 
simulation, it is essential to choose an appropriate granularity for partitioning the ANN 
modeL The problem of determining the proper amount of parallelism for mapping an ANN 
model has not been addressed in most earlier studies. Most studies impose some restrictions 
on the size of the ANN model or the target architecture. Our unique mapping approach on 
the other hand systematically selected an appropriate degree of parallelism leading to a 
highly efficient realization of the ANN model on the host architecture. The scheme took 
into account several factors for determining the most suitable task granularity. These 
factors included the computational structure of the ANN model and the characteristics of the 
target KNC architecture, specifically its computation time for atomic arithmetic operations 
and per word communication time between adjacent nodes. If necessary, the scheme could 
utilize a subset of the processors of a given KNC architecture (referred to as subcube). In 
such cases, the simulation on a subcube of the target architecture resulted in the most 
efficient simulation.
In Chapter 2, we proposed a formal methodology for optimal implementation of the 
backpropagation and similar algorithms on KNC's. The methodology was developed by 
generalizing the optimal mapping of a bipartite graph. Initially, the FFANN was mapped 
onto a virtual KNC. The extent of parallelism was such that simulation of the learning
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
141
phase on the virtual KNC was time optimal. Then, the virtual KNC was recursively folded 
until its dimension matched that of the physical architecture or a subcube thereof, depending 
on the physical size that provided the best execution time. A systematic folding process was 
developed to minimize execution time of each learning pass and to preserve the degree of 
redundancy. This mapping approach was very efficient since its computational complexity 
was a logarithmic function of the network size. We proved that our mapping methodology 
was time-optimal regardless of the structure of the FFANN.
In the same chapter, we showed that the methodology developed for FFANNs could 
be applied to several other classes of ANNs. In particular, we considered the mapping of 
Radial Basis Function (RBF) networks. We showed that the training algorithm for these 
networks had computational structures which were similar to that of the backpropagation 
algorithm. We considered both supervised and partially supervised training algorithms. We 
showed that the mapping of the fully supervised RBF was similar to that of a two-layer 
feedforward neural network trained with the backpropagation algorithm. We stated that 
the partially supervised scheme consisted of two steps. We introduced an efficient scheme 
for efficiently implementing the first step of the training algorithm. Initially, we provided 
a parallel version of the step. Then, we mapped the parallel version of the step onto a target 
KNC. Our implementation was based on partitioning the training set among processors of 
the KNC architecture. We proved that a uniform distribution of training patterns among 
processors of a given KNC architecture minimized the iteration time of the first step of the 
algorithm. Furthermore, we obtained the dimensionality of the subcube of the target KNC 
which resulted in the time-optimal execution of that step. We also showed that the second
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
142
step of the algorithm could be mapped as a two-layer feedforward network trained with 
backpropagation algorithm.
In Chapter 3, we addressed the mapping of an important class of ANN'S called unit- 
allocating neural networks. This class includes several important ANN models such as the 
Cascade Correlation learning algorithm and the Adaptive Resonance Theory 1 model. The 
common feature of these models is the dynamic nature of their architecture which grows 
during the learning phase. (Hardware implementation of these models are difficult due to 
their dynamic structure.) We investigated the problem of efficiently implementing these 
unit-allocating ANN'S on existing parallel systems. To our knowledge, this has not been 
previously attempted perhaps due to the dynamic nature of the network architecture.
We first presented a methodology for parallel implementation of the cascade 
correlation neural network learning technique on KNC s. The method rendered efficient 
simulation of the algorithm through pipelining of several training patterns in parallel. 
Moreover, the efficiency of the implementation was enhanced by utilizing the inherent 
parallelism of the training algorithm. Both the output-unit and hidden-unit phases of the 
training were modeled using a computational task graph. A pipelined computational model 
called PCC was developed based on the task graph to accommodate the processing of 
several training patterns in parallel. This overcame the inefficiency of the original network 
due to its potentially large depth. We pointed out that the space complexity (memory 
requirements) of mapping the pipelined CC algorithm was increased by a constant factor 
at each processor with respect to the non-pipelined implementation of the algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
143
The mapping involved implementation of the computational model of the pipelined 
algorithm, denoted by PCC, on an /za-dimensk>nal KNC. This consisted of two steps. First, 
the PCC model was mapped onto a virtual KNC (called VKNC) of compatible size. Since 
the number of hidden units was not known in advance, we considered an upper bound for 
this value (called Hmax) and used a PCC model developed based on a network with Hmax 
hidden units. The VKNC was then folded repeatedly until the dimensionality of the folded 
graph was less than or equal to na , depending on which size yielded minimum time. The 
fokling process was designed to optimize a desired metric for a network with Hmax hidden 
units. We considered two optimization criteria, one represented the iteration time of the 
largest possible network and the other corresponded to the average iteration time of the 
algorithm for both training phases, calculated based on networks with 0 through Hmax 
hidden units. We showed that mapping based on either of the two criteria led to very 
efficient simulation of all instances of the network (but the smallest). We also examined the 
effect of Hmax on the performance of the simulated algorithm. We showed that the choice 
of Hmax was not critical if the sum of iteration times was to be optimized (assuming Hmax 
hidden units). The minimization of each metric (assuming Hmax hidden units) had 
computational complexity O (logt(L + Hmax)}, for a network with L output units. Based 
on the proposed mapping, task assignments for networks with 0 through Hmax hidden units 
were known apriori. Hence, no data migration or task rescheduling was needed as the 
number of hidden units grew.
In the same chapter, we used the parameters for a network implementing the 
benchmark application NETTalk to evaluate the performance of our mappings. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
144
presented experimental results which showed that our approach led to near-optimal results 
for networks with H  hidden units where, H  £ Hmax. In addition, we showed that the 
proposed scheme led to very efficient simulation of the training algorithm even if the number 
of hidden units exceeded Hmax. We also examined the effect of Hmax choice on the 
mapping.
Also in Chapter 3, we addressed the mapping of a popular clustering network called 
Adaptive Resonance Theory {ART) . The training algorithm of the ART1 model 
incrementally adds output neurons until all training patterns are properly classified. We 
showed that the mapping of such a unit-allocating model was very similar to that of the 
cascade correlation training algorithm. We first developed a parallel version of the training 
algorithm. We basically utilized the inherent parallelism of the original training algorithm. 
We modified one step of the original algorithm, namely the vigilance test, in order to 
improve its overall efficiency. We then modeled the parallel version of the algorithm by a 
bipartite task graph, denoted by PARTmax. The overall mapping was simplified to that of 
mapping PARTmax onto the target KNC. This task graph was implemented on the KNC 
architecture using the same approach applied to the task graph of the cascade correlation 
algorithm. We showed how an ART1 model with certain number of output units could be 
optimally mapped on a KNC architecture. Through experimental results, we illustrated how 
this mapping could lead to very efficient simulation of other instances of the network with 
fewer output units. The proposed mapping was very efficient because of its logarithmic 
computational complexity.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
145
There stOl remains several open problems related to mapping of ANN'S. One needs 
to explore the mapping of other classes of ANN’S which were not consider in this 
dissertation and determine if the general mapping schemes developed here could be adapted 
to them. For instance, we believe that the mapping we developed here for simulating 
feedforward neural networks can be applied to other ANN models whose computations can 
be captured by a bipartite task graph. In addition, similar mapping schemes for other 
classes of parallel architectures such as multiple bus systems need to be developed.
Further, the fault tolerance aspects of simulated ANN'S need to be studied. 
Artificial neural networks can inherently tolerate failure of neurons or neuron connections 
to some extent Several studies on fault tolerance of neural networks appear in [4], [5],[6],
[20], [63], [58], [70], and [76]. In these studies, several schemes were introduced to 
enhance the fault tolerance of ANN'S. These schemes mostly modified the training algorithm 
of an ANN model or provided some degree of redundancy to tolerate the failure of some 
neurons or neuron connections. However, for simulated ANN'S we should be concerned 
with the failure of processors or connections of the underlying parallel architecture rather 
than neuron or neuron connection failures. It might be possible to utilize the inherent fault 
tolerant capabilities of ANN'S to tolerate failure of processors. In particular, one needs to 
explore the possibility of modifying the training algorithm of ANN'S such that they can 
tolerate processor failures. For this, one needs to utilize robust training algorithms which 
can tolerate failure of multiple neurons such as the algorithms introduced in [4],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
REFERENCES
[1] W. Allen and S. Saha, "Parallel Neural Network Simulation Using Backpropagation 
for the es-kit Environment," in Proc. 1989 Conf. Hypercubes, Concurrent 
Computers and Applications, 1989, pp. 1097-1102.
[2] J. Alspector, and R. B. Allen, A Neuromorphic VLSI Learning System, in Advanced 
Research on VLSI, (P. Losleben ed.), pp. 313-349, MIT Press, 1987.
[3] J. A  Anderson, "Neural Models for Cognitive Computations," IEEE Transactions 
on Systems, Man, and Cybernetics, SMC-13, pp. 799-815,1983.
[4] B. Arad, A. El-Amawy, "On Fault Tolerant Training of Feedforward Neural 
Networks", to appear in Neural Networks, 1997.
[5] B. Arad, A. El-Amawy, "Robust Fault Tolerant Training of Feedforward Neural 
Networks", Proc. o f 37* Midwest Symposium on Circuits and Systems, August 
1994.
[6] P. Bapat, "Design of fault Tolerant Feed-forward Neural Networks", M.S. Thesis, 
Louisiana State University, May 1994.
[7] A  G. Barto and P. Anandan, "Pattern Recognizing Stochastic Learning Automata," 
IEEE Transactions on Systems, Man and Cybernetics," SMC-15, pp. 360-375, 
1985.
[8] G. BeBock and C. Rosenberg, "Network Learning on The Connection Machine," in 
Proceedings H f1 International Joint Conference on Artificial Intelligence, Milan, 
Italy 1987.
[9] D. Berkey, Calculus, Saunders College Publishing, 1984.
[10] B. Bose, B. Broeg, Y. Kwon, and Y. Ashir, "Lee Distance and Topological
Properties offc-ary n-cubes," IEEE Trans, on Computers, vol. 44, no. 8, pp. 1021- 
1030, August 1995.
[11] B. E. Boser, E. Sackinger, J. Bromley, Y. LeCun, and L. D. Jackel, "Hardware
Requirements for Neural Network Pattern Classifier," IEEE Micro, vol. 12, pp. 32-
40, 1992.
[ 12] D. S. Broomhead and D. Lowe, "Multivariate Functional Interpolation and Adaptive 
Networks," Complex Systems, 2, pp. 321-355, 1988.
146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
147
[13] G. A. Carpenter, "ART-EMAP: A Neural Network Architecture for Object 
Recognition by Evidence Accumulation," IEEE Transactions on Neural Networks, 
vol. 6, pp. 805-819, July, 1995.
[14] G. A. Carpenter, S. Grossberg, N. Markuzon, J. Reynolds, and D. Rosen, "Fuzzy 
ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of 
Analog Multidimensional Maps," IEEE Transactions on Neural Networks, vol. 3, 
pp. 698-713, September, 1992.
[15] G. A. Carpenter, S. Grossberg, and D. Rosen, "Fuzzy Art: Fast Stable Learning and 
Categorization of Analog Patterns by an Adaptive Resonance System," Neural 
Networks, vol. 4, pp. 759-771,1991.
[16] G. A. Carpenter, S. Grossberg, and J. H. Reynolds, "ARTMAP: Supervised Real- 
Time learning and Classification of Nonstationary Data by a Self-Organizing Neural 
Networks," Neural Networks, vol. 4, pp. 565-588, 1991.
[17] G. A. Carpenter and S. Grossberg, "ART2: Self-Organization of Stable Category 
Recognition Codes for Analog Input Patterns," Proceeding o f IEEE International 
Conference on Neural Networks, vol. n, pp. 727-736, San Diego, 1987.
[18] G. A. Carpenter and S. Grossberg, "A Massively Parallel Architecture for Self- 
Organizing Neural Pattern Recognition Architecture for a Self-Organizing Neural 
Pattern Recognition Machine," Computer Vision, Graphics, and Image Processing, 
vol. 37, pp. 54-115,1987.
[19] G. Chartrand and L. Lesniak, Graphs & Digraphs, Wadsworth & Brooks/Cole 
Advanced Books & Software, 1986.
[20] L. C. Chu and B. W. Wah, "Fault Tolerant Neural Networks with Hybrid 
Redundancy", Proc. International Joint Conference on Neural Networks , pp. II- 
639-649, San Diego, CA, June 1990.
[21] T. G. Clarkson, et aL, "The pRAM: An Adaptive VLSI Chip," IEEE Trans, on 
Neural Networks, vol. 4, pp. 408-411, May 1993.
[22] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, Wiley, New 
York, (1973).
[23] M. Duranton and J. A  Sirat, "Learning on VLSI: A General Purpose Digital 
Neurochip," International Conference on Neural Networks, Washington DC, 1989.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
148
[24] A. El-Amawy and P. Kulasinghe, "Algorithmic Mapping of Feedforward Neural 
Networks onto Multiple Bus Systems," IEEE Transactions on Parallel and 
Distributed Systems, vol. 8, no. 2, pp. 130 -136, February 1997.
[25] S. E. Fahhnan and C. Lebiere, "The Cascade Correlation Learning Architecture," in 
Neural Information Processing Systems 2, D. S. Touretzsky, ed. Morgan Kaufman, 
1990, pp. 524-532.
[26] Y. Fujimoto, N. Fukuda, and T. Akabane, "Massively parallel Architectures for 
Large Scale Neural Network Simulations," IEEE Trans. Neural Networks, vol. 3, 
no. 6, pp. 876-887, November 1992.
[27] P. T. Gaughan and S. Yalamanchili, "A Performance Model of Pipelined k-ary n- 
cubes," IEEE Trans, on Computers, vol. 44, no. 8, pp. 1059-1063, August 1995
[28] J. Ghosh and K. Hwang, "Mapping Neural Networks onto Message-Passing 
Multicomputers," J. Parallel and Distributed Computing, vol. 6, pp. 291-330, 
Academic Press, 1989.
[29] J. Ghosh and K. Hwang, "Critical Issues in Mapping Neural Networks on Message- 
Passing multicomputers," Int'l. Symp. on Computer Architecture, pp. 3-11, 
ACM/IEEE, 1988.
[30] H. P. Graf and P. DeVegvar, A CMOS Implementation of Neural Network Model, 
in Advanced Research on VLSI, (P. Losleben ed.), pp. 351-367, MIT Press, 1987.
[31] S. Grossberg, "Adaptive Pattern Classification and Universal Recording: L Parallel 
Development and Coding of Neural Feature Detectors, Biological Cybernetics, vol 
23, pp. 121-134, 1976.
[32] D. Hammerstorm, "A VLSI Architecture for High-Performance, Low-Cost, On- 
Chip Learning," International Joint Conference on Neural Networks, vol. 2, pp. 
537-543,1990.
[33] M. H. Hassoun, Fundamentals o f Artificial Neural Networks, MIT Press, 
Cambridge, Mass, 1995.
[34] H. M. Hastings and S. Waner, "Neural Nets on The MPP," in Frontiers o f 
Massively Parallel Scientific Computations, J. R. Fisher, Ed., NASA, July 1987.
[35] S. Haykin, Neural Networks A Comprehensive Foundation, Macmillan college 
Publishing Company, 1994.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
149
[36] D. Hebb, The Organization o f Behavior, Wiley, New York, 1949.
[37] A. Hwang, et. aL, "A Two Level Pipeline RISC Processor Array for ANN," 
International Joint Conference on Neural Networks, pp. 137-140, 1990.
[38] K. Hwang and J. Ghosh, "Hypernet: A Communication-efficient Architecture for 
Constructing Massively Parallel Computers," IEEE Trans, on computers, vol. c-36, 
pp. 1450-1465, Dec. 1987.
[39] J. S. Kane and T. G. Kincaid, "Optoelectronic Winner-Take-All VLSI shunting 
neural network," IEEE Trans, on Neural Networks, voL 6, pp. 1275-1279 , 
September 1995.
[40] S. Kollias and A. Stafylopatis, "Parallel implementations of the backpropagation 
learning algorithm based on network topology," In I. Pitas (Ed.), Parallel 
Algorithms for Digital Image Processing, Computer Vision and Neural Networks, 
Wiley, pp. 233-258,1993.
[41] S. Kollias and D. Anastassiou, "An Adaptive Least Squares Algorithm for the 
Efficient Training of Artificial Neural Networks," IEEE Trans, on Circuits and 
Systems, pp. 1092-1101, August 1989.
[42] V. Kumar, S. Shekhar, and M. B. Amin, "A Scalable Parallel Formulation of The 
Backpropagation Algorithms for Hypercubes and Related Architectures," IEEE 
Trans, on Parallel and Distributed Systems, vol. 5, no. 10, pp. 1073-1090 October 
1994.
[43] S.Y. Kung, Digital Neural Networks, PTR Prentice Hall, New Jersey, 1993.
[44] S.Y. Kung, J.N. Hwang, "A Unified Systolic Architecture for Artificial Neural 
Networks," Journal o f Parallel and Distributed Computing, vol. 6, pp. 358-387,
1989.
[45] S.Y. Kung, "Parallel Architectures for Artificial Neural Nets," Proc. Int'l. Conf. on 
Systolic Arrays, pp. 163-174, IEEE, 1988.
[46] J. L. Lansner, et al., "An Analog CMOS Chip Set for Neural Networks with 
Arbitrary Topologies," IEEE Trans, on Neural Networks, vol. 4, pp. 441-443, May 
1993.
[47] C. Lehmann, et al, "A Generic Systolic Array Building Block for Neural Networks 
with On-Chip Learning," IEEE Trans, on Neural Networks, vol. 4, pp. 400-407, 
May 1993.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
150
[48] P. H. W. Leong and M. A. Jabri, "A Low-power VLSI Arrhythmia Classifier," 
IEEE Trans, on Neural Networks, voL 6, pp. 1435-1445 November 1995 1995.
[49] W. Lin, V. K. Prasanna, and K. W. Przytula, "Algorithmic Mapping of Neural 
Network Models onto Parallel SIMD Machines," IEEE Trans, on Computers, C-40, 
pp. 1390-1401, Dec. 1991.
[50] T. H. Madraswala, e t  al., "A Reconfigurable ANN Architecture," International 
Symposium on Circuits and Systems, 1991.
[51] Q. M. MaHuhi, M. A. Bayoumi, and T. R. N. Rao, "Efficient Mapping of ANN’S on 
Hypercube Massively Parallel Machines," IEEE Trans, on Computers, vol. 44, no. 
6, pp. 769-779, June 1995.
[52] W. S. McCulloch and W. Pitts, "A Logical Calculus of Ideas Immanent in Nervous 
Activity," Bulletin o f Mathematical Biophysics, vol 5, pp. 115-133,1943.
[53] Micchelli, C. A. "Interpolation of Scattered Data: Distance and Conditionally 
Positive Definite Functions," Constructive Approximation, 2,11 - 22,1986.
[54] J. Moody and C. Darken, "Learning with Localized Receptive Helds," in Proceeding 
of the 1988 Connectionist Models Summer School (Pittsburgh, 1998), D. 
Touretzky, G. Hilton, and T. Sejnowski, eds., pp. 133 - 143. Morgan Kaufmann, 
San Mateo, CA, 1989.
[55] J. Moody and C. Darken, "Fast Learning in Networks of Locally-Tuned Processing 
Units," Neural Computation, 1(2), 281 - 294,1989.
[56] N. Morgan, et al., "The Ring Array Processor: A Mulitiprocessing Peripheral for 
Connectionist Applications," Journal o f Parallel and Distributed Computing 14, pp. 
248-259, 1992.
[57] R. Straub, D. Schwarz, and E. Schoneburg, "Simulation of Backpropagation 
Networks on Transputers," Neurocompudng, vol. 2, nos. 5 & 6, pp. 199 - 208, July 
1991.
[58] C. Neti, M. H. Schneider and E. E. Young, "Maximally Fault Tolerant Neural 
Networks and Nonlinear Programming", Proc. International Joint Conference on 
Neural Networks, vol. 2, pp. 483-496, San Diego, CA, 1990.
[59] T. Nordstorm and B. Svensson, "Using and Designing Massively Parallel Computers 
for Artificial Neural Networks," Journal o f Parallel and Distributed Computing, 
vol. 14, no. 3, pp. 260 - 285,1992.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
151
[60] E. Parzen, "On Estimation of Probability Density Function and Mode," Annals o f 
Mathematics and Statistics, 33,1065 - 1076,1962.
[61] R  Paugam-Moisy, "Parallel Neural Computing Based on Network Duplicating," in 
L Pitas (Ed.), Parallel Algorithms for Digital Image Processing, Computer Vision 
and Neural Networks, Wiley, pp. 305-340,1993.
[62] A. Petrowski and R  Paugam-Moisy, "Parallel Neural Computation Based on 
Algebraic Partitioning," in L Pitas (Ed.), Parallel Algorithms fo r  Digital Image 
Processing, Computer Vision and Neural Networks, Wiley, pp. 259-304,1993.
[63] T. Petsche and B. W. Dickinsom, "Trellis Codes, Receptive Reids, and Fault 
Tolerant, Self-Repairing Neural Networks," IEEE Trans, on Neural Networks, vol. 
1, no. 2 pp. 154-166, June 1990.
[64] D. S. Phatak and L Koren, "Connectivity and Performance Tradeoffs in the Cascade 
Correlation Learning Architecture," IEEE Trans, on Neural Networks, vol. 5, no. 
6, pp. 930-934, November 1994.
[65] I. Pitas, Parallel Algorithms for Digital Image Processing, Computer Vision and 
Neural Networks, Wiley, 1993.
[66] T. Poggio and F. Girosi, "A theory of Networks for approximation and learning," 
A. I. Memo 1140, MIT, Cambridge, Mass, 1989.
[67] D. A. Pomerleau, G. L. Gusciora, D. S. Touretzky, and R  T. kung, "Neural 
Networks at Warp Speed: How We got 17 Million Connections Per Second," in 
Proceedings o f International Conference on Neural Networks, San Diego, CA, June 
1988.
[68] M. J. D. Powell, "Radial Basis functions for Multivariate Interpolation: A Review," 
in Algorithms for the Approximation o f Functions and Data, J. C. Mason and M. 
G. Cox, eds. Clarendon Press, Oxford, England, 1987.
[69] D. E. Rumelhart, G. E. Hilton, and R. J. Williams, "Learning Internal 
Representations by Error Propagation," in D. E. Rumelhart, J. L. McClelland, and 
the PDP Research Group, Parallel Distributed Processing, Exploration in the 
Microstructure o f Cognition, Volume 1: Foundations, Cambridge, MA: MIT Press, 
pp. 318-364, 1986.
[70] C. H. Sequin and R. Clay, "Fault Tolerance in Feed-forward Artificial Neural 
Networks", TR-90_031, July 1990.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
152
[71] N. B. Serbedzija, "Simulating Artificial Neural Networks on Parallel Architectures," 
IEEE Computer, pp. 56-63, March 1996.
[72] G. M. Shepherd and C. Koch, "Introduction to Synaptic Circuits," in The Synaptic 
Organization o f The Brain, G. M. Shepherd (ed.), Oxford University Press, New 
York, pp. 3-31,1990.
[73] R. K. Sitannan and N. K. Jha, "Optimal Design of Checks for Error Detection and 
Location in Fault-Tolerant Multiprocessor Systems," IEEE Trans, on Computers, 
vol. c-42, pp. 780-793, July 1993.
[74] M. A. Sivilotd et al., Real Time Visual Computations Using Analog CMOS 
Processing Arrays, in Advanced Research on VLSI, (P. Losleben ed.), pp. 295-312, 
MIT Press, 1987.
[75] D. F. Specht, "Probabilistic Neural Networks," Neural Networks, 3(1), pp. 109-118,
1990.
[76] M  Stevenson, R. Winter, B. Widrow, "Sensitivity of Feedforward Neural Networks 
to Weight Errors," IEEE Trans, on Neural Networks, vol. 1, no. 1, pp. 71-80, 
March 1990.
[77] H. S. Stone, High-Performance Computer Architecture, 3rd edition, Addison- 
Wesely Publishing Company, 1993.
[78] A. P. Thakkar, "Content-addressable, High-Density Memories Based on Neural 
Network models," Technical report, Jet Propulsion Laboratory, JPL D_4166, 
March 1987.
[79] A. Torralba, F. Colodro, E. Ibanez, and L. G. Franquelo , "Two digital circuits for 
a fully parallel stochastic neural network," IEEE Trans, on Neural Networks, vol. 
6, pp. 1264-1268, September 1995.
[80] R. Ulrich, "SYNAPSE- A Neurocomputer that Synthesizes Neural Algorithms on 
Parallel Systolic Engine," Journal o f Parallel and Distributed Computing 14, pp. 
306-318, 1992.
[81] Anujan Varma, "Combinatorial Design of Bus-based Interconnection Structures", 
Research report RC12550, IBM Research Division, Sep. 1986.
[82] B. Vinnakota and N. K. Jha, "Diagnosability and Diagnosis of Algorithm-Based 
Fault-Tolerant Systems," IEEE Trans, on Computers, vol. c-42, pp. 924-937, 
August 1993.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
153
[83] B.W. Wah and L. C. Chu, "Efficient Mapping of Neural Networks on 
Multicomputers," Proc. Int'l Conf. on Parallel Proc., pp. 1-234-241, Aug. 1990.
[84] T. Watanabe, et al., "A Single 1.5-V Digital Chip for a 106-Synapse Neural 
Network," IEEE Trans, on Neural Networks, vol. 4, pp. 387-393, May 1993.
[85] D. Wettschereck and T. Dietterich, "Improving the Performance of Radial Basis 
Function Networks by Learning Center Locations," in Advances in Neural 
Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, 
eds., pp. 1133-1140. Morgan Kaufmann, San Mateo, Calif. 1991.
[86] H. Yoon and J.H. Nang, "Multilayer Neural Networks on Distributed-memory 
Multiprocessors," in Proc. Int. Conf. Neural Networks (IEEE/EEC), 1991, pp. 669- 
672.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA
Behnam Seyed Arad received his B.S. degree in Electrical Engineering with honors 
from the University of Massachusetts, Lowell in 1988. He received his M.S. degree in 
Electrical Engineering from Purdue University, West Lafayette Indiana in 1990. His 
research interests include high performance computer architecture, parallel processing, 
neural networks, and data communications.
154
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate:
Major Field:
Title of Dissertation:
Date of Examination:
April 4, 1997
Behnam Seyed Arad 
Electrical Engineering
Efficient Mapping of Neural Network Models 
on a Class of Parallel Architectures
Approved:
Major Professor and
Graduate Schoolof
EXAMINING COMMITTEE:
( J ttL  f t l - J u j L u ____________
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
