Investigation of hybrid message-passing and shared-memory architectures for parallel computer : a case study : turbonet by Li, Xi
New Jersey Institute of Technology
Digital Commons @ NJIT
Dissertations Theses and Dissertations
Spring 1995
Investigation of hybrid message-passing and shared-
memory architectures for parallel computer : a case
study : turbonet
Xi Li
New Jersey Institute of Technology
Follow this and additional works at: https://digitalcommons.njit.edu/dissertations
Part of the Electrical and Electronics Commons
This Dissertation is brought to you for free and open access by the Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for
inclusion in Dissertations by an authorized administrator of Digital Commons @ NJIT. For more information, please contact
digitalcommons@njit.edu.
Recommended Citation
Li, Xi, "Investigation of hybrid message-passing and shared-memory architectures for parallel computer : a case study : turbonet"
(1995). Dissertations. 1119.
https://digitalcommons.njit.edu/dissertations/1119
 
Copyright Warning & Restrictions 
 
 
The copyright law of the United States (Title 17, United 
States Code) governs the making of photocopies or other 
reproductions of copyrighted material. 
 
Under certain conditions specified in the law, libraries and 
archives are authorized to furnish a photocopy or other 
reproduction. One of these specified conditions is that the 
photocopy or reproduction is not to be “used for any 
purpose other than private study, scholarship, or research.” 
If a, user makes a request for, or later uses, a photocopy or 
reproduction for purposes in excess of “fair use” that user 
may be liable for copyright infringement, 
 
This institution reserves the right to refuse to accept a 
copying order if, in its judgment, fulfillment of the order 
would involve violation of copyright law. 
 
Please Note:  The author retains the copyright while the 
New Jersey Institute of Technology reserves the right to 
distribute this thesis or dissertation 
 
 
Printing note: If you do not wish to print this page, then select  
“Pages from: first page # to: last page #”  on the print dialog screen 
 
  
 
 
 
 
 
 
 
 
 
 
 
The Van Houten library has removed some of the 
personal information and all signatures from the 
approval page and biographical sketches of theses 
and dissertations in order to protect the identity of 
NJIT graduates and faculty.  
 
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI 
films the text directly from the original or copy submitted. Thus, some 
thesis and dissertation copies are in typewriter face, while others may 
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
A Bell & Howell Information Company 
300 North Z eeb  Road. Ann Arbor. Ml 48106-1346 USA 
313-'761-4700 800/521-0600
UMI Number: 9539583
Copyright 1995 by 
Li, Xi 
All rights reserved.
UMI Microform 9539583 
Copyright 1995, by DMI Company. All rights reserved.
This microform edition is protected against unauthorized 
copying under Title 17, United States Code.
UMI
300 North Zeeb Road 
Ann Arbor, MI 48103
ABSTRACT
INVESTIGATION OF 
HYBRID M ESSAGE-PASSING AND SHARED-M EM ORY ARCHITECTURES  
FOR PARALLEL COM PUTERS. A CASE STUDY: TURBONET
by 
Xi Li
Several DSP (Digital Signal Processing) algorithms are developed for the NJIT 
TurboNet parallel computer. In contrast to other parallel computers that implement 
exclusively in hardware either the message-passing or the shared-memory communication 
paradigm, or employ distributed shared-memory architectures characterized by inefficient 
implementation o f  the shared-memory paradigm, the hybrid architecture o f TurboNet 
supports direct, efficient implementation o f  both paradigms. Three versions o f  each 
algorithm are developed, if possible, corresponding to message-passing, shared-memory, 
and hybrid communications, respectively. Theoretical and experimental comparisons o f 
algorithms are employed in the analysis o f performance. The results prove that the hybrid 
versions generally achieve better performance than the other two versions. The main 
conclusion o f this research is that small-scale and medium-scale parallel computers 
should implement directly in hardware both communication paradigms, for high 
performance, robustness in relation to the application space, and ease o f  algorithm 
development. To facilitate theoretical comparisons, a methodology is developed for 
highly accurate prediction o f  algorithm performance. The success o f  this methodology 
proves that such prediction is possible for complex parallel computers, such as TurboNet, 
if  enough information is provided by the data dependence graphs.
INVESTIGATION OF 
HYBRID M ESSAGE-PASSING AND SHARED-M EM ORY ARCHITECTURES
FOR PARALLEL COM PUTERS.
A CASE STUDY: TURBONET
by 
Xi Li
A Dissertation 
Submitted to the Faculty of 
New Jersey Institute of Technology 
in Partial Fulfillment o f the Requirements for the Degree of 
Doctor of Philosophy
Department o f Electrical and Com puter Engineering
May 1995
Copyright © 1995 by Xi Li 
ALL RIGHTS RESERVED
APPROVAL PAGE 
INVESTIGATION OF 
HYBRID MESSAGE-PASSING AND SHARED-MEMORY ARCHITECTURES 
FOR PARALLEL COMPUTERS. A CASE STUDY: TURBONET 
Xi Li 
Dr. Sotirios G. Ziavras, Thesis Advisor 
	 Date 
Assistant Professor of Electrical and Computer Engineering, NJIT 
Dr. Constantine N. Manikopoulos, Thesis Advisor 	 Date 
Associate Professor of Electrical and Computer Engineering, NJIT 
Dr. Edwin Hou, Committee Member 	 'Date 
Assistrofessor of Computer and Information Science, NJIT 
Dr. Michael Palis, Committee Member 	 Date 
Associate Professo of 	 trical and Computer Engineering, NJIT 
Dr. Huifang Sun, Committee Member 	 Date 
Member of Technical Staff, David Sarnoff Research Laboratories 
Mengchu Zhou, Committee Member 	 Date 
Assistant Professor of Electrical and Computer Engineering, NJIT 
BIOGRAPHICAL SKETCH 
Author: 	 Xi Li 
Degree: 	 Doctor of Philosophy in Electrical Engineering 
Date: 	 May 1995 
Undergraduate and Graduate Education: 
• Doctor of Philosophy in Electrical Engineering, 
New Jersey Institute of Technology, Newark, NJ, 1995 
• Master of Science in Computer Engineering, 
Florida Institute of Technology, Melbourne, FL, 1989 
• Bachelor of Science in Electrical Engineering, 
South-China Institute of Technology, Guangzhou, P.R.China, 1983 
Major: 	 Electrical Engineering 
Presentation and Publications: 
1. X. Li, S. G. Ziavras and C. N. Manikopoulos, "Parallel DSP Algorithms on TurboNet: 
An Experimental System with Hybrid Message-Passing and Shared-Memory 
Architecture," Concurrency: Practice and Experience, accepted for publication, 
1995. 
2. X. Li, S. G. Ziavras and C. N. Manikopoulos, "Parallel Generation of Adaptive 
Multiresolution Structures for Image Processing," Concurrency: Practice and 
Experience, submitted for publication. 
3. R. Hross, S. G. Ziavras, C. N. Manikopoulos, N. J. Lad and X. Li, "A Defect 
Identification Algorithm for Sequential and Parallel Computers," IEEE 
International Symposium on Industrial Electronics, Athens, Greece, July 10-14, 
1995, to appear. 
iv 
This thesis is dedicated to 
my parents, my sister and Catherine Cai
v
A C K N O W LED G M EN T
The author would like to express his sincere gratitude to his advisors, Professors 
Ziavras and Manikopoulos, for their guidance, insight and support throughout the process 
o f producing this thesis.
My thanks are due to the other members o f my thesis committee, Drs. E. Hou, M. 
Palis, H. Sun, and M. Zhou.
I am thankful to Catherine Cai for her support and encouragement.
The work presented in this research was supported in part by the National Science 
Foundation under Grants CDA-9121475 and DMI-9500260.
TABLE OF CONTENTS
Chapter Page
1 IN TRO D U CTIO N..........................................................................................................................1
1.1 Parallel Processing System s................................................................................................. 1
1.1.1 Shared-Memory M ultiprocessors............................................................................. 2
1.1.2 Message-Passing M ulticomputers............................................................................ 3
1.2 Message-Passing Versus Shared-M em ory....................................................................... 5
1.3 Motivation, Objectives, and Contributions...................................................................... 8
1.4 O utline...................................................................................................................................... 9
2 TURBONET: A M ESSAGE-PASSING AND SHARED-M EM ORY
HYBRID A R CH ITECTU RE......................................................................................................10
2.1 The Texas Instruments TM S320C40 Processor............................................................10
2.2 Hydra Boards......................................................................................................................... 11
2.3 The Host System................................................................................................................... 14
2.4 The Hybrid Architecture o f  the TurboNet System ...................................................... 14
3 M ULTIDIM ENSIONAL CO NV O LU TIO N ......................................................................... 18
3.1 One-Dimensional Convolution A lgorithm s.................................................................... 18
3.2 Two-Dimensional Convolution A lgorithm .....................................................................20
3.3 Theoretical A nalysis............................................................................................................. 27
3.3.1 1-D C onvolution ....................................................................................................... 27
3.3.2 2-D C onvolution....................................................................................................... 30
TABLE OF CONTENTS
(Continued)
Chapter Page
3.4 Performance Results on T urboN et...................................................................................39
3.4.1 1-D C onvolu tion ........................................................................................................39
3.4.2 2-D C onvolu tion ....................................................................................................... 43
4 FAST FOURIER TR A N SFO R M .............................................................................................50
4.1 One-Dimensional FFT .........................................................................................................50
4.2 Two-Dimensional F F T ....................................................................................................... 54
4.3 Theoretical A nalysis............................................................................................................57
4.3.1 1-D F F T ...................................................................................................................... 57
4.3.2 2-D F F T ..................................................................................................................... 62
4.4 Performance Results on T urboN et.................................................................................... 65
4.4.1 1-D F F T ......................................................................................................................65
4.4.2 2-D F F T ..................................................................................................................... 70
5 M ATRIX M U LTIPLICATION ................................................................................................77
5.1 Matrix M ultiplication A lgorithm ......................................................................................77
5.2 Theoretical A nalysis............................................................................................................84
5.3 Performance Results on T urboN et.................................................................................. 95
6 ADAPTIVE M ULTIRESOLUTION STRUCTURES
FOR IMAGE SEG M ENTA TION........................................................................................... 103
TABLE OF CONTENTS
(Continued)
Chapter Page
6.1 Introduction...........................................................................................................................103
6.2 Algorithm to Generate RAG Pyramids on TurboN et.................................................105
6.3 Implementation on T urboN et........................................................................................... 108
6.4 Performance R esults............................................................................................................113
7 CONCLUSIONS...........................................................................................................................123
APPENDIX A DSP ALGORITHM S.......................................................................................... 125
REFERENCES...................................................................................................................................130
ix
LIST OF TABLES
Table Page
3.1 Estimated running time for 1-D convolution with method 1..........................................29
3.2 Estimated running time for 1 -D convolution with method 2 ..........................................29
3.3 The time ucal-1 (in nsecs) for 1-D convolution with method 1......................................30
3.4 The time ucal-2 (in nsecs) for 1-D convolution with method 2 ......................................30
3.5 The estimated running time o f the hybrid version o f the binary-tree algorithm  36
3.6 The estimated running time o f the message-passing version o f
the binary-tree algorithm .........................................................................................................37
3.7 The estimated running time o f the shared-memory version
o f the binary-tree algorithm ...................................................................................................37
3.8 The time U) (in nsecs) for the binary-tree algorithm ....................................................... 37
3.9 The time uxx2 (in nsecs) for the binary-tree algorithm .................................................. 38
3.10 The time uxx3 (in nsecs) for the binary-tree algorithm .................................................. 38
3.11 The time uxx4 (in nsecs) for the binary-tree algorithm...................................................... 38
3.12 Execution time for 1-D convolution with method 1 ....................................................... 40
3.13 Speedup o f  1-D convolution with method 1 ....................................................................40
3.14 Efficiency o f 1-D convolution with method 1 ................................................................. 40
3.15 Execution time for 1-D convolution with method 2 ........................................................ 40
3.16 Speedup o f  1 -D convolution with method 2 .....................................................................41
3.17 Efficiency o f 1-D convolution with method 2 ................................................................. 41
3.18 Execution time for the hybrid version o f the binary-tree algorithm ............................ 44
x
LIST OF TABLES
(Continued)
Table Page
3.19 Speedup o f the hybrid version o f  the binary-tree algorithm ......................................... 44
3.20 Efficiency o f the hybrid version o f the binary-tree a lgorithm ..................................... 44
3.21 Execution time for the message-passing version o f the binary-tree algorithm ..........44
3.22 Speedup o f the message-passing version o f the binary-tree algorithm ........................45
3.23 Efficiency o f the message-passing version o f the binary-tree algorithm .................... 45
3.24 Execution time for the shared-memory version o f the binary-tree algorithm ............45
3.25 Speedup o f the shared-memory version o f  the binary-tree algorithm ..........................45
3.26 Efficiency o f  the shared-memory version o f the binary-tree algorithm ...................... 45
3.27 The communication time (in psec) between P| and Pi+| in stage 2 .............................. 49
3.28 The communication time (in psec) between P, and P|.2 in stage 3 ................................49
3.29 The comm unication time (in psec) between P| and P|.4 in stage 4 .............................. 49
4.1 The estimated running time for 1-D FFT with m essage-passing.................................. 61
4.2 The estimated running time for 1-D FFT with shared-memory.................................... 61
4.3 The sequential time u (in nsecs) o f  1 -D F F T .....................................................................61
4.4 The time ump (in nsecs) o f 1-D FFT with m essage-passing......................................... 61
4.5 The time usm (in nsecs) o f 1-D FFT with shared-memory............................................ 62
4.6 The estimated running time for 2-D FFT for 2k < (N/P) with shared-m em ory.........64
4.7 The sequential time u (in nsecs) o f 2-D F F T .....................................................................65
4.8 The time uDRAM (in nsecs) for matrix transposition o f  2-D FFT ................................65
LIST OF TABLES
(Continued)
Table Page
4.9 Execution time for 1-D FFT with message-passing......................................................... 66
4.10 Speedup o f 1-D FFT with message-passing...........................................................66
4.11 Efficiency o f 1-D FFT with m essage-passing...................................................................67
4.12. Execution time for 1-D FFT with shared-m em ory...........................................................67
4.13 Speedup o f 1 -D FFT with shared-m em ory............................................................67
4.14 Efficiency o f 1-D FFT with shared-m em ory....................................................................67
4.15 Execution time for 2-D FFT (without matrix transposition).......................................... 71
4.16 Speedup o f 2-D FFT (without matrix transposition)................................................71
4.17 Efficiency o f 2-D FFT (without matrix transposition)....................................................71
4.18 Execution time for 2-D FFT (with matrix transposition)................................................ 71
4.19 Speedup o f 2-D FFT (with matrix transposition).......................................................72
4.20 Efficiency o f 2-D FFT (with matrix transposition)..........................................................72
4.21 Execution time for 2-D FFT with the improved m ethod ................................................ 74
4.22 Speedup o f 2-D FFT with the improved m ethod.......................................................75
4.23 Efficiency o f 2-D FFT with the improved m ethod ......................................................... 75
5.1 Estimated running time for matrix multiplication on 0-D and 1 -D hypercubes 87
5.2 The times u(2Pixx) (in nsecs) for matrix multiplication
on the 0-D and 1-D hypercubes............................................................................................ 87
5.3 The times uxx (in nsecs) for matrix multiplication on the 1-D hypercube.................... 87
xii
LIST OF TABLES
(Continued)
Table Page
5.4 Estimated running time for matrix multiplication with the 2-D hypercube................. 90
5.5 The times u u(4p xx) (in nsecs) for matrix multiplication with the 2-Dhypercube....... 90
5.6 Estimated running time for matrix multiplication with the 3-D hypercube.................93
5.7 The times u(gp xx) (in nsecs) for matrix m ultiplication with the 3-D hypercube..........94
5.8 Execution time o f matrix multiplication on the 1-D hypercube.................................. 95
5.9 Speedup o f matrix multiplication for the 1-D hypercube.............................................95
5.10 Efficiency o f matrix multiplication for the 1-D hypercube......................................... 96
5.11 Execution time o f  matrix multiplication on the 2-D hypercube.................................. 98
5.12 Speedup o f matrix multiplication for the 2-D hypercube..............................................98
5.13 Efficiency o f matrix multiplication for the 2-D hypercube......................................... 98
5.14 Execution time o f  matrix multiplication on the 3-D hypercube................................ 100
5.15 Speedup of matrix multiplication for the 3-D hypercube...........................................100
5.16 Efficiency o f matrix m ultiplication for the 3-D hypercube....................................... 100
5.17 Two implementation results o f matrix multiplication on the 0-D hypercube.
Cal.: calculation time only, Inp. & Cal.: data input and calculation tim e..................102
6.1 Function codes for communication operations................................................................111
6.2 Execution time and numbers o f pyramid levels for a single processor......................115
6.3 Execution time for the message-passing version with P = 2 ........................................ 116
6.4 Execution time for the shared-memory version with P = 2 ..........................................116
LIST OF TABLES
(Continued)
Table Page
6.5 Execution time for the hybrid version with P = 2 .........................................................117
6.6 Execution time for the hybrid version with P = 2 and separate
Phases 4 and 5 ......................................................................................................................... 117
6.7 Execution time for the message-passing version with P = 4 ........................................ 118
6.8 Execution time for the shared-memory version with P = 4 ....................................... 119
6.9 Execution time for the hybrid version with P = 4 .........................................................119
6.10 Execution time for the hybrid version with P = 4 and separate
Phases 4 and 5 ......................................................................................................................... 119
6.11 Execution time for the message-passing version with P = 8........................ 120
6.12 Execution time for the shared-memory version with P = 8 ........................................... 120
6.13 Execution time for the hybrid version with P = 8 .............................................................121
xiv
LIST OF FIGURES
Figure Page
1.1 The UMA m ultiprocessor m odel............................................................................................. 2
1.2 The NUM A m ultiprocessor m o d el..........................................................................................3
1.3 A m ulticom puter.........................................................................................................................4
1.4 Examples o f common network topologies............................................................................. 5
2.1 The Hydra block d iag ram ........................................................................................................ 13
2.2 The interconnection diagram of TurboNet........................................................................... 16
3.1 An example o f one-dimensional convolution..................................................................... 19
3.2 An example o f  2-D convolution............................................................................................. 21
3.3 The data dependencies for convolution o f  two 4x4 m atrices.......................................... 22
3.4 An example o f the binary-tree m ethod................................................................................. 23
3.5 An example o f the grid m ethod.............................................................................................. 25
3.6 An example o f the triangle m ethod....................................................................................... 26
3.7 DDG o f the binary-tree algorithm with 8 processors........................................................ 31
3.8 Execution time o f  1-D convolution (method 1 Vs method 2), with P = 8 ................... 41
3.9 A comparison o f  theoretical and experimental running times for method 1.................42
3.10 The difference between theoretical and experimental running times
for method 1.............................................................................................................................. 42
3.11 The difference o f  theoretical and experimental running times from method 2 ............ 43
3.12 The difference in execution times between the message-passing
and hybrid versions, in this order, for P = 8.......................................................................46
xv
LIST OF FIGURES
(Continued)
Figure Page
3.13 A comparison o f theoretical and experimental running times
o f  the hybrid version for P = 8 .............................................................................................. 46
3.14 The difference o f theoretical and experimental running times
from the hybrid version...........................................................................................................47
3.15 The difference o f theoretical and experimental running times
for the message -passing version ..........................................................................................47
3.16 The difference o f theoretical and experimental running times
for the shared-memory version ............................................................................................. 48
4.1 Data flow graph o f  an 8-point decimation-in-time D FT...................................................52
4.2 An example o f  accessing row data for the FFT, with P = 4 and 2k < (N /P)...............56
4.3 An example o f  transferring data for the FFT, with P = 4 and 2k > (N /P).................... 56
4.4 DDG o f the 1 -D FFT with P = 8 ........................................................................................... 58
4.5 DDG o f the 2-D FFT algorithm for 2k < (N /P )................................................................. 63
4.6 Execution time for 1-D FFT Vs array size, with P = 8 .....................................................68
4.7 Com parison o f theoretical and experimental running times
for 1-D FFT with m essage-passing...................................................................................... 68
4.8 Comparison o f theoretical and experimental running times
for 1-D FFT with shared-memory........................................................................................ 69
4.9 The difference o f theoretical and experimental running times
for the message-passing version o f 1-D F F T .....................................................................69
4.10 The difference o f theoretical and experimental running times
for the shared-memory version o f the 1-D FFT ................................................................ 70
xvi
LIST OF FIGURES
(Continued)
Figure Page
4.11 Comparison o f  theoretical and experimental running times
for 2-D FFT with shared-memory........................................................................................ 72
4.12 The difference o f  theoretical and experimental running times for 2-D FFT..............73
4.13 Execution time for 2-D FFT vs matrix size, with P = 8 ..................................................75
4.14 Speed up o f 2-D FFT vs matrix size, with P = 8...............................................................76
5.1 The decomposition for Cy.......................................................................................................79
5.2 The decomposition o f the matrices A, B and C for four processors............................81
5.3 The communication scheme according to Equation 5 .3 ................................................ 82
5.4 The decomposition o f the matrices A and B for the 3-D hypercube............................82
5.5 The decomposition o f matrices A, B and C for the 1-D hypercube............................ 83
5.6 DDG o f  matrix multiplication for the 1-D hypercube....................................................85
5.7 DDG o f  matrix multiplication for the 2-D hypercube....................................................89
5.8 DDG o f  matrix multiplication for P0 o f  the 3-D hypercube...........................................91
5.9 Performance comparision o f  the three versions for the 1 -D hypercube..................... 96
5.10 The difference o f  theoretical and experimental timings for the 1-D hypercube 97
5.11 Performance comparision o f  the three versions for the 2-D hypercube..................... 99
5.12 Difference o f theoretical and experimental timings for the 2-D hypercube............. 99
5.13 Derformance comparision o f  the three versions for the 3-D hypercube...................101
5.14 Difference o f theoretical and experimental timings for the 3-D hypercube............101
LIST OF FIGURES
(Continued)
Figure Page
6.1 An example for Phases 4 and 5 .......................................................................................... 107
6.2 Assignment o f data for an NxN input im age .................................................................... 110
6.3 A 32-bit long communication package...............................................................................I l l
6.4 Four types of images used in testing: (a) Uniform, (b) Horizontal half,
(c) Three objects, including the background, and (d) Checkerboard........................... 114
6.5 Execution time vs. image types for the hybrid version with P = 4 ............................. 120
6.6 Performance comparison o f  the three versions with image d and P = 2 .................... 121
6.7 Performance comparison o f the three versions with image d and P = 4 .................... 122
6.8 Performance comparison o f  the three versions with image d and P = 8 .................... 122
xviii
CHAPTER 1
INTRODUCTION
Parallel processing has become the most prominent technology in achieving high 
performance computational power. One o f the key problems to be solved with this 
technology is to determine how individual processes cooperate with each other efficiently 
when carrying out a task together. In general, message-passing and shared-memory are 
two techniques parallel computer systems use for coordination and communication, and 
they will be the focus o f  this dissertation. This chapter provides an introductory 
background in this fast growing research area. The motivations and objectives o f  our 
research, as well as an overview o f  its contributions, are also presented. An outline o f  the 
thesis is presented at the end.
1.1 Parallel Processing Systems
A parallel processing system consists o f multiple processors (or nodes), memory 
modules, peripherals, and a switching or interconnection network. There are two major 
categories in classifying parallel processing systems: shared-memory multiprocessors and 
message-passing multicomputers [24], The difference between them lies in how 
communication among nodes is carried out. The following two subsections give more 
details about these two categories.
1
1.1.1 Shared-M emory M ultiprocessors
In a shared-memory m ultiprocessor multiple processors share a common memory and 
some peripherals, and communication is performed through the shared memory by 
writing and reading it. Two shared-memory m ultiprocessor models are primarily used: 
the uniform-memory-access (UMA) model and the nommiform-memory-access (NUMA) 
model [36]. They differ in the way the memory and other resources are distributed. In the 
UMA model, as shown in Figure 1.1, all processors have equal access time to all memory 
locations in all shared-memory modules (marked as SM) under the condition o f no 
network congestion, and that is why it is called uniform-memory-access model. In the 
NUM A model in Figure 1.2, however, accessing the local shared memory (marked as 
LM) is faster than accessing a remote one, because there is no need for a processor to go 
through the switching network when accessing the former.
SMI/O SM, SM ,
S w itch in g  N e tw ork
Figure 1.1 The UMA m ultiprocessor model.
The most popular switching networks are single bus, crossbar, and multistage. 
The single bus can only handle one transaction at a time, employing a single source. The 
crossbar and multistage networks, built with extra hardware, can have more than one on­
3going transaction. Hence, the single bus has low cost and low performance while the 
other two provide high bandwidth with higher cost.
I’m
LM
LM ,
S w itch in g
N e tw ork
Figure 1.2 The NUM A multiprocessor model.
1.1.2 M essage-Passing M ulticomputcrs
A message-passing multicomputer consists o f multiple computers (or nodes) 
interconnected by a point-to-point network, and each node is an autonom ous computer 
including a processor, a private local memory, and possibly disks or I/O peripherals, as 
modeled in Figure 1.3. Internode communication is carried out by passing messages 
through the network while observing certain communication protocols. Such actions may 
involve m ultiple links (i.e. physical connections between nodes) and nodes, if  the source 
is not directly connected to the destination.
Some common network topologies in constructing interconnection networks for 
multicomputers are, as shown in Figure 1.4, binary tree, star, ring, mesh, hypercube, etc. 
They are also called static connection networks because all links between nodes are fixed 
after a network is built. Among these topologies, the hypercube has very high hardware
4complexity but it is very popular [38], A c/-dimensional hypercube consists o f 2'y nodes, 
each o f which is connected to one other node in each dimension. For example, a 0- 
dimensional hypercube, a 0-cube for short, has a single node with no communication 
channels, i.e. it is a standard sequential computer. A 1-cube is constructed from two 0- 
cubes by connecting them with a single communication channel, and a 2-cube is formed 
with two 1-cubes by connecting their corresponding nodes via an additional channel. 
Figure 1.4(e) shows a 3-cube, containing two 2-cubes, and each node in each 2-cube has a 
connection to the corresponding node in the other 2-cube. Hence, in general, a cZ-cube is 
constructed by connecting corresponding pairs o f nodes in two (c/-1 )-cubes with an 
additional channel. The number o f  nodes is P = 2d. The number o f  connections per node 
and the maximum distance between two nodes are d  = log2/5. The node number (i.e. its 
identification) is chosen to be a c/-bit binary code where its f '  bit represents the 
coordinate o f the node in the i,h dimension o f the hypercube. The binary addresses o f  any 
two neighboring nodes differ in a single bit. The number o f  bits that differ between the 
addresses o f  two nodes gives the distance between them.
I.MI.M
I.M
I.M I.M
I.M
M essa g e-P assin g  
In te rco n n ec tio n  N etw ork
Figure 1.3 A multicomputer.
(a) Binary tree (b) Star (c) Ring
(e) Hypercube(d) Mesh
Figure 1.4 Examples o f common network topologies.
1.2 M essage-Passing Versus Shared-M emory
In a general sense, the message-passing architecture is efficient for communicating small 
amounts o f  data in small distance. On the other hand, shared-memory is primarily used 
for I/O with the host, and for otherwise distant communications with large amounts o f 
data. Additionally, the shared-memory paradigm simplifies the development o f 
algorithms.
Chandra et. al. [10] studied the strengths and weaknesses o f these two 
fundamental mechanisms o f message-passing and shared-memory by comparing the 
performance o f equivalent, well-written message-passing and shared-memory programs 
running on similar hardware. Each application program was produced in two versions and 
its performance was measured on closely-related simulators o f a message-passing and a
shared-memory machine. They found that three o f the four shared-memory programs ran 
at roughly the same speed as their message-passing equivalents, even though their 
communication patterns were different. Therefore, the advantage o f message-passing over 
shared-memory, or vice-versa, is not clear-cut, and this strongly suggests that parallel 
computers should support, if possible, both communication mechanisms.
Many parallel computer systems have been developed since the early 1980’s, such 
as the nCUBE, Intel iPSC, Wavetracer, Cray Y-MP, M ultiMax, Symmetry, and 
Connection-M achines CM-2 and CM-5. They primarily implement either the message- 
passing or the shared-memory parallel processing paradigm. For example, the nCUBE, 
iPSC, and Connection Machine CM-2 [9] are message-passing machines, whereas the 
Cray Y-MP employs a shared-memory architecture. Many parallel Digital Signal 
Processing (DSP) algorithms have also been developed for either one o f  the two 
paradigms. For instance, Cvetanovic [7] analyzed the performance o f  a Fast Fourier 
Transform (FFT) algorithm for a shared-memory parallel architecture, and Fox [4] 
developed a matrix multiplication algorithm and an FFT algorithm targeting a message- 
passing architecture.
Recently, the complementary nature o f  the shared-memory and message-passing 
communication mechanisms, together with the advance o f underlying hardware 
technologies have led to a growing interest in hybrid architectures supporting both 
paradigms within a single parallel system. Emphasis has been given to the creation of 
distributed shared-memory systems, where all local memories o f  all processors in a 
message-passing system form a global address space. The underlying message-passing
7architecture is chosen because it supports system scalability. The Stanford FLASH 
Multiprocessor [11] is one o f the few hybrid architectures o f this kind that appeared in 
recent years. It integrates message-passing with shared-memory in a single architecture 
by using a programmable node controller, called MAGIC, to efficiently support both 
communication paradigms. MAGIC provides a tight coupling with the network and 
memory, and allows for efficient protocol handling through parallel implem entation o f 
control processing and data movement. Cache coherence for the shared memory can be 
carried out even during message transfer. Alewife [12] is another example o f  a hybrid 
architecture. Each Alewife node has a hardware controller to handle cache coherence, and 
a DMA unit (within the controller) to facilitate message passing. In addition, the main 
processor has an efficient memory-mapped interface to the controller that is used for 
controlling message sends. However, both systems are based on a m essage-passing 
architecture; shared-memory transactions are simulated by software.
In related work, Dowling et. al. [3] introduced and analyzed the Hybrid Array 
Ring Processor (HARP) architecture. HARP is an application specific architecture 
centered around a host processor, shared memory, and a set o f  memory-mapped 
processors connected into both an open backplane and a bidirectional systolic ring. The 
architecture implements both the message-passing and shared-memory paradigms and has 
been benchmarked through simulation o f matrix multiplication, FFT, QR Decomposition, 
and Singular Value Decomposition algorithms, assuming the TM S34082 floating point 
RISC processor. The HARP system has not been built. In these simulations, the shared 
memory is used only for the purpose o f distributing input data and collecting final results;
8data transfers during the execution of algorithms are all done through the systolic links of 
HARP. Thus, the HARP project does not integrate the message-passing and shared- 
memory paradigms during the execution o f algorithms.
1.3 Motivation, Objectives, and Contributions
It is our vision that a hardware combination o f message-passing with shared-memory has 
potential advantages for many applications, and therefore should be investigated 
extensively for small-scale and medium-scale parallel computers. The current version of 
our TurboNet system, the target system in this dissertation, is an asynchronous three- 
dimensional hypercube system composed o f eight powerful Texas Instruments 
TM S320C40 (C40 in brevity) DSP processor chips. We have built this system with two 
VME Hydra boards o f Ariel Corp., where each board contains four C 40’s and shared- 
memory, in addition to local memory attached to each C40. The shared-memory o f  each 
board is also global so that it can be accessed by any o f the eight processors via the 
shared VM E bus. TurboNet has a more general architecture that implements directly in 
hardware both communication paradigms, in contrast to the proposed FLASH, Alewife, 
and HARP systems. Details about TurboNet follow in the next section.
The objectives o f  this dissertation are: (1) to employ simultaneously both the 
message-passing and shared-memory paradigms in the development o f  parallel DSP 
algorithms, such as one-dimensional (l-D ) and two-dimensional (2-D) FFT and 
convolution, matrix multiplication, and image segmentation; (2) to develop highly 
accurate m ethodology for the theoretical assessment o f algorithms during their design; (3)
9to compare the performance o f our algorithms that employ the hybrid communication 
paradigm (i.e. both the message-passing and shared-memory paradigms) with the 
performance o f algorithms supporting only one o f the two basic paradigms, in order to 
illustrate the superiority o f the former.
Our results prove that several algorithms can take advantage o f  the hybrid 
architecture o f our target system in order to achieve very impressive speedups. Therefore 
the main conclusion o f this dissertation is that small-scale and medium-scale parallel 
computers should support directly in hardware, if  possible, both the message-passing and 
shared-memory paradigms for good performance and versatility in algorithm 
development.
1.4 Outline
This dissertation is organized as follows. Following this introduction, Chapter 2 provides 
a brief overview o f  our target TurboNet system, including a overview o f the C40 and the 
Hydra architecture, and the main characteristics o f  the TurboNet system. In Chapters 3, 4, 
5, and 6, algorithms for convolution, FFT, matrix multiplication, and image segmentation 
are introduced, and relevant performance analysis and experimental results are included. 
Chapter 7 presents the conclusions and further research objectives.
CHAPTER 2
TURBONET: A M ESSAGE-PASSING AND SHARED-M EM ORY  
HYBRID ARCHITECTURE
The TurboNet system is presented in this chapter. Three main aspects o f  the system are 
discussed here: the TM S320C40 DSP’s, the Hydra boards, and the host board. From the 
architecture point o f  view, the TurboNet system implements in hardware both the 
message-passing and shared-memory paradigms, and this is what distinguishes it from 
other systems.
2,1 The Texas Instruments TM S320C40 Processor
The TurboNet system currently contains eight C 40’s. The C40 has the following key 
features [32]:
1. Six 8-bit bidirectional half-duplex communication ports for high speed 
interprocessor communication.
2. Six-channel DMA coprocessor for concurrent I/O and CPU operation, thereby 
maximizing sustained CPU performance by alleviating the CPU o f 
burdensome I/O.
3. High-performance DSP CPU capable o f  275 MOPS and 320 M bytes/sec o f 
data-transfer rate.
10
11
4. Two identical external data and address buses supporting shared-memory 
systems and high data rate, single-cycle transfers. The 32 bit data buses are 
called the Global Bus and the Local Bus. Each bus is capable o f addressing 2 
Gwords (x32 bits) o f address space for a total o f 4 Gwords addressable by 
each C40.
5. On-chip analysis module supporting efficient, state o f the art parallel 
processing debugging.
6. On-chip program cache and dual-access/single-cycle RAM (2 KW ords) for 
increased memory access performance.
7. Separate internal program, data, and DMA coprocessor buses for support o f 
massive concurrent I/O o f program and data throughput, thereby m aximizing 
sustained CPU performance.
2.2 Hydra Boards
Hydra[31] is a single-slot 6U VME-based multi-digital signal processor (DSP) board, 
consisting o f a base card and a daughter card, each one containing two C40 chips. Its 
block diagram is shown in Figure 2.1.
The global or local bus o f each C40 is connected to a bank o f fast zero-wait-state 
static memory which is private to the specific C40, that is, any o f  the DSP’s can access its 
own SRAM without affecting the operations in any o f  the other DSP’s. Each o f the D SP’s 
is, in fact, an independent processing node. Each o f the DSP’s can have 16 Kwords, 64 
Kwords, or 256 Kwords (x32 bits) o f SRAM in each o f its two banks. In our system, the
12
SR A M ’s in Hydra 1 all contain 64 Kwords while there are 16 Kword SRAM ’s in Hydra 
2 .
Hydra has an Internal Shared Bus (ISB) to which the VME bus and the Global bus 
o f  each DSP are attached. The ISB provides access to a bank o f DRAM (1 M words, 4 
M words, or 16 M words in size; 1 Mwords in our system) and other shared resources. Any 
external VME master can also access the ISB. Isolation transceivers have been utilized to 
isolate all DSP modules and the VME bus from the ISB when they are not currently 
granted the ISB.
ISB requests from the VME bus always take higher priority over local DSP 
requests. ISB requests from the Hydra DSP’s are arbitrated on a rotating priority scheme 
to make it impossible for one internal master to lock-out or 'starve' the other devices o f 
the ISB. The ownership o f  the ISB is determined on a cycle-by-cycle basis. If  the current 
master is a local DSP, ownership is challenged after that DSP completes its current ISB 
cycle. If  that DSP is accessing DRAM or any ISB facility, the arbiter will test for new 
requesters between each access, even if  the current owner is operating the DRAM in page 
mode.
Each C40 has six communication ports. Three o f them are used within the board 
to implement a fully-connected system with four C40’s. The remaining three 
comm unication ports are brought out to the front panel o f Hydra for connections with 
multiple Hydra boards via microminiature, 25-pin D-sub connectors.
A m onitor program, including a bootstrap, is stored in the EPROM attached to 
DSP1. The bootstrap will set up the VME interface, including the VME memory window
13
as seen by the internal ISB masters and the base offset o f Hydra as seen by another VME 
bus master. In addition to the VME interface, the monitor also initializes the RS-232 port.
Internal S h a re d  B us
AD bus 
Port
Boot
EPROM
Comm
Ports
(2.4.5)
Comm
Ports
(0 , 1 ,2 )
Comm
Ports
(1,2.5)
EEPRO M
Comm
Ports
(3.4.5)
JTAG
SRAM 
16K. 64K, o r 256K  
(x32 bits)
SRAM 
16K or 64K 
(x32 bits)
SRAM 
16K, 64K o r 256K  
(x32 bits)
SRAM 
16K or 6 4 K 
(x32 bits)
SRAM 
16K. 6 4 K, o r 256k 
(x32 bits)
SRAM 
16K or 64K 
(x32 bits)
VMEbus
Interface
RS-232
Ports
T M S 320C 40 
D SP  #  a2
T M S 320C 40 
D S P  #  a4
T M S 320C 40
D S P # a 1
T M S 320C 40 
D S P  #  a3
Figure 2.1 The Hydra block diagram.
14
After DSP1 is boot up by bootstrap, it will boot up the other three D SP's through their 
parallel communication ports. All the necessary configuration information for Hydra is 
stored in an EEPROM  that is also attached to DSP1.
In our system, the Hydra device driver and the utility library are both made 
available to users. The former is a traditional UNIX device driver while the latter is a 
resource-m apping library, and they both are independent to each other. The driver can be 
a loadable one or it can be configured into the SunOS kernel.
2.3 The Host System
The host system, a SPARC CPU-2CE board, is a complete VM E-based SPARCstation 2 
architecture with Sbus expansion. It offers the same I/O interface as the SPARCstation 2, 
including DMA supported SCSI and Ethernet ports, along with audio, keyboard/mouse, 
and two serial channels with full modem support. Two Sbus sockets allow the installation 
o f standard Sbus modules, such as graphics frame buffers, or accelerators, or any other o f 
over 300 Sbus cards available. In our case, these two sockets are used for a m onitor and a 
cache memory expansion card.
2.4 The Hybrid Architecture o f the TurboNet System
The TurboNet system comprises a VME backplane, a SPARC CPU-2CE [29] host board, 
two Hydra boards, two hard disk drives, a floppy drive, a CDROM , a VM E bus logic 
analyzer, and a dumb terminal, as depicted in Figure 2.2. There are four bidirectional 
links between the two Hydra boards, each o f them connecting two C40’s together in the
15
two different boards. As mentioned earlier, three o f  the six communication ports o f  each 
DSP are dedicated to the interconnection o f the four DSP’s inside Hydra. A fourth port is 
used for an interboard connection. Ignoring the diagonal intraboard connections, the 
processors form a 3-dimensional hypercube. It is our intention to build a larger hypercube 
system with up to 64 C40’s, thus the diagonal intraboard connections will be removed in 
the future in order to facilitate this goal. Therefore, these diagonal connections are 
ignored in this research.
As shown in Figure 2.1, each Hydra board has a shared memory (the DRAM) 
connected to all four DSP’s by a single bus, the ISB. It can be accessed by the DSP’s 
from inside and outside the board. The DSP’s, however, when accessing the shared 
memory from the other board, would have to go through the VME bus before reaching it, 
i.e. the access time will be longer compared to accessing the local shared memory. This 
indicates that we have a UMA shared-memory model when dealing with one board and a 
NUM A shared-memory model when using both boards. And let us not forget that all 
these coexist with the message-passing hypercube connection in our TurboNet system. 
Both comm unication mechanisms (paradigms) can perform their functions at the same 
time, so that DSP’s can talk to each other through communication ports and have their 
DMA controllers competing for the shared memory via their ISB’s and/or VME bus. This 
hybrid architecture is like adding a new communication media to a system which has only 
one overcrowded communication space. It also provides options o f  communication 
paradigms to applications that may have interconnection bias in performance, such as 1-D 
FFT, as explained in Chapter 4.
D um b T erm ina l
P C  m  
VM E B u s  M onitor
O isk A rray
□
TurboNet Parallel System
T x
S u n  T erm ina l
P C  # 2  
H ydra  M onitor
a T H tT T ttT tT tn rT T T T T T T nrT '
H P  L a s e r J e t  4 
P rin te r
□ _ J "
D isk A rray
r
H ost:
S P A R C -2
S u n
T erm ina l
VM E B u s  Logic 
A n a ly z e r  1
VM E B u s Logic 
A n a ly z e r  2 □
-V M E  B U S -
Pgfl
P C # 1
VM E
B u s
M onitor
□ L
T I-C 40
a t
T I-C 40
a2
j t _ ----- iJ
I r ~  ■
T I-C 40
a 4
T I-C 40
a 3
T I-C 40 T I-C 40
b1 b2
T I-C 40 T I-C 40
b4 b3
Hydra 1 □
' H I
P C  #2  
H ydra 
M onitor
Hydra 2
Figure 2.2 The interconnection diagram o f TurboNet.
17
In the course o f  the system integration, two major problems occurred. The two 
Hydra boards are different versions, and the older one can take only the low base address 
in the VME bus while the recent version can use any base address available. The other 
problem is the malfunction o f  three communication ports in DSP7 (b3) o f the second 
Hydra board. Due to this reason, its connection with DSP4 (a4) o f the first Hydra board 
goes through DSP5 (b l)  o f the second Hydra board instead o f going to a4 directly. This 
is not shown in Figure 2.2.
CHAPTER 3
M ULTIDIM ENSIONAL CONVOLUTION
In this chapter algorithms are presented for 1-D and 2-D convolution on the TurboNet 
system. Theoretical analysis o f the algorithms and performance results o f their 
implementation are also included. All running times in this thesis are expresses in 
seconds unless specified otherwise. There are two algorithms for 1 -D convolution. They 
do not require any communication between processors. For 2-D convolution, all three 
comm unication paradigms have been implemented, namely message-passing, shared- 
memory and a combination o f both (the hybrid paradigm).
3.1 One-Dim ensional Convolution Algorithm s
Given two sequences o f numbers /[0], / [ l ] ,  ..., /[N -l] and g[0], g [ l] , . . . ,  g[M-1], 1-D 
convolution produces a sequence y o f  length N + M -1, where
T[/] = Z / [ 7 M / - . / ]  (3.1)
7=0
for / =  0, 1, ..., N+M-2.  Let us consider an example o f this convolution w i t h /  = {1,4,2} 
and g  = {1,2,3,1}. The result o f  convolution is shown in Figure 3.1.
From the figure we can see that each element o f y  is generated by one or more 
multiplications o f some elements from /  and g, and zero or more subsequent additions. 
Suppose we have two processors to perform this convolution. One way to implement this
18
19
is to let processor P0 calculate y[0], _y[l], y[4] and y[5], while let processor P, calculate 
>■[2] and _y[3]. Another way is to let processor P0 calculate y[0], ^[1] and _y[2], while let 
processor P\ calculate ^[3], y[4] and_y[5]. In method  I, if  there are P  processors, where 1 
< P < M, each one o f them calculates M/P  results in the first phase. For example, P{) 
calculates y[0], y [ l] , ..., y[M/P-\],  P, calculates y[M/P], y[M/P+\],  ..., y[2M/P-\] ,  etc. In 
the second phase, each o f the first |"(N -  1) P / M~\ processors calculates up to M/P  results 
from yfM ] to y[M+N-2], In method  2, P0 calculates the first (M+N-]) /P  results while P, 
calculates the second (M+N-l) /P  results, and so on; we assume that M/P  and (M+N-\) /P  
are integer numbers. In method 1 each processor has almost the same amount o f 
computation, but in method 2 processors that generate the central portion o f  the output 
array have more calculation load than the others, and therefore good load balancing is 
rarely accomplished. As a result, method 1 will generally have better performance than 
method 2. In both methods, as the algorithms were designed, no comm unication is 
necessary between processors.
y[0] = ./[0]g[0] = ix i
P [ l ] = / [ 0 M l ] + . / [ l M 0 ] =  1 x2 + 4x 1
y  [2] = / [ 0 ]  g[2] +Al ] g [ l ]  +./[2] g[0] = 1x3 + 4x2 + 2x 1 
y  [3] = /[ 0 ]  g[3] +./I1] g[2] +./[2] M l] = 1x1 + 4x3 + 2x2 
y [  4 ]=  ,/[ l]M 3 ]+ ./[2 ]M 2 ]=  4 x 1 + 2 x 3
y [  5 ]=  ./ [2 ]g [3 ]=  2x 1
Figure 3.1 An example o f one-dimensional convolution.
20
3.2 Two-Dimensional Convolution Algorithm
The basic definition o f  two-dimensional convolution is
A*i—I A '2 - l
(3.2)
/=0  /=0
where x = 0, 1 ,2 , N x+ M x- 2 , y  = 0, 1 ,2 , ..., N2+ M 2-2, and the lengths o f f [ n x, n2] and
g[m t, m2] in the first dimension are N x and A/,, respectively, whereas they are N2 and ;V/2, 
respectively, in the second dimension. In the following discussion we assume that N x and 
A/, are powers o f  2.
involving distinct pairs o f  rows from /  and g, respectively, are grouped together with 
distinct parallelograms, then four such parallelograms are obtained, as in Figure 3.2 (b). If 
we rotate this figure counter-clockwise by 90 degrees, then the parallelograms derived 
for the convolution o f  two 4x4 matrices and the data dependencies for obtaining the 
resulting matrix are shown in Figure 3.3. Again each parallelogram in Figure 3.3 
represents the convolution o f a pair o f rows from / and g, respectively.
As an example, let us consider two-dimensional convolution with /  =
shows the operations to be performed. If  operations
21
c[0, 0 
c[0, 1 
c[0, 2 
c[0, 3 
£'[0, 4 
c [  1, 0 
c[ 1, 1 
c [l , 2 
c[ 1, 3 
c [ l , 4 
c[2, 0 
c[2, 1 
c[2, 2 
c[2, 3 
c[2, 4
= /[0,0]g[0,0]
=./[o,o] g[0,i]+./[0,i]g[0,0]
=./[0,0] g[0,2]+ ./[0,1 ]g [0 ,1 ]+ /[0,2]g[0,0]
./[0 ,1 ]g[0,2]+ /[0 ,2]g[0 ,1 ]
./[0,2]g[0,2]
= /[0 ,0 ]g [l,0 ] + ./[!,0]g[0,0]
=./[0,0] g[ i , 1 ]+ /[0 ,1 ]g[ 1,0] +./[ i ,0]g[0, i M 1, i ]g[0,0]
=./[0,0] g[ 1,2 ]+ /[0 ,1 ]g[ 1,1 ]+./[0,2]g[ 1,0] +./[ 1,0]g[0,2]+/[ 1,1 ]g [0 ,1 ]+/[ 1,2]g[0,0]
./[0 ,l]g [l,2 ]+ ./[0 ,2 M l,l]  + /[M ]g [0 ,2 ]+ /[l,2 ]g [0 ,ll
./[0,2]g[l,2] + /[l,2]g[0,2]
./[U 0]g[l,0]
. / [ l ’0 ]g [ l ,l]+ /l l ,l ]g [ l ,0 ]
./[ 1,0]g[ 1,2]+/[ 1,1 ]g[ 1,1 ]+/[ 1,2]g[ 1,0]
./[ l ,l]g [ l,2 ]+ /[ l,2 ]g [ l.l]
./tl,2 ]g [l,2 ]
(a) An expansion o f  Equation 3.2 with the data a rra y s /a n d  g.
c[0 ,0 ] = ! 1x2'--....
c[0, 1] = j lx l  + 4 x 2 " -- ...
c [0 ,2 ] = j  1x5+  4x1 +3x2":
c|(). 3] = 4x5 + 3x1 j
C [ 0, 4] = 3x5 :
C[1,0] = i lxd" -.... ' + i 2x2
£■[1, 1] = i  1x4 + 4 x 6 "' -... +j 2x1 + 0x2 '
£■[1,2] = j 1x3 + 4x4 + 3x0 j +'; 2 x 5 +  0x1+5x2"
£•[1,3] = 4x3 + 3x4 j 0x5 + 5x1
o [ l ,4 ]  = 3x3 ! ""•••-...+ 5x5
o[2,0] = '"""-.J i 2x0'
£[2,1] = j  2x4 + 0x0"
£[2, 2] = j  2x3 + 0x4 + 5x0
£[2,3] = 0x3 + 5x4
£•[2,4] = " 5x3
(b) The convolution parallelograms.
Figure 3.2 An example o f 2-D convolution.
22
f [0 ,n 2 ]
g [0 ,m 2 ]  g [ 1 ,m 2 l  g [2 ,m 2 ]  g [3 ,m 2 ]
P , n 2 ]
7
f[2 ,n 2 ] Z 7 -Z 7 -Z 7 Y
/
f[1 ,n2]
/
/^r j-a -r j-u
v
c [0 ,y ]
V
c [i,y ]
V
c[2,y] c[3,y]
7
V
c [4 ,y ]
7
V
c[5,y]
7
v
c[6,y]
Figure 3.3 The data dependencies for convolution o f two 4x4 matrices.
Suppose we have a hypercube system o f four processors to perform this 
convolution. Each processor could be responsible for the calculations in one row o f 
parallelogram s in Figure 3.3. The final result will spread unevenly among the processors, 
so that processor P{) will have c [0 j/|, c[l,y], c\2,y] and c[3,y], whereas processors P u P2 
and P3 will have c[4,y], c[5,y] and c[6,y], respectively. This is shown in Figure 3.4. In the 
first stage o f  this algorithm, each processor performs the operations o f all the 
parallelograms assigned to it. In the second stage, processors P] and P3 send the results o f 
their first three parallelograms to processors P0 and P2, respectively, and the receivers add 
them to their corresponding results. In the final stage, processor P2 sends the results o f  its 
first two parallelograms to processor P0 and the third one to processor /*,, and again the
23
receivers add them to their corresponding results. We call this method o f  parallel 
convolution the binary-tree method  because o f the pairwise additions in rows.
g[0,m2] g[1,m2)
/  ]/■
g[2.m2] g[3,m2]
f[3,n2) P3
f[2.n2J
f[1,n2]
f[0.n2|
^rj^n^n-^n
' n ^ n - * b ^ b
c[5.y]
c[4,y]
c[6,y) P2
P1
PO
c[0,y]
V
c [ i ,y l
V
c tz .y ]
v
c[3,y]
Figure 3.4 An example o f the binary-tree method.
It is important to mention that data transfers can be avoided altogether by 
combining pairs o f columns, so that the number o f  parallelograms in each column 
becomes N ] (sim ilar to method l o f  1-D convolution). This is accomplished in Figure 3.3 
by com bining the linear arrays that produce the pairs (c[0j/], c[4.y]), (c[l,y). c[5vy]) and 
{c[2,y], c\6,y]). However, such an implementation is neglected here because programming 
such a task is more difficult in the general case. Also such an implementation could not 
be used to investigate the capabilities o f TurboNet’s hybrid communication architecture.
With P  processors (P > 1), where P is assumed to be a power o f  2, the binary-tree 
algorithm will have log2 / > + l stages, and each processor will have N\/P  rows of
24
parallelograms {N\/P is an integer). P{) will have c[0^v], c[l,y], c[M\-2+N\/P, y] o f the
final result, and Pi will have c[Mr  1+ i N }/P, y], ..., c[Mr 2+ (i+\)N\/P, y],  for / =  1 ,2 , 
. . . , / M .
Now suppose we have a 4-dimensional hypercube with 16 processors. To perform 
the two-dimensional convolution, we can assign each o f the 16 parallelograms in Figure
3.3 to a processor in the hypercube, so that the convolution o f/[3 , n2] and g[0. m2\ goes to 
processor P0, the convolution o f/[3 , /?2] and g [ l ,  m2] goes to processor P\ and so on, as 
shown in Figure 3.5. The arrows for values o f / a n d  g  do not represent systolic array 
processing as in [37, 38, 39], Each arrow implies that all processors on the corresponding 
row or column initially obtain the same values o f f  or g. After each processor finishes its 
1 -D convolution, except for those processors on the south and east borders, it has to send 
the result to its south-east neighbor, and the final result will be distributed among the 
processors on the south and east borders. We call this the grid method.
In the grid method, all communications are non-local, and they are most likely 
done efficiently through shared-memory. A binary Gray code is used for the assignment 
o f  parallelograms to processors, and each communication is between processors whose 
distance is equal to 2. The Gray codes for rows and columns are chosen, so that each set 
o f four processors in each quadrant belong to the same Hydra board. This scheme 
minimizes the number o f VME bus accesses.
25
g[1,m2] g[2,m2] g[3,m2] g[4,m2]
f[3,n2] ------>
f[2,n2] ------>
f[1,n2] ------>
fI0,n2] ------>
->c[6 ,y]
- >  c[5,y]
c[4,y]
c[0,y] c[1,y] c[2,y] c[3,y]
Figure 3.5 An example o f the grid method.
To also make efficient use o f the message-passing capability o f  our target system, 
we invent another method, called the triangle method.  The profile in Figure 3.3 is also a 
parallelogram. If we cut its right angle along the dashed line and paste it in the left side of 
the parallelogram, we have a square configuration. We then assign each parallelogram to 
a processor o f the hypercube as shown in Figure 3.6. In the figure each row and column 
number is transformed according to the binary reflected Gray code. To obtain the address 
o f  a hypercube processor, we concatenate the corresponding transformed column and row 
numbers. This process results in optimal mapping o f  a mesh to the hypercube. Again, the
26
address assignm ent in this algorithm intents to let all the processors in the same column 
o f Figure 3.6 share the same-board DRAM, and therefore avoid going through the VME 
bus.
g[3,m2]
g[0,m2] g[0,m2] g[3,m2]
  f[3,n2]
< ------ f[2,n2]
f[1,n2]
< ------ f[0,n2]
c[0,y] c[1,y] c[2,y] c[3,y]
Figure 3.6 An example o f the triangle method.
After each processor is done with its l-D  convolution, all processors in the lower- 
right triangle (including the diagonal processors) transfer their results to their lower- 
vertical neighbors, while the processors in the upper-left triangle send theirs to their upper 
neighbors, as indicated in Figure 3.6. The final result will be distributed among the
27
processors in the upper and lower sides o f the mesh. All data transfers are between 
neighbors.
In general, for both the grid and triangle methods each processor might be 
assigned more than one parallelogram. In this case, after calculating the convolutions of 
its assigned parallelograms, the processor has to transfer different parallelograms to 
different processors. We did not implement these two methods. Only the binary-tree 
method was implemented on TurboNet because o f its simplicity in programming. We. 
however, consider these as interesting problems for further research. Algorithms for all 
methods are shown in Appendix A.
3.3 Theoretical Analysis
Theoretical analysis for 1-D and 2-D convolution is presented in this section. Since the 
algorithms for 1-D convolution do not need any communication, no Data Dependence 
Graph (DDG) is presented. For the algorithm o f  2-D convolution, three communication 
strategies are discussed and the DDG is also provided for theoretical estimation o f the 
algorithm ’s performance.
3.3.1 1-D Convolution
1-D convolution o f two arrays is as defined in Equation 3.1. Its running time on a single 
processor is
tsq]( M )  = 2 M N t ca] = ( ) ( M 2) (3.3)
28
for M  > N, where M  and N  are the sizes o f the two data arrays, and /ta| is the time for one 
m ultiplication or one addition. Since no communication is involved, the execution times 
o f  both m ethods for 1-D convolution consist o f calculation times only. In both methods, 
let P, the number o f  processors, be a power o f  2, P < M  and M  > N. In method 1, let 
M / P  be an integer. Because the workload is balanced among the processors, each one of 
them performs MN/P  multiplications and additions. The total running time is therefore
( M )  = 2 M N / P / cal = 6>( M 2/ P )  (3.4)
In method 2, n = ( M  + N  -  \ )/  P is assumed to be an integer. Here each processor 
calculates n consecutive elements o f the result. Since the numbers o f  operations required 
for different output elements may differ, this method does not generally balance the 
workload among the processors. The processor(s) that produces the central portion o f the 
output array y  corresponds to the longest running time. The maximum number o f 
multiplications/additions the processor might have to perform is
N n - ( ( P / 2  + \ ) n -  M) ( ( P / 2  + \ ) n -  M + \ ) / 2  if  {P/ 2  - \ )n < N  < nP/2,  or Nn  if
N  < ( P/2 -  \ )n.  Then the running time o f method 2 is
' 2 ( N  n - ( ( P / 2  + \ ) n -  M) ( (  P/2 + \ )n -  M + l ) /2 ) / cal 
2 N n / tal
' ( ) ( M 2/ P )  if ( P / 2 - \ ) n  < N  <nP/2
(3.5)
0 ( M 2/ P )  if N  < ( P/ 2 - \ ) n
In Tables 3.1 and 3.2 the calculated running times for both m ethods o f  1-D
convolution are presented. They are estimated from the following two equations
and
29
<v.- ,A M )=  M N / P u ^ N . P )  (3.6)
(Nn-{(P/2  + 1);?- M){{P/2 + \ )n -  M  + 1 )/2  ) H c a l ( N , P) 
if (P/2 -  1 )n < N  < n P/2 
N n u ^ . 2{N, P) if N<(P/2-\)n
(3.7)
where ucat_\(N, P) and uca\.2(N, P ), obtained from experiment, represent the amount o f 
time for one m ultiplication and one addition, including the time to load and store the 
input and output data in methods 1 and 2, respectively. Due to the pipelined structure o f 
the C40 [33], the unit time may change according to the number o f operations needed to 
be performed. This is why the u parameters are functions o f  N  and P. Their values are 
derived from experiment and are listed in Tables 3.3 and 3.4.
Table 3.1 Estimated running time for 1-D convolution with method 1.
M=  512 N=  8 N=32 JV=128 7V=384
P= 1 0.005132 0.020509 0.082040 0.246151
P =  2 0.003689 0.013160 0.053300 0.174860
P =  4 0.001861 0.006710 0.028462 0.090753
soII 0.000954 0.003519 0.015033 0.046197
Table 3.2 Estimated running time for 1-D convolution with method 2.
M = 513 iV=8 iV= 32 A - 128 N=  384
P= 1 0.005132 0.020509 0.082040 0.246151
P =2 0.004470 0.014063 0.052554 0.154966
P =  4 0.002266 0.007400 0.032507 0.114776
P =  8 0.001112 0.003708 0.016281 0.065480
30
Table 3.3 The time z/ca|., (in nsecs) for 1-D convolution with method 1.
^cal-l N=S yv= 32 7V=128 /V=384
P=\ 1250.7 1251.8 1251.8 1252.0
P =  2 1801.3 1606.5 1626.6 1778.8
II 1817.4 1638.2 1737.2 1846.4
P =  8 1863.3 1718.3 1835.1 1879.8
Table 3.4 The time uca\_2 (in nsecs) for 1-D convolution with method 2.
wcal-2 A^=8 yv=32 7V=128 A^=384
P =  2 2178.4 1720.0 1600.7 1573.3
P =  4 2178.9 1700.4 1587.3 1566.0
P =  8 2138.5 1704.0 1590.0 1563.5
3.3.2 2-D Convolution
The 2-D convolution o f  two matrices performed on a uniprocessor has running time
/sq2 ( M x) = 2 N x N 2 M ] M 2 /caI = 0 (  M \ ) (3.8)
where N h N2 , M ] and M2 are as defined in Section 3.2, and M x is assumed to be the 
largest among them. The binary-tree algorithm for 2-D convolution, as described in the 
previous section, has
.V = log2 /* + 1 (3.9)
stages.
31
P  =  8
Stage 1 (P>=1) 
Stage 2 (P>=2)
Stage 3 (P>=4)
*  P  =  4  ►
< p = 2 ►
PO P 1 P 2 P 3 P 4 P 5 P 6 P 7
Stage 4 (P=8)
CT-
C2C2
CT,
C4
CT,
C5
F igure  3.7 DDG o f  the binary-tree algorithm with 8 processors.
Although the only restriction o f the algorithm is that the number o f processors P 
be a power o f 2, the DDG in Figure 3.7 is for 8 processors, the number o f processors 
available in TurboNet. This diagram also includes the data dependence information for P 
= 2 and P  = 4. From Equation 3.9, there are 4 stages in the algorithm for P = 8. In stage 1 
each processor computes the convolution for each parallelogram assigned to it. As 
indicated in Section 3.2, each processor has A\ /P  rows o f  parallelograms, and from Figure
3.2 it can be seen that each row contains parallelograms, i.e. each processor is 
assigned M \N {!P parallelograms. Inside each one o f them, it is simply a convolution o f
32
two rows from the two input matrices, with lengths o f N2 and M2, respectively, as seen in 
Figure 3.1. Hence, there are N2M 2M ]N i/P multiplications and the same num ber o f 
additions in stage 1 for each processor. With the same assumption as in Equation 3.8, we 
have the running time o f  stage 1 as
c ] ( M ] )  = 2 N 2M 2 M f N j  P /cal = 0 { M \ / P )  (3.10)
The convolution result in each processor is then treated locally as a 1-D array o f length 
{N]/P+ A/,-1)( N2+M2- 1), where N 2+M2- 1 is the number o f elements o f the result in the 1- 
D convolution o f vectors (i.e. one parallelogram) having length N 2 and M2.
In stage 2, each processor P, whose LSB of its binary address / is 0 (i.e. P(), P2, P4 
or Pb) receives from Pi+i (i.e. P u P2, P5 or P7) the first (M r \)( N2+M2- 1) elements o f the 
array that resulted in stage 1 in Pi+,, and adds them to the last (A /,-l)( N2+M2- 1) elements 
o f  its own array from stage 1. Since the binary addresses o f  P { and /,+  , differ in a single 
bit, they are neighbors i f  TurboNet is viewed as a hypercube, and therefore their 
communications can be easily carried out through direct links between them. This is true 
only for the hybrid and message-passing versions o f the algorithm. In the shared-memory 
version, because each pair o f  P , and PM  are located in the same board, no accessing o f 
the shared memory through the VME bus is necessary. But the shared memory in each 
Hydra board is shared by two pairs o f P , and P i+i for P > 4, therefore delay due to 
congestion on the ISB occurs (no delay for P = 2). As described in Section 2.2, the 
ownership o f the ISB is determined on a cycle-by-cycle basis, so the maximum delay for 
P > 4 is about (A /,-1)( N2+M2- 1) cycles. The running times o f the three versions in stage 2 
are
33
c,2.h>., ( M , ) = ( -  1 )(A 2 + M 2 -  1 ) ( /cal + /com) = 0 (  M f ) (3.11)
c,2.mp, ( M ]) = ( M ] - l ) ( N 2 + M 2 -  l ) ( / caI + /con, ) = 0 (  M]  ) (3.12)
( M ] - \ ) ( N 2 + M 2 -  l ) ( /cal + / DRAM ) = 0{  M]  ) for P = 2DRAM
( M) - \ ) ( N 2 + M 2 -  l ) ( /CIll + 1DRAM + /dc,)  = 6>(M,2) for P > 4
(3.13)
where hy, mp and sm mean hybrid, message-passing and shared-memory, respectively. 
Also, /com is the time to transfer a single word between neighbors in the hypercube via the 
direct link that connects them, and /di;I = /DRAM is the time to transfer the same word but 
through the shared memory (the DRAM).
In stage 3 each processor P, whose LSB o f  its /' is 0 and the next bit is 1 (i.e. P2 or 
P6) sends to PiA (i.e. P x or P 5) N i/P(N2+M2-\)  elements o f its resulting array from stage 2 
(starting at element (A/r A^ 1//5-l)(yV2+yW2-l))  and to P jm2 (i.e. P0 or PA) the first {Mr N\/P- 
\){N2+M2-\)  elements. PiA and P,.2 will then add the received data to their own arrays. 
The communication process in this stage can be carried out in three ways. In the hybrid 
way, the communication between Pt and PiA is done through the shared memory while P, 
sends data to P,.2 via a direct link between them, and the later is assumed to dom inate the 
running time in this stage. In this case, there is no communication congestion in either the 
link or the shared memory. In using message-passing, data being sent from P, to PiA may 
need to go through Pim2, that is P,_2 is an intermediate node in the path between P, and PiA. 
This implies a delay o f  N {/P(N2+M2-\)  /com in the communication between P, and PjA. It 
is assumed that this is the dominant time. In the shared-memory implementation Ph PiA
34
and Pj.2 are all in the same board, therefore there will be no accessing to out-board shared 
memory. The ISB delay due to congestion exists when P jA and Pt,2 try to read data at the 
same time from the on-board shared memory through the ISB bus. Since P, starts sending 
data to Pj.2 only after it finishes with PiA, Pj_2 will finish last with a maximum delay o f 
N\IP{N2+M2-\)  /dc|. The running times for this stage are
‘khy, ( M ]) - ( M l - N [/ P - \ ) ( N 2 + M 2 -  1 ) ( /cal + tcom) = 0 (  M [ )
(3.14)
W ,»  { M, )  = { M , - N J P ~ \ ) ( N 1 + M 2 - l ) ( / cal + /com ) + N j P ( N 2 + M 2 - \ )  l mm 
= 0 (  M ] ) + 0 {  M ; / P)
(3.15)
c(<.sm)( M ]) = ( M ] - N j P - \ ) ( N 1 + M 2 - l ) ( / cnl +I„ram) + N J P ( N 2 + M 2 - \ ) i dram 
= 0 (  M ] ) + 0 (  M ; f P )
(3.16)
Stage 4 is similar to stage 3. Each processor P , whose lowest two bits o f  / are 0 's  
and the next bit is 1 (i.e. P4) sends three different sets o f N\IP(N2+M2- \)  elements o f its 
resulting array from stage 3 (starting at element (M r 3N i/P -\)(N 2+M2-\))  to P ,_,, Ph2 and 
P,_3 (i.e. P3, P2 and /*,), and then the first (M r 3 N ]/P-\)(N 2+M2-\)  elements to P,_4 ( i.e. 
P0). The receivers will add the received data to their own data arrays and the algorithm 
will be terminated at this point. In this stage, all the receivers are in one board while Ph 
the sender, is in another board, and /*, has a direct link to P{_4 only. Hence, in the hybrid 
version o f  the algorithm P t sends data to PtA, Pia and P,.3 through the shared memory by 
using three o f its DMA controllers, and then to PiA via the link. The three DMA
controllers work simultaneously and produce congestion delay on the VME bus in this 
case as much as 2N ]/P(N2+Mr \)  /0.DRAm- where /0.DRAM is the time to access the out­
board DRAM for a single value. When using message-passing only, Pt dumps all the 
data to Pj_4, and PlA then distributes them to PiA, Ph2 and P,_3 through communication 
links. So the delay in the link between R, and PiA is 3N l/P(N2+M2-\)  /conv The shared- 
memory only case is similar to the hybrid one in terms o f using the shared-memory, but 
with a bigger delay o f  3N xIP{N2+M2-\)  (/0-oram + 'dram ) in both the VME bus and the 
ISB for Pj_4. Since PjA will finish last, the running times for stage 4 are
c,8..,y) { M X) = { M , - 3 N J P - \ ) { N 2 + M 2 -  l ) ( / t„  + /ra,„ ) + 2 N j P ( N 2 + M 2 - 1 )  /„_DRAM 
= 0 ( M 2) + 0 ( M 2/ P )
(3.17)
( M \ ) = ( M\  -  3 yV, / P  -  1)(yV2 + M 2 -  l ) ( / cal + /co,„ ) + 3 N j P ( N 2 + M 2 - 1 )  tmm 
= 0 (M ,2) + 0 (  M ; / P )
(3.18)
c'i«.sm)( ) = ( M^ -  3 N  J P  -  \ ) ( N 2 + M 2 -  l ) ( / ca, + / DRAM )
+ 3 N J P ( N 2 + M 2 - 1 )  + (mAM )= 0 (  M f  ) + 0 (  M ; / P)
(3.19)
The running times o f PiA, P,_2 and P,_3 will be very close to each other in the hybrid and 
shared-memory cases. In the message-passing case, P jA will finish first and then P;.2 will 
be followed by P,.3.
From the discussion above we can see that the hybrid version has less delay than 
the message-passing and shared-memory versions, and therefore would provide better 
performance over the other two. As described above, the dominant running times in
36
stages 2, 3 and 4 are c2, c4 and c8, respectively, that is, the longest path in the DDG of 
Figure 3.7 is from c, o f  P7 to c2 o f Pb, c4 o f P4 and cs o f  P0 (as marked by thick lines in 
the figure). Hence, the overall running times o f the binary-tree algorithm with P = 8. 
listed in Tables 3.5 to 3.7, are estimated from the following equations:
t E. hy( M i ) = ( N 2M 2M l N j P ) u i ( N , P )  + ( N 2 + M 2 - l ) ( ( M ] - l ) u hy2( N , P )
+ ( -  N  J  P -  \ ) uhy2( N  , P)  + ( M } - 3 N J P -  1) why4 ( N  , P) )
(3.20)
'  ( M ]) = ( N 2M 2M ] N J P )  u J N , P )  + ( N 2 + M 2 -  l ) ( (A /t - 1 )  ump2 ( N , P )
+ ( M ] - N j P - \ ) u m])2( N , P )  + ( M,  - 3 N j P - \ ) u mp4( N , P ) )
(3.21)
 ^E-sm ( M l ) = ( N 2M 2M\  N J P )  u J N  , P)  + ( N 2 + M 2 -1)((A T , - 1 )  »SIIl2 ( N  , P)
+ ( M l - N J P -  l )u m3 ( N  ,P )  + ( M i - 3 N J P -  1 )u m4 ( N , P ) )
(3.22)
where all the u parameters represent the times for one addition and one communication 
(except for itl which is only the time for one multiplication and one addition) in 
corresponding methods and stages, including relevant delay and time to load and store 
data. They are derived from experiment and are listed in Tables 3.8 to 3.11.
Table 3.5 The estimated running time o f the hybrid version o f  the binary-tree algorithm.
N ,=N 2=8 M ,=M 2=8 M |=M 2=16 M ,=M 2=24 M |=M 2=28 M ,=M 2=32
P=1 0.008228 0.031727 0.070469 0.095581 0.124509
P=2 0.004277 0.016290 0.036129 0.048954 0.063698
P=4 0.002381 0.008957 0.019737 0.026673 0.034641
P=8 0.001545 0.005595 0.012062 0.016266 0.021081
37
Tabic 3.6 The estimated running time of the message-passing version o f  the binary-tree 
algorithm.
N ,=N 2=8 M ,=M 2=8 M ,=M 2=16 M |=M 2=24 M ,=M 2=28 M ,=M 2=32
P=1 0.008228 0.031727 0.070469 0.095581 0.124509
P=2 0.004273 0.016281 0.036136 0.048881 0.063701
P=4 0.002425 0.008979 0.019720 0.026669 0.034672
P=8 0.001579 0.005644 0.012138 0.016368 0.021174
Table 3.7 The estimated running time of the shared-memory version o f the binary-tree 
algorithm.
N ,=N 2=8 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M |=M 2=32
P=1 0.008228 0.031727 0.070469 0.095581 0.124509
P=2 0.004418 0.016678 0.036901 0.049883 0.065020
P=4 0.002773 0.010128 0.022053 0.029819 0.038701
P=8 0.002011 0.007060 0.015154 0.020352 0.026261
Table 3.8 The time u x (in nsecs) for the binary-tree algorithm.
" i M ,=M ,=8 M ,=M 2=16 M |=M 2=24 M ,=M 2=28 M |=M 2=32
P=1 2008.8 1936.5 1911.6 1904.9 1899.9
P=2 1999.1 1922.5 1897.8 1890.9 1885.7
P=4 2005.4 1923.8 1898.6 1891.4 1886.1
P=8 2017.2 1928.0 1900.7 1892.1 1887.1
38
Tabic 3.9 The time uxx2 (in nsecs) for the binary-tree algorithm.
«xx2
00'U£II§ M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
hy 1741.9 1567.8 1610.5 1604.1 1570.0
mp 1707.6 1542.9 1620.9 1526.2 1580.2
sm
P=2 3078.1 2694.2 2691.3 2586.5 2671.6
P=4 3960.0 3719.4 3642.2 3661.5 3669.0
P=8 3976.2 3785.2 3771.4 3759.7 3719.6
Table 3.10 The time uxx2 (in nsecs) for the binary-tree algorithm.
«xx3 M |=M 2=8 M ,=M 2=16 M ,=M 2=24 2 "n 2 II N> 00 1 M ,=M 2=32
hy P=4 1928.0 1793.3 1676.2 1636.5 1627.5
P=8 1807.8 1725.2 1586.5 1589.2 1597.2
mp P=4 2149.3 1531.3 1411.5 1397.1 1402.8
P=8 1718.9 1393.2 1330.0 1329.5 1345.0
sm P=4 4054.7 3219.7 3068.4 3010.5 2972.6
P=8 3717.8 3225.8 3067.6 3025.6 2987.7
Table 3.11 The time uxx4 (in nsecs) for the binary-tree algorithm.
*^ xx4 M |=M 2=8 M ,=M 2=16 M ,=M 2=24 M |=M 2=28 M |=M 2=32
hy 2776.7 1992.8 1731.3 1710.1 1698.5
mp 3171.7 2129.3 1806.3 1759.5 1746.6
sm 3856.7 2763.0 2629.5 2590.5 2563.2
39
3.4 Performance Results on TurboNet
The algorithms for 1-D and 2-D convolution were implemented on TurboNet. Relevant 
performance results are presented in this section. In all implementations each processor 
employs its DMA controllers to fetch the values o f f  and g  from the shared memory, and 
starts the calculation after both data arrays are stored in the local memory. Due to their 
lack o f communication, the 1-D convolution algorithms cannot demonstrate the use o f 
message-passing and/or shared-memory, and therefore will not be evaluated in significant 
detail. They are, however, efficient algorithms that serve as introductory for the algorithm 
o f 2-D convolution.
3.4.1 1-D Convolution
Tables 3.12 to 3.17 contain execution times, speed-ups and efficiencies for both methods 
o f  1-D convolution, with variant data sizes and numbers o f processors. The execution 
time listed here for each case is the longest one among the processors. It can be seen from 
the results that method 1 has better performance than method 2. This agrees with our 
analysis in Section 3.1. In Figure 3.8, the execution tim es for both methods are plotted for 
8 processors. It can also been seen that the estimated running times in Section 3.3.1 are 
very close to the experimental ones. In Figure 3.9, such a comparison for m ethod 1 is 
plotted. The difference between the running times o f  methods 1 and 2 are also shown in 
Figures 3.10 and 3.11.
40
T able  3.12 Execution time for 1-D convolution with method 1.
M=  512 N=8 N=32 7V=128 yv=384
P= 1 0.005129 0.020513 0.082049 0.246145
IIft. 0.003682 0.013148 0.053314 0.174866
P  =4 0.001858 0.006716 0.028450 0.090712
P =  8 0.000946 0.003500 0.015048 0.046178
T ab le  3.13 Speedup o f 1-D convolution with method 1.
M=512 N =8 N=  32 iV=128 7V=384
P= 2 1.39 1.56 1.54 1.41
ii ■c*. 2.76 3.05 2.88 2.71
P =  8 5.42 5.86 5.45 5.33
T able  3.14 Efficiency o f  1-D convolution with method 1.
M=512 N=8 N=32 yv=128 A^ =384
P =  2 0.70 0.78 0.77 0.70
P  =4 0.69 0.76 0.72 0.68
P =  8 0.68 0.73 0.68 0.67
T ab le  3.15 Execution time for 1-D convolution with method 2.
M =513 N=8 N=32 A^=128 /V=384
P =  1 0.005139 0.020553 0.082209 0.246625
P =  2 0.004482 0.014087 0.052504 0.154948
73 II 0.002250 0.007413 0.032529 0.114757
P =  8 0.001129 0.003711 0.016269 0.065491
41
T able  3.16 Speedup o f  1-D convolution with method 2.
M=513 N=8 7V=32 N=128 N=384
P =  2 1.15 1.46 1.57 1.59
P =  4 2.29 2.77 2.52 2.15
ll 00 4.55 5.54 5.05 3.77
T able  3.17 Efficiency o f  1-D convolution with method 2.
A/=513 N=8 7V= 32 A^=l 28 7V=384(NII 0.57 0.73 0.78 0.80
P =  4 0.57 0.69 0.63 0.54
ll 00 0.57 0.69 0.63 0.47
0 . 0 6 0 9  ••
E  0 . 0 5 0 9  ••
X
e  T  0 . 0 4 0 9  ■■ 
c  .1
u  in  0 . 0 3 0 9  ■■ 
t
i
o 
n
M e t h o d  1 ( M = 5 1 2 )  
M e t h o d  2 ( M  =  5 1 3 )
0 . 0 2 0 9
0 . 0 1 0 9
0 . 0 0 0 9  B — 
N  = 8 N  =  3 2 N  =  128 N  = 3 8 4
A r r a y  S i z e
F igure  3.8 Execution time o f 1-D convolution (method 1 vs method 2), with P = 8.
42
0 . 0 4 5  ■■
E  ° ' 0 4  •
x 0 . 0 3 5  ••
e  T  0 . 0 3
> 0 . 0 2 5
1 m 0.02
' 0 . 0 1 5
o 
n
0.01
0 . 0 0 5
0
9 T h e o r e t i c a l  
□  E x p e r i m e n t a l
N  =  3 2  N =  128
A r r a y  S i z e  ( M  =  5 1 2 )
N  =  3 8 4
Figure 3.9 A comparison o f  theoretical and experimental running times for method 1.
N  =  8 N  =  3 2 N =  128 N  =  3 8 4
5 °
E  4 0  ' B  P = 1
X
U
) o
□  P  = 2
e  T 9  P  = 4
c  . 2 0  • □  P  = 8
u
t m  10  •
e
o  0  • H
”  - 1 0  ■
- 2 0  •
A r r a y  S i z e  ( M  =  5 1 2 )
Figure 3.10 The difference between theoretical and experimental running times for 
method 1.
43
N  =  8 N  =  3 2  N =  1 2 8  N  =  3 8 4
A r r a y  S i z e  ( M  =  5 1 3 )
Figure 3.11 The difference o f  theoretical and experimental running times from method 
2 .
3.4.2 2-D Convolution
Tables 3.18 to 3.26 contain execution times, speed-ups and efficiencies for the three 
versions o f the binary-tree algorithm, with variant data sizes and numbers o f  processors 
(iV, and N 2 are always equal to 8). From the results we see that the performance o f  the 
hybrid version is slightly better than that o f the message-passing one, while the shared- 
memory one trails both o f them with a noticeable margin. This matches our prediction in 
Section 3.3.2. From Tables 3.18 and 3.24, it is seen that the hybrid and m essage-passing 
versions are much faster than the shared-memory version. In order to show clearly that 
the hybrid version is also faster than the message-passing version, in Figure 3.12 we plot 
the difference in execution times between the message-passing and hybrid versions, in 
this order, for 8 processors. Again the theoretical and experimental running times o f the
44
binary-tree method are close to each other. A direct comparison o f  both running times for 
the hybrid version with P = 8 is shown in Figure 3.13, and the differences o f these 
running times for the three versions are plotted in Figures 3.14 to 3.16.
Table 3.18 Execution time for the hybrid version o f the binary-tree algorithm.
N ,=N 2=8 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=1 0.008237 0.031713 0.070460 0.095560 0.124481
P=2 0.004265 0.016292 0.036094 0.048911 0.063672
P=4 0.002388 0.008948 0.019696 0.026638 0.034628
P=8 0.001553 0.005565 0.012071 0.016254 0.021066
Table 3.19 Speedup o f the hybrid version o f the binary-tree algorithm.
N ,=N 2=8 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 1.93 1.95 1.95 1.95 1.96
P=4 3.45 3.54 3.58 3.59 3.59
P=8 5.30 5.70 5.84 5.88 5.91
Table 3.20 Efficiency o f the hybrid version o f the binary-tree algorithm.
N ,=N 2=8 M ,=M 2=8 'I,
£II§ M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 0.97 0.97 0.98 0.98 0.98
P=4 0.86 0.89 0.89 0.90 0.90
P=8 0.66 0.71 0.73 0.73 0.74
Table 3.21 Execution time for the m essage-passing version o f  the binary-tree algorithm.
Z “ll z II 00 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=1 0.008237 0.031713 0.070460 0.095560 0.124481
P=2 0.004267 0.016294 0.036096 0.048914 0.063675
P=4 0.002417 0.008989 0.019746 0.026695 0.034690
P=8 0.001595 0.005628 0.012149 0.016343 0.021163
45
Table 3.22 Speedup of the message-passing version o f the binary-tree algorithm.
N ,=N 2=8
00II£IIs M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 1.93 1.95 1.95 1.95 1.95
P=4 3.41 3.53 3.57 3.58 3.59
P=8 5.16 5.63 5.80 5.85 5.88
Table 3.23 Efficiency o f the message-passing version o f the binary-tree algorithm.
2 “ll Z II 00 M |=M 2=8 M ,=M ,=16 M |=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 0.96 0.97 0.98 0.98 0.98
P=4 0.85 0.88 0.89 0.89 0.90
P=8 0.65 0.70 0.72 0.73 0.74
Table 3.24 Execution time for the shared-memory version o f  the binary-tree algorithm.
2 “ll 2 II 00 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=1 0.008237 0.031713 0.070460 0.095560 0.124481
P=2 0.004409 0.016666 0.036872 0.049907 0.064997
P=4 0.002763 0.010133 0.022097 0.029843 0.038714
P=8 0.002017 0.007077 0.015137 0.020329 0.026273
Table 3.25 Speedup o f the shared-memory version o f  the binary-tree algorithm.
soIIcl
ZIIz M ,=M 2=8 M ,=M 2=16 M |=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 1.87 1.90 1.91 1.91 1.92
P=4 2.98 3.13 3.19 3.20 3.22
P=8 4.08 4.48 4.65 4.70 4.74
Table 3.26 Efficiency o f the shared-memory version o f  the binary-tree algorithm.
N ,=N 2=8 M ,=M 2=8 M |=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
P=2 0.94 0.95 0.96 0.96 0.96
P=4 0.75 0.78 0.80 0.80 0.80
P=8 0.51 0.56 0.58 0.59 0.59
46
0.0001
0 .0 0 0 0 9
0 . 0 0 0 0 8
T
' 0 . 0 0 0 0 7
ill
e
0 . 0 0 0 0 6
0 .0 0 0 0 5
0 . 0 0 0 0 4  T --------------------------- 1---------------------------- 1---------------------------- 1---------------------------- 1
M l  = M 2 = 8  M l  = M 2 =  16 M 1 = M 2 = 2 4  M 1 = M 2  =  28 M 1 = M 2  =  32
M a t r i x  S i z e ( N l =  N 2  =  8)
Figure 3.12 The difference in execution times between the message-passing and hybrid 
versions, in this order, for P  = 8.
■  T h e o r e t i c a l  
□  E x p e r i m e n t a l
M l  = M 2  =  8 M l  = M 2 =  16 M 1 = M 2  =  2 4  M 1 = M 2  =  2 8  M 1 = M 2  =  3 2  
M a t r i x  S i z e  ( N 1 = N 2 = 8 )
Figure 3.13 A comparison o f  theoretical and experimental running times o f  the hybrid 
version for P = 8.
47
50
E  4 0
X
e  T  3 0  
c  .
u  ' 2 0
in
e
10 
0
-10 1 '
M l  = M 2  =  8 M l  =  M 2  =  16 M 1 = M 2  =  2 4  M 1 = M 2  =  2 8  M 1 = M 2  =  3 2
□  P  =  2
□  P  =  8
IT
M a t r i x  S i z e  ( N 1 = N 2  =  8 )
Figure 3.14 The difference o f  theoretical and experimental running times for the hybrid 
version.
M l  = M 2  =  8 M l  = M 2  =  16  M 1 = M 2  =  2 4  M 1 = M 2  =  2 8  M 1 = M 2  =  3 2
E
X
e
T
c •
u
m
4 0
a p= 2
□  P  =  4
a  p=8
3 0
20
10
0
10
-20
- 3 0
- 4 0
M a t r i x  S i z e  ( N 1 =  N 2  =  8)
Figure 3.15 The difference o f theoretical and experimental running tim es for the 
message-passing version.
48
M l  = M 2  =  8 M l  = M 2 =  16 M 1 = M 2  =  2 4  M 1 = M 2  =  2 8  M 1 = M 2  =  3 2
3 0  
20
E
x 10
6 T  0
c
u  ' - 1 0  m
! e  - 2 0i
o  - 3 0
n
- 4 0  
- 5 0
e
Figure 3.16 The difference o f  theoretical and experimental running times for the shared- 
memory version.
In Figure 3.7, CT2, CT3 and CT4 represent the communication times o f  the 
corresponding edges in stages 2, 3 and 4. Their values were measured in the 
implementations and are displayed in Tables 3.27 to 3.29. In the tables, hy, mp and sm 
are abbreviations for hybrid, message-passing and shared-memory. From Equations 3.11 
to 3.13, it is seen that the number of operations in stage 2 is not a function o f P. P 
affects the communication times only in the shared-memory implementation; that is why 
in Table 3.27 only the sm case is categorized for different values o f P. Also, notice that 
Table 3.29 is only for P  = 8.
M a t r i x  S i z e  ( N 1 = N 2  =  8)
49
T able  3.27 The communication time (in gsec) between P, and PH[ in stage 2.
CT, M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
hy 133.30 424.50 880.38 1182.12 1545.88
mp 133.40 424.55 880.41 1182.35 1545.98
sm
P=2 274.30 796.75 1654.61 2175.25 2868.48
P=4 370.90 1164.95 2383.61 3179.15 4103.78
P=8 379.60 1190.15 2448.71 3252.95 4162.48
T able  3.28 The communication time (in gsec) between Pt and Pj_2 in stage 3.
c t 3 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M 2=32
hy P=4 132.90 409.55 867.37 1148.65 1517.02
P=8 138.50 414.80 877.04 1170.96 1539.90
mp P=4 161.20 457.85 918.87 1222.45 1586.62
P=8 154.70 448.60 907.04 1209.80 1573.70
sm P=4 270.60 860.55 1804.67 2374.95 3051.82
P=8 302.40 931.00 1862.04 2464.20 3172.00
T able  3.29 The communication time (in gsec) between P, and Ph4 in stage 4.
c t 4 M ,=M 2=8 M ,=M 2=16 M ,=M 2=24 M ,=M 2=28 M ,=M ,=32
hy 153.80 410.40 868.80 1161.60 1521.05
mp 178.50 476.60 940.80 1246.60 1609.04
sm 210.60 673.00 1418.70 1906.60 2497.24
CHAPTER 4
FAST FOURIER TRANSFORM
The implementation o f  Fast Fourier Transform (FFT) algorithms on parallel systems is 
found in many applications [1, 2, 4, 6, 7, 23], This chapter deals with the 1-D and 2-D 
FFT algorithms and their implementation on TurboNet. Here the number o f  processors 
and the sizes o f any data arrays or matrices are assumed to be powers o f 2.
A 2-D FFT is usually performed by 1-D row FFT’s followed by 1-D column 
FFT’s, that is, as long as we know how to do the 1-D FFT, the 2-D FFT is also solved. 
The problem with the 2-D FFT here, however, is how to overlap the input, calculation 
and output processes, so that the performance can be improved, especially when the input 
data to a processor is bigger than its internal memory and then such overlaps become 
vital. Our 2-D FFT algorithm will also focus on how to perform efficiently the 
transposition o f the intermediate matrix between the row and column FFT’s.
4.1 One-Dim ensional FFT
The Discrete Fourier Transform (DFT) o f a finite-length digital sequence x is a sequence 
y  o f  the same length as x such that
= (4.i)
n= 0
50
51
for k  = 0, 1 , AM,  where N  is the length o f the digital sequence and w,v is a primitive A11' 
root o f  unity, that is w = e'2n/N, where / = V -T .  Its computation requires on the order o f N 2 
m ultiplications and N  additions.
The FFT is the m ost popular algorithm in the implem entation o f the DFT. For a 
parallel implementation o f the FFT, let us look at the data flow graph o f an 8-point 
decimation-in-time (DIT) FFT. Its butterfly configuration is shown in Figure 4.1, where 
the input array is shuffled while the output is in increasing-index order. In the figure, as 
an example, four processors are used to perform the 8-point FFT. Each one o f  them 
calculates two results as shown. Any arrow that crosses a dashed line separating two 
processors indicates a communication between the two end-processors o f the arrow. In 
stage 1, each processor performs a butterfly computation without any communication 
involved. In stage 2, each processor carries out the low or upper halves o f two butterfly 
computations and communicates with one o f its two neighbors the results o f  both 
computations. Stage 3 is ju st like stage 2 except that this time it involves another 
neighbor. The l¥^ ’s are called butterfly-lwiddle factors (these coefficients are complex 
numbers). There is a trade-off between speed and memory in dealing with the twiddle 
factors. In order to speed up the FFT operation, they are usually generated and stored in a 
table before stage 1 starts. To save memory, however, they can be calculated each time 
when they are needed. For an A-point FFT, A/2 coefficients are needed. The butterfly 
network can also be mapped onto a hypercube with the same number o f processors (i.e. P 
= 8 in this example) in an optimal manner. Therefore, assuming a hypercube system with
52
8 processors, each processor will have to exchange data once with each one o f  its 
log2 8 = 3 neighbors.
Stage 1 Stage 2 Stage 3
x[0] o ------
W  0
x[4] o  > -
x[2] O----- $*~~
W  0
x[6] O----- 5>-
x[1] O S*~~
W  u 
x[5] o ------
x[3] o  >
W  u 
x[7] O------
s o  y 1
O y 2
PO
P1
P2
P3
Figure 4.1 Data flow graph o f  an 8-point decimation-in-time DFT.
In general, if  there are P processors in performing an Appoint, radix-2, 
decimation-in-time FFT, then each one o f them calculates N/P  results by performing 
N/(2P) log2 N  butterfly computations, so that F 0 produces y[0], y [ l] , ..., y[N/P-\], P\ 
produces y[N/P], y[N/P+\~\, y\2N/P-\], etc. Note that after the first log2 (jN/ P)
butterfly computations, which is a sequential N/ P-point 1-D FFT, the remaining butterfly
computations are split in half between neighboring processors. This means that when 
calculating half o f a butterfly computation, a processor has to exchange data with its 
neighbor performing the other half o f  the butterfly computation.
The parallel 1-D FFT algorithm has a strong communication bias toward 
hypercube systems, since all communications in the algorithm are between neighboring 
processors, and this results in good performance for the message-passing method. The 
FFT algorithm is described briefly below.
ONE-DIMENSIONAL FFT ALGORITHM:
(1) for all processors do in parallel
set up a DMA controller to read N/P input data from the DRAM with bit- 
reverse addressing (global memory);
(2) for a1 = 1 to \ og2( N  / P)  do
for / = 0 to 272 do
read the coefficient from the DRAM; 
perform the basic butterfly computation;
end for 
end for;
(3) for s = log, ( N  / P)  + 1 to log2 do
for / = 0 to N/P-\  do
read the coefficient from the DRAM; 
exchange data with a neighbor;
54
perform the basic butterfly computation;
end for
end for.
4.2 Two-Dimensional FFT
The definition o f the two-dimensional DFT is given by
y [ k ], k 2\ =  £  «21 <  w n"! (4 -2)
/;, =0  / / ,  =0
for £, = 0, 1 N r \ and k2 = 0, 1, jV2-1, where = e '27t/'V|, vrVi =e '2’LlN'-, and iV, and
N2 are the lengths o f the 2-D matrix x.
To perform a 2-D FFT, we first perform 1-D FFT’s row-by-row on the input 
matrix, and then on the resulting matrix we perform 1-D FFT’s column-by-column; this is 
called the row-column method [26]. The 2-D FFT is, thus, decomposed into 7V,+ N2 1-D 
FFT’s. In order to reduce the communication overhead and simplify the programming 
task, we let each 1-D FFT be performed by a single processor rather than dividing it 
between two or more processors; we assume that the number o f processors is less than the 
num ber o f  rows or columns, which is usually the case in many applications. This means 
that each processor will perform 1-D FFT’s in a sequential way, and comm unication is 
needed only during the transposition o f the resulting matrix o f the row FFT’s.
As mentioned at the beginning o f this chapter, due to the limited amount o f 
processor internal memory, all required input data for a processor may not be fully stored 
in it initially. Instead the processor may have to read few rows or columns at a time, and
55
after completion o f  the current task it will have to start the next read cycle. In this 
situation, the efficiency o f reads and writes is vital to the performance o f  the algorithm. 
Hence, our research focus here is on how efficiently input data are read from and output 
data are written back to the DRAM, for both row and column FFT’s.
Suppose we have P  processors to perform a 2-D FFT on a N xN  matrix, where N/P 
is an integer. Assume row FFT’s for simplicity. Also, assume that each processor can 
store internally 2k  rows o f  data, where 2k < (N/P), and that (N/P)/k is an integer. In order 
to save CPU time, we let a DMA controller o f  each processor read from the DRAM k 
rows o f  data assigned to it, while the processor operates on k  preceding rows. Each time 
the DMA controller reads another k rows, Pk  rows down from the first row o f the last k 
rows read. While the processor works on these newly read k  rows, another DMA 
controller writes back to the DRAM the result for the last k  rows processed. O f course, 
the processor’s internal memory should accommodate 4k  rows if  the memory space 
occupied by the 2k  rows o f the result is also considered. Each processor will repeat this 
process (N/P)/k times. Such a memory interleaving example for input rows, with P = 4, is 
shown in Figure 4.2. For the column FFT, the procedure is the same except that the input 
is the columns o f  the resulting matrix o f  the row FFT’s. This means we have to use 
different address indexing in the DMA controllers when reading or writing the DRAM, or 
perform transposition o f  the matrix directly.
Under the condition o f 2k < (N/P), we have assumed the use o f shared-memory 
for communication, and it is obviously a practical choice over any m essage-passing 
method.
56
SYSTEM DRAM PROCESSOR INTERNAL MEMORY
T he first Pk rows of data T he 2k row m em ory o f PO
k rows
T he 2k row m em ory o f P1
T he sec o n d  Pk rows o f data
T he 2k row m em ory of P 2
The 2k row m em ory o f P3
The (N/P)/k th Pk rows o f data
Figure 4.2 An example o f  accessing row data for the FFT, with P  = 4 and 2k < (N/P).
PROCESSOR INTERNAL MEMORY
PO
P1
P2
P3
to P3no move to P1 to P2
to PO to P2 to P3no ive
to PO to P3no lovetoP1
to P1to PO to P2 no move
Figure 4.3 An example o f transferring data for the FFT, with P = 4 and 2k > (N/P).
57
Now consider the case o f  2k > (N/P).  Each processor now can fully contain all 
N/ P  rows o f its data in its internal memory. In this case the only problem is how to carry 
out the transposition o f  the intermediate matrix. After finishing the row FFT ’s, each 
processor divides the resulting N/ P  rows into P  sections o f columns, and then sends each 
one o f them to that processor whose identification number (ID) is the same as the section 
number. Transposition o f columns into rows is also involved. Each processor keeps the 
section with the same section number as its own ID. Figure 4.3 gives an example for this 
case. In this case, each processor has to talk to every other processor, which results in 
complex programming structures. Flence, this method was not implemented on TurboNet 
and can be considered in further research.
4.3 Theoretical Analysis
Theoretical analysis o f  the 1 -D and 2-D FFT algorithms is presented in this section.
4.3.1 1-D FFT
From Figure 4.1 we can see that there are log2 N  stages in the A-point FFT. In each 
stage there are N/2 butterfly computations, and each one includes two m ultiplications and 
two additions o f complex numbers. Hence, its running time on a single processor is
fsql(/V) = 6/V (log2 N )  ta l = 0 ( N  log, N )  (4.3)
where /ca| is the time for one multiplication or addition.
58
The DDG for P  = 8 is depicted in Figure 4.4. Parts o f the Figure can also be used 
for 2 and 4 processors. If  N/P > 2, each arrow connecting two processors represents the 
communication o f N/P values. As indicated in Section 4.1, the parallel implementation of 
the A-point FFT by P  processors has two parts. The first part o f  the first log2(jV /P)  
stages is an M P-point sequential FFT with running time
c , ( N )  = 6 ( N / P ) \ o g 2( N / P ) t ^  = ( ) ( N / P  \og2( N / P ) )  (4.4)
P  = 4
■* P  = 2 ►
P  = 8
Stages I t o l o g ^ N / p )  PO P1 
(P>=1) ( c l
Stage log2( N /p ) + l  
(P>=2)
Stage log2(N /p )+ 2  
(P>=4)
Stage log2(N /p )+ 3  
(P=8)
Figure 4.4 DDG o f the 1-D FFT with P = 8.
The second part is the last log2 P  stages (they are the last three layers for P = 8 in Figure 
4.4). In this part each butterfly computation in every stage is divided into two halves 
carried out by two neighboring processors. All stages in this part should have the same
59
running time since the numbers o f calculations are equal and the communication time 
between any two neighbors is assumed to be the same. Each processor will have N/P 
values o f  the butterfly computations in each stage, and therefore the running time for each 
stage o f  the second part is
c2 ( N ) =  N / P  (6 tc,  + 2 t cam) = 0 ( N / P )  (4.5)
where /eom is the time to send one word (a complex number is considered to have two 
words) from a processor to its neighbor via the direct link that connects them, and the 
running tim e o f  the parallel algorithm for the message-passing system adds up to
tmp( N )  = cl { N )  + {\og2 P ) c 2 ( N )  = 0 { N /  P \ o g 2 N )  (4.6)
As mentioned in Section 4.1, the nature o f the parallel FFT makes good use o f  the 
hypercube structure in terms o f  efficient communication, since all data exchanges are 
between neighbors. If  the communication channels between neighboring processors are 
full-duplex, there will be no additional delay when they exchange data at the same time. 
For a half-duplex channel, which is the case for the TurboNet system, if  two neighboring 
processors want to send data to each other at the same time, one has to wait until the other 
one finishes. However, we can safely assume that the comm unication between two 
neighbors in the TurboNet system never happens at the same time, because one o f them 
has to multiply the outgoing data with the twiddle-factors while the other does not. It is 
obvious that the integration o f  the shared-memory paradigm with this message-passing 
paradigm  for the 1-D FFT would result in performance degradation. For the shared- 
memory only method, there will be a total bus delay as much as ( P  -  1) N /  P( \ og2 P ) t de], 
where N / P  log2 P is the total number o f complex numbers each processor has to send,
60
and /dc| = /DRAm is the time to transfer one complex number from/to the DRAM. Hence, 
the estimated running time for the shared-memory method is
Pm ( AO = 6(( A^/Plog, P + jV /Rlog, ( N  / P ) ) t m] + 2 TV (log , P ) t mAM)
= N / P ( log, yv)/cal + W (Iog2 P ) t mAM = 0 ( N / P  log2 N  + N  log2 P)
(4.7)
From Equations 4.6 and 4.7, it is obvious that the message-passing method will have 
better performance than the shared-memory one. Tables 4.1 and 4.2 give the estimated 
running times o f  the 1-D FFT for the message-passing and shared-memory methods, 
respectively. They are estimated from the following two equations:
lu-mp( N )  = N / P ( l o g 1( N / P ) ) u ( N , P )  + N / P ( l o g 2 P ) u mp( N , P )
(4.8)
t ^ m( H )  = N / P ( \ o g 2( N / P ) M N , P )  + N / P ( \ o g 2 P ) u im( N , P )
(4.9)
where u(N,P), ump(N,P) and usm(N,P), obtained from experiment, represent the amount o f 
time for ha lf a butterfly computation (i.e. one multiplication and one addition o f  complex 
numbers) in the sequential, parallel with message-passing and parallel with shared- 
memory implementations, respectively. The u parameters also include the time to load 
and store input and output data. Appropriate values are derived from experiment and are 
listed in Tables 4.3, 4.4 and 4.5.
61
Table 4.1 The estimated running time for the 1-D FFT with message-passing.
N=64 N=128 N=256 N=512 N=1024
P=1 0.001163 0.002610 0.005828 0.012921 0.028499
P=2 0.000783 0.001665 0.003563 0.007708 0.016668
P=4 0.000523 0.001050 0.002159 0.004526 0.009605
P=8 0.000372 0.000688 0.001318 0.002678 0.005468
Table 4.2 The estimated running time for the 1-D FFT with shared-memory.
N=64 N=128 N=256 N=512 N=1024
P=1 0.001163 0.002610 0.005828 0.012921 0.028499
P=2 0.000923 0.001928 0.004079 0.008786 0.018734
P=4 0.000820 0.001616 0.003275 0.006523 0.013444
P=8 0.000886 0.001667 0.003250 0.006515 0.013198
Table 4.3 The sequential time u (in nsecs) o f 1-D FFT.
u N=64 N=128 N=256 N=512 N=1024
P=1 3028.6 2912.9 2845.7 2804.0 2783.1
P=2 3300.0 3044.3 2916.3 2841.8 2805.8
P=4 3765.6 3300.0 3046.9 2917.4 2842.3
P=8 4666.7 3781.3 3306.3 3046.9 2917.7
Table 4.4 The time ump (in nsecs) o f 1-D FFT with message-passing.
®mp N=64 N=128 N=256 N=512 N=1024
P=2 7968.8 7750.0 7421.9 7375.0 7302.7
P=4 8812.5 8156.3 7726.6 7468.8 7390.6
P=8 10833.3 9291.7 8218.8 7854.2 7432.3
62
Table 4.5 The time »sm (in nsecs) o f 1-D FFT with shared-memory.
®sm N=64 N=128 N=256 N=512 N=1024
P=2 12343.8 1 1859.4 11453.1 11586.0 11337.9
P=4 18093.8 17000.0 16445.3 15269.5 14888.7
P=8 32250.0 29687.5 28343.8 27838.5 27562.5
4.3.2 2-D FFT
A sequential 2-D FFT on an N xN  matrix is actually 2N  /V-points 1-D FFT’s, and therefore 
its running time is
^ 2  (AO = 2 N ( 6 N (log , W )/cal) = 0 ( N 2 log, N )  (4.10)
Here the matrix transposition is assumed to be done through reading input columns. The 
proposed 2-D FFT algorithm for 2k < (N/P), described in Section 4.2, has a DDG as 
shown in Figure 4.5. In order to speed up the operations, the algorithm overlaps its data 
input, FFT calculation and data output phases for the row and column FFT’s. This is 
indicated in the figure by dashed circles. Inside each circle, the three phases are 
represented by smaller circles marked with “r”, “ffts” and “w” for read, 1-D FFT’s and 
write, respectively. Overlapping o f the three phases iterates until the row (or column) 
FFT’s are completed. Since the time for reading and writing k rows (or columns) o f data 
from and to the shared memory is less than the time for their 1-D FFT’s, after the first k 
rows (or columns) o f  data are fetched each processor can complete without pause all o f  its 
N/P row (or column) FFT’s. For example, from experiment, with N  = 64, P = 8 and k = 2, 
the two 64-point 1-D FFT’s take 2338 gsecs, while the reading and writing o f  data from
63
the shared memory can take up to (2P- \)kN  660(nsecs) = 1267.2 jusecs, where the 
660(nsecs) is the value o f / D r a m - Therefore, the running time o f the row (or column) FFT 
stage is
( N )  -  ( N / P )  N (log2 N ) 6  = 0 ( N 2/ P  log, N )  (4.11)
PO
Row FFTs
P1
► ©  (fits)— ►(©
p-1
- ©
Matrix Transposition
Column FFTs ms ffts
F igure  4.5 DDG o f the 2-D FFT algorithm for 2k < (N/P).
As pointed out earlier, the transposition can be performed during the input o f  columns by 
either using another addressing scheme for the DMA controllers, different from the one 
used for row FFT’s, or directly transposing the matrix in the DRAM. The latter is chosen 
in our algorithm for the following two reasons: (1) re-utilizing the program o f  row FFT’s 
for column FFT’s without any change; (2) avoiding the initialization o f  the DMA 
controllers for the reading o f each column with different starting addresses, since it is 
rather time consuming. In the transposition stage, each processor has to read and write
64
N  N / P  complex numbers with a possible maximum delay in the DRAM of 
( P  -  1) N 2/ P 2  / dram, that is, the running time o f  this stage is
l mt ( N )  = ( N 2/ P  + ( P - \ )  N 2/ P) 2  / dram = 2 N 2 / DRAM = 0 ( N 2)
(4.12)
The total running time o f the algorithm is then the summation o f 2/m(A0 and fnU(A0, as 
'm 2 (■AO = N 2/ P ( \ o g 2 N) 12  /ca, + N 2 2 l nRAM = 0 ( N 2/ P  (log,  N  + />))
= 0 ( N 2/ P log2 N )
(4.13)
for N »  P. The estimated running time o f  the 2-D FFT algorithm for 2k < (N/P)  is 
presented in Table 4.6. It is based on Equation 4.14:
' r:-fit2 (AO = 2 N 2 /  P ( \o g 2 N ) u ( N , P)  + N 2 /  P wPRAM ( N , P )  (4.14)
where u(N, P) is defined as in Equation 4.9 and listed in Table 4.7. The parameter 
udram(N, P) is the unit time o f reading and writing a complex number from and to the 
DRAM with congestion delay found from experiment, and is listed in Table 4.8.
Table 4.6 The estimated running time for 2-D FFT for 2k < (N/P)  with shared-memory.
8 x 8 1 6 x 1 6 3 2 x 3 2 6 4 x 6 4 128 x 128
P=1 0.002021 0.009048 0.037552 0.161429 0.683195
P=2 0.001228 0.005172 0.020977 0.088500 0.372199
P=4 0.000848 0.003398 0.013740 0.057035 0.228818
P=8 0.000686 0.002735 0.010733 0.045891 0.182538
65
Table 4.7 The sequential time u (in nsecs) o f  2-D FFT.
II 8 x 8 16 x 16 3 2 x 3 2 64 x 64 128 x 128
P=1 4685.8 3688.0 3329.8 3003.8 2792.0
P=2 4903.2 3819.4 3383.8 3184.4 3007.6
P=4 5431.6 4120.2 3493.8 3493.4 3193.8
P=8 6490.0 4738.2 4028.4 4295.9 4028.3
Table 4.8 The time uDRAM (in nsecs) for matrix transposition for 2-D FFT.
" dram 8 x 8 16 x 16 3 2 x 3 2 64 x 64 1 2 8 x 128
P=1 3461.8 5840.8 3373.9 2621.3 1083.1
P=2 8967.0 9849.6 7132.9 5000.0 3328.1
P=4 20438.5 20126.0 18733.3 13777.3 11150.6
P=8 46870.0 47561.5 43571.3 38080.0 32733.9
4.4 Performance Results on TurboNet
The 1-D FFT and 2-D FFT algorithms for 2k < (N/P) have been implemented on 
TurboNet. Results are presented in this section. In all o f  our FFT implementations, the 
twiddle factors (or coefficients) are calculated and stored in a table in the DRAM. Each 
processor fetches the table values only once and stores them in its internal memory before 
starting any calculation.
4.4.1 1-D FFT
The 1-D FFT algorithm has been tested in two ways. One way uses message-passing 
only, whereas another one uses shared-memory only. From the experimental results o f  the
66
1-D FFT in Tables 4.9 to 4.12, as our theoretical analysis predicted, the message-passing 
m ethod is shown indeed to be superior to the shared-memory one, and both experimental 
running times are close to the theoretical estimations. Figure 4.6 displays the 
experimental results for both versions, u n d e r/5 = 8. In Figures 4.7 and 4.8, the theoretical 
times o f  both communication methods for P  = 8 are shown together with the experimental 
times. The differences between the theoretical and experimental times for both versions 
are shown in Figures 4.9 and 4.10. In our implementation, programs for complex and real 
FFT’s were written and tested separately. The results show that the real FFT is a little 
faster than the complex one. All results given here are for the complex FFT.
The potential advantage o f using simultaneously both the shared-memory and 
m essage-passing methods in 1-D FFT is not obvious; further research is needed.
Table 4.9 Execution time for 1-D FFT with message-passing.
N=64 N=128 N=256 N=512 N=1024
P=1 0.001169 0.002613 0.005820 0.012928 0.028471
P=2 0.000781 0.001660 0.003558 0.007712 0.016638
P=4 0.000530 0.001049 0.002155 0.004527 0.009590
P=8 0.000377 0.000682 0.001314 0.002656 0.005483
Table 4.10 Speedup o f 1-D FFT with message-passing.
N=64 N=128 N=256 N=512 N=1024
P=2 1.50 1.57 1.64 1.68 1.71
P=4 2.21 2.49 2.70 2.86 2.97
P=8 3.10 3.83 4.43 4.87 5.19
67
Table 4.11 Efficiency o f 1-D FFT with message-passing.
N=64 N=128 N=256 N=512 N=1024
P=2 0.75 0.79 0.82 0.84 0.86
P=4 0.55 0.62 0.68 0.71 0.74
P=8 0.39 0.48 0.55 0.61 0.65
Table 4.12 Execution time for 1-D FFT with shared-memory.
N=64 N=128 N=256 N=512 N=1024
P=1 0.001169 0.002613 0.005820 0.012928 0.028471
P=2 0.000924 0.001926 0.004092 0.008741 0.018736
P=4 0.000823 0.001617 0.003270 0.006512 0.013419
P=8 0.000882 0.001676 0.003260 0.006545 0.013221
Table 4.13 Speedup o f 1-D FFT with shared-memory.
N=64 N=128 N=256 N=512 N=1024
P=2 1.26 1.36 1.42 1.48 1.52
P=4 1.42 1.62 1.78 1.99 2.12
P=8 1.33 1.56 1.79 1.99 2.15
Table 4.14 Efficiency o f  1-D FFT with shared-memory.
N=64 N=128 N=256 N=512 N=1024
P=2 0.63 0.68 0.71 0.74 0.76
P=4 0.36 0.40 0.44 0.50 0.53
P=8 0.17 0.19 0.22 0.25 0.27
68
0 . 0 1 4  T
0.012
0.01  ■■
T
j 0 . 0 0 8  ••
111 0 . 0 0 6  •• 
e
0 . 0 0 4
m p
0.002
N =  128N  =  64 N  =  2 5 6 N  =  5 1 2 N  =  1 0 2 4
A r r a y  S i z e
Figure 4.6 Execution time for 1-D FFT Vs array size, with P = 8.
■  T h e o r e t i c a l  
□  E x p e r i m e n t a l
N  =  6 4  N =  1 28  N  =  2 5 6  N  =  5 1 2  N = 1 0 2 4
A r r a y  S i z e  ( P  =  8)
Figure 4.7 Comparison o f theoretical and experimental running times for 1-D FFT with 
message-passing.
69
01 T h e o r e t i c a l  
□  E x p e r i m e n t a l
N  =  6 4  N =  1 2 8  N  =  2 5 6  N  =  5 1 2  N = 1 0 2 4
A r r a y  S i z e  ( P  =  8)
Figure 4.8 Comparison o f theoretical and experimental running times for 1 -D FFT with 
shared-memory.
N  =  6 4  N =  1 28  N  = 2 5 6  N = 5 1 2  N = 1 0 2 4
B  P = 1
□  P = 2  
■  P = 4
□  P  =  8
E
X
e T
c  . i
u
in
t
. ei
o
ii
-10
- 1 5
A r r a y  S i z e
Figure 4.9 The difference o f theoretical and experimental running times for the message- 
passing version o f the 1-D FFT.
70
N  =  6 4  N =  128 N  =  2 5 6  N  =  5 1 2  N = 1 0 2 4
50
A r r a y  S i z e
Figure 4.10 The difference o f theoretical and experimental running times the shared- 
memory version o f 1 -D FFT.
4.4.2 2-D FFT
The 2-D FFT algorithm for 2k < (N/P) also has been implemented on TurboNet. Two 
groups o f execution times are given; one does not include the time for matrix 
transposition (Table 4.15) while the other does (Table 4.18). The purpose o f  this is to 
show how much time is needed for transposition. The experimental times (with matrix 
transposition) are also compared with the theoretical ones in Figure 4.11 to show that 
they are very close to each other. The differences between them are shown in Figure 4.12.
71
T able  4.15 Execution time for 2-D FFT (without matrix transposition).
8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 1 2 8 x 128
P=1 0.001799 0.007557 0.034086 0.150738 0.665432
P=2 0.000942 0.003910 0.017320 0.078233 0.344841
P=4 0.000522 0.002108 0.008952 0.043000 0.183083
P=8 0.000312 0.001214 0.005155 0.026388 0.115527
T able  4.16 Speedup o f  2-D FFT (without matrix transposition).
8 x 8 16 x 16 32 x 3 2 64 x 64 1 2 8 x 128
P=2 1.91 1.93 1.97 1.93 1.93
P=4 3.45 3.58 3.81 3.51 3.63
P=8 5.77 6.22 6.61 5.71 5.76
T able  4.17 Efficiency o f 2-D FFT (without matrix transposition).
8 x 8 16 x 16 32 x 3 2 6 4 x 6 4 128 x 128
P=2 0.95 0.97 0.98 0.96 0.96
P=4 0.86 0.9 0.95 0.88 0.91
P=8 0.72 0.78 0.83 0.71 0.72
T able  4.18 Execution time for 2-D FFT (with matrix transposition).
8 x 8 16 x 16 3 2 x 3 2 64 x 64 128 x 128
P=1 0.002078 0.009052 0.037546 0.161483 0.683177
P=2 0.001229 0.005171 0.020971 0.088475 0.372105
P=4 0.000849 0.003396 0.013748 0.057108 0.228764
P=8 0.000687 0.002736 0.010732 0.045888 0.182566
72
Table 4.19 Speedup o f  2-D FFT (with matrix transposition).
8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 128 x 128
P=2 1.78 1.84 1.89 1.91 1.94
P=4 2.77 3.16 3.28 3.29 3. 45
P=8 3.74 4.22 4.60 5.03 5.50
Table 4.20 Efficiency o f 2-D FFT (with matrix transposition).
8 x 8 16 x 16 32 x 3 2 64 x 64 1 2 8 x 128
P=2 0.89 0.92 0.94 0.96 0.97
P=4 0.69 0.79 0.82 0.82 0.86
P=8 0.47 0.53 0.57 0.63 0.69
■  T h e o r e t i c a l  
□  e x p e r i m e n t a l
8 x  8  16 x 1 6  3 2  x  3 2  6 4  x  6 4  128  x 1 2 8
M a t r i x  S i z e  ( P  =  8)
Figure 4.11 Comparison o f theoretical and experimental running times for 2-D FFT with 
shared-memory.
73
8  x 8 16 x 1 6  3 2  x 3 2  6 4  x 6 4  128 x 128
100 T
- 8 0
M a t r i x  S i z e
Figure 4.12 The difference o f theoretical and experimental running times for the 2-D 
FFT.
In the above implementations, elements o f the intermediate matrix were written 
back to the shared memory by each processor, and the transposition was carried out there 
by all processors. To reduce the contention in the shared memory incurred by this action, 
in an improved implementation each processor using its DMA controller writes its 
intermediate matrix into its own local memory, then reads data column-by-column from 
the latter matrix and writes them row-by-row into the shared memory. This not only 
carries out matrix transposition, but also cuts down on the number o f shared-memory 
accesses for each element o f data by each processor from 4 to 2. However, this 
im plem entation has assumes that the internal memory o f  each processor is big enough to 
hold all its output data plus 2k  rows o f input. Furthermore, the overlapping o f  input, 
calculation and output in the algorithm has changed its purpose solely for the speed o f 
execution.
74
We have also taken an additional step to reduce the contention in the shared 
memory, that is, we delay each processor by a certain amount o f time according to:
Ta = ( Processor ID) ^dma M a  p (4.15)
where /DMA is the average time for a DMA controller to transfer a word from the shared 
mem ory to the local memory, M  is the number o f words to be transferred (M  -  2N2/P. if 
the size o f  the internal RAM o f  each processor is larger than the size o f its matrix; 
otherwise, M  equals the half size o f the RAM), a  is a fraction ranging from 0.2 to 0.33 
and is determined by experiment, and P is a number proportional to log, P,  also 
determined by experiment. Results for the resulting improved method are shown in 
Tables 4.21 to 4.23. In Figures 4.10 and 4.11, the execution times and speedups are 
compared for the original (from Table 4.18) and improved methods. It can be concluded 
that the improved method results in significant performance improvement.
Table 4.21 Execution time for 2-D FFT with the improved method.
8 x 8 16 x 16 3 2 x 3 2 64 x 64 128 x 128
P=1 0.002008 0.008182 0.034783 0.151509 0.667038
P=2 0.001126 0.004450 0.018400 0.079082 0.343463
P=4 0.000700 0.002530 0.010500 0.045300 0.190800
P=8 0.000587 0.001800 0.006800 0.029372 0.118000
75
Table 4.22 Speedup of 2-D FFT with the improved method.
8 x 8 16 x 16 3 2 x 3 2 64 x 64 128 x 128
P=2 1.78 1.84 1.89 1.92 1.94
P=4 2.87 3.23 3.31 3.34 3.50
P=8 3.42 4.55 5.12 5.16 5.65
Table 4.23 Efficiency o f 2-D FFT with the improved method.
8 x 8 16 x 16 32 x 3 2 64 x 64 128 x 128
P=2 0.89 0.92 0.95 0.96 0.97
P=4 0.72 0.81 0.83 0.84 0.87
P=8 0.43 0.57 0.64 0.65 0.71
♦  o r i g i n a l  
HU—  i m p r o v e d
8 x 8 16  x  16  3 2  x 3 2  6 4  x 6 4  1 2 8  x 1 2 8
M a t r i x  S i z e
Figure 4.13 Execution time for 2-D FFT Vs matrix size, with P = 8.
76
6
5
4
3
2 ♦  o r i g i n a l  
B —  i m p r o v e d
0
8 8 16 x 16 3 2  x  3 2 6 4  x 6 4 1 2 8  x 128x
M a t r i x  S i z e
Figure 4.14 Speed up o f  2-D FFT Vs matrix size, with P = 8.
CHAPTER 5
M ATRIX M ULTIPLICATION
The numerical solution o f many problems reduces in part or fully to various matrix 
operations. A very common operation is the multiplication o f matrices. A matrix can be 
“full” (with an insignificant number o f  zero elements), banded (non-zero elements 
clustered along the diagonal o f the matrix), or sparse (with a more complicated structure 
in the locations o f  non-zero elements). In this chapter we present an algorithm for the 
multiplication o f full matrices, its performance analysis, and experimental results on 
TurboNet. Naturally for TurboNet we have developed three versions o f the algorithm 
based on message-passing, shared-memory and hybrid communication paradigms.
5.1 Matrix M ultiplication Algorithm
The multiplication o f two matrices A and B  o f  size L by N  and N  by M,  respectively, is a 
matrix C o f  size L by M  whose elements are given by
N -1
c„ = (5-0
A=0
for / = 0, 1 ,..., L - 1, y' = 0, 1 ,..., M-l  and k = 0, 1 ,..., AM. In the following discussion, we 
assume that L, N, M  and P  (the number o f  processors) are always powers o f 2.
The fastest possible algorithm for the multiplication o f  two N b y  N  matrices on the 
hypercube requires N3 processors and consumes time 0 ( lo g , N ) [5]. The basic algorithm 
chosen here is the popular simple algorithm o f  Fox [4, 37] that assumes a message-
77
78
passing architecture. The assignment o f  matrix elements to the hypercube processors 
assumes the mapping o f  the matrix mesh to the hypercube. The run time o f  this algorithm 
is 0 ( N ? /  P + ^  log2 P + N~ j-J~P log2 P) ,  where P is the number o f  processors. It uses 
broadcasts o f  blocks o f A in rows, and circular upward shifts o f  block B  in columns. 
Initially, diagonal blocks A a are selected for broadcast. The algorithm performs -J~P 
iterations o f  the following steps:
• Broadcast the selected block o f A among yJ~P processors in the same row.
• M ultiply the received block o f A with the contained block o f B.
• Shift the block o f  B to the processor directly above, assuming wraparound in 
columns.
• Select the block ^j,(j+i)m0C| p for the next row broadcast.
To understand the basic idea behind this algorithm, relevant calibration follows.
To perform the multiplication in a parallel way, let us expand Equation 5.1 as
follows
N / 2 - \  AM
CU = + 2 / ' A  (5.2)
A=0 k  = N /  2
The sum in the calculation o f  Cy is split into two parts. The first part involves the first M 2 
elements from the i ,h row of A and the j  lh column o f B, while the second part involves 
the last M 2 elements o f each o f them, as illustrated in Figure 5.1. Both parts can be 
further decomposed successively into halves until each resulting part contains only one 
element from each matrix, thus achieving a complete expansion o f Equation 5.1.
79
r o w
I
column
Figure 5.1 The decomposition for c,y.
Now assume a 2-D hypercube (i.e. four processors) for the matrix multiplication 
in Equation 5.1. Both m atrices^  and B  are decomposed into four submatrices, and each 
one o f  them is assigned to a processor as indicated in Figure 5.2 (a). Also the product C  
will be distributed among the processors, as shown in Figure 5.2 (b). From Equation 5.2, 
we know that each ctj o f  C0 is produced from elements o f  the i'h rows o f A 0 and A x and 
the j ,h columns o f B0 and B2, as in the first line o f  the following equation:
Co II o to © + A t B2
C, -  AgB, f  A t B 3
C2 = A>B2 + A 2B0
Cy = a 3b 3 + A 2B t
P h a s e  1 P h a s e  I
4--------- :--------- ►
From this Equation it is seen that Phase 1 (i.e. generating A 0B0 for C 0, A 0B t for C ,. etc.) 
requires communication from PQ(P3) t0 ^ 1(^ 2 ) f ° r submatrix A 0(A3), while in Phase 2 
P\(P2) sends A t(A2) to Pq(P3). Also notice that P0 and P2 (P] and P3) need to exchange 
their submatrices B0 and B2 {B{ and B3) with each other before the calculation in Phase 2 
can start (this scheme is shown in Figure 5.3). In order to save time, these exchanges are 
initiated at the beginning o f Phase 1, and the processors have to wait for the submatrices 
o f  B before starting the calculation in Phase 2 only if  the exchanges are still going on.
In the previous example each submatrix o f A had the same number o f columns as 
the num ber o f  rows in each submatrix o f  B, because the square root o f  P (i.e. 4 in the 
example) was an integer (i.e. 2). Therefore, the processors in the system could be 
arranged in a square mesh configuration. However, this square mesh configuration is not 
possible if  the square root o f  P  is not an integer. For example, the processors in the 3-D 
hypercube (P = 8) can only be arranged in a 2x4 rectangular mesh. As a result in the 
latter case, each submatrix in each processor needs to be divided again in half in the 
colum n or row direction, so that each resulting submatrix o f A has the same number o f 
columns as the number o f  rows in each resulting submatrix o f  B, as shown in Figure 5.4. 
Virtual processing is then needed, because the 3-D hypercube will emulate a 4-D 
hypercube (i.e. P = 16) in order to perform the calculations. Each physical processor will 
emulate two virtual processors by dividing its time between two submatrices assigned to 
it (for example, P,} will first produce A m B(m for C00 and then A 00 /i01 for C01). There are 
four phases o f  operation in this 3-D hypercube example.
81
A B
' I l l ' 1 ,
( / ’ „ ) ( / ’ l )
A  2 ' I ?
( / * 2 ) V ’ > )
( P « ) </'i)
( / ' : ) U ’y )
(a)
C
Ci = ■ 'IA + 'l  |/ i2 
)
C, = A,,B,+ A ,/i,
(P ,)
Cl = Ayl i2+ A tBu
(P 2)
r ,  = / i 2«, 
(/’c
(b)
F igure  5.2 The decomposition o f the matrices A, B and C for four processors.
82
A
(/>,.)
— ► , )
(P2 )
( P A
(a) Phase 1 
A B
11 ( P i)
IS, ( P i ) 11, ( P A
U o )
(Pi)
‘U
V ;i )
* (' 3 )
(b) Phase 2
Figure 5.3 The communication scheme according to Equation 5.3.
B,m (Pt) )Ptn 11 I I I  ( / > !  )  B I I
Pm  ( P i ) Pii ^ 3 1 1  (Pi  ) B 11
Bw ( P A U » B ( P A B n
Bm d \ . ) lli,i Bm ( P 7 ) Hii
A,m d \ ) A>" A w  ( Pi )  A "
A m  ( A ’ j J ' T i •To
(P4 ) / , i  i T o  ( P S ) ^ S I
■To (/'(,) A 7 i i  ( / ' 7 ) T i
F igure  5.4 The decomposition o f the matrices A and B for the 3-D hypercube.
83
Using two processors to implement Equation 5.1 is a special case which can be 
speeded up by selecting a decomposition scheme for matrices different from the 
previously presented one for the 3-D hypercube. More specifically, we divide the 
matrices A and B  in half by rows and columns, respectively, and no further 
decomposition o f submatrices inside each processor is needed. In this case 
com m unication is necessary only for the submatrices o f B. This process is shown in 
Figure 5.5.
A B  C
An
(ft.)
A i
(Pi )
f P\
(Pi) ) (Pi )
G t  — A 0B 0 + A aB  | 
( A  )
G  “  A |/ i ]  +  / /  |/?d
(Pi )
F igu re  5.5 The decomposition o f matrices A, B  and C for the 1-D hypercube.
To summarize the basic algorithm, the matrices A and B  are first divided in such a 
way that the number o f  columns in each submatrix o f  A and the num ber o f  rows in each 
submatrix o f  B are the same, and then (except for the 1-D hypercube case) an iterative 
process is employed that each time broadcasts one o f  the submatrices o f  A horizontally 
(i.e. to all other processors in the same row) and shifts upward the submatrices o f B (no 
shifting is needed in the last phase), so that each processor generates one o f the 
submatrices o f  C. The number o f phases is equal to the number o f rows in the processor 
mesh.
84
When the num ber o f  dimensions in the hypercube is less than or equal to three, all 
communications are local (for higher dimensional hypercubes, the broadcasting o f 
submatrices o f  A is not a local communication when utilizing the m essage-passing 
paradigm). This means that the message-passing communication should be expected to be 
efficient. We have developed three versions o f the matrix multiplication algorithm  based 
on the message-passing, shared-memory and hybrid paradigms. The following sections 
present a theoretical analysis that includes estimated running times, and implementation 
results on TurboNet.
5.2 Theoretical Analysis
Since TurboNet is currently a 3-D hypercube system, our theoretical analysis assumes at 
most eight processors. For simplicity, we assume that all matrices are square o f  size o f  N  
by N.  Based on Equation 5.1, the sequential m ultiplication time is
rsq(A 0 = 2/V3 /L.;,= O ( y V 3) (5.4)
where /ca| is the tim e for one multiplication or one addition.
Let us first look at the 1-D hypercube case. Figure 5.6 shows the DDG o f the 
matrix multiplication algorithm for the 1-D hypercube, derived from Figure 5.5. There are 
two phases in this algorithm. In the first phase, P0 and P , produce A 0B() and A tB h 
respectively. In the last phase the processors exchange B0 and B, with each other and 
generate the final results by adding A 0B, and A tB0 to their previous products, 
respectively. The thick arrows crossing the vertical dotted line in the figure represent the 
exchanges. The purpose o f putting the exchange process in the last phase instead o f in the
85
first one is to see how much effect the communication overhead has on the speedup and 
efficiency o f  the algorithm, since the calculation time is expected to be larger than the 
comm unication overhead (the exchange process would finish before the calculation o f 
phase 1 if  it is initiated at the beginning o f phase 1).
The calculation time in each phase is ( N* / 2 )  . The communication in the last
phase can be implemented in three ways. In the hybrid way, P0 can use a communication 
channel while P, can use the shared memory (or vice verse). The communication times 
for P Q and P, are ( N 2/ 2 ) /com and ( N 2/ 2 ) / DRAM , respectively, where /com is the time to 
transfer one word between neighbors in the hypercube via the direct link, and /DRAM is the 
time to transfer the same word but through the shared memory (the DRAM). From our 
experiment on TurboNet, /com is greater than /DRAM, therefore the total running time for 
the hybrid implementation is
tl2p^ { N )  = 2 { N i / 2 ) t ai + ( N 2/ 2 ) t a m = 0 ( N 3) (5.5)
Phase 1 
P hase 2
Figure 5.6 DDG o f  matrix multiplication for the l-D  hypercube.
86
When using message-passing only with the half-duplex physical channel between the two 
processors, the communication time is 2 { N 2 /2)  tcnm = M 2 tmm . The total running time 
for the message-passing implementation is
W * ,( A O  = 2 ( W 7 2 ) / cal+7V2 , cam= 0 ( N 2) (5.6)
With the shared-memory implementation that involves congestion delay, the 
communication time would be as much as 2( N 2/ 2)  / DRAM= N 2 / DRAM , and the total 
running time in this case could be at most
/(2p.sin, ( ^ )  = 2 ( 2V3/2 )  /cal + N 2 t mAM= 0 ( N 2) (5.7)
From Equations 5.5 to 5.7, it is seen that the performance o f the algorithm depends 
heavily on the communication time, and the hybrid version o f the algorithm is estimated
to have better performance than the other two versions. The actual estimation formulas
are
^n-(2p.hy)( ^ 0  - N 3 W<2p,hy) (AO+(A^2/2 )  wliy ( AO (5.8)
' i H 2 p . m p ) ( W )  = w(2 ( W ) +  M 2 ump( N )  (5.9)
'n-(2p.sm)(W) = (2V)+ N 2 Usm ( N )  (5.10)
where UQp<hy)(N), u(2P,mp)(A0 and w(2piSm)(A0 are the amounts o f  times for one multiplication 
or one addition, including the time for loading operands and storing results, in the hybrid, 
message-passing and shared-memory implementations, respectively. Also, why(AO, ump(N) 
and i/sm(A0 represent the communication times for one word in the three communication 
paradigms, respectively. They are obtained from experiment and are listed in Tables 5.2 
and 5.3. The estimated running times are listed in Table 5.1. The calculation time for a
87
single processor is estimated from Equation 5.11 and is also included in Table 5.1. It is 
obtained by
/,;_sli(A O = 2 A f3 »sq(yV) (5.11)
where wsq(A0 is defined similarly to u2p(N) but now with one processor. Its values are also 
obtained from experiment and are listed in Table 5.2.
Table 5.1 Estimated running time for matrix multiplication on the 0-D and 1-D 
hypercubes.
8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 1 2 8 x 128
p =  1 0.000520 0.003866 0.029443 0.224348 1.687490
P = 2
hy 0.000404 0.002880 0.022187 0.175391 1.396054
mp 0.000521 0.003720 0.027400 0.202813 1.500705
sm 0.000509 0.003398 0.024712 0.187066 1.467966
Table 5.2 The times w(2p xx) (in nsecs) for matrix multiplication on the 0-D and 1-D 
hypercubes.
8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 128 x 128
"Sq 507.82 471.92 449.26 427.90 402.33
(^2p.hy) 728.52 673.83 664.79 662.98 662.71
H(2p,mp) 886.72 846.68 806.00 759.10 708.63
(^2p,sni) 882.81 782.72 728.52 701.42 693.98
Table 5.3 The times uxx (in nsecs) for matrix multiplication on the 1-D hypercube.
P = 2 8 x 8 16 x 16 3 2 x 3 2 64 x  64 128 x 128
«hv 968.75 937.50 787.11 778.81 763.86
^mp 1046.87 984.37 966.38 932.37 890.66
«sm 875.00 750.00 820.30 779.30 768.20
88
The DDG for the 2-D hypercube is shown in Figure 5.7. As discussed in the 
example o f using the 2-D hypercube in Section 5.1, there are two types o f comm unication 
in this case; the comm unication in both phases o f  submatrices o f  A between processors in 
the same row and the exchange in phase 1 o f submatrices o f  B between processors in the 
same column (the exchange finishes before the end o f phase 1). All these comm unications 
are represented by thick arrows in the DDG. In phase 1, F, (P2) receives one row o f  A 0 
(A 3) from PQ (P3) and performs multiplications o f this row with all columns o f  5 , (B2). 
This process is repeated N/2 times, until all rows o f A 0 (A3) have been multiplied by B i 
(B2). In the meantime, the exchange o f  the submatrices o f  B is also in process. Therefore, 
the running time o f  each iteration in this phase is the calculation time N 2/ 2 / cal plus the 
time for transferring N/2  words between two neighboring processors. In the hybrid 
implementation, as in the 1-D hypercube case, submatrices o f A are transferred via the 
shared memory while submatrices o f  B are exchanged over the communication channels. 
This leads to a communication time in each iteration o f  N / 2  / DRAM and the same amount 
o f  maximum delay. In the message-passing implementation, since submatrices o f A and 
B go through different channels, i.e. there is no congestion on the channels, the time for 
communication is N / 2  lmm. When only shared memory is used for communication, the 
amount o f  time is N / 2 t mAM, plus the maximum delay o f 5 N / 2 t DIKAM due to the 
congestion from exchanges o f submatrices o f A and B. The time for phase 2 is sim ilar to 
the one for phase 1, except that there is a time delay o f  N / 2  / I)RAM when using the shared 
memory in the shared-memory implementation only; since the exchange o f  submatrices 
o f B  has finished, the delay is caused by the communication o f a submatrix o f  A between
89
processors in another row. The total running times o f the three implementations, 
therefore, are
W h y , ( W ) = ( W 7 2 ) / c n . +  W2 t 0MM= O ( N 3) (5.12)
t ^ , nv)( N ) = ( N > / 2 )  tca] + ( N 2/ 2)  lcom = O ( N ' )  (5.13)
l ^ , m)( N ) = ( N ’/ 2 ) l a] + 2 N 2 / DRAM = 0 ( W 3) (5.14)
Po P, P2 P3
P hase 1
Phase 2
Figure 5.7 DDG o f matrix multiplication for the 2-D hypercube.
From the three equations above, since from experiment ( tmm/ 2) > t DRAM, we predict that 
the hybrid version o f the algorithm will provide the best performance. The estimated 
running times o f  the three implementations, based on Equations 5.15 to 5.17 below, are 
presented in Table 5.4.
'..:-(4p.hy,( )^ = ^72«(4p.hy)(^) (5.15)
E^-(4p,mp, ( ^ 0  = A ^/2  w,4p.mp)(A0 (5.16)
90
' E - ( 4 p . « „ ) ( W ) =  ^ V 4 ( M ( 4 p . n n | ) ( ^ ) + M ( 4 p . » n 2 ) ( ^ ) )  ( 5 - 1 7 )
In the equations, w(4p hy)(A0 and % p,mp)(A0 are the amounts o f  time for one multiplication 
or one addition, and the transfer o f at most 2IN and 1 IN word with the hybrid and 
message-passing paradigms, respectively. The parameters M(4PiSmp(A0 and «(4P.Sm2)(^0 are 
the amounts o f time for one multiplication or one addition, and the transfer in phases 1 
and 2 o f at most 6/W and 2/N word with the shared-memory paradigm, respectively. Their 
values are from experiment and are recorded in Table 5.5.
Table 5.4 Estimated running time for matrix multiplication with the 2-D hypercube.
P = 4 8 x 8 16 x  16 3 2 x 3 2 6 4 x 6 4 128 x 128
hy 0.000254 0.001459 0.010211 0.076225 0.567676
mp 0.000199 0.001398 0.010636 0.080952 0.637976
sm 0.000319 0.001713 0.011101 0.079013 0.604105
Table 5.5 The times w(4p xx) (in nsecs) for matrix multiplication with the 2-D hypercube.
P = 4 8 x 8 16 x 16 3 2 x 3 2 64 x  64 1 2 8 x 128
W(4p,hy) 992.19 712.40 623.23 581.55 541.38
(^4p.mp) 777.33 682.62 649.17 617.61 608.42
^(4p,sinl) 1531.91 939.64 713.62 607.04 579.41
**(4n,sm2) 960.28 733.21 641.48 598.60 572.83
The 3-D hypercube implementation o f m atrix multiplication is similar to the 2-D 
case, except for the addition o f  two more phases, more shared memory delay, and that 
each processor has to perform matrix multiplication for two submatrices o f A.  In order to
91
avoid confusion from line crossings, instead o f drawing the entire DDG for eight 
processors, we show in Figure 5.8 the DDG for one processor (P0) only. In the figure, a 
submatrix is identified with a processor number if it comes from another processor. The 
structures o f the DDGs for the other processors are similar to this one except that 
processor numbers identifying some submatrices are different.
Po
Phase 1 Phase 2 Phase 3 Phase 4
F igure  5.8 DDG o f  matrix m ultiplication for P0 o f  the 3-D hypercube.
In each iteration o f phase 1, each processor performs multiplications o f two pairs 
o f submatrices o f  A and B, as indicated in Figure 5.8. Also the comm unication o f N/4 
values is needed. The calculation time in each iteration is W2/ 4 / cal ■ For the hybrid 
implem entation, the amount o f time for the communication o f  N/4 words through the 
shared memory (the exchanges o f  submatrices o f B are via comm unication channels) is
92
N / 4  / DRAM plus the same amount o f time for maximum delay. In the m essage-passing 
implementation, it takes N / 4  lcmn to transfer the N/4 words through a channel. Since the 
number o f these “calculate-communicate” iterations is N/4,  and phases 2 to 4 are the same 
as phase 1, the running times o f the hybrid and message-passing implementations, 
respectively, are
W H y)(A' ) = ( A'V 4 ) ^  + ( ^ V 2 ) / n RAM -C >(A '3) (5.17)
' ,8,„„P) ( AO = ( N 7 4 )  ' c  + ( ^ 7 4 )  = 0( N 5) (5.18)
The shared-memory implementation is a little more complicated than its counterpart in 
the 2-D hypercube case, due to the introduction o f another shared memory module. As 
shown in Figure 5.4, processors P0 to P3 are in one Hydra board while the rest o f  the 
processors are in another, and hence the communications o f submatrices o f  A need not 
cross the VME bus. This limits the congestion delay contributed by each processor when 
transferring submatrices o f  A to N/ 4  / DRAM for one iteration. This delay exists in all four 
phases. In exchanges o f submatrices o f B, two processors in one board have to access the 
shared memory in another board, while the other two processors use only the shared 
memory in the same board. All these operations generate a maximum delay o f 
6 N / 4  / DRAM for each iteration in the ISB (which is connected to the shared memory) o f 
each board, and this delay appears only in the first three phases. Therefore, with N/4 
iterations in each phase, the total running time for the four phases in the shared-memory 
implementation is
W n . ) ( ^ ) = ( ^ 7 4 ) / c a .  + 1 3 iV 7 8 /DRAM= 0 ( A ( 3) (5.19)
93
From Equations 5.17 to 5.19, we see that the hybrid version o f  the algorithm has less 
communication time than the other two (since ( /com/2 )  > t DRAM), hence its execution will 
be the fastest o f all three. The estimation o f running times for the 3-D hypercube will 
again be similar to the one for the 2-D hypercube. The estimated times are based on the 
following equations and are listed in Table 5.6:
l^:.-(8p,liy) ( ^ 0  = N '  /A w(8p.|,y) ( ^ 0  (5.20)
i^;-(8p.mp) ( ^ 0  = N '  /A M(8|U„p) ( jV )  (5.21)
'h-,8p„„,(A0 = N* / \ 6  (3?/(gpsml)(jV )+ »(8PiSm2) (A^)) (5-22)
In the equations above, the definitions o f u(8pj,y){N) and »(8p,mp)(N) are the same as the 
ones for % p,|iy)(A0 and % p,mp)(A0 in Equations 5.15 and 5.16, while «(8p,sni|)(A0 and 
i/(gpsm2 )(A0  are the amounts o f time for one m ultiplication or one addition, plus the 
transfer o f  at most 8/7V and 2/N  word, with the shared-memory paradigm, in phases 1 to 3 
and 4, respectively. Their values are from experiment and are recorded in Table 5.7.
Table 5.6 Estimated running time for matrix multiplication with the 3-D hypercube.
"13 II 00 8 x 8 1 6 x 1 6 3 2 x 3 2 6 4 x 6 4 1 2 8 x 128
hy 0.000207 0.000874 0.004812 0.035357 0.263544
m p 0.000210 0.000876 0.005229 0.040400 0.326750
sin 0.000478 0.001800 0.008074 0.048980 0.315953
94
Tabic 5.7 The times u(8p xx) (in nsecs) for matrix multiplication with the 3-D hypercube.
II 00 8 x 8 16 x  16 3 2 x 3 2 6 4 x 6 4 128 x 128
"(8p,liy) 1617.19 853.52 587.40 539.51 502.67
w(8p,mp) 1640.63 855.47 638.31 616.46 623.23
^(Sp.sml) 4552.08 2089.84 1132.32 829.14 640.00
(^8p,sm2) 1281.25 761.72 545.41 502.01 490.52
In general, the running time o f the matrix multiplication algorithm on a hypercube 
with P  processors, where P >  2,
tc (AT) = (2 Af3/ P )  /c,  + ( N 2/ r )  t x + Tm  = 0 ( N y/ P ) (5.23)
where tx is the amount o f time to transfer a word through the shared memory or 
comm unication channel(s), r  is the number o f rows in the processor array or the number 
o f  phases, and T^i is the congestion delay in the communication. In the hybrid 
implementation o f the algorithm, r dc| is induced only from the communication o f 
submatrices o f  A through the shared memory. With the message-passing paradigm only, 
if  Plr  (the number o f columns in the processor array) is less than or equal to 2, 7’dc| is zero 
(i.e. all communications are local), otherwise it is the delay due to channel congestion. 
For the shared-memory implementation, r de| is caused by the communication o f 
submatrices o f  A and B that requires sharing o f  the bus or memory.
95
5.3 Performance Results on TurboNet
The algorithms o f  matrix multiplication were implemented on TurboNet. This section 
presents experimental results, speedups, efficiencies, and comparisons o f  theoretical and 
experimental results.
In the 1-D hypercube implementation, as predicted in the last section, the hybrid 
version has the best performance among all three versions. The experimental results are 
shown in Table 5.8. Figure 5.9 compares execution times o f the three versions o f  the 
algorithm on the 1-D hypercube, and the difference between the theoretical and 
experimental results is plotted in Figure 5.10.
Table 5.8 Execution time o f matrix multiplication on the 0-D and 1-D hypercubes.
8 x 8 16 x 16 3 2 x 3 2 64 x 64 1 2 8 x 128
p = 1 0.000523 0.003881 0.029438 0.224369 1.687509
P = 2
hy 0.000381 0.002851 0.022170 0.175384 1.396061
mp 0.000513 0.003700 0.027417 0.202837 1.500771
sm 0.000514 0.003387 0.024688 0.187088 1.467935
Table 5.9 Speedup o f matrix m ultiplication for the 1-D hypercube.
*0 n to 8 x 8 16 x 16 3 2 x 3 2 64 x 64 128 x  128
hy 1.37 1.36 1.33 1.33 1.22
mp 1.02 1.05 1.07 1.10 1.12
sm 1.02 1.15 1.19 1.20 1.15
96
Table 5.10 Efficiency o f matrix multiplication for the 1-D hypercube.
*3 II K> 8 x 8 1 6 x  16 3 2 x 3 2 64 x  64 1 2 8 x 128
hy 0.68 0.68 0.66 0.66 0.61
m p 0.51 0.52 0.54 0.55 0.56
sm 0.51 0.57 0.59 0.60 0.57
p  =  2
1 .5
E  1 .3
x
e  1.1
T
c  j 0 .9  
u
, m  0 .7  
. e
' 0 .5
0.1
6 4  x 6 4  1 28  x  128
M a tr ix  S iz e
Figure 5.9 Performance comparison o f the three versions for the 1-D hypercube.
97
P = 2
40
E 30
x  2 0
10
i -1 0  
U m -20 
‘ e  -3 0  
1 -4 0
0  -5 0
n -6 0
-7 0
M a t r i x  S iz e
Figure 5.10 The difference o f theoretical and experimental tim ings for the 1-D 
hypercube.
In the implementations o f  matrix multiplication on both the 2-D and 3-D 
hypercubes, the hybrid versions o f the algorithm again have better performance than the 
others. These results agree with our earlier analysis according to Equations 5.12 to 5.14 
and 5.17 to 5.19. Also notice that in Tables 5.11 and 5.14, where the experimental results 
for the 2-D and 3-D hypercubes are presented, as the size o f the matrices increases, the 
message-passing version become slower than the shared-memory one. This is because the 
m essage-passing paradigm is not suitable for the communication o f large amounts of 
data. Relevant speedups, efficiencies, performance comparisons, and differences between 
theoretical and experimental timings are all given below.
8 x 8  1 6 x 1 6  3 2 x 3 2  6 4 x 6 4  1 2 8 x 1 2 8
B h y  
□  m p  
B  s m
98
Table 5.11 Execution time o f matrix multiplication on the 2-D hypercube.
P = 4 8 x 8 16 x  16 3 2 x 3 2 6 4 x 6 4 128 x  128
hy 0.000258 0.001483 0.010170 0.076173 0.567643
m p 0.000204 0.001390 0.010616 0.080993 0.638016
sm 0.000323 0.001702 0.011051 0.078957 0.604139
Table 5.12 Speedup o f matrix multiplication for the 2-D hypercube.
■*rIIG* 8 x 8 16 x  16 3 2 x 3 2 6 4 x 6 4 128 x  128
hy 2.03 2.62 2.89 2.95 2.97
mp 2.56 2.79 2.77 2.77 2.64
sm 1.62 2.28 2.66 2.84 2.79
Table 5.13 Efficiency o f  matrix multiplication for the 2-D hypercube.
P = 4 8 x 8 16 x  16 3 2 x 3 2 64 x  64 1 2 8 x 128
hy 0.51 0.65 0.72 0.74 0.74
mp 0.64 0.70 0.69 0.69 0.66
sm 0.40 0.57 0.67 0.71 0.69
6 4  x  6 4  1 2 8  x 128
M a t r i x  S iz e
F igure  5.11 Performance comparison o f the three versions for the 2-D hypercube.
1 6 x 1 6 6 4 x 6 4 1 2 8 x 1 2 8
M a t r i x  S iz e
Figure 5.12 Difference o f theoretical and experimental timings for the 2-D hypercube.
100
Table 5.14 Execution time o f matrix multiplication on the 3-D hypercube.
ii 00 8 x 8 16 x  16 3 2 x 3 2 6 4 x 6 4 1 2 8 x 128
hy 0.000217 0.000897 0.004878 0.035379 0.263526
m p 0.000205 0.000889 0.005242 0.040346 0.326720
sm 0.000472 0.001789 0.008110 0.049000 0.316000
Table 5.15 Speedup o f  matrix multiplication for the 3-D hypercube.
ii 00 8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 128 x  128
h y 2.41 4.33 6.03 6.34 6.40
mp 2.55 4.37 5.62 5.56 5.17
s m 1 .1 2 2.17 3.63 4.58 5.34
Table 5.16 Efficiency o f matrix multiplication for the 3-D hypercube.
00II 8 x 8 16 x 16 3 2 x 3 2 6 4 x 6 4 128 x 128
hy 0.30 0.54 0.75 0.79 0.80
mp 0.32 0.55 0.70 0.70 0.65
sm 0.14 0.27 0.45 0.57 0.67
101
P = 8
6 4  x 6 4  128  x  1 2 8
M a tr ix  S iz e
Figure 5.13 Performance comparison o f  the three versions for the 3-D hypercube.
p  =  8
8 x 8  1 6 x 1 6  3 2 x 3 2  6 4 x 6 4  1 2 8 x 1 2 8
6 0
-8 0
M a t r ix  S iz e
Figure 5.14 Difference o f theoretical and experimental tim ings for the 3-D hypercube.
102
The execution time on the 0-D hypercube, shown in Table 5.8, involves only 
calculation time, i.e. the timer starts after both input matrices A and B are stored in the 
processor’s memory. For the sake o f  comparison, another algorithm was also developed 
for a single processor so that data inputs and calculations are overlapped. Results for the 
latter case are presented in Table 5.17, together with pure calculation times.
Table 5.17 Two implementation results o f  matrix multiplication on the 0-D hypercube. 
Cal.: calculation time only, Inp. & Cal.: data input and calculation time.
p = 1 8 x 8 16 x  16 3 2 x 3 2 64 x  64 1 2 8 x 128
Cal. 0.000523 0.003881 0.029438 0.224369 1.687509
Inp. & Cal. 0.000590 0.004067 0.029824 0.224966 1.689134
CHAPTER 6
ADAPTIVE M ULTIRESOLUTION STRUCTURES  
FOR IMAGE SEGM ENTATION
In this chapter, an algorithm is described for the creation o f  region-adjacency-graph 
(RAG) pyram ids on TurboNet. The implementation o f  the algorithm is discussed and 
performance results are presented for the three communication techniques which are 
supported by TurboN et’s hybrid architecture. Each level o f these hierarchies o f irregular 
tessellations is generated by independent stochastic processes that adapt the structure of 
the pyramid to the content o f  the image. RAG pyramids can be used in multiresolution 
image analysis to extract connected components from labeled images. The experimental 
results indicate that efficient communication is vital to good performance o f the 
algorithm.
6.1 Introduction
The structure o f  conventional multiresolution representations o f  images, i.e. image 
pyramids [15], is regular, with all parents having the same number o f  children arranged in 
a square neighborhood. The rigid structure o f  image pyramids is not optimal for feature 
extraction tasks where homogeneous images have to be delineated. Techniques have been 
presented that m odify the links o f the image pyramid in order to alleviate this problem, 
however they do not provide satisfactory or efficient solutions. In contrast, if  the child-
103
104
parent links are established according to local homogeneity criteria, a region adjacency 
graph (RAG) is used to represent each level o f the pyramid [13], RAG pyramids are 
hierarchies o f  RA G ’s [14]. The irregular structure o f  RAG pyramids adapts to the image 
content and is, therefore, optimal for feature extraction; it also preserves the logarithmic 
processing time property o f  regular pyramids [13].
For the sake o f  simplicity, this research assumes binary images. The entire image 
array is the RAG corresponding to the highest resolution. To build the next level o f  the 
RAG pyramid, each node in the RAG o f the current level has to determine if each one o f 
its neighbors has the same value (i.e. 0 or 1) as its own. With this class information, 
stochastic processes are applied to decide if  this node will survive and become a node in 
the next lower-resolution level. Links between survivors and non-survivors (i.e. child- 
parent links) are then established by connecting the survivors to their non-surviving 
neighbors in the same class. The survivors also establish new neighborhoods in the new 
level. Eventually, each connected component will have only one survivor in the highest- 
level RAG. By starting at an apex o f  the highest-level RAG and traversing the pyramid 
through previously established links, we can find all the pixels in the corresponding 
connected component.
Algorithms for generating RAG pyramids on the CRCW  (Concurrent-Read, 
Concurrent-W rite) PRAM (Parallel Random-Access M achine) and implementations on 
the Connection M achine CM-2 message-passing hypercube supercomputer have been 
presented in [14]. In the rest o f  this chapter, we present results for the creation o f  RAG 
pyramids on our TurboNet system.
105
6.2 Algorithm to Generate RAG Pyramids on TurboNet
In this section, the algorithm for generating RAG pyramids on TurboNet is described. 
There are 5 phases in this algorithm. Let us assume that a node c0 in level / has r 
neighbors (r = 8 in level 0, assuming 8-connectedness) labeled by ch where / = 1, ..., z\ c() 
is associated with two binary state variables, q{) and p Q, as well as with one random 
variable x0, uniformly distributed between [0, 1].
In Phase 1, we assign to each c, o f  c0 a binary state variable for / = 1, .... z\ 
according to the following:
f 1, if  c, belongs to the same class with c0
[0, otherwise. ^  ^
The stochastic process mentioned in Section 6.1 is applied iteratively until no 
more changes take place. Every iteration has two phases, called Phase 2 and Phase 3. If 
c0survives at the end o f the k{U iteration, its p 0(k) value will be 1, otherwise it will be 0. In 
the beginning o f each stochastic process, /?,(0) = 0, Vz = 0 ,..., r.
In Phase 2, q0{k) in the kih iteration is updated as follows:
This means q0(k) is 1 if  there is no survivor among all neighbors o f  c0, including itself, 
that belong to its own class. I f  q0(k) is 1, a random number will be generated according to 
the uniform distribution and will be stored in x0(k).
In Phase 3, p (i(k) is determined based on the updated q,(k) and x^k):
1, if  XlPl ( k - \ )  = 0 Vz = 0,•••,/• 
0, otherwise.
( 6 .2 )
1, if  q0( k ) x 0( k )  = max ( Xiql ( k ) x i ( k ) )  > 0 V / = (),••■,z- 
Po( k - 1), otherwise.
(6.3)
106
Thus, c0 becomes a survivor if  the outcome o f the random variable for c0 is a local 
maximum in the similarity graph.
All survivors produced in each level by the stochastic processes satisfy the 
following two conditions that are required for the RAG pyramids:
• any two survivors in the same level m ust not be neighbors, and
• each non-survivor has to have at least one survivor among its neighbors in the 
same level.
The stochastic processes will terminate if  ail qo{kys are 0.
The task o f  Phase 4 is to establish the child-parent links between survivors 
(parents) in the new level and non-survivors (children) in the preceding level. These links 
are actually a subset o f the edges in the RAG o f  the preceding level. A survivor c0 looks 
at each one o f  its non-surviving neighbors n \c h for / = 1, ..., r, to see if  n \ c, has any 
surviving neighbor, other than c0 (a non-survivor may have more than one neighboring 
survivor as long as they are not neighbors themselves). Let the neighbors o f n lc , be n2cr 
for /  = 1, ..., /*’. If  one or more n lc j s  are found to be survivors, their x0(A:) values will be 
compared with the one o f c0, and n\Cj will become the child o f  c{) or one o f  the Z72t,/’s, 
w hichever has the largest random value o f  x0(k) drawn when it was chosen as a survivor. 
If  no survivor is found from the n2c/s , c0 will simply connect to n \c , as its parent. For 
example, ignoring the solid lines in Figure 6.1, we use dotted lines to represent 
connections between neighbors (represented by circles). Since the survivors (dark circles) 
c0 and n2c] have x Q(k) values o f  0.3387 and 0.1956, respectively, n lc , becomes the child 
o f c„. There is a child flag for each non-survivor which is set when it is chosen as a child
107
o f  its parent. Hence, by checking the child flag, c0 does not have to go through Phase 4 
for n\Cj if  it already has a parent.
In Phase 5, all survivors in the new level establish connections to create the new 
RAG. Two survivors are neighbors in the RAG o f  the new level if they were not more 
than three edges away in the preceding level. It is implemented as follows. A survivor c0 
looks into the n2c f s  o f  each one o f its n lc ,’s (it must not be the survivor c0 itself). If  an 
n2fj  is also a survivor, then survivor c0 takes it as a neighbor and moves to the next n2cj. 
If  not, cQ goes one step further into the neighbors o f  n2cj, represented by n3ch for / = 1, 
..., /•” . If  an YibCj is a survivor, it becomes a neighbor o f c0, otherwise c0 goes to the next 
n3ci until / = /•” , the number o f  neighbors o f the current n2cj. As an example, let us look 
at Figure 6.1. Because n l c x is a survivor, c0 connects to n2cx (this connection is 
represented by a solid line). Then, since n lc 2 is not a survivor, c0 goes into its neighbors 
and finds n3c2 as a new neighbor. By setting up connections between these survivors, the
Figure 6.1 An example for Phases 4 and 5.
108
construction o f the RAG is completed, and a new iteration o f  the algorithm will start with 
Phase 1 assuming this RAG as the input.
A survivor becomes an apex root in the hierarchical structure when it can no 
longer find any neighbor. The algorithm quits when all survivors are roots.
6.3 Implementation on TurboNet
At the beginning o f the algorithm, an input image (or matrix) is divided into P sections as 
shown in Figure 6.2, where P  is the number o f processors used. Each section has N '/P  
pixels and is assigned to one processor, where N xN  is the size o f  the square matrix. An 
optimal mapping o f  the ring o f image sections to the TurboN et’s hypercube structure is 
carried out using the binary reflected Gray code. Therefore, neighboring sections are 
assigned to neighboring processors in the hypercube [16]. Before Phase 1 for level 0 can 
start, each pixel in the input image has to find all its 8-connected neighbors by recording 
their addresses in a list. This takes 0 (N 2/P) time.
In Phase 1, each processor finds the X /s for all neighbors o f  its pixels c{) for the 
class similarity criterion. This time is proportional to
/, ( N )  = r ( N 2/ P ) ( t ca] + a 0tmm) = 0 ( r  N 2/ P )  (6.4)
where r is the number o f neighbors o f c0, a 0 is the number o f neighbors o f c{) outside of 
the current processor, and /ca) and /com are times for one calculation and one 
comm unication between two processors, respectively. In Phases 2 and 3, the stochastic 
processes select survivors. Again if  the neighbor c, o f  c0 is not within the current
109
processor, comm unication between the respective two processors is needed for the values 
o f  cji, Pj and x,. Both phases have similar asymptotic time complexities as show below 
t2( N )  = r ( N 2/ P ) { 2  >cal + 3 a 0 fcom) = ()(r  N 2/ P)  (6.5)
and
t 3( N )  = r ( N 2l P ) { A t cal + 3 a 0 / com) -  (){r N 2/ P)  (6 .6 )
In Phase 4, if  n 1 c, is not in the same processor as c0, then the addresses, the p- s and x /s  
o f the n2c- s have to be transferred. Otherwise, only the values o f the p / s  and x /s  for 
those n 2 c /s  that are not in the current processor are needed from other processors. Its 
time complexity is
t , ( N )  = ( N 2/ P - d 0) ( a i 3t com + r / cal +</,((*,.2/C0II, + /•'/„, + s  ia]))
i , / (6 -7)= 0 ( r 2 N  / P)
assuming that r »  (/•’, r/, and s /  and N2/P »  d0, where cl, is the number o f non-survivors 
among the n l c / s ,  sf- is the number o f survivors among the n2c/s , and a , and a , are the 
numbers o f  pixels that are not in the current processor in d , and sh respectively. Also, d{) is 
the total number o f  non-survivors o f the original N 2/P. The time complexity o f Phase 5 is 
similar to that o f  Phase 4 except for one additional step; cQ may look into the n3cf s  if 
n2cj is not a survivor. The time consumed is
t 5 ( N )  = ( N  / P  -  d 0 ) ( a 0 /C01ll + /•(/• /c.,| + d f ( a 0/ /C0I1) + /• /c.,|)))
( 6 .8 )
= 0 ( / -  N 2/ P )
2
also assuming that r »  (/•’, r "  and d /  and N  IP »  d0, where dj is the num ber o f  non­
survivors among the n 2 c /s, and a 0/ is the number o f n3cis  outside o f  the current
processor.
110
Since the total number o f levels in the hierarchy is O (log 2 N ) ,  Phases 1, 4 and 5
are performed O (log 2 N )  times, while Phases 2  and 3 are performed O ((lo g 2 AO2)
times. Therefore, the running time o f the entire algorithm is
l ( N )  -  (9(log2 Af(/l (AO + /.l (AO + /5 (AO) + (log 2 N ) 2 ( l2( N )  + t 2( N ) ) )
= 0 ( r ( N 2/ P ) l o g 2 N ( r 2 + \og2 N ) )
(6.9)
N
p-i
TV
Figure 6,2 Assignment o f  data for an N xN  input image.
We have developed message-passing, shared-memory and hybrid versions o f the 
algorithm. In the m essage-passing version, all communications are done through 
communication channels. If  there is no direct link between the source and destination 
processors, a comm unication package has to go through intermediate processor(s). When 
communicating between boards, the source first sends the package to the processor in the 
other board that is directly linked to it. If this processor is not the destination o f the
I l l
package, it will pass it to another processor. For example, in Figure 2.2 from Chapter 2. if 
a2 wants to transmit a package to b3, it will send it to b l ,  and b l will transfer it to b2 (or 
b4). b2 (or b4) then passes it to b3. A communication package is 32 bits long and is 
divided into four fields, as depicted in Figure 6.3. In the diagram, D, S and FC stand for 
destination, source and function code, and they are 3, 3 and 4 bits long, respectively. 
DATA, representing either p h x h n\Cj or n3c, depending on the function code, takes 16 
bits. When DATA is just one item o f a group o f data, the 6 -bit INDEX gives its order. 
For example, when a source requests the neighbors o f  n \c h the destination will reply by 
sending the « 2 c /s  one-by-one with their indices. Because x, is a 32-bit floating point 
number, it can only be sent in two parts with 16 bits each and INDEX is also used here. 
Table 6.1 shows all function codes.
DATA INDEX FC S D
Figure 6.3 A 32-bit long communication package.
Table 6.1 Function codes for communication operations.
FC Request Reply
X 0 1
P 2 3
nlc 4 5
n3c 6 7
112
When using only shared-memory for communication, each processor copies all 
values o f p h <7 ,, x,- and the neighbor addresses o f all its pixels to the shared memory using 
DM A controllers. This happens only at the end o f Phases 2, 3 and 5.
The major advantage o f  TurboNet is that it has both message-passing and shared- 
m emory capabilities. This allows users to choose between communication methods based 
on their particular tasks, and hence benefit form both methods at the same time. In our 
hybrid version o f the algorithm, each processor has the ability to decide whether to use a 
communication link or the shared memory when communicating with others by observing 
the following rules:
• if  the source and destination processors are neighbors in the system, i.e. in the 3- 
dimensional hypercube, communication is done via communication links;
•  otherwise, data is sent through the shared memory.
Because in this system, a processor cannot interrupt another processor unless it sends a 
m essage to a communication port o f the latter processor, all requests for communication 
are delivered via communication links in all o f our implementations. Hence, the 
processors can choose what communication method to use only when replying to 
requests. As mentioned earlier in this section, the values o f  x, have to be split and sent in 
two parts. In order to reduce the communication overhead and simplify the programming 
task, in the hybrid version o f the algorithm the x /s  are always transmitted through the 
shared memory as in the shared-memory implementation.
All processors have to be synchronized at the end o f  Phases 2, 3 and 5 before 
continuing. This is done by using the shared memory. If  a processor finishes its job
113
earlier, it may not quit until all others are also done, because it has to stay “active” to
respond to requests by other processors directed to it.
6.4 Performance Results
All three versions o f the algorithm have been implemented on TurboNet. The 
performance results are presented in this section. All running times are expressed in 
seconds. Four types o f binary images, named a, b, c and d, were used to test the 
algorithm. The image sizes were 4x4, 8 x 8 , 16x16, 32x32, 64x64 and 128x128. The 
image types are depicted in Figure 6.4. The execution times for one processor and the 
numbers o f  pyramid levels are given in Table 6.2. Because we have obtained the same 
numbers o f  pyramid levels for 2, 4 and 8  processors as those shown in Table 6.2, these 
numbers are not repeated in the following tables.
In Tables 6.3 to 6.5, execution times for the three versions o f  the algorithm with 
two processors are shown. We can see that the shared-memory version has the best 
performance over the other two, especially as the image size increases, and the hybrid one 
comes second. As detailed in Section 6.3, to send a piece o f  data a processor has to
construct a header and then merge it with the data. This includes at least 7 logical
operations and doubles the amount o f time for just sending and receiving a piece o f data 
through communication ports. It is also longer than just reading from or writing to the 
shared memory the same piece o f data when there is little or no collision on the ISB bus 
connecting the processors to the memory. The second reason for which the message- 
passing version is slower than the others is the limited amount o f memory that forces us
114
to discard data after use, instead o f  keeping them in the memory. This implies that two or 
more communications for the same piece o f data may be needed. When only two 
processors are used, two neighboring nodes in the hypercube are always chosen, hence in 
the hybrid version communications are almost the same as those in the message-passing 
one, except for transferring the x,’s through the shared memory. This means that the 
hybrid version will have similar performance with the message-passing version. Also, 
when P = 2, the possibility o f  bus collisions is low in the shared-memory version. All 
these factors make the shared-memory version a winner over the others.
(c) (d)
Figure 6.4 Four types o f  images used in testing: (a) Uniform, (b) Horizontal half, (c) 
Three objects, including the background, and (d) Checkerboard.
Table 6.2 Execution time and numbers o f pyramid levels for a single processor.
a b c d
4x4 0.002417 0.001267 j 0.002415 | 0.002416
2  1 2  "l 2  " 2
8 x8 0.015711 0.012314 0.009935 0.010288
3 3 3 3
16x16 0.075226 0.067247 0.059171 0.031162
4 4 4 4
32x32 0.349460 0.329755 0.310074 0.194625
5 5 5 5
64x64 1 1.472487 1.428048 1.351442 1.098416
6 r 6 6 6
128x128 5.919033 5.888514 5.752992 5.087174
6 6 6 6
Let us look at the results from the viewpoint o f image type. In an image, due to 
the nature o f connected components, pixels on edges have fewer neighbors than non­
boundary pixels, i.e. r (or /•’, or /•”) is small. This is why the execution times for image d, 
a checkerboard image, are the smallest among the results. Due to the way an image is 
divided for the algorithm, image b requires no communication between processors when 
P = 2 (or less communication than other images when P = 4 or 8 ). Hence, the 
performance for image b is better than the one for image c as the size o f  the image 
increases, although the latter might have a large number o f  edges than the former (this is 
not true for a single processor since no communication is involved in that situation). This 
indicates that there are two major factors having influence on the performance: the 
num ber o f  communication operations and the value o f r. Generally, the bigger the value
o f i\ the higher the frequency o f communication operations unless a sim ilar situation as 
that for image b with P = 2 happens.
As explained in the last paragraph, one o f  the major problems with the 
performance o f the algorithm is the necessity for frequent communications in the general 
case. All implementation results in Tables 6.3 to 6.5 are obtained with Phases 4 and 5 
m erged due to the fact that both phases search the n lc / s  and their neighbors in a similar 
way. As a comparison, the execution times for the hybrid version with separation o f 
Phases 4 and 5 is presented in Table 6 .6 . Improvement as high as 9 percent is gained 
from this merging o f phases.
Table 6.3 Execution time for the message-passing version with P  = 2.
Image size a b c d
4x4 0.004822 0.001088 0.004898 0.004901
8 x 8 0.020987 0.008731 0.012005 0.011019
16x16 0.075012 0.047472 0.051081 0.035916
32x32 0.278019 0.224320 0.240187 0.159799
64x64 1.150380 0.952072 1.023819 0.831675
128x128 4.413166 4.068171 4.241350 3.659837
Table 6.4 Execution time for the shared-memory version with P  = 2.
Image size a b c d
4x4 0.004820 0 . 0 0 1 2 0 1 0.004909 0.004880
8 x 8 0.020717 0.008967 0.012413 0.010863
16x16 0.074750 0.046817 0.050675 0.033301
32x32 0.261943 0.218780 0.233138 0.147998
64x64 1.099478 0.939378 0.973961 0.777961
128x128 4.183290 3.918180 4.107810 3.496321
117
Table 6.5 Execution time for the hybrid version with P = 2.
Image size a b c d
4x4 0.004838 0 . 0 0 1 1 1 1 0.004839 0.004851
8 x 8 0.020617 0.008697 0.012394 0.011036
16x16 0.075134 0.046943 0.050850 0.034060
32x32 0.276149 0.216675 0.235188 0.150298
64x64 1.110316 0.951002 1.005726 0.786590
128x128 4.363920 4.061670 4.149989 3.556343
Table 6 . 6  Execution time for the hybrid version with P = 2 and separate Phases 4 and 5.
Image size a b c d
4x4 0.005547 0.001226 0.005503 0.005547
8 x8 0 . 0 2 2 1 2 2 0.009569 0.013490 0.012660
16x16 0.081270 0.050118 0.055276 0.037902
32x32 0.297830 0.232514 0.253200 0.168105
64x64 1.187579 1.048526 1.077222 0.883491
128x128 4.013880 4.326028 4.420089 3.891362
In the 4-processor case, bus collisions incurred from accessing the shared memory 
increase, and communication channel congestion may also happen in the message-passing 
version due to non-local communications (e.g. when al sends a group o f  data to a3 via a4, 
and a4 sends a group o f  data to a2 via a3 almost at the same time, congestion could result 
in a3). However, the hybrid version can ease the channel congestion by using the shared 
memory for non-local communications when reducing the number o f bus collisions by 
moving local communications from the shared memory to the links. From Tables 6.7 to 
6.9, we can see that the hybrid version has better performance than the other two, and the
118
earlier discussion about the performance for P = 2 resulting from the image types still 
holds as depicted in Figure 6.5. Again the results for the hybrid version with separation o f 
Phases 4 and 5 are shown in Table 6.10.
When P  = 8 , similarly to the 4-processor case the ISB bus and the communication 
links become more overloaded, and the hybrid version again has better performance than 
the other two versions o f the algorithm. Results for P  = 8  are displayed in Tables 6.11 to 
6.13. The performance differences o f the three versions for image d are visualized by the 
charts in Figures 6 .6 , 6.7 and 6 . 8  with respect to different numbers o f processors.
Better performance improvement should be expected for the hybrid version 
running on a system with more than eight processors. Improvement o f  the rules that 
determine the employment o f  shared-memory or message-passing in the hybrid algorithm 
could definitely enhance the performance further. This task is an objective for future 
research.
Table 6.7 Execution time for the message-passing version with P  = 4.
Image size a b c d
4x4 0.007121 0.003567 0.007577 0.006661
8 x 8 0.032992 0.021199 0.021617 0.021018
16x16 0.116961 0.078435 0.088001 0.038952
32x32 0.309354 0.246113 0.276843 0.137059
64x64 1.029711 0.886206 0.912341 0.646127
128x128 3.594540 3.397266 3.465665 2.826207
119
Table 6.8 Execution time for the shared-memory version with P = 4.
Image size a b c d
4x4 0.007301 0.003532 0.007436 0.007412
8 x 8 0.033181 0.020987 0.020871 0.020140
16x16 0.116385 0.077981 0.085413 0.036887
32x32 0.301351 0.240361 0.256259 0.131873
64x64 0.998678 0.873288 0.897125 0.610231
128x128 3.439710 3.265150 3.356454 2.705943
Table 6.9 Execution time for the hybrid version with P  = 4.
Image size a b c d
4x4 0.007041 0.003423 0.007335 0.007338
8 x 8 0.032782 0.020600 0.020287 0.020163
16x16 0.115116 0.076685 0.081354 0.036087
32x32 0.297675 0.236533 0.249481 0.126750
64x64 0.969237 0.813678 0.882079 0.534873
128x128 3.314809 3.024618 3.099592 2.505213
Table 6.10 Execution time for the hybrid version with P = 4 and separate Phases 4 and 5.
Image size a b c d
4x4 0.008185 0.004167 0.008260 0.008261
8 x 8 0.036674 0.022218 0.023210 0.021983
16x16 0.124501 0.081513 0.089198 0.041297
32x32 0.321667 0.253449 0.271251 0.141354
64x64 0.986330 0.982225 0.949851 0.583320
128x128 3.523334 3.203883 3.302786 3.099013
120
3.5
3
2 .5
2
1 .5
0 .5
0
8 x 84 x 4 1 6 x 1 6 3 2 x 3 2 6 4 x 6 4 1 2 8 x 1 2 8
Im a g e  S iz e
Figure 6.5 Execution time vs. image types for the hybrid version with P = 4.
Table 6.11 Execution time for the message-passing version with P  = 8 .
Image size a b c d
8 x 8 0.052370 0.031772 0.039740 0.041152
16x16 0.188065 0.130690 0.155713 0.043280
32x32 0.349411 0.272126 0.322993 0.118673
64x64 0.931953 0.828863 0.819055 0.506182
128x128 2.930214 2.839261 2.876496 2.192747
Table 6.12 Execution time for the shared-memory version with P = 8 .
Image size a b c d
8 x 8 0.056110 0.049703 0.035482 0.038130
16x16 0.179111 0.130550 0.147927 0.042110
32x32 0.349392 0.267737 0.287105 0.117954
64x64 0.920304 0.819539 0.829105 0.481761
128x128 2.845688 2.720958 2.765861 2.119655
Table 6.13 Execution time for the hybrid version with P = 8.
Image size a b c d
8 x 8 0.052370 0.049730 0.034258 0.038103
16x16 0.179109 0.125714 0.095437 0.039445
32x32 0.326547 0.261412 0.269629 0.106352
64x64 0.846256 0.704534 0.776690 0.367363
128x128 2.551307 2.278012 2.329146 1.778732
8 m e s s a g e - p a s s in g
□  s h a r e d - m e m o r y
□  h y b r id
3 2 x 3 2  6 4 x 6 4  1 2 8 x 1 2 8
I m a g e  S iz e
Figure 6.6 Performance comparison o f  the three versions with image d and P =
122
■  m e s s a g e - p a s s in g
□  s h a r e d - m e m o r y
□  h y b r id
3 2 x 3 2  6 4 x 6 4  1 2 8 x 1 2 8
Im a g e  S iz e
Figure 6.7 Performance comparison o f  the three versions with image d and P  = 4.
B  m e s s a g e -p a s s in g
□  s h a r e d - m e m o r y
□  h y b r id
3 2 x 3 2  6 4 x 6 4  1 2 8 x 1 2 8
Im a g e  S iz e
Figure 6.8 Performance comparison o f  the three versions with image d and P -  8.
CHAPTER 7
CONCLUSIONS
In this dissertation, several DSP algorithms were developed for a new experimental 
parallel system, namely TurboNet. The hybrid architecture o f the TurboNet system 
facilitates direct implementation o f both the message-passing and shared-memory 
fundamental parallel-processing communication paradigms. In contrast, other existing 
systems support directly in hardware only one o f these two fundamental communication 
paradigms. The main objective o f this dissertation was to show that hybrid architectures 
that implement directly both the message-passing and shared-memory paradigms 
introduce flexibility in algorithm developm ent that often results in very good 
performance. Three versions o f  each algorithm were developed, if possible, that employ 
message-passing, shared-memory and hybrid communications, respectively. Theoretical 
and experimental comparisons o f these algorithms were also included. The theoretical 
evaluation o f algorithms was based on the framework o f a methodology developed here 
for good performance prediction. The success o f this methodology proves that highly 
accurate performance prediction is possible for complex systems, such as TurboNet, if 
enough information is provided by data dependence graphs. This m ethodology can also 
be applied in the algorithm design phase to assess the performance before program 
coding.
The theoretical and experimental results show that the hybrid versions o f the 
algorithms generally have better performance than versions that employ only one o f  the
123
124
two fundamental communication paradigms. Therefore, not only do hybrid systems o f 
this type lead to significant performance gains for applications that can take advantage o f 
the hybrid architecture, but these systems also support directly each fundamental 
paradigm for applications with communication bias toward either one o f  the two 
paradigms.
In general, the shared-memory paradigm is good for coarse-grain parallelism, 
while message-passing is more efficient for communicating small amounts o f  data within 
groups o f nearby processors. To be able to use both paradigms for very high performance, 
an application should employ a mixture o f local and/or remote communications with 
simultaneous presence o f coarse-grain and fine-grain parallelism. The main conclusion of 
this research is that small-scale and medium-scale parallel computers should implement 
directly in hardware, if  possible, both communication mechanisms for high performance, 
robustness and ease o f  algorithm development.
The current TurboNet system has eight processors. However, its architecture 
supports straightforward system scalability for up to sixty-four processors. In general, as 
the system size increases, the communication overhead for the system also increases. This 
is because the bandwidths o f communication channels and shared memories are limited. 
However, this effect o f system size is more preeminent in the implementation o f  the 
shared-memory paradigm. Further research is needed to find the maximum system size 
for which the hybrid architecture is still superior. To achieve this objective, both 
theoretical and experimental results must be produced.
APPENDIX A
DSP ALGORITHM S
1-D Convolution Algorithm 1:
(1) Assume that the input data o f f  and g are stored in the shared memory o f  the 
system. Each processor gets the values o f f  through its DMA, and its g values as 
shown below, where i is the ID number o f a processor with 0 < i < P - l :
m = M/P; 
ti = im; 
t2 = ( i+ l)m -l;
for j = 0 to N+m-2 do in parallel 
if  t2-j > 0
giL)] = g[t2-j];
else
Si D] = g[M +t2-j];
end if 
if  j <N-1
fiD] = fI)];
end if
end for
(2 ) for j = t, to t2 do in parallel
y[j-t i ]  =  0; 
if j < N-l 
T = j ;
else
T = N - l ;
end if
for k = t2-j to t2-j+T do in parallel
yLj-tl] = yO -t|]+ g j[k ]f[k -t2+j];
end for
end for
(3) This step is for processor that has 2m elements o f y[t], 
for n = 0 to (N -l)/m  do in parallel
if  i = n
for j = ti+M  to t2+M -l do 
yyU-ti-M] = 0; 
if  j < N+M=2
for k = t2- 1 to t2 +N +M -1 -j do
yyU-ti-M] = yy[j-ti-M ]+gi[k]f[j-M+k-t2 ];
125
126
end for
else
return;
end if 
return;
end for
end if
end for
1-D Convolution Algorithm 2:
(1) Assume that the input data are stored in the shared memory o f the system. Each 
processor gets its f  and g values through its DMA as follows:
m = (N +M -l)/P; 
ti = im; 
t2 = ( i+ l)m -l; 
t3 = ti-(N -l);
for all the processors do in parallel 
if  ti < (N -l) or t2 > (M -l) 
if  ti < (N -l)
for j = t2 to 0  do
giLM2] = g[)]; 
i f j- t2 < N -l
fi LH2] =  f[j-t2];
e n d  i f
end for
else
for j  = t3 to M -l do 
gi0-t3] = g 0 ] 
if j - t3  < N -l
f i D-t3] = f[N -l-j+ t3];
end if  
end for
end if
else
fo r  j  = 0 to  N+m-2 d o  
g i f j ]  =  g [ t2 - j ] ;  
i f  j  < N-l
end if
end for
end if
(2 ) fo r  j = ti to  t2 d o  in p a r a l le l
yLi-ti] = 0;
127
if  j > (M/m)m 
if  j > M
T = M+N-2-j;
else
T = N -l;
end if
for k = j-ti to j- ti+ T  do
y[i-ti] = y[i-ti]+gi[k]fi [k-j+ti];
end for
else
if j < N -l
T = j ;
else
T = N-1;
end if
for k = t2-j to tz-j+T do
yfj-t'] = y 0 -ti]+gi[k]fi[k+j-t2];
end for
end if
end for
2-D Convolution Algorithm (binary-trce):
(1) For i = 0 to P-l do in parallel:
read the g matrix;
fo rj = ix( N /P )  to (i+1) x( N)/P)-l do 
read f[j,y];
end for
end for
(2) For i = 0 to P-l do in parallel:
perform the convolution o f f  and g;
end for
(3) For s = 1 to log2 P
send result to lower neighbor or receiver result from upper neighbor;
if ( ID >(P-vl)) 
break;
end if 
end for
128
2-D Convolution Algorithm (grid):
(1) Processors in the first row and first column get g and f  respectively, from the 
system DRAM  and pass them to the other processors accordingly through DMAs 
and communication ports.
(2) For all the processors do in parallel
for i = 0 to N2-1 do
for j = 0 to M2-1 do
c[i+j] = c[i+ j]+f[i]g[j];
end for
end for
(3) if  ( you are in the first row or the first column )
if  ( not in the last row or the last column)
send your c[] to your south-east neighbor; 
return;
else
return;
end if
end if
wait until you receive result from your north-west neighbor; 
add the received result to yours; 
if  ( you have a south-east neighbor)
send your c[] to your south-east neighbor; 
return;
else
return;
end if  
return;
2-D Convolution Algorithm (triangle):
(1) Processors in the first row, first column and last column get g and f  respectively, 
from the system DRAM and pass them to the other processors accordingly 
through DMAs and communication ports.
(2) For all the processors do in parallel
for i = 0 to N2-1 do
for j = 0 to M2-1 do
c[i+j] = c[i+j]+f[i]g[j];
end for
end for
(3) if  ( you are in the diagonal or right above i t )
if  ( you are in the d iagonal)
if  ( have lower neighbor )
129
send your c[] to it; 
return;
else
return;
end if
else
if  ( have upper neighbor ) 
send your c[] to it; 
return;
else
return;
end if
end if
end if
wait until you receive result from your upper or lower neighbor; 
add the received result to yours; 
if  ( you have received result from your upper neighbor ) 
if  ( have lower neighbor) 
send your c[] to it; 
return;
else
return;
end if
else
if  ( have upper neighbor ) 
send your c[] to it; 
return;
else
return;
end if
end if
REFERENCES
1. S. Horguchi and T. Nakada, “Performance Evaluation o f  Parallel Fast Fourier
Transform on a M ultiprocessor W orkstation,” Journal o f  Parallel and D istributed  
Computing, pp. 158-163, December 1993 .
2. C. Tong and P. N. Swarztrauber, “Ordered Fast Fourier Transform on a M assively
Parallel Hypercube M ultiprocessor,” Journal o f  Parallel and D istributed  
Computing , pp. 50-59, December 1991 .
3. E. M. Dowling and Z. Fu, “HARP: An Open Architecture for Parallel Matrix and
Signal Processing,” IEEE Transactions on Parallel and  Distributed Systems, vol. 
4, pp. 1081-1091, October 1993.
4. G. Fox, et. al., Solving Problems on Concurrent Processors, vol. 1, Prentice-Hall,
Englewood Cliffs, NJ, 1988.
5. S. G. Akl, The Design and  Analysis o f  Parallel A lgorithm s, Prentice-Hall, Englewood
Cliffs, NJ, 1989.
6. D. H. Bailey, “FFTs in External or Hierarchical M emory,” Journal o f
Supercomputing , vol. 4, No. 1, pp. 23-35, March 1990.
7. Z. Cvetanovic, “Performance Analysis o f the FFT Algorithm on a Shared-Memory
Parallel Architecture, ” IB M  Journal o f  Research and Development, vol. 31, No. 
4, pp. 245-451, July 1987.
8 . R. E. Blahut, Fast Algorithms fo r  Digital Signal Processing , Addison-W esley,
Reading, MA, 1987.
9. S. G. Ziavras and M. A. Siddiqui, “Pyramid Mapping onto Hypercubes for Computer
Vision: Connection M achine Comparative Study,” Concurrency: Practice and  
Experience, Vol. 5, No. 6, pp. 471-489, September 1993.
10. S. Chandra, J. R. Larus and A. Rogers, “What Is Time Spent in M essage-Passing and
Shared-M emory Programs?” 6th Int 7 Conference on Architectural Support for  
Program Langs. & OS's, October 4-7, 1994.
11. J. Heinlein, K. Gharachorloo, S. Dresser and A. Gupta, “Integration o f M essage
Passing and Shared Memory in the Stanford FLASH M ultiprocessor,” 6th Int 7 
Conference on Architectural Support fo r  Program Langs. & OS's, October 4-7,
1994.
130
131
12. D. Kranz, K. Jonhson, A. Agarwal, J. Kubiatowicz and B. Lim, “Integrating Message-
Passing and Shared-Memory: Early Experience,” 5th A C M  SIG PLAN Symposium  
on Principles & Practice o f  Parallel Programming, pp. 54-63, May 1993.
13. A. M ontanvert, P. Meer and A. Rosenfeld, “Hierarchical Image Analysis Using
Irregular Tessellations,” IEEE Transactions on Pattern Analysis and Machine 
Intelligence, Vol. 13, No. 4, pp. 307-316, 1991.
14. S. G. Ziavras and P. Meer, “Adaptive Multiresolution Structures for Image Processing
on Parallel Computers,” Journal o f  Parallel and Distributed Computing, Vol. 25, 
p p .475-483, 1994.
15. A. Rosenfeld (Ed.), M ultiresolution Image Processing and  Analysis, Springer-
Verlag, Berlin, 1984.
16. S. G. Ziavras and D. P. Shah, “High Performance Emulation o f  Hierarchical
Structures on Hypercube Supercomputers,” Concurrency: Practice and
Experience, Vol. 6, No. 2, pp. 85-100, 1994.
17. T. II. Cormen, Introduction to Algorithms, McGraw-Hill, New York, NY, pp. 784-
796, 1990.
18. R. Sedgewick, Algorithms, Addison-W esley, Reading, MA, pp. 583-592, 1988.
19. J. H. M cClellam, Number Theory in DSP, Addison-W esley, Reading, MA, pp. 194-
196, 1979.
20. K. M. Chandy, Parallel Program Design: a Foundation, Addison-W esley, Reading,
MA, pp. 148-151, 1989.
21. A. Y. Grama, “ Isoefficiency: M easuring the Scalability o f  Parallel Algorithms and
Architectures,” IEEE Parallel and D istributed Technology, August 1993.
22. A. Gupta and V. Kumar, “ The Scalability o f FFT on Parallel Com puters,” IEEE
Transactions on Parallel and D istributed Systems, vol. 4, August 1993.
23. R. Hross, S. G. Ziavras, C. N. Manikopoulos, N. J. Lad and X. Li, “A Defect
Identification Algorithm for Sequential and Parallel Com puters,” IEEE  
International Symposium on Industrial Electronics, Athens, Greece, July 10-14, 
1995, to appear.
24. T. G. Lewis and H. El-Rewini, Introduction to Parallel Computing, Prentice-Hall,
Englewood Cliffs, NJ, 1992.
132
25. P. A. Lynn and W. Fuerst, Introduction to D igital Signal Processing with Computer
Applications, John Wiley & Son, Chichester, NY, pp.211-246, 1989.
26. J. S. Lim, Two-Dimensional Signal and Image Processing, PTR Prentice-Hall,
Englewood Cliffs, NJ, pp. 163-172, 1990.
27. R. C. Gonzalez and R. E. Woods, D igital Image Processing, Addison-W esley,
Reading, MA, 1992.
28. X. Li, S. G. Ziavras and C. N. Manikopoulos, “Parallel DSP Algorithms on TurboNet:
An Experimental System with Hybrid M essage-Passing and Shared-M emory 
Architecture,” Concurrency: Practice and Experience, accepted for publication,
1995.
29. Force Computers Inc, San Jose, CA, SPARC CPU-2CE Technical Reference Manual,
April 1993.
30. X. Li, S. G. Ziavras and C. N. M anikopoulos, “Parallel Generation o f Adaptive
M ultiresolution Structures for Image Processing,” Concurrency: Practice and  
Experience, submitted for publication.
31. Ariel Corporation, Highland Park, NJ, User's Manual for the V-C40 Hydra, version
0.60, February, 1994.
32. Texas Instruments, Houston, TX, TMS320C4x User's Guide, pp. 1.4-1.6, 1993.
33. Texas Instruments, Houston, TX, TMS320C4x User's Guide, pp.10.1-10.16, 1993.
34. Texas Instruments, Houston, TX, TMS320 Floating-Point DSP Optimal C Compiler
User's Guide, 1991
35. Texas Instruments, Houston, TX, TMS320 Floating-Point DSP Assembly Language
Tools U ser’s Guide, 1991
36. K. Hwang, Advanced Computer Architecture with Parallel Programming, Ed. 2.
M cGraw Hill, Englewood Cliffs, NJ, 1993.
37. G. C. Fox, S. W. Otto and A. J. Hey, “Matrix Algorithms on Hypercube (I): Matrix
M ultiplication,” Parallel Computing, pp. 17-31, 1987.
38. S. G. Ziavras, “ RH: A Versatile Family o f  Reduced Hypercube Interconnection
Networks,” IEEE Transactions on Parallel and D istributed Systems, Vol. 5, No.
11. pp. 1210-1220, November 1994.
